[Java] Class CharsetToolkit
- groovy.util.CharsetToolkit
Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
- Authors:
- Guillaume Laforge
Constructor Summary
| Constructor and description |
|---|
CharsetToolkit
(File file)Constructor of the CharsetToolkit utility class. |
Methods Summary
| Type Params | Return Type | Name and description |
|---|---|---|
static Charset[] |
getAvailableCharsets()Retrieves all the available Charsets on the platform, among which the default charset. | |
Charset |
getCharset() | |
Charset |
getDefaultCharset()Retrieves the default Charset | |
static Charset |
getDefaultSystemCharset()Retrieve the default charset of the system. | |
boolean |
getEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding. | |
BufferedReader |
getReader()Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered. | |
boolean |
hasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). | |
boolean |
hasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). | |
boolean |
hasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors). | |
void |
setDefaultCharset(Charset defaultCharset)Defines the default Charset used in case the buffer represents an 8-bit Charset. | |
void |
setEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. |
Inherited Methods Summary
| Methods inherited from class | Name |
|---|---|
class Object | wait, wait, wait, equals, toString, hashCode, getClass, notify, notifyAll |
Constructor Detail
public CharsetToolkit(File file)
Constructor of the CharsetToolkit utility class.
- Parameters:
-
file- of which we want to know the encoding.
Method Detail
public static Charset[] getAvailableCharsets()
Retrieves all the available Charsets on the platform, among which the default charset.
- Returns:
- an array of
Charsets.
public Charset getCharset()
public Charset getDefaultCharset()
Retrieves the default Charset
public static Charset getDefaultSystemCharset()
Retrieve the default charset of the system.
- Returns:
- the default
Charset.
public boolean getEnforce8Bit()
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
- Returns:
- a boolean representing the flag of use of US-ASCII.
public BufferedReader getReader()
Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered.
- throws:
- FileNotFoundException if the file is not found.
- Returns:
- a
BufferedReader
public boolean hasUTF16BEBom()
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
- Returns:
- true if the buffer has a BOM for UTF-16 Big Endian.
public boolean hasUTF16LEBom()
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
- Returns:
- true if the buffer has a BOM for UTF-16 Low Endian.
public boolean hasUTF8Bom()
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
- Returns:
- true if the buffer has a BOM for UTF8.
public void setDefaultCharset(Charset defaultCharset)
Defines the default Charset used in case the buffer represents an 8-bit Charset.
- Parameters:
-
defaultCharset- the defaultCharsetto be returned if an 8-bitCharsetis encountered.
public void setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.
- Parameters:
-
enforce- a boolean specifying the use or not of US-ASCII.
© 2003-2020 The Apache Software Foundation
Licensed under the Apache license.
https://docs.groovy-lang.org/2.4.21/html/gapi/groovy/util/CharsetToolkit.html