[Java] Class CharsetToolkit

groovy.util.CharsetToolkit

Utility class to guess the encoding of a given text file.

Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.

A byte buffer of 4KB is used to be able to guess the encoding.

Usage:

 CharsetToolkit toolkit = new CharsetToolkit(file);

 // guess the encoding
 Charset guessedCharset = toolkit.getCharset();

 // create a reader with the correct charset
 BufferedReader reader = toolkit.getReader();

 // read the file content
 String line;
 while ((line = br.readLine())!= null)
 {
     System.out.println(line);
 }

Authors:: Guillaume Laforge

Constructor Summary

Constructors
Constructor and description
`CharsetToolkit (File file)` Constructor of the `CharsetToolkit` utility class.

Methods Summary

Methods
Type Params	Return Type	Name and description
	`static Charset[]`	`getAvailableCharsets()` Retrieves all the available `Charset`s on the platform, among which the default `charset`.
	`Charset`	`getCharset()`
	`Charset`	`getDefaultCharset()` Retrieves the default Charset
	`static Charset`	`getDefaultSystemCharset()` Retrieve the default charset of the system.
	`boolean`	`getEnforce8Bit()` Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
	`BufferedReader`	`getReader()` Gets a `BufferedReader` (indeed a `LineNumberReader`) from the `File` specified in the constructor of `CharsetToolkit` using the charset discovered or the default charset if an 8-bit `Charset` is encountered.
	`boolean`	`hasUTF16BEBom()` Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
	`boolean`	`hasUTF16LEBom()` Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
	`boolean`	`hasUTF8Bom()` Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
	`void`	`setDefaultCharset(Charset defaultCharset)` Defines the default `Charset` used in case the buffer represents an 8-bit `Charset`.
	`void`	`setEnforce8Bit(boolean enforce)` If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.

Inherited Methods Summary

Inherited Methods
Methods inherited from class	Name
`class Object`	`wait, wait, wait, equals, toString, hashCode, getClass, notify, notifyAll`

Constructor Detail

public CharsetToolkit(File file)

Constructor of the CharsetToolkit utility class.

Parameters:: file - of which we want to know the encoding.

Method Detail

public static Charset[] getAvailableCharsets()

Retrieves all the available Charsets on the platform, among which the default charset.

Returns:: an array of Charsets.

public Charset getCharset()

public Charset getDefaultCharset()

Retrieves the default Charset

public static Charset getDefaultSystemCharset()

Retrieve the default charset of the system.

Returns:: the default Charset.

public boolean getEnforce8Bit()

Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.

Returns:: a boolean representing the flag of use of US-ASCII.

public BufferedReader getReader()

Gets a BufferedReader (indeed a LineNumberReader) from the File specified in the constructor of CharsetToolkit using the charset discovered or the default charset if an 8-bit Charset is encountered.

throws:: FileNotFoundException if the file is not found.

Returns:: a BufferedReader

public boolean hasUTF16BEBom()

Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

Returns:: true if the buffer has a BOM for UTF-16 Big Endian.

public boolean hasUTF16LEBom()

Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

Returns:: true if the buffer has a BOM for UTF-16 Low Endian.

public boolean hasUTF8Bom()

Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).

Returns:: true if the buffer has a BOM for UTF8.

public void setDefaultCharset(Charset defaultCharset)

Defines the default Charset used in case the buffer represents an 8-bit Charset.

Parameters:: defaultCharset - the default Charset to be returned if an 8-bit Charset is encountered.

public void setEnforce8Bit(boolean enforce)

If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the default charset rather than US-ASCII.

Parameters:: enforce - a boolean specifying the use or not of US-ASCII.

© 2003-2020 The Apache Software Foundation
Licensed under the Apache license.
https://docs.groovy-lang.org/2.4.21/html/gapi/groovy/util/CharsetToolkit.html