Convert byte to string with byte-order marker present
Java does a pretty good job of converting byte array to string, using the String constructor new String(byteArray, charset). But if a byte-order marker (BOM) is present, it can get confused. A UTF-8 encoded file may have a BOM, which is nonstandard and technically incorrect, but common since Windows applications such as Notepad will create this by default when saving UTF-8 text. When Java attempts to convert this to String, a bogus character will be placed at the beginning of the string.
Something similar happens when attempting to convert UTF-16 as well. If the endianess is specified in the Charset name ('UTF-16BE' or 'UTF-16LE'), the BOM will not be expected, and a bogus character will be added to the string.
What follows is a short method to fix this by stripping this bogus character out of the returned string.
public static String convertToString(byte[] bytes, Charset charset) {
String ret = new String(bytes, charset);
if ( (bytes[0] == 0xEF - 256) && (bytes[1] == 0xBB - 256) && (bytes[2] == 0xBF - 256) ) {
ret = ret.substring(1);
} else if ( (bytes[0] == 0xFE - 256) && (bytes[1] == 0xFF - 256) ) {
ret = ret.substring(1);
} else if ( (bytes[0] == 0xFF - 256) && (bytes[1] == 0xFE - 256) ) {
ret = ret.substring(1);
}
return ret;
}
Add new comment