Skip to content

Commit

Permalink
Stating endianness of UTF-16 and UTF-32 was an error when BOM present.
Browse files Browse the repository at this point in the history
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
  • Loading branch information
Jehan committed Dec 4, 2015
1 parent 2856e68 commit e5234d6
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions src/nsUniversalDetector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)
case '\xFE':
if ('\xFF' == aBuf[1])
/* FE FF: UTF-16, big endian BOM. */
mDetectedCharset = "UTF-16BE";
mDetectedCharset = "UTF-16";
break;
case '\xFF':
if ('\xFE' == aBuf[1])
Expand All @@ -131,12 +131,12 @@ nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)
aBuf[3] == '\x00')
{
/* FF FE 00 00: UTF-32 (LE). */
mDetectedCharset = "UTF-32LE";
mDetectedCharset = "UTF-32";
}
else
{
/* FF FE: UTF-16, little endian BOM. */
mDetectedCharset = "UTF-16LE";
mDetectedCharset = "UTF-16";
}
}
break;
Expand All @@ -147,7 +147,7 @@ nsresult nsUniversalDetector::HandleData(const char* aBuf, PRUint32 aLen)
aBuf[3] == '\xFF')
{
/* 00 00 FE FF: UTF-32 (BE). */
mDetectedCharset = "UTF-32BE";
mDetectedCharset = "UTF-32";
}
break;
}
Expand Down

0 comments on commit e5234d6

Please sign in to comment.