FIX: UNICODE Byte Order Marks Ignored by Internet Explorer 4.0xID: Q190837
|
A UNICODE HTML page that include a Byte Order Mark will display garbage characters when displayed in Internet Explorer 4.0 and 4.01. If the UNICODE HTML page is in Big-Endian order then the viewed source of the page will also contain garbage characters, primarily a black box character.
Internet Explorer 4.0X does not support the use of Byte Order Marks in UNICODE HTML files.
The only resolution at this time is to normalize and strip Byte Order Marks from UNICODE HTML files before display in Internet Explorer 4.0 or 4.01. Normalization requires that all UNICODE characters in an HTML file be in Little-Endian format, least significant byte first. For instance, in Big- Endian format, the left angle bracket UNICODE character would appear as 00 3C in a binary dump. This character would need to be byte swapped to 3C 00, Little-Endian format, before being processed and displayed by Internet Explorer.
Microsoft has confirmed this to be a bug in the Microsoft products listed at the beginning of this article. This bug was corrected in Microsoft Internet Explorer 5.
The UNICODE Specification Version 2.0 describes a "Byte Order Mark" in
section 2.4, but does not insist on its use. According to the
specification, the byte sequence FE FF at the beginning of a file indicates
that the following characters are probably UNICODE, normalized for the
memory architecture of the current machine. If the byte sequence FF FE is
found at the beginning of a file it indicates that the remaining bytes are
not normalized and should be byte swapped before use.
Windows, and therefore Internet Explorer, assumes that the memory
architecture of the machine they are running on is "Little Endian." The
first byte of a two-byte sequence is actually the least significant byte.
In UNICODE 00 3C is the right angle bracket character. In Little Endian
this character would be stored in memory as 3C 00, the least significant
portion, 3C, comes first.
A Little Endian format UNICODE HTML file is permitted by the UNICODE
standard to include the UNICODE Byte Order Mark FE FF at the beginning of
the file. In Little Endian, the Byte Order Mark is swapped like all
characters so a binary dump of the Byte Order Mark would actually display
as FF FE. In other words, the Byte Order Mark is UNICODE FE FF, but since
Little Endian machines automatically swap their bytes, a binary dump of the
mark would be FF FE.
When Internet Explorer 4.0x processes a UNICODE HTML file containing the
Little Endian, FF FE, normalized UNICODE mark, it ignores the purpose of
the mark and displays the Byte Order Mark as two UNICODE characters, in
English these characters look like garbage characters, somewhat like "py".
If the UNICODE non-normalized, Byte Order Mark, FF FE, is encountered in a
file, it indicates that the characters should be byte swapped (in a Little
Endian architecture FF FE would appear as FE FF if the file were dumped).
Internet Explorer does not recognize this form of the Byte Order Mark
either, and since UNICODE FF FE is not a valid UNICODE character, Internet
Explorer will not display garbage characters. Internet Explorer 4.0X will
also not swap the bytes, so nothing at all will be displayed. If the HTML
source for the page is viewed from within Internet Explorer a mix of valid
and invalid characters will be seen, the invalid characters appearing as
small, dark box characters.
For additional information, please see the following article(s) in the Microsoft Knowledge Base:
Q102025 Explanation of Big Endian and Little Endian ArchitectureThe UNICODE Standard, Version 2.0, The Unicode Consortium
Additional query words: UNICODE HTML ENDIAN
Keywords : kbhtml kbIE400bug kbIE401bug kbIE401sp1bug kbIE500fix
Version :
Platform :
Issue type : kbbug
Last Reviewed: April 8, 1999