FIX: UNICODE Byte Order Marks Ignored by Internet Explorer 4.0x

ID: Q190837


The information in this article applies to:


SYMPTOMS

A UNICODE HTML page that include a Byte Order Mark will display garbage characters when displayed in Internet Explorer 4.0 and 4.01. If the UNICODE HTML page is in Big-Endian order then the viewed source of the page will also contain garbage characters, primarily a black box character.


CAUSE

Internet Explorer 4.0X does not support the use of Byte Order Marks in UNICODE HTML files.


RESOLUTION

The only resolution at this time is to normalize and strip Byte Order Marks from UNICODE HTML files before display in Internet Explorer 4.0 or 4.01. Normalization requires that all UNICODE characters in an HTML file be in Little-Endian format, least significant byte first. For instance, in Big- Endian format, the left angle bracket UNICODE character would appear as 00 3C in a binary dump. This character would need to be byte swapped to 3C 00, Little-Endian format, before being processed and displayed by Internet Explorer.


STATUS

Microsoft has confirmed this to be a bug in the Microsoft products listed at the beginning of this article. This bug was corrected in Microsoft Internet Explorer 5.


MORE INFORMATION

The UNICODE Specification Version 2.0 describes a "Byte Order Mark" in section 2.4, but does not insist on its use. According to the specification, the byte sequence FE FF at the beginning of a file indicates that the following characters are probably UNICODE, normalized for the memory architecture of the current machine. If the byte sequence FF FE is found at the beginning of a file it indicates that the remaining bytes are not normalized and should be byte swapped before use.

Windows, and therefore Internet Explorer, assumes that the memory architecture of the machine they are running on is "Little Endian." The first byte of a two-byte sequence is actually the least significant byte. In UNICODE 00 3C is the right angle bracket character. In Little Endian this character would be stored in memory as 3C 00, the least significant portion, 3C, comes first.

A Little Endian format UNICODE HTML file is permitted by the UNICODE standard to include the UNICODE Byte Order Mark FE FF at the beginning of the file. In Little Endian, the Byte Order Mark is swapped like all characters so a binary dump of the Byte Order Mark would actually display as FF FE. In other words, the Byte Order Mark is UNICODE FE FF, but since Little Endian machines automatically swap their bytes, a binary dump of the mark would be FF FE.

When Internet Explorer 4.0x processes a UNICODE HTML file containing the Little Endian, FF FE, normalized UNICODE mark, it ignores the purpose of the mark and displays the Byte Order Mark as two UNICODE characters, in English these characters look like garbage characters, somewhat like "py".

If the UNICODE non-normalized, Byte Order Mark, FF FE, is encountered in a file, it indicates that the characters should be byte swapped (in a Little Endian architecture FF FE would appear as FE FF if the file were dumped). Internet Explorer does not recognize this form of the Byte Order Mark either, and since UNICODE FF FE is not a valid UNICODE character, Internet Explorer will not display garbage characters. Internet Explorer 4.0X will also not swap the bytes, so nothing at all will be displayed. If the HTML source for the page is viewed from within Internet Explorer a mix of valid and invalid characters will be seen, the invalid characters appearing as small, dark box characters.


REFERENCES

For additional information, please see the following article(s) in the Microsoft Knowledge Base:

Q102025 Explanation of Big Endian and Little Endian Architecture
The UNICODE Standard, Version 2.0, The Unicode Consortium

Additional query words: UNICODE HTML ENDIAN


Keywords          : kbhtml kbIE400bug kbIE401bug kbIE401sp1bug kbIE500fix 
Version           : 
Platform          : 
Issue type        : kbbug 

Last Reviewed: April 8, 1999