Unicode and Microsoft Windows NT

ID: q99884

The information in this article applies to:

Windows NT version 3.1 employs a relatively new standard of character representation called Unicode. This new standard allows for greater flexibility in adding support for localized versions of Microsoft Windows NT.

MORE INFORMATION

The first and most prominent character standard in use by computers today is ASCII. This format is adequate for western languages, but as computers became more popular in European countries, the limitations of ASCII became clear.

In an effort to overcome some of these limitations, the International Standards Organization (ISO) established a new standard called Latin-1 that defined European characters that were omitted from ASCII. Microsoft Windows modified the Latin-1 standard even further and called the character set Windows ANSI. However, by continuing use of an 8-bit coding scheme, ASCII is only capable of representing 256 unique symbols--considerably less than the 10,000 symbols that are common in such languages as Chinese, Korean, and Japanese. In addition to the language barriers, as the capabilities of computers broaden beyond uppercase, mono-spaced fonts, the requirements for a large set of unique characters (for example, letters, punctuation, mathematical and technical symbols, and publishing characters) have also grown far beyond the capabilities of 8-bit text.

The lowest level of localization (adaptation to a particular language) is the actual binary representation of characters: the code set. To overcome the limitations of the other coding methods, several major computer companies, including Apple Computer, Inc., Sun Microsystems, Inc., Xerox Corp., and IBM (International Business Machines Corp.), formed Unicode Inc., a non-profit consortium, to set out to define a new standard for international character sets. At the same time, the ISO began developing a standard. Eventually, these standards merged and became Unicode. Unicode is published as The Unicode Standard, Worldwide Character Encoding.

Unicode employs a 16-bit coding scheme that allows for 65,536 distinct characters--more than enough to include all languages in use today. In addition, it supports several archaic or arcane languages such as Sanskrit and Egyptian hieroglyphs. Unicode also includes representations for punctuation marks, mathematical symbols, and dingbats, with room left for future expansion. Because it establishes a unique code for each character in each script, Windows NT can ensure that the character translation from one language to another is accurate.

Unicode in Windows NT

Unicode is the native code set of Windows NT, but the Win32 subsystem provides both ANSI and Unicode support. Character strings in the system, including object names, path names, and file and directory names are represented with 16-bit Unicode characters. The Win32 subsystem converts any ANSI characters it receives into Unicode strings before manipulating them. It then converts them back to ANSI, if necessary, upon exit from the system.

References:

Unicode Inc. 1965 Charleston Road Mountain View, CA 94043 Phone (415) 961-4189

"Inside Windows NT," by Helen Custer, Microsoft Press, 1992

"Program Migration to Unicode," by Amus Freytag, Proceedings of the First Unicode Implementers Workshop, The Unicode Consortium, Mountain View, California, August, 1991

"Adapt Your Program for Worldwide Use with Windows Internationalization Support," by William S. Hall, Microsoft Systems Journal, Vol. 6, No. 6, Nov./Dec. 1991

"Operating Systems Design and Implementation," by Andrew S. Tanenbaum, Prentice-Hall, Inc., Englewood Cliffs; New Jersey, 1987

Additional query words: prodnt win32

Keywords          : kbother
Version           : 3.1
Platform          : WINDOWS

Last Reviewed: August 25, 1998