This is an archive of the original scripts.sil.org site, preserved as a historical reference. Some of the content is outdated. Please consult our other sites for more current information: software.sil.org, ScriptSource, FDBP, and silfontdev



Home

Contact Us

General

Initiative B@bel

WSI Guidelines

Encoding

Principles

Unicode

Training

Tutorials

PUA

Conversion

Resources

Utilities

TECkit

Maps

Resources

Input

Principles

Utilities

Tutorials

Resources

Type Design

Principles

Design Tools

Formats

Resources

Font Downloads

Gentium

Doulos

IPA

Rendering

Principles

Technologies

OpenType

Graphite

Resources

Font FAQ

Links

Glossary


Computers & Writing Systems

SIL HOME | SIL SOFTWARE | SUPPORT | DONATE | PRIVACY POLICY

You are here: Encoding
Short URL: https://scripts.sil.org/WindowsCodepages

Windows and Codepages

Martin Hosken, 1997-12-29

Contents

Abstract

This document examines how Windows 95 handles multi-lingual computing. It looks at Languages, Codepages, Locales, Unicode and Fonts with particular reference to their support in Windows 95.

An alternative title for this document might be: “How to add a new script to Windows 95 and fail”.

Introduction

For those people requiring the availability of different scripts on their computers, a number of tools and approaches are available for Windows 3.1. How appropriate are such tools and approaches in Windows 95 which has better support for multilingual computing?

Here we introduce some of the basic concepts used in the rest of this discussion.

Unicode — Unicode is a 16-bit character set. Its primary purpose is for data interchange, just like ASCII. Whilst it aims to support every language, as we shall see, care should be taken in assuming that something which supports Unicode will necessarily support your language.

Language ID — A Language ID is a 16-bit number used to identify a particular language. Amongst other things, a particular language has one sort order associated. For this reason, the language ID is broken into two parts: a 10-bit primary language ID and a 6-bit sub-language ID. For example, US English = 0x0409, whilst UK English = 0x0809. Spanish (Traditional Sort) = 0x040a and Spanish (Modern Sort) = 0x0c0a.

Locale — Each language has a locale identified by the language ID. A locale specifies how to represent certain information, e.g. dates, monetary values, month names, in the language. It contains no information on how data is stored or sorted in any scripts of the language. In Windows 95 and Windows NT all locale information is stored in Unicode.

Codepage — Each different encoding in a system needs to have information describing how to map to and from Unicode, to describe the semantics of each character (e.g. upper to lowercase mapping, identifying numbers) and to give default and language specific sorting information. Each encoding, or codepage, is given a 16-bit number identifying it.

The rest of this document is a discussion of how all this is implemented in Windows 95 and associated products. We start by examining how we might expect it all to work, and then look at the problems of this and resulting realities. Finally we take a quick peek into the fog trying to guess what the future might hold.

The Windows 95 Solution

Files

Windows 95 has a single locale file, WindowsSystemLocale.nls, which holds all the locale information for every language.

Windows 95 also holds one file for each codepage. The names of these files are usually of the form WindowsSystemcp_nnnn.nls where nnnn is the codepage number, in decimal. The particular file for a codepage is referenced via the registry at key:

HKEY_LOCAL_MACHINESystemCurrentControlSetcontrolNlsCodepage

Within this key, each codepage number has an entry which references a file relative to WindowsSystem.

Fonts

TrueType fonts are a technology within themselves. Each font consists of a number of tables holding various pieces of information pertaining to rendering. One of the tables (cmap) is used to map between the external codepoints and the internal glyphs. In Windows (all versions) this mapping is from a 16-bit value (assumed to be Unicode) to a glyph in the font.

Applications normally store data in an 8-bit form, requiring a mapping from the 8-bit form to the 16-bit form used by a font. This is where codepages come into play. They hold the 8-bit to 16-bit mapping information. In Windows 3.11 there is one, de facto, mapping. The precise nature of the mapping is dictated by the national version of Windows that you have. So, for example, US Windows supports codepage 1252; Thai Windows supports codepage 874; and so on. This also corresponds to the default codepage provided with a particular national version of Windows 95. Windows also supports a few other minor mappings: Symbol and OEM (corresponding to DOS), but again, these are fixed and not extensible. Windows 95, on the other hand, theoretically, can support any number of codepages. This is particularly useful when doing multilingual computing.

Once you get into the realm of multiple codepages, a font needs to indicate which codepages it supports. This is done, to a limited extent, within a TrueType font. Details of how this works, and the limitations it imposes are covered in this next section.

Windows 95 Implementation

Introduction

Given the solution presented by Windows 95, adding a new orthography to Windows 95 would consist merely in producing a codepage for the encoding and any locale entries for the languages which use that codepage. These could then be inserted into the appropriate locations, and, hey presto! we can work with the new orthography in all our applications.

Unfortunately Windows 95 has various problems to overcome, and the solution to these problems results in a severe limiting of the openness of the system.

Let us return to the problem of a font indicating which codepages it supports. This is a necessary activity in order to provide scripting support by choice of font. In a program such as Word, each font is listed with the scripts it supports. Thus, if you install the multilingual extensions to Windows 95, you will have large versions of such fonts as Times New Roman. When you pull down a font selection list in, say, Word, you will see that Times New Roman can be selected in various forms, including Central European, Greek, etc. In order for Windows to give you this list, it is necessary for it to be able to interrogate the font in question to see which scripts (or codepages) it supports.

The information is provided by means of a 64-bit bitfield, stored in the OS/2 table of the TrueType font file, in which each codepage in question is allocated one of the bits. If the font supports that codepage, then the corresponding bit is set.

Toward our end of adding a new script to Windows, therefore, all we need do is allocate one of those bits to a codepage of our choice and everything is fine. The difficulty is in how to do this. Due to the small number of bits, Windows may just as well hard-code the allocation of the bits to codepage numbers, which is what it does. There is no way to add a bitfield entry to codepage number mapping to the system. Thus, if Windows does not know about your codepage at design time, it cannot be properly integrated.

The overall upshot of this is that such applications as Word and WordPad do not support codepages beyond a restricted set.

Good News

Thankfully, this lack of a mapping is not insurmountable. At the API level (the level at which programs interact with Windows internally) any codepage is referencable and useable, if care is taken. Programmers are referred to the MultiByteToWideChar() type function calls which map from 8-bit to Unicode.

Porting from Windows 3.1 to Windows 95

In order to support fonts indicating which codepages they support, the TrueType specification underwent a quiet change between Windows 3.1 and Windows 95. As a result it is possible that a font may work perfectly adequately in Windows 3.1 but not at all in Windows 95. This is because it has not got the codepage information in it. For much of the time Windows 95 guesses quite happily, but this should not be relied upon.

Another change that was added at the same time was that a font can indicate which Unicode ranges it supports. I am not sure what this is used for yet, but I have my suspicions. For a table of which bit means which codepage, see Appendix A: CodePage Bitfields, and for a table of Unicode ranges, see Appendix B: Unicode Bitfields.

If it is necessary to add any of this information to a font, there are a number of tools to help. Typecaster, from version 3, supports two commands at the start of a .cst file. codepage_range is followed by two 32-bit hex values separated by commas, and indicates the codepage bitfield to be included in the font. unicode_range is followed by four 32-bit hex values separated by commas, and indicates the Unicode ranges supported by the font. For example:

code 1
uni 3

This is the default, used for an ANSI font and indicates codepage 1252. Notice that missing values are assumed to be 0.

Fontographer version 4.1 and beyond allows the insertion of the necessary information.

An interim PERL v4 program exists called hackos2 which allows the manipulation of the OS/2 table in a TrueType font which contains the appropriate bitfields.

To mimic the behaviour of Windows 3.1, it is most likely that a user will want to make their font an ANSI font and indicate that it supports codepage 1252. Symbol fonts, whilst not having a codepage file, do have a codepage bit associated with them.

Multilingual Extensions

As another example of what is going on, we can look at the multilingual extensions supplied with Windows 95. To install them, go to Add / Remove Programs in the control panel and click on the Windows Setup tab. From there, select Multilingual Extensions and click  OK . You will have to restart Windows to gain the full benefits.

Here is what installing these extensions does.

  • Adds a bunch of codepage files to your WindowsSystem directory and update the registry accordingly.
  • Replaces your system fonts (Times New Roman, Arial, Lucida Sans, Courier New, etc.) with large fonts that encompass all the codepages added.
  • Adds different language keyboards. Setting a particular language keyboard indicates to the application the language associated with that keyboard. This is used in such applications as Word 97.

Overall this is probably a worthwhile thing to do if you are intending to work with any scripts beyond Western European.

Unicode: The Future

As far as Windows and NT are concerned, the future is Unicode. This means that underlying storage will be increasingly Unicode. For example, Word 97 uses Unicode to store its data as will WordPad, etc. and will use conversion techniques to generate 8-bit data when necessary.

Example: Word 97

One of the difficulties encountered with Word 97 sometimes occurs with a font change, when data unpredictably either disappears into little boxes or, when saving, converts to question marks. What is going on?

Word 97 keeps track of which codepage data is entered with. In the case of a Symbol font, there is no associated codepage, due to the vagueries of Unicode. Thus Word 97 converts the data directly into Unicode (and incidently gives it the system codepage).

Then a user decides to change font to one with a different encoding. In the case of a supported codepage, Word will not allow the user to change the encoding. In the case of Symbol encoding, Word allows you to change the font to one which supports the system codepage. But that font need not support the Unicode values used by the Symbol encoding (U+F020  !!unknown USV!! U+F021  !!unknown USV!! U+F022  !!unknown USV!! U+F023  !!unknown USV!! U+F024  !!unknown USV!! U+F025  !!unknown USV!! U+F026  !!unknown USV!! U+F027  !!unknown USV!! U+F028  !!unknown USV!! U+F029  !!unknown USV!! U+F02A  !!unknown USV!! U+F02B  !!unknown USV!! U+F02C  !!unknown USV!! U+F02D  !!unknown USV!! U+F02E  !!unknown USV!! U+F02F  !!unknown USV!! U+F030  !!unknown USV!! U+F031  !!unknown USV!! U+F032  !!unknown USV!! U+F033  !!unknown USV!! U+F034  !!unknown USV!! U+F035  !!unknown USV!! U+F036  !!unknown USV!! U+F037  !!unknown USV!! U+F038  !!unknown USV!! U+F039  !!unknown USV!! U+F03A  !!unknown USV!! U+F03B  !!unknown USV!! U+F03C  !!unknown USV!! U+F03D  !!unknown USV!! U+F03E  !!unknown USV!! U+F03F  !!unknown USV!! U+F040  !!unknown USV!! U+F041  !!unknown USV!! U+F042  !!unknown USV!! U+F043  !!unknown USV!! U+F044  !!unknown USV!! U+F045  !!unknown USV!! U+F046  !!unknown USV!! U+F047  !!unknown USV!! U+F048  !!unknown USV!! U+F049  !!unknown USV!! U+F04A  !!unknown USV!! U+F04B  !!unknown USV!! U+F04C  !!unknown USV!! U+F04D  !!unknown USV!! U+F04E  !!unknown USV!! U+F04F  !!unknown USV!! U+F050  !!unknown USV!! U+F051  !!unknown USV!! U+F052  !!unknown USV!! U+F053  !!unknown USV!! U+F054  !!unknown USV!! U+F055  !!unknown USV!! U+F056  !!unknown USV!! U+F057  !!unknown USV!! U+F058  !!unknown USV!! U+F059  !!unknown USV!! U+F05A  !!unknown USV!! U+F05B  !!unknown USV!! U+F05C  !!unknown USV!! U+F05D  !!unknown USV!! U+F05E  !!unknown USV!! U+F05F  !!unknown USV!! U+F060  !!unknown USV!! U+F061  !!unknown USV!! U+F062  !!unknown USV!! U+F063  !!unknown USV!! U+F064  !!unknown USV!! U+F065  !!unknown USV!! U+F066  !!unknown USV!! U+F067  !!unknown USV!! U+F068  !!unknown USV!! U+F069  !!unknown USV!! U+F06A  !!unknown USV!! U+F06B  !!unknown USV!! U+F06C  !!unknown USV!! U+F06D  !!unknown USV!! U+F06E  !!unknown USV!! U+F06F  !!unknown USV!! U+F070  !!unknown USV!! U+F071  !!unknown USV!! U+F072  !!unknown USV!! U+F073  !!unknown USV!! U+F074  !!unknown USV!! U+F075  !!unknown USV!! U+F076  !!unknown USV!! U+F077  !!unknown USV!! U+F078  !!unknown USV!! U+F079  !!unknown USV!! U+F07A  !!unknown USV!! U+F07B  !!unknown USV!! U+F07C  !!unknown USV!! U+F07D  !!unknown USV!! U+F07E  !!unknown USV!! U+F07F  !!unknown USV!! U+F080  !!unknown USV!! U+F081  !!unknown USV!! U+F082  !!unknown USV!! U+F083  !!unknown USV!! U+F084  !!unknown USV!! U+F085  !!unknown USV!! U+F086  !!unknown USV!! U+F087  !!unknown USV!! U+F088  !!unknown USV!! U+F089  !!unknown USV!! U+F08A  !!unknown USV!! U+F08B  !!unknown USV!! U+F08C  !!unknown USV!! U+F08D  !!unknown USV!! U+F08E  !!unknown USV!! U+F08F  !!unknown USV!! U+F090  !!unknown USV!! U+F091  !!unknown USV!! U+F092  !!unknown USV!! U+F093  !!unknown USV!! U+F094  !!unknown USV!! U+F095  !!unknown USV!! U+F096  !!unknown USV!! U+F097  !!unknown USV!! U+F098  !!unknown USV!! U+F099  !!unknown USV!! U+F09A  !!unknown USV!! U+F09B  !!unknown USV!! U+F09C  !!unknown USV!! U+F09D  !!unknown USV!! U+F09E  !!unknown USV!! U+F09F  !!unknown USV!! U+F0A0  !!unknown USV!! U+F0A1  !!unknown USV!! U+F0A2  !!unknown USV!! U+F0A3  !!unknown USV!! U+F0A4  !!unknown USV!! U+F0A5  !!unknown USV!! U+F0A6  !!unknown USV!! U+F0A7  !!unknown USV!! U+F0A8  !!unknown USV!! U+F0A9  !!unknown USV!! U+F0AA  !!unknown USV!! U+F0AB  !!unknown USV!! U+F0AC  !!unknown USV!! U+F0AD  !!unknown USV!! U+F0AE  !!unknown USV!! U+F0AF  !!unknown USV!! U+F0B0  !!unknown USV!! U+F0B1  !!unknown USV!! U+F0B2  !!unknown USV!! U+F0B3  !!unknown USV!! U+F0B4  !!unknown USV!! U+F0B5  !!unknown USV!! U+F0B6  !!unknown USV!! U+F0B7  !!unknown USV!! U+F0B8  !!unknown USV!! U+F0B9  !!unknown USV!! U+F0BA  !!unknown USV!! U+F0BB  !!unknown USV!! U+F0BC  !!unknown USV!! U+F0BD  !!unknown USV!! U+F0BE  !!unknown USV!! U+F0BF  !!unknown USV!! U+F0C0  !!unknown USV!! U+F0C1  !!unknown USV!! U+F0C2  !!unknown USV!! U+F0C3  !!unknown USV!! U+F0C4  !!unknown USV!! U+F0C5  !!unknown USV!! U+F0C6  !!unknown USV!! U+F0C7  !!unknown USV!! U+F0C8  !!unknown USV!! U+F0C9  !!unknown USV!! U+F0CA  !!unknown USV!! U+F0CB  !!unknown USV!! U+F0CC  !!unknown USV!! U+F0CD  !!unknown USV!! U+F0CE  !!unknown USV!! U+F0CF  !!unknown USV!! U+F0D0  !!unknown USV!! U+F0D1  !!unknown USV!! U+F0D2  !!unknown USV!! U+F0D3  !!unknown USV!! U+F0D4  !!unknown USV!! U+F0D5  !!unknown USV!! U+F0D6  !!unknown USV!! U+F0D7  !!unknown USV!! U+F0D8  !!unknown USV!! U+F0D9  !!unknown USV!! U+F0DA  !!unknown USV!! U+F0DB  !!unknown USV!! U+F0DC  !!unknown USV!! U+F0DD  !!unknown USV!! U+F0DE  !!unknown USV!! U+F0DF  !!unknown USV!! U+F0E0  !!unknown USV!! U+F0E1  !!unknown USV!! U+F0E2  !!unknown USV!! U+F0E3  !!unknown USV!! U+F0E4  !!unknown USV!! U+F0E5  !!unknown USV!! U+F0E6  !!unknown USV!! U+F0E7  !!unknown USV!! U+F0E8  !!unknown USV!! U+F0E9  !!unknown USV!! U+F0EA  !!unknown USV!! U+F0EB  !!unknown USV!! U+F0EC  !!unknown USV!! U+F0ED  !!unknown USV!! U+F0EE  !!unknown USV!! U+F0EF  !!unknown USV!! U+F0F0  !!unknown USV!! U+F0F1  !!unknown USV!! U+F0F2  !!unknown USV!! U+F0F3  !!unknown USV!! U+F0F4  !!unknown USV!! U+F0F5  !!unknown USV!! U+F0F6  !!unknown USV!! U+F0F7  !!unknown USV!! U+F0F8  !!unknown USV!! U+F0F9  !!unknown USV!! U+F0FA  !!unknown USV!! U+F0FB  !!unknown USV!! U+F0FC  !!unknown USV!! U+F0FD  !!unknown USV!! U+F0FE  !!unknown USV!! U+F0FF  !!unknown USV!!), and so those characters are converted into boxes.

There is a mechanism in later versions of Word 97 (Service Release 1) to allow conversion from fonts using alien (to the bitfield system) codepages into fonts with known codepages. But then there is a problem with typing since the 8-bit key-codes are converted using the known, converted, codepage, rather than the alien codepage. So we cannot fully support a new codepage that way.

A second problem arises when storing as 8-bit ASCII text. Word 97 converts the data to ASCII via the system codepage (see the ACP entry in the codepage section of the registry). This conversion, from one codepage to another via Unicode, makes a best approximation to an 8-bit form of the characters. Resulting in, for example, the letter a being output rather than a hooked-a; or, when there is no good approximation, a question mark. Since the system has no idea what Symbols are, they all get converted to question marks.

This, at least, is what we think Word 97 is up to. It’s handling of codepages, and especially Symbol fonts, is consistent in that the same thing happens every time, but not necessarily logical when compared with behaviour in other parts of the program. (For example, try converting some text from Times New Roman to Symbol and back again).

Conclusion

The future trend towards Unicode support has major implications for those wishing to work with scripts not specified in the version of Unicode that is implemented.

Firstly, there is more information held about characters than just how to render them. There is all sorts of semantic information to do with case, directionality, diacritics, etc. At the moment, this is stored in the codepage, thus allowing one codepage to effectively give a different semantic meaning to a Unicode character than another codepage. NT and probably Windows will tend towards a centralised semantic database for the whole of Unicode. As it is, this is achieveable, through compression, in about 9K bytes.

The implication for multilingual users is that it will be increasingly difficult to reinterpret characters to our own ends. Our existing technique of saying that an A acute looks like a high tone diacritic in another font is not going to work so well.

Secondly, Unicode is a data transfer standard, as ASCII was, and rendering directly from Unicode is sometimes very difficult. Our fonts are going to have to become smarter, as will our rendering technology. Scripting issues will increasingly have to become a speciality rather than something that OWLs can necessarily deal with unaided.

Having said all this, as an organisation we are not in an unhealthy position and if we keep working at it, we can stay that way.

Appendix A: Codepage Bitfields

BitCode pageDescription
0 1252 Latin 1
1 1250 Latin 2: Eastern Europe
2 1251 Cyrillic
3 1253 Greek
4 1254 Turkish
5 1255 Hebrew
6 1256 Arabic
7 1257 Baltic
8 - 15 Reserved for ANSI

ANSI

BitCode pageDescription
16 874 Thai
17 932 Japanese, Shift-JIS
18 936 Chinese: Simplified chars, PRC and Singapore
19 949 Korean Unified Hangeul Code (Hangeul TongHabHyung Code)
20 950 Chinese: Traditional chars-Taiwan and Hong Kong
21 1361 Korean (Johab)
22 - 29 Reserved for alternate ANSI and OEM
30 - 31 Reserved by system. (Bit 31 is used for Symbol Fonts)

ANSI and OEM

BitCode pageDescription
32 - 47 Reserved for OEM
48 869 IBM Greek
49 866 MS-DOS Russian
50 865 MS-DOS Nordic
51 864 Arabic
52 863 MS-DOS Canadian French
53 862 Hebrew
54 861 MS-DOS Icelandic
55 860 MS-DOS Portuguese
56 857 IBM Turkish
57 855 IBM Cyrillic; primarily Russian
58 852 Latin 2
59 775 Baltic
60 737 Greek; former 437 G
61 708 Arabic; ASMO 708
62 850 Western European/Latin 1
63 437 US

OEM

Appendix B: Unicode Subset Bitfields

BitDescription
0 Basic Latin
1 Latin-1 Supplement
2 Latin Extended-A
3 Latin Extended-B
4 IPA Extensions
5 Spacing Modifier Letters
6 Combining Diacritical Marks
7 Basic Greek
8 Greek Symbols and Coptic
9 Cyrillic
10 Armenian
11 Basic Hebrew
12 Hebrew Extended
13 Basic Arabic
14 Arabic Extended
15 Devanagari
16 Bengali
17 Gurmukhi
18 Gujarati
19 Oriya
20 Tamil
21 Telugu
22 Kannada
23 Malayalam
24 Thai
25 Lao
26 Basic Georgian
27 Georgian Extended
28 Hangul Jamo
29 Latin Extended Additional
30 Greek Extended
31 General Punctuation
32 Subscripts and Superscripts
33 Currency Symbols
34 Combining Diacritical Marks for Symbols
35 Letter-like Symbols
36 Number Forms
37 Arrows
38 Mathematical Operators
39 Miscellaneous Technical
40 Control Pictures
41 Optical Character Recognition
42 Enclosed Alphanumerics
43 Box Drawing
44 Block Elements
45 Geometric Shapes
46 Miscellaneous Symbols
47 Dingbats
48 Chinese, Japanese, and Korean (CJK) Symbols and Punctuation
49 Hiragana
50 Katakana
51 Bopomofo
52 Hangul Compatibility Jamo
53 CJK Miscellaneous
54 Enclosed CJK
55 CJK Compatibility
56 Hangul
57 Reserved for Unicode Subranges
58 Reserved for Unicode Subranges
59 CJK Unified Ideographs
60 Private Use Area
61 CJK Compatibility Ideographs
62 Alphabetic Presentation Forms
63 Arabic Presentation Forms-A
64 Combining Half Marks
65 CJK Compatibility Forms
66 Small Form Variants
67 Arabic Presentation Forms-B
68 Halfwidth and Fullwidth Forms
69 Specials
70-127 Reserved for Unicode Subranges

© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page.
Provided by SIL's Writing Systems Technology team (formerly known as NRSI). Read our Privacy Policy. Contact us here.