|
Computers & Writing Systems
You are here: Encoding Windows and Codepages
Contents AbstractThis document examines how Windows 95 handles multi-lingual computing. It looks at Languages, Codepages, Locales, Unicode and Fonts with particular reference to their support in Windows 95. An alternative title for this document might be: “How to add a new script to Windows 95 and fail”. IntroductionFor those people requiring the availability of different scripts on their computers, a number of tools and approaches are available for Windows 3.1. How appropriate are such tools and approaches in Windows 95 which has better support for multilingual computing? Here we introduce some of the basic concepts used in the rest of this discussion. Unicode — Unicode is a 16-bit character set. Its primary purpose is for data interchange, just like ASCII. Whilst it aims to support every language, as we shall see, care should be taken in assuming that something which supports Unicode will necessarily support your language. Language ID — A Language ID is a 16-bit number used to identify a particular language. Amongst other things, a particular language has one sort order associated. For this reason, the language ID is broken into two parts: a 10-bit primary language ID and a 6-bit sub-language ID. For example, US English = 0x0409, whilst UK English = 0x0809. Spanish (Traditional Sort) = 0x040a and Spanish (Modern Sort) = 0x0c0a. Locale — Each language has a locale identified by the language ID. A locale specifies how to represent certain information, e.g. dates, monetary values, month names, in the language. It contains no information on how data is stored or sorted in any scripts of the language. In Windows 95 and Windows NT all locale information is stored in Unicode. Codepage — Each different encoding in a system needs to have information describing how to map to and from Unicode, to describe the semantics of each character (e.g. upper to lowercase mapping, identifying numbers) and to give default and language specific sorting information. Each encoding, or codepage, is given a 16-bit number identifying it. The rest of this document is a discussion of how all this is implemented in Windows 95 and associated products. We start by examining how we might expect it all to work, and then look at the problems of this and resulting realities. Finally we take a quick peek into the fog trying to guess what the future might hold. The Windows 95 SolutionFilesWindows 95 has a single locale file, WindowsSystemLocale.nls, which holds all the locale information for every language. Windows 95 also holds one file for each codepage. The names of these files are usually of the form WindowsSystemcp_nnnn.nls where nnnn is the codepage number, in decimal. The particular file for a codepage is referenced via the registry at key: HKEY_LOCAL_MACHINESystemCurrentControlSetcontrolNlsCodepage Within this key, each codepage number has an entry which references a file relative to WindowsSystem. FontsTrueType fonts are a technology within themselves. Each font consists of a number of tables holding various pieces of information pertaining to rendering. One of the tables (cmap) is used to map between the external codepoints and the internal glyphs. In Windows (all versions) this mapping is from a 16-bit value (assumed to be Unicode) to a glyph in the font. Applications normally store data in an 8-bit form, requiring a mapping from the 8-bit form to the 16-bit form used by a font. This is where codepages come into play. They hold the 8-bit to 16-bit mapping information. In Windows 3.11 there is one, de facto, mapping. The precise nature of the mapping is dictated by the national version of Windows that you have. So, for example, US Windows supports codepage 1252; Thai Windows supports codepage 874; and so on. This also corresponds to the default codepage provided with a particular national version of Windows 95. Windows also supports a few other minor mappings: Symbol and OEM (corresponding to DOS), but again, these are fixed and not extensible. Windows 95, on the other hand, theoretically, can support any number of codepages. This is particularly useful when doing multilingual computing. Once you get into the realm of multiple codepages, a font needs to indicate which codepages it supports. This is done, to a limited extent, within a TrueType font. Details of how this works, and the limitations it imposes are covered in this next section. Windows 95 ImplementationIntroductionGiven the solution presented by Windows 95, adding a new orthography to Windows 95 would consist merely in producing a codepage for the encoding and any locale entries for the languages which use that codepage. These could then be inserted into the appropriate locations, and, hey presto! we can work with the new orthography in all our applications. Unfortunately Windows 95 has various problems to overcome, and the solution to these problems results in a severe limiting of the openness of the system. Let us return to the problem of a font indicating which codepages it supports. This is a necessary activity in order to provide scripting support by choice of font. In a program such as Word, each font is listed with the scripts it supports. Thus, if you install the multilingual extensions to Windows 95, you will have large versions of such fonts as Times New Roman. When you pull down a font selection list in, say, Word, you will see that Times New Roman can be selected in various forms, including Central European, Greek, etc. In order for Windows to give you this list, it is necessary for it to be able to interrogate the font in question to see which scripts (or codepages) it supports. The information is provided by means of a 64-bit bitfield, stored in the OS/2 table of the TrueType font file, in which each codepage in question is allocated one of the bits. If the font supports that codepage, then the corresponding bit is set. Toward our end of adding a new script to Windows, therefore, all we need do is allocate one of those bits to a codepage of our choice and everything is fine. The difficulty is in how to do this. Due to the small number of bits, Windows may just as well hard-code the allocation of the bits to codepage numbers, which is what it does. There is no way to add a bitfield entry to codepage number mapping to the system. Thus, if Windows does not know about your codepage at design time, it cannot be properly integrated. The overall upshot of this is that such applications as Word and WordPad do not support codepages beyond a restricted set. Good NewsThankfully, this lack of a mapping is not insurmountable. At the API level (the level at which programs interact with Windows internally) any codepage is referencable and useable, if care is taken. Programmers are referred to the MultiByteToWideChar() type function calls which map from 8-bit to Unicode. Porting from Windows 3.1 to Windows 95In order to support fonts indicating which codepages they support, the TrueType specification underwent a quiet change between Windows 3.1 and Windows 95. As a result it is possible that a font may work perfectly adequately in Windows 3.1 but not at all in Windows 95. This is because it has not got the codepage information in it. For much of the time Windows 95 guesses quite happily, but this should not be relied upon. Another change that was added at the same time was that a font can indicate which Unicode ranges it supports. I am not sure what this is used for yet, but I have my suspicions. For a table of which bit means which codepage, see Appendix A: CodePage Bitfields, and for a table of Unicode ranges, see Appendix B: Unicode Bitfields. If it is necessary to add any of this information to a font, there are a number of tools to help. Typecaster, from version 3, supports two commands at the start of a .cst file. codepage_range is followed by two 32-bit hex values separated by commas, and indicates the codepage bitfield to be included in the font. unicode_range is followed by four 32-bit hex values separated by commas, and indicates the Unicode ranges supported by the font. For example: code 1 uni 3 This is the default, used for an ANSI font and indicates codepage 1252. Notice that missing values are assumed to be 0. Fontographer version 4.1 and beyond allows the insertion of the necessary information. An interim PERL v4 program exists called hackos2 which allows the manipulation of the OS/2 table in a TrueType font which contains the appropriate bitfields. To mimic the behaviour of Windows 3.1, it is most likely that a user will want to make their font an ANSI font and indicate that it supports codepage 1252. Symbol fonts, whilst not having a codepage file, do have a codepage bit associated with them. Multilingual ExtensionsAs another example of what is going on, we can look at the multilingual extensions supplied with Windows 95. To install them, go to in the control panel and click on the tab. From there, select and click . You will have to restart Windows to gain the full benefits.Here is what installing these extensions does.
Overall this is probably a worthwhile thing to do if you are intending to work with any scripts beyond Western European. Unicode: The FutureAs far as Windows and NT are concerned, the future is Unicode. This means that underlying storage will be increasingly Unicode. For example, Word 97 uses Unicode to store its data as will WordPad, etc. and will use conversion techniques to generate 8-bit data when necessary. Example: Word 97One of the difficulties encountered with Word 97 sometimes occurs with a font change, when data unpredictably either disappears into little boxes or, when saving, converts to question marks. What is going on? Word 97 keeps track of which codepage data is entered with. In the case of a Symbol font, there is no associated codepage, due to the vagueries of Unicode. Thus Word 97 converts the data directly into Unicode (and incidently gives it the system codepage). Then a user decides to change font to one with a different encoding. In the case of a supported codepage, Word will not allow the user to change the encoding. In the case of Symbol encoding, Word allows you to change the font to one which supports the system codepage. But that font need not support the Unicode values used by the Symbol encoding (U+F020 !!unknown USV!! U+F021 !!unknown USV!! U+F022 !!unknown USV!! U+F023 !!unknown USV!! U+F024 !!unknown USV!! U+F025 !!unknown USV!! U+F026 !!unknown USV!! U+F027 !!unknown USV!! U+F028 !!unknown USV!! U+F029 !!unknown USV!! U+F02A !!unknown USV!! U+F02B !!unknown USV!! U+F02C !!unknown USV!! U+F02D !!unknown USV!! U+F02E !!unknown USV!! U+F02F !!unknown USV!! U+F030 !!unknown USV!! U+F031 !!unknown USV!! U+F032 !!unknown USV!! U+F033 !!unknown USV!! U+F034 !!unknown USV!! U+F035 !!unknown USV!! U+F036 !!unknown USV!! U+F037 !!unknown USV!! U+F038 !!unknown USV!! U+F039 !!unknown USV!! U+F03A !!unknown USV!! U+F03B !!unknown USV!! U+F03C !!unknown USV!! U+F03D !!unknown USV!! U+F03E !!unknown USV!! U+F03F !!unknown USV!! U+F040 !!unknown USV!! U+F041 !!unknown USV!! U+F042 !!unknown USV!! U+F043 !!unknown USV!! U+F044 !!unknown USV!! U+F045 !!unknown USV!! U+F046 !!unknown USV!! U+F047 !!unknown USV!! U+F048 !!unknown USV!! U+F049 !!unknown USV!! U+F04A !!unknown USV!! U+F04B !!unknown USV!! U+F04C !!unknown USV!! U+F04D !!unknown USV!! U+F04E !!unknown USV!! U+F04F !!unknown USV!! U+F050 !!unknown USV!! U+F051 !!unknown USV!! U+F052 !!unknown USV!! U+F053 !!unknown USV!! U+F054 !!unknown USV!! U+F055 !!unknown USV!! U+F056 !!unknown USV!! U+F057 !!unknown USV!! U+F058 !!unknown USV!! U+F059 !!unknown USV!! U+F05A !!unknown USV!! U+F05B !!unknown USV!! U+F05C !!unknown USV!! U+F05D !!unknown USV!! U+F05E !!unknown USV!! U+F05F !!unknown USV!! U+F060 !!unknown USV!! U+F061 !!unknown USV!! U+F062 !!unknown USV!! U+F063 !!unknown USV!! U+F064 !!unknown USV!! U+F065 !!unknown USV!! U+F066 !!unknown USV!! U+F067 !!unknown USV!! U+F068 !!unknown USV!! U+F069 !!unknown USV!! U+F06A !!unknown USV!! U+F06B !!unknown USV!! U+F06C !!unknown USV!! U+F06D !!unknown USV!! U+F06E !!unknown USV!! U+F06F !!unknown USV!! U+F070 !!unknown USV!! U+F071 !!unknown USV!! U+F072 !!unknown USV!! U+F073 !!unknown USV!! U+F074 !!unknown USV!! U+F075 !!unknown USV!! U+F076 !!unknown USV!! U+F077 !!unknown USV!! U+F078 !!unknown USV!! U+F079 !!unknown USV!! U+F07A !!unknown USV!! U+F07B !!unknown USV!! U+F07C !!unknown USV!! U+F07D !!unknown USV!! U+F07E !!unknown USV!! U+F07F !!unknown USV!! U+F080 !!unknown USV!! U+F081 !!unknown USV!! U+F082 !!unknown USV!! U+F083 !!unknown USV!! U+F084 !!unknown USV!! U+F085 !!unknown USV!! U+F086 !!unknown USV!! U+F087 !!unknown USV!! U+F088 !!unknown USV!! U+F089 !!unknown USV!! U+F08A !!unknown USV!! U+F08B !!unknown USV!! U+F08C !!unknown USV!! U+F08D !!unknown USV!! U+F08E !!unknown USV!! U+F08F !!unknown USV!! U+F090 !!unknown USV!! U+F091 !!unknown USV!! U+F092 !!unknown USV!! U+F093 !!unknown USV!! U+F094 !!unknown USV!! U+F095 !!unknown USV!! U+F096 !!unknown USV!! U+F097 !!unknown USV!! U+F098 !!unknown USV!! U+F099 !!unknown USV!! U+F09A !!unknown USV!! U+F09B !!unknown USV!! U+F09C !!unknown USV!! U+F09D !!unknown USV!! U+F09E !!unknown USV!! U+F09F !!unknown USV!! U+F0A0 !!unknown USV!! U+F0A1 !!unknown USV!! U+F0A2 !!unknown USV!! U+F0A3 !!unknown USV!! U+F0A4 !!unknown USV!! U+F0A5 !!unknown USV!! U+F0A6 !!unknown USV!! U+F0A7 !!unknown USV!! U+F0A8 !!unknown USV!! U+F0A9 !!unknown USV!! U+F0AA !!unknown USV!! U+F0AB !!unknown USV!! U+F0AC !!unknown USV!! U+F0AD !!unknown USV!! U+F0AE !!unknown USV!! U+F0AF !!unknown USV!! U+F0B0 !!unknown USV!! U+F0B1 !!unknown USV!! U+F0B2 !!unknown USV!! U+F0B3 !!unknown USV!! U+F0B4 !!unknown USV!! U+F0B5 !!unknown USV!! U+F0B6 !!unknown USV!! U+F0B7 !!unknown USV!! U+F0B8 !!unknown USV!! U+F0B9 !!unknown USV!! U+F0BA !!unknown USV!! U+F0BB !!unknown USV!! U+F0BC !!unknown USV!! U+F0BD !!unknown USV!! U+F0BE !!unknown USV!! U+F0BF !!unknown USV!! U+F0C0 !!unknown USV!! U+F0C1 !!unknown USV!! U+F0C2 !!unknown USV!! U+F0C3 !!unknown USV!! U+F0C4 !!unknown USV!! U+F0C5 !!unknown USV!! U+F0C6 !!unknown USV!! U+F0C7 !!unknown USV!! U+F0C8 !!unknown USV!! U+F0C9 !!unknown USV!! U+F0CA !!unknown USV!! U+F0CB !!unknown USV!! U+F0CC !!unknown USV!! U+F0CD !!unknown USV!! U+F0CE !!unknown USV!! U+F0CF !!unknown USV!! U+F0D0 !!unknown USV!! U+F0D1 !!unknown USV!! U+F0D2 !!unknown USV!! U+F0D3 !!unknown USV!! U+F0D4 !!unknown USV!! U+F0D5 !!unknown USV!! U+F0D6 !!unknown USV!! U+F0D7 !!unknown USV!! U+F0D8 !!unknown USV!! U+F0D9 !!unknown USV!! U+F0DA !!unknown USV!! U+F0DB !!unknown USV!! U+F0DC !!unknown USV!! U+F0DD !!unknown USV!! U+F0DE !!unknown USV!! U+F0DF !!unknown USV!! U+F0E0 !!unknown USV!! U+F0E1 !!unknown USV!! U+F0E2 !!unknown USV!! U+F0E3 !!unknown USV!! U+F0E4 !!unknown USV!! U+F0E5 !!unknown USV!! U+F0E6 !!unknown USV!! U+F0E7 !!unknown USV!! U+F0E8 !!unknown USV!! U+F0E9 !!unknown USV!! U+F0EA !!unknown USV!! U+F0EB !!unknown USV!! U+F0EC !!unknown USV!! U+F0ED !!unknown USV!! U+F0EE !!unknown USV!! U+F0EF !!unknown USV!! U+F0F0 !!unknown USV!! U+F0F1 !!unknown USV!! U+F0F2 !!unknown USV!! U+F0F3 !!unknown USV!! U+F0F4 !!unknown USV!! U+F0F5 !!unknown USV!! U+F0F6 !!unknown USV!! U+F0F7 !!unknown USV!! U+F0F8 !!unknown USV!! U+F0F9 !!unknown USV!! U+F0FA !!unknown USV!! U+F0FB !!unknown USV!! U+F0FC !!unknown USV!! U+F0FD !!unknown USV!! U+F0FE !!unknown USV!! U+F0FF !!unknown USV!!), and so those characters are converted into boxes. There is a mechanism in later versions of Word 97 (Service Release 1) to allow conversion from fonts using alien (to the bitfield system) codepages into fonts with known codepages. But then there is a problem with typing since the 8-bit key-codes are converted using the known, converted, codepage, rather than the alien codepage. So we cannot fully support a new codepage that way. A second problem arises when storing as 8-bit ASCII text. Word 97 converts the data to ASCII via the system codepage (see the ACP entry in the codepage section of the registry). This conversion, from one codepage to another via Unicode, makes a best approximation to an 8-bit form of the characters. Resulting in, for example, the letter a being output rather than a hooked-a; or, when there is no good approximation, a question mark. Since the system has no idea what Symbols are, they all get converted to question marks. This, at least, is what we think Word 97 is up to. It’s handling of codepages, and especially Symbol fonts, is consistent in that the same thing happens every time, but not necessarily logical when compared with behaviour in other parts of the program. (For example, try converting some text from Times New Roman to Symbol and back again). ConclusionThe future trend towards Unicode support has major implications for those wishing to work with scripts not specified in the version of Unicode that is implemented. Firstly, there is more information held about characters than just how to render them. There is all sorts of semantic information to do with case, directionality, diacritics, etc. At the moment, this is stored in the codepage, thus allowing one codepage to effectively give a different semantic meaning to a Unicode character than another codepage. NT and probably Windows will tend towards a centralised semantic database for the whole of Unicode. As it is, this is achieveable, through compression, in about 9K bytes. The implication for multilingual users is that it will be increasingly difficult to reinterpret characters to our own ends. Our existing technique of saying that an A acute looks like a high tone diacritic in another font is not going to work so well. Secondly, Unicode is a data transfer standard, as ASCII was, and rendering directly from Unicode is sometimes very difficult. Our fonts are going to have to become smarter, as will our rendering technology. Scripting issues will increasingly have to become a speciality rather than something that OWLs can necessarily deal with unaided. Having said all this, as an organisation we are not in an unhealthy position and if we keep working at it, we can stay that way. Appendix A: Codepage Bitfields
ANSI
ANSI and OEM
OEM Appendix B: Unicode Subset Bitfields
© 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |