|
Computers & Writing Systems
You are here: Encoding > Unicode > Tutorials TECkit mapping language conversion
Return to Unicode Transition Tutorial Links. Note This page is currently being updated. Come back in a day or two. Contents
The object of this tutorial is to go through the process of creating and using a conversion mapping in the TECkit mapping language. The SIL IPA93 legacy font will be used in this example. The example files are worked using the TECkit language approach. Before beginning you will need to follow the document: Computer setup for encoding conversion. Making a copy of the font in your work area
Note Note that it is also possible to make a copy of the font by clicking and dragging, but doing so increases the chance of mistakenly moving the font rather than copying it. It is also possible to work directly with the font file in the Fonts folder, but then the files you generate will end up in this folder as well. Generating a draft mapping using Encore2Unicode
Note This function accomplishes the same result as dragging the font file and dropping it onto the shortcut, but reduces the chance of “dropping” the file in the wrong place.The Encore2Unicode program will put up a window on the screen. This window will remain empty while the program is working. When the draft mapping file has been created, the program will display “Encore2Unicode finished: press Return to exit...”. Note If the program detects a problem, the window will display information to help you find what is wrong. This can occur if you mistakenly give it the wrong type of file. If you launch the program by double-clicking on it, it will give you information on how to run the program from the command line.
Note Note that the content of any .map file must be considered a draft. The Encore2Unicode program has perhaps done eighty or ninety percent of the work, but you must review and correct this draft. Editing the draft mapping using TECkit Unicode Mapping Editor
Comments
; Draft TECkit mapping file generated by Encore2Unicode from C:UTTutorial3-TECkit IPA93Ipa93dr.ttf
Header
EncodingName "(REG_ID)-SILDoulos_IPA93-(VERSION)" DescriptiveName "" Version "0" Contact "mailto:(YOUR_ADDRESS_HERE)" RegistrationAuthority "(REG_NAME)" RegistrationName "SILDoulos IPA93-(VERSION)" ;*** Replace "(REG_ID)" with "SIL" or other organization identifier ;*** Replace "(REG_NAME)" with "SIL International" or other organization name ;*** Replace "(VERSION)" with year the encoding was introduced ;*** Replace "(YOUR_ADDRESS_HERE)" with your email address ;*** Replace font name with other encoding identifier if appropriate
Normalization flags and control code classesWe’ll leave the normalization flags alone for now. Note Note the syntax for defining a class of characters, either byte or Unicode. This is used to map the byte control codes (hex 00 through 1F) to the Unicode ones (hex 0000 through 001F). Later in this exercise, we will be defining classes for use in describing the context in which a mapping should occur. Check character assignments - PUA character notesNote Since this is a tutorial, we'll be providing a lot of the information. For your mapping, you'll need to do the research.
0x22 <> U+0069 ; latin_small_letter_i -- SILID 1411* and consult the entry at the bottom of the file for SILID 1411: ;* SILID 1411: Typically SILID 1411 is used with diacritics and so should map to U+0069 [Basic Latin]. If the orthography distinguishes dotted and dotless 'i' characters, then U+0131 [Latin Extended-A] may be appropriate. In this case, we do want the U+0069 entry, since the "dotless i" is used when diacritics must be placed over the letter "i". We'll be looking at this entry later, but for now leave it as it is. Note TECkit also supports referencing Unicode characters by their names ("latin_small_letter_i", for example) rather than number.
0xAD <> U+0320 ; combining_minus_sign_below
0xC9 <> U+F180 ; superscript m [PUA]
0xCB <> U+F181 ; superscript left-tail n [PUA]
0xD4 <> U+F182 ; superscript eng [PUA]
0xDC <> U+0304 ; combining_macron
;* SILID 6401: See U+0334 [Combining Diacritical Marks]. Combinations that use a combining overlay (U+0334 [Combining Diacritical Marks]..U+0338 [Combining Diacritical Marks], U+20E5 [Combining Diacritical Marks for Symbols]) should probably be handled as a single, non-decomposed character, and we should probably try to get that character into Unicode before we consider encoding it using the combining overlay characters. One possibility is to map this character to U+0334, however it seems better to map specific combinations, for example: 0x6c 0xF2 <> U+026B ; latin_small_letter_l_with_middle_tilde Comment out the entry for 0xF2 by placing a “;” in front of it, then add the above specific mapping.
Check character assignments - Did E2U choose correctly?Although the mapping file will now compile without errors (try it if you want!), it is still not correct. You’re probably sick of us telling you that Encore2Unicode creates a draft mapping that must be reviewed carefully, but what’s needed next is careful review. 0x27 <> U+0027 ; apostrophe -- quoteright = SILID 9038* (mapped as codepage 1252 character) The asterisk (*) after the SILID indicates that there is a note at the bottom of the file: ;* SILID 9038: The appropriate mapping of Encore glyph SILID 9038 depends upon how it is used. The following are preferred: for apostrophe (in contractions), U+2019 [General Punctuation]; for quotation, U+2018 [General Punctuation] and U+2019 [General Punctuation] (may use U+201A [General Punctuation] or U+201B [General Punctuation] depending upon locale conventions); for indicating primary stress in IPA transcription, U+02C8 [Spacing Modifier Letters] (U+02B9 [Spacing Modifier Letters] or U+02CA [Spacing Modifier Letters] may be preferred typographically in some dictionaries); for glottalization, ejective consonants or orthographic glottal stop, U+02BC [Spacing Modifier Letters]; for minutes or feet... In this case, the correct choice is U+02BC, since this font was designed for IPA and this character is the glottal stop. Make the change in the mapping file. 0x27 <> U+02BC ; modifier_letter_apostrophe Note The most important thing for you to consider in developing your mapping is how the apostrophe character is used. If it is used as an apostrophe in contractions, most likely you will want to use U+2019.
0x47 <> U+0047 ; latin_capital_letter_g -- SILID 3009* (mapped as codepage 1252 character) and at the corresponding note at the bottom of the file: ;* SILID 3009: If the small cap attribute is not being used for a character distinction, Encore glyph SILID 3009 should be mapped to U+0047 [Basic Latin]. In this case, the correct choice is U+0262, an IPA character. Make this change in the mapping file. 0x47 <> U+0262 ; latin_letter_small_capital_g 0x49 <> U+026A ; latin_letter_small_capital_i 0x59 <> U+028F ; latin_letter_small_capital_y 0x67 <> U+0261 ; latin_letter_small_script_g Note Remember to check whether the characters that Encore2Unicode generated are what you really want. Don’t judge by the shape of the glyph, but by the description of how the character is used. Check the notes at the bottom of the file.
0x7C <> U+0020 U+031A ; space combining_left_angle_above -- SILID 9086 Remove the “U+0020”. It turns out that this extra space character which Encore2Unicode generated is in error. (This will likely be corrected in a later version of the program.)
0x3D <> U+0331 ; combining_macron_below -- SILID 0130 (space glyph ignored) + SILID 6609* Note First of all note that the “SILID 0130 (space glyph ignored)” message is included when Encore2Unicode encounters an overstriking diacritic. You may want to make a global replacement to change “SILID 0130 (space glyph ignored) + “ to nothing to eliminate this part of the comment. As it turns out, E2U chose the wrong character for 0x3D. It turns out that this is the same character in the entry for 0xAD which you corrected previously. Both refer to the same combining diacritic, U+0320, and the entry for 0x3D should read: 0x3D <> U+0320 ; combining_minus_sign_below Make this change in your file. Later we’ll be adding constraints to these (and other) diacritics so that the correct choice will be made when mapping from Unicode back to the legacy font encoding.
0x8E <> U+007C ; vertical_line -- SILID 9090* 0x92 <> U+007C U+007C ; vertical_line vertical_line — SILID 9090* + SILID 9090* 0x96 <> U+007C ; vertical_line -- bar = SILID 9055* The correct characters for the first two entries are U+01C0 (latin letter dental click) and U+01C1 (latin letter lateral click). Correct these entries in your file to read. 0x8E <> U+01C0 ; latin_letter_dental_click 0x92 <> U+01C1 ; latin_letter_lateral_click
0x84 <> U+2225 ; parallel_to -- SILID 9078* The note at the end of the file (referenced by the SILID 9078*) says: ;* SILID 9078: See also U+01C1 [Latin Extended-B] and U+2016 [General Punctuation]. The character that Encore2Unicode chose was the mathematical symbol for “is parallel to”. The better choice here is U+2016 (double vertical line). Make this correction in your file: 0x84 <> U+2016 ; double_vertical_line
0x97 <> U+0021 ; exclamation_mark -- exclam = SILID 9016* Correct this entry to use U+01C3 (latin letter retroflex click): 0x97 <> U+01C3 ; latin_letter_retroflex_click
0x8A <> U+02E5 U+02E5 ; modifier_letter_extra_high_tone_bar modifier_letter_extra_high_tone_bar -- SILID 0655 0x91 <> U+02E6 U+02E6 ; modifier_letter_high_tone_bar modifier_letter_high_tone_bar -- SILID 0644 0x95 <> U+02E7 U+02E7 ; modifier_letter_mid_tone_bar modifier_letter_mid_tone_bar -- SILID 0633 0x9A <> U+02E8 U+02E8 ; modifier_letter_low_tone_bar modifier_letter_low_tone_bar -- SILID 0622 0x9F <> U+02E9 U+02E9 ; modifier_letter_extra_low_tone_bar modifier_letter_extra_low_tone_bar -- SILID 0611 Correct these five entries by deleting the extra Unicode character. (Correct the comment as well.)
0xCA <> U+2002 ; en_space -- composite: SILID 0130* ;* SILID 0130: Possible mappings include space U+0020 [Basic Latin]; no-break space U+00A0 [Latin-1 Supplement]; en space U+2002 [General Punctuation]; three-per-em space U+2004 [General Punctuation]; four-per-em space U+2005 [General Punctuation]; six-per-em space U+2006 [General Punctuation]; figure space U+2007 [General Punctuation]; punctuation space U+2008 [General Punctuation]; thin space U+2009 [General Punctuation]; hair space U+200A [General Punctuation]; zero width space U+200B [General Punctuation]; ideographic space U+3000 [CJK Symbols and Punctuation] and zero width no-break space U+FEFF [Arabic Presentation Forms-B]. There are lots of spaces from which to choose and Encore2Unicode doesn’t have enough information so it made its best guess. It turns out that this character should be the “hair space” U+200A. Make this change in the file. Adding context to constrain mappingsTip Since this tutorial was written, the TECkit mapping language has been enhanced to allow contexts to be defined once and then referenced in each entry which needs them. Note In the following steps, the “(*)” indicates a step (or part of a step) that can be omitted if time is short. The resulting mapping file will be able to handle the specific test data, but not be a general solution. The goal of the tutorial is to show you this process rather than to reproduce this specific mapping.
0x22 <> U+0069 ; latin_small_letter_i 0x69 <> U+0069 ; latin_small_letter_i The first is a “dotless i” used when diacritics are placed over an “i”. The second is the normal “dotted i”. When converting from legacy byte encodings to Unicode, both map to the Unicode character U+0069. This is fine since the Unicode font is smart enough to remove the dot when putting on an accent. However, in order to convert from Unicode back to the legacy byte encoding, we need to supply a context to constrain one of these mappings. (We could constrain both, but it is easier to pick one as the default case.) When converting from Unicode to bytes, we want to pick the “dotless i” (0x22) when it is followed by a diacritic that is rendered above it. Rather than list all the possible combinations, we’ll make use of the “class” construct we saw earlier under “Normalization flags and control code classes”. We also need to allow for the possibility of a diacritic that is rendered under the “i” coming between the “i” and the diacritic over it. For your reference, the diacritics that appear in the SIL IPA93 legacy font are listed in the table in Appendix I. [CTL] <> [CTL] statement, create a Unicode class called “uDia” and assign to it all the diacritics that are in the “Above” section of the table in Appendix I. Then create another Unicode class called “lDia” and assign to it all the diacritics that are in the “Below” section of the table. You should end up with something like this: ; diacritics which appear above a base character UniClass [uDia] = (U+0300 U+0301 U+0302 U+0303 U+0304 U+0306 U+0308 U+030A U+030B U+030C U+030F U+033D U+0361) ; diacritics which appear below a base character UniClass [lDia] = (U+0318 U+0319 U+031C U+031D U+031E U+031F U+0320 U+0324 U+0325 U+0329 U+032A U+032C U+032F U+0330 U+0339 U+033A U+033B U+033C) Tip Note that if you want to break a long line you will need to place a “” character at the end of the line. Armed with these class definitions, we can write the mapping for the “dotless i”. Since the constraint applies when mapping from Unicode to bytes, the context is added to the Unicode side of the mapping. Change the entry for 0x22 to be: 0x22 <> U+0069 / _ [lDia]? [uDia] The “_” indicates the character (U+0069 in this case). The “[lDia]?” indicates zero or one occurrences of a member of the “lDia” class. The “[uDia]” indicates one occurrence of a member of the “uDia” class. (*) Now add a similar context to the entries for 0xAE (dotless i with stroke) and 0xBE (dotless j).
0xDD <> U+0300 ; combining_grave_accent to: 0xDD <> U+0300 / [iWid] [lDia]? _; i-width combining_grave_accent
0xBC <> U+0330 ; combining_tilde_below to: 0xBC <> U+0330 / [iWid] _ ; i-width combining_tilde_below
Note Note that the constraints used in the actual .map file are more complex than those used in this tutorial and allow combinations such as “i/tilde/acute” to be mapped correctly from Unicode to the legacy IPA font (where the acute accent must be “i-width high” because of the tilde).
Using DropTEC
Appendix I - Diacritics in SIL IPA93 font
Diacritics occurring above the base character
Diacritics occurring below the base character © 2003-2024 SIL International, all rights reserved, unless otherwise noted elsewhere on this page. |