DMDX Help.


Unicode conversion from RTF character set to code page notes.

   

   The Unicode code path in DMDX converts RTF character set references in item files to code page ones and then runs them through Windows routines to convert them to UTF-8 and UTF-16 (depending on whether the text occurs within quotes or not), said Windows routines (MultiByteToWideChar and WideCharToMultiByte) requiring code pages and not RTF character set references so we have to convert between the two.  Unfortunately there's no complete table of character sets and their code page equivalents so I've had to coble together a list from various sources and include it here because if your language doesn't have a charset reference DMDX understands it's probably going to mojibake your text but good.

   In the following table the first number on each row is the code page that will be used if a match is found, the second number is the character set ("charset") value that can be found in the font definition table in the RTF item file and lastly the name of the font is included if there's no matching charset and DMDX instead has to guess based on the name of the font.  Entries with a -1 mean there was no charset reference found and it's strictly a name based guess and NULL for a name means that entry is just a charset reference.  Text following a // is a comment.  Also note that search is by substring so "Japan" matches "Japanese" as well as "Japan", searches are also case insensitive.  I guess if anyone really needs it I could implement a keyword to add to this table and as long as you put it first thing in your item file it would probably correctly translate it but I'm guessing finding an editor that uses Unicode instead of double byte (or other) code page references is a better bet, double byte code page references are pretty passé these days...

{ 1252, 0, NULL }, // from https://msdn.microsoft.com/en-us/library/cc194829.aspx more (better) here: https://docs.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers
{ 932, 128, "SHIFTJIS" }, 
{ 949, 129, "HANGUL" },
{ 936, 134, "GB2312" }, // Chinese (listed later as SimSun but we might need a 936, -1, "Chinese" entry)
{ 950, 136, "CHINESEBIG5" },
{ 1253, 161, "GREEK" },
{ 1254, 162, "TURKISH" },
{ 1255, 177, "HEBREW" },
{ 1256, 178, "ARABIC" },
{ 1257, 186, "BALTIC" },
{ 1251, 204, "RUSSIAN" },
{ 874, 222, "THAI" },
{ 1250, 238, NULL },
{ 1258, 163, "Vietnamese" }, // inferred from usage and https://en.wikipedia.org/wiki/Code_page
{ 936, -1, "SimSun" }, // from https://support.office.com/en-gb/article/choose-text-encoding-when-you-open-and-save-files-60d59c21-88b5-4006-831c-d536d42fd861#bm4
{ 950, -1, "MingLiU" },
{ 1251, -1, "Cyrillic" },
{ 57004, -1, "Tamil" },
{ 57002, -1, "Nepali" },
{ 57002, -1, "Konkani" },
{ 57002, -1, "Hindi" },
{ 57006, -1, "Assamese" },
{ 57003, -1, "Bengali" },
{ 57010, -1, "Gujarati" },
{ 57008, -1, "Kannada" },
{ 57009, -1, "Malayalam" },
{ 57007, -1, "Oriya" },
{ 57002, -1, "Marathi" },
{ 57011, -1, "Punjabi" },
{ 57005, -1, "Telugu" },
{ 57002, -1, "Sanskrit" },
{ 932, -1, "Japan" },
{ 949, -1, "Korean" },

 




DMDX Index.