1.4.2 • Published 8 years ago

@balderdash/codepage v1.4.2

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
8 years ago

Codepages for JS

Codepages are character encodings. In many contexts, single- or double-byte character sets are used in lieu of Unicode encodings. The codepages map between characters and numbers.

unicode.org hosts lists of mappings. The build script automatically downloads and parses the mappings in order to generate the full script. The pages.csv description in codepage.md controls which codepages are used.

Setup

In node:

var cptable = require('codepage');

In the browser:

<script src="cptable.js"></script>
<script src="cputils.js"></script>

Alternatively, use the full version in the dist folder:

<script src="cptable.full.js"></script>

The complete set of codepages is large due to some Double Byte Character Set encodings. A much smaller file that just includes SBCS codepages is provided in this repo (sbcs.js), as well as a file for other projects (cpexcel.js)

If you know which codepages you need, you can include individual scripts for each codepage. The individual files are provided in the bits/ directory. For example, to include only the Mac codepages:

<script src="bits/10000.js"></script>
<script src="bits/10006.js"></script>
<script src="bits/10007.js"></script>
<script src="bits/10029.js"></script>
<script src="bits/10079.js"></script>
<script src="bits/10081.js"></script>

All of the browser scripts define and append to the cptable object. To rename the object, edit the JSVAR shell variable in make.sh and run the script.

The utilities functions are contained in cputils.js, which assumes that the appropriate codepage scripts were loaded.

Usage

The codepages are indexed by number. To get the unicode character for a given codepoint, use the dec property:

var unicode_cp10000_255 = cptable[10000].dec[255]; // ˇ

To get the codepoint for a given character, use the enc property:

var cp10000_711 = cptable[10000].enc[String.fromCharCode(711)]; // 255

There are a few utilities that deal with strings and buffers:

var 汇总 = cptable.utils.decode(936, [0xbb,0xe3,0xd7,0xdc]);
var buf =  cptable.utils.encode(936,  汇总);
var sushi= cptable.utils.decode(65001, [0xf0,0x9f,0x8d,0xa3]); // 🍣
var sbuf = cptable.utils.encode(65001, sushi);

cptable.utils.encode(CP, data, ofmt) accepts a String or Array of characters and returns a representation controlled by ofmt:

  • Default output is a Buffer (or Array) of bytes (integers between 0 and 255).
  • If ofmt == 'str', return a String where o.charCodeAt(i) is the ith byte
  • If ofmt == 'arr', return an Array of bytes

Known Excel Codepages

A much smaller script, including only the codepages known to be used in Excel, is available under the name cpexcel. It exposes the same variable cptable and is suitable as a drop-in replacement when the full codepage tables are not needed.

In node:

var cptable = require('codepage/dist/cpexcel.full');

Rolling your own script

The make.sh script in the repo can take a manifest and generate JS source.

Usage:

bash make.sh path_to_manifest output_file_name JSVAR

where

  • JSVAR is the name of the exported variable (generally cptable)
  • output_file_name is the output file (e.g. cpexcel.js, cptable.js)
  • path_to_manifest is the path to the manifest file.

The manifest file is expected to be a CSV with 3 columns:

<codepage number>,<source>,<size>

If a source is specified, it will try to download the specified file and parse. The file format is expected to follow the format from the unicode.org site. The size should be 1 for a single-byte codepage and 2 for a double-byte codepage. For mixed codepages (which use some single- and some double-byte codes), the script assumes the mapping is a prefix code and generates efficient JS code.

Generated scripts only include the mapping. cat a mapping with cputils.js to produce a complete script like cpexcel.full.js.

Building the complete script

This script uses voc. The script to build the codepage tables and the JS source is codepage.md, so building is as simple as voc codepage.md.

Generated Codepages

The complete list of hardcoded codepages can be found in the file pages.csv.

Some codepages are easier to implement algorithmically. Since these are hardcoded in utils, there is no corresponding entry (they are "magic")

CP#InformationDescription
37unicode.orgIBM EBCDIC US-Canada
437unicode.orgOEM United States
500unicode.orgIBM EBCDIC International
620NLSMazovia (Polish) MS-DOS
708MakeEncoding.csArabic (ASMO 708)
720MakeEncoding.csArabic (Transparent ASMO); Arabic (DOS)
737unicode.orgOEM Greek (formerly 437G); Greek (DOS)
775unicode.orgOEM Baltic; Baltic (DOS)
850unicode.orgOEM Multilingual Latin 1; Western European (DOS)
852unicode.orgOEM Latin 2; Central European (DOS)
855unicode.orgOEM Cyrillic (primarily Russian)
857unicode.orgOEM Turkish; Turkish (DOS)
858MakeEncoding.csOEM Multilingual Latin 1 + Euro symbol
860unicode.orgOEM Portuguese; Portuguese (DOS)
861unicode.orgOEM Icelandic; Icelandic (DOS)
862unicode.orgOEM Hebrew; Hebrew (DOS)
863unicode.orgOEM French Canadian; French Canadian (DOS)
864unicode.orgOEM Arabic; Arabic (864)
865unicode.orgOEM Nordic; Nordic (DOS)
866unicode.orgOEM Russian; Cyrillic (DOS)
869unicode.orgOEM Modern Greek; Greek, Modern (DOS)
870MakeEncoding.csIBM EBCDIC Multilingual/ROECE (Latin 2)
874unicode.orgWindows Thai
875unicode.orgIBM EBCDIC Greek Modern
895NLSKamenický (Czech) MS-DOS
932unicode.orgJapanese Shift-JIS
936unicode.orgSimplified Chinese GBK
949unicode.orgKorean
950unicode.orgTraditional Chinese Big5
1026unicode.orgIBM EBCDIC Turkish (Latin 5)
1047MakeEncoding.csIBM EBCDIC Latin 1/Open System
1140MakeEncoding.csIBM EBCDIC US-Canada (037 + Euro symbol)
1141MakeEncoding.csIBM EBCDIC Germany (20273 + Euro symbol)
1142MakeEncoding.csIBM EBCDIC Denmark-Norway (20277 + Euro symbol)
1143MakeEncoding.csIBM EBCDIC Finland-Sweden (20278 + Euro symbol)
1144MakeEncoding.csIBM EBCDIC Italy (20280 + Euro symbol)
1145MakeEncoding.csIBM EBCDIC Latin America-Spain (20284 + Euro symbol)
1146MakeEncoding.csIBM EBCDIC United Kingdom (20285 + Euro symbol)
1147MakeEncoding.csIBM EBCDIC France (20297 + Euro symbol)
1148MakeEncoding.csIBM EBCDIC International (500 + Euro symbol)
1149MakeEncoding.csIBM EBCDIC Icelandic (20871 + Euro symbol)
1200magicUnicode UTF-16, little endian (BMP of ISO 10646)
1201magicUnicode UTF-16, big endian
1250unicode.orgWindows Central Europe
1251unicode.orgWindows Cyrillic
1252unicode.orgWindows Latin I
1253unicode.orgWindows Greek
1254unicode.orgWindows Turkish
1255unicode.orgWindows Hebrew
1256unicode.orgWindows Arabic
1257unicode.orgWindows Baltic
1258unicode.orgWindows Vietnam
1361MakeEncoding.csKorean (Johab)
10000unicode.orgMAC Roman
10001MakeEncoding.csJapanese (Mac)
10002MakeEncoding.csMAC Traditional Chinese (Big5)
10003MakeEncoding.csKorean (Mac)
10004MakeEncoding.csArabic (Mac)
10005MakeEncoding.csHebrew (Mac)
10006unicode.orgGreek (Mac)
10007unicode.orgCyrillic (Mac)
10008MakeEncoding.csMAC Simplified Chinese (GB 2312)
10010MakeEncoding.csRomanian (Mac)
10017MakeEncoding.csUkrainian (Mac)
10021MakeEncoding.csThai (Mac)
10029unicode.orgMAC Latin 2 (Central European)
10079unicode.orgIcelandic (Mac)
10081unicode.orgTurkish (Mac)
10082MakeEncoding.csCroatian (Mac)
12000magicUnicode UTF-32, little endian byte order
12001magicUnicode UTF-32, big endian byte order
20000MakeEncoding.csCNS Taiwan (Chinese Traditional)
20001MakeEncoding.csTCA Taiwan
20002MakeEncoding.csEten Taiwan (Chinese Traditional)
20003MakeEncoding.csIBM5550 Taiwan
20004MakeEncoding.csTeleText Taiwan
20005MakeEncoding.csWang Taiwan
20105MakeEncoding.csWestern European IA5 (IRV International Alphabet 5) 7-bit
20106MakeEncoding.csIA5 German (7-bit)
20107MakeEncoding.csIA5 Swedish (7-bit)
20108MakeEncoding.csIA5 Norwegian (7-bit)
20127magicUS-ASCII (7-bit)
20261MakeEncoding.csT.61
20269MakeEncoding.csISO 6937 Non-Spacing Accent
20273MakeEncoding.csIBM EBCDIC Germany
20277MakeEncoding.csIBM EBCDIC Denmark-Norway
20278MakeEncoding.csIBM EBCDIC Finland-Sweden
20280MakeEncoding.csIBM EBCDIC Italy
20284MakeEncoding.csIBM EBCDIC Latin America-Spain
20285MakeEncoding.csIBM EBCDIC United Kingdom
20290MakeEncoding.csIBM EBCDIC Japanese Katakana Extended
20297MakeEncoding.csIBM EBCDIC France
20420MakeEncoding.csIBM EBCDIC Arabic
20423MakeEncoding.csIBM EBCDIC Greek
20424MakeEncoding.csIBM EBCDIC Hebrew
20833MakeEncoding.csIBM EBCDIC Korean Extended
20838MakeEncoding.csIBM EBCDIC Thai
20866MakeEncoding.csRussian Cyrillic (KOI8-R)
20871MakeEncoding.csIBM EBCDIC Icelandic
20880MakeEncoding.csIBM EBCDIC Cyrillic Russian
20905MakeEncoding.csIBM EBCDIC Turkish
20924MakeEncoding.csIBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
20932MakeEncoding.csJapanese (JIS 0208-1990 and 0212-1990)
20936MakeEncoding.csSimplified Chinese (GB2312-80)
20949MakeEncoding.csKorean Wansung
21025MakeEncoding.csIBM EBCDIC Cyrillic Serbian-Bulgarian
21027NLSExtended/Ext Alpha Lowercase
21866MakeEncoding.csUkrainian Cyrillic (KOI8-U)
28591unicode.orgISO 8859-1 Latin 1 (Western European)
28592unicode.orgISO 8859-2 Latin 2 (Central European)
28593unicode.orgISO 8859-3 Latin 3
28594unicode.orgISO 8859-4 Baltic
28595unicode.orgISO 8859-5 Cyrillic
28596unicode.orgISO 8859-6 Arabic
28597unicode.orgISO 8859-7 Greek
28598unicode.orgISO 8859-8 Hebrew (ISO-Visual)
28599unicode.orgISO 8859-9 Turkish
28600unicode.orgISO 8859-10 Latin 6
28601unicode.orgISO 8859-11 Latin (Thai)
28603unicode.orgISO 8859-13 Latin 7 (Estonian)
28604unicode.orgISO 8859-14 Latin 8 (Celtic)
28605unicode.orgISO 8859-15 Latin 9
28606unicode.orgISO 8859-15 Latin 10
29001MakeEncoding.csEuropa 3
38598MakeEncoding.csISO 8859-8 Hebrew (ISO-Logical)
50220MakeEncoding.csISO 2022 JIS Japanese with no halfwidth Katakana
50221MakeEncoding.csISO 2022 JIS Japanese with halfwidth Katakana
50222MakeEncoding.csISO 2022 Japanese JIS X 0201-1989 (1 byte Kana-SO/SI)
50225MakeEncoding.csISO 2022 Korean
50227MakeEncoding.csISO 2022 Simplified Chinese
51932MakeEncoding.csEUC Japanese
51936MakeEncoding.csEUC Simplified Chinese
51949MakeEncoding.csEUC Korean
52936MakeEncoding.csHZ-GB2312 Simplified Chinese
54936MakeEncoding.csGB18030 Simplified Chinese (4 byte)
57002MakeEncoding.csISCII Devanagari
57003MakeEncoding.csISCII Bengali
57004MakeEncoding.csISCII Tamil
57005MakeEncoding.csISCII Telugu
57006MakeEncoding.csISCII Assamese
57007MakeEncoding.csISCII Oriya
57008MakeEncoding.csISCII Kannada
57009MakeEncoding.csISCII Malayalam
57010MakeEncoding.csISCII Gujarati
57011MakeEncoding.csISCII Punjabi
65000magicUnicode (UTF-7)
65001magicUnicode (UTF-8)

Note that MakeEncoding.cs deviates from unicode.org for some codepages. In the case of direct conflicts, unicode.org takes precedence. In cases where the unicode.org listing does not prescribe a value, MakeEncoding.cs value is used.

NLS refers to the National Language Support files supplied in various versions of Windows. In older versions of Windows (e.g. Windows 98) these files followed the pattern CP_#.NLS, but newer versions use the pattern C_#.NLS.

Sources

Badges

githalytics.com alpha Build Status Coverage Status