1.0.4 β’ Published 1 year ago
multibyte v1.0.4
multibyte
multibyte provides common string functions that respect multibyte Unicode characters.
npm install multibyte
The problem and the solution
On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.
If you display Unicode text from a UTF-8 source, you need these multibyte
functions that take advantage of the fact that Array.from(string)
is Unicode safe.
import {
charAt,
codePointAt,
length,
slice,
split,
truncateBytes,
} from 'multibyte';
// JavaScript String.prototype.charAt() can return a UTF-16 surrogate
'aπc'.charAt(1); // β "\ud83d" (half a rocket)
charAt('aπc', 1); // β
"π"
// JavaScript String.prototype.codePointAt() can return a UTF-16 surrogate
'πabc'.codePointAt(1); // β 56960 (surrogate pair of rocket emoji)
codePointAt('πabc', 1); // β
97 (the letter a)
// JavaScript returns length in UTF-16, not Unicode characters
'aπc'.length; // β 4
length('aπc'); // β
3
// JavaScript slices along UTF-16 boundaries, not Unicode characters
'aπcdef'.slice(2, 3); // β "\ude80" (half a rocket)
slice('aπcdef', 2, 3); // β
"c"
// JavaScript splits along UTF-16 boundaries, not Unicode characters
'aπc'.split(''); // β ["a", "\ud83d", "\ude80", "c"]
split('aπc', ''); // β
["a", "π", "c"] β
// JavaScript slices strings along UTF-16 boundaries, not Unicode characters
'aπcdef'.slice(0, 2); // β "a\ud83d" (half a rocket)
truncateBytes('aπcdef', 2); // β
"a" (including the rocket would be 3 total bytes)
BOM (Byte order mark) - U+FEFF
Under the hood, all these functions strip a leading BOM if present.