1.0.4 β’ Published 12 months ago
multibyte v1.0.4
multibyte
multibyte provides common string functions that respect multibyte Unicode characters.
npm install multibyte
The problem and the solution
On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.
If you display Unicode text from a UTF-8 source, you need these multibyte
functions that take advantage of the fact that Array.from(string)
is Unicode safe.
import {
charAt,
codePointAt,
length,
slice,
split,
truncateBytes,
} from 'multibyte';
// JavaScript String.prototype.charAt() can return a UTF-16 surrogate
'aπc'.charAt(1); // β "\ud83d" (half a rocket)
charAt('aπc', 1); // β
"π"
// JavaScript String.prototype.codePointAt() can return a UTF-16 surrogate
'πabc'.codePointAt(1); // β 56960 (surrogate pair of rocket emoji)
codePointAt('πabc', 1); // β
97 (the letter a)
// JavaScript returns length in UTF-16, not Unicode characters
'aπc'.length; // β 4
length('aπc'); // β
3
// JavaScript slices along UTF-16 boundaries, not Unicode characters
'aπcdef'.slice(2, 3); // β "\ude80" (half a rocket)
slice('aπcdef', 2, 3); // β
"c"
// JavaScript splits along UTF-16 boundaries, not Unicode characters
'aπc'.split(''); // β ["a", "\ud83d", "\ude80", "c"]
split('aπc', ''); // β
["a", "π", "c"] β
// JavaScript slices strings along UTF-16 boundaries, not Unicode characters
'aπcdef'.slice(0, 2); // β "a\ud83d" (half a rocket)
truncateBytes('aπcdef', 2); // β
"a" (including the rocket would be 3 total bytes)
BOM (Byte order mark) - U+FEFF
Under the hood, all these functions strip a leading BOM if present.