1.0.4 β€’ Published 12 months ago

multibyte v1.0.4

Weekly downloads
-
License
ISC
Repository
github
Last release
12 months ago

multibyte

NPM Link Language Build Status Code Coverage Gzipped Size Dependency details Tree shakeable ISC License

multibyte provides common string functions that respect multibyte Unicode characters.

npm install multibyte

The problem and the solution

On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.

If you display Unicode text from a UTF-8 source, you need these multibyte functions that take advantage of the fact that Array.from(string) is Unicode safe.

import {
  charAt,
  codePointAt,
  length,
  slice,
  split,
  truncateBytes,
} from 'multibyte';

// JavaScript String.prototype.charAt() can return a UTF-16 surrogate
'aπŸš€c'.charAt(1); //  ❌ "\ud83d" (half a rocket)
charAt('aπŸš€c', 1); // βœ… "πŸš€"

// JavaScript String.prototype.codePointAt() can return a UTF-16 surrogate
'πŸš€abc'.codePointAt(1); //  ❌ 56960 (surrogate pair of rocket emoji)
codePointAt('πŸš€abc', 1); // βœ… 97 (the letter a)

// JavaScript returns length in UTF-16, not Unicode characters
'aπŸš€c'.length; //  ❌ 4
length('aπŸš€c'); // βœ… 3

// JavaScript slices along UTF-16 boundaries, not Unicode characters
'aπŸš€cdef'.slice(2, 3); //  ❌ "\ude80" (half a rocket)
slice('aπŸš€cdef', 2, 3); // βœ… "c"

// JavaScript splits along UTF-16 boundaries, not Unicode characters
'aπŸš€c'.split(''); //  ❌ ["a", "\ud83d", "\ude80", "c"]
split('aπŸš€c', ''); // βœ… ["a", "πŸš€", "c"] βœ…

// JavaScript slices strings along UTF-16 boundaries, not Unicode characters
'aπŸš€cdef'.slice(0, 2); //       ❌ "a\ud83d" (half a rocket)
truncateBytes('aπŸš€cdef', 2); // βœ… "a" (including the rocket would be 3 total bytes)

BOM (Byte order mark) - U+FEFF

Under the hood, all these functions strip a leading BOM if present.