base114688 v1.0.0
base114688
Base114688 is a text-to-binary encoding optimized for serializing data in UTF-32 (a.k.a. in the fewest unicode characters possible). Inspired by base65536, base114688 attempts to squeeze in even more safe unicode codepoints to produce a "more efficient" encoding.
Installation
$ npm install base114688
* Note: base114688 uses BigInt
which is supported in Node.js >= 10.9 and some browsers (mostly chrome).
Usage
// ES6 modules
import base114688 from 'base114688';
// CommonJS
const base114688 = require('base114688').default;
let data = new Uint8Array([123,24,25,84]);
let encoded = base114688.encode(data);
console.log(encoded); //R𘝂𢉸㶌𧴧𐫏㾳𪫫𭾓軴𞴈R
let decoded = base114688.decode(data);
console.log(decoded); // Uint8Array(4) [ 123, 24, 25, 84 ]
How does it work
Base114688 encodes a stream of bytes by taking consecutive chunks of 21 bytes and encoding it into a series of 10 unicode characters. 21 bytes can represent a total of 25621 ≈ 3.74×1050 values. Since 10 digits in radix-114688 can represent a total of 11468810≈ 3.94×1050, there are enough values for 10 digits to encode the bytes. (By the way, 114688 is 65536×7/4)
To encode a group of 21 bytes, each byte is converted into a 168-bit integer, most significant byte first. Since javascript is incapable of handling integers larger than 253, the BigInt library must be used. The 168-bit number is converted into 10 radix-114688 digits by repeatedly divided by 114688 and assigning the remainder to each consecutive digit.
Since the number of bytes to encode likely won't be divisible by 21, a number of padding bytes will be appended to the stream. The stream of characters is then prepended and appended with a single radix-114688 digit encoding the number of padding bytes, a number from 0-20. It is both prepended and appended to make it easier to select text in a text field on a website, for example.
The encoding process is analogous to how the base85 encoding works.
Base114688 is able to encode 21 bytes per 10 characters or 2.1 bytes per character. That's 5% more efficient than base65536 which is able to encode exactly 2 bytes per character. However, since this encoding requires the use of the BigInt library, it is probably pretty slow and not very optimized.
Why?
Credits
Much inspiration for this project came from projects by qntm.
Both his article about safe unicode points and safe code-point generator were very useful when making this project.
License
MIT
5 years ago