0.0.0 • Published 5 years ago

zdag v0.0.0

Weekly downloads
-
License
(Apache-2.0 AND M...
Repository
-
Last release
5 years ago

zdag

This is an experimental block format for IPLD. It leverages the constraints of the IPLD data model to shave as many bytes off the block format as we can without compression.

Features:

  • fully deterministic
  • deduplicates (compresses) common values like strings, bytes, map keys.
  • deduplicates (compresses) all links (cids)
  • supports the full IPLD data model
  • links can be parsed without reading the entire block
  • link prefixes are de-duplicated (compressed)
  • (experiment) value data compression.
    • by compressing the string/byte value and map key data separate from the links and structure we target only the most easily compressable data.
  • paths can be read without fully parsing most values
  • some paths can even be ruled out of being available simply by checking the lengths of potential map keys

Experimental Value Compression

Constraints:

  • Does not support anything not in the IPLD data model
  • Blocks must be written transactionally (no streaming)
  • Every serialization must be a single, complete value type

block structure

links | values | structure

A block is broken into 3 parts and each part has different parsing rules

| ...cids | null byte |
| length | ...values(len | data) |
| structure |

The first section is all the links (cids) and is ended by a null byte.

The second section is all the values (map keys, string values, byte values). The length of this section is the first varint, so there is no closing delimiter.

The third section is the actual value type structure, which will reference cids and string/byte values from the prior sections. Other types are written inline into the structure.

// each line is an structure example
[ 0 ] // 0 int
[ 1 ] // 1 int
[ 101, 102 ] // 102 int (penalty byte due to conflict with table entries)
[ 102, reference varint ] // string
[ 103, reference varint ] // bytes
[ 108, ...(reference varint, table entry) ] // map
[ 109, ...(table entry) ] // list

table

0 - 100 : inline varint
100 : list delimiter
101 : varint (limited allowable use)
102 : utf8 string reference
103 : bytes reference
104 : null
105 : true
106 : false
107 : float
108 : map
109 : list
110 : cid reference
111 : signed varint
112 : signed float
113 : zero point float (not implemented)
114 : typed map (not implemented)
115 : typed list (not implemented)
116+ : inline varint

writing cids

  • write the byte representation of every CID
  • write null byte

writing values

  • pepare every value
    • recursively parse through the entire object
      • convert all string values and map keys to binary representation
      • collect all binary values
      • de-duplicate every value's binary representation
    • order all values
      • length first, then byte comparison
    • write the increase in length from the prior value (beginning at zero)
    • write the value
  • write the length of the prepared values
  • write the prepared values

Writing this way enforces deterministic ordering and de-duplication. It also keeps the encoded length numbers as small as possible, leading to a smaller varint.

writing structure

The block stucture is written as a single table entry.

If the root block stucture is a map or list the trailing delimiter (0 for map, 100 for list) MUST be removed.

writing table entries

writing map value

  • write 108
  • for every entry in the list
    • write the increase over the previous map key + 1
    • write the table entry for the value
  • write a null byte

Writing this way enforces deterministic order of map keys. It also keeps the encoded index numbers as small as possible, leading to a smaller varint.

No two map keys can reference the same value index. MUST throw.

writing list value

  • write 109
  • for every value in the list
    • write the table entry
  • write 100

writing varints

For any whole positive

  • If the number is between 100 and 112 write 101, varint
  • For any other number just write the varint

For negative integers:

  • write 111
  • write the varint

writing floats

For positive floats:

  • write 107
  • write the mantissa length as a varint
  • write the integer for the full value (float divided by Math.pow(10, mantissaLength)) as a varint

For negative floats:

  • write 112
  • write the mantissa length as a varint
  • write the integer for the full value (float divided by Math.pow(10, mantissaLength)) as a varint

writing constants (null, true, false)

  • if null: write 104
  • if true: write 105
  • if false: write 106

writing references (cid, string, bytes)

  • if link: write 110 and then the varint index of the link
  • if string: write 102 and then the varint index of the value
  • if bytes: write 103 and then the varint index of the value

parsing cids

  • parse each cid until null byte

parsing values

  • first varint is the length of the entire values section
  • parse each value
    • read() varint for increase from prior length (begins at zero)
    • value is written after the varint for the given length (prior length + increase)

parsing structure

  • structure is parsed as a single table entry value
  • for map and list entries the final trailing delimiter (0 for map, 100 for list) is removed

parsing a table entry

We use read() to refer to the next read after this varint is read

  • < 100 - This varint is the inline number value
  • 100 - List delimiter
  • 101 - If it's a varint then the next read() is the full varint value
  • 102 - The next read() is a varint for the index of the value and this value is a string value.
  • 103 - The next read() is a varint for the index of the value and this value is a bytes value.
  • 104 - This is null
  • 105 - This is true
  • 106 - This is false
  • 107 - The next two read()s are varints parsed into a float value.
    • The first varint is the left side of the decimal value (positive or negative integer)
    • The second varint is the rigth side of the decimal value
      • This means that the decoding rules may need to vary for positive and negative floats, as the interpretation of this value is positive when applied to a positive float and negative when applied to a negative float
  • 108 - All of the following read()s are in this map until a null byte
    • The first read() is a reference to the value index and should be represented by a string key. The index is offset by +1 so that a null byte can be used to end the map.
    • The next read() is the map value and is a table entry table entry
    • read these recursively until you hit a null byte
  • 109 - All of the following read()s are in this list until a 100 delimiter
    • Every read() is a table entry you hit a 100 delimiter
  • 110 - The next read() is a varint for the index of the cid value.
  • 111 - The next read() is a signed varint, follow the varint parsing rules but make the value negative
  • 112 - The next read() is a signed float, follow float parsing rules but make the left side negative
  • 113 - The next read() is a zero point float, the next varint is the integer representing the mantissa below zero. (not implemented)
  • 114 - Typed map (not implemented)
  • 115 - Typed list (not implemented)
  • 116+ - This varint is the inline number value

value and cid sorting algorithm

value sorting rules

  • only the binary form is stored
  • sort length first
  • then sort by byte comparison

Sort:

[ 1, 2, 3, 4 ]
[ 1, 5, 5, 5 ]
[ 2, 0, 0, 0, 0, 0]

Serializes as:

[ 4 ] [ 1, 2, 3, 4 ]
[ 0 ] [ 1, 5, 5, 5 ]
[ 2 ] [ 2, 0, 0, 0, 0, 0]

cid sorting / compression rules

The link section of the block is sorted with CID specific rules in order to compress the links by de-duplicating common prefixes.

  • All CIDv0 entries come first.
  • All cid's with a digest length greater than 4 are compressed by common prefixes.
    • It's possible to have an identity multihash with a valid length below zero. These must be written serially ordered by length first

The sorting of CIDs:

[ 18 ] [ length=32 ] [ digest-1 ]
[ 18 ] [ length=32 ] [ digest-2 ]
[ 1 , codec=1, hashfn=1 ] [ length=0 ] [ ]
[ 1 , codec=1, hashfn=1 ] [ length=1 ] [ 1 ]
[ 1 , codec=1, hashfn=1 ] [ length=1 ] [ 2 ]
[ 1 , codec=1, hashfn=1 ] [ length=245 ] [ digest-1 ]
[ 1 , codec=1, hashfn=1 ] [ length=245 ] [ digest-2 ]
[ 1 , codec=1, hashfn=1 ] [ length=245 ] [ digest-3 ]
[ 1 , codec=1, hashfn=1 ] [ length=250 ] [ digest-1 ]
[ 1 , codec=1, hashfn=1 ] [ length=250 ] [ digest-2 ]
[ 1 , codec=1, hashfn=1 ] [ length=250 ] [ digest-3 ]
[ 1 , codec=2, hashfn=1 ] [ length=250 ] [ digest-1 ]
[ 1 , codec=2, hashfn=1 ] [ length=250 ] [ digest-2 ]
[ 1 , codec=2, hashfn=2 ] [ length=250 ] [ digest-1 ]
[ 1 , codec=2, hashfn=2 ] [ length=250 ] [ digest-2 ]

Is serialized as:

[ 18 ] [ length=32 ] [ digest-1 ]
       [ length=32 ] [ digest-2 ]
[ 1 , codec=1, hashfn=1 ] [ length=0 ] [ ]
[ 1 , codec=1, hashfn=1 ] [ length=1 ] [ 1 ]
[ 1 , codec=1, hashfn=1 ] [ length=1 ] [ 2 ]
[ 1 , codec=1, hashfn=1 ] [ length=245 ] [ digest-2 ]
                          [ length=245 ] [ digest-3 ]
                          [ length=250 ] [ digest-1 ]
                          [ length=250 ] [ digest-2 ]
                          [ length=250 ] [ digest-3 ]
[ 1 , codec=2, hashfn=1 ] [ length=250 ] [ digest-1 ]
                          [ length=250 ] [ digest-2 ]
[ 1 , codec=2, hashfn=2 ] [ length=250 ] [ digest-1 ]
                          [ length=250 ] [ digest-2 ]

map sorting algorithm

  • map keys are sorted by their string value reference number
  • this means the map sorting naturally corresponds to the value sorting algorithm

Optimizations

All optimizations are required in order to guarantee determinism.

Inline Structure when no Links or Values

When no links or values are present the two nullbytes should be dropped. If the first byte in the structure is less than 19 (lower will conflict with CIDv0 and other potential future multiformats), you must prepend 101.

Typed lists and maps (not implemented)

Any map or list that only has entries that are of the same type, and that type is not integer, MUST use typed lists and maps. Integer typed lists don't use this feature because it would be an extra byte.

All empty maps and lists must be typed. This is so that any schema validation can be done without additional parsing when the list is typed to anyting but an integer.

Examples

Structure Examples

Note that all the following examples do not have any links or values and as such do not have proceeding null bytes.

[ 1, 2 ]

/* serializes to */

109 // list
1   // 1
2   // 2
    // omit trailing delimiter when list or map is root structure
[ 1, [ 2, 3 ] ]

/* serializes to */

109  // list
1    // 1
109  // list
2    // 2
3    // 2
100  // end list
     // omit trailing delimiter when list or map is root structure
[ 1, [ null ], 3 ]

/* serializes to */

109  // list
1    // 1
109  // list
104  // null
100  // end list
3    // 3
     // omit trailing delimiter when list or map is root structure

The following examples have values and no CIDs, which is why they begin with a null byte.

{ "hello": "world" }

/* serializes to */

0                        // no links
12                       // length of values header
5                        // +5 length offset
104, 101, 108, 108, 111  // "hello"
0                        // +0 length offset
119, 111, 114, 108, 100  // "world"
108                      // map
1                        // (+1 map key offset) - 1
1                        // value index
                         // omit trailing delimiter when list or map is root structure
[ { "hello": "world", "world": "hello" } ]

/* serializes to */

0                        // no links
12                       // length of values header
5                        // +5 length offset
104, 101, 108, 108, 111  // "hello"
0                        // +0 length offset
119, 111, 114, 108, 100  // "world"
109
108                      // map
1                        // (+1 map key offset ) + 1
1                        // value index
1                        // (+1 map key offset ) - 1
0                        // value index
0                        // map end
                         // omit trailing delimiter when list or map is root structure

0 - 100 : inline varint 100 : list delimiter 101 : varint (limited allowable use) 102 : utf8 string reference 103 : bytes reference 104 : null 105 : true 106 : false 107 : float 108 : map 109 : list 110 : cid reference 111 : signed varint 112 : signed float 113 : zero point float (not implemented) 114 : string typed map (not implemented) 115 : string typed list (not implemented) 116 : byte value typed map 117 : byte value type list 119 : float typed map 120 : float typed list 121 : signed varint map 122 : signed varint list 123 : signed float map 124 : signed float list 125 : empty map 126 : empty list 118+ : inline varint

Typed list

[ 'x', 'y', 'z' ]

/* serializes to */

0       // no links
6       // value header length
1, 120  // +1 offset length, "x"
0, 121  // +0 offset length, "y"
0, 122  // +0 offset length, "z"
115     // string typed list
1       // value index + 1
2       // value index + 1
3       // value index + 1
        // trailing zero is ommitted during structure inlining