update 0.12.0

Chik-Network · Feb 5, 2025 · e63c0f3 · e63c0f3
1 parent 3211546
commit e63c0f3
Show file tree

Hide file tree

Showing 12 changed files with 364 additions and 47 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -3,7 +3,7 @@ members = ["fuzz", "tools", "wasm", "wheel"]
 
 [package]
 name = "klvmr"
-version = "0.11.0"
+version = "0.12.0"
 authors = ["Richard Kiss <him@richardkiss.com>"]
 edition = "2021"
 license = "Apache-2.0"

diff --git a/benches/serialize.rs b/benches/serialize.rs
@@ -35,8 +35,8 @@ fn serialize_benchmark(c: &mut Criterion) {
         group.bench_function(format!("Serializer {name}"), |b| {
             b.iter(|| {
                 let start = Instant::now();
-                let mut ser = Serializer::default();
-                let _ = ser.add(&a, node, None);
+                let mut ser = Serializer::new(None);
+                let _ = ser.add(&a, node);
                 black_box(ser.into_inner());
                 start.elapsed()
             })

diff --git a/docs/compressed-serialization.md b/docs/compressed-serialization.md
@@ -0,0 +1,195 @@
+# Compressed Serialization
+
+With the Chik hard fork at height 5'496'000, KLVM can be serialized in a more
+space efficient form, referring back to previous sub-trees instead of
+duplicating them. These references are referred to as "back references".
+
+## Format
+
+The original serialization format had 3 tokens.
+
+- `0xff` - a pair, followed by the left and right sub-trees.
+- Atom - values, in the form of an array of bytes. For more details, see [KLVM
+  serialization](https://chiklisp.com/klvm/#serialization).
+
+A back reference is introduced by `0xfe` followed by an atom. The atom refers
+back to an already decoded sub tree. The bits are interpreted just like an
+environment lookup in KLVM. The bits are inspected one at a time, from least
+significant to most significant bits, in big-endian order.
+
+## Paths
+
+```
+            +----------+----------+----------+----------+
+byte index: |  byte 0  |  byte 1  |  byte 2  |  byte 3  |
+            +----------+----------+----------+----------+
+ bit index: | 76543210 | 76543210 | 76543210 | 76543210 |
+            +----------+----------+----------+----------+
+
+bit traversal direction:                          <- x
+```
+
+A `0` bit means follow the left sub-tree while a `1` bit means follow the right
+sub-tree. The last 1-bit is the terminator, and means we should pick the node at
+the current location in the tree.
+
+e.g. The reference `0b1011` means:
+
+- right
+- right
+- left
+- (terminator bit)
+
+It follows the path below:
+
+```
+                 [*]
+                /   \
+               /     \
+              /       \ 1
+             /         \
+            /           \
+           /             \
+         [ ]             [*]
+        /   \           /   \ 1
+       /     \         /     \
+     [ ]     [ ]     [ ]     [*]
+     / \     / \     /  \  0 / \
+   [ ] [ ] [ ] [ ] [ ] [ ] [*] [ ]
+```
+
+How environment lookups work is also described in the
+[chiklisp documentation](https://chiklisp.com/klvm/#environment).
+
+## Parsing
+
+Back references refer into the "parse stack". This is a KLVM tree that's updated
+as we parse, so what a back reference refers to changes as we parse the
+serialized KLVM tree. To understand what the parse stack is, we first need to
+look at how KLVM is parsed.
+
+The parser has a stack of _operations_ and a stack of the parsed results (the
+parse stack).
+
+There are 2 operations that can be pushed onto the operations stack:
+
+- `Cons` - Construct a pair (cons box)
+- `Traverse` - parse a sub-tree
+
+As outlined in the [Format](#Format) section, there are two tokens we can
+encounter when parsing; an atom or a pair (followed by the left- and right
+sub-trees).
+
+We keep popping operations off of the op-stack until it's empty. We take the
+following actions depending on the operation:
+
+- `Traverse`, inspect the next byte of the input stream. If it's a pair (`0xff`)
+  we push `Cons`, `Traverse`, `Traverse` onto the operations stack. If it's an
+  atom, parse the atom and push it into the parse stack.
+
+- `Cons`, pop two nodes from the parse stack, create a new pair with those nodes
+  as the left and right side. Push the resulting pair onto the stack.
+
+### Example
+
+To parse the tokens: `0xff` `1` `0xff` `2` `foobar`, the two stacks end up like
+this while parsing. The stacks grow to the right in this illustration.
+
+| step              | op-stack                       | parse-stack              |
+| ----------------- | ------------------------------ | ------------------------ |
+| 1, initial state  | Traverse                       |                          |
+| 2, parse `0xff`   | Cons, Traverse, Traverse       |                          |
+| 3, parse `1`      | Cons, Traverse                 | `1`                      |
+| 4, parse `0xff`   | Cons, Cons, Traverse, Traverse | `1`                      |
+| 5, parse `2`      | Cons, Cons, Traverse           | `1`, `2`                 |
+| 6, parse `foobar` | Cons, Cons                     | `1`, `2`, `foobar`       |
+| 7, pop2 and cons  | Cons                           | `1`, (`2` . `foobar`)    |
+| 8, pop2 and cons  |                                | (`1` . (`2` . `foobar`)) |
+
+## Parse stack
+
+When a back-reference token (`0xfe`) is encountered, the parse stack in that
+current state is used as the environment for the back-reference path to look up
+what node to place at this position in the resulting tree.
+
+The parse stack is itself a LISP list of items. The top of the stack is the head
+of the list.
+
+e.g.
+
+The stack `1`, `2`, `3`, would have the following LISP structure:
+
+```
+(`1` . (`2` . (`3` . NIL)))
+```
+
+A back reference to `3` would be: `0b1100` (right, left).
+
+### Example back-reference
+
+Consider the following LISP structure: ((`1` . `2`) . (`1` . `2`))  
+It can be serialized as `0xff` `0xff` `1` `2` `0xfe` `0b10`
+
+The parsing steps would be as follows:
+
+| step                | op-stack                                 | parse-stack                 |
+| ------------------- | ---------------------------------------- | --------------------------- |
+| 1, initial state    | Traverse                                 |                             |
+| 2, parse `0xff`     | Cons, Traverse, Traverse                 |                             |
+| 3, parse `0xff`     | Cons, Traverse, Cons, Traverse, Traverse |                             |
+| 4, parse `1`        | Cons, Traverse, Cons, Traverse           | `1`                         |
+| 5, parse `2`        | Cons, Traverse, Cons                     | `1`, `2`                    |
+| 6, pop2 and cons    | Cons, Traverse                           | (`1` . `2`)                 |
+| 7, parse `0xfe` `2` | Cons                                     | (`1` . `2`), (`1` . `2`)    |
+| 8, pop2 and cons    |                                          | ((`1` . `2`) . (`1` . `2`)) |
+
+### Referencing the stack itself
+
+Back references aren't limited to just referencing items in the stack, but can
+reference any node in the stack. For example, consider parsing the following
+structure:
+
+`0xff` `foobar` `0xff` `foobar` NIL
+
+| step              | op-stack                 | parse-stack |
+| ----------------- | ------------------------ | ----------- |
+| 1, initial state  | Traverse                 |             |
+| 2, parse `0xff`   | Cons, Traverse, Traverse |             |
+| 3, parse `foobar` | Cons, Traverse           | `foobar`    |
+
+At this point, rather than parsing the next `0xff` pair, we could have a back
+reference (`0xfe`) with a path pointing to the root of the parse stack. In LISP
+form, the parse stack will be (`foobar` . NIL) - a list with one item. The rest
+of the KLVM tree is just the second `foobar` followed by the list terminator. It
+can be replaced with the parse stack itself. i.e. We can use a back-reference of
+`1`. We then get the NIL and the cons box "for free". It's implied by the parse
+stack.
+
+In this scenario, the rest of the parsing steps are:
+
+| step                | op-stack | parse-stack                   |
+| ------------------- | -------- | ----------------------------- |
+| 4, parse `0xfe` `1` | Cons     | `foobar`, (`foobar` . NIL)    |
+| 5, pop2 and cons    |          | (`foobar` . (`foobar` . NIL)) |
+
+In practice, however, this rarely happens.
+
+## Generating back references
+
+When serializing with compression, we need to assign a tree-hash and an
+(uncompressed) serialized length to every node. When deciding whether to output
+the sub-tree itself or a back-reference, we need to know whether we have already
+serialized an identical sub tree. If we have, we then have to perform a search
+from that node up all of its parents until we reach the top of the parse stack.
+This requires a data structure that knows about the parents of all nodes.
+
+This search is performed in `find_path()`. There may be multiple paths leading
+to the stack (if the same structure is repeated in multiple places). We pick the
+_shortest_ path. This path may still be quite long, if the stack is deep or if
+the node is found deep down in a KLVM structure. We need to compare the length
+of the path against the serialized-length of the subtree. If the path is longer,
+it would be a net loss to replace it with a back reference.
+
+During serialization, we need to track what the parse-stack will look like when
+deserializing, since this is part of the structure we need to search through
+when finding paths to previous sub trees.
diff --git a/fuzz/fuzz_targets/incremental_serializer.rs b/fuzz/fuzz_targets/incremental_serializer.rs
@@ -94,10 +94,10 @@ fn do_fuzz(data: &[u8], short_atoms: bool) {
     {
         node_idx += 1;
 
-        let mut ser = Serializer::new();
-        let (done, _) = ser.add(&allocator, first_step, Some(sentinel)).unwrap();
+        let mut ser = Serializer::new(Some(sentinel));
+        let (done, _) = ser.add(&allocator, first_step).unwrap();
         assert!(!done);
-        let (done, _) = ser.add(&allocator, second_step, None).unwrap();
+        let (done, _) = ser.add(&allocator, second_step).unwrap();
         assert!(done);
 
         // now, make sure that we deserialize to the exact same structure, by

diff --git a/fuzz/fuzz_targets/serializer.rs b/fuzz/fuzz_targets/serializer.rs
@@ -19,8 +19,8 @@ fn do_fuzz(data: &[u8], short_atoms: bool) {
 
     let b1 = node_to_bytes_backrefs(&allocator, program).unwrap();
 
-    let mut ser = Serializer::new();
-    let (done, _) = ser.add(&allocator, program, None).unwrap();
+    let mut ser = Serializer::new(None);
+    let (done, _) = ser.add(&allocator, program).unwrap();
     assert!(done);
     let b2 = ser.into_inner();
 

diff --git a/src/serde/identity_hash.rs b/src/serde/identity_hash.rs
@@ -25,6 +25,7 @@ impl Hasher for IdentityHash {
     }
 }
 
+#[derive(Clone)]
 pub struct RandomState(u64);
 
 impl Default for RandomState {
-Original file line number
+Diff line change
@@ Expand Up / @@ -25,6 +25,7 @@ impl Hasher for IdentityHash { @@
         }
     }
+    #[derive(Clone)]
     pub struct RandomState(u64);
     impl Default for RandomState {
@@ Expand Down @@