From 5897d4845f214eba608660e410383a18ba115745 Mon Sep 17 00:00:00 2001 From: Titusz Pan Date: Wed, 3 Jan 2024 13:23:55 +0100 Subject: [PATCH] Deployed d77f83a with MkDocs version: 1.5.3 --- iep-0008/index.html | 178 ++++++++++++++++++++++++- images/iscc-iep-0008-f10-data-code.png | Bin 0 -> 28593 bytes index.html | 2 +- search/search_index.json | 2 +- sitemap.xml | 36 ++--- sitemap.xml.gz | Bin 284 -> 284 bytes 6 files changed, 195 insertions(+), 23 deletions(-) create mode 100755 images/iscc-iep-0008-f10-data-code.png diff --git a/iep-0008/index.html b/iep-0008/index.html index 7cef8aa..a143131 100755 --- a/iep-0008/index.html +++ b/iep-0008/index.html @@ -197,11 +197,67 @@
  • + IEP-0008 - Data-Code +
  • @@ -261,6 +317,54 @@ @@ -290,7 +394,7 @@

    I Status: -TBD +DRAFT Type: @@ -306,7 +410,7 @@

    I Updated: -2023-12-28 +2024-01-03 @@ -317,12 +421,80 @@

    I developed at the International Organization for Standardization as ISO/DIS 24138

    +

    1. General#

    +
      +
    1. The Data-Code shall be a similarity hash for any kind of data regardless of its media type.
    2. +
    3. The Data-Code shall cluster digital assets that have near-identical data.
    4. +
    5. Small differences (as a proportion of the whole) in referent data shall yield identical Data-Codes.
    6. +
    7. More significant differences in referent data shall produce similar Data-Codes that can be compared against each other to estimate the data-similarity of the referents.
    8. +
    9. The Data-Code shall be resistant to data shifting and reordering sequences of data within referent data.
    10. +
    +
    +

    NOTE

    +

    Changes of the Data-Code do not reflect semantic or syntactic changes of the content.

    +
    +

    2. Format#

    +

    The Data-Code shall have the data format illustrated in Figure 10:

    +
    +

    Figure 10 - Data format of the Data-Code +

    +
    Figure 10 - Data format of the Data-Code
    +
    +
    +

    EXAMPLE 1: 64-bit Data-Code in its canonical form:

    +

    ISCC:GAAWAIBQLNWP7X32

    +
    +
    +

    EXAMPLE 2: 256-bit Data-Code in its canonical form:

    +

    ISCC:GADWAIBQLNWP7X32J3INMAMDUJ4QMN67BBQKVTVZIWHXQ7QJIKHYTBY

    +
    +

    3. Inputs#

    +

    The input for calculating the Data-Code shall be the bytes of a file, without reference to their +meaning or structure.

    +

    4. Outputs#

    +

    Data-Code processing shall generate the following output elements:

    +
      +
    • iscc: the Data-Code in its canonical form (required).
    • +
    +

    5. Processing#

    +

    An ISCC processor shall calculate the Data-Code as follows:

    +
      +
    1. Split the data into variable sized chunks with an average chunk size of 1024 bytes using the content defined chunking (CDC) algorithm.
    2. +
    3. Calculate the 32-bit integer hash of each chunk using the XXH32 algorithm.
    4. +
    5. Apply the minhash algorithm to the array of 32-bit integers to calculate the ISCC-BODY of the Data-Code with appropriate length.
    6. +
    +
    +

    NOTE

    +

    For further technical details see source-code in modules +code_data.py +and minhash.py of the +reference implementation.

    +
    +

    6. Conformance#

    +

    An implementation of the Data-Code algorithm shall be regarded as conforming to the standard if it +creates the same Data-Code as the reference implementation for the same data input.

    +
    +

    NOTE

    +

    The ISCC reference implementation uses the open source XXHASH library 1 for XXH32 chunk +hashing and appropriate use of this software will generate the same codes as the +reference implementation.

    +
    +

    Bibliography

    +
    +
    +
      +
    1. +

      Collet, Yann. xxHash: Extremely fast hash algorithm. +Accessed July 2022, available at https://cyan4973.github.io/xxHash/ 

      +
    2. +
    +