Skip to content

Commit

Permalink
Merge pull request #148 from n1analytics/develop
Browse files Browse the repository at this point in the history
Merge v0.9.0 into master
  • Loading branch information
nbgl authored Aug 14, 2018
2 parents cffabb3 + cc589e9 commit e3cb0a8
Show file tree
Hide file tree
Showing 27 changed files with 3,026 additions and 62 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ sudo: false

python:
- '3.6'
- '3.7-dev'
- 'nightly'
- 'pypy3'

Expand All @@ -13,6 +14,7 @@ env:

matrix:
allow_failures:
- python: '3.7-dev'
- python: 'nightly'
- python: 'pypy3'

Expand Down
133 changes: 123 additions & 10 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,30 +1,144 @@
0.9.0
=====

This release contains a major overhaul of Anonlink’s API and introduces support for multi-party linkage.

The changes are all additive, so the previous API continues to work. That API has now been deprecated and will be removed in a future release. The deprecation timeline is:
- v0.9.0: old API deprecated
- v0.10.0: use of old API raises a warning
- v0.11.0: remove old API

Major changes
-------------
- Introduce abstract similarity functions. The Sørensen–Dice coefficient is now just one possible similarity function.
- Implement Hamming similarity as a similarity function.
- Permit linkage of records other than CLKs (BYO similarity function).
- Similarity functions now return multiple contiguous arrays instead of a list of tuples.
- Candidate pairs from similarity functions are now always sorted.
- Introduce a standard type for storing candidate pairs. This is now used consistently throughout the API.
- Provide a function for multiparty candidate generation. It takes multiple datasets and compares them against each other using a similarity function.
- Extend the greedy solver to multiparty problems.
- The greedy solver also takes the new candidate pairs type.
- Implement serialisation and deserialisation of candidate pairs.
- Multiple files with serialised candidate pairs can be merged without loading everything into memory at once.
- Introduce type annotations in the new API.

Minor changes
-------------
- Automatically test on Python 3.7.
- Remove support for Python 3.5 and below.
- Update Clkhash dependency to 0.11.
- Minor documentation and style in ``anonlink.concurrency``.
- Provide a convenience function for generating valid candidate pairs from a chunk.
- Change the format of a chunk and move the type definition to ``anonlink.typechecking``.

New modules
-----------
- ``anonlink.blocking``: Implementation of functions that assign blocks to every record. These are generally used to optimise matching.
- ``anonlink.candidate_generation``: Finding candidate pairs from multiple datasets using a similarity function.
- ``anonlink.serialization``: Tools for serialisation and deserialisation of candidate pairs. Also permits efficient merging multiple files of serialised candidate pairs.
- ``anonlink.similarities``: Exposes different similarity functions that can be used to compare records. Currently implemented are ``hamming_similarity`` and ``dice_coefficient``.
- ``anonlink.solving``: Exposes solvers that can be used to turn candidate pairs into a concrete matching. Currently, only the ``greedy_solve`` function is exposed.
- ``anonlink.typechecking``: Types for Mypy and other typecheckers.

Deprecated modules
------------------
- ``anonlink.bloommatcher`` is replaced by ``anonlink.similarities``. The Tanimoto coefficient functions currently have no replacement.
- ``anonlink.distributed_processing`` is deprecated with no replacement.
- ``anonlink.network_flow`` is deprecated with no replacement.
- ``anonlink.util`` is deprecated with no replacement.

New usage examples
------------------
Before
~~~~~~
.. code-block:: python
>>> dataset0[0]
(bitarray('0111101001001100101001001010101000100100010010011011010110110000'),
0,
28)
>>> dataset1[0]
(bitarray('1100101101001110100001110000110000110101110010101001010001110100'),
3,
30)
>>> candidate_pairs = anonlink.entitymatch.calculate_filter_similarity(
dataset0, dataset1, k=len(dataset1), threshold=0.7)
>>> candidate_pairs[0:3]
[(1, 0.75, 6), (1, 0.75, 96), (1, 0.7457627118644068, 13)]
>>> mapping = anonlink.entitymatch.greedy_solver(candidate_pairs)
>>> mapping
{1: 6,
2: 44,
3: 86,
4: 4,
5: 61,
6: 10,
...
After
~~~~~~
- The function generating candidate pairs needs only the bloom filters. It does not need the record indices or the popcounts.
- The same function returns a tuple of arrays, instead of a list of tuples.
- The solvers return groups of 2-tuples (dataset index, record index) instead of a mapping.
.. code-block:: python
>>> dataset0[0]
bitarray('0111101001001100101001001010101000100100010010011011010110110000')
>>> dataset1[0]
bitarray('0101001110110000101110101101110000110001010000000011010010100011')
>>> datasets = [dataset0, dataset1]
>>> candidate_pairs = anonlink.candidate_generation.find_candidate_pairs(
datasets,
anonlink.similarities.dice_coefficient,
0.7)
>>> candidate_pairs[0][:3]
array('d', [1.0, 0.9850746268656716, 0.9841269841269841])
>>> candidate_pairs[1][0][:3]
array('I', [0, 0, 0])
>>> candidate_pairs[1][1][:3]
array('I', [1, 1, 1])
>>> candidate_pairs[2][0][:3]
array('I', [85, 66, 83])
>>> candidate_pairs[2][1][:3]
array('I', [82, 62, 79])
>>> groups = anonlink.solving.greedy_solve(candidate_pairs)
>>> groups
([(0, 85), (1, 82)],
[(0, 66), (1, 62)],
[(0, 83), (1, 79)],
[(0, 49), (1, 44)],
[(0, 20), (1, 22)],
...
0.8.2
-----
=====
Fix discrepancies between Python and C++ versions #102
Utility added to ``anonlink/concurrency.py`` help with chunking.
Better Github status messages posted by jenkins.
0.8.1
-----
=====
Minor updates and fixes. Code cleanup.
- Remove checking of chunk size to prevent crashes on small chunks.
0.8.0
-----
=====
Fix to greedy solver, so that mappings are set by the first match, not repeatedly overwritten. #89
Other improvements
~~~~~~~~~~~~~~~~~~
------------------
- Order of k and threshold parameters now consistent across library
- Limit size of `k` to prevent OOM DoS
- Fix misaligned pointer handling #77
0.7.1
-----
=====
Removed the default values for the threshold and "top k results" parameters
throughout as these parameters should always be determined by the requirements
at the call site. This modifies the API of the functions
Expand All @@ -34,28 +148,27 @@ at the call site. This modifies the API of the functions
be specified in every case.
0.7.0
-----
=====
Introduces support for comparing "arbitrary" length cryptographic linkage keys.
Benchmark is much more comprehensive and more comparable between releases - see the
readme for an example report.
Other improvements
~~~~~~~~~~~~~~~~~~
------------------
- Internal C/C++ cleanup/refactoring and optimization.
- Expose the native popcount implementation to Python.
- Bug fix to avoid configuring a logger.
- Testing is now with `py.test` and runs on [travis-ci](https://travis-ci.org/n1analytics/anonlink/)
0.6.3
-----
=====
Small fix to logging setup.
0.6.2 - Changelog init
---------------------
======================
``anonlink`` computes similarity scores, and/or best guess matches between two sets
of *cryptographic linkage keys* (hashed entity records).

4 changes: 2 additions & 2 deletions Jenkinsfile.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ GITHUB_TEST_CONTEXT = "jenkins/test"
GITHUB_RELEASE_CONTEXT = "jenkins/release"

def configs = [
[os: 'linux', pythons: ['python3.4', 'python3.5', 'python3.6'], compilers: ['clang', 'gcc']],
[os: 'linux', pythons: ['python3.6'], compilers: ['clang', 'gcc']],
[os: 'osx', pythons: ['python3.6', 'python3.7'], compilers: ['clang']]
]

Expand Down Expand Up @@ -123,7 +123,7 @@ node('GPU 1') {
stage('Release') {
try {
commit.setInProgressStatus(GITHUB_RELEASE_CONTEXT);
build('python3.5', 'gcc', 'GPU 1', true)
build('python3.6', 'gcc', 'GPU 1', true)
commit.setSuccessStatus(GITHUB_RELEASE_CONTEXT)
} catch (Exception e) {
commit.setFailStatus("Release failed", GITHUB_RELEASE_CONTEXT);
Expand Down
5 changes: 5 additions & 0 deletions anonlink/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
import pkg_resources

from anonlink import blocking
from anonlink import bloommatcher
from anonlink import candidate_generation
from anonlink import concurrency
from anonlink import entitymatch
from anonlink import network_flow
from anonlink import serialization
from anonlink import solving
from anonlink import typechecking

__version__ = pkg_resources.get_distribution('anonlink').version
__author__ = 'N1 Analytics'
9 changes: 4 additions & 5 deletions anonlink/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@
from timeit import default_timer as timer

from clkhash.key_derivation import generate_key_lists
from clkhash.schema import get_schema_types
from clkhash.bloomfilter import calculate_bloom_filters
from clkhash.bloomfilter import stream_bloom_filters
from clkhash.randomnames import NameList

from anonlink.entitymatch import *
Expand Down Expand Up @@ -120,9 +119,9 @@ def compare_python_c(ntotal=10000, nsubset=6000, frac=0.8):
nml = NameList(ntotal)
sl1, sl2 = nml.generate_subsets(nsubset, frac)

keys = generate_key_lists(('test1', 'test2'), len(nml.schema))
filters1 = calculate_bloom_filters(sl1, get_schema_types(nml.schema), keys)
filters2 = calculate_bloom_filters(sl2, get_schema_types(nml.schema), keys)
keys = generate_key_lists(('test1', 'test2'), len(nml.schema_types))
filters1 = tuple(stream_bloom_filters(sl1, keys, nml.SCHEMA))
filters2 = tuple(stream_bloom_filters(sl2, keys, nml.SCHEMA))

# Pure Python version
start = timer()
Expand Down
Loading

0 comments on commit e3cb0a8

Please sign in to comment.