Skip to content

Latest commit

 

History

History
513 lines (366 loc) · 21.1 KB

README.md

File metadata and controls

513 lines (366 loc) · 21.1 KB

Build Status Try fuzzball on RunKit

fuzzball.js logo

Easy to use and powerful fuzzy string matching.

Mostly a JavaScript port of the TheFuzz (formerly fuzzywuzzy) Python library, with some additional features like token similarity sorting and wildcard support.

Demo here comparing some of the different scorers/options. Auto-generated API Docs here.

Contents

Installation

Using NPM

npm install fuzzball

Browser (using pre-built umd bundle, make sure script is utf-8 if page isn't already)

<script charset="UTF-8" src="dist/fuzzball.umd.min.js"></script>
<script>
fuzzball.ratio("fuzz", "fuzzy");
</script>

or as module

<script charset="UTF-8" type="module"">
import {ratio} from './dist/esm/fuzzball.esm.min.js';
console.log(ratio('fuzz', 'fuzzy'));
</script>

Also available are two lite bundles which leave out select features in exchange for a smaller file size. See lite section below.

Usage

Basic Overview

fuzz = require('fuzzball');

fuzz.ratio("hello world", "hiyyo wyrld");
        64
fuzz.token_set_ratio("fuzzy was a bear", "a fuzzy bear fuzzy was");
        100

options = {scorer: fuzz.token_set_ratio};
choices = ["Hood, Harry", "Mr. Minor", "Mr. Henry Hood"];

fuzz.extract("mr. harry hood", choices, options);

// [choice, score, index/key]
[ [ 'Hood, Harry', 100, 0 ],
  [ 'Mr. Henry Hood', 85, 2 ],
  [ 'Mr. Minor', 40, 1 ] ]

/**
* Set options.returnObjects = true to get back
* an array of {choice, score, key} objects instead of tuples
*/

fuzz.extractAsync("mr. harry hood", choices, options, function (err, results){/* do stuff */});

fuzz.extractAsPromised("mr. harry hood", choices, options).then(res => {/* do stuff */});

// Cancel search

// For Node versions 14 and below and IE, use the legacy cancelToken option instead
// see the "Async and Cancellation" section below

const abortController = new AbortController();
options.abortController = abortController;

fuzz.extractAsPromised("gonna get canceled", choices, options)
        .then(res => {/* do stuff */})
        .catch((e) => {
                if (e.message === 'aborted') console.log('Search was aborted!')
        });

abortController.abort();

Simple Ratio

// "!" Stripped and lowercased in pre-processing by default
fuzz.ratio("this is a test", "This is a test!");
        100

Partial Ratio

Highest scoring substring of the longer string vs. the shorter string.

// Still 100, substring of 2nd is a perfect match of the first
fuzz.partial_ratio("test", "testing");
        100

Token Sort Ratio

Tokenized, sorted, and then recombined before scoring.

fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear");
        91
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear");
        100

Token Set Ratio

Highest of 3 scores comparing the set intersection, intersection + difference 1 to 2, and intersection + difference 2 to 1.

fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear");
        84
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear");
        100

If you set options.trySimple to true it will add the simple ratio to the token_set_ratio test suite as well. This can help smooth out occational irregularities in how much differences in the first letter of a token will get penalized.

Token Similarity Sort Ratio

Instead of sorting alphabetically, tokens will be sorted by similarity to the smaller set. Useful if the matching token may have a different first letter, but performs a bit slower. You can also use similarity sorting when calculating token_set_ratio by setting sortBySimilarity to true.

Not available in the lite builds and sorting doesn't yet take wildcards or collation into account. Based off this fuzzywuzzy PR by Exquisition. (seatgeek/fuzzywuzzy#296)

fuzz.token_sort_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
        58
fuzz.token_set_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
        61
fuzz.token_similarity_sort_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
        68
fuzz.token_set_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar', {sortBySimilarity: true})
        71

Distance

Unmodified Levenshtein distance without any additional ratio calculations.

fuzz.distance("fuzzy was a bear", "fozzy was a bear");
        1

Other Scoring Options

  • partial_token_set_ratio (options.trySimple = true will add the partial_ratio to the test suite, note this function will always return 100 if there are any tokens in common)
  • partial_token_sort_ratio
  • partial_token_similarity_sort_ratio
  • WRatio (runs tests based on relative string length and returns weighted top score, current default scorer in fuzzywuzzy extract)

Blog post with overview of scoring algorithms can be found here.

Pre-Processing

Pre-processing to remove non-alphanumeric characters run by default unless options.full_process is set to false.

// Eh, don't need to clean it up..
// Set options.force_ascii to true to remove all non-ascii letters as well, default: false
fuzz.ratio("this is a test", "this is a test!", {full_process: false});
        97

Or run separately.. (run beforehand to avoid a bit of performance overhead)

// force_ascii will strip out non-ascii characters except designated wildcards
fuzz.full_process("myt^eäXt!");
        myt eäxt
fuzz.full_process("myt^eäXt!", {force_ascii: true});
        myt ext

Consecutive white space will be collapsed unless options.collapseWhitespace = false, default true. Setting to false will match the behavior in fuzzywuzzy. Only affects the non-token scorers.

Collation and Unicode Stuff

To use collation when calculating edit distance, set useCollator to true. Will be ignored if Intl.Collator does not exist in your enviroment. (node 0.10 and under, IE10 and under)

Setting useCollator to true will have an impact on performance, so if you have really large number of choices may be best to pre-process (i.e. lodash _.deburr) instead if possible.

options = {useCollator: true};
fuzz.ratio("this is ä test", "this is a test", options);
        100

If your strings contain code points beyond the basic multilingual plane (BMP), set astral to true. If your strings contain astral symbols and this is not set, those symbols will be treated as multiple characters and the ratio will be off a bit. (This will have some impact on performance, which is why it is turned off by default.)

options = {astral: true};
fuzz.ratio("ab🐴c", "ab🐴d", options);
        75

If astral is set to true, full_process will be set to false automatically, as the current alphanumeric check only supports BMP.

When astral is true it will also normalize your strings before scoring, as long as String.prototype.normalize exists in your environment, but will not attempt to polyfill. (So if you need to compare unnormalized strings in IE, normalize separately) You can set the normalize option to false if you want different representations not to match, but is true by default.

Batch Extract

Search list of choices for top results.

fuzz.extract(query, choices, options);
fuzz.extractAsync(query, choices, options, function(err, results) { /* do stuff */ }); (internal loop will be non-blocking)
fuzz.extractAsPromised(query, choices, options).then(results => { /* do stuff */ }); (Promise will not be polyfilled)

Simple: array of strings, or object in form of {key: "string"}

The scorer defaults to fuzz.ratio if not specified.

With array of strings

query = "polar bear";
choices = ["brown bear", "polar bear", "koala bear"];

results = fuzz.extract(query, choices);

// [choice, score, index]
[ [ 'polar bear', 100, 1 ],
  [ 'koala bear', 80, 2 ],
  [ 'brown bear', 60, 0 ] ]

With object

query = "polar bear";
choicesObj = {id1: "brown bear",
              id2: "polar bear",
              id3: "koala bear"};

results = fuzz.extract(query, choicesObj);

// [choice, score, key]
[ [ 'polar bear', 100, 'id2' ],
  [ 'koala bear', 80, 'id3' ],
  [ 'brown bear', 60, 'id1' ] ]

Return objects

options = {returnObjects: true}
results = fuzz.extract(query, choicesObj, options);

[ { choice: 'polar bear', score: 100, key: 'id2' },
  { choice: 'koala bear', score: 80, key: 'id3' },
  { choice: 'brown bear', score: 60, key: 'id1' } ]

Less simple: array of objects, or object in form of {key: choice}, with processor function + options

Optional processor function takes a choice and returns the string which will be used for scoring. Each choice can be a string or an object, as long as the processor function can accept it and return a string.

query = "126abzx";
choices = [{id: 345, model: "123abc"},
           {id: 346, model: "123efg"},
           {id: 347, model: "456abdzx"}];

options = {
        scorer: fuzz.partial_ratio, // Any function that takes two values and returns a score, default: ratio
        processor: choice => choice.model,  // Takes choice object, returns string, default: no processor. Must supply if choices are not already strings.
        limit: 2, // Max number of top results to return, default: no limit / 0.
        cutoff: 50, // Lowest score to return, default: 0
        unsorted: false // Results won't be sorted if true, default: false. If true limit will be ignored.
};

results = fuzz.extract(query, choices, options);

// [choice, score, index/key]
[ [ { id: 347, model: '456abdzx' }, 71, 2 ],
  [ { id: 345, model: '123abc' }, 67, 0 ] ]

The processor function will only run on choices, so if your processor function modifies text in any way be sure to do the same to your query for unbiased results. This and default scorer are a slight departure from current fuzzywuzzy behavior.

Multiple Fields

If you want to use more than one field for scoring, can do stuff like combine two fields in a processor function before scoring.

processor = choice => choice.field1 + " " + choice.field2;

For more complex behavior you can provide a custom scorer, say for a weighted score of two fields, or to include additional types of data. When using a custom scorer both the query and each choice can be any type of value, as long as your scorer can handle the respective parameters correctly.

query = {name: "tiger", gender: "female"}

choices = [{name: "tigger", gender: "male"},
           {name: "lulu", gender: "female"},
           {name: "chad ochocinco", gender: "male"}]

function myCustomScorer(query, choice, options) {
        if (query.gender !== choice.gender) return 0;
        else return fuzz.ratio(query.name, choice.name, options);
}

options = {scorer: myCustomScorer}
results = fuzz.extract(query, choices, options);

(if you still wanted to use a separate processor function for whatever reason, the processor function would need to return something your scorer accepts)

Async and Cancellation

When using extractAsPromised or extractAsync, you might want to cancel the action before it has finished. It can be done using CancelToken or AbortController.

For performance, by default only every 256th loop will be async, but set asyncLoopOffset to change. It is most likely not worth changing this.

AbortController

AbortController is present in modern browsers and from node version 15+. It is currently a standard way how to cancel running operations.

// or use AbortController to cancel search
const abortController = new AbortController();

options.abortController = abortController;
options.asyncLoopOffset = 64;

fuzz.extractAsPromised("gonna get aborted", choices, options)
        .then(res => {/* do stuff */})
        .catch((e) => {
                if (e.message === 'aborted') console.log('I got aborted!')
        });

abortController.abort();

CancelToken

For older browsers and node versions you can use cancel token. It is an in

let cancelToken = { canceled: false };

options.cancelToken = cancelToken;
options.asyncLoopOffset = 64;

fuzz.extractAsPromised("gonna get canceled", choices, options)
        .then(res => {/* do stuff */})
        .catch((e) => {
                if (e.message === 'canceled') console.log('I got canceled!')
        });

// ...

cancelToken.canceled = true;

Wildcards

Set options.wildcards to a string containing wildcard characters to be used as wildcards when calculating edit distance. Each character in the string will be treated as a wildcard, and wildcards are case insensitive unless options.full_process is set to false.

options = {wildcards: "*x"}; // '*' and 'x' are both wildcards
fuzz.ratio('fuzzba*l', 'fuXxball', options);
        100

Notes: Wildcards are currently not supported when astral is set to true. The set operations in the token_set_ratio's will now take wildcards into account, unless using fuzzball_lite. In fuzzball_lite the set operations are currently still not wildcard aware to avoid the extra dependencies, so the token_set scores in lite when using wildcards will differ.

Fuzzy Dedupe

Convenience function to take a list of items containing duplicates and uses fuzzy matching to identify and remove duplicates. Uses extract to identify duplicates that score greater than a user defined threshold/cutoff. Then, it looks for the longest item in the duplicate list since we assume this item contains the most entity information and returns that. It breaks string length ties on an alphabetical sort.

To keep the map of which items were matched with each unique value set options.keepmap = true. Other than that, available options are the same as in extract except that the cutoff will default to 70 if not supplied, limit will be ignored if given, and each item must either be a string, or if using a processor function the post-processed item must be a string.

Note: as the cutoff DECREASES the number of duplicates that are found INCREASES. This means that the returned deduplicated list will likely be shorter. Raise the threshold for fuzzy_dedupe to be less sensitive.

contains_dupes = ['fuzzy wuzzy', 'fuzzy wuzz', 'not a dupe'];
options = {cutoff: 85, scorer: fuzz.token_set_ratio}
fuzz.dedupe(contains_dupes, options)

// [item, index/key of item in original list]
[ [ 'fuzzy wuzzy', 0 ],
  [ 'not a dupe', 2 ] ]

options.keepmap = true;
fuzz.dedupe(contains_dupes, options)

// [item, index/key of item in original list, [output of fuzz.extract for item]]
[ [ 'fuzzy wuzzy', 0, [ [Object], [Object] ] ],
  [ 'not a dupe', 2, [ [Object] ] ] ]

Performance Optimization

If you have a large list of terms that you're searching repeatedly, and you need to boost performance, can do some of the processing beforehand. For all scorers you can run full_process() on all of the choices beforehand, and then set options.full_process to false. With the token scorers you can run some of the additional processing beforehand. Exactly how depends on if using with the extract function or as standalone functions. (Also, if you wanted to use an alternate tokenizer could sub them for the functions used below)

If using either "token_sort" scorer with the extract function: You can set the property "proc_sorted" of each choice object and it will use that instead of running process_and_sort() again. If you supply a processor function when proc_sorted is set the processor will not get used. (Will need to make sure each choice is an object, even if just "choice = new String(choice)")

query = fuzz.full_process("126-Abzx");
choices = [{id: 345, model: "123-abc"},
           {id: 346, model: "efg-123"},
           {id: 347, model: "456 abdzx"}];
for (choice of choices) {
        choice.proc_sorted = fuzz.process_and_sort(fuzz.full_process(choice.model));
}
options = {
        scorer: fuzz.token_sort_ratio,
        full_process: false
};
results = fuzz.extract(query, choices, options);

If using either "token_sort" scorer as standalone functions: Set options.proc_sorted = true and process both strings beforehand.

str1 = "Abe Lincoln";
str2 = "Lincoln, Abe";

str1 = fuzz.process_and_sort(fuzz.full_process(str1));
str2 = fuzz.process_and_sort(fuzz.full_process(str2));
fuzz.token_sort_ratio(str1, str2, {proc_sorted: true});
        100

If using either "token_set" scorer with extract: You can set the property "tokens" of each choice object and it will use that instead of running unique_tokens() again. Processor functions will be ignored if "tokens" is set. (Will need to make sure each choice is an object, even if just "choice = new String(choice)")

query = fuzz.full_process("126-Abzx");
choices = [{id: 345, model: "123-abc"},
           {id: 346, model: "efg-123"},
           {id: 347, model: "456 abdzx"}];

for (choice of choices) {
        choice.tokens = fuzz.unique_tokens(fuzz.full_process(choice.model));
}
options = {
        scorer: fuzz.token_set_ratio,
        full_process: false
};
results = fuzz.extract(query, choices, options);

If using either "token_set" scorer as standalone functions: Tokenize both strings beforehand and attach them to options.tokens as a two element array.

str1 = "fluffy head man";
str2 = "heady fluffy head";

str1_tokens = fuzz.unique_tokens(fuzz.full_process(str1));
str2_tokens = fuzz.unique_tokens(fuzz.full_process(str2));

options = {tokens: [str1_tokens, str2_tokens]};

// Still have to include first two args for validation but they won't be used for scoring
fuzz.token_set_ratio(str1, str2, options);
        85

Pass options to fuzz.unique_tokens as the second argument if you're using wildcards for it to be wildcard aware.

Alternate Ratio Calculations

If you want to use difflib's ratio function for all ratio calculations, which differs slightly from the default python-Levenshtein style behavior, you can specify options.ratio_alg = "difflib". The difflib calculation is a bit different in that it's based on matching characters rather than true minimum edit distance, but the results are usually pretty similar. Difflib uses the formula 2.0*M / T where M is the number of matches, and T is the total number of elements in both sequences. This mirrors the behavior of fuzzywuzzy when not using python-Levenshtein. Not all features (wildcards, collation) supported when using difflib ratio.

Except when using difflib, the ratios are calculated as ((str1.length + str2.length) - distance) / (str1.length + str2.length), where distance is calculated with a substitution cost of 2. This follows the behavior of python-Levenshtein, however the fuzz.distance function still uses a cost of 1 by default for all operations if just calculating distance and not a ratio.

Not all scoring options are available if using the difflib calculation. (i.e. useCollator, wildcards)

Lite Versions

Also available are the fuzzball_lite and fuzzball_ultra_lite versions if you need a smaller file size. These are located in fuzzbal/lite and fuzzball/ultra_lite. The full version has been reworked to only pull in the needed parts from difflib though, so the difference between it and lite isn't quite as much as before.

The lite version doesn't include the partial ratio functions, and only has limited wildcard support. The ultra_lite version doesn't include those and also leaves out support for collation or correct astral symbols handling, the extract functions are not as optimized for large datasets, and it's alphanumeric check will strip out all non-ascii characters.

The full, lite and ultra_lite bundles currently weight in at a compressed 28kB, 20kB, and 8kB, respectively.

Credits

In addition to all dependencies..

Distance calculations based on leven and fast-levenshtein.

Default ratio formula is based on python-Levenshtein.

Unicode alphanumerica check from XRegExp.

Substring matching from difflib.js.

Inspiration to add cancellation and async optimization from fuzzysort

Thanks to dvisg and voidevector on reddit for .d.ts feedback.

Contributions

Pull requests welcome.