Dynamic pruning strategies such as MaxScore
, Wand
and BlockMaxWand
require a max score value (a.k.a. term upper bound) associated to each term in the lexicon. The ms-generate
tool appends a new data structure to an existing index, called max score index. The new data structure is file-stored array of floats, indexed by term id (extension: .ms
). The array maps the lexicon termid to the corresponding max score. The tool modifies the index properties file to store information about the newly created max score index.
To generate a max score index, computing the max score using the BM25 weighting model:
./target/bin/ms-generate -index /path/to/old/index/cw09b.properties -wm BM25
The index properties file now includes the weighting model used to generate the max score.
The ms-generate
tool accepts the following options.
-index [String] (required)
Fully qualified filename of one of the files of a existing Terrier index. The parameter will be split automatically into a Terrier path and prefix.
-wm [String] (required)
Class implementing the weighting model to use when computing the max score.
If no full class name is provided, the it.cnr.isti.hpclab.matching.structures.
string will be prepended to the argument given.
The currently implemented weighting models are BM25, LM and DLH13.
-p [Number] (optional)
Number of threads to use. Anyway the maximum value will be the number of available cores. Default: 1.
Multi-threaded compressions is experimental -- caution advised due to threads competing for available memory!
Once generated, the new max score index can be accessed and used as any other Terrier 5 index data structure:
import it.cnr.isti.hpclab.maxscore.structures.MaxScoreIndex;
...
MaxScoreIndex msi = (MaxScoreIndex) index.getIndexStructure("maxscore");
...
When a index is opened for query processing, the max score array is fully loaded in main memory; in order to save memory, it is possible to access it form file, leveraging a small cache memory with 10Ki entries. This behavior is controlled by the preload.maxscore.index
system property (default: true
).
- The weighting models used for max score generation and batch query processing must be the same. An new index property is used to ensure this.
- For memory problems, check the
appassembler
Maven plugin parameter in the POM file
Developed by Nicola Tonellotto, ISTI-CNR.