Similarity Lab provides a set of tools and trained embedding matrices for historical language analysis.
With the aid of this toolset you will be able to track changes of words significance across several decades.
You'll find two sets of word embedding matrices along with their corresponding vocabularies. These matrices were obtained from two major vocabulary corpuses. Thousand of news articles were used from The New York Time and The Guardian.
- Track significance changes across the years
- Measure cosine distance between words
- Perform analogy test
- Analyse change tendencies
You can get all the files mentioned above by just cloning the repo. It may take a while beacause of the size of the matrices so be patient
git clone https://github.com/CID-ITBA/similarity-lab.git
We have made a python package to interface with the matrices available via pip as well. It's an active project so make sure to check for upcoming updates
pip install SimiLab
You can find examples and documentation at our Read the Docs site.
We are seeking to expand our word corpuses collection so any good reference to a new source will be appreciated.
This project is under the MIT license.
@cselmo | @MT2321 | @PabloSML |
---|---|---|
Memeber of @CID-ITBA and @CoNexDat for the OpLaDyn project | Memeber of @CID-ITBA | Memeber of @CID-ITBA |