Skip to content

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces (EMNLP 2022)

Notifications You must be signed in to change notification settings

kellymarchisio/isovec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

0295beb · Mar 3, 2023

History

35 Commits
Mar 3, 2023
Feb 25, 2023
Dec 5, 2022
Feb 27, 2023
Feb 25, 2023
Feb 27, 2023
Mar 3, 2023
Feb 25, 2023
Mar 3, 2023
Mar 3, 2023
Mar 3, 2023
Dec 8, 2022
Feb 25, 2023
Mar 3, 2023
Dec 5, 2022

Repository files navigation

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

This is an implementation of the experiments and combination system presented in:

If you use this software for academic research, please cite the paper above.

Requirements

  • python3
  • pytorch
  • sklearn
  • scipy
  • numpy
  • indic-nlp-library
  • torchtext

Setup

  • Download third party packages: cd third_party && sh get_third_party.sh && cd ..
    • Note: If you're on Mac with an M1 chip, word2vec might not build. You can fix this by changing -march=native to -mcpu=apple-m1 in word2vec's makefile, and subbing in getc_unlocked and putc_unlocked for fgetc_unlocked/fputc_unlocked. You'll also need to use gshuf instead of shuf within src/train.py.
  • Download and make data: cd data && sh make_data.sh
  • Download and make train/dev/test dictionaries: cd data/dicts && sh create_dicts.sh

Usage

To reproduce Table 1 in the paper (Baselines), run:

  • sh baseline.sh $system $lang $seed
    • For instance, run sh baseline.sh w2v uk for offical word2vec trained on Ukrainian.
    • system choices: {isovec, w2v}
    • lang choices: {uk, bn, ta, en}
  • After you train English and Ukrainian baseline w2v spaces, for instance, you can map them and evaluate the dictionary precision with: sh map-and-eval.sh baseline w2v uk en dev
    • Results will be in exps/baseline/w2v/uk-en/*out

To run IsoVec in reference to a fixed embedding space (main experiments):

  • Example Goal: Train a Ukrainian embedding space with RSIM-U, in reference to a fixed English space.
  • Step 1: Train the fixed English space with sh baseline.sh isovec en
  • Step 2: Train the Ukrainian space with: sh run-isovec.sh rsim-u uk en
    • Choices of Isovec training algorithm are l2, proc-l2, proc-l2-init, rsim, rsim-init, rsim-u, evs-u for L2, Proc-L2, Proc-L2+Init, RSIM, RSIM-U, and EVS-U as detailed in Section 4.3 and 4.4 of the paper.
  • Step 3: Map & Evaluate the spaces with: sh map-and-eval.sh isovec rsim-u uk en dev

About

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces (EMNLP 2022)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published