Skip to content

Latest commit

 

History

History
152 lines (121 loc) · 4.15 KB

README.md

File metadata and controls

152 lines (121 loc) · 4.15 KB

ipa-grammar

Basic grammar for parsing International Phonetic Alphabet (IPA) transcriptions

Setup

To graphically visualize parse trees, you'll need to install Graphviz from your package manager of choice.

For example, from Homebrew on macOS:

$ brew install graphviz
...

Install the dependencies in a Python virtual environment:

$ python3 -m venv ipa
$ source ipa/bin/activate
(ipa) $ pip3 install -U pip
(ipa) $ pip3 install -r requirements.txt
...

To use the virtual environment in the Jupyter notebook, run:

(ipa) $ ipython kernel install --user --name=ipa
(ipa) $ jupyter notebook ipa_grammar.ipynb

Then, choose the kernel with the name of the virtual environment:

Select the "ipa" kernal

ipa_grammar.py

The ipa_grammar.py script has a basic CLI that allows you to read a "sentence" from a file (or stdin) and parse it with a given .lark grammar. The script will attempt to pretty-print a parse tree as text and additionally generate a .gv graph that can be rendered as an image by Graphviz's dot program.

(ipa) $ ./ipa_grammar.py -h
usage: ipa_grammar.py [-h] [-o OUTPUT] [-g GRAMMAR] input

positional arguments:
  input                 path to file to read input from (use "-" to read from stdin)

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        path to file where graphviz graph will be written (default: None)
  -g GRAMMAR, --grammar GRAMMAR
                        path to .lark grammar file (default: ipa.lark)

For example:

(ipa) $ echo '[kʰæt]' > cat-transcription.txt
(ipa) $ ./ipa_grammar.py cat-transcription.txt -g ipa.lark -o cat.gv
$ ./ipa_grammar.py cat-transcription.txt -g ipa.lark -o cat.gv
transcription
  phonetic
    syllables
      None
      None
      syllable
        onset
          consonant
            k
            cfeatures
              cfeature	ʰ
        rime
          nucleus
            vowel
              æ
              None
              None
          coda
            consonant
              t
              None

(ipa) $ dot -Tpng -o cat.png cat.gv

This will generate a graphical parse tree in the file cat.png:

Graphical parse tree

If you try to parse some text that the grammar does not license as a valid transcription, you'll get an error like this:

(ipa) $ echo '/ˈɡɹæ.mə(ɹ)/' | ./ipa_grammar.py -
No terminal matches '(' in the current parser context, at line 1 col 9

/ˈɡɹæ.mə(ɹ)/
        ^
Expected one of: 
	* LEFTTONECONTOUR
	* V
	* STRESS
	* RIGHTTONECONTOUR
	* TONEMARK
	* VBAR
	* LINK
	* VFEATURE
	* XFEATURE
	* SLASH
	* BREAK
	* __ANON_0
	* LENGTH
	* TONESTEP
	* C
	* DOUBLEBREVE

Tests

To run the tests:

(ipa) $ ./tests/run.zsh 
/mǎi mài mâi mái/ PASS
/ˈkatən/ PASS
[ˈkhætn̩] PASS
[ˈdʒæk|pɹəˌpɛəɹɪŋ ðə ˈweɪ|wɛnt ˈɒn‖] PASS
[↑bɪn.ðɛɹ↘|↑dɐn.ðæt↘‖] PASS
[túrán↑tʃí nè] PASS
[xɤn˧˥ xaʊ˨˩˦] PASS
[ˈɹɪðm̩] PASS
[ˈhuːˀsð̩ɣ] PASS
[ˈsr̩t͡sɛ] PASS
[ɹ̝̍] PASS
[ʙ̞̍] PASS
èlʊ́kʊ́nyá PASS
huʔ˩˥ PASS
mā PASS
nu.jam.ɬ̩ PASS
a˩˥˥˩˦˥˩˨˧˦˧ PASS
[u ↑ˈvẽ.tu ˈnɔ.ɾtɯ ku.mɯˈso.ɐ.suˈpɾaɾ.kõˈmũi.tɐ ˩˧fu.ɾiɐ | mɐʃ ↑ˈku̯ɐ̃.tu.maiʃ.su˩˧pɾa.vɐ | maiz ↑u.viɐ↓ˈʒɐ̃.tɯ.si.ɐk.õʃ↓ˈɡa.va.suɐ ˧˩ka.pɐ | ɐˈtɛ ↑kiu ˈvẽ.tu ˈnɔɾ.tɯ ˧˩d̥z̥ʃtiu ǁ] PASS
( while read l; do; echo -n "$l " | tee /dev/stderr | ( ./ipa_grammar.py - > )  5.20s user 0.58s system 94% cpu 6.122 total

Known Issues

The grammar is not comprehensive, and the current parsing of syllable structures isn't going to work in all cases. For example, there is no disambiguation of consonant clusters that could span syllable boundaries, nor is there disambiguation of adjacent vowels that might belong to different syllables.

To Do

  • Write a grammar for IPA extensions
  • Write grammars for specific languages taking phonotactics into account