Skip to content

Commit

Permalink
Create scripts for phonetic transcription of Czech, Slovak and Polish.
Browse files Browse the repository at this point in the history
  • Loading branch information
lukyjanek committed Sep 7, 2019
0 parents commit 9c85534
Show file tree
Hide file tree
Showing 5 changed files with 773 additions and 0 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Version 1 (v1) [released in 7 Sept 2019]
- create script for automatic phonetic transcription of Czech according to listed linguistic works (in README.md)
- create script for automatic phonetic transcription of Slovak according to listed linguistic works (in README.md)
- create script for automatic phonetic transcription of Polish according to listed linguistic works (in README.md)
70 changes: 70 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Automatic phonetic transcription of the Czech, Slovak and Polish languages
This repository contains codes of rule-based approach to the phonetics transcription of the Czech, Slovak and Polish languages into the [International Phonetic Alphabet](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) (IPA). Used rules and IPA signs are based on the phonologic, phonetic, and orthoepic studies (listed below) of the mentioned West-Slavic languages.

`CHANGELOG.txt` contains list of changes in each version. Current version is this one (version 1).

## Usage
These scripts can be used both as imported in any project, and as shell scripts. Bellow, three examples how to use them are shown.

**1. Import as the function to your Python3 project.**
```python
from phon_czech import ipa_czech
from phon_slovak import ipa_slovak
from phon_polish import ipa_polish

word1 = ipa_czech('všichni')
text1 = ipa_czech('Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv.')

word2 = ipa_slovak('všetci')
text2 = ipa_slovak('Všetci ľudia sa rodia slobodní a rovní si do dôstojnosti a práv.')

word3 = ipa_polish('wszyscy')
text3 = ipa_polish('Wszyscy ludzie rodzą się wolni i równi pod względem godności i praw.')

print(word1, word2, word3, sep='\n')
print(text1, text2, text3, sep='\n')
```

**2. Read from stdin in the shell pipeline.**
```bash
echo -e 'všichni' | python3 phon_czech.py
echo -e 'Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv.' | python3 phon_czech.py

echo -e 'všetci' | python3 phon_slovak.py
echo -e 'Všetci ľudia sa rodia slobodní a rovní si do dôstojnosti a práv.' | python3 phon_slovak.py

echo -e 'wszyscy' | python3 phon_polish.py
echo -e 'Wszyscy ludzie rodzą się wolni i równi pod względem godności i praw.' | python3 phon_polish.py
```

```bash
cat 'path-to-input-file' | python3 phon_czech.py
cat 'path-to-input-file' | python3 phon_slovak.py
cat 'path-to-input-file' | python3 phon_polish.py
```

**3. Read from file in shell pipeline.**
```bash
python3 phon_czech.py 'path-to-input-file'
python3 phon_slovak.py 'path-to-input-file'
python3 phon_polish.py 'path-to-input-file'
```

## Based on these studies
- BALOWSKI, Mieczysław. 1993. Fonetika a fonologie současné polštiny. Praha: Karolinum. ISBN: 80-7066-793-1.
- DUDÁŠOVÁ-KRIŠŠÁKOVÁ, Júlia. 1999. Fonologický systém spisovnej slovenčiny a poľštiny z typologického hľadiska. Slavica Slovaca. 34(1), 16-24. ISSN: 0037-6787.
- KAJANOVÁ-SCHULZOVÁ, Oľga. 1970. Úvod do fonetiky slovenčiny. Bratislava: Slovenské pedagogické nakladateľstvo.
- KRÁĽ, Ábeľ; SABOL, Ján. 1989. Fonetika a fonológia. Bratislava: Slovenské pedagogické nakladateľstvo. ISBN: 80-08-00036-8.
- KRČMOVÁ, Marie. 2016. Úvod do fonetiky a fonologie pro bohemisty. Ostrava: Universitas Ostraviensis. ISBN: 978-80-7368-636-9.
- KRČMOVÁ, Marie. 2017. TRANSKRIPCE. In: Petr Karlík, Marek Nekula, Jana Pleskalová (eds.), CzechEncy - Nový encyklopedický slovník češtiny.
URL: https://www.czechency.org/slovnik/TRANSKRIPCE.
- KRČMOVÁ, Marie. 2017. ORTOEPIE. In: Petr Karlík, Marek Nekula, Jana Pleskalová (eds.), CzechEncy - Nový encyklopedický slovník češtiny.
URL: https://www.czechency.org/slovnik/ORTOEPIE.
- LIPOWSKI, Jaroslav. 2011. Operatívna fonetika slovenčiny, češtiny a poľštiny. Wrocław: Wydawnictwo Uniwersytetu Wrocławskiego. ISBN: 978-80-7294-511-5.
- LOTKO, Edvard. 1999. Ke konfrontaci příbuzných jazyků. In: Srovnávací a bohemistické studie. Olomouc: Vydavatelství Univerzity Palackého, 9-19. ISBN: 978-80-244-2201-5.
- PALKOVÁ, Zdena. 1994. Fonetika a fonologie češtiny. Praha: Karolinum. ISBN: 80-7066-843-1.
- PAULINY, Eugen. 1979. Slovenská fonológia. Bratislava: Slovenské pedagogické nakladateľstvo.
- ZEMAN, Jiří. 2008. Základy české ortoepie. Hradec Králové: Gaudeamus. ISBN: 978-80-7041-778-2.

- Fonetická transkripce češtiny. Fonetický ústav, Filozofická fakulta, Univerzita Karlova. URL: https://fonetika.ff.cuni.cz/o-fonetice/foneticka-transkripce/o-foneticke-transkripci/.
- International Phonetic Alphabet. URL: https://www.internationalphoneticassociation.org/redirected_home.
237 changes: 237 additions & 0 deletions phon_czech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
#!/usr/bin/env python3
# coding: utf-8

"""Phonetic transcription of Czech text to IPA."""

import re
import sys


# function for the phonetic transcription of Czech language to IPA
def ipa_czech(text):
"""Phonetic transcription to IPA of given Czech text or word."""
# set transription table (IPA)
vowels = {'a': 'a', 'e': 'ɛ', 'i': 'ɪ', 'y': 'ɪ', 'o': 'ɔ', 'u': 'u',
'á': 'aː', 'é': 'ɛː', 'í': 'iː', 'ý': 'iː', 'ó': 'ɔː',
'ú': 'uː', 'ů': 'uː', 'ě': 'ɛ'}

sonors = {'l': 'l', 'm': 'm', 'n': 'n', 'ň': 'ɲ', 'r': 'r', 'j': 'j'}

voice_voice = {'dz': 'd͡z', 'dž': 'd͡ʒ', 'v': 'v', 'g': 'ɡ', 'b': 'b',
'z': 'z', 'ž': 'ʒ', 'd': 'd', 'ď': 'ɟ', 'h': 'ɦ',
'ch': 'ɣ', 'x': 'ks', 'w': 'v', 'ř': 'r̝', 'q': 'kv'}

voice_voiceless = {'dz': 't͡s', 'dž': 't͡ʃ', 'v': 'f', 'g': 'k', 'b': 'p',
'z': 's', 'ž': 'ʃ', 'd': 't', 'ď': 'c', 'h': 'x',
'ch': 'x', 'x': 'ks', 'w': 'f', 'ř': 'r̝̊', 'q': 'kf'}

voiceless_voiceless = {'c': 't͡s', 'č': 't͡ʃ', 'f': 'f', 'k': 'k',
'p': 'p', 's': 's', 'š': 'ʃ', 't': 't', 'ť': 'c'}

voiceless_voice = {'c': 'd͡z', 'č': 'd͡ʒ', 'f': 'v', 'k': 'ɡ', 'p': 'b',
's': 'z', 'š': 'ʒ', 't': 'd', 'ť': 'ɟ'}

# exceptions
vowel_prefixes = ('nade', 'obe', 'pode', 'přede', 'roze', 'se', 've',
'vze', 'ze', 'ne', 'vele', 'ante', 'de', 'pre', 're',
'vice', 'na', 'za', 'leda', 'pa', 'pra', 'sotva', 'ana',
'dia', 'extra', 'hepta', 'hexa', 'infra', 'intra',
'kontra', 'meta', 'para', 'supra', 'tetra', 'ultra',
'mimo', 'místo', 'okolo', 'polo', 'skoro', 'alo',
'hetero', 'homo', 'hypo', 'iso', 'kvadro', 'makro',
'mezzo', 'mikro', 'proto', 'pseudo', 'retro', 'mono')

# TODO: foreign words

# split on clauses
text = text.replace('...', '.')
parts = re.split(r'[,;\.\!\?\"\-\–$]', text)
delimiters = [l for l in text if l in ',;.!?"-–']

# transcript clauses
transcripted_parts = list()
for part in parts:
# check input
if not part:
transcripted_parts.append('')
continue

# prepare text to list of letters to transcript
part = part.lower().strip()
part = part.replace('ch', 'A').replace('dz', 'B').replace('dž', 'C')
digraphs = {'A': 'ch', 'B': 'dz', 'C': 'dž'}
part = list(part)
for l in range(len(part)):
if part[l] in digraphs:
part[l] = digraphs[part[l]]

# transcripted input
ipa = [l for l in part]

# find out intervals for neutralization and assimilation
posit_vowel = [-1] + [i for i in range(len(part)) if part[i] in vowels]
posit_sonor = [i for i in range(len(part)) if part[i] in sonors]

# neutralization
j = posit_vowel[-1]
if posit_sonor and posit_sonor[-1] > posit_vowel[-1]:
j = posit_sonor[-1]

i = len(part) - 1
while i > j:
if part[i] in voice_voiceless:
ipa[i] = voice_voiceless[part[i]]
elif part[i] in voiceless_voiceless:
ipa[i] = voiceless_voiceless[part[i]]
elif part[i] in sonors:
ipa[i] = sonors[part[i]]
i -= 1

# transctiption and assimilation
while posit_vowel:
i, k = j, j
j = posit_vowel.pop()
voice = None # assimil. type (N=uknown, T=voice, F=voiceless)
while i > j:
# transcription of vowels
if part[i] in vowels:
# diphtongs ou, eu, au
if part[i] in 'aeo' and len(part) > i+1 \
and part[i+1] == 'u':
test = [True if p == ''.join(part[i+1-len(p):i+1])
else False
for p in vowel_prefixes]
if any(test):
ipa[i] = vowels[part[i]] + ' ʔ'
else:
ipa[i] = vowels[part[i]] + 'u̯'
ipa[i+1] = ''
# i/í preceeding
elif part[i-1] in 'ií':
ipa[i] = 'j ' + vowels[part[i]]
# otherwise
else:
ipa[i] = vowels[part[i]]
# initial of word (glotal plosive)
if i == 0 or part[i-1] == ' ' and part[i-2] in vowels:
ipa[i] = 'ʔ ' + ipa[i]

# transcription of sonors and consonants
elif k != i:
# sonors
if part[i] in sonors:
voice = None
# m, n
if part[i] in 'mn':
# nn
if part[i] == 'n' and part[i+1] == 'n':
ipa[i] = ''
# nk, ng
elif part[i] == 'n' and part[i+1] in 'kg':
ipa[i] = 'ŋ'
# mv, mf
elif part[i] == 'm' and part[i+1] in 'vf':
ipa[i] = 'ɱ'
# ni, ní
elif part[i] == 'n' and part[i+1] in 'ií':
ipa[i] = 'ɲ'
# mně, mě, ně
elif part[i+1] == 'ě':
if part[i] == 'n':
ipa[i] = 'ɲ'
else:
ipa[i] = 'm ɲ'
# otherwise
else:
ipa[i] = sonors[part[i]]
# otherwise
else:
ipa[i] = sonors[part[i]]
# kk
elif part[i] == 'k' and part[i+1] == 'k':
ipa[i] = ''
# choose type of assimilation
elif voice is None:
# voiced
if part[i] in voice_voice:
voice = True
# v
if part[i] == 'v':
voice = None
# bě, vě
if part[i] in 'bv' and part[i+1] == 'ě':
ipa[i] = voice_voice[part[i]] + ' j'
# di, dí, dě
elif part[i] == 'd' and part[i+1] in 'iíě':
ipa[i] = 'ɟ'
# ř
elif part[i] == 'ř' and i != 0:
if part[i-1] in voiceless_voiceless:
ipa[i] = voice_voiceless[part[i]]
voice = False
else:
ipa[i] = voice_voice[part[i]]
# otherwise
else:
ipa[i] = voice_voice[part[i]]
# voiceless
elif part[i] in voiceless_voiceless:
voice = False
# pě, fě
if part[i] in 'pf' and part[i+1] == 'ě':
ipa[i] = voiceless_voiceless[part[i]] + ' j'
# ti, tí, tě
elif part[i] == 't' and part[i+1] in 'iíě':
ipa[i] = 'c'
# otherwise
else:
ipa[i] = voiceless_voiceless[part[i]]
# assimilation
else:
# voiced group
if voice is True and part[i] in voice_voice:
ipa[i] = voice_voice[part[i]]
elif voice is True and part[i] in voiceless_voice:
ipa[i] = voiceless_voice[part[i]]
# voiceless group
elif voice is False and part[i] in voice_voiceless:
ipa[i] = voice_voiceless[part[i]]
elif voice is False and part[i] in voiceless_voiceless:
ipa[i] = voiceless_voiceless[part[i]]

i -= 1

# clean empty cells and save transcripted clauses
ipa = list(filter(None, ipa))
transcripted_parts.append(ipa)

# return transcripted text
transcripted_parts = [' '.join(part) for part in transcripted_parts]
transcripted = ''
i = 0
while i < len(delimiters):
transcripted += transcripted_parts[i] + delimiters[i]
i += 1
if i < len(transcripted_parts):
transcripted += transcripted_parts[-1]

transcripted = re.sub(r'\.|\?|\!|\;|\"', ' || ', transcripted)
transcripted = re.sub(r'\,|\-|\–', ' | ', transcripted)
return transcripted


# running script if it is used in shell (with stdin or path to file)
if __name__ == '__main__':

if not sys.stdin.isatty(): # read from stdin
for line in sys.stdin:
print(ipa_czech(line.strip()), sep='\t')

else: # read from file
if len(sys.argv) == 2:
with open(sys.argv[1], mode='r', encoding='utf-8') as f:
for line in f:
print(ipa_czech(line.strip()), sep='\t')
else:
print('Error: Use script in pipeline or give the path '
'to the relevant file in the first argument.')
Loading

0 comments on commit 9c85534

Please sign in to comment.