-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create scripts for phonetic transcription of Czech, Slovak and Polish.
- Loading branch information
0 parents
commit 9c85534
Showing
5 changed files
with
773 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Version 1 (v1) [released in 7 Sept 2019] | ||
- create script for automatic phonetic transcription of Czech according to listed linguistic works (in README.md) | ||
- create script for automatic phonetic transcription of Slovak according to listed linguistic works (in README.md) | ||
- create script for automatic phonetic transcription of Polish according to listed linguistic works (in README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Automatic phonetic transcription of the Czech, Slovak and Polish languages | ||
This repository contains codes of rule-based approach to the phonetics transcription of the Czech, Slovak and Polish languages into the [International Phonetic Alphabet](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) (IPA). Used rules and IPA signs are based on the phonologic, phonetic, and orthoepic studies (listed below) of the mentioned West-Slavic languages. | ||
|
||
`CHANGELOG.txt` contains list of changes in each version. Current version is this one (version 1). | ||
|
||
## Usage | ||
These scripts can be used both as imported in any project, and as shell scripts. Bellow, three examples how to use them are shown. | ||
|
||
**1. Import as the function to your Python3 project.** | ||
```python | ||
from phon_czech import ipa_czech | ||
from phon_slovak import ipa_slovak | ||
from phon_polish import ipa_polish | ||
|
||
word1 = ipa_czech('všichni') | ||
text1 = ipa_czech('Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv.') | ||
|
||
word2 = ipa_slovak('všetci') | ||
text2 = ipa_slovak('Všetci ľudia sa rodia slobodní a rovní si do dôstojnosti a práv.') | ||
|
||
word3 = ipa_polish('wszyscy') | ||
text3 = ipa_polish('Wszyscy ludzie rodzą się wolni i równi pod względem godności i praw.') | ||
|
||
print(word1, word2, word3, sep='\n') | ||
print(text1, text2, text3, sep='\n') | ||
``` | ||
|
||
**2. Read from stdin in the shell pipeline.** | ||
```bash | ||
echo -e 'všichni' | python3 phon_czech.py | ||
echo -e 'Všichni lidé rodí se svobodní a sobě rovní co do důstojnosti a práv.' | python3 phon_czech.py | ||
|
||
echo -e 'všetci' | python3 phon_slovak.py | ||
echo -e 'Všetci ľudia sa rodia slobodní a rovní si do dôstojnosti a práv.' | python3 phon_slovak.py | ||
|
||
echo -e 'wszyscy' | python3 phon_polish.py | ||
echo -e 'Wszyscy ludzie rodzą się wolni i równi pod względem godności i praw.' | python3 phon_polish.py | ||
``` | ||
|
||
```bash | ||
cat 'path-to-input-file' | python3 phon_czech.py | ||
cat 'path-to-input-file' | python3 phon_slovak.py | ||
cat 'path-to-input-file' | python3 phon_polish.py | ||
``` | ||
|
||
**3. Read from file in shell pipeline.** | ||
```bash | ||
python3 phon_czech.py 'path-to-input-file' | ||
python3 phon_slovak.py 'path-to-input-file' | ||
python3 phon_polish.py 'path-to-input-file' | ||
``` | ||
|
||
## Based on these studies | ||
- BALOWSKI, Mieczysław. 1993. Fonetika a fonologie současné polštiny. Praha: Karolinum. ISBN: 80-7066-793-1. | ||
- DUDÁŠOVÁ-KRIŠŠÁKOVÁ, Júlia. 1999. Fonologický systém spisovnej slovenčiny a poľštiny z typologického hľadiska. Slavica Slovaca. 34(1), 16-24. ISSN: 0037-6787. | ||
- KAJANOVÁ-SCHULZOVÁ, Oľga. 1970. Úvod do fonetiky slovenčiny. Bratislava: Slovenské pedagogické nakladateľstvo. | ||
- KRÁĽ, Ábeľ; SABOL, Ján. 1989. Fonetika a fonológia. Bratislava: Slovenské pedagogické nakladateľstvo. ISBN: 80-08-00036-8. | ||
- KRČMOVÁ, Marie. 2016. Úvod do fonetiky a fonologie pro bohemisty. Ostrava: Universitas Ostraviensis. ISBN: 978-80-7368-636-9. | ||
- KRČMOVÁ, Marie. 2017. TRANSKRIPCE. In: Petr Karlík, Marek Nekula, Jana Pleskalová (eds.), CzechEncy - Nový encyklopedický slovník češtiny. | ||
URL: https://www.czechency.org/slovnik/TRANSKRIPCE. | ||
- KRČMOVÁ, Marie. 2017. ORTOEPIE. In: Petr Karlík, Marek Nekula, Jana Pleskalová (eds.), CzechEncy - Nový encyklopedický slovník češtiny. | ||
URL: https://www.czechency.org/slovnik/ORTOEPIE. | ||
- LIPOWSKI, Jaroslav. 2011. Operatívna fonetika slovenčiny, češtiny a poľštiny. Wrocław: Wydawnictwo Uniwersytetu Wrocławskiego. ISBN: 978-80-7294-511-5. | ||
- LOTKO, Edvard. 1999. Ke konfrontaci příbuzných jazyků. In: Srovnávací a bohemistické studie. Olomouc: Vydavatelství Univerzity Palackého, 9-19. ISBN: 978-80-244-2201-5. | ||
- PALKOVÁ, Zdena. 1994. Fonetika a fonologie češtiny. Praha: Karolinum. ISBN: 80-7066-843-1. | ||
- PAULINY, Eugen. 1979. Slovenská fonológia. Bratislava: Slovenské pedagogické nakladateľstvo. | ||
- ZEMAN, Jiří. 2008. Základy české ortoepie. Hradec Králové: Gaudeamus. ISBN: 978-80-7041-778-2. | ||
|
||
- Fonetická transkripce češtiny. Fonetický ústav, Filozofická fakulta, Univerzita Karlova. URL: https://fonetika.ff.cuni.cz/o-fonetice/foneticka-transkripce/o-foneticke-transkripci/. | ||
- International Phonetic Alphabet. URL: https://www.internationalphoneticassociation.org/redirected_home. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,237 @@ | ||
#!/usr/bin/env python3 | ||
# coding: utf-8 | ||
|
||
"""Phonetic transcription of Czech text to IPA.""" | ||
|
||
import re | ||
import sys | ||
|
||
|
||
# function for the phonetic transcription of Czech language to IPA | ||
def ipa_czech(text): | ||
"""Phonetic transcription to IPA of given Czech text or word.""" | ||
# set transription table (IPA) | ||
vowels = {'a': 'a', 'e': 'ɛ', 'i': 'ɪ', 'y': 'ɪ', 'o': 'ɔ', 'u': 'u', | ||
'á': 'aː', 'é': 'ɛː', 'í': 'iː', 'ý': 'iː', 'ó': 'ɔː', | ||
'ú': 'uː', 'ů': 'uː', 'ě': 'ɛ'} | ||
|
||
sonors = {'l': 'l', 'm': 'm', 'n': 'n', 'ň': 'ɲ', 'r': 'r', 'j': 'j'} | ||
|
||
voice_voice = {'dz': 'd͡z', 'dž': 'd͡ʒ', 'v': 'v', 'g': 'ɡ', 'b': 'b', | ||
'z': 'z', 'ž': 'ʒ', 'd': 'd', 'ď': 'ɟ', 'h': 'ɦ', | ||
'ch': 'ɣ', 'x': 'ks', 'w': 'v', 'ř': 'r̝', 'q': 'kv'} | ||
|
||
voice_voiceless = {'dz': 't͡s', 'dž': 't͡ʃ', 'v': 'f', 'g': 'k', 'b': 'p', | ||
'z': 's', 'ž': 'ʃ', 'd': 't', 'ď': 'c', 'h': 'x', | ||
'ch': 'x', 'x': 'ks', 'w': 'f', 'ř': 'r̝̊', 'q': 'kf'} | ||
|
||
voiceless_voiceless = {'c': 't͡s', 'č': 't͡ʃ', 'f': 'f', 'k': 'k', | ||
'p': 'p', 's': 's', 'š': 'ʃ', 't': 't', 'ť': 'c'} | ||
|
||
voiceless_voice = {'c': 'd͡z', 'č': 'd͡ʒ', 'f': 'v', 'k': 'ɡ', 'p': 'b', | ||
's': 'z', 'š': 'ʒ', 't': 'd', 'ť': 'ɟ'} | ||
|
||
# exceptions | ||
vowel_prefixes = ('nade', 'obe', 'pode', 'přede', 'roze', 'se', 've', | ||
'vze', 'ze', 'ne', 'vele', 'ante', 'de', 'pre', 're', | ||
'vice', 'na', 'za', 'leda', 'pa', 'pra', 'sotva', 'ana', | ||
'dia', 'extra', 'hepta', 'hexa', 'infra', 'intra', | ||
'kontra', 'meta', 'para', 'supra', 'tetra', 'ultra', | ||
'mimo', 'místo', 'okolo', 'polo', 'skoro', 'alo', | ||
'hetero', 'homo', 'hypo', 'iso', 'kvadro', 'makro', | ||
'mezzo', 'mikro', 'proto', 'pseudo', 'retro', 'mono') | ||
|
||
# TODO: foreign words | ||
|
||
# split on clauses | ||
text = text.replace('...', '.') | ||
parts = re.split(r'[,;\.\!\?\"\-\–$]', text) | ||
delimiters = [l for l in text if l in ',;.!?"-–'] | ||
|
||
# transcript clauses | ||
transcripted_parts = list() | ||
for part in parts: | ||
# check input | ||
if not part: | ||
transcripted_parts.append('') | ||
continue | ||
|
||
# prepare text to list of letters to transcript | ||
part = part.lower().strip() | ||
part = part.replace('ch', 'A').replace('dz', 'B').replace('dž', 'C') | ||
digraphs = {'A': 'ch', 'B': 'dz', 'C': 'dž'} | ||
part = list(part) | ||
for l in range(len(part)): | ||
if part[l] in digraphs: | ||
part[l] = digraphs[part[l]] | ||
|
||
# transcripted input | ||
ipa = [l for l in part] | ||
|
||
# find out intervals for neutralization and assimilation | ||
posit_vowel = [-1] + [i for i in range(len(part)) if part[i] in vowels] | ||
posit_sonor = [i for i in range(len(part)) if part[i] in sonors] | ||
|
||
# neutralization | ||
j = posit_vowel[-1] | ||
if posit_sonor and posit_sonor[-1] > posit_vowel[-1]: | ||
j = posit_sonor[-1] | ||
|
||
i = len(part) - 1 | ||
while i > j: | ||
if part[i] in voice_voiceless: | ||
ipa[i] = voice_voiceless[part[i]] | ||
elif part[i] in voiceless_voiceless: | ||
ipa[i] = voiceless_voiceless[part[i]] | ||
elif part[i] in sonors: | ||
ipa[i] = sonors[part[i]] | ||
i -= 1 | ||
|
||
# transctiption and assimilation | ||
while posit_vowel: | ||
i, k = j, j | ||
j = posit_vowel.pop() | ||
voice = None # assimil. type (N=uknown, T=voice, F=voiceless) | ||
while i > j: | ||
# transcription of vowels | ||
if part[i] in vowels: | ||
# diphtongs ou, eu, au | ||
if part[i] in 'aeo' and len(part) > i+1 \ | ||
and part[i+1] == 'u': | ||
test = [True if p == ''.join(part[i+1-len(p):i+1]) | ||
else False | ||
for p in vowel_prefixes] | ||
if any(test): | ||
ipa[i] = vowels[part[i]] + ' ʔ' | ||
else: | ||
ipa[i] = vowels[part[i]] + 'u̯' | ||
ipa[i+1] = '' | ||
# i/í preceeding | ||
elif part[i-1] in 'ií': | ||
ipa[i] = 'j ' + vowels[part[i]] | ||
# otherwise | ||
else: | ||
ipa[i] = vowels[part[i]] | ||
# initial of word (glotal plosive) | ||
if i == 0 or part[i-1] == ' ' and part[i-2] in vowels: | ||
ipa[i] = 'ʔ ' + ipa[i] | ||
|
||
# transcription of sonors and consonants | ||
elif k != i: | ||
# sonors | ||
if part[i] in sonors: | ||
voice = None | ||
# m, n | ||
if part[i] in 'mn': | ||
# nn | ||
if part[i] == 'n' and part[i+1] == 'n': | ||
ipa[i] = '' | ||
# nk, ng | ||
elif part[i] == 'n' and part[i+1] in 'kg': | ||
ipa[i] = 'ŋ' | ||
# mv, mf | ||
elif part[i] == 'm' and part[i+1] in 'vf': | ||
ipa[i] = 'ɱ' | ||
# ni, ní | ||
elif part[i] == 'n' and part[i+1] in 'ií': | ||
ipa[i] = 'ɲ' | ||
# mně, mě, ně | ||
elif part[i+1] == 'ě': | ||
if part[i] == 'n': | ||
ipa[i] = 'ɲ' | ||
else: | ||
ipa[i] = 'm ɲ' | ||
# otherwise | ||
else: | ||
ipa[i] = sonors[part[i]] | ||
# otherwise | ||
else: | ||
ipa[i] = sonors[part[i]] | ||
# kk | ||
elif part[i] == 'k' and part[i+1] == 'k': | ||
ipa[i] = '' | ||
# choose type of assimilation | ||
elif voice is None: | ||
# voiced | ||
if part[i] in voice_voice: | ||
voice = True | ||
# v | ||
if part[i] == 'v': | ||
voice = None | ||
# bě, vě | ||
if part[i] in 'bv' and part[i+1] == 'ě': | ||
ipa[i] = voice_voice[part[i]] + ' j' | ||
# di, dí, dě | ||
elif part[i] == 'd' and part[i+1] in 'iíě': | ||
ipa[i] = 'ɟ' | ||
# ř | ||
elif part[i] == 'ř' and i != 0: | ||
if part[i-1] in voiceless_voiceless: | ||
ipa[i] = voice_voiceless[part[i]] | ||
voice = False | ||
else: | ||
ipa[i] = voice_voice[part[i]] | ||
# otherwise | ||
else: | ||
ipa[i] = voice_voice[part[i]] | ||
# voiceless | ||
elif part[i] in voiceless_voiceless: | ||
voice = False | ||
# pě, fě | ||
if part[i] in 'pf' and part[i+1] == 'ě': | ||
ipa[i] = voiceless_voiceless[part[i]] + ' j' | ||
# ti, tí, tě | ||
elif part[i] == 't' and part[i+1] in 'iíě': | ||
ipa[i] = 'c' | ||
# otherwise | ||
else: | ||
ipa[i] = voiceless_voiceless[part[i]] | ||
# assimilation | ||
else: | ||
# voiced group | ||
if voice is True and part[i] in voice_voice: | ||
ipa[i] = voice_voice[part[i]] | ||
elif voice is True and part[i] in voiceless_voice: | ||
ipa[i] = voiceless_voice[part[i]] | ||
# voiceless group | ||
elif voice is False and part[i] in voice_voiceless: | ||
ipa[i] = voice_voiceless[part[i]] | ||
elif voice is False and part[i] in voiceless_voiceless: | ||
ipa[i] = voiceless_voiceless[part[i]] | ||
|
||
i -= 1 | ||
|
||
# clean empty cells and save transcripted clauses | ||
ipa = list(filter(None, ipa)) | ||
transcripted_parts.append(ipa) | ||
|
||
# return transcripted text | ||
transcripted_parts = [' '.join(part) for part in transcripted_parts] | ||
transcripted = '' | ||
i = 0 | ||
while i < len(delimiters): | ||
transcripted += transcripted_parts[i] + delimiters[i] | ||
i += 1 | ||
if i < len(transcripted_parts): | ||
transcripted += transcripted_parts[-1] | ||
|
||
transcripted = re.sub(r'\.|\?|\!|\;|\"', ' || ', transcripted) | ||
transcripted = re.sub(r'\,|\-|\–', ' | ', transcripted) | ||
return transcripted | ||
|
||
|
||
# running script if it is used in shell (with stdin or path to file) | ||
if __name__ == '__main__': | ||
|
||
if not sys.stdin.isatty(): # read from stdin | ||
for line in sys.stdin: | ||
print(ipa_czech(line.strip()), sep='\t') | ||
|
||
else: # read from file | ||
if len(sys.argv) == 2: | ||
with open(sys.argv[1], mode='r', encoding='utf-8') as f: | ||
for line in f: | ||
print(ipa_czech(line.strip()), sep='\t') | ||
else: | ||
print('Error: Use script in pipeline or give the path ' | ||
'to the relevant file in the first argument.') |
Oops, something went wrong.