All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Improved meanings parsing (experimental)
- Update dependencies
- Improved meanings parsing (experimental)
- Parse meanings and add "meanings" field to output, when
include_meanings
-param ist True inparse_entry
-call.
- Update value for xml property "xsi:schemaLocation" to "http://www.mediawiki.org/xml/export-0.11/"
- "wikicode" field from page model (not used)
- Performance improvements
- Add dict comprehension to improve performance
- Add None type to language
- Small improvements
- Rename "syllables" to "hyphenation"
- Allow to specify path to dump file in
WiktionaryDump
class
- Update dependencies
- Refactor internally to use
pydantic
models
- Remove
Record
class - Remove config options
- Add pydantic models
- Add WiktionaryParser and WiktionaryDump classes
- pass parsed wikitext internally to extraction methods
- update tests
- Update dependencies
- Use dataclasses instead of dicts internally
- Add "page_id" and "index" field to output (if a page contains multiple entries, the index indicates the position of the word in the page)
- Add tests for POS and language parsing
- BREAKING: Removed the ability to load custom methods from outside the package. The same can be achieved by setting the "wiki_text" field in the config dict and parsing the Wikitext manually.
- Make sure "title" is of base string type (not
etree._ElementUnicodeResult
)
- Improve typing
- Fix type errors
- Add method to parse rhymes
- Add tests for rhymes parsing
- Improve lemma parsing
- Add tests for lemma parsing
- Make config dict keys optional
- Add development instructions to README.md
- Add tests for syllable parsing
- Add tests for IPA parsing
- Add VSCode launch.json
- Add config dict
- Add config option to optionally include wikitext in output (disabled by default)
- Update dependencies
- Replace Autopep with black
- Ignore inflected forms, regional slang, Austrian/Swiss dialect etc. when parsing IPA-templates from now on
ignored_prefixes
is now part of a config dict
- Improve syllable parsing
- Improve IPA parsing
pyphen
as fallback for syllables parsing
- Change repository and package name from
wiktionary_de_parser
towiktionary-de-parser
- Make
lemma
andinflected
fields required fields
- Removed typing_extensions again
- Added typing_extensions
- More type hints
- Type hint for iterable (Record)
- removed None type dict entries in flexion parsing result
- minor flexion parsing improvements
- Converted repository to Poetry project
- Renamed
langCode
tolang_code
- Started to implement tests and type hints
- Updated regular expression and improved flexion parsing
- improve dash parsing in table values
MANIFEST.in
added langcode files
syllables.py
improvemed syllables parsing
language.py
added fieldlangCode
(providing ISO639-1 language code)
language.py
renamed fieldlanguage
tolang
README.md
updated readme
ipa.py
IPA parsing improvement
pos.py
added 'Deklinierte Form' as POS (can be Substantiv, Adjektiv, Artikel, Pronomen)
ipa.py
Match correct paragraph in WikiText for parsing IPA
syllables.py
Improved syllables parsing
ipa.py
Make IPA field alist
(support multiple IPA transcriptions for one word)
ipa.py
Improved IPA parsing
pos.py
Prevent duplicate POS names
pos.py
Toponym was a Dict key, when Template 'Deutsch Toponym Übersicht' was present (should be nested noun value)
- Python package support
- repository structure
- README.md
- allow 'Genus 1' - 'Genus 4' in flexion dictionary
- added
inflected
field to indicate whether entry is for inflected word
- put 'Genus' back to to flexion dictionary
- strip values in
lemma.py
,language.py
,ipa.py
- accept
Vorlage-Test
in regex pattern inpos.py
&language.py
- accept
Merkspruch
inpos.py
- improved regex for section splitting
- improved regex for POS matching
- fix missing POS names when there is a POS template
- language codes
- loading custom methods via
custom_methods
argument in class constructor andload_methods
function - Changelog.md (this file)
- load all files from
methods
folder and initialize them as extraction methods - extraction methods must return a
Dict()
now flexion.py
: returns 'genus' and flexion info separately
method_names
in__init__.py
- initial release