Skip to content

Jyut Dictionary Database Schema

Aaron Tan edited this page Sep 8, 2023 · 1 revision

The Jyut Dictionary database, currently on version 3, consists of seven (7) tables.

Code reference

If you prefer code, see https://github.com/aaronhktan/jyut-dict/blob/main/src/dictionaries/database/database.py for the database creation script.

Tables

entries

The entries table consists of seven (7) columns. They are: entry_id, traditional, simplified, pinyin, jyutping, and frequency. The set of (traditional, simplified, pinyin, jyutping) is enforced unique by the schema.

  • The entry_id for the set of (traditional, simplified, pinyin, jyutping) is not constant between different database files! I made this decision to allow arbitrary entry additions from a variety of sources without needing a centralized index of pre-existing entries in all the databases.

sources

The sources table consists of seven (7) columns. They are: source_id, sourcename, sourceshortname, version, description, legal, link, update_url, other. The sourcename must be unique for each row.

  • Like entry_id, source_id to source mapping is not consistent between different database files.
  • The link column should contain a link to the original location where the source can be found.
  • The update_url is currently unused. I added it originally intending for Jyut Dictionary to discover updates for dictionaries that were already downloaded, but have not (yet) built this feature.
  • The other column contains a comma-separated list of ["words", "sentences"]. This indicates to Jyut Dictionary whether to copy only rows from the entries and definitions tables ("words"), or to also copy the chinese_sentences, definitions_chinese_sentence_links, nonchinese_sentences, and sentences tables ("sentences").

definitions

The definitions table contains five (5) columns. They are: definition_id, definition, label, fk_entry_id, and fk_source_id. The set of (definition, label, fk_entry_id, fk_source_id) is enforced unique by the schema.

  • Like entry_id, definition_id <-> definition mapping is not constant between database files or database versions.
  • The label contains any label that should be displayed with a definition (generally a part-of-speech/POS indicator).
  • fk_entry_id references the entry that this definition is for.
  • fk_source_id references the source that provides this definition. Notice that definitions contain a source_id, but not entries! I made this decision because multiple sources may provide definitions for one entry, so an entry doesn't belong to a single source. There would be no point for an entry to be linked to any particular source.

chinese_sentences

TODO: Fill this out

nonchinese_sentences

TODO: Fill this out

sentence_links

TODO: Fill this out

definitions_chinese_sentences_links

TODO: Fill this out