If you use these data please cite
- the original source
Tjuka, Annika; Forkel, Robert; Rzymski, Christoph; and List, Johann-Mattis (2025): CLICS⁴: An Improved Database of Cross-Linguistic Colexifications [Dataset, Version 0.4]. Passau: MCL Chair at the University of Passau.
- the derived dataset using the DOI of the particular released version you were using
This dataset is licensed under a CC-BY-4.0 license
Available online at https://clics.clld.org
The CLICS4 workflow differs slightly from the workflow we have used in CLICS3. We now have drastically increased the number of datasets, but we have also made sure to use stricter selection criteria for the languages to be included. This also results in different numbers with respect to the number of concepts and the number of language varieties.
Tjuka, Annika; Forkel, Robert; Rzymski, Christoph; and List, Johann-Mattis (2025): CLICS⁴: An Improved Database of Cross-Linguistic Colexifications [Dataset Version 0.4]. Passau: MCL Chair at the University of Passau.
The following points summarize major differences between CLICS³ and CLICS⁴:
- more datasets in CLICS⁴: CLICS⁴ now uses 98 datasets, while CLICS³ used 30
- fully transcribed data instead of data in orthography: CLICS⁴ now uses data fully transcribed to IPA, ignoring all datasets that only offer orthography (this results in fewer languages at times, despite the increase in datasets)
All you need to install the packages required is to install the current package with PIP as follows (using a fresh virtual environment), after having downloaded the clics4
package with GIT. The following lines also obtain the version that we used in this demo.
$ git clone https://github.com/clics/clics4.git
$ cd clics4
$ git checkout v0.4
$ pip install -e .
In order to do a fresh download of all the data that we use in CLICS⁴, you need to run the following command:
$ cldfbench download lexibank_clics4.py
Before you can run the code, you must make sure to have downloaded all data and also obtained actual copies of Glottolog, Concepticon, and CLTS. An easy way to obtain these with the help of cldfbench
is to run the command cldfbench catconfig
and follow instructions there. If you use a Windows machine, you will need some additional preparations (see Snee 2024), so we kindly ask you to follow the respective instructions in Snee (2024).
If you have successfully run the catconfig
subcommand, just type:
$ cldfbench lexibank.makecldf --glottolog-version=v5.1 --concepticon-version=v3.3.0 --clts-version=v2.3.0 lexibank_clics4.py
In the other case, specify the explicit locations of the repositories for Glottolog, Concepticon, and CLTS as follwo.
cldfbench lexibank.makecldf --glottolog-repos=Path2Glottolog --concepticon-repos=Path2Concepticon --clts-repos=Path2Clics --glottolog-version=v5.1 --concepticon-version=v3.3.0 --clts-version=v2.3.0 lexibank_clics4.py
This release is a CLICS⁴ dataset that we consider generally good enough with respect to the data to be used in publications (small errors would always be possible with such large numbers of data aggregated from different sources). However, we emphasize that there are a couple of shortcomings for now that we will try to handle before publishing a new version of CLICS that succeeds the current version 3.0 at https://clics.clld.org. Before publishing this new CLLD version of CLICS⁴, we will implement a new representation of the data in order to adhere to the representation of ParameterNetworks in the new CLDF specification.
- Varieties: 3,420 (linked to 2,149 different Glottocodes)
- Concepts: 1,730 (linked to 1,730 different Concepticon concept sets)
- Lexemes: 1,443,325
- Sources: 94
- Synonymy: 1.10
- Invalid lexemes: 0
- Tokens: 8,107,217
- Segments: 2,034 (0 BIPA errors, 0 CLTS sound class errors, 2026 CLTS modified)
- Inventory size (avg): 40.74
- Languages linked to bookkeeping languoids in Glottolog:
Name | GitHub user | Description | Role |
---|---|---|---|
Annika Tjuka | @annikatjuka | maintainer | Author |
Christoph Rzymski | @chrzyki | maintainer | Author |
Robert Forkel | @xrotwang | maintainer | Author |
Johann-Mattis List | @LinguList | maintainer | Author |
The following CLDF datasets are available in cldf: