If you use these data please cite this dataset using the DOI of the particular released version you were using
This dataset is licensed under a CC-BY-4.0 license.
The repository includes:
- data in CLDF
- commands to compile the colexifications
- inventory of concepts, Lexibank datasets, and language families
- table with forms
- output including graphs, plots, and degrees
- raw data folder for local clones of the Lexibank repositories
- scripts for the analyses of ARI and degree values
To run the code, you need to install the following. We recommend to use a fresh virtual environment.
Clone the GitHub repository in a local folder:
$ git clone https://github.com/clics/clicsbp.git
Change the directory to clicsbp
and run:
$ pip install -e .
Check if everything worked by typing cldfbench -h
. If you see commands starting with clicsbp.
, you will be able to run the following code.
In addition, you need the packages from our other reference catalogs (Glottolog, Concepticon, CLTS) and the pyclics
package. Make sure to install pyclics
by downloading the GitHub repository, checking out the branch colexifications
, and then installing the package with pip install -e .
.
This command downloads the Lexibank datasets in the local raw
folder.
$ cldfbench download lexibank_clicsbp.py
To create the CLDF dataset with the colexifications aggregated from the Lexibank word lists, use:
$ cldfbench lexibank.makecldf lexibank_clicsbp.py --concepticon-version=v3.1.0 --clts-version=v2.2.0 --glottolog-version=v4.8
Note that the versions of the reference catalogs change and might need to be adapted in the future.
Now,you are able to compute the colexifications and perform the analysis. First, run:
$ cldfbench clicsbp.colexifications
$ cldfbench clicsbp.colexify_all_data
Calculate the coverage of the data with:
$ cldfbench clicsbp.coverage
The values can be created for each of the semantic domains by typing:
$ cldfbench clicsbp.ari --tag "human body part"
$ cldfbench clicsbp.ari --tag "emotion"
$ cldfbench clicsbp.ari --tag "color"
Examine the degree distributions across, for example, language weight with:
$ cldfbench clicsbp.degrees --weight "language"
To create images of the networks with body part colexifications for each of the 20 language families, use:
$ cldfbench clicsbp.plotgraphs --weight=Cognate_Count_Weighted --tag="human body part"
The cognitive relations associated with body part colexifications can be explored by creating pie-charts of the data.
$ cldfbench clicsbp.piecharts --weight=Language_Count_Weighted
- Varieties: 1,028
- Concepts: 1,500
- Lexemes: 662,159
- Sources: 0
- Synonymy: 1.13
- Invalid lexemes: 0
- Tokens: 3,861,880
- Segments: 1,390 (0 BIPA errors, 0 CLTS sound class errors, 1382 CLTS modified)
- Inventory size (avg): 44.04
-
Languages linked to bookkeeping languoids in Glottolog:
-
Entries missing sources: 662159/662159 (100.00%)
The following CLDF datasets are available in cldf:
- CLDF Wordlist at cldf/cldf-metadata.json