Skip to content

Latest commit

 

History

History
108 lines (89 loc) · 4.59 KB

README.md

File metadata and controls

108 lines (89 loc) · 4.59 KB

Paraly: An (annotated) dataset for exploring the concept of paralysis (fr. ‘paralysie’) in a digital corpus of French Literature

PRs Welcome Open Code Open Data Open Science

Table of contents

Structure

paraly/
├── docs/
│   └── paraly_annotation_guidelines.pdf
├── data/
│   ├── training/
│   │   ├── train_fasttext_dataset.txt
│   │   ├── test_fasttext_dataset.txt
│   │   └── dev_fasttext_dataset.txt
│   ├── model/
│   │   ├── training.log
│   │   ├── test.tsv
│   │   ├── paraly_camembert_large_multilabel.pt
│   │   ├── loss.tsv
│   │   └── dev.tsv
│   ├── errors/
│   │   └── all_metadata_errors.csv
│   ├── corpus/
│   │   ├── 20_paraly_metadata.csv
│   │   ├── 20_paraly_data_TEI.xml/
│   │   ├── 20_paraly_corpus.cec6
│   │   ├── 19_paraly_metadata.csv
│   │   ├── 19_paraly_data_TEI.xml/
│   │   ├── 19_paraly_corpus.cec6
│   │   ├── 18_paraly_metadata.csv
│   │   ├── 18_paraly_data_TEI.xml/
│   │   └── 18_paraly_corpus.cec6
│   └── annotations/
│       ├── 20_paraly_annotations_v1.xlsx
│       ├── 20_paraly_annotations_v1.csv
│       ├── 19_paraly_annotations_v1.xlsx
│       ├── 19_paraly_annotations_v1.csv
│       ├── 18_paraly_annotations_v1.xlsx
│       └── 18_paraly_annotations_v1.csv
├── code/
│   ├── training/
│   │   ├── train_fc.py
│   │   └── README_training.md
│   ├── splitting/
│   │   ├── prepare_training_data.py
│   │   └── README_splitting.md
│   ├── merging/
│   │   ├── README_merging.md
│   │   └── Merge.ipynb
│   ├── extraction/
│   │   ├── query.txt
│   │   ├── Starten.bat
│   │   ├── Skript.cecs
│   │   └── README_extraction.md
│   ├── collection/
│   │   ├── get_metadata_for_corpus.ipynb
│   │   ├── get_metadata_for_all_books.ipynb
│   │   ├── get_OCRed_books_from_gallica.ipynb
│   │   └── comment_metadata_in_html_files.ipynb
│   └── app/
│       └── app.py
├── README.md
└── LICENSE.md

Collection

The whole digital collection for various centuries is at https://gallica.bnf.fr/html/und/litteratures/les-classiques-de-la-litterature-acces-par-periode?mode=desktop. Our focus is on the following collections:

The OCR-ed books and their metadata were downloaded using the scripts in ./code/collection/.

Annotation

The annotated data located in ./data/annotations/ was labeled as "c" (concrete), "f" (figurative), and “cf” (an “inter-“category).

Model

The multilabel classifier paraly_camembert_large_multilabel was trained using flair-library with a script in ./code/training/ and is openly available at Hugging Face.

App

The app for using the classifier is openly available via Hugging Face Spaces.

License

Unless stated otherwise, this work is licensed as follows: