Skip to content

UB-Mannheim/paraly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paraly: An (annotated) dataset for exploring the concept of paralysis (fr. ‘paralysie’) in a digital corpus of French Literature

PRs Welcome Open Code Open Data Open Science

Table of contents

Structure

paraly/
├── docs/
│   └── paraly_annotation_guidelines.pdf
├── data/
│   ├── training/
│   │   ├── train_fasttext_dataset.txt
│   │   ├── test_fasttext_dataset.txt
│   │   └── dev_fasttext_dataset.txt
│   ├── model/
│   │   ├── training.log
│   │   ├── test.tsv
│   │   ├── paraly_camembert_large_multilabel.pt
│   │   ├── loss.tsv
│   │   └── dev.tsv
│   ├── errors/
│   │   └── all_metadata_errors.csv
│   ├── corpus/
│   │   ├── 20_paraly_metadata.csv
│   │   ├── 20_paraly_data_TEI.xml/
│   │   ├── 20_paraly_corpus.cec6
│   │   ├── 19_paraly_metadata.csv
│   │   ├── 19_paraly_data_TEI.xml/
│   │   ├── 19_paraly_corpus.cec6
│   │   ├── 18_paraly_metadata.csv
│   │   ├── 18_paraly_data_TEI.xml/
│   │   └── 18_paraly_corpus.cec6
│   └── annotations/
│       ├── 20_paraly_annotations_v1.xlsx
│       ├── 20_paraly_annotations_v1.csv
│       ├── 19_paraly_annotations_v1.xlsx
│       ├── 19_paraly_annotations_v1.csv
│       ├── 18_paraly_annotations_v1.xlsx
│       └── 18_paraly_annotations_v1.csv
├── code/
│   ├── training/
│   │   ├── train_fc.py
│   │   └── README_training.md
│   ├── splitting/
│   │   ├── prepare_training_data.py
│   │   └── README_splitting.md
│   ├── merging/
│   │   ├── README_merging.md
│   │   └── Merge.ipynb
│   ├── extraction/
│   │   ├── query.txt
│   │   ├── Starten.bat
│   │   ├── Skript.cecs
│   │   └── README_extraction.md
│   ├── collection/
│   │   ├── get_metadata_for_corpus.ipynb
│   │   ├── get_metadata_for_all_books.ipynb
│   │   ├── get_OCRed_books_from_gallica.ipynb
│   │   └── comment_metadata_in_html_files.ipynb
│   └── app/
│       └── app.py
├── README.md
└── LICENSE.md

Collection

The whole digital collection for various centuries is at https://gallica.bnf.fr/html/und/litteratures/les-classiques-de-la-litterature-acces-par-periode?mode=desktop. Our focus is on the following collections:

The OCR-ed books and their metadata were downloaded using the scripts in ./code/collection/.

Annotation

The annotated data located in ./data/annotations/ was labeled as "c" (concrete), "f" (figurative), and “cf” (an “inter-“category).

Model

The multilabel classifier paraly_camembert_large_multilabel was trained using flair-library with a script in ./code/training/ and is openly available at Hugging Face.

App

The app for using the classifier is openly available via Hugging Face Spaces.

License

Unless stated otherwise, this work is licensed as follows:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published