Skip to content

Latest commit

 

History

History
37 lines (29 loc) · 2.73 KB

README.md

File metadata and controls

37 lines (29 loc) · 2.73 KB

digi-gt

Ground truth for the digitized historic collections of Universitätsbibliothek Mannheim.

The transcriptions were done with eScriptorium, a transcription platform developed as part of the Scripta and RESILIENCE projects (https://gitlab.com/scripta/escriptorium/).

After exporting the transcriptions as PAGE XML files, those without any transcription were removed, and empty lines in the remaining ones were removed, too.:

# Remove PAGE XML files without any transcription.
rm -v $(grep -L "<Unicode>..*</Unicode>" *.xml)
# Remove empty lines in PAGE XML files.
perl -i -ne "tr|\r||d; next if /^\s*$/;print" *.xml

List of transcriptions

Links