The Citation Extractor and Classifier (CEC) is a software that performs automatic annotation of in-text citations in academic papers provided in PDF.
This page describes the Citation Extraction Service (CEX) component, aiming to isolate the bibliographic references from a scholarly article to perform further analysis more easily and quickly. The process in the extractor proceeds as follows: starting from a PDF, the input file is processed, resulting in a structured output (TEI-XML format). From the output, a JSON file containing the citations’ sentences and the name of the section in which they appear is created and passed to CIC, which further elaborates and annotates them with their citation function.
The CEX project utilizes GROBID, a machine learning library designed to extract, parse, and restructure raw documents, such as PDFs, into structured XML/TEI-encoded documents. Specifically, it leverages GROBID's citation model, which has been trained using a particular configuration—Training 5—among six available training configurations (Pagnotta, O. (2024). CEX Project - trained GROBID citation models. Zenodo. https://doi.org/10.5281/zenodo.10529709.
In this setup, the GROBID Python Client is configured to use the processFulltextDocument
option rather than processReferences
. This choice aligns with the service's objective of processing the entire text to extract and isolate sentences that contain citations, which are then used as input for the classifier. Although the citation model is used thanks to GROBID's modular design, it is important to note that only the citation model was trained—not the full-text model. As a result, the sections annotated in the generated XML may exhibit inconsistencies when compared to the original PDF.
This nuanced understanding emphasizes the need for caution when interpreting the results, given the limitations associated with the specific training configuration used in GROBID.
- Extract in-text citations, along with the sentences they appear in and the corresponding section titles, from a single PDF.
- The output is provided as a ZIP archive containing both the XML/TEI file generated by GROBID and a JSON file that includes the extracted citations, sentences, and section titles.
- Option to perform semantic alignment of the original PDF section titles with those that conform to the Discourse Element Ontology (DEO).
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running, follow these simple example steps.
- Java: OpenJDK 17
openjdk version "17.0.11" 2024-04-16 OpenJDK Runtime Environment (build 17.0.11+9-Ubuntu-122.04.1) OpenJDK 64-Bit Server VM (build 17.0.11+9-Ubuntu-122.04.1, mixed mode, sharing)
- Gradle: Version 0.8.0
Note all operations must be done as root su
- Clone the Grobid repo (https://github.com/kermitt2/grobid.git) into
cex/src/
- Update the directory with the corresponding model from
extractor/cex/src/train_data
: (1)grobid-home/models/citation
, (2)grobid-trainer/resources/dataset/citation
- Substitute the model in
grobid-home/models/citation
with model5.wapiti from Pagnotta, O. (2024). CEX Project - trained GROBID citation models. Zenodo. https://doi.org/10.5281/zenodo.10529709, renaming it as model.wapiti. - Update the directory
grobid-trainer/resources/dataset/citation/corpus
with all the TEI-XML annotated files from Pagnotta, O. (2024). CEX Project - GROBID annotation aligned Gold Standard (Versione 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10529646. - run grobid
./gradlew run
fromcec/extractor/cex/src/grobid
- create a python virtual env
python -m venv <your_venv>
- activate venv:
source <your_venv>/bin/activate
- install libs:
cd cex | pip -r requirements.txt
- run the app:
python main.py
Configuration:
To change the PREFIX
variables go to cex/main.py.
Default is set to PREFIX = /cex/
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have any suggestion that would make this project better, please fork the repo and create a pull request. If this sounds too complex, you can simply open an issue with the tag "enhancement". Don't forget to give the project a star!
Distributed under the ISC License. See LICENCE for more information.