Skip to content

Latest commit

 

History

History
63 lines (42 loc) · 4.83 KB

README.md

File metadata and controls

63 lines (42 loc) · 4.83 KB

Citation Extraction Service (CEX) [Current Release: Beta]

About The Project

The Citation Extractor and Classifier (CEC) is a software that performs automatic annotation of in-text citations in academic papers provided in PDF.

This page describes the Citation Extraction Service (CEX) component, aiming to isolate the bibliographic references from a scholarly article to perform further analysis more easily and quickly. The process in the extractor proceeds as follows: starting from a PDF, the input file is processed, resulting in a structured output (TEI-XML format). From the output, a JSON file containing the citations’ sentences and the name of the section in which they appear is created and passed to CIC, which further elaborates and annotates them with their citation function.

Technical Overview

The CEX project utilizes GROBID, a machine learning library designed to extract, parse, and restructure raw documents, such as PDFs, into structured XML/TEI-encoded documents. Specifically, it leverages GROBID's citation model, which has been trained using a particular configuration—Training 5—among six available training configurations (Pagnotta, O. (2024). CEX Project - trained GROBID citation models. Zenodo. https://doi.org/10.5281/zenodo.10529709.

In this setup, the GROBID Python Client is configured to use the processFulltextDocument option rather than processReferences. This choice aligns with the service's objective of processing the entire text to extract and isolate sentences that contain citations, which are then used as input for the classifier. Although the citation model is used thanks to GROBID's modular design, it is important to note that only the citation model was trained—not the full-text model. As a result, the sections annotated in the generated XML may exhibit inconsistencies when compared to the original PDF.

This nuanced understanding emphasizes the need for caution when interpreting the results, given the limitations associated with the specific training configuration used in GROBID.

Key Features

  • Extract in-text citations, along with the sentences they appear in and the corresponding section titles, from a single PDF.
  • The output is provided as a ZIP archive containing both the XML/TEI file generated by GROBID and a JSON file that includes the extracted citations, sentences, and section titles.
  • Option to perform semantic alignment of the original PDF section titles with those that conform to the Discourse Element Ontology (DEO).

Getting Started

This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running, follow these simple example steps.

Development environment

  • Java: OpenJDK 17
    openjdk version "17.0.11" 2024-04-16
    OpenJDK Runtime Environment (build 17.0.11+9-Ubuntu-122.04.1)
    OpenJDK 64-Bit Server VM (build 17.0.11+9-Ubuntu-122.04.1, mixed mode, sharing)
  • Gradle: Version 0.8.0

Installation

Note all operations must be done as root su

Init Grobid

  • Clone the Grobid repo (https://github.com/kermitt2/grobid.git) into cex/src/
  • Update the directory with the corresponding model from extractor/cex/src/train_data: (1) grobid-home/models/citation, (2) grobid-trainer/resources/dataset/citation
  • Substitute the model in grobid-home/models/citation with model5.wapiti from Pagnotta, O. (2024). CEX Project - trained GROBID citation models. Zenodo. https://doi.org/10.5281/zenodo.10529709, renaming it as model.wapiti.
  • Update the directory grobid-trainer/resources/dataset/citation/corpus with all the TEI-XML annotated files from Pagnotta, O. (2024). CEX Project - GROBID annotation aligned Gold Standard (Versione 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10529646.
  • run grobid ./gradlew run from cec/extractor/cex/src/grobid

Prepare/Run the python service

  • create a python virtual env python -m venv <your_venv>
  • activate venv: source <your_venv>/bin/activate
  • install libs: cd cex | pip -r requirements.txt
  • run the app: python main.py

Configuration: To change the PREFIX variables go to cex/main.py. Default is set to PREFIX = /cex/

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have any suggestion that would make this project better, please fork the repo and create a pull request. If this sounds too complex, you can simply open an issue with the tag "enhancement". Don't forget to give the project a star!

License

Distributed under the ISC License. See LICENCE for more information.