This repository contains a comprehensive pipeline for processing PDF documents, extracting features, and classifying them into different categories. Access Streamlit dashboard for quick inference here : https://clascify-doraemon.streamlit.app/ . Access detailed project report here : https://drive.google.com/file/d/1Ni8SkQJm62X6zkBp52kb3Qx-SMwoqsJd/view?usp=sharing
├── Dataset/
│ ├── pdfs/
│ │ ├── publishable/
│ │ └── nonpublishable/
│ ├── texts/
│ │ ├── publishable/
│ │ └── nonpublishable/
│ ├── keywords/
│ └── vectors/
├── KDSH_2025_Dataset/
├── Sample/
│ ├── pdfs/
│ ├── texts/
│ ├── keywords/
│ └── vectors/
├── sci-pdf-parser/
│ ├── vila/
│ └── main.py
├── Binary_classification.py
├── Conference_classification.py
├── Corruption.py
├── Dashboard.py
├── FULL_CODE.ipynb
├── Inference.py
├── Mistral7b_Instruct_1.py
├── Mistral7b_Instruct_2.py
├── PDFparserFITZ.py
├── Pathway_inference.py
├── Scibert_embeddings.py
├── credentials.json
├── doraemon_binary_classifier.pt
├── doraemon_conference_classifier.pt
├── requirements.txt
└── results.csv
- Clone the repository:
git clone https://github.com/who-else-but-arjun/claSCIfy.git
- Navigate to the project directory:
cd claSCIfy
- Install dependencies:
pip install -r requirements.txt
- Place PDFs in the appropriate directory:
- Save publishable PDFs in
Dataset/pdfs/publishable/
. - Save non-publishable PDFs in
Dataset/pdfs/nonpublishable/
.
- Save publishable PDFs in
- Run PDF Parser:
- Convert PDFs to JSON format:
python PDFparserFITZ.py
- Convert PDFs to JSON format:
- Corrupt Text Data:
- Create non-publishable datasets:
python Corruption.py
- Create non-publishable datasets:
- Extract Features:
- Generate feature vectors and keywords:
python Scibert_embeddings.py
- Feature vectors are saved in
Dataset/vectors/
. - Keywords are saved in
Dataset/keywords/
.
- Generate feature vectors and keywords:
- Binary Classification:
- Train the binary classification model:
python Binary_classification.py
- This script uses
doraemon_binary_classifier.pt
for training.
- Train the binary classification model:
- Conference Classification:
- Train the conference classification model:
python Conference_classification.py
- This script uses
doraemon_conference_classifier.pt
and classifies documents into conferences (EMNLP, KDD, TMLR, CVPR, NEURIPS).
- Train the conference classification model:
- Place test PDFs in the
Sample/pdfs/
directory. - Run inference:
python Inference.py
- Results:
- The results are saved in
results.csv
.
- The results are saved in
- Justification Generation:
- Use the following scripts for generating justifications:
python Mistral7b_Instruct_1.py python Mistral7b_Instruct_2.py
- Use the following scripts for generating justifications:
- Pathway Inference:
- Implement pathway connector and vector store service:
python Pathway_inference.py
- Implement pathway connector and vector store service:
- Deploy the dashboard:
streamlit run Dashboard.py
- Use the dashboard for quick PDF inferences and assessments.
- Binary_classification.py: Trains and runs the binary classification model for publishability and non publishability.
- Conference_classification.py: Trains and runs the conference classification model for conference prediction.
- Corruption.py: Corrupts text data to create non-publishable datasets.
- Dashboard.py: Deploys a Streamlit dashboard for quick PDF inference.
- FULL_CODE.ipynb: Contains the full pipeline code in a Jupyter notebook format.
- Inference.py: Runs inference on the sample data and saves the results.
- Mistral7b_Instruct_1.py & Mistral7b_Instruct_2.py: Different approaches for generating justifications using Mistral.
- PDFparserFITZ.py: Parses PDFs into JSON format.
- Pathway_inference.py: Integrates pathway features like gdrive connector and vector store server to fetch PDFs, and processes them.
- Scibert_embeddings.py: Creates feature vectors using SciBERT embeddings.
- main.py (in sci-pdf-parser): Parses PDFs to JSON format using the VILA model.
- Ensure all dependencies are installed by running:
pip install -r requirements.txt
- Verify that all necessary files are placed in their respective directories.
- Check log outputs for specific errors during execution.