- Introduction
- Tech Stack
- Prerequisites
- Installation
- Usage
- Project Structure
- Supported Formats
- Limitations
- Contributing
A simple web application that automatically extracts personal information (Name, Phone Number, Address) from PDF documents and populates a web form. The system uses Python for PDF processing, Node.js for the API server, and React for the frontend interface.
- PDF text extraction and parsing
- Automatic detection of personal information
- Real-time form population
- RESTful API endpoints
- Web-based interface
- Support for various PDF formats
- Error handling and validation
- Node.js
- Python
- pdfplumber
- React
- Tailwind CSS
Before running this project, make sure you have:
-
Clone the repository:
git clone https://github.com/raghav-97/pdf-extractor.git cd pdf-extractor
-
Install Frontend Dependencies:
npm install
-
Navigate to backend folder and install backend Dependencies:
cd backend npm install
-
Install Python Dependencies
cd backend/python pip install -r requirements.txt
-
Start the backend server: Navigate to backend folder
cd backend // run this in \pdf-extractor\backend directory node server.js
-
Start the frontend Development server:
cd .. // make sure you are in \pdf-extractor directory npm run dev
Your project structure should look like this:
pdf-extractor/
├── backend/
│ ├── node_modules/
│ ├── python/
│ │ ├── pdf_parser.py
│ │ └── requirements.txt
│ ├── uploads/ # For storing uploaded PDFs
│ ├── package.json
│ ├── package-lock.json
│ └── server.js
├── node_modules/
├── public/
├── src/
├── .gitignore
├── eslint.config.js
├── index.html
├── package.json
├── postcss.config.js
├── README.md
├── tailwind.config.js
└── vite.config.js
The application can extract information from PDFs that:
Contain searchable text (not scanned images) Have information labeled with common patterns:
- "Name:"
- "Phone:"
- "Address:"
- Password-protected PDFs
- Scanned PDFs without OCR
- Image-only PDFs
- Information should be preceded by labels (e.g., "Name:", "Phone:", "Address:")
- Text should be in a readable format
-
Fork the repository
-
Create a feature branch:
git checkout -b feature/my-feature
-
Commit changes:
git commit -am 'Add new feature'
-
Push to branch:
git push origin feature/my-feature
-
Submit a Pull Request