Skip to content

Given a single-page PDF, extract the table and output it into Excel format.

Notifications You must be signed in to change notification settings

palakjadwani/pdf-table-to-excel-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

pdf-table-to-excel-extraction

Given a single-page PDF, extract the table and output it into Excel format.

The top row of the Excel file will be the headers of the table. Each row in the table corresponds to one Excel row, and likewise for Excel columns.

  • Pdf is first converted into an image using Pdf2image library.
  • Image is then converted to grayscale.
  • Median blur is done to remove noise.
  • Used Tesseract to apply OCR and extract text from image and to localize each area of text in the image.
  • Extracted the bounding box coordinates of the text region.
  • Extracted PDF columns on the basis of bounding box 'top' coordinate and form column dataframe.
  • Adjusted rows to its respective row number in column dataframe.
  • Transposed the dataframe and set column names.
  • Standardized date entries in YYYY/MM/DD format.
  • Standardize numerical entries in ###.## format by removing commas.
  • Write dataframe to excel file.

About

Given a single-page PDF, extract the table and output it into Excel format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages