Implementation of Cosine Document Distance algorithm.
Required Libraries
For Simple Pdfs: a) Import pdfToTxt from pdfReader b) pdfminer.six
For Scanned Pdfs(OCR to text): a) PIL b) wand c) pyocr d) tesseract-ocr e) ImageMagick 6.9.9-37 (magic using pip and software through official site) f) GhostScript
Installation Tutorial :
For Word Documents: a) python-docx
Additional Libraries: a) ntlk
- Place the Documents in same directory as program
The output is not acccurate as it implements the cosine document distance algo (processing over word frequencies)
Algorithm Reference: