-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape proceedings to find software links #2
Comments
Some tools that seem useful for getting started: |
I've used pdf2txt before, and it provides mildly usable results. I'll write up a script (it'll be bash, hopefully that's okay) that will dump PDFs to strings of text which may or may not be usable. It's also possible to dump XML with some features. |
Sorry, the tool is actually called PDFMiner: |
A USB key is around with proceedings. Look for |
Running
yields a XML file. Interestingly, the link to github is within an href, so finding href will certainly help. |
Yep, pdfminer is not the best choice when pdftohtml is already extracting hrefs. |
das15.pdf does not have any http or www prefix, and it not href'ed within the PDF. The link looks like this It would be great if we could still catch it as positive. Problem is that the ".edu" suffix is also present in email addresses. |
So we'll have to filter out the at-sign |
A starting set of proceedings:
http://www.jmlr.org/proceedings/
The text was updated successfully, but these errors were encountered: