Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape proceedings to find software links #2

Open
chengsoonong opened this issue Jul 9, 2015 · 8 comments
Open

Scrape proceedings to find software links #2

chengsoonong opened this issue Jul 9, 2015 · 8 comments

Comments

@chengsoonong
Copy link
Member

A starting set of proceedings:
http://www.jmlr.org/proceedings/

@ahonkela ahonkela added this to the MLOSS15 ICML Hackathon milestone Jul 9, 2015
@ahonkela
Copy link
Member

Some tools that seem useful for getting started:
http://pdftohtml.sourceforge.net/
http://www.crummy.com/software/BeautifulSoup/

@rcurtin
Copy link

rcurtin commented Jul 10, 2015

I've used pdf2txt before, and it provides mildly usable results. I'll write up a script (it'll be bash, hopefully that's okay) that will dump PDFs to strings of text which may or may not be usable. It's also possible to dump XML with some features.

@rcurtin
Copy link

rcurtin commented Jul 10, 2015

Sorry, the tool is actually called PDFMiner:
http://www.unixuser.org/~euske/python/pdfminer/

@lostanlen
Copy link

A USB key is around with proceedings. Look for
icml2015/amid15.pdf
for a sample proceeding with link to code.

@lostanlen
Copy link

Running

pdftohtml -xml filename.pdf

yields a XML file. Interestingly, the link to github is within an href, so finding href will certainly help.
@rcurtin has tried pdfminer in the meantime and it does not shows href.

@rcurtin
Copy link

rcurtin commented Jul 10, 2015

Yep, pdfminer is not the best choice when pdftohtml is already extracting hrefs.

@lostanlen
Copy link

das15.pdf does not have any http or www prefix, and it not href'ed within the PDF. The link looks like this
lists.cs.princeton.edu/pipermail/topic- models/attachments/20140424/8eea8833/attachment-0001.zip

It would be great if we could still catch it as positive. Problem is that the ".edu" suffix is also present in email addresses.

@lostanlen
Copy link

So we'll have to filter out the at-sign @ as negative. Thanks @rcurtin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants