Scrape proceedings to find software links #2

chengsoonong · 2015-07-09T08:52:08Z

A starting set of proceedings:
http://www.jmlr.org/proceedings/

ahonkela · 2015-07-10T13:06:30Z

Some tools that seem useful for getting started:
http://pdftohtml.sourceforge.net/
http://www.crummy.com/software/BeautifulSoup/

rcurtin · 2015-07-10T15:19:05Z

I've used pdf2txt before, and it provides mildly usable results. I'll write up a script (it'll be bash, hopefully that's okay) that will dump PDFs to strings of text which may or may not be usable. It's also possible to dump XML with some features.

rcurtin · 2015-07-10T15:19:54Z

Sorry, the tool is actually called PDFMiner:
http://www.unixuser.org/~euske/python/pdfminer/

lostanlen · 2015-07-10T15:34:52Z

A USB key is around with proceedings. Look for
icml2015/amid15.pdf
for a sample proceeding with link to code.

lostanlen · 2015-07-10T15:40:57Z

Running

pdftohtml -xml filename.pdf

yields a XML file. Interestingly, the link to github is within an href, so finding href will certainly help.
@rcurtin has tried pdfminer in the meantime and it does not shows href.

rcurtin · 2015-07-10T15:41:50Z

Yep, pdfminer is not the best choice when pdftohtml is already extracting hrefs.

lostanlen · 2015-07-10T15:46:05Z

das15.pdf does not have any http or www prefix, and it not href'ed within the PDF. The link looks like this
lists.cs.princeton.edu/pipermail/topic- models/attachments/20140424/8eea8833/attachment-0001.zip

It would be great if we could still catch it as positive. Problem is that the ".edu" suffix is also present in email addresses.

lostanlen · 2015-07-10T15:52:41Z

So we'll have to filter out the at-sign @ as negative. Thanks @rcurtin

ahonkela added this to the MLOSS15 ICML Hackathon milestone Jul 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape proceedings to find software links #2

Scrape proceedings to find software links #2

chengsoonong commented Jul 9, 2015

ahonkela commented Jul 10, 2015

rcurtin commented Jul 10, 2015

rcurtin commented Jul 10, 2015

lostanlen commented Jul 10, 2015

lostanlen commented Jul 10, 2015

rcurtin commented Jul 10, 2015

lostanlen commented Jul 10, 2015

lostanlen commented Jul 10, 2015

Scrape proceedings to find software links #2

Scrape proceedings to find software links #2

Comments

chengsoonong commented Jul 9, 2015

ahonkela commented Jul 10, 2015

rcurtin commented Jul 10, 2015

rcurtin commented Jul 10, 2015

lostanlen commented Jul 10, 2015

lostanlen commented Jul 10, 2015

rcurtin commented Jul 10, 2015

lostanlen commented Jul 10, 2015

lostanlen commented Jul 10, 2015