Skip to content
This repository has been archived by the owner on Jul 28, 2021. It is now read-only.

Known failure texts #1

Open
ikarth opened this issue Nov 3, 2015 · 7 comments
Open

Known failure texts #1

ikarth opened this issue Nov 3, 2015 · 7 comments

Comments

@ikarth
Copy link

ikarth commented Nov 3, 2015

Just a short list of Project Gutenberg files that are known failures:

00ws110.txt (and a lot of the other early William Shakespeare texts.)
zncli10.txt
zen10.txt
25019-0.txt
25012-8.txt

As a general rule, numbered files seem to be standardized enough to pass, mostly, while the earlier files with letters in the file name are more hit-or-miss. Non-English files also tend to fail for some reason. Also, books with names too long to fit on one line tend to fail.

@ikarth
Copy link
Author

ikarth commented Nov 3, 2015

The non-English books failing may be because the ones I was looking at had really long German titles, because the Chinese ones I glanced at seem to be fine.

@cpressey
Copy link
Member

cpressey commented Nov 3, 2015

Handling titles that extend over 2 lines should definitely be fixed, if it can be (it probably can.)

I suspect some of these listed might be "old" texts. zncli10.txt for example, I eventually did find, but I had to use a web search instead of Gutenberg's search. It does indeed fail on it because the "produced by" regexp is inadquate.

I'll see what I can do for it, and the other ones, shortly.

Is the boilerplate on non-English books in English? If not, ... there's going to be tough going there and I may just cop out and disclaim that this tool is only suitable for English works...

@ikarth
Copy link
Author

ikarth commented Nov 3, 2015

Yes, these are from the April 2010 DVD. I believe that Gutenberg has modernized some of the old texts since then, but they seem to frown on me downloading the entire site at once.

The boilerplate on non-English works appears to be universally in English, at least for the ones I looked at. If this holds true, I imagine that it would be substantially easier to detect, say, where a Chinese or Cyrillic text begins and ends.

@cpressey
Copy link
Member

I realized I could run the script on every text file I've downloaded from Gutenberg like so

cd my_gutenberg_texts
mkdir tmp
guten-gutter --output-dir=tmp/ *

and it will report which ones it fails on. So I'll add them here (I've renamed them for my own convenience but retained the original filename at the end):

ProducedByStripper failed to clean 'A_Princess_of_Mars_pg62.txt'
ProducedByStripper failed to clean 'Around_the_World_in_80_Days_pg103.txt'
ProducedByStripper failed to clean 'The_Island_of_Doctor_Moreau_pg159.txt'
ProducedByStripper failed to clean 'The_Time_Machine_pg35.txt'
ProducedByStripper failed to clean 'War_and_Peace_pg2600.txt'

@cpressey
Copy link
Member

zen10.txt = http://www.gutenberg.org/cache/epub/34/pg34.txt = "This is a COPYRIGHTED Project Gutenberg eBook, Details Below" = WONTFIX... or at least NOTINCLINEDTOFIX... because I think the primary purpose of this tool is to extract the public domain contents of PG texts. I'll clarify this in the README.

@ikarth
Copy link
Author

ikarth commented Nov 19, 2015

Note, for those wishing to sort out only the public domain texts: copyright information is included in the metadata in Project Gutenberg's RDF catalog: https://www.gutenberg.org/wiki/Gutenberg:Feeds

@cpressey
Copy link
Member

cpressey commented Dec 2, 2015

Have added some commits (c593b4d , 35c6f40) that handle zncli10.txt and pg159.txt (and maybe others like them, as a side-effect.)

For pg62.txt and others listed in #1 (comment) the problem is that they don't have a "Produced by" line. In fact, they are cleaned fine by the script, it's just that they produce this warning message when it tries to remove this line, and it can't find it.

Not 100% sure of the best way to handle this, because there is no general way to distinguish between having no "produced by" line, and having a "produced line" in a format that we don't recognize.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants