The goal of this was project was simple: create a single python script which will scrape quotes from goodreads.com.
Python was chosen simply due to the availability of libraries. Initially, I tried using the BeautifulSoup libary,
but found it too inflexible.
Instead, upon the recommendation of a friend I used Scrapy and found it significantly faster and easier to use.
Python 3.0, and Scrapy are the only necessary installations for this project to work. There are only 4 things you need to edit
for a succesful scrape with a different author.
- Change the link from the Tolkien one to whichever authors quotes you desire.
- Set the page to start at (leave as 1 if desired)
- Set the page to end at (inclusive, make sure you do not ask for more pages than exist!)
- Change the authors name (used in output format)
From here, cd
into your directory and enter the command: scrapy crawl GoodReadsSpider
After that, Scrapy will take care of the rest and you should have a diretory full of json files! The only element in these files
is an array with all of the scraped quotes from the respective page numbers (which will be included in the file names).
You are free, of course, to use the quotes however you like, but I might personally recommend the partner project to this, my QuoteGenerator.
Please note, this was done as an exercise in Scrapy and little more! I have not included any filters to quotes, and
GoodReads sometimes includes non-english quotes. They also do not specift who said what so I could not add that feature.
As a final note, I ask that you please not abuse this project to create a bunch of traffic for GoodReads!