Skip to content

caazzi/book_scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

In this second exercise, we will put in practice the Scraping techniques covered in this morning's lecture. The goal will be to automatically extract information from a website with Python.

The website we are scraping is books.toscrape.com. It's a website which has been created exactly for our purpose - to learn how to scrape!

The goal will be to automatically retrieve information about sold books, like their name, price, rating, etc. The trick is that the website is paginated. Can you see how? Do you foresee it being a difficulty?

Setup

The goal is to scrape the website and then use pandas to visualize the extracted information. For this exercise, it still makes sense to work in a Notebook.

jupyter notebook

Go ahead and open the new Python Notebook in the ~/code/<user.github_nickname>/{{local_path_to("02-Data-Toolkit/02-Data-Sourcing/02-Scraping")}} folder.

Start your notebook with the following imports in the first code cell:

import requests
from bs4 import BeautifulSoup

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib

First request

Insert a new cell and work on the TODOs (the starter code is the same as the one in the lecture's slides!)

url = "http://books.toscrape.com/"

# TODO: Use `requests` to do an HTTP request to fetch data located at that URL
# TODO: Create a `BeautifulSoup` instance with that data
View solution

This code is quite generic and should be the same as the lecture! If you already have a scraping project, what you usually do, is open it and copy-paste those first lines!

url = "http://books.toscrape.com/"

# This is where we do an HTTP request to get the HTML from the website
response = requests.get(url)

# And this is where we feed that HTML to the parser
soup = BeautifulSoup(response.content, "html.parser")

soup is now a variable containing the parser on which we can run our queries. For that, you need to analyze the "Books to Scrape"'s website HTML with the browser inspector.

Can you spot which HTML contains one book? Is it identical for each book?

View solution

The <article /> element with the class product_pod is what we are looking for! All the books on the page have exactly the same structure, that's exactly what we need for parsing.

<article class="product_pod">
  <!-- [...] -->
</article>

Now that we have identified the relevant HTML, we can use the soup Python variable to query the document. Let's use the searching by CSS class approach. Insert a new cell and try to select all books in the HTML. Store this in a books_html variable.

View solution
books_html = soup.find_all("article", class_="product_pod")
len(books_html)

Now that we have a books_html variable containing all the HTML <article />, let's focus on one book (the first!) and try to extract all the information we need from that HTML fragment.

Parsing one book

It's a good time for you to insert a Markdown cell and type in the following:

## Parsing _one_ book

Of course you can write more text! The goal is to document your train of thoughts in your notebook to eventually have a well documented/structured Notebook.

Let's have a look at the HTML fragment of the first book. Insert a code cell and type in:

books_html[0]

Great! We now have a smaller piece of HTML to deal with. We can chain the .find() on this HTML fragment to extract 3 pieces of information from it.

Let's start with the book title. Try to retrieve this information from books_html[0] and store it in a book_title variable.

View solution

The title is located in an HTML link tag <a /> inside the <h3 /> tag. So we need to first .find() the h3, then the a:

books_html[0].find("h3").find("a")

That's almost it. Now we need to select the title in the <a /> tag's attributes:

books_html[0].find("h3").find("a").attrs

The line above returns a dict. You can now select the right key!

book_title = books_html[0].find("h3").find("a").attrs["title"]
book_title

Awesome! Let's now try to retrieve the price of that book. Going back to the element inspection in the browser we find that the price is located within a <p class="price_color"></p>. Try to put the price in a book_price variable, and be careful, we want to get a float!

View solution

Like for the <article /> before to select books, we are going to use the "Searching by CSS class" approach, combined with using the .string method:

books_html[0].find("p", class_="price_color").string

The thing is that we want to extract the number (here that would be a Python float) rather than just text. We need to get rid of the first character Β£ with the slice method on the list and then pass the sliced string to the float() method to convert it:

book_price = float(books_html[0].find("p", class_="price_color").string[1:])
book_price

Finally we need to get the rating (how many yellow stars does the book have). Back to the browser inspector, we can see that there is a <p class="star-rating TEXT"></p> where TEXT can take the values "One", "Two", "Three", "Four" or "Five". This one is a bit more difficult then, but doable. Insert a cell and copy/paste the following code:

book_stars_html = books_html[0].find("p", class_="star-rating")
book_stars_html
book_stars_html.attrs['class']

In Python, you can use the in keyword to check if an element is contained in a list. For instance:

cities = [ 'paris', 'london', 'brussels' ]

if 'berlin' in cities:
    print("Berlin is available")
else:
    print("Sorry, Berlin is not available")

❓ Define a method parse_rating which takes a list of classes (from the <p />) and returns the rating from 1 to 5:

def parse_rating(rating_classes):
    # TODO: Look at `rating_classes` and return the correct rating
    # e.g. of an argument for `rating_classes`: [ 'star-rating', 'Three' ]
    # "One" => 1
    # "Two" => 2
    # "Three" => 3
    # "Four" => 4
    # "Five" => 5
    return 0
View solution
def parse_rating(rating_classes):
    if 'One' in rating_classes:
        return 1
    elif 'Two' in rating_classes:
        return 2
    elif 'Three' in rating_classes:
        return 3
    elif 'Four' in rating_classes:
        return 4
    elif 'Five' in rating_classes:
        return 5
    else:
        return 0

Once you implemented this method, you can use it to read the book's rating! Insert a new code cell and copy/paste the following code:

book_rating = parse_rating(books_html[0].find("p", class_="star-rating").attrs['class'])

Parsing all books

Once again, it's a good time to insert a Markdown cell and type in the following:

## Parsing _all_ books

We now need to glue all the code above and put it inside a for loop over the books_html variable! This variable is given by soup in return of the .find_all function call.

We are going to store the information collected about the books in a Python dict. This dictionary will have three keys. The values stored in that dictionary would be lists to which we append whatever we find in the HTML:

  • Title => ["A light in the attic", "Tipping the Velvet", ...]
  • Price => [51.77, 53.74, ...]
  • Rating => [3, 1, ...]

We store the information this way because we aim to give it to Pandas, and, conveniently enough, giving the data in this format to Pandas allow us to create a Dataframe from it very easily.

Insert a new cell and initialize this dictionary

books_dict = { 'Title': [], 'Price': [], 'Rating': [] }

❓ Implement a loop that will iterate over books_html to populate the books_dict dictionary by reusing all the code from above.

View solution

In a new cell, we write the for loop and copy paste the code

for book in books_html:
    title = book.find("h3").find("a").attrs["title"]
    price = float(book.find("p", class_="price_color").text[1:])
    rating = parse_rating(book.find("p", class_="star-rating").attrs['class'])
    books_dict["Title"].append(title)
    books_dict["Price"].append(price)
    books_dict["Rating"].append(rating)

Have a look at the results with the following cells:

books_dict
len(books_dict)         # You should have 3 key:value pairs
len(books_dict["Title"]) # Each value should contain 20 elements from the 20 books, as many as on the web page!

Loading data in Pandas

New section! Don't forget about the Markdown cell to document your process as you go.

The books_dict looks good, let's now load that data into Pandas with the pandas.DataFrame.from_dict function:

books_df = pd.DataFrame.from_dict(books_dict)
books_df

Looks great! Let's generate a small plot to celebrate. The plot will show how many books there are per possible Rating:

books_df.groupby("Rating").count()["Title"].plot(kind="bar")

Test your code!

Add and run the following cell to test your code:

from nbresult import ChallengeResult

result = ChallengeResult('books',
    columns=books_df.columns,
    title=str(books_df.loc[0,'Title']),
    price=books_df.loc[0,'Price'],
    rating=books_df.loc[0,'Rating']
)
result.write()
print(result.check())

Then you can commit and push your code πŸš€

Quite a lot of books have a very poor rating (1). Is it only the first page? What about the other pages? Time to look at page 2 and beyond!

Going through all the pages of the catalogue

New section! Don't forget about the Markdown cell.

On books.toscrape.com, scroll down to the bottom and click on the "Next" button. Do it again. Do you see the pattern of the URL for the different pages?

View solution
page = 1
url = f"http://books.toscrape.com/catalogue/page-{page}.html"
url

We need another for loop! Basically a loop which will iterate from page 1 to 50 and do the scraping. While we are testing, let's just focus on scraping from page 1 to 3:

MAX_PAGE = 3
for page in range(1, MAX_PAGE + 1):
    url = f"http://books.toscrape.com/catalogue/page-{page}.html"
    print(url)

Seems like the loop is working! Let's replace the print with the actual code to scrape. Then run another for loop to scrape all the books from the current page. All the code is already in your notebook. Time to pick it up and glue everything together!

View solution
all_books_dict = { 'Title': [], 'Price': [], 'Rating': [] }

MAX_PAGE = 50
for page in range(1, MAX_PAGE + 1):
    print(f"Parsing page {page}...")
    url = f"http://books.toscrape.com/catalogue/page-{page}.html"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        title = book.find("h3").find("a").attrs["title"]
        price = float(book.find("p", class_="price_color").text[1:])
        rating = parse_rating(book.find("p", class_="star-rating").attrs["class"])
        all_books_dict["Title"].append(title)
        all_books_dict["Price"].append(price)
        all_books_dict["Rating"].append(rating)

print("Done!")

All good? Check that you actually parsed MAX_PAGE * 20 books with:

len(all_books_dict["Title"])

Time to load the all_books_dict into a Pandas DataFrame:

all_books_df = pd.DataFrame.from_dict(all_books_dict)
all_books_df.tail()

Let's see how expensive the books are:

all_books_df["Price"].hist()

And how well rated they are:

all_books_df.groupby("Rating").count()["Title"].plot(kind="bar")

Saving the data for later

Right now, all the scraped data is living in memory of the Notebook, and will be lost as soon as we Ctrl + C it. It would be a shame, so a good practice is to actually save the results of a successful scraping session into a file.

For that we will use one of the writers Pandas provide. We can write a DataFrame to disk like this:

all_books_df.to_csv("books.csv")

If you'd rather have a regular Excel file, it's possible!

pip install XlsxWriter
all_books_df.to_excel('books.xlsx', sheet_name='Books')

A good practice is to create a Data Pipeline where one process will scrape and dump the Data to CSV, and another one will read back the data from the CSV file and go on to analyze it through a Pandas Dataframe!

πŸ’‘ Don't forget to push your code to GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published