# Coding Tutorial Week 7: Web Scraping

This week focuses on three skills that used together, allow you to perform web scraping.

We work on them in the following order:

1. Regular Expressions `re`
2. Making `html` requests
3. Parsing `html` with `beautifulsoup`

## Regular Expressions

Regular expressions are a tool for matching sequences of characters (i.e. strings).

The built-in library `re` contains most functionality we need for this lecture, but for broader unicode support you may want to use `regex` (which needs to be installed).

To begin, let's start with the sentence _How now brown cow?_

### How now brown cow?

In [None]:
import re

sentence = "How now brown cow?"

To build a regular expression, you can use a regular Python string, but often it is preferable to compile a raw string.

We do this by using the `re.compile` function on a string prefaced by `r`.

Let's first create a pattern that matches the word "cow".

In [None]:
pattern = re.compile(r"cow")
pattern

There are several functions that can be used with this pattern. The ones we are interested in are:

- `re.search`
    - Related: `re.findall`
- `re.split`
- `re.sub`

In [None]:
# Remember: the pattern is "cow"

match = re.search(pattern, sentence)
print(match[0])

print(re.findall(pattern, sentence))
print(re.split(pattern, sentence))
print(re.sub(pattern, "fox", sentence))

Using a literal search pattern does not really demonstrate the power of regular expressions.

The pattern `\w+ow\b` will match words that end with "ow".

- `\w` indicates the set of all letters.
    - `+` indicates 1 or more occurrences of the previous RE.
- `o`, `w` are the exact letters `o` and `w`.
- `\b` matches the null string at the start or end of a word.

In [None]:
print(sentence)
pattern = re.compile(r"\w+ow\b")

print(re.search(pattern, sentence)) # Only finds the first instance
print(re.findall(pattern, sentence)) # Returns all matches
print(re.sub(pattern, "fox", sentence)) # Substitutes all matches with "fox"

Maybe we only want to match "How" and "cow": `[Hc]ow\b`

- `[Hc]` matches the letters `H` or `c`

In [None]:
print(sentence)
pattern = re.compile(r"[Hc]ow")

print(re.search(pattern, sentence)) # Only finds the first instance
print(re.findall(pattern, sentence)) # Returns all matches
print(re.sub(pattern, "fox", sentence)) # Substitutes all matches with "fox"

Match all lowercase words: `\b[a-z]+\b`

- `\b` matches the null string at the beginning of the word.
- `[a-z]` defines the set of all lowercase latin letters.
    - `+` indicates one or more of the previous RE
- `\b` matches the null string at the end of the word.

In [None]:
print(sentence)
pattern = re.compile(r"\b[a-z]+\b")

print(re.search(pattern, sentence)) # Only finds the first instance
print(re.findall(pattern, sentence)) # Returns all matches
print(re.sub(pattern, "fox", sentence)) # Substitutes all matches with "fox"

### The Irish Rover

In [None]:
lyrics = """
On the Fourth of July, 1806
We set sail from the sweet Cove of Cork
We were sailing away with a cargo of bricks
For the Grand City Hall in New York
'Twas a wonderful craft, she was rigged fore and aft
And oh, how the wild wind drove her
She stood several blasts, she had twenty seven masts
And they called her The Irish Rover
We had one million bags of the best Sligo rags
We had two million barrels of stone
We had three million sides of old blind horses hides
We had four million barrels of bones
We had five million hogs and six million dogs
Seven million barrels of porter
We had eight million bails of old nanny goats' tails
In the hold of the Irish Rover
"""

Let's extract all of the words that begin with a capital letter and are 3 letters or longer.

In [None]:
r = re.compile() # Fill in the blank

re.findall(r, lyrics)

Let's substitute all the "x million"s an actual number.

In [None]:
numbers = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']
for i, number in enumerate(numbers, 1): # Start from 1
    lyrics = re.sub(
        re.compile(number+" million", re.IGNORECASE), # re.IGNORECASE to include Seven
        str(i)+",000,000",
        lyrics)
print(lyrics)

## Sub-Groups and Greedy Matching

- Use `()` to create capture groups within your regex.
- By default, `+*?` will match as many instances as possible.
- Add an additional `?` after a pattern to match as few instances as possible.

In [None]:
s = '<html><head><title>Regex</title>'
# Let's get the title

r = re.compile(r"<title>([A-z ]+)</title>")
re.findall(r, s)

In [None]:
# Greedy matching - how do we get just one tag?
r = re.compile(r'<.*>')

re.findall(r, s)

In [None]:
# Non-greedy solution
r = re.compile(r'<.*?>')

re.findall(r, s)

Challenge: Let's extract all the "x million ___"

In [None]:
r = re.compile() # Fill in the blank

re.findall(r, lyrics)

## Requests

We use the `requests` library to make `html` requests and retrieve webpages.

In [None]:
import requests

To use `requests`, we first initiate a `Session` object, and then use the `.get()` method of the `Session`.

In [None]:
url = "http://books.toscrape.com"

session = requests.Session()
page = session.get(url)
print(page.text)

It's good practice to use `try`/`except` when making the original request.

`try`/`except` runs the code under `try` until an _exception_, or error, occurs.

If we try to use the url `books.toscrape.com`, with no schema, then we will get a `requests.exceptions.MissingSchema` error.

In [None]:
url = "books.toscrape.com"

try:
    session = requests.Session()
    page = session.get(url)
except requests.exceptions.MissingSchema as e:
    url = "http://"+url
    print("Retrying with "+url)
    session = requests.Session()
    page = session.get(url)
    print("Request returned Status Code "+str(page.status_code))


## BeautifulSoup

BeautifulSoup contains tools for navigating `html` code.

Although not strictly necessary, given that `html` does not need to be "complete" (i.e. missing tags are permissible), libraries such as this are much easier to use than trying to directly parse raw `html`.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
# We can just print out the entire document
print(soup.prettify())

In [None]:
# Let's inspect the header

print(soup.head.prettify())

In [None]:
# Let's get all links in the page

links = soup.find_all('a', href=True)

for link in links:
    print(link.text, link['href'])

In [None]:
# That output is annoying, so let's use some regex to remove whitespace

for link in links:
    print(re.sub(re.compile(r"\s"), "", link.text), link['href'])

In [None]:
# It'd be easier to read in a table, so let's use pandas!

import pandas as pd

links_df = pd.DataFrame(
    data={
        "Label": [re.sub(re.compile(r"\s"), "", link.text) for link in links],
        "Link": [link['href'] for link in links]
    }
)
links_df

In [None]:
# The latter links look a bit strange; there is missing text.

links_df[links_df['Label'].apply(lambda x: len(x)==0)]

In [None]:
links_df = pd.DataFrame(
    data={
        "Label": [re.sub(re.compile(r"\s"), "", link.text) for link in links if re.search(re.compile(r"\S"), link.text)],
        "Link": [link['href'] for link in links if re.search(re.compile(r"\S"), link.text)]
    }
)
links_df.tail()

## Using the Inspector with Web Scraping

When it's just one webpage we want to scrape repeatedly, or a series of similarly-formatted webpages, we can use the Inspector to see if there are any patterns we may be able to use.

Let's try and get information on all of the books listed on the first page of the website.

As we have seen, the information all seems to be contained in `html` tags called `<article class="product_pod">`.

We can use this to isolate all of the tags.

In [None]:
product_pods = soup.find_all('article', class_="product_pod")
len(product_pods)

In [None]:
# Let's look at just one of the elements

print(product_pods[0].prettify())

In [None]:
def get_product_info(product_pod):
    # Title can be accessed from img alt
    image_elem = product_pod.div.a.img
    title = image_elem['alt']
    # Rating can be accessed from class (css) on star-rating
    rating_elem = product_pod.find('p', class_=re.compile(r"star-rating .*"))
    rating = rating_elem['class'][1]+"/Five" # Second class attribute
    price_elem = product_pod.find('div', class_="product_price")
    price = re.search(re.compile("[0-9\.]+"), price_elem.text)[0]
    link = product_pod.find('a', href=True)['href']
    return title, rating, price, link

In [None]:
product_info = []
for pod in product_pods:
    product_info.append(get_product_info(pod))

product_info = pd.DataFrame(product_info, columns=["Title", "Rating", "Price", 'Link'])

In [None]:
product_info.loc[:, 'Link'] = product_info['Link'].apply(lambda x: url+"/"+x)

In [None]:
product_info