Lecture 7 - Mining the Web
Musashi Harukawa, DPIR
7th Week Hilary 2020
The aim of this lecture/tutorial is twofold:
Writing a web scraper takes a lot of time and effort. Mastery of the tools will enable you to build these more efficiently, and focus on the logic instead of the mechanics.
We are all familiar with how to access information on the Internet.
URL
of the website we want to visit.To understand how to automate this process, we need to understand which portions of this process can be automated.
Let’s look under the hood to see what actually happens.
When you type a website into your browser’s URL
bar and hit enter
:
HTTPS
) GET
request to the specified URL
.Let’s break this down:
HTTP(S)
GET
URL
HTTP
is an application-level data transfer protocol used on the Internet.HTTPS
is a secured version of this protocol.FTP
(file transfer protocol), SSH
(secure shell), etc.GET
request is an HTTP
request for retrieving information from the target web server.
URL
A URL
(uniform resource locator) is a reference to a web resource. They have the generic syntax1:
URI = scheme:[//authority]path[?query][#fragment]
authority = [userinfo@]host[:port]
userinfo host port
┌──┴───┐ ┌──────┴──────┐ ┌┴┐
https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top
└─┬─┘ └───────────┬──────────────┘└───────┬───────┘ └───────────┬─────────────┘ └┬┘
scheme authority path query fragment
URL
s Explained host
┌──────┴───────┐
https://muhark.github.io/dpir-intro-python/Week7/lecture.html#/urls-explained
└─┬─┘ └──────┬───────┘└─────────────────┬─────────────────┘└──────┬───────┘
scheme authority path fragment
userinfo@
and :port
), and not seen here.host
is essentially the name of the web server.(Heavy oversimplification on this slide)
URL
, but usually as an IP address, e.g. 216.58.210.206
URL
must be converted to an IP address. This conversion is done by DNS
.
Your request now having been routed to the correct address, the web server:
html
, css
and javascript
.
pdf
documents directly from webpages, interacting with a php
server.A web browser is actually a specialised piece of software that can render and display all kinds of document formats, and allow you to interact with them via a graphical interface2.
Many websites you deal with will be a mixture of html
, css
and javascript
.
html
is more appropriately described as a data structure than a language. It provides the “skeleton” of the webpage and the textual elements.css
is also a data structure, but provides styling information that informs much of the aesthetics of the webpage.javascript
is a programming language that runs programs on the client side (i.e. in your computer) and creates interactive elements on webpages.Most browsers allow you to inspect the source of the webpage that you are viewing. I recommend that you use Chrome/Chromium/Firefox.
Mac | Windows/Linux |
---|---|
Command+Option+I |
F12 or Control+Shift+I |
The devtools in the browser allow you to inspect the html
and css
files that generate the webpages. Recognising how these are structured is key to web scraping.
I use the following three terms to refer to different things:
The difference will become clearer as I discuss them.
It’s useful to distinguish between trace data, and structured data embedded in websites.
Some websites provide an application programming interface (API), which is an interface specifically designed for automated querying/retrieval.
tweepy
for Twitter).The following libraries are key for building web scrapers:
requests
: For making generic http(s) requests.beautifulsoup
: “Cleans up” and provides powerful interface for navigating html.re
: Regular expressions library for powerful string matching.Some glaring omissions from this list:
scrapy
: For building deployable web crawlers.selenium
: Web testing library, can handle javascript
and be programmed to behave much more like a human than a bot.Here’s a function I wrote/adapted from Web Scraping with Python, 2nd Ed.
def safe_get(url, parser='html.parser'):
try:
session = requests.Session()
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
req = session.get(url, headers=headers)
except requests.exceptions.RequestsWarning:
return None
return BeautifulSoup(req.text, parser)
try
/except
try:
session = requests.Session()
[...]
req = session.get(url, headers=headers)
except requests.exceptions.RequestsWarning:
return None
try
/except
is a control flow structure.
session = ...
to ... headers=headers)
requests.exceptions.RequestsWarning
is raised, then the code return None
is executed.requests
There are just three things the requests
library is being used for:
GET
command to the provided url with a customized header.
html
html
Hypertext Markup Language, or html
, is the core of all webpages.
html
tags<head>...</head>
: Defines the header block<h1>...</h1>
: Section header level 1<p>...</p>
: Paragraph<div>...</div>
: Defines a section in a document<section id="title-slide">...</section>
class
attribute allows for style inheritance from css
.html
example<section id="title-slide">
<h1 class="title">Introduction to Python for Social Science</h1>
<p class="subtitle">Lecture 7 - Mining the Web</p>
<p class="author">Musashi Harukawa, DPIR</p>
<p class="date">7th Week Hilary 2020</p>
</section>
BeautifulSoup
html
does not actually need to be “correct” to function; most parsers can deal with issues such as missing tags, etc.BeautifulSoup
parses and “cleans” html
, then represents the document as Python objects with useful methods for navigating and searching the tree, e.g.:
.text
method accesses the direct html
of the tag..child
: Returns the first child of the element..parent
: Returns the immediate parent..parents
: Iterates up the tree through each parent.BeautifulSoup
find()
, find_all()
, etc. These searches can take:
id
, class_
, and so on.re
A regular expression (RE) is a sequence of characters that defines a search pattern. These patterns are used to systematically and flexibly search through strings.
Python provides its own implementation of regular expressions, along with the built-in library re
.
Regular expressions are constructed of:
For instance, the regular expression a.
contains:
a
: the regular character lowercase ‘a’.
: the metacharacter matching any character except a newline..
: Any character other than newline
(the character denoting that the subsequent character should be on a new line)[]
: Defines a character set.
[adw]
defines the set of any of a
, d
or w
.[a-z]
defines the set of the 26 lowercase latin letters.[A-z0-9]
defines the set of all latin letters and the numbers 0-9.^
at the beginning of the set negates the following; [^eng]
is all characters other than e
, n
or g
\w
matches all Unicode word characters, including numbers and underscore.\W
matches all Unicode non-word characters, i.e. [^\w]
\s
/\S
match respectively all whitespace and non-whitespace characters.*
: Matches 0 or more consecutive instances of the previous RE. Matches as many as possible.+
: Matches 1 or more consecutive instances of the previous RE. Matches as many as possible.?
: Matches 0 or 1 instances of the previous RE.{m}
: Matches exactly m
instances of the previous RE.{m, n}
: Matches between m
and n
instances of the previous RE.[a-z]*
matches 0 or more instances of lowercase latin letters.[^0-9]?
matches 0 or 1 instances a non-number character.[abc]{4, }[0-9]
will match 4 or more occurrences of a
, b
or c
followed by a single number:
aaaaabac1
abcabc
abc0
abcb01
^
: Matches the null character at the beginning of a string.$
: Matches the null character at the end of a string.\
: Converts the subsequent metacharacter to a literal.()
: Defines subgroups within the regular expression that can be extracted individually.Say you have a webpage containing a large number of links, each leading to a csv file that you want to download.
requests
to retrieve the html
of the target webpage.html
and create a searchable object with BeautifulSoup
<div>
in the webpage containing the download links..*\.pdf
requests
to recursively download the objects and write them to disk.requests
and BeautifulSoup
<body>
for a link or menu matching the regex [Tt]witter
.^(https?://)?twitter.com/.*
or internal ^/.*
.
twitter.com
, then append to output.html
-tree that contains all relevant matches. This can be found using the Inspector Tool on most browsers.css
classes are often a very helpful tool for finding a series of like objects. Inspect the elements that you are looking for, maybe they are all contained in objects that possess the same class
attribute.Unlike the tools we have discussed so far in this course, the tools used for web scraping can easily have unintended and damaging consequences.
For the most part, it is unlikely that you will bring down a website; most web servers have countermeasures in place which will block IP addresses in response to a sudden high volume of requests. Nevertheless, make sure to:
sleep
function from the time
library to wait 5-15 seconds between requests.I am not a lawyer, and this does not constitute legal advice.
Social Science Use Cases