Lecture 7 - Mining the Web
Musashi Harukawa, DPIR
7th Week Hilary 2020
The aim of this lecture/tutorial is twofold:
Writing a web scraper takes a lot of time and effort. Mastery of the tools will enable you to build these more efficiently, and focus on the logic instead of the mechanics.
We are all familiar with how to access information on the Internet.
URL of the website we want to visit.To understand how to automate this process, we need to understand which portions of this process can be automated.
Let’s look under the hood to see what actually happens.
When you type a website into your browser’s URL bar and hit enter:
HTTPS) GET request to the specified URL.Let’s break this down:
HTTP(S)GETURLHTTP is an application-level data transfer protocol used on the Internet.HTTPS is a secured version of this protocol.FTP (file transfer protocol), SSH (secure shell), etc.GET request is an HTTP request for retrieving information from the target web server.
URLA URL (uniform resource locator) is a reference to a web resource. They have the generic syntax1:
URI = scheme:[//authority]path[?query][#fragment]
authority = [userinfo@]host[:port]
userinfo host port
┌──┴───┐ ┌──────┴──────┐ ┌┴┐
https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top
└─┬─┘ └───────────┬──────────────┘└───────┬───────┘ └───────────┬─────────────┘ └┬┘
scheme authority path query fragment
URLs Explained host
┌──────┴───────┐
https://muhark.github.io/dpir-intro-python/Week7/lecture.html#/urls-explained
└─┬─┘ └──────┬───────┘└─────────────────┬─────────────────┘└──────┬───────┘
scheme authority path fragment
userinfo@ and :port), and not seen here.host is essentially the name of the web server.(Heavy oversimplification on this slide)
URL, but usually as an IP address, e.g. 216.58.210.206URL must be converted to an IP address. This conversion is done by DNS.
Your request now having been routed to the correct address, the web server:
html, css and javascript.
pdf documents directly from webpages, interacting with a php server.A web browser is actually a specialised piece of software that can render and display all kinds of document formats, and allow you to interact with them via a graphical interface2.
Many websites you deal with will be a mixture of html, css and javascript.
html is more appropriately described as a data structure than a language. It provides the “skeleton” of the webpage and the textual elements.css is also a data structure, but provides styling information that informs much of the aesthetics of the webpage.javascript is a programming language that runs programs on the client side (i.e. in your computer) and creates interactive elements on webpages.Most browsers allow you to inspect the source of the webpage that you are viewing. I recommend that you use Chrome/Chromium/Firefox.
| Mac | Windows/Linux |
|---|---|
Command+Option+I |
F12 or Control+Shift+I |
The devtools in the browser allow you to inspect the html and css files that generate the webpages. Recognising how these are structured is key to web scraping.
I use the following three terms to refer to different things:
The difference will become clearer as I discuss them.
It’s useful to distinguish between trace data, and structured data embedded in websites.
Some websites provide an application programming interface (API), which is an interface specifically designed for automated querying/retrieval.
tweepy for Twitter).The following libraries are key for building web scrapers:
requests: For making generic http(s) requests.beautifulsoup: “Cleans up” and provides powerful interface for navigating html.re: Regular expressions library for powerful string matching.Some glaring omissions from this list:
scrapy: For building deployable web crawlers.selenium: Web testing library, can handle javascript and be programmed to behave much more like a human than a bot.Here’s a function I wrote/adapted from Web Scraping with Python, 2nd Ed.
def safe_get(url, parser='html.parser'):
try:
session = requests.Session()
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
req = session.get(url, headers=headers)
except requests.exceptions.RequestsWarning:
return None
return BeautifulSoup(req.text, parser)
try/except try:
session = requests.Session()
[...]
req = session.get(url, headers=headers)
except requests.exceptions.RequestsWarning:
return None
try/except is a control flow structure.
session = ... to ... headers=headers)requests.exceptions.RequestsWarning is raised, then the code return None is executed.requestsThere are just three things the requests library is being used for:
GET command to the provided url with a customized header.
htmlhtmlHypertext Markup Language, or html, is the core of all webpages.
html tags<head>...</head>: Defines the header block<h1>...</h1>: Section header level 1<p>...</p>: Paragraph<div>...</div>: Defines a section in a document<section id="title-slide">...</section>class attribute allows for style inheritance from css.html example<section id="title-slide">
<h1 class="title">Introduction to Python for Social Science</h1>
<p class="subtitle">Lecture 7 - Mining the Web</p>
<p class="author">Musashi Harukawa, DPIR</p>
<p class="date">7th Week Hilary 2020</p>
</section>
BeautifulSouphtml does not actually need to be “correct” to function; most parsers can deal with issues such as missing tags, etc.BeautifulSoup parses and “cleans” html, then represents the document as Python objects with useful methods for navigating and searching the tree, e.g.:
.text method accesses the direct html of the tag..child: Returns the first child of the element..parent: Returns the immediate parent..parents: Iterates up the tree through each parent.BeautifulSoupfind(), find_all(), etc. These searches can take:
id, class_, and so on.reA regular expression (RE) is a sequence of characters that defines a search pattern. These patterns are used to systematically and flexibly search through strings.
Python provides its own implementation of regular expressions, along with the built-in library re.
Regular expressions are constructed of:
For instance, the regular expression a. contains:
a: the regular character lowercase ‘a’.: the metacharacter matching any character except a newline..: Any character other than newline (the character denoting that the subsequent character should be on a new line)[]: Defines a character set.
[adw] defines the set of any of a, d or w.[a-z] defines the set of the 26 lowercase latin letters.[A-z0-9] defines the set of all latin letters and the numbers 0-9.^ at the beginning of the set negates the following; [^eng] is all characters other than e, n or g\w matches all Unicode word characters, including numbers and underscore.\W matches all Unicode non-word characters, i.e. [^\w]\s/\S match respectively all whitespace and non-whitespace characters.*: Matches 0 or more consecutive instances of the previous RE. Matches as many as possible.+: Matches 1 or more consecutive instances of the previous RE. Matches as many as possible.?: Matches 0 or 1 instances of the previous RE.{m}: Matches exactly m instances of the previous RE.{m, n}: Matches between m and n instances of the previous RE.[a-z]* matches 0 or more instances of lowercase latin letters.[^0-9]? matches 0 or 1 instances a non-number character.[abc]{4, }[0-9] will match 4 or more occurrences of a, b or c followed by a single number:
aaaaabac1abcabcabc0abcb01^: Matches the null character at the beginning of a string.$: Matches the null character at the end of a string.\: Converts the subsequent metacharacter to a literal.(): Defines subgroups within the regular expression that can be extracted individually.Say you have a webpage containing a large number of links, each leading to a csv file that you want to download.
requests to retrieve the html of the target webpage.html and create a searchable object with BeautifulSoup<div> in the webpage containing the download links..*\.pdfrequests to recursively download the objects and write them to disk.requests and BeautifulSoup<body> for a link or menu matching the regex [Tt]witter.^(https?://)?twitter.com/.* or internal ^/.*.
twitter.com, then append to output.html-tree that contains all relevant matches. This can be found using the Inspector Tool on most browsers.css classes are often a very helpful tool for finding a series of like objects. Inspect the elements that you are looking for, maybe they are all contained in objects that possess the same class attribute.Unlike the tools we have discussed so far in this course, the tools used for web scraping can easily have unintended and damaging consequences.
For the most part, it is unlikely that you will bring down a website; most web servers have countermeasures in place which will block IP addresses in response to a sudden high volume of requests. Nevertheless, make sure to:
sleep function from the time library to wait 5-15 seconds between requests.I am not a lawyer, and this does not constitute legal advice.
Social Science Use Cases