Introduction to Python for Social Science

Lecture 1 - Introduction to Python

Musashi Harukawa, DPIR

1st Week Hilary 2021

Course Overview

Schedule

8-week long course, delivered weekly on Teams.

1600-1645: Lecture
1645-1730: Coding Tutorial
1730-1745: 15 minute intermission
1745-1830: Workshop

Topics:

Introduction to Python and the Development Environment
Data Structures and pandas I
Data Structures and pandas II
Data Visualisation
Machine Learning with scikit-learn I
Machine Learning with scikit-learn II
Web Scraping with BeautifulSoup and regex
Web Scraping with Selenium

I’ve chosen five topics for the eight weeks.
The first week will be focused on getting you to understand a bit better what Python is, and will teach some basics.
The second and third week focus on working with data; cleaning, reshaping, writing and reading, and most importantly understanding.
The fourth week will introduce two libraries for making data-based visuals; these will be publication grade, and can be used for academic papers or in a professional setting.
The fifth and sixth week look at machine learning. The library we will use is not the cutting edge, but has an amazing array of well-documented algorithms and is ideal for teaching ML basics.
The seventh week gives a flying introduction to automating the collection of data from web pages. Web scraping is something I get asked about a lot, and takes longer than a week to teach, but hopefully I can give you a strong starting point to develop your own scrapers.
The final week was originally on NLP, but given that there will be a separate module for this taught by Ash in TT, I decided to expand the web scraping module.

Lecture

Lectures will contain a mixture of theory, methodological motivation, and contextual information.
Largely social science methodology, with a bit of computer science.
The slides will be made available at the following:
- https://github.com/muhark/dpir-intro-python
- Canvas
Lectures will also be uploaded to Panopto (details to follow)

Coding Tutorial

In this section I explain the nitty-gritty of actually implementing these problems in code.
Open up the notebooks on your laptop and code along with me while we work through examples!

Workshop

In the workshop, you will work through a number of set programming problems and discussion questions.
I will be available during this period to answer questions and clarify points

Office Hours

Time tbd, depending on time zones and demand. Currently thinking Tuesday morning.

Feedback

This course is very new!
Feedback can either be:
- sent to me at musashi.harukawa@politics.ox.ac.uk
- written into a questionnaire circulated in Week 3

Week 1: Introduction to Python and the Development Environment

Overview:

This week will cover the following points:

What is Python?…
… and what can I use it for?
What are the tools I have to write, test and run Python code?
Coding tutorial I: Base Data Types and Structures

What is Python?…

Python is an open-source, general-purpose scripting language.

Open-Source

Built by a community
Maintained by a community
Free to use for all

General-Purpose

If you’re doing it on a computer and there’s some repetitive element, then you can automate it in Python.
Python isn’t limited to Data Science, but it’s very popular with data scientists!

Scripting

No strict definition for what a “script” is.
Series of commands to automate some task.
Like a pipeline: takes some inputs, does some things to these inputs, and gives back some outputs.

Language

Python is a language, and not an application.
Practical difference for you:
- most applications provide you options to select from.
- languages require to generate commands from accepted rules.
Upshot is that you can do nearly anything with Python!

and what can I use Python for?

I want to…

Clean up my messy data!
Run analyses with (hundreds of) millions of data points
- it won’t fit into an excel spreadsheet!
I want to automate downloading several decades of newspaper articles!
I want to create beautiful (interactive) visuals to accompany my analyses!
I want to uncover hidden structures linking parliamentary committees!
Again: any repetitive task done on a computer can be automated with Python.

Comparison: Python vs `R`

Task	`python`	`R`
General Purpose Programming	Great	Poor
Regression Analysis	OK	Great
Machine Learning	Great	Great
Web Scraping	Great	OK
Natural Language Processing¹	Great	Great
Data Visualisation	Great	Great

Conclusion: … it depends, but ideally you want to learn both!

Tools of the Trade

Development

Writing Code
Executing Code

Jupyter

Interactive code editor.
Multiple options: console, notebook and lab

Google Colab

Google’s cloud-based Jupyter notebooks
Sufficient for this class

Managing Libraries

For managing Python packages, I recommend Anaconda.

Environment and software manager.
Recommended solution for Python installation and management.
Can be used from the command line (cli) or browser-like interface (anaconda-navigator).

Other Options

I use Atom or VIM to write code
PyCharm is popular with developers
If you’ve spent a lot of time with RStudio, you may prefer Spyder.

This Week

Data Types and Structures

Today I introduce two fundamental, but abstract aspects of coding:
- Data Types
- Data Structures

Why Automate?

Advantage of automation cost, scale, and scope.
To harness computational methods, need to represent our observations in a way that algorithms and programs can utilise.
Process of quantifying and structuring our observations usually entails loss of some information.

Bridging the Gap between Qualitative and Quantitative Methods

Choosing a representation of your information that retains relevant properties is key.
To read more about this particular debate, a good starting point is Stevens (1946).

Data Types

Some (statistical) data types:

Logical
Numerical
Categorical
Text
Date and time

Representing Data on a Computer

Good news: Python, like most modern programming languages, has ways to represent each of the data types listed above.
Bad news: At a fundamental level, this is being stored as 0’s and 1’s.
Take away: Take the time to understand the relationship between:
- your empirical observations,
- the abstracted representation of them in your mathematical model,
- the approximation of this in your computational model.

Data Structures

Data types are concerned with the representation of individual data points, or observations.
Data structures are concerned with the relations between observations.
- Are the data points members of the same set?
- Are the data points members of the same sequence?
- Are the data points different features of single empirical unit?

Exercise

And now…

Open Browser
Go to colab.research.google.com/
Open pre-existing notebook, or create new one.
Start coding!

Coding Recap

Variable Assignment

Variables can be assigned with =.

Four Basic Data Types

There are four basic data types in Python. These are:

String
Integer
Float
Boolean

String

A sequence of characters.
Behaves like a sequence; can be indexed with [index]

Integer

Whole numbers.
Can be positive or negative.

Float

Decimal numbers.
Behave unexpectedly. Remember: 0.1*3==0.3 returns False.

Boolean

True/False
Behaves similarly to integers 0 and 1.

Two basic data structures

We learned about two basic data structures:

Lists
Dictionaries

Lists

Lists are an ordered sequence of values.
Created by writing a sequence of comma-separated values between square brackets:
- i.e. [1, 2, 5, "some string"]
Lists are mutable; values can be changed in place without creating a new variable.
Lists can be indexed the same way as strings:
- [n] to get the n+1th element.
- [m:n] to get all elements from m+1 to n.

Dictionaries

Unordered mapping of keys to values.
- Cannot be indexed numerically, and if iterated over, will not return values in the same order.
Created by writing a list of key:value pairs separated by commas between curly braces.
- i.e. {"cat": "meow", "dog": "bork"}
some_dict[some_key] returns the corresponding value for some_key in some_dict
To see all of the keys, use the .keys() method of the dict, i.e. some_dict.keys()
To see all of the values, use the .values() method of the dict, i.e. some_dict.values()

Next Week: Data Structures and `pandas` I

We learn about:

Tabular data structures
pandas, a key library for working with data
Reading different data formats
Slicing and indexing data

Readings

Automate the Boring Stuff with Python

Chapter 1
Chapter 4 (up to “The in and not in operators”)
Chapter 5 (up to “Checking Whether a Key or Value Exists in a Dictionary”)

Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition

Useful:

2.2: IPython Basics
3.1: Data Structures and Sequences

Interesting:

1.2 Why Python for Data Analysis?

Python and R both provide extensive and powerful natural language processing libraries, e.g. nltk, gensim in Python; tm, quanteda in R, and spaCy in both. Unfortunately, there are many techniques that are only implemented in one language but not the other.↩︎

Introduction to Python for Social Science

Course Overview

Schedule

Topics:

Lecture

Coding Tutorial

Workshop

Office Hours

Feedback

Week 1: Introduction to Python and the Development Environment

Overview:

What is Python?…

Python is an open-source, general-purpose scripting language.

Open-Source

General-Purpose

Scripting

Language

and what can I use Python for?

I want to…

Comparison: Python vs R

Tools of the Trade

Development

Jupyter

Google Colab

Managing Libraries

Other Options

This Week

Data Types and Structures

Why Automate?

Bridging the Gap between Qualitative and Quantitative Methods

Data Types

Representing Data on a Computer

Data Structures

Exercise

And now…

Coding Recap

Variable Assignment

Four Basic Data Types

String

Integer

Float

Boolean

Two basic data structures

Lists

Dictionaries

Next Week: Data Structures and pandas I

Readings

Readings

Comparison: Python vs `R`

Next Week: Data Structures and `pandas` I