Introduction to Python for Social Science

Lecture 1 - Introduction to Python

Musashi Harukawa, DPIR

1st Week Hilary 2021

Course Overview

Schedule

8-week long course, delivered weekly on Teams.

  • 1600-1645: Lecture
  • 1645-1730: Coding Tutorial
  • 1730-1745: 15 minute intermission
  • 1745-1830: Workshop

Topics:

  1. Introduction to Python and the Development Environment
  2. Data Structures and pandas I
  3. Data Structures and pandas II
  4. Data Visualisation
  5. Machine Learning with scikit-learn I
  6. Machine Learning with scikit-learn II
  7. Web Scraping with BeautifulSoup and regex
  8. Web Scraping with Selenium

Lecture

  • Lectures will contain a mixture of theory, methodological motivation, and contextual information.
  • Largely social science methodology, with a bit of computer science.
  • The slides will be made available at the following:
    • https://github.com/muhark/dpir-intro-python
    • Canvas
  • Lectures will also be uploaded to Panopto (details to follow)

Coding Tutorial

  • In this section I explain the nitty-gritty of actually implementing these problems in code.
  • Open up the notebooks on your laptop and code along with me while we work through examples!

Workshop

  • In the workshop, you will work through a number of set programming problems and discussion questions.
  • I will be available during this period to answer questions and clarify points

Office Hours

  • Time tbd, depending on time zones and demand. Currently thinking Tuesday morning.

Feedback

  • This course is very new!
  • Feedback can either be:
    • sent to me at musashi.harukawa@politics.ox.ac.uk
    • written into a questionnaire circulated in Week 3

Week 1: Introduction to Python and the Development Environment

Overview:

This week will cover the following points:

  1. What is Python?…
  2. … and what can I use it for?
  3. What are the tools I have to write, test and run Python code?
  4. Coding tutorial I: Base Data Types and Structures

What is Python?…

Python is an open-source, general-purpose scripting language.

Open-Source

  • Built by a community
  • Maintained by a community
  • Free to use for all

General-Purpose

  • If you’re doing it on a computer and there’s some repetitive element, then you can automate it in Python.
  • Python isn’t limited to Data Science, but it’s very popular with data scientists!

Scripting

  • No strict definition for what a “script” is.
  • Series of commands to automate some task.
  • Like a pipeline: takes some inputs, does some things to these inputs, and gives back some outputs.

Language

  • Python is a language, and not an application.
  • Practical difference for you:
    • most applications provide you options to select from.
    • languages require to generate commands from accepted rules.
  • Upshot is that you can do nearly anything with Python!

and what can I use Python for?

I want to…

  • Clean up my messy data!
  • Run analyses with (hundreds of) millions of data points
    • it won’t fit into an excel spreadsheet!
  • I want to automate downloading several decades of newspaper articles!
  • I want to create beautiful (interactive) visuals to accompany my analyses!
  • I want to uncover hidden structures linking parliamentary committees!
  • Again: any repetitive task done on a computer can be automated with Python.

Comparison: Python vs R

Task python R
General Purpose Programming Great Poor
Regression Analysis OK Great
Machine Learning Great Great
Web Scraping Great OK
Natural Language Processing1 Great Great
Data Visualisation Great Great

Conclusion: … it depends, but ideally you want to learn both!

Tools of the Trade

Development

  • Writing Code
  • Executing Code

Jupyter

  • Interactive code editor.
  • Multiple options: console, notebook and lab

Google Colab

  • Google’s cloud-based Jupyter notebooks
  • Sufficient for this class

Managing Libraries

For managing Python packages, I recommend Anaconda.

  • Environment and software manager.
  • Recommended solution for Python installation and management.
  • Can be used from the command line (cli) or browser-like interface (anaconda-navigator).

Other Options

  • I use Atom or VIM to write code
  • PyCharm is popular with developers
  • If you’ve spent a lot of time with RStudio, you may prefer Spyder.

This Week

Data Types and Structures

  • Today I introduce two fundamental, but abstract aspects of coding:
    • Data Types
    • Data Structures

Why Automate?

  • Advantage of automation cost, scale, and scope.
  • To harness computational methods, need to represent our observations in a way that algorithms and programs can utilise.
  • Process of quantifying and structuring our observations usually entails loss of some information.

Bridging the Gap between Qualitative and Quantitative Methods

  • Choosing a representation of your information that retains relevant properties is key.
  • To read more about this particular debate, a good starting point is Stevens (1946).

Data Types

Some (statistical) data types:

  • Logical
  • Numerical
  • Categorical
  • Text
  • Date and time

Representing Data on a Computer

  • Good news: Python, like most modern programming languages, has ways to represent each of the data types listed above.
  • Bad news: At a fundamental level, this is being stored as 0’s and 1’s.
  • Take away: Take the time to understand the relationship between:
    • your empirical observations,
    • the abstracted representation of them in your mathematical model,
    • the approximation of this in your computational model.

Data Structures

  • Data types are concerned with the representation of individual data points, or observations.
  • Data structures are concerned with the relations between observations.
    • Are the data points members of the same set?
    • Are the data points members of the same sequence?
    • Are the data points different features of single empirical unit?

Exercise

Menu

And now…

  1. Open Browser
  2. Go to colab.research.google.com/
  3. Open pre-existing notebook, or create new one.
  4. Start coding!

Coding Recap

Variable Assignment

Variables can be assigned with =.

Four Basic Data Types

There are four basic data types in Python. These are:

  • String
  • Integer
  • Float
  • Boolean

String

  • A sequence of characters.
  • Behaves like a sequence; can be indexed with [index]

Integer

  • Whole numbers.
  • Can be positive or negative.

Float

  • Decimal numbers.
  • Behave unexpectedly. Remember: 0.1*3==0.3 returns False.

Boolean

  • True/False
  • Behaves similarly to integers 0 and 1.

Two basic data structures

We learned about two basic data structures:

  • Lists
  • Dictionaries

Lists

  • Lists are an ordered sequence of values.
  • Created by writing a sequence of comma-separated values between square brackets:
    • i.e. [1, 2, 5, "some string"]
  • Lists are mutable; values can be changed in place without creating a new variable.
  • Lists can be indexed the same way as strings:
    • [n] to get the n+1th element.
    • [m:n] to get all elements from m+1 to n.

Dictionaries

  • Unordered mapping of keys to values.
    • Cannot be indexed numerically, and if iterated over, will not return values in the same order.
  • Created by writing a list of key:value pairs separated by commas between curly braces.
    • i.e. {"cat": "meow", "dog": "bork"}
  • some_dict[some_key] returns the corresponding value for some_key in some_dict
  • To see all of the keys, use the .keys() method of the dict, i.e. some_dict.keys()
  • To see all of the values, use the .values() method of the dict, i.e. some_dict.values()

Next Week: Data Structures and pandas I

We learn about:

  • Tabular data structures
  • pandas, a key library for working with data
  • Reading different data formats
  • Slicing and indexing data

Readings

Readings

Automate the Boring Stuff with Python

  • Chapter 1
  • Chapter 4 (up to “The in and not in operators”)
  • Chapter 5 (up to “Checking Whether a Key or Value Exists in a Dictionary”)

Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition

Useful:

  • 2.2: IPython Basics
  • 3.1: Data Structures and Sequences

Interesting:

  • 1.2 Why Python for Data Analysis?

  1. Python and R both provide extensive and powerful natural language processing libraries, e.g. nltk, gensim in Python; tm, quanteda in R, and spaCy in both. Unfortunately, there are many techniques that are only implemented in one language but not the other.↩︎

// reveal.js plugins