Lecture 2 - Data Structures and Pandas I
Musashi Harukawa, DPIR
2nd Week Hilary 2021
This week we will continue talking about data:
pandas for data analysisWe can think of a data point as having two properties:
Three Ways of Structuring Data:
Here’s a challenge:
Extra challenge:
| Format | Structure | Built-in Types | Human-Readable | Compatibility |
|---|---|---|---|---|
csv |
Tab | No | Yes | Any |
json |
Hier/Ord | Yes | Yes | Any |
xls(x) |
Tab/Rel | Yes | No | Excel |
dta |
Tabular | Yes | No | STATA |
HDF5 |
Hier/Tab | Yes | No | C, Java, Python, R |
Feather |
Tab | Yes | No | Python, R |
csv (“comma separated-values”) is an extremely common tabular data storage format.
json (JavaScript Object Notation) is also extremely common, especially when using web data.
json usually requires us to coerce hierarchical data to tabular data.xls(x) and dta
xls(x) is the format used by Microsoft Excel
If there is only one sheet, then this is tabular and essentially equivalent to csv.
If there are multiple sheets, pandas represents it as a list of tabular data frames.
dta is the format used by STATA.
More common in social sciences, but essentially unheard of in professional contexts.
May be preferred for compatibility with STATA, otherwise would not recommend.
HDF5 and Feather
HDF5 is a commonly-used data storage format in data science, but not in academia.
Has a lot of nice properties, including efficient compression and fast reading.
Feather was created by Hadley Wickham and Wes McKinney as a fast, consistent and convenient data storage format for cross-usage between R and Python.
I highly recommend this if you work a lot with R and Python, or want to use both in the same project.
SQLSQL-compliant databases are a common type. SQL is a database managing and querying language.pandaspandas is a very popular library for working with tabular data structures in Python. Before we start using it, let’s go over some of the ways it can be useful to you as a social science researcher.
pandaspandaspandas comes with functions for reading and writing to all kinds of data formats. A quick list can be viewed using tab completion:
In [1]: import pandas as pd
In [2]: pd.read_<TAB>
read_clipboard() read_hdf() read_sas()
read_csv() read_html() read_sql()
read_excel() read_json() read_sql_query()
read_feather() read_msgpack() read_sql_table()
read_fwf() read_parquet() read_stata()
read_gbq() read_pickle() read_table
Today, we learn about the following in pandas:
DataFrame and Seriespandas is that we can analyse tabular data regardless of the source format.DataFrame and Seriespandas contains two native data containers:
pandas.DataFrame: A two-dimensional* labelled matrixpandas.Series: A one-dimensional labelled array*Can be higher-dimensional with the use of hierarchical indices
pandas data frames have explicit (named) row- and column-indices, as well as implicit indices because they all elements are ordered and named. We will learn methods for leveraging both.The following sections of Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition are relevant to this lecture:
Useful:
Advanced:
Blog Posts: