Lecture 2 - Data Structures and Pandas I
Musashi Harukawa, DPIR
2nd Week Hilary 2021
This week we will continue talking about data:
pandas
for data analysisWe can think of a data point as having two properties:
Three Ways of Structuring Data:
Here’s a challenge:
Extra challenge:
Format | Structure | Built-in Types | Human-Readable | Compatibility |
---|---|---|---|---|
csv |
Tab | No | Yes | Any |
json |
Hier/Ord | Yes | Yes | Any |
xls(x) |
Tab/Rel | Yes | No | Excel |
dta |
Tabular | Yes | No | STATA |
HDF5 |
Hier/Tab | Yes | No | C, Java, Python, R |
Feather |
Tab | Yes | No | Python, R |
csv
(“comma separated-values”) is an extremely common tabular data storage format.
json
(JavaScript Object Notation) is also extremely common, especially when using web data.
json
usually requires us to coerce hierarchical data to tabular data.xls(x)
and dta
xls(x)
is the format used by Microsoft Excel
If there is only one sheet, then this is tabular and essentially equivalent to csv
.
If there are multiple sheets, pandas
represents it as a list of tabular data frames.
dta
is the format used by STATA.
More common in social sciences, but essentially unheard of in professional contexts.
May be preferred for compatibility with STATA, otherwise would not recommend.
HDF5
and Feather
HDF5
is a commonly-used data storage format in data science, but not in academia.
Has a lot of nice properties, including efficient compression and fast reading.
Feather
was created by Hadley Wickham and Wes McKinney as a fast, consistent and convenient data storage format for cross-usage between R and Python.
I highly recommend this if you work a lot with R and Python, or want to use both in the same project.
SQL
SQL
-compliant databases are a common type. SQL
is a database managing and querying language.pandas
pandas
is a very popular library for working with tabular data structures in Python. Before we start using it, let’s go over some of the ways it can be useful to you as a social science researcher.
pandas
pandas
pandas
comes with functions for reading and writing to all kinds of data formats. A quick list can be viewed using tab completion:
In [1]: import pandas as pd
In [2]: pd.read_<TAB>
read_clipboard() read_hdf() read_sas()
read_csv() read_html() read_sql()
read_excel() read_json() read_sql_query()
read_feather() read_msgpack() read_sql_table()
read_fwf() read_parquet() read_stata()
read_gbq() read_pickle() read_table
Today, we learn about the following in pandas
:
DataFrame
and Series
pandas
is that we can analyse tabular data regardless of the source format.DataFrame
and Series
pandas
contains two native data containers:
pandas.DataFrame
: A two-dimensional* labelled matrixpandas.Series
: A one-dimensional labelled array*Can be higher-dimensional with the use of hierarchical indices
pandas
data frames have explicit (named) row- and column-indices, as well as implicit indices because they all elements are ordered and named. We will learn methods for leveraging both.The following sections of Python for Data Analysis: Data Wrangling with Pandas, NumPy and IPython, 2nd edition are relevant to this lecture:
Useful:
Advanced:
Blog Posts: