Lecture 3 - Data Structures and Pandas II
Musashi Harukawa, DPIR
3rd Week Hilary 2021
pandas
This week we learn more advanced methods for working with data:
apply
and vectorizationGroupBy
: Split-apply-combineThere are several ways to think about applying a function to a vector \(X_i\) of \(i\) values:
The vector of all values in \(X_i\), \([x_1, x_2, ... x_i]\) is in the domain of \(f\), and a vector \(Y_i\) of equal length \(i\) is returned.
Examples/Use Cases per Datatype:
Dtype | Example |
---|---|
Numeric | Normalization by Mean |
Date/Time | Conversion to Timedeltas |
Language | Translation |
\[ f(X_{i}) = Y_{i} \]
Element-wise operations are a special case of transformation. Individual elements of \(X_i\) are in the domain of \(f\), and \(f\) is applied to each element of \(X_i\) to return a vector of length \(i\) where the \(i\)th element is the value of \(f(x_i)\).
Examples/Use Cases per Datatype:
Dtype | Example |
---|---|
Numeric | Inversion (multiply by -1) |
Date/Time | Timezone Conversion |
Language | Stemming, capitalisation |
\[ \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_i) \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_i \end{bmatrix} \]
Cumulative operations apply a function with an expanding scope iteratively over a vector. Thus the first value is takes the first element as its argument, the second value takes the first two elements as arguments, the third value takes the first three elements as arguments, and so on.
Dtype | Example |
---|---|
Numeric | Cumulative Sum |
Language | Grammar Parsing1 |
\[ f_{cum}(X_i) = \begin{bmatrix} f(x_1) \\ f(x_1, x_2) \\ \vdots \\ f(x_1, x_2, ..., x_i) \end{bmatrix} = \begin{bmatrix} f_{1}^{1}(x_i) \\ f_{1}^{2}(x_i) \\ \vdots \\ f_{1}^{i}(x_i) \end{bmatrix} \]
A summary reduces a vector \(X_i\) of length \(i\) to a single value \(\theta\). Thus vector \(X_i\) is within the domain of \(f\), and \(\theta\) is value of \(f\) applied to \(X_i\).
Examples/Use Cases per Datatype:
Dtype | Example(s) |
---|---|
Numeric | Mean, sum |
Date/Time | Range, total seconds |
Language | Sentiment scoring |
\[ f \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_i \end{bmatrix} \rightarrow \theta \]
Elements of \(X_{i, g}\) are groupable in that each element \(x_i\) is a member of some group \(g \in G\). In a sense they are somewhere between element-wise operations and summaries.
\[ f \odot_g \begin{bmatrix} x_{1, g=1} \\ x_{2, g=1} \\ x_{3, g=2} \\ \vdots \\ x_{i, g} \end{bmatrix} \rightarrow \begin{bmatrix} f(x_{1, g=1}, x_{2, g=1}) \\ f(x_{3, g=2}) \\ \vdots \\ f(X_{i, g=g}) \end{bmatrix} \rightarrow \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_g \end{bmatrix} \]
Here are two different ways that two tabular datasets can be combined:
We can concatenate two or more tabular datasets so long as they all share at least one dimension (“height” or “width”).
\[ X = \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} Y = \begin{bmatrix} w & x \\ y & z \end{bmatrix} \]
\[ XY = \begin{bmatrix} a & b \\ c & d \\ e & f \\ w & x \\ y & z \end{bmatrix} \]
The case of adding one dataset onto the “end” of another is referred to typically as “appending”.
Joining/merging is the combination of two datasets on one or more keys. In general, merging is done “horizontally”. There are four kinds of joins:
Let’s begin with the simple case of joining on a single key.
A key is a column, present in both \(X\) and \(Y\), which usually shares some number of elements. These common elements are used to determine how the merge is conducted.
Suppose we have two datasets. The first, \(X\), details the first names and roles of a number of employees working on some project. The second, \(Y\), details the names and employee ids of all full-time employees at that company.
\[ X = \begin{bmatrix} \textbf{Role} & \textbf{Name} \\ Lead & Bertha \\ Research & David \\ Assistant & Frankie \\ Consultant & Ryan \end{bmatrix} Y=\begin{bmatrix} \textbf{Name} & \textbf{id} \\ Adam & 001 \\ Bertha & 002 \\ Carla & 003 \\ David & 004 \\ Erica & 005 \\ Frankie & 006 \end{bmatrix} \]
An inner join retains rows that contain the intersection of the sets of the keys. In other words, it only retains rows from either dataframe that have a corresponding key in the other dataframe.
In the above example, the INNER JOIN of the two tables on NAME would result in the following table:
Role | Name | id |
---|---|---|
Lead | Bertha | 002 |
Research | David | 004 |
Assistant | Frankie | 006 |
A left join only retains rows that contain the keys of the left-hand dataset, inserting N/A where the right-hand key does not contain the corresponding left-hand one.
In the above example, the LEFT JOIN of the two tables on NAME would result in the following table. Note how Ryan’s id is N/A.
Role | Name | id |
---|---|---|
Lead | Bertha | 002 |
Research | David | 004 |
Assistant | Frankie | 006 |
Consultant | Ryan | N/A |
A right join retains rows that contain the keys of the right-hand dataset, inserting N/A where the left-hand key does not contain the corresponding right-hand one.
Role | Name | id |
---|---|---|
N/A |
Adam | 001 |
Lead | Bertha | 002 |
N/A |
Carla | 003 |
Research | David | 004 |
N/A |
Erica | 005 |
Assistant | Frankie | 006 |
A (full) outer join retains all rows of both datasets, filling NAs where there is no key in common. NAs are inserted where a key does not exist in the opposing dataset.
Role | Name | id |
---|---|---|
N/A |
Adam | 001 |
Lead | Bertha | 002 |
N/A |
Carla | 003 |
Research | David | 004 |
N/A |
Erica | 005 |
Assistant | Frankie | 006 |
Consultant | Ryan | N/A |
Constituency | Year | Party |
---|---|---|
Oxford East | 2010 | Labour |
Oxford West | 2010 | Tory |
Oxford East | 2015 | Labour |
Oxford West | 2015 | Tory |
Oxford East | 2017 | Labour |
Oxford West | 2017 | LibDem |
Oxford East | 2019 | Labour |
Oxford West | 2019 | LibDem |
Constituency | Party2010 | Party2015 | Party2017 | Party2019 |
---|---|---|---|---|
Oxford East | Labour | Labour | Labour | Labour |
Oxford West | Tory | Tory | LibDem | LibDem |
Here’s a simple function that adds 1 to the input:
def add_one(x):
"""
This function adds 1 to the input.
"""
y = x+1
return y
def add_one(x):
def add_one(x):
"""
This function adds 1 to the input.
"""
y = x+1
return y
def
followed by a space tells Python that you are defining a function.def
; in this case add_one
.()
.:
says that the definition line is done. The following line must be indented by four spaces.func?
command."""
to create a multiline string. This is convenient, but not necessary (you can use a simple "
or '
).The following code will result in an error:
def f(x):
y = 5
return x + y
print(y)
The following code will return the local value of y
, thus returning 10.
y = 0
def f(x):
y = 5
return x + y
print(f(5))
>> 10
The following code uses y
, which is defined globally. Therefore it returns 5.
y = 0
def f(x):
return x + y
print(f(5))
>> 5
Lambda
FunctionsPython has lambda functions. These are essentially a way to define a function in-line. Below, the function f
is equivalent to the line lambda x: x+1
.
def f(x):
return x+1
lambda x: x+1
Remember that there are four relevant ways in which we might want to apply a function to a Series
(or DataFrame
!)
The most general method for applying a function element-wise is the apply
function.
apply
takes a function as its argument.
Series
, the function will be applied to each element of the vector.DataFrame
, the function will be applied to each row (axis=0
), or each column (axis=1
) of the matrix.
DataFrame
, use applymap
.pd.Series.apply
with functionsThe following two examples will add every element of a Series to itself. Note that the function is passed as an argument to apply()
without ()
. This is because we are pointing to the function inside the apply()
, not calling it.
def add_to_self(x):
y = x + x
return y
df['col1'].apply(add_to_self)
pd.Series.apply
with lambdaI usually use apply()
with lambda functions. There are two advantages to this approach:
apply()
takes a function name, and not a call, as its argument, you cannot pass arguments to the function. By using a lambda function, this is possible:pd.Series.apply
with lambda examplesIn [1]: import pandas as pd
In [2]: df = pd.DataFrame({'col1':range(5)})
In [4]: z=5
In [5]: df['col1'].apply(lambda x: x+x-z)
Out[5]:
0 -5
1 -3
2 -1
3 1
4 3
In [6]: def foo(x, z):
...: y = x+x-z
...: return y
In [7]: df['col1'].apply(lambda x: foo(x, z))
Out[7]:
0 -5
1 -3
2 -1
3 1
4 3
There are built-in methods for the four standard cumulative operations. These are methods of both pd.Series
and pd.DataFrame
.
cumsum
cumprod
cummin
cummax
pd.Series.expanding
, which works similarly to groupby
(covered in subsequent slides).apply
versus for-loopspandas
and its underlying library have optimised the library for vectorized operations.pandas
, you should use apply
as much as possible when conducting transformations.Pandas provides an extremely efficient and clean method for doing group summaries, but the syntax can be difficult to understand.
Imagine we have the following table:
Name | Location | Age | Female? | Likes BoJo (1-10) |
---|---|---|---|---|
Andy | Scotland | 32 | 0 | 1 |
Barbara | Wales | 48 | 1 | 3 |
Chris | Scotland | 65 | 1 | 2 |
Dara | N. Ireland | 55 | 0 | 6 |
Elaine | Wales | 43 | 1 | 4 |
To conduct grouped summaries, we use the following syntax:
df.groupby('group_col')['value_col'].summary_func()
group_col
is the column we are grouping over.value_col
is the column that contains the values we will be applying grouped summary functions to.summary_func
is the function that that is applied to each group.We can pass lists in place of the scalar values above:
df.groupby(['group_col1', 'group_col2'])[['value_col1', 'value_col2']].summary_func()
groupby()
function, the resulting groups are hierarchical, with the order of columns in the list passed to this function determining the rank of columns within the hierarchy.[]
accessor following the groupby()
call, we apply the grouped summary operation to each of the columns in this list. Order matters less here.In [3]: df.groupby('Location')['Age'].mean()
Out[3]:
Location
N. Ireland 55.0
Scotland 48.5
Wales 45.5
Name: Age, dtype: float64
group_col
: Locationvalue_col
: Agesummary_func
: MeanSeries
, because the value_cols
arguments contained only a single column.In [4]: df.groupby(['Location', 'Female'])['PM_approval'].mean()
Out[4]:
Location Female
N. Ireland 0 6.0
Scotland 0 1.0
1 2.0
Wales 1 3.5
Name: PM_approval, dtype: float64
In [5]: df.groupby(['Location', 'Female'])[['Age', 'PM_approval']].mean()
Out[5]:
Age PM_approval
Location Female
N. Ireland 0 55.0 6.0
Scotland 0 32.0 1.0
1 65.0 2.0
Wales 1 45.5 3.5
Pandas includes a multitude of functions for combining datasets, as well as an extensive guide with examples.
We cover just two functions:
pd.concat()
pd.merge()
To concatenate two or more dataframes, we use the following syntax:
pd.concat([df1, df2{, ..., dfn}], axis={0, 1})
{}
indicates that the argument is optional.axis=0
, dataframes are stacked “vertically”. Where they have columns in common, the columns will be “stacked”, otherwise N/A
will be inserted in the cells.
sort=True
.axis=1
, dataframes are stacked “horizontally”. Where the index aligns, rows will be concatenated side-by-side. If there are index values not common between the dataframes, then N/A
will be inserted in the cells.pd.merge(
left_df,
right_df,
how={'left', 'right', 'outer', 'inner'},
{on=common_key},
{left_on=left_key},
{right_on=right_key},
{left_index={True, False}},
{right_index={True, False}}
)
left_df
is the left-hand dataframe to be merged.right_df
is the right-hand dataframe to be merged.how
determines what kind of join is used. (See previous slides.)on
can be used when the key is a column with the same name in both dataframes.left_on
should be used in conjunction with right_on
or right_index
. Each can be a single column name or a list of column names to be used as merge keys.left_index
and right_index
take either True
or False
. If True
, then the index of the according dataframe is used as the merge key.Pandas provides two functions for changing between long and wide data format. For further details I direct you to the documentation:
pd.pivot()
: Converts long to wide.pd.melt()
: Converts wide to long.One could construct an iterative parser that treats a language as a cumulative set of tokens; whether this parser would be the most effective is a different question.↩︎