Lecture 4 - Data Visualization
Musashi Harukawa, DPIR
4th Week Hilary 2021
x1 | x2 | |
---|---|---|
count | 300.000000 | 300.000000 |
mean | 4.048335 | 4.066768 |
std | 4.145384 | 3.908675 |
min | -2.304990 | -6.892820 |
25% | 0.102002 | 1.528010 |
50% | 3.389367 | 4.125494 |
75% | 8.039512 | 6.871554 |
max | 11.230459 | 16.889281 |
The type and structure of your data tells you what type of figure you need:
Most figures are created on a two-dimensional plane, where the dimensions are usually referred to as X (width) and Y (height).
These axes are the most versatile; they can be used to plot any kind of variable. The only trade-off is the overall size of the figure is determined by these two dimensions.
Visuals for one-dimensional data tend to be concerned with distributions; i.e. frequencies of values along some dimension.
Useful plots include:
Visuals for two-dimensional data often fulfil one of the following two purposes:
In addition to all of the aforementioned plots, some examples of the latter include:
Colors can show variation along a multitude of data types.
Panelling is the use of multiple sub-plots within a single figure.
Colors and panelling are not the only means.
When visualising data, ask yourself the following questions, then look through galleries to get an idea of what could work for you.
Are you:
matplotlib
is the primary library for building data-based visuals in Python.
seaborn
is a more recent library, built on top of matplotlib
.
On the back end, all matplotlib
-based visuals adhere to a similar tree-like structure. By learning this structure, you can locate and customise any element of a figure.
Here is a truncated version of the matplotlib hierarchy:
The figure is essentially the “canvas” upon which all visuals are made. Some parameters/methods set at this level include:
Subplots are the frames within which individual visuals are contained.
Most drawing methods are called at the subplot level:
matplotlib
and seaborn
provide an enormous number of plotting functions. These functions:
matplotlib
function, you should call it as a method of the relevant subplot.seaborn
function, and there is more than one subplot, then you should pass the relevant subplot as a parameter to the function.Graphical objects take a large number of customisable parameters, such as:
Subplots have xaxis
and yaxis
methods. Call these to customise the following aspects:
import matplotlib.pyplot as plt
f = plt.figure(figsize=(15, 8))
This does not create any visible objects, but it lays down the canvas that other things will go onto.
Note:
pyplot
module of matplotlib
.plt.figure
has been assigned to a variable, f
. This will be our means of accessing the figure and its methods.figsize=(15, 8)
has been passed to plt.figure
. This tells matplotlib
to create a canvas that is 1500x800 pixels.f, ax = plt.subplots(1, 1, figsize=(15, 8))
f.suptitle("This is a figure with a subplot")
ax.set_title("This is a subplot", color="r")
f, ax = plt.subplots(1, 2, figsize=(15, 8))
f.suptitle("This is a figure with two subplots")
ax[0].set_title("This is a subplot", color="r")
ax[1].set_title("This is another subplot", color="r")
f, ax = plt.subplots(2, 2, figsize=(15, 8))
f.suptitle("This is a figure with four subplots")
for i in range(2):
for j in range(2):
ax[i][j].set_title(f"Subplot [{i}][{j}]", color="r")
f, ax = plt.subplots(1, 1, figsize=(15, 8))
ax.scatter(data['x2'], data['x1'], color='r')
f, ax = plt.subplots(1, 1, figsize=(15, 8))
ax.plot(np.linspace(0, 10, 100), np.linspace(0, 5, 100), color='r')
f, ax = plt.subplots(1, 1, figsize=(8, 4))
ax.scatter(data['x2'], data['x1'], color='r', s=3)
ax.plot(np.linspace(-10, 20, 150), np.linspace(-3.5, 4, 150)**2)
ax.axhline(0, color='k', alpha=0.5, ls="--")
ax.axvline(0, color='k', alpha=0.5, ls="--")
[...]
ax.xaxis.set_label_text("X-Axis Label", color='r')
ax.yaxis.set_label_text("Y-Axis Label", color='r')
[...]
ax.xaxis.set_ticks(range(-10, 40, 10))
ax.yaxis.set_ticks(range(-4, 25, 2))
ax.xaxis.set_major_locator(matplotlib.ticker.MultipleLocator(base=3))
ax.yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(base=2))
f, ax = plt.subplots(1, 1, figsize=(8, 4))
sns.boxenplot(bes_df['region'], bes_df['Age'], ax=ax)
ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(), rotation=30)