# Coding Tutorial 5: Unsupervised Learning

In this coding tutorial, we learn how to do the following for `k-means` clustering and principal components analysis:

- Import models from `scikit-learn`
- Prepare a pandas dataframe for analysis with `scikit-learn`
- Instantiate and fit a model to data
- Visualise the results of the model

# Importing Models from Scikit-Learn

`scikit-learn` is actually a collection of modules, so you will need to find which sub-module contains the model you want to use.

In [None]:
# standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# scikit-learn imports
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [None]:
# import the data
link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'
df = pd.read_feather(link)

# Data Pre-Processing

There are four steps for preparing data for analysis:

1. Feature Selection
2. Accounting for NAs
3. One Hot Encoding
4. Conversion to numpy ndarray

## Feature Selection

Here we just choose which columns we are going to use. If your data has a lot of NAs, it may be worthwhile to prefer columns with fewer NAs.

In [None]:
features = ['region', 'Age', 'a02', 'a03', 'e01',
            'k01', 'k02', 'k11', 'k13', 'k06', 'k08',
            'y01', 'y03', 'y06', 'y08', 'y09', 'y11', 'y17']

## Accounting for NAs

In [None]:
# Can check for na's with:
# df[features].isna().sum()
df = df[features].dropna()

## One-Hot Encoding

We can do a one-hot encoding using the `pd.get_dummies()` function.

In [None]:
data = pd.get_dummies(df)
print(df.shape, data.shape)

## Normalization and Conversion to `numpy`

We call the `StandardScaler().fit_transform()` function on the `.values` argument of the dataframe

In [None]:
X = data.values
scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

# Instantiating and Fitting `k-means`

We first create an instance of the model, where we provide parameters, and then we pass data to it.

In [None]:
kmeans = KMeans(n_clusters=5, random_state=634)

In [None]:
kmeans.fit(X_norm)

We can extract the labels using the `.labels_` method, and then assign them to a column.

In [None]:
df['labels_'] = kmeans.labels_
df['labels_'] = df['labels_'].astype(str)

# Visualising the Results

This is a bit difficult with so many variables. Let's look at age.

In [None]:
f, ax = plt.subplots(1, 1, figsize=(15, 8))
sns.histplot(df[['labels_', 'Age']].sort_values('labels_'),
             x='Age', ax=ax, kde=True, hue='labels_');

In [None]:
# We can appropriate this function
def grouped_barplot(data, var1, var2):
    """
    Creates a grouped bar plot of the distribution of `var2` within each group of `var2`.
    """
    temp = data.groupby([var1, var2]).apply(len).reset_index().rename({0: 'Count'}, axis=1)
    f, ax = plt.subplots(1, 1, figsize=(len(data[var1].unique())*len(data[var1].unique())/5, 10))
    sns.barplot(data=temp, x=var1, y='Count', hue=var2)
    ax.set_title(f"BES Sample {var2} per {var1}")
    ax.xaxis.set_ticklabels(ax.xaxis.get_ticklabels(), rotation=30)

In [None]:
grouped_barplot(df, 'a02','labels_') 

In [None]:
grouped_barplot(df, 'region','labels_')

## Instantiating and Fitting PCA

In [None]:
pca = PCA(n_components=2, random_state=634)
pca = pca.fit(X_norm)
reduced_X = pca.fit_transform(X_norm)

In [None]:
sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1]);

## Combining PCA and `k-means`

We can fit k-means to PCA-reduced data:

In [None]:
pcakmeans = KMeans(n_clusters=5, random_state=634)
pcakmeans.fit(reduced_X)
df['pcakmeans_labels'] = pcakmeans.labels_

In [None]:
sns.set_style('darkgrid')
f, ax = plt.subplots(1, 1, figsize=(15, 8))
sns.scatterplot(x=reduced_X[:, 0], y=reduced_X[:, 1],
                hue=pcakmeans.labels_,
                palette=sns.color_palette(palette='colorblind', n_colors=5));

In [None]:
grouped_barplot(df, 'a02', 'pcakmeans_labels')

In [None]:
pd.DataFrame(pca.components_, columns=data.columns)