# Week 2 Lecture Exercises


We'll be working with the BES 2017 face-to-face cross-sectional survey extensively in this course.

- You can download a zip folder containing the data from the website: https://muhark.github.io/dpir-intro-python/Week2/data/data_week2.zip
- For these exercises, you can either use `bes_data_full_week2` or `bes_data_subset_week2`.
- I've included the codebook (`BES-2017-F2F-codebook.pdf`). You'll need this to interpret the columns.

### Step 0: Read in the Data

I've taken this first step for you because I'm hosting the data files online. Normally you would write a filepath to the location the file is being kept relative to where the script is being executed.

In [1]:
import pandas as pd

link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'
bes_df = pd.read_feather(link)

## Exercise 1: First Look at the Data

_Answer the following questions about the dataset_:

- How many observations in the dataset?
- How many variables?
- How many variables contain numeric values?
- How many variables are open-ended response?
- How many categorical variables?

In [2]:
bes_df.info()
# 2194 observations
# 30 coluimns
# 1 numeric column (finalserialno should not be int32)
# Open-ended response a01
# Categorical: 20 variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2194 entries, 0 to 2193
Data columns (total 30 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   finalserialno         2194 non-null   int32   
 1   region                2194 non-null   object  
 2   Constit_Code          2194 non-null   object  
 3   Constit_Name          2194 non-null   object  
 4   Interview_Date        2194 non-null   object  
 5   total_num_dwel        2194 non-null   object  
 6   total_num_hous        2194 non-null   object  
 7   num_elig_people       2194 non-null   object  
 8   turnoutValidationReg  1507 non-null   category
 9   Age                   2175 non-null   float64 
 10  a01                   2194 non-null   object  
 11  a02                   2132 non-null   category
 12  a03                   2194 non-null   category
 13  e01                   2194 non-null   category
 14  k01                   2194 non-null   category
 15  k02 

In [3]:
{
    'a01': 'top_issue',
    'a02': 'top_issue-best_party',
    'a03': 'politics_interest',
    'e01': 'ideo_LR',
    'k01': 'politics_attention',
    'k02': 'read_pol_news',
    'k03': 'newspaper',
    'k11': 'canvasser_contact',
    'k13': 'party_contact',
    'k06': 'twitter_use',
    'k08': 'facebook_use',
    'y01': 'income_band',
    'y03': 'housing',
    'y06': 'religion',
    'y07': 'religiosity',
    'y08': 'union_member',
    'y09': 'gender',
    'y11': 'ethnicity',
    'y17': 'employment_type',
    'y18': 'has_worked'
}

{'a01': 'top_issue',
 'a02': 'top_issue-best_party',
 'a03': 'politics_interest',
 'e01': 'ideo_LR',
 'k01': 'politics_attention',
 'k02': 'read_pol_news',
 'k03': 'newspaper',
 'k11': 'canvasser_contact',
 'k13': 'party_contact',
 'k06': 'twitter_use',
 'k08': 'facebook_use',
 'y01': 'income_band',
 'y03': 'housing',
 'y06': 'religion',
 'y07': 'religiosity',
 'y08': 'union_member',
 'y09': 'gender',
 'y11': 'ethnicity',
 'y17': 'employment_type',
 'y18': 'has_worked'}

# Exercise 2: Clean Up Labels

_It's annoying to have to always refer to the codebook. Choose a few sections from the survey (i.e. questions a, questions b, etc.) and give the columns short, meaningful titles._

For this part of the assignment, 

For instance, `a01` asks "First, I'd like to ask you a few questions about the issues and problems facing Britain today. As far as you're concerned, what is the single most important issue facing the country at the present time?". I might rename this question `most_important_issue`, or even `top_issue`.

Another example: `y01` could be renamed `income` or `annual_income`.

To keep your code neat, I recommend that you first create a dictionary called something like `col_name_dict`, put the original and replacements in there, and then use the `df.rename()` function to substitute the column names.


In [4]:
col_name_dict =  {
    'a01': 'top_issue',
    'a02': 'top_issue-best_party',
    'a03': 'politics_interest',
    'e01': 'ideo_LR',
    'k01': 'politics_attention',
    'k02': 'read_pol_news',
    'k03': 'newspaper',
    'k11': 'canvasser_contact',
    'k13': 'party_contact',
    'k06': 'twitter_use',
    'k08': 'facebook_use',
    'y01': 'income_band',
    'y03': 'housing',
    'y06': 'religion',
    'y07': 'religiosity',
    'y08': 'union_member',
    'y09': 'gender',
    'y11': 'ethnicity',
    'y17': 'employment_type',
    'y18': 'has_worked'
}

In [5]:
bes_df = bes_df.rename(col_name_dict, axis=1)
bes_df.columns

Index(['finalserialno', 'region', 'Constit_Code', 'Constit_Name',
       'Interview_Date', 'total_num_dwel', 'total_num_hous', 'num_elig_people',
       'turnoutValidationReg', 'Age', 'top_issue', 'top_issue-best_party',
       'politics_interest', 'ideo_LR', 'politics_attention', 'read_pol_news',
       'newspaper', 'canvasser_contact', 'party_contact', 'twitter_use',
       'facebook_use', 'income_band', 'housing', 'religion', 'religiosity',
       'union_member', 'gender', 'ethnicity', 'employment_type', 'has_worked'],
      dtype='object')

## Exercise 3: Cursory Statistics

There are a few things you can calculate fairly easily. For instance:

- How many responses per region? per constituency?
- (If using section y:) Median income bracket? Modal religion? Mean/median age?

Here you want to be creative. What questions would you ask of your data? What would a reviewer or a client be likely to want to know?

For an additional challenge, calculate each of the statistics per-region, e.g. median income bracket per-region.

In [6]:
bes_df['region'].value_counts()

North West            304
South East            282
West Midlands         227
Eastern               226
London                212
Scotland              191
Yorkshire & Humber    187
South West            167
East Midlands         156
Wales                 129
North East            113
Name: region, dtype: int64

In [7]:
bes_df['Constit_Code'].value_counts()

Birmingham    57
North East    44
Sheffield,    26
Liverpool,    22
Falkirk       21
              ..
Southport      4
Great Grim     3
Perth and      2
Glenrothes     2
Bolton Wes     1
Name: Constit_Code, Length: 214, dtype: int64