METCS677 Homework

# Seaborn Exercises

## Imports

Run the cell below to import the libraries

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

## The Data

The dataset for this homework is also available on Kaggle: https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Brief Description:

Feature Information:

application_record.csv
Feature name Explanation Remarks
ID Client number
CODE_GENDER Gender
FLAG_OWN_CAR Is there a car
FLAG_OWN_REALTY Is there a property
CNT_CHILDREN Number of children
AMT_INCOME_TOTAL Annual income
NAME_INCOME_TYPE Income category
NAME_EDUCATION_TYPE Education level
NAME_FAMILY_STATUS Marital status
NAME_HOUSING_TYPE Way of living
DAYS_BIRTH Birthday Count backwards from current day (0), -1 means yesterday
DAYS_EMPLOYED Start date of employment Count backwards from current day(0). If positive, it means the person currently unemployed.
FLAG_MOBIL Is there a mobile phone
FLAG_WORK_PHONE Is there a work phone
FLAG_PHONE Is there a phone
FLAG_EMAIL Is there an email
OCCUPATION_TYPE Occupation
CNT_FAM_MEMBERS Family size
df = pd.read_csv("application_record.csv")
df.head(3)
ID CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS
0 5008804 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment -12005 -4542 1 1 0 0 NaN 2.0
1 5008805 M Y Y 0 427500.0 Working Higher education Civil marriage Rented apartment -12005 -4542 1 1 0 0 NaN 2.0
2 5008806 M Y Y 0 112500.0 Working Secondary / secondary special Married House / apartment -21474 -1134 1 0 0 0 Security staff 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438557 entries, 0 to 438556
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   438557 non-null  int64  
 1   CODE_GENDER          438557 non-null  object 
 2   FLAG_OWN_CAR         438557 non-null  object 
 3   FLAG_OWN_REALTY      438557 non-null  object 
 4   CNT_CHILDREN         438557 non-null  int64  
 5   AMT_INCOME_TOTAL     438557 non-null  float64
 6   NAME_INCOME_TYPE     438557 non-null  object 
 7   NAME_EDUCATION_TYPE  438557 non-null  object 
 8   NAME_FAMILY_STATUS   438557 non-null  object 
 9   NAME_HOUSING_TYPE    438557 non-null  object 
 10  DAYS_BIRTH           438557 non-null  int64  
 11  DAYS_EMPLOYED        438557 non-null  int64  
 12  FLAG_MOBIL           438557 non-null  int64  
 13  FLAG_WORK_PHONE      438557 non-null  int64  
 14  FLAG_PHONE           438557 non-null  int64  
 15  FLAG_EMAIL           438557 non-null  int64  
 16  OCCUPATION_TYPE      304354 non-null  object 
 17  CNT_FAM_MEMBERS      438557 non-null  float64
dtypes: float64(2), int64(8), object(8)
memory usage: 60.2+ MB
df.describe()
ID CNT_CHILDREN AMT_INCOME_TOTAL DAYS_BIRTH DAYS_EMPLOYED FLAG_MOBIL FLAG_WORK_PHONE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS
count 4.385570e+05 438557.000000 4.385570e+05 438557.000000 438557.000000 438557.0 438557.000000 438557.000000 438557.000000 438557.000000
mean 6.022176e+06 0.427390 1.875243e+05 -15997.904649 60563.675328 1.0 0.206133 0.287771 0.108207 2.194465
std 5.716370e+05 0.724882 1.100869e+05 4185.030007 138767.799647 0.0 0.404527 0.452724 0.310642 0.897207
min 5.008804e+06 0.000000 2.610000e+04 -25201.000000 -17531.000000 1.0 0.000000 0.000000 0.000000 1.000000
25% 5.609375e+06 0.000000 1.215000e+05 -19483.000000 -3103.000000 1.0 0.000000 0.000000 0.000000 2.000000
50% 6.047745e+06 0.000000 1.607805e+05 -15630.000000 -1467.000000 1.0 0.000000 0.000000 0.000000 2.000000
75% 6.456971e+06 1.000000 2.250000e+05 -12514.000000 -371.000000 1.0 0.000000 1.000000 0.000000 3.000000
max 7.999952e+06 19.000000 6.750000e+06 -7489.000000 365243.000000 1.0 1.000000 1.000000 1.000000 20.000000

## TASKS

### Recreate the plots shown in the markdown image cells.

Each plot also contains a brief description of what it is trying to convey.

Note, these are meant to be quite challenging. Start by first replicating the most basic form of the plot, then attempt to adjust its styling and parameters to match the given image.

Closer reproductions will receive higher scores.

NOTE: You may need to perform extra calculations on the Pandas DataFrame before calling Seaborn to create the plot.


### TASK: Recreate the Scatter Plot shown below

# CODE HERE TO RECREATE THE PLOT SHOWN ABOVE
df_employed = df[df['DAYS_EMPLOYED']<= 2000 ]  # Remove the outlier
# Transform the DAYS_BIRTH and DAYS_EMPLOYED to positive using abs()
df_employed['DAYS_BIRTH'] = df_employed['DAYS_BIRTH'].abs()  # Make age positive
df_employed['DAYS_EMPLOYED'] = df_employed['DAYS_EMPLOYED'].abs()  # Make days employed positive

# Create the scatter plot
plt.figure(figsize=(7, 7))
sns.scatterplot(data=df_employed, x='DAYS_BIRTH', y='DAYS_EMPLOYED', alpha=.1, linewidth=.2)

# Calculate the IQR for DAYS_BIRTH
Q1_birth = df_employed['DAYS_BIRTH'].quantile(0.25)
Q3_birth = df_employed['DAYS_BIRTH'].quantile(0.75)
IQR_birth = Q3_birth - Q1_birth

# Calculate the IQR for DAYS_EMPLOYED
Q1_employed = df_employed['DAYS_EMPLOYED'].quantile(0.25)
Q3_employed = df_employed['DAYS_EMPLOYED'].quantile(0.75)
IQR_employed = Q3_employed - Q1_employed

# Define bounds for removing outliers
lower_bound_birth = Q1_birth - 1.5 * IQR_birth
upper_bound_birth = Q3_birth + 1.5 * IQR_birth
lower_bound_employed = Q1_employed - 1.5 * IQR_employed
upper_bound_employed = Q3_employed + 1.5 * IQR_employed

# Filter out the outliers
df_filtered = df_employed[
    (df_employed['DAYS_BIRTH'] >= lower_bound_birth) & 
    (df_employed['DAYS_BIRTH'] <= upper_bound_birth) &
    (df_employed['DAYS_EMPLOYED'] >= lower_bound_employed) & 
    (df_employed['DAYS_EMPLOYED'] <= upper_bound_employed)
]
# Set labels and title
plt.xlabel('DAYS_BIRTH')
plt.ylabel('DAYS_EMPLOYED')


# Show the plot
plt.show()
/tmp/ipykernel_16857/423195713.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_employed['DAYS_BIRTH'] = df_employed['DAYS_BIRTH'].abs()  # Make age positive
/tmp/ipykernel_16857/423195713.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_employed['DAYS_EMPLOYED'] = df_employed['DAYS_EMPLOYED'].abs()  # Make days employed positive


### TASK: Recreate the Distribution Plot shown below:

Note, you will need to figure out how to calculate “Age in Years” from one of the columns in the DF. Think carefully about this.

# CODE HERE TO RECREATE THE PLOT SHOWN ABOVE
# Create the distribution plot

plt.figure(figsize=(15, 3))
# Calculate age in years
df['Age_Years'] = df['DAYS_BIRTH'].abs() / 365  # Convert DAYS_BIRTH to positive years
# Set labels and title
sns.histplot(df['Age_Years'], color='red')

# Show the plot
plt.show()


### TASK: Recreate the Categorical Plot shown below:

# Code here
# Calculate the median total income
median_income = df['AMT_INCOME_TOTAL'].median()

# Filter the DataFrame for the bottom half of income earners
df_bottom_half = df[df['AMT_INCOME_TOTAL'] <= median_income]
# Determine the order of NAME_FAMILY_STATUS based on total income for the bottom half
order = df_bottom_half.groupby('NAME_FAMILY_STATUS')['AMT_INCOME_TOTAL'].median().sort_values().index

# Create the boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_bottom_half, x='NAME_FAMILY_STATUS', y='AMT_INCOME_TOTAL', hue='FLAG_OWN_REALTY', order=order)
# Remove labels
plt.xlabel('')
plt.ylabel('')
plt.title('')  # Optionally remove the title as well

# Position the legend on the side
plt.legend(title='FLAG_OWN_REALTY', loc='center left', bbox_to_anchor=(1, 0.5))

# Show the plot
plt.show()


# Heatmaps

In Seaborn, heatmaps are used to visualize data in a matrix format, where the individual values in the matrix are represented as colored cells. Heatmaps are especially useful for showing the magnitude of data and patterns across a 2D space.

Each cell in the heatmap is colored based on the value it contains, with a color gradient that reflects the intensity or magnitude of the value.

Seaborn Heatmap documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html

### What Heatmaps Show:

  1. Matrix of values: Heatmaps visualize data that is organized in a grid (e.g., a correlation matrix, confusion matrix, or any two-dimensional data).

  2. Color-coded magnitude: The color of each cell in the heatmap reflects the magnitude of the corresponding data point. Lighter or darker colors represent higher or lower values, depending on the color map.

  3. Patterns and relationships: Heatmaps are excellent for identifying patterns, trends, or correlations between variables by using colors to highlight significant values or clusters in the data.

### Common Use Cases for Heatmaps:

### TASK: Recreate the Heat Map shown below:

<img src=‘heatmap_solution.png’, width=‘450’>

# CODE HERE
df_hm = df.drop(columns=['FLAG_MOBIL'])

# Select only numeric columns
numeric_columns = df_hm.select_dtypes(include=['number'])

# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()

# Optionally, drop rows and columns with NaN values from the correlation matrix
correlation_matrix = correlation_matrix.dropna().dropna(axis=1)

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=False, cmap='viridis', square=True, cbar=False)
plt.xticks([])
plt.yticks([])
plt.xlabel('')
plt.ylabel('')
plt.title('')
Text(0.5, 1.0, '')