Exploratory Data Analysis

Dr. Nathaniel Cline

Agenda

Exploring Numerical Data

Exploring Categorical Data

Finding Stories

Review and to do

County Example Data

Code

county |>
  select(state, name, pop_change, pop_change_2levels, median_hh_income) |>
  filter(!is.na(pop_change)) |>
  group_by(pop_change_2levels) |> 
  slice_sample(n = 5) |>
  arrange(pop_change_2levels, state, name) |>
  rename(
    State = state,
    County = name,
    `Population change (%)` = pop_change,
    `Gain / No gain` = pop_change_2levels,
    `Median household income` = median_hh_income
  ) |>
  kbl(linesep = "", booktabs = TRUE, align = "llccc") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position"), full_width = FALSE) |>
  column_spec(3, width = "8em") |>
  column_spec(4, width = "4em") |>
  column_spec(5, width = "8em")

The median household income from a random sample of 5 counties with population gain between 2010 to 2017 and another random sample of 5 counties with no population gain.
State	County	Population change (%)	Gain / No gain	Median household income
Georgia	Murray County	1.35	gain	41617
Georgia	Rabun County	2.46	gain	39297
Missouri	Audrain County	0.38	gain	44056
Montana	Wheatland County	0.80	gain	37222
Texas	Tyler County	0.20	gain	44396
Arkansas	Pike County	-3.46	no gain	36893
Illinois	Rock Island County	-1.85	no gain	51426
Iowa	Hamilton County	-1.47	no gain	55836
Nebraska	Scotts Bluff County	-1.33	no gain	47975
Oklahoma	Seminole County	-2.21	no gain	37741

The Loans Example Data

Code

glimpse(loans)

Rows: 10,000
Columns: 55
$ emp_title                        <chr> "global config engineer ", "warehouse…
$ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
$ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
$ homeownership                    <fct> mortgage, rent, rent, rent, rent, own…
$ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
$ verified_income                  <fct> Verified, Not Verified, Source Verifi…
$ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
$ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
$ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
$ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
$ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
$ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
$ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
$ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
$ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
$ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
$ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
$ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
$ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
$ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
$ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
$ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
$ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
$ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
$ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
$ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
$ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
$ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
$ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
$ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
$ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ loan_purpose                     <fct> moving, debt_consolidation, other, de…
$ application_type                 <fct> individual, individual, individual, i…
$ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
$ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
$ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
$ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
$ grade                            <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
$ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
$ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
$ loan_status                      <fct> Current, Current, Current, Current, C…
$ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
$ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
$ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
$ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
$ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
$ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
$ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Scaterplots: County Data

Code

ggplot(county, aes(x = poverty/100, y = median_hh_income)) +
  geom_point(alpha = 0.3, fill = "#8BAB82", 
             shape = 21, size = 3) +
  geom_smooth(linetype = "dashed", color = "#DD8C6E", se = FALSE) +
  labs(x = "Poverty rate",y = "Median household income") +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = dollar_format(scale = 0.001, suffix = "K"))

Deviation

We call the distance of an observation from its mean its deviation. Below are the deviations for the \(1^{st},\) \(2^{nd},\) \(3^{rd},\) and \(50^{th}\) observations in the interest_rate variable:

\[ \begin{align} x_1 - \bar{x} &= 10.9 - 11.57 = -0.67 \\ x_2 - \bar{x} &= 9.92 - 11.57 = -1.65 \\ x_3 - \bar{x} &= 26.3 - 11.57 = 14.73 \\ &\vdots \\ x_{50} - \bar{x} &= 6.08 - 11.57 = -5.49 \\ \end{align} \]

Sample variance

If we square these deviations and then take an average, the result is equal to the sample variance, denoted by \(s^2\):

\[ \begin{align} s^2 &= \frac{(-0.67)^2 + (-1.65)^2 + (14.73)^2 + \cdots + (-5.49)^2}{50 - 1} \\ &= \frac{0.45 + 2.72 + \cdots + 30.14}{49} \\ &= 25.52 \end{align} \]

Note(1): We divide by \(n - 1,\) rather than dividing by \(n,\) when computing a sample’s variance.

Note(2): Note that squaring the deviations does two things. First, it makes large values relatively much larger. Second, it gets rid of any negative signs.

Standard Deviation

The sample standard deviation can be calculated as the square root of the sum of the squared distance of each value from the mean divided by the number of observations minus one:

\[s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]

The standard deviation is also defined as the square root of the variance:

\[s = \sqrt{25.52} = 5.05\]

Reminder: Variance and standard deviation

The variance is the average squared distance from the mean.
The standard deviation is the square root of the variance.
The standard deviation is useful when considering how far the data are distributed from the mean.

A Contingency Table

A table that summarizes data for two categorical variables is called a contingency table

Code

result_table <- loans %>%
  count(application_type, homeownership) %>%
  pivot_wider(names_from = homeownership, values_from = n) %>%
  select(application_type, rent, mortgage, own) %>%
  adorn_totals(where = c("row", "col"))

# Print the result_table
styled_table <- kable(result_table, format = "html") %>%
  kable_styling("striped")

print(styled_table)

application_type	rent	mortgage	own	Total
joint	362	950	183	1495
individual	3496	3839	1170	8505
Total	3858	4789	1353	10000

Stacked bar plot

Code

p_stacked <- ggplot(loans, aes(x = homeownership, fill = application_type)) +
  geom_bar() +
  labs(x = "Homeownership", y = "Count", fill = "Application type")+
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E"))  # Add colors for each application_type

# Print the plot
print(p_stacked)

Dodged bar plot

We can convey the same info in a dodged bar plot

Code

p_dodged <- ggplot(loans, aes(x = homeownership, fill = application_type)) +
  geom_bar(position = "dodge") +  # Use dodge position to separate bars by application_type
  labs(x = "Homeownership", y = "Count")+
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E"))  # Add colors for each application_type

# Print the plot
print(p_dodged)

Proportion

Code

p_standardized <- ggplot(loans, aes(x = homeownership, fill = application_type)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E")) +
  labs(x = "Homeownership", y = "Proportion", fill = "Application type") 

# Print the plot
print(p_standardized)

Choosing a bar plot

When is the stacked, dodged, or standardized bar plot the most useful?

Explanatory vs. response?
Space?
Balance?
Raw totals?

Mosaic Plots

Code

p_mosaic <- ggplot(loans) +
  geom_mosaic(aes(x = product(application_type), fill = homeownership)) +
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E")) +
  labs(x = "Application type", y = "Homeownership") +
  guides(fill = FALSE)

print(p_mosaic)

Sleep Mosaic Plot

Pie Charts

Let’s just not

Waffle charts

Code

p_waffle <- loans |>
  count(homeownership) |>
  ggplot(aes(fill = homeownership, values = n)) +
  geom_waffle(colour = "#F8F8F8", flip = TRUE, make_proportional = TRUE, na.rm = TRUE) +
  labs(fill = NULL, title = "Homeownership") +
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E")) +
  coord_equal() +
  theme_enhance_waffle() +
  theme(legend.position = "bottom")

print(p_waffle)

Waffle chart

County Example Data

Code

county |>
  select(state, name, pop_change, pop_change_2levels, median_hh_income) |>
  filter(!is.na(pop_change)) |>
  group_by(pop_change_2levels) |> 
  slice_sample(n = 5) |>
  arrange(pop_change_2levels, state, name) |>
  rename(
    State = state,
    County = name,
    `Population change (%)` = pop_change,
    `Gain / No gain` = pop_change_2levels,
    `Median household income` = median_hh_income
  ) |>
  kbl(linesep = "", booktabs = TRUE, align = "llccc") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), 
                latex_options = c("striped", "hold_position"), full_width = FALSE) |>
  column_spec(3, width = "8em") |>
  column_spec(4, width = "4em") |>
  column_spec(5, width = "8em")

The median household income from a random sample of 5 counties with population gain between 2010 to 2017 and another random sample of 5 counties with no population gain.
State	County	Population change (%)	Gain / No gain	Median household income
Missouri	Dallas County	1.26	gain	41441
Montana	McCone County	0.23	gain	46193
Tennessee	Grainger County	1.79	gain	40088
Texas	Sterling County	5.63	gain	50000
Wyoming	Teton County	4.21	gain	80049
Arkansas	Franklin County	-0.06	no gain	39472
Indiana	Sullivan County	-1.86	no gain	45031
Iowa	Humboldt County	-1.06	no gain	48847
Missouri	Holt County	-3.86	no gain	43981
Virginia	Charles City County	-0.95	no gain	55069

Side by Side Box Plot

Code

p_box <- county |>
  filter(!is.na(pop_change)) |>
  ggplot(aes(x = median_hh_income, y = pop_change_2levels, color = pop_change_2levels)) +
  geom_boxplot() +
  scale_color_manual(values = c("#8BAB82", "#DD8C6E")) +  # Use scale_color_manual for color
  scale_x_continuous(labels = label_dollar()) +
  labs(x = "Median household income", y = NULL, color = "Change\nin population")

print(p_box)

Ridge Plots

Code

p_ridge <- county |>
  filter(!is.na(pop_change) & !is.na(metro) & !is.na(median_edu) & median_edu != "below_hs") |>
  # for better labeling on plot
  rename(
    pop_change_num = pop_change,
    pop_change = pop_change_2levels
    ) |>
  ggplot(aes(x = median_hh_income, y = median_edu, fill = median_edu, color = median_edu)) +
  geom_density_ridges(alpha = 0.5, aes(linetype = median_edu)) +
  scale_fill_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E")) +
  scale_color_manual(values = c("#8BAB82", "#98ACB5", "#DD8C6E")) +
  scale_linetype_manual(values = c("solid", "dashed", "dotted")) +
  scale_x_continuous(labels = label_dollar()) +
  facet_grid(pop_change~metro, labeller = label_both) +
  labs(x = "Median household income", y = NULL) +
  guides(fill = FALSE, color = FALSE, linetype = FALSE)

print(p_ridge)

Ridge Plots

Is inflation returning
to normal?

Ridge Plots

Based on the stacked bar plot, do views on the protests and age appear to be associated? Explain your reasoning.

Side Note: Heatmaps

Some possible stories to look for

While doing exploratory analysis, you might be on the lookout for:

Factoid stories
Interaction stories
Comparison stories
Change stories

Factoid stories

Why does this one data point stand out from the others?

Sometimes in large sets of data you find the most interesting thing is the story of one particular piece of information.
This could be an “outlier”, or it could be the data point that is most common.
A detail about one particular piece of your data can fascinate and surprise people. It can also give them an easier way to start thinking about the whole set of data.

Interaction stories

Why do these do aspects of the data change with each other?

When two aspects of your data seem related (correlated), you can tell a story about how they interact.
In other cases, they might interact as opposites.
You need to be cautious about whether you will guess about reasons for the interaction, but noticing the relationship itself can be a good story that connects things people otherwise don’t think about together.

Comparison Stories

What is the meaningful difference between these parts?

Often one part of your data tells one story, but another part tells a totally different story.
Or maybe there is a smaller portion of your data that serves as an example of an overall pattern.

Change stories

What made this part of the data change from this to that?

People like to think about how things change over time.
Telling a story a story about change over time appeals to people’s interest in understanding what causes change, and they can often remember seeing the differences.

Review

Today we covered some basics about exploring data.
Most of you have had a fair amount of statistics training so hopefully exploring data and descriptive statistics are familiar.
Perhaps what was new was the use of graphical tools to explore data.
If you’d like to think more about using visualization as a tool to explore data, I recommend: Nathan Yau, “Data Points

To-Do

Before next class:

When you are finished with each, there will be an option to generate a hash. Generate the hash, copy, and paste here!

Exploratory Data Analysis

Agenda

College Earnings Review

Numerical Data

County Example Data

The Loans Example Data

Scaterplots: County Data

Deviation

Sample variance

Standard Deviation

Reminder: Variance and standard deviation

Exploring Categorical Data

A Contingency Table

Stacked bar plot

Dodged bar plot

Proportion

Choosing a bar plot

Mosaic Plots

Sleep Mosaic Plot

Pie Charts

Waffle charts

Waffle chart

County Example Data

Side by Side Box Plot

Ridge Plots

Ridge Plots

Ridge Plots

Side Note: Heatmaps

Finding Stories

Some possible stories to look for

Factoid stories

Interaction stories

Comparison Stories

Change stories

Review and To-Do

Review

To-Do