Exploratory Data Analysis

Jan 4, 2018 00:00 · 5090 words · 24 minute read english R EDA

When a dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. Here I’ll see how to use graphical and numerical techniques to begin uncovering the structure of some data. Which variables suggest interesting relationships? Which observations are unusual?

Intro

Libraries used in this post:

library(tidyverse)

Databases:
- comics
- cars
- immigration
- life expectancy
- names
- us income

Exploring Categorical Data

This section describes how to create graphical and numerical summaries of two categorical variables.

Bar chart expectations

Suppose you’ve asked 30 people, some young, some old, what their preferred flavour of pie is: apple or pumpkin. That data could be summarized in a side-by-side barchart. Here are three possibilities for how it might look.

Which one of the barcharts shows no relationship between age and flavor? In other words, which shows that pie preference is the same for both young and old?

Contingency table review

I’ll work with the comics dataset. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.

Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.

comics <- read.csv("data/ExplDataAna/comics.csv")
# Print the first rows of the data
head(comics)
##                                    name      id   align        eye
## 1             Spider-Man (Peter Parker)  Secret    Good Hazel Eyes
## 2       Captain America (Steven Rogers)  Public    Good  Blue Eyes
## 3 Wolverine (James \\"Logan\\" Howlett)  Public Neutral  Blue Eyes
## 4   Iron Man (Anthony \\"Tony\\" Stark)  Public    Good  Blue Eyes
## 5                   Thor (Thor Odinson) No Dual    Good  Blue Eyes
## 6            Benjamin Grimm (Earth-616)  Public    Good  Blue Eyes
##         hair gender  gsm             alive appearances first_appear
## 1 Brown Hair   Male <NA> Living Characters        4043       Aug-62
## 2 White Hair   Male <NA> Living Characters        3360       Mar-41
## 3 Black Hair   Male <NA> Living Characters        3061       Oct-74
## 4 Black Hair   Male <NA> Living Characters        2961       Mar-63
## 5 Blond Hair   Male <NA> Living Characters        2258       Nov-50
## 6    No Hair   Male <NA> Living Characters        2255       Nov-61
##   publisher
## 1    marvel
## 2    marvel
## 3    marvel
## 4    marvel
## 5    marvel
## 6    marvel
# Check levels of align
levels(comics$align)
## [1] "Bad"                "Good"               "Neutral"           
## [4] "Reformed Criminals"
# Check the levels of gender
levels(comics$gender)
## [1] "Female" "Male"   "Other"
# Create a 2-way contingency table
table(comics$align, comics$gender)
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0

Dropping levels

The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.

In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with droplevels(). This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.

tab<- table(comics$align, comics$gender)
# Print tab
tab
##                     
##                      Female Male Other
##   Bad                  1573 7561    32
##   Good                 2490 4809    17
##   Neutral               836 1799    17
##   Reformed Criminals      1    2     0
# Remove align level
comics <- comics %>%
  filter(align != "Reformed Criminals") %>%
  droplevels()

Side-by-side barcharts

While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.

Here I’ll construct two side-by-side barcharts of the comics data. This shows that there can often be two or more options for presenting the same data. Passing the argument position = "dodge" to geom_bar() says that you want a side-by-side (i.e. not stacked) barchart.

# Load ggplot2
library(ggplot2)

# Create side-by-side barchart of gender by alignment
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "dodge")

# Create side-by-side barchart of alignment by gender
ggplot(comics, aes(x = gender, fill = align)) + 
  geom_bar(position= "dodge") +
  theme(axis.text.x = element_text(angle = 90))

Bar chart interpretation

Which of the following interpretations of the bar charts to you right is not valid?

  • Among characters with “Neutral” alignment, males are the most common.
  • In general, there is an association between gender and alignment.
  • Across all genders, “Bad” is the most common alignment.
  • There are more male characters than female characters in this dataset.

Conditional proportions

The following code generates tables of joint and conditional proportions, respectively:

tab <- table(comics$align, comics$gender)
options(scipen = 999, digits = 3) # Print fewer digits
prop.table(tab)     # Joint proportions
##          
##             Female     Male    Other
##   Bad     0.082210 0.395160 0.001672
##   Good    0.130135 0.251333 0.000888
##   Neutral 0.043692 0.094021 0.000888
prop.table(tab, 1)  # Conditional on rows
##          
##            Female    Male   Other
##   Bad     0.17161 0.82490 0.00349
##   Good    0.34035 0.65733 0.00232
##   Neutral 0.31523 0.67836 0.00641
prop.table(tab, 2)  # Conditional on columns
##          
##           Female  Male Other
##   Bad      0.321 0.534 0.485
##   Good     0.508 0.339 0.258
##   Neutral  0.171 0.127 0.258

Proportions come handy when dealing with lots of count data.

Often it’s easier to draw conclusions from proportions table than from simple counts, for example

tab
##          
##           Female Male Other
##   Bad       1573 7561    32
##   Good      2490 4809    17
##   Neutral    836 1799    17
prop.table(tab, 1)  # Conditional on columns
##          
##            Female    Male   Other
##   Bad     0.17161 0.82490 0.00349
##   Good    0.34035 0.65733 0.00232
##   Neutral 0.31523 0.67836 0.00641

Counts vs. proportions (2)

Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, I’ll construct two barcharts: one of counts and one of proportions.

Following graphs differ in one argument: geom_bar() vs geom_bar(position = "fill"):

# Plot of gender by align
ggplot(comics, aes(x = align, fill = gender)) +
  geom_bar()

# Plot proportion of gender, conditional on align
ggplot(comics, aes(x = align, fill = gender)) + 
  geom_bar(position = "fill") +
  ylab("proportion")

By adding position = "fill" to geom_bar(), I’m saying I want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.

Marginal barchart

If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable.

You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are "Neutral" show an alignment between "Good" and "Bad", so it makes sense to put that bar in the middle.

# Change the order of the levels in align
comics$align <- factor(comics$align, 
                       levels = c("Bad", "Neutral", "Good"))

# Create plot of align
ggplot(comics, aes(x = align)) + 
  geom_bar()

With ggplot it’s easy to split charts into panels - called facets here:

# Plot of alignment broken down by gender
ggplot(comics, aes(x = align)) + geom_bar() + facet_wrap(~gender)

Improve piechart

** no dataset avaible** The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts.

This is a piechart of a dataset called pies that contains the favourite pie flavours of 98 people. How to improve the representation of these data by constructing a barchart that is ordered in descending order of count?

# Put levels of flavor in decending order
lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry")
pies$flavor <- factor(pies$flavor, levels = lev)

# Create barchart of flavor
ggplot(pies, aes(x = flavor)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))

Exploring Numerical Data

This section explains how how to graphically summarize numerical data.

I’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. I will investigate the distribution of mileage across a categorical variable, but before I get there, it’s essential to familiarize yourself with the dataset.

cars <- read.csv("data/ExplDataAna/cars04.csv")

Faceted histogram

Plot a histogram of city_mpg faceted by suv, a logical variable indicating whether the car is an SUV or not.

# Learn data structure
str(cars)
## 'data.frame':    428 obs. of  19 variables:
##  $ name       : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ...
##  $ sports_car : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ suv        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wagon      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ minivan    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pickup     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ all_wheel  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ rear_wheel : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ msrp       : int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ dealer_cost: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ eng_size   : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncyl       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepwr   : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ city_mpg   : int  28 28 26 26 26 29 29 26 27 26 ...
##  $ hwy_mpg    : int  34 34 37 37 37 36 36 33 36 33 ...
##  $ weight     : int  2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ...
##  $ wheel_base : int  98 98 104 104 104 105 105 103 103 103 ...
##  $ length     : int  167 153 183 183 183 174 174 168 168 168 ...
##  $ width      : int  66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~suv)

It’s important to note that I can facet a plot by any categorical variable using facet_wrap().

Boxplots and density plots

The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, I could stick to using histograms, but here I’ll try with two alternatives: the box plot and the density plot.

A quick look at unique(cars$ncyl) shows that there are more possible levels of ncyl than I might think. Here, I’ll restrict my attention to the most common levels (4,6 and 8.

# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4,6,8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

Which of the following interpretations of the plot is not valid?

  • The highest mileage cars have 4 cylinders.
  • The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
  • Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.
  • The variability in mileage of 8 cylinder cars is similar to the variability in mileage of 4 cylinder cars.

Marginal and conditional histograms

Now, let’s look at horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

I’ll be making two plots using the “data pipeline” paradigm. Piping requires dplyr to by loaded (this has been taken care of by loading tidyverse at the beginning)

# Create hist of horsepwr
cars %>%
  ggplot(aes(x=horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of cars HorsePower")

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(x=horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of cars HorsePower in cars less then $25.000")

Observe the two histograms in the plotting window and decide which of the following is a valid interpretation.

  • Cars with around 300 horsepower are more common than cars with around 200 horsepower.
  • The highest horsepower car in the less expensive range has just under 250 horsepower.
  • Most cars under $25,000 vary from roughly 100 horsepower to roughly 350 horsepower.

Three binwidths

Before we take these plots for granted, it’s a good idea to see how things change when we alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in the data.

# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("Cars horsepower distribution with binwidth = 3")

# Create hist of horsepwr with binwidth of 30

cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("Cars horsepower distribution with binwidth = 30")

# Create hist of horsepwr with binwidth of 60

cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("Cars horsepower distribution with binwidth = 60")

What feature is present in Plot A that’s not found in B or C?

  • The most common horsepower is around 200.
  • There is a tendency for cars to have horsepower right at 200 or 300 horsepower.
  • There is a second mode around 300 horsepower.

Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower

Boxplots

While dotplots are a zero-information-loss type of plots, boxplots do their best to simplify what might be unreadable in dotplots:

Box plots for outliers

In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers.

I will apply this method to the msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.

# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +  
  geom_boxplot()

Note that x=1 is mandatory, as there are no grouping variables

Exclude the largest 3-5 outliers by filtering the rows to retain cars less than $100,000.

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +  
  geom_boxplot()

Plot selection

Consider two other columns in the cars dataset: city_mpg and width.

Which is the most appropriate plot for displaying the important features of their distributions?

Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Create plot of city_mpg
cars %>%
  ggplot(aes(x=city_mpg)) +
  geom_density()

# Create plot of width
cars %>% 
  ggplot(aes(x=width)) +
  geom_density()

Same as above, but based on boxplots:

# Create plot of city_mpg
cars %>%
  ggplot(aes(x=1, y=city_mpg)) +
  geom_boxplot()

# Create plot of width
cars %>% 
  ggplot(aes(x=1, y=width)) +
  geom_boxplot()

Because the city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot. Width on the other hand, looks preety good as a density plot.

Visualization in higher dimensions

One way for plotting 3 variables at once is by using facet_grid()

ggplot(cars, aes(x = msrp)) + geom_density() + 
      facet_grid(pickup ~ rear_wheel)

A very handy feature comes from labeller argument:

ggplot(cars, aes(x = msrp)) + geom_density() + facet_grid(pickup ~ rear_wheel, labeller = label_both)

An important thing to remember here is that density plots normalize each distribution, so that area under the plot is one. Before taking these plots for granted we should check raw counts to make sure we draw conclusions based on enough data points:

# Table of counts
table(cars$rear_wheel, cars$pickup)
##        
##         FALSE TRUE
##   FALSE   306   12
##   TRUE     98   12

3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, one can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv, labeller = label_both) +
  ggtitle("Milage by suv and ncyl")

Numerical Summaries

Here we’ll visit some useful statistics for describing distributions of data.

Calculate center measures

I’ll work with data from gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with ?gapminder.

Let’s focus on how the life expectancy differs from continent to continent.

This requires that I conduct analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of group_by() and summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.

library(gapminder)

# Create dataset of 2007 data
gap2007 <- filter(gapminder, year == 2007)

# Compute groupwise mean and median lifeExp
gap2007 %>%
  group_by(continent) %>%
  summarize(mean(lifeExp),
            median(lifeExp))
## # A tibble: 5 x 3
##   continent `mean(lifeExp)` `median(lifeExp)`
##   <fctr>              <dbl>             <dbl>
## 1 Africa               54.8              52.9
## 2 Americas             73.6              72.9
## 3 Asia                 70.7              72.4
## 4 Europe               77.6              78.6
## 5 Oceania              80.7              80.7
# Generate box plots of lifeExp for each continent
gap2007 %>%
  ggplot(aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Boxplots are generally better, as they allow to see some degree of variability in the data - which is not the case for the mean itself.

One thing I could do, to make a numeric table more informative (without computing variation itself) would be, to look at the difference between mean and median. Since the mean is more sensitive to outliers, larger difference between mean and median can be an indication of bigger data variablity:

gap2007 %>% 
      group_by(continent) %>% 
      summarize(m1=mean(lifeExp),
                m2=median(lifeExp)) %>%
      mutate(dev=m1-m2) %>%
      arrange(desc(abs(dev)))
## # A tibble: 5 x 4
##   continent    m1    m2    dev
##   <fctr>    <dbl> <dbl>  <dbl>
## 1 Africa     54.8  52.9  1.88 
## 2 Asia       70.7  72.4 -1.67 
## 3 Europe     77.6  78.6 -0.960
## 4 Americas   73.6  72.9  0.709
## 5 Oceania    80.7  80.7  0

Calculate spread measures

Let’s extend the powerful group_by() and summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.

# Compute groupwise measures of spread
gap2007 %>%
  group_by(continent) %>%
  summarize(sd(lifeExp),
            IQR(lifeExp),
            n())
## # A tibble: 5 x 4
##   continent `sd(lifeExp)` `IQR(lifeExp)` `n()`
##   <fctr>            <dbl>          <dbl> <int>
## 1 Africa            9.63          11.6      52
## 2 Americas          4.44           4.63     25
## 3 Asia              7.96          10.2      33
## 4 Europe            2.98           4.78     30
## 5 Oceania           0.729          0.516     2
# Generate overlaid density plots
gap2007 %>%
  ggplot(aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.3)

Choose measures for center and spread

Consider the density plots shown here.

What are the most appropriate measures to describe their centers and spreads?

# Compute stats for lifeExp in Americas
gap2007 %>%
  filter(continent=="Americas") %>%
  summarize(mean(lifeExp),
            sd(lifeExp))
## # A tibble: 1 x 2
##   `mean(lifeExp)` `sd(lifeExp)`
##             <dbl>         <dbl>
## 1            73.6          4.44
# Compute stats for population
gap2007 %>%
  summarize(median(pop),
            IQR(pop))
## # A tibble: 1 x 2
##   `median(pop)` `IQR(pop)`
##           <dbl>      <dbl>
## 1      10517531   26702008

Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.

Transformations

Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.

Here I’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (log() in R)

# Create density plot of old variable
gap2007 %>%
  ggplot(aes(x = pop)) + geom_density()

# Transform the skewed pop variable
gap2007 <- gap2007 %>%
  mutate(log_pop = log(pop))

# Create density plot of new variable
gap2007 %>%
  ggplot(aes(x = log_pop)) + geom_density()

Identify outliers

Consider the distribution, shown here, of the life expectancies of the countries in Asia.

gap2007 %>% filter(continent=="Asia") %>% ggplot(aes(x=1, y=lifeExp)) + geom_boxplot()

The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either min() or filter(), then proceed to building a plot with that country removed.

gap2007 %>% 
      filter(continent=="Asia") %>% 
      arrange(lifeExp) %>% 
      head(5)
## # A tibble: 5 x 7
##   country     continent  year lifeExp      pop gdpPercap log_pop
##   <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>   <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923       975    17.3
## 2 Iraq        Asia       2007    59.5 27499638      4471    17.1
## 3 Cambodia    Asia       2007    59.7 14131858      1714    16.5
## 4 Myanmar     Asia       2007    62.1 47761980       944    17.7
## 5 Yemen, Rep. Asia       2007    62.7 22211743      2281    16.9
# Filter for Asia, add column indicating outliers
gap_asia <- gap2007 %>%
  filter(continent=="Asia") %>%
  mutate(is_outlier = lifeExp < 50)

# Remove outliers, create box plot of lifeExp
gap_asia %>%
  filter(!is_outlier) %>%
  ggplot(aes(x = 1, y = lifeExp)) +
  geom_boxplot()

Case Study

Apply all of the above to explore and summarize a real world dataset in this case study of email spam.

Spam and num_char

Is there an association between spam and the length of an email?

You could imagine a story either way:

  • Spam is more likely to be a short message tempting me to click on a link, or

  • My normal email is likely shorter since I exchange brief emails with my friends all the time.

Here, I’ll use the email dataset from openintro package to settle that question.

# Load packages
library(openintro)
library(dplyr)
library(ggplot2)

# Make dataset premise - just because
data(email)

Quick look at email database shows that spam variable is numeric. I’ll factorize it before continuing so that charts and tables have nice labels

# Factorize spam
email$spam <- factor(email$spam, labels = c("non-spam", "spam"))
# Compute summary statistics
email %>%
  group_by(spam) %>%
  summarise(mean(num_char),
            median(num_char),
            sd(num_char),
            IQR(num_char))
## # A tibble: 2 x 5
##   spam     `mean(num_char)` `median(num_char)` `sd(num_char)` `IQR(num_ch~
##   <fctr>              <dbl>              <dbl>          <dbl>        <dbl>
## 1 non-spam            11.3                6.83           14.5        13.6 
## 2 spam                 5.44               1.05           14.9         2.82
# Create plot
email %>%
  mutate(log_num_char = log(num_char)) %>%
  ggplot(aes(x = spam, y = log_num_char)) +
  geom_boxplot()

This boxplot shows the log of number of characters grouped by spam / non-spam mails. From this plot I can see, that spam mails have generally less characters than non-spam mails (their median is lower). Also non-spam mails have more variability, as shown by quite a few short mails, categorized as ham.

Spam and !!!

Let’s look at a more obvious indicator of spam: exclamation marks. exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, I check if there is a relationship between this variable and whether or not a message is spam.

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarise(median(exclaim_mess),
            IQR(exclaim_mess),
                mean(exclaim_mess),
            sum(exclaim_mess))
## # A tibble: 2 x 5
##   spam     `median(exclaim_mess)` `IQR(exclaim_mess)` `mean(exc~ `sum(exc~
##   <fctr>                    <dbl>               <dbl>      <dbl>     <dbl>
## 1 non-spam                   1.00                5.00       6.51     23130
## 2 spam                       0                   1.00       7.32      2687
# Create plot for spam and exclaim_mess

# Denisty plot
email %>%
   mutate(exclaim_mess = exclaim_mess + 0.1,
         log_exclaim_mess= log(exclaim_mess)) %>%
  ggplot(aes(x = log_exclaim_mess, fill = spam)) +
  geom_density(alpha=0.3)

Note, that log(0) is -Inf in R, which isn’t a very useful value! I can get around this by adding a small number (like .01) to the quantity inside the log() function. This small shift to the right won’t affect your results.

# Boxplot
email %>%
  mutate(exclaim_mess = exclaim_mess + 0.1,
         log_exclaim_mess= log(exclaim_mess)) %>%
  ggplot(aes(y = log_exclaim_mess, x = spam)) +
  geom_boxplot()

# Histogram
email %>%
   mutate(exclaim_mess = exclaim_mess + 0.1,
         log_exclaim_mess= log(exclaim_mess)) %>%
      ggplot(aes(x=log_exclaim_mess)) +geom_histogram() + facet_wrap(~spam)

From the above examples it’s obvious that histograms are the way to go here. I can see clearly, that there where very littile exclamation marks (in terms of counts) in SPAM mails (and that feels odd, but it’s true):

# Compute center and spread for exclaim_mess by spam
email %>%
  group_by(spam) %>%
  summarise(sum(exclaim_mess))
## # A tibble: 2 x 2
##   spam     `sum(exclaim_mess)`
##   <fctr>                 <dbl>
## 1 non-spam               23130
## 2 spam                    2687

Collapsing levels

It is difficult to work with the heavy skew of exclaim_mess, the number of images attached to each email (image) poses even more of a challenge.

# Get a sense of its distribution
table(email$image)
## 
##    0    1    2    3    4    5    9   20 
## 3811   76   17   11    2    2    1    1

This tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, I’ll collapse image into a categorical variable that indicates whether or not the email had at least one image.

# Create plot of proportion of spam by image
email %>%
  mutate(has_image = image!=0) %>%
  ggplot(aes(x = has_image, fill = spam)) +
  geom_bar(position = "fill")

From here, I can see that an email without an image is more likely to be not-spam than spam. But the same is true also for mails that had an image. Also, it’s worth noting that non-spam massages are higher represented:

table(email$spam)
## 
## non-spam     spam 
##     3554      367

Overall, I would say, this plot is not providing much info (correct me if I’m wrong).

Data Integrity

In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, the variable num_char contains the number of characters in the email, in thousands, so it could take decimal values, but it certainly shouldn’t take negative values.

I can formulate a test to ensure this variable is behaving as expected:

sum(email$num_char < 0)
## [1] 0

When doing arithmetic on logical values, R treats TRUE as 1 and FALSE as 0. Since the sum over the whole vector is zero, I see that every case in the dataset took a value of FALSE in the test. That is, the num_char column is behaving as we expect and taking only non-negative values.

Let’s consider the variables image and attach.

Do attached images count as attached files in this dataset?

Design a simple test to determine if images count as attached files. This involves creating a logical condition to compare the values of the two variables, then using sum() to assess every case in the dataset. Recall that the logical operators are < for less than, <= for less than or equal to, > for greater than, >= for greater than or equal to, and == for equal to.

table(email$image, email$attach)
##     
##         0    1    2    3    4    5    6    7    8    9   10   20   21
##   0  3638   98   61   11    1    2    0    0    0    0    0    0    0
##   1     0   60   16    0    0    0    0    0    0    0    0    0    0
##   2     0    0   13    3    0    0    0    0    0    0    0    0    1
##   3     0    0    0    5    1    0    2    2    1    0    0    0    0
##   4     0    0    0    0    1    0    0    0    0    0    1    0    0
##   5     0    0    0    0    0    2    0    0    0    0    0    0    0
##   9     0    0    0    0    0    0    0    0    0    1    0    0    0
##   20    0    0    0    0    0    0    0    0    0    0    0    1    0

A simple table isn’t very clear to read. I could simplify it by recoding both image and attach to yes/no categories:

email %>% 
      mutate(im=image>0, at=attach>0) %>%
      group_by(im) %>%
      summarise(sum(at))
## # A tibble: 2 x 2
##   im    `sum(at)`
##   <lgl>     <int>
## 1 F           173
## 2 T           110
email %>% 
      mutate(im=image>0, at=attach>0) %>%
      ggplot(aes(x=im, fill=at)) + geom_bar()

Tabular data shows the difference between green bars, that is: there where more mails with attachment that where non-image than mails with attachment that where actual images. It’s not best way to answer my question.

If I use proportions, I can see, that actually all mails that had an image where classified as mails with attachment:

email %>% 
      mutate(im=image>0, at=attach>0) %>%
      ggplot(aes(x=im, fill=at)) + geom_bar(position="fill")

I can reverse what’s on x-axis and show, that among mails without an attachment there where no mails with images attached - makes sense. Also, among mails with attachment slighly more attachments where non-graphical:

email %>% 
      mutate(im=image>0, at=attach>0) %>%
      ggplot(aes(x=at, fill=im)) + geom_bar(position="fill")

A Yes / No approach to this question could simply count the number of occurences where image is TRUE (mail has an image) and attache is FALSE (mail has no attachment):

sum(email$image > email$attach)
## [1] 0
  • There where no such mails.

Answering questions with chains

Let’s consider the following question:

  • Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?"

This can be answered with the following chain:

email %>%
   filter(spam == "non-spam") %>%
   group_by(to_multiple) %>%
   summarize(median(num_char))
## # A tibble: 2 x 2
##   to_multiple `median(num_char)`
##         <dbl>              <dbl>
## 1        0                  7.20
## 2        1.00               5.36

The code makes it clear that we are using num_char to measure the length of an email and median() as the measure of what is typical. The answer to the question is “yes”: the typical length of non-spam sent to multiple people is a bit lower than those sent to only one person.

  • For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email?
# Question 1
email %>%
  filter(dollar>0) %>%
  group_by(spam) %>%
  summarize(median(dollar))
## # A tibble: 2 x 2
##   spam     `median(dollar)`
##   <fctr>              <dbl>
## 1 non-spam             4.00
## 2 spam                 2.00
  • If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not-spam?
email %>%
  filter(dollar>10) %>%
  ggplot(aes(x = spam)) +
  geom_bar()

Explore the association between number variable and spam

# Reorder levels
email$number <- factor(email$number, levels = c("none", "small", "big"))

# Construct plot of number
ggplot(email, aes(x=number)) + geom_bar() + facet_wrap(~spam)