When a dataset is represented as a table or a database, it’s difficult to observe much about it beyond its size and the types of variables it contains. Here I’ll see how to use graphical and numerical techniques to begin uncovering the structure of some data. Which variables suggest interesting relationships? Which observations are unusual?
Libraries used in this post:
Exploring Categorical Data
This section describes how to create graphical and numerical summaries of two categorical variables.
Bar chart expectations
Suppose you’ve asked 30 people, some young, some old, what their preferred flavour of pie is: apple or pumpkin. That data could be summarized in a side-by-side barchart. Here are three possibilities for how it might look.
Which one of the barcharts shows no relationship between age and flavor? In other words, which shows that pie preference is the same for both young and old?
Contingency table review
I’ll work with the
comics dataset. This is a collection of characteristics on all of the superheroes created by Marvel and DC comics in the last 80 years.
Let’s start by creating a contingency table, which is a useful way to represent the total counts of observations that fall into each combination of the levels of categorical variables.
comics <- read.csv("data/ExplDataAna/comics.csv")
# Print the first rows of the data head(comics)
## name id align eye ## 1 Spider-Man (Peter Parker) Secret Good Hazel Eyes ## 2 Captain America (Steven Rogers) Public Good Blue Eyes ## 3 Wolverine (James \\"Logan\\" Howlett) Public Neutral Blue Eyes ## 4 Iron Man (Anthony \\"Tony\\" Stark) Public Good Blue Eyes ## 5 Thor (Thor Odinson) No Dual Good Blue Eyes ## 6 Benjamin Grimm (Earth-616) Public Good Blue Eyes ## hair gender gsm alive appearances first_appear ## 1 Brown Hair Male <NA> Living Characters 4043 Aug-62 ## 2 White Hair Male <NA> Living Characters 3360 Mar-41 ## 3 Black Hair Male <NA> Living Characters 3061 Oct-74 ## 4 Black Hair Male <NA> Living Characters 2961 Mar-63 ## 5 Blond Hair Male <NA> Living Characters 2258 Nov-50 ## 6 No Hair Male <NA> Living Characters 2255 Nov-61 ## publisher ## 1 marvel ## 2 marvel ## 3 marvel ## 4 marvel ## 5 marvel ## 6 marvel
# Check levels of align levels(comics$align)
##  "Bad" "Good" "Neutral" ##  "Reformed Criminals"
# Check the levels of gender levels(comics$gender)
##  "Female" "Male" "Other"
# Create a 2-way contingency table table(comics$align, comics$gender)
## ## Female Male Other ## Bad 1573 7561 32 ## Good 2490 4809 17 ## Neutral 836 1799 17 ## Reformed Criminals 1 2 0
The contingency table from the last exercise revealed that there are some levels that have very low counts. To simplify the analysis, it often helps to drop such levels.
In R, this requires two steps: first filtering out any rows with the levels that have very low counts, then removing these levels from the factor variable with
droplevels(). This is because the
droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don’t exist in a dataset.
tab<- table(comics$align, comics$gender)
# Print tab tab
## ## Female Male Other ## Bad 1573 7561 32 ## Good 2490 4809 17 ## Neutral 836 1799 17 ## Reformed Criminals 1 2 0
# Remove align level comics <- comics %>% filter(align != "Reformed Criminals") %>% droplevels()
While a contingency table represents the counts numerically, it’s often more useful to represent them graphically.
Here I’ll construct two side-by-side barcharts of the comics data. This shows that there can often be two or more options for presenting the same data. Passing the argument
position = "dodge" to
geom_bar() says that you want a side-by-side (i.e. not stacked) barchart.
# Load ggplot2 library(ggplot2) # Create side-by-side barchart of gender by alignment ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "dodge")
# Create side-by-side barchart of alignment by gender ggplot(comics, aes(x = gender, fill = align)) + geom_bar(position= "dodge") + theme(axis.text.x = element_text(angle = 90))
Bar chart interpretation
Which of the following interpretations of the bar charts to you right is not valid?
- Among characters with “Neutral” alignment, males are the most common.
- In general, there is an association between gender and alignment.
- Across all genders, “Bad” is the most common alignment.
- There are more male characters than female characters in this dataset.
The following code generates tables of joint and conditional proportions, respectively:
tab <- table(comics$align, comics$gender) options(scipen = 999, digits = 3) # Print fewer digits prop.table(tab) # Joint proportions
## ## Female Male Other ## Bad 0.082210 0.395160 0.001672 ## Good 0.130135 0.251333 0.000888 ## Neutral 0.043692 0.094021 0.000888
prop.table(tab, 1) # Conditional on rows
## ## Female Male Other ## Bad 0.17161 0.82490 0.00349 ## Good 0.34035 0.65733 0.00232 ## Neutral 0.31523 0.67836 0.00641
prop.table(tab, 2) # Conditional on columns
## ## Female Male Other ## Bad 0.321 0.534 0.485 ## Good 0.508 0.339 0.258 ## Neutral 0.171 0.127 0.258
Proportions come handy when dealing with lots of count data.
Often it’s easier to draw conclusions from proportions table than from simple counts, for example
## ## Female Male Other ## Bad 1573 7561 32 ## Good 2490 4809 17 ## Neutral 836 1799 17
prop.table(tab, 1) # Conditional on columns
## ## Female Male Other ## Bad 0.17161 0.82490 0.00349 ## Good 0.34035 0.65733 0.00232 ## Neutral 0.31523 0.67836 0.00641
Counts vs. proportions (2)
Bar charts can tell dramatically different stories depending on whether they represent counts or proportions and, if proportions, what the proportions are conditioned on. To demonstrate this difference, I’ll construct two barcharts: one of counts and one of proportions.
Following graphs differ in one argument:
geom_bar(position = "fill"):
# Plot of gender by align ggplot(comics, aes(x = align, fill = gender)) + geom_bar()
# Plot proportion of gender, conditional on align ggplot(comics, aes(x = align, fill = gender)) + geom_bar(position = "fill") + ylab("proportion")
position = "fill" to
geom_bar(), I’m saying I want the bars to fill the entire height of the plotting window, thus displaying proportions and not raw counts.
If you are interested in the distribution of alignment of all superheroes, it makes sense to construct a barchart for just that single variable.
You can improve the interpretability of the plot, though, by implementing some sensible ordering. Superheroes that are
"Neutral" show an alignment between
"Bad", so it makes sense to put that bar in the middle.
# Change the order of the levels in align comics$align <- factor(comics$align, levels = c("Bad", "Neutral", "Good")) # Create plot of align ggplot(comics, aes(x = align)) + geom_bar()
ggplot it’s easy to split charts into panels - called
# Plot of alignment broken down by gender ggplot(comics, aes(x = align)) + geom_bar() + facet_wrap(~gender)
** no dataset avaible** The piechart is a very common way to represent the distribution of a single categorical variable, but they can be more difficult to interpret than barcharts.
This is a piechart of a dataset called pies that contains the favourite pie flavours of 98 people. How to improve the representation of these data by constructing a barchart that is ordered in descending order of count?
# Put levels of flavor in decending order lev <- c("apple", "key lime", "boston creme", "blueberry", "cherry", "pumpkin", "strawberry") pies$flavor <- factor(pies$flavor, levels = lev) # Create barchart of flavor ggplot(pies, aes(x = flavor)) + geom_bar(fill = "chartreuse") + theme(axis.text.x = element_text(angle = 90))
Exploring Numerical Data
This section explains how how to graphically summarize numerical data.
I’ll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. I will investigate the distribution of mileage across a categorical variable, but before I get there, it’s essential to familiarize yourself with the dataset.
cars <- read.csv("data/ExplDataAna/cars04.csv")
Plot a histogram of
city_mpg faceted by
suv, a logical variable indicating whether the car is an SUV or not.
# Learn data structure str(cars)
## 'data.frame': 428 obs. of 19 variables: ## $ name : Factor w/ 425 levels "Acura 3.5 RL 4dr",..: 66 67 68 69 70 114 115 133 129 130 ... ## $ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ... ## $ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ... ## $ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ... ## $ ncyl : int 4 4 4 4 4 4 4 4 4 4 ... ## $ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ... ## $ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ... ## $ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ... ## $ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 2606 ... ## $ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ... ## $ length : int 167 153 183 183 183 174 174 168 168 168 ... ## $ width : int 66 66 69 68 69 67 67 67 67 67 ...
# Create faceted histogram ggplot(cars, aes(x = city_mpg)) + geom_histogram() + facet_wrap(~suv)
It’s important to note that I can facet a plot by any categorical variable using
Boxplots and density plots
The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, I could stick to using histograms, but here I’ll try with two alternatives: the box plot and the density plot.
A quick look at
unique(cars$ncyl) shows that there are more possible levels of
ncyl than I might think. Here, I’ll restrict my attention to the most common levels (4,6 and 8.
# Filter cars with 4, 6, 8 cylinders common_cyl <- filter(cars, ncyl %in% c(4,6,8)) # Create box plots of city mpg by ncyl ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()
# Create overlaid density plots for same data ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) + geom_density(alpha = .3)
Which of the following interpretations of the plot is not valid?
- The highest mileage cars have 4 cylinders.
- The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
- Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.
- The variability in mileage of 8 cylinder cars is similar to the variability in mileage of 4 cylinder cars.
Marginal and conditional histograms
Now, let’s look at
horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.
I’ll be making two plots using the “data pipeline” paradigm. Piping requires
dplyr to by loaded (this has been taken care of by loading
tidyverse at the beginning)
# Create hist of horsepwr cars %>% ggplot(aes(x=horsepwr)) + geom_histogram() + ggtitle("Distribution of cars HorsePower")
# Create hist of horsepwr for affordable cars cars %>% filter(msrp < 25000) %>% ggplot(aes(x=horsepwr)) + geom_histogram() + xlim(c(90, 550)) + ggtitle("Distribution of cars HorsePower in cars less then $25.000")
Observe the two histograms in the plotting window and decide which of the following is a valid interpretation.
- Cars with around 300 horsepower are more common than cars with around 200 horsepower.
- The highest horsepower car in the less expensive range has just under 250 horsepower.
- Most cars under $25,000 vary from roughly 100 horsepower to roughly 350 horsepower.
Before we take these plots for granted, it’s a good idea to see how things change when we alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It’s good practice to consider several binwidths in order to detect different types of structure in the data.
# Create hist of horsepwr with binwidth of 3 cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 3) + ggtitle("Cars horsepower distribution with binwidth = 3")
# Create hist of horsepwr with binwidth of 30 cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 30) + ggtitle("Cars horsepower distribution with binwidth = 30")
# Create hist of horsepwr with binwidth of 60 cars %>% ggplot(aes(horsepwr)) + geom_histogram(binwidth = 60) + ggtitle("Cars horsepower distribution with binwidth = 60")
What feature is present in Plot A that’s not found in B or C?
- The most common horsepower is around 200.
- There is a tendency for cars to have horsepower right at 200 or 300 horsepower.
- There is a second mode around 300 horsepower.
Plot A is the only histogram that shows the count for cars with exactly 200 and 300 horsepower
While dotplots are a zero-information-loss type of plots, boxplots do their best to simplify what might be unreadable in dotplots:
Box plots for outliers
In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers.
I will apply this method to the
msrp column (manufacturer’s suggested retail price) to detect if there are unusually expensive or cheap cars.
# Construct box plot of msrp cars %>% ggplot(aes(x = 1, y = msrp)) + geom_boxplot()
x=1 is mandatory, as there are no grouping variables
Exclude the largest 3-5 outliers by filtering the rows to retain cars less than $100,000.
# Exclude outliers from data cars_no_out <- cars %>% filter(msrp < 100000) # Construct box plot of msrp using the reduced dataset cars_no_out %>% ggplot(aes(x = 1, y = msrp)) + geom_boxplot()
Consider two other columns in the cars dataset:
Which is the most appropriate plot for displaying the important features of their distributions?
Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.
# Create plot of city_mpg cars %>% ggplot(aes(x=city_mpg)) + geom_density()
# Create plot of width cars %>% ggplot(aes(x=width)) + geom_density()
Same as above, but based on boxplots:
# Create plot of city_mpg cars %>% ggplot(aes(x=1, y=city_mpg)) + geom_boxplot()
# Create plot of width cars %>% ggplot(aes(x=1, y=width)) + geom_boxplot()
city_mpg variable has a much wider range with its outliers, it’s best to display its distribution as a box plot.
Width on the other hand, looks preety good as a density plot.
Visualization in higher dimensions
One way for plotting 3 variables at once is by using
ggplot(cars, aes(x = msrp)) + geom_density() + facet_grid(pickup ~ rear_wheel)
A very handy feature comes from
ggplot(cars, aes(x = msrp)) + geom_density() + facet_grid(pickup ~ rear_wheel, labeller = label_both)
An important thing to remember here is that density plots normalize each distribution, so that area under the plot is one. Before taking these plots for granted we should check raw counts to make sure we draw conclusions based on enough data points:
# Table of counts table(cars$rear_wheel, cars$pickup)
## ## FALSE TRUE ## FALSE 306 12 ## TRUE 98 12
3 variable plot
Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, one can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.
# Facet hists using hwy mileage and ncyl common_cyl %>% ggplot(aes(x = hwy_mpg)) + geom_histogram() + facet_grid(ncyl ~ suv, labeller = label_both) + ggtitle("Milage by suv and ncyl")
Here we’ll visit some useful statistics for describing distributions of data.
Calculate center measures
I’ll work with data from
gapminder, which tracks demographic data in countries of the world over time. To learn more about it, you can bring up the help file with
Let’s focus on how the life expectancy differs from continent to continent.
This requires that I conduct analysis not at the country level, but aggregated up to the continent level. This is made possible by the one-two punch of
summarize(), a very powerful syntax for carrying out the same analysis on different subsets of the full dataset.
library(gapminder) # Create dataset of 2007 data gap2007 <- filter(gapminder, year == 2007) # Compute groupwise mean and median lifeExp gap2007 %>% group_by(continent) %>% summarize(mean(lifeExp), median(lifeExp))
## # A tibble: 5 x 3 ## continent `mean(lifeExp)` `median(lifeExp)` ## <fctr> <dbl> <dbl> ## 1 Africa 54.8 52.9 ## 2 Americas 73.6 72.9 ## 3 Asia 70.7 72.4 ## 4 Europe 77.6 78.6 ## 5 Oceania 80.7 80.7
# Generate box plots of lifeExp for each continent gap2007 %>% ggplot(aes(x = continent, y = lifeExp)) + geom_boxplot()
Boxplots are generally better, as they allow to see some degree of variability in the data - which is not the case for the mean itself.
One thing I could do, to make a numeric table more informative (without computing variation itself) would be, to look at the difference between mean and median. Since the mean is more sensitive to outliers, larger difference between mean and median can be an indication of bigger data variablity:
gap2007 %>% group_by(continent) %>% summarize(m1=mean(lifeExp), m2=median(lifeExp)) %>% mutate(dev=m1-m2) %>% arrange(desc(abs(dev)))
## # A tibble: 5 x 4 ## continent m1 m2 dev ## <fctr> <dbl> <dbl> <dbl> ## 1 Africa 54.8 52.9 1.88 ## 2 Asia 70.7 72.4 -1.67 ## 3 Europe 77.6 78.6 -0.960 ## 4 Americas 73.6 72.9 0.709 ## 5 Oceania 80.7 80.7 0
Calculate spread measures
Let’s extend the powerful
summarize() syntax to measures of spread. If you’re unsure whether you’re working with symmetric or skewed distributions, it’s a good idea to consider a robust measure like IQR in addition to the usual measures of variance or standard deviation.
# Compute groupwise measures of spread gap2007 %>% group_by(continent) %>% summarize(sd(lifeExp), IQR(lifeExp), n())
## # A tibble: 5 x 4 ## continent `sd(lifeExp)` `IQR(lifeExp)` `n()` ## <fctr> <dbl> <dbl> <int> ## 1 Africa 9.63 11.6 52 ## 2 Americas 4.44 4.63 25 ## 3 Asia 7.96 10.2 33 ## 4 Europe 2.98 4.78 30 ## 5 Oceania 0.729 0.516 2
# Generate overlaid density plots gap2007 %>% ggplot(aes(x = lifeExp, fill = continent)) + geom_density(alpha = 0.3)
Choose measures for center and spread
Consider the density plots shown here.
What are the most appropriate measures to describe their centers and spreads?
# Compute stats for lifeExp in Americas gap2007 %>% filter(continent=="Americas") %>% summarize(mean(lifeExp), sd(lifeExp))
## # A tibble: 1 x 2 ## `mean(lifeExp)` `sd(lifeExp)` ## <dbl> <dbl> ## 1 73.6 4.44
# Compute stats for population gap2007 %>% summarize(median(pop), IQR(pop))
## # A tibble: 1 x 2 ## `median(pop)` `IQR(pop)` ## <dbl> <dbl> ## 1 10517531 26702008
Like mean and standard deviation, median and IQR measure the central tendency and spread, respectively, but are robust to outliers and non-normal data.
Highly skewed distributions can make it very difficult to learn anything from a visualization. Transformations can be helpful in revealing the more subtle structure.
Here I’ll focus on the population variable, which exhibits strong right skew, and transform it with the natural logarithm function (
log() in R)
# Create density plot of old variable gap2007 %>% ggplot(aes(x = pop)) + geom_density()
# Transform the skewed pop variable gap2007 <- gap2007 %>% mutate(log_pop = log(pop)) # Create density plot of new variable gap2007 %>% ggplot(aes(x = log_pop)) + geom_density()
Consider the distribution, shown here, of the life expectancies of the countries in Asia.
gap2007 %>% filter(continent=="Asia") %>% ggplot(aes(x=1, y=lifeExp)) + geom_boxplot()
The box plot identifies one clear outlier: a country with a notably low life expectancy. Do you have a guess as to which country this might be? Test your guess in the console using either
filter(), then proceed to building a plot with that country removed.
gap2007 %>% filter(continent=="Asia") %>% arrange(lifeExp) %>% head(5)
## # A tibble: 5 x 7 ## country continent year lifeExp pop gdpPercap log_pop ## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl> ## 1 Afghanistan Asia 2007 43.8 31889923 975 17.3 ## 2 Iraq Asia 2007 59.5 27499638 4471 17.1 ## 3 Cambodia Asia 2007 59.7 14131858 1714 16.5 ## 4 Myanmar Asia 2007 62.1 47761980 944 17.7 ## 5 Yemen, Rep. Asia 2007 62.7 22211743 2281 16.9
# Filter for Asia, add column indicating outliers gap_asia <- gap2007 %>% filter(continent=="Asia") %>% mutate(is_outlier = lifeExp < 50) # Remove outliers, create box plot of lifeExp gap_asia %>% filter(!is_outlier) %>% ggplot(aes(x = 1, y = lifeExp)) + geom_boxplot()
Apply all of the above to explore and summarize a real world dataset in this case study of email spam.
Spam and num_char
Is there an association between spam and the length of an email?
You could imagine a story either way:
Spam is more likely to be a short message tempting me to click on a link, or
My normal email is likely shorter since I exchange brief emails with my friends all the time.
Here, I’ll use the
openintro package to settle that question.
# Load packages library(openintro) library(dplyr) library(ggplot2) # Make dataset premise - just because data(email)
Quick look at
spam variable is numeric. I’ll factorize it before continuing so that charts and tables have nice labels
# Factorize spam email$spam <- factor(email$spam, labels = c("non-spam", "spam"))
# Compute summary statistics email %>% group_by(spam) %>% summarise(mean(num_char), median(num_char), sd(num_char), IQR(num_char))
## # A tibble: 2 x 5 ## spam `mean(num_char)` `median(num_char)` `sd(num_char)` `IQR(num_ch~ ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 non-spam 11.3 6.83 14.5 13.6 ## 2 spam 5.44 1.05 14.9 2.82
# Create plot email %>% mutate(log_num_char = log(num_char)) %>% ggplot(aes(x = spam, y = log_num_char)) + geom_boxplot()
This boxplot shows the log of number of characters grouped by spam / non-spam mails. From this plot I can see, that spam mails have generally less characters than non-spam mails (their median is lower). Also non-spam mails have more variability, as shown by quite a few short mails, categorized as ham.
Spam and !!!
Let’s look at a more obvious indicator of spam: exclamation marks.
exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, I check if there is a relationship between this variable and whether or not a message is spam.
# Compute center and spread for exclaim_mess by spam email %>% group_by(spam) %>% summarise(median(exclaim_mess), IQR(exclaim_mess), mean(exclaim_mess), sum(exclaim_mess))
## # A tibble: 2 x 5 ## spam `median(exclaim_mess)` `IQR(exclaim_mess)` `mean(exc~ `sum(exc~ ## <fctr> <dbl> <dbl> <dbl> <dbl> ## 1 non-spam 1.00 5.00 6.51 23130 ## 2 spam 0 1.00 7.32 2687
# Create plot for spam and exclaim_mess # Denisty plot email %>% mutate(exclaim_mess = exclaim_mess + 0.1, log_exclaim_mess= log(exclaim_mess)) %>% ggplot(aes(x = log_exclaim_mess, fill = spam)) + geom_density(alpha=0.3)
-Inf in R, which isn’t a very useful value!
I can get around this by adding a small number (like .01) to the quantity inside the
log() function. This small shift to the right won’t affect your results.
# Boxplot email %>% mutate(exclaim_mess = exclaim_mess + 0.1, log_exclaim_mess= log(exclaim_mess)) %>% ggplot(aes(y = log_exclaim_mess, x = spam)) + geom_boxplot()
# Histogram email %>% mutate(exclaim_mess = exclaim_mess + 0.1, log_exclaim_mess= log(exclaim_mess)) %>% ggplot(aes(x=log_exclaim_mess)) +geom_histogram() + facet_wrap(~spam)
From the above examples it’s obvious that histograms are the way to go here. I can see clearly, that there where very littile exclamation marks (in terms of counts) in SPAM mails (and that feels odd, but it’s true):
# Compute center and spread for exclaim_mess by spam email %>% group_by(spam) %>% summarise(sum(exclaim_mess))
## # A tibble: 2 x 2 ## spam `sum(exclaim_mess)` ## <fctr> <dbl> ## 1 non-spam 23130 ## 2 spam 2687
It is difficult to work with the heavy skew of
exclaim_mess, the number of images attached to each email (
image) poses even more of a challenge.
# Get a sense of its distribution table(email$image)
## ## 0 1 2 3 4 5 9 20 ## 3811 76 17 11 2 2 1 1
This tabulates the number of cases in each category (so there were 3811 emails with 0 images, for example). Given the very low counts at the higher number of images, I’ll collapse image into a categorical variable that indicates whether or not the email had at least one image.
# Create plot of proportion of spam by image email %>% mutate(has_image = image!=0) %>% ggplot(aes(x = has_image, fill = spam)) + geom_bar(position = "fill")
From here, I can see that an email without an image is more likely to be not-spam than spam. But the same is true also for mails that had an image. Also, it’s worth noting that non-spam massages are higher represented:
## ## non-spam spam ## 3554 367
Overall, I would say, this plot is not providing much info (correct me if I’m wrong).
In the process of exploring a dataset, you’ll sometimes come across something that will lead you to question how the data were compiled. For example, the variable
num_char contains the number of characters in the email, in thousands, so it could take decimal values, but it certainly shouldn’t take negative values.
I can formulate a test to ensure this variable is behaving as expected:
sum(email$num_char < 0)
##  0
When doing arithmetic on logical values, R treats
0. Since the sum over the whole vector is zero, I see that every case in the dataset took a value of
FALSE in the test. That is, the
num_char column is behaving as we expect and taking only non-negative values.
Let’s consider the variables
Do attached images count as attached files in this dataset?
Design a simple test to determine if images count as attached files. This involves creating a logical condition to compare the values of the two variables, then using sum() to assess every case in the dataset. Recall that the logical operators are < for less than, <= for less than or equal to, > for greater than, >= for greater than or equal to, and == for equal to.
## ## 0 1 2 3 4 5 6 7 8 9 10 20 21 ## 0 3638 98 61 11 1 2 0 0 0 0 0 0 0 ## 1 0 60 16 0 0 0 0 0 0 0 0 0 0 ## 2 0 0 13 3 0 0 0 0 0 0 0 0 1 ## 3 0 0 0 5 1 0 2 2 1 0 0 0 0 ## 4 0 0 0 0 1 0 0 0 0 0 1 0 0 ## 5 0 0 0 0 0 2 0 0 0 0 0 0 0 ## 9 0 0 0 0 0 0 0 0 0 1 0 0 0 ## 20 0 0 0 0 0 0 0 0 0 0 0 1 0
A simple table isn’t very clear to read. I could simplify it by recoding both
attach to yes/no categories:
email %>% mutate(im=image>0, at=attach>0) %>% group_by(im) %>% summarise(sum(at))
## # A tibble: 2 x 2 ## im `sum(at)` ## <lgl> <int> ## 1 F 173 ## 2 T 110
email %>% mutate(im=image>0, at=attach>0) %>% ggplot(aes(x=im, fill=at)) + geom_bar()
Tabular data shows the difference between green bars, that is: there where more mails with attachment that where non-image than mails with attachment that where actual images. It’s not best way to answer my question.
If I use proportions, I can see, that actually all mails that had an image where classified as mails with attachment:
email %>% mutate(im=image>0, at=attach>0) %>% ggplot(aes(x=im, fill=at)) + geom_bar(position="fill")
I can reverse what’s on
x-axis and show, that among mails without an attachment there where no mails with images attached - makes sense. Also, among mails with attachment slighly more attachments where non-graphical:
email %>% mutate(im=image>0, at=attach>0) %>% ggplot(aes(x=at, fill=im)) + geom_bar(position="fill")
A Yes / No approach to this question could simply count the number of occurences where
TRUE (mail has an image) and
FALSE (mail has no attachment):
sum(email$image > email$attach)
##  0
- There where no such mails.
Answering questions with chains
Let’s consider the following question:
- Within non-spam emails, is the typical length of emails shorter for those that were sent to multiple people?"
This can be answered with the following chain:
email %>% filter(spam == "non-spam") %>% group_by(to_multiple) %>% summarize(median(num_char))
## # A tibble: 2 x 2 ## to_multiple `median(num_char)` ## <dbl> <dbl> ## 1 0 7.20 ## 2 1.00 5.36
The code makes it clear that we are using
num_char to measure the length of an email and
median() as the measure of what is typical.
The answer to the question is “yes”: the typical length of non-spam sent to multiple people is a bit lower than those sent to only one person.
- For emails containing the word “dollar”, does the typical spam email contain a greater number of occurrences of the word than the typical non-spam email?
# Question 1 email %>% filter(dollar>0) %>% group_by(spam) %>% summarize(median(dollar))
## # A tibble: 2 x 2 ## spam `median(dollar)` ## <fctr> <dbl> ## 1 non-spam 4.00 ## 2 spam 2.00
- If you encounter an email with greater than 10 occurrences of the word “dollar”, is it more likely to be spam or not-spam?
email %>% filter(dollar>10) %>% ggplot(aes(x = spam)) + geom_bar()
Explore the association between
number variable and
# Reorder levels email$number <- factor(email$number, levels = c("none", "small", "big")) # Construct plot of number ggplot(email, aes(x=number)) + geom_bar() + facet_wrap(~spam)