Intro To Data

Dec 28, 2017 00:00 · 5326 words · 26 minute read english R

Intro

Libraries used in this post:

Databases:
- evals
- ucb_admit
- us_regions

Language of data

## 'data.frame':    50 obs. of  21 variables:
##  $ spam        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int  0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct, format: "2012-01-04 14:19:16" "2012-02-16 21:10:06" ...
##  $ image       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num  0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num  21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int  551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num  1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num  1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num  8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

Type of variables

  • Numerical (quantitative): numerical values
    - Continuous: infinite number of values within a given range, ofen measured
    - Discrete: specific set of numeric values that can be counted or enumerated, ofen counted

  • Categorical (qualitative): limited number of distinct categories
    - Ordinal: finite number of values within a given range, ofen measured

## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 14:19:16, 2012-02-16 21:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fctr> no, no, no, no, no, no, no, no, no, no, no, no, ...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fctr> small, big, none, small, small, small, small, sm...

Categorical data

  • Often stored as factors in R
    - Important use: statistical modeling
    - Sometimes undesirable, sometimes essential
  • Common in subgroup analysis
    - Only interested in a subset of the data
    - Filter for specific levels of categorical variable

Filtering based on a factor

## Observations: 7
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 21:10:06, 2012-02-05 00:26:09, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fctr> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fctr> big, big, big, big, big, big, big

Complete filtering based on a factor

The droplevels() function removes unused levels of factor variables from your dataset.

NOTE: droplevels automatically drops missing levels

## 
##  none small   big 
##     0     0     7
## 
## big 
##   7

Discretize a variable

Combining levels of a different factor

Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50 dataset has a categorical variable called number with levels “none”, “small”, and “big”, but suppose you’re only interested in whether an email contains a number. In this exercise, you will create a variable containing this information and also visualize it.

  • Create a new variable in email50 called number_yn that is “no” if there is no number in the email and “yes” if there is a small or a big number. The ifelse() function may prove useful here.
  • Run the code provided to visualize the distribution of the number_yn variable.

Visualizing numerical data

  • ggplot2 mandatory

Visualizing numerical and categorical data

Create a scatterplot of number of exclamation points in the email message (exclaim_mess) vs. number of characters (num_char).

  • Color points by whether or not the email is spam.
  • Note that the spam variable is stored as numerical (0/1) but you want to use it as a categorical variable in this plot. To do this, you need to force R to think of it as such with the factor() function.

Study types and cautionary tales

Chapter covers: - observational studies and experiments,
- scope of inference,
- and Simpson’s paradox.

Observational studies and experiments

  • Observational study:
    - Collect data in a way that does not directly interfere with how the data arise - Only correlation can be inferred
  • Experiment:
    - Randomly assign subjects to various treatments
    - Causation can be inferred
  • In experiments, the decision to do something or not is not left of to the participants but decided by the researchers

Identify study type

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared.

What type of study is this?

  • Experiment
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Random sampling and random assignment

Random samping:
    At selection of subjects from popultion
    Helps generalizability of results
Random assignmnet:
    Asssignment of subjects to various treatments
    Helps infer causation from results 

Scope of inference:

Random sampling or random assignment?

One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.

Does this study employ random sampling and/or random assignment?

Neither random sampling
    Dealth only with patients that were already hospitalized. It would not be appropriate to apply the findings back to the population as a whole.
nor random assignment
    The conditions are not imposed on the patients by the people conducting the study
    If the researchers has one group of people smoke and the other not, this would be random assignment.

Identify the scope of inference of study

Volunteers were recruited to participate in a study where they were asked to type 40 bits of trivia—for example, “an ostrich’s eye is bigger than its brain”—into a computer. A randomly selected half of these subjects were told the information would be saved in the computer; the other half were told the items they typed would be erased.

Then, the subjects were asked to remember these bits of trivia, and the number of bits of trivia each subject could correctly recall were recorded. It was found that the subjects were significantly more likely to remember information if they thought they would not be able to find it later.

The results of the study cannot be generalized to all people
A causal link between believing information is stored and memory can be inferred based on these results.

Simpson’s paradox

  When the relationship between two variable is reversed when a new variable is introduced
  e.g. a grouping variable is added and its then clear that the trend is negative for both groups. 

Number of males and females admitted

In order to calculate the number of males and females admitted, I use: count() from the dplyr package and spread() from the tidyr package.

In one step, count() allows to group the data by certain variables (in this case, admission status and gender) and then counts the number of observations in each category. These counts are available under a new variable called n.

spread() simply reorganizes the output across columns based on a key-value pair, where a pair contains a key that explains what the information describes and a value that contains the actual information. spread() takes the name of the dataset as its first argument, the name of the key column as its second argument, and the name of the value column as its third argument, all specified without quotation marks.

## # A tibble: 4 x 3
##   Admit    Gender     n
##   <fctr>   <fctr> <int>
## 1 Admitted Male    1198
## 2 Admitted Female   557
## 3 Rejected Male    1493
## 4 Rejected Female  1278
## # A tibble: 2 x 3
##   Gender Admitted Rejected
## * <fctr>    <int>    <int>
## 1 Male       1198     1493
## 2 Female      557     1278

Note no need to use group_by with count(): comma separated colnames do the trick

Proportion of males admitted overall

## # A tibble: 2 x 4
##   Gender Admitted Rejected Perc_Admit
##   <fctr>    <int>    <int>      <dbl>
## 1 Male       1198     1493      0.445
## 2 Female      557     1278      0.304

It looks like 44% of males were admitted versus only 30% of females.

Proportion of males admitted for each department

Make a table similar to the one constructed earlier, except first group the data by department. Then, use this table to calculate the proportion of males admitted in each department.

## # A tibble: 12 x 4
##    Dept  Gender Admitted Rejected
##  * <chr> <fctr>    <int>    <int>
##  1 A     Male        512      313
##  2 A     Female       89       19
##  3 B     Male        353      207
##  4 B     Female       17        8
##  5 C     Male        120      205
##  6 C     Female      202      391
##  7 D     Male        138      279
##  8 D     Female      131      244
##  9 E     Male         53      138
## 10 E     Female       94      299
## 11 F     Male         22      351
## 12 F     Female       24      317
## # A tibble: 12 x 5
##    Dept  Gender Admitted Rejected Perc_Admit
##    <chr> <fctr>    <int>    <int>      <dbl>
##  1 A     Male        512      313     0.621 
##  2 A     Female       89       19     0.824 
##  3 B     Male        353      207     0.630 
##  4 B     Female       17        8     0.680 
##  5 C     Male        120      205     0.369 
##  6 C     Female      202      391     0.341 
##  7 D     Male        138      279     0.331 
##  8 D     Female      131      244     0.349 
##  9 E     Male         53      138     0.277 
## 10 E     Female       94      299     0.239 
## 11 F     Male         22      351     0.0590
## 12 F     Female       24      317     0.0704

Contingency table results by group

Which of the following best describes the relationship between admission status and gender?

## # A tibble: 2 x 2
##   Gender `mean(Perc_Admit)`
##   <fctr>              <dbl>
## 1 Male                0.381
## 2 Female              0.417
  • Within most departments, female applicants are more likely to be admitted.

Recap: Simpson’s paradox

Overall: males are more likely to be admitted
But within most departments: females more likely
When controlling for department, relationship between gender and admission status is reversed
Potential reason:
    Women tended to apply to competitive departments with low admission rates
    Men tended to apply to less competitive departments with high admission rates

Sampling strategies and experimental design

Sampling strategies

A census is the procedure of systematically acquiring and recording information about the members of a given population. The term is used mostly in connection with national population and housing censuses; other common censuses include agriculture, business, and traffic censuses

Why not take a census?
    Conducting a census is very resource intensive
    (Nearly) impossible to collect data from all individuals, hence no guarantee of unbiased results. Some types of people may have more reason to avoid your survey.
    Populations constantly change
Sampling is like tasting your soup as you make it to see if its salty.
    stir well and then you can infer the taste of the soup from the small sample.
    but there are many sampling strategies in the real world…

Sample strategies:

Simple Random sample
    each case is equally likekly to be selected 

Stratified sample

Divide the population into homogeneous groups and then randomly sample from within each group
e.g. using zipcode or income level as a stratum and sampling equal numbers of people from each. 

Cluster sample

Divide population into clusters, randomly pick a few clusters, then sample all of these clusters.
The clusters are heterogenous and each cluster is similar to the other cluster so we can get away with just sampling a few of the clusters.
e.g. cities could be clusters 

Multistage sample

Multiple clusters
Often used for economical reasons
e.g. divide a city into similar geographical regions and then sample from some of them to avoid having to travel to every region. 

Sampling strategies, determine which

A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.

What sampling strategy has this company used?

Stratified sample

A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.

Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.

Which approach would likely be the least effective for selecting the schools where the survey will be conducted?

  Cluster sampling, where each cluster is a neighborhood 

This sampling strategy would be a bad idea because each neighborhood has a unique socioeconomic status. A good study would collect information about every neighborhood.

Sampling in R

simple random sample
    dplyr: sample_n
stratified sample,
    first group by state than sample

Simple random sample in R

Collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.

## # A tibble: 3 x 2
## # Groups: region [3]
##   region        n
##   <fctr>    <int>
## 1 Midwest       3
## 2 Northeast     3
## 3 South         2

Notice that this strategy selects an unequal number of states from each region. In the next exercise, you’ll implement stratified sampling to select an equal number of states from each region.

Stratified sample in R

The goal of stratified sampling is to select an equal number of states from each region

## # A tibble: 4 x 2
## # Groups: region [4]
##   region        n
##   <fctr>    <int>
## 1 Midwest       2
## 2 Northeast     2
## 3 South         2
## 4 West          2

In a stratified sample, each stratum (i.e. Region) is represented equally.

Compare SRS vs. stratified sample

Which method, simple random sampling or stratified sampling, ensures an equal number of states from each region?

  • stratified sampling

Principles of experimental design

  • Control: compare treatment of interest to a control group
  • Randomize: randomly assign subjects to treatments
  • Replicate: collect a sufficiently large sample within a study, or replicate the entire study
  • Block: account for the potential effect of confounding variables - Group subjects into blocks based on these variables - Randomize within each bolock to treatment group e.g. male and female, or prior programming experience

Identifying components of a study

A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

What variables are involved in this study?

2 explanatory variables (light and noise)
1 blocking variable (gender)
1 response variable (exam performance)

Experimental design terminology

Explanatory variables are conditions you can impose on the experimental units, while blocking variables are characteristics that the experimental units come with that you would like to control for.

Connect blocking and stratifying

In random sampling, you use stratifying to control for a variable.
In random assignment, you use blocking to achieve the same goal.

Case study

Inspect Data

The purpose of this chapter is to give you an opportunity to apply and practice what you’ve learned on a real world dataset. For this reason, we’ll provide a little less guidance than usual.

## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...

What type of study is this?

  • Observational study

The data from this study were gathered by ___.

  • randomly sampling classes

Identify variable types

It’s always useful to start your exploration of a dataset by identifying variable types.

## Observations: 463
## Variables: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5...
## $ rank          <fctr> tenure track, tenure track, tenure track, tenur...
## $ ethnicity     <fctr> minority, minority, minority, minority, not min...
## $ gender        <fctr> female, female, female, female, male, male, mal...
## $ language      <fctr> english, english, english, english, english, en...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, ...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000...
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24,...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, ...
## $ cls_level     <fctr> upper, upper, upper, upper, upper, upper, upper...
## $ cls_profs     <fctr> single, single, single, single, multiple, multi...
## $ cls_credits   <fctr> multi credit, multi credit, multi credit, multi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, ...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, ...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, ...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000,...
## $ pic_outfit    <fctr> not formal, not formal, not formal, not formal,...
## $ pic_color     <fctr> color, color, color, color, color, color, color...

Recode variables

The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is

"small" (18 students or fewer),
"midsize" (19 - 59 students), or
"large" (60 students or more).

Since you’d like to have three distinct levels (instead of just two), you will need a nested call to ifelse(), which means that you’ll call ifelse() a second time from within your first call to ifelse().

The cls_type variable is a categorical variable, stored as a character vector. You could have made it a factor variable by wrapping the nested ifelse() statements inside factor()

Create a scatterplot

The bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.

Create a scatterplot, with an added layer

Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).

Reference

[1] J. Bryan. gapminder: Data from Gapminder. R package version 0.3.0. 2017. <URL: https://CRAN.R-project.org/package=gapminder>.

[2] D. M. Diez, C. D. Barr and M. Cetinkaya-Rundel. openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks. R package version 1.7.1. 2017. <URL: https://CRAN.R-project.org/package=openintro>.

[3] L. Henry and H. Wickham. purrr: Functional Programming Tools. R package version 0.2.4. 2017. <URL: https://CRAN.R-project.org/package=purrr>.

[4] K. MÄ‚Ä˝ller. bindrcpp: An ‘Rcpp’ Interface to Active Bindings. R package version 0.2. 2017. <URL: https://CRAN.R-project.org/package=bindrcpp>.

[5] K. MĂĽller and H. Wickham. tibble: Simple Data Frames. R package version 1.4.1. 2017. <URL: https://CRAN.R-project.org/package=tibble>.

[6] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2017. <URL: https://www.R-project.org/>.

[7] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 0.2.0. 2017. <URL: https://CRAN.R-project.org/package=forcats>.

[8] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.2.0. 2017. <URL: https://CRAN.R-project.org/package=stringr>.

[9] H. Wickham. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. 2017. <URL: https://CRAN.R-project.org/package=tidyverse>.

[10] H. Wickham and W. Chang. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 2.2.1. 2016. <URL: https://CRAN.R-project.org/package=ggplot2>.

[11] H. Wickham, R. Francois, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 0.7.4. 2017. <URL: https://CRAN.R-project.org/package=dplyr>.

[12] H. Wickham and L. Henry. tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. R package version 0.7.2. 2017. <URL: https://CRAN.R-project.org/package=tidyr>.

[13] H. Wickham, J. Hester and R. Francois. readr: Read Rectangular Text Data. R package version 1.1.1. 2017. <URL: https://CRAN.R-project.org/package=readr>.