Intro To Tidyverse

Dec 31, 2017 00:00 · 4207 words · 20 minute read english R Data Manipulation

Notes for Introduction Tidyverse DataCamp course

This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

Data wrangling

In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

Loading the gapminder and dplyr packages

## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Understanding a data frame

Now that you’ve loaded the gapminder dataset, you can start examining and understanding it.

We’ve already loaded the gapminder and dplyr packages. Type gapminder in your R terminal, to the lower right, to display the object.

How many observations (rows) are in the dataset?

  • Observations: 1,704

Filtering for one year

The filter verb extracts particular observations based on a condition. In this exercise you’ll filter for observations from a particular year.

## # A tibble: 142 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934       821
##  2 Albania     Europe     1957    59.3  1476505      1942
##  3 Algeria     Africa     1957    45.7 10270856      3014
##  4 Angola      Africa     1957    32.0  4561361      3828
##  5 Argentina   Americas   1957    64.4 19610538      6857
##  6 Australia   Oceania    1957    70.3  9712569     10950
##  7 Austria     Europe     1957    67.5  6965860      8843
##  8 Bahrain     Asia       1957    53.8   138655     11636
##  9 Bangladesh  Asia       1957    39.3 51365468       662
## 10 Belgium     Europe     1957    69.2  8989111      9715
## # ... with 132 more rows

Filtering for one country and one year

You can also use the filter() verb to set two conditions, which could retrieve a single observation.

Just like in the last exercise, you can do this in two lines of code, starting with gapminder %>% and having the filter() on the second line. Keeping one verb on each line helps keep the code readable. Note that each time, you’ll put the pipe %>% at the end of the first line (like gapminder %>%).

## # A tibble: 1 x 6
##   country continent  year lifeExp        pop gdpPercap
##   <fctr>  <fctr>    <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2002    72.0 1280400000      3119

Arranging observations by life expectancy

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the dataset based on the lifeExp variable.

## # A tibble: 1,704 x 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fctr>       <fctr>    <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203       737
##  2 Afghanistan  Asia       1952    28.8 8425333       779
##  3 Gambia       Africa     1952    30.0  284320       485
##  4 Angola       Africa     1952    30.0 4232095      3521
##  5 Sierra Leone Africa     1952    30.3 2143249       880
##  6 Afghanistan  Asia       1957    30.3 9240934       821
##  7 Cambodia     Asia       1977    31.2 6978607       525
##  8 Mozambique   Africa     1952    31.3 6446316       469
##  9 Sierra Leone Africa     1957    31.6 2295678      1004
## 10 Burkina Faso Africa     1952    32.0 4469979       543
## # ... with 1,694 more rows
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fctr>           <fctr>    <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972     31656
##  2 Hong Kong, China Asia       2007    82.2   6980412     39725
##  3 Japan            Asia       2002    82.0 127065841     28605
##  4 Iceland          Europe     2007    81.8    301931     36181
##  5 Switzerland      Europe     2007    81.7   7554661     37506
##  6 Hong Kong, China Asia       2002    81.5   6762476     30209
##  7 Australia        Oceania    2007    81.2  20434176     34435
##  8 Spain            Europe     2007    80.9  40448191     28821
##  9 Sweden           Europe     2007    80.9   9031088     33860
## 10 Israel           Asia       2007    80.7   6426679     25523
## # ... with 1,694 more rows

Filtering and arranging

You’ll often need to use the pipe operator (%>%) to combine multiple dplyr verbs in a row. In this case, you’ll combine a filter() with an arrange() to find the highest population countries in a particular year.

## # A tibble: 142 x 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fctr>         <fctr>    <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000       576
##  2 India          Asia       1957    40.2 409000000       590
##  3 United States  Americas   1957    69.5 171984000     14847
##  4 Japan          Asia       1957    65.5  91563009      4318
##  5 Indonesia      Asia       1957    39.9  90124000       859
##  6 Germany        Europe     1957    69.1  71019069     10188
##  7 Brazil         Americas   1957    53.3  65551171      2487
##  8 United Kingdom Europe     1957    70.4  51430000     11283
##  9 Bangladesh     Asia       1957    39.3  51365468       662
## 10 Italy          Europe     1957    67.8  49182000      6249
## # ... with 132 more rows

Using mutate to change or create a column

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952     346  8425333       779
##  2 Afghanistan Asia       1957     364  9240934       821
##  3 Afghanistan Asia       1962     384 10267083       853
##  4 Afghanistan Asia       1967     408 11537966       836
##  5 Afghanistan Asia       1972     433 13079460       740
##  6 Afghanistan Asia       1977     461 14880372       786
##  7 Afghanistan Asia       1982     478 12881816       978
##  8 Afghanistan Asia       1987     490 13867957       852
##  9 Afghanistan Asia       1992     500 16317921       649
## 10 Afghanistan Asia       1997     501 22227415       635
## # ... with 1,694 more rows
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap lifeExpMonths
##    <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>         <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333       779           346
##  2 Afghanistan Asia       1957    30.3  9240934       821           364
##  3 Afghanistan Asia       1962    32.0 10267083       853           384
##  4 Afghanistan Asia       1967    34.0 11537966       836           408
##  5 Afghanistan Asia       1972    36.1 13079460       740           433
##  6 Afghanistan Asia       1977    38.4 14880372       786           461
##  7 Afghanistan Asia       1982    39.9 12881816       978           478
##  8 Afghanistan Asia       1987    40.8 13867957       852           490
##  9 Afghanistan Asia       1992    41.7 16317921       649           500
## 10 Afghanistan Asia       1997    41.8 22227415       635           501
## # ... with 1,694 more rows

Combining filter, mutate, and arrange

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

## # A tibble: 142 x 7
##    country          continent  year lifeExp       pop gdpPercap lifeExpMo~
##    <fctr>           <fctr>    <int>   <dbl>     <int>     <dbl>      <dbl>
##  1 Japan            Asia       2007    82.6 127467972     31656        991
##  2 Hong Kong, China Asia       2007    82.2   6980412     39725        986
##  3 Iceland          Europe     2007    81.8    301931     36181        981
##  4 Switzerland      Europe     2007    81.7   7554661     37506        980
##  5 Australia        Oceania    2007    81.2  20434176     34435        975
##  6 Spain            Europe     2007    80.9  40448191     28821        971
##  7 Sweden           Europe     2007    80.9   9031088     33860        971
##  8 Israel           Asia       2007    80.7   6426679     25523        969
##  9 France           Europe     2007    80.7  61083916     30470        968
## 10 Canada           Americas   2007    80.7  33390141     36319        968
## # ... with 132 more rows

Data visualization

You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and maniuplation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

Variable assignment

Throughout the exercises in this chapter, you’ll be visualizing a subset of the gapminder data from the year 1952. First, you’ll have to load the ggplot2 package, and create a gapminder_1952 dataset to visualize.

Comparing population and GDP per capita

In the video you learned to create a scatter plot with GDP per capita on the x-axis and life expectancy on the y-axis (the code for that graph is shown here). When you’re exploring data visually, you’ll often need to try different combinations of variables and aesthetics.

Comparing population and life expectancy

In this exercise, you’ll use ggplot2 to create a scatter plot from scratch, to compare each country’s population with its life expectancy in the year 1952.

Putting the x-axis on a log scale

You previously created a scatter plot with population on the x-axis and life expectancy on the y-axis. Since population is spread over several orders of magnitude, with some countries having a much higher population than others, it’s a good idea to put the x-axis on a log scale.

Grouping and summarizing

So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

Summarizing the median life expectancy

You’ve seen how to find the mean life expectancy and the total population across a set of observations, but mean() and sum() are only two of the functions R provides for summarizing a collection of numbers. Here, you’ll learn to use the median() function in combination with summarize().

By the way, dplyr displays some messages when it’s loaded that we’ve been hiding so far. They’ll show up in red and start with:

Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’:

This will occur in future exercises each time you load dplyr: it’s mentioning some built-in functions that are overwritten by dplyr. You won’t need to worry about this message within this course.

## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          60.7

Summarizing the median life expectancy in 1957

Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.

## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          48.4

Summarizing multiple variables in 1957

The summarize() verb allows you to summarize multiple variables at once. In this case, you’ll use the median() function to find the median life expectancy and the max() function to find the maximum GDP per capita.

## # A tibble: 1 x 2
##   medianLifeExp maxGdpPercap
##           <dbl>        <dbl>
## 1          48.4       113523

Summarizing by year

In a previous exercise, you found the median life expectancy and the maximum GDP per capita in the year 1957. Now, you’ll perform those two summaries within each year in the dataset, using the group_by verb.

## # A tibble: 12 x 3
##     year medianLifeExp maxGdpPercap
##    <int>         <dbl>        <dbl>
##  1  1952          45.1       108382
##  2  1957          48.4       113523
##  3  1962          50.9        95458
##  4  1967          53.8        80895
##  5  1972          56.5       109348
##  6  1977          59.7        59265
##  7  1982          62.4        33693
##  8  1987          65.8        31541
##  9  1992          67.7        34933
## 10  1997          69.4        41283
## 11  2002          70.8        44684
## 12  2007          71.9        49357

Summarizing by continent

You can group by any variable in your dataset to create a summary. Rather then comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.

## # A tibble: 5 x 3
##   continent medianLifeExp maxGdpPercap
##   <fctr>            <dbl>        <dbl>
## 1 Africa             40.6         5487
## 2 Americas           56.1        14847
## 3 Asia               48.3       113523
## 4 Europe             67.6        17909
## 5 Oceania            70.3        12247

Summarizing by continent and year

Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.

## # A tibble: 60 x 4
## # Groups: continent [?]
##    continent  year medianLifeExp maxGdpPercap
##    <fctr>    <int>         <dbl>        <dbl>
##  1 Africa     1952          38.8         4725
##  2 Africa     1957          40.6         5487
##  3 Africa     1962          42.6         6757
##  4 Africa     1967          44.7        18773
##  5 Africa     1972          47.0        21011
##  6 Africa     1977          49.3        21951
##  7 Africa     1982          50.8        17364
##  8 Africa     1987          51.6        11864
##  9 Africa     1992          52.4        13522
## 10 Africa     1997          52.8        14723
## # ... with 50 more rows

Visualizing median life expectancy over time

In the last chapter, you summarized the gapminder data to calculate the median life expectancy within each year. This code is provided for you, and is saved (with <-) as the by_year dataset.

Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.

Visualizing median GDP per capita per continent over time

In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.

Comparing median life expectancy and median GDP per continent in 2007

In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.

Types of visualizations

You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

Line plots

Bar plots

Visualizing GDP per capita by country in Oceania

You’ve created a plot where each bar represents one continent, showing the median GDP per capita for each. But the x-axis of the bar plot doesn’t have to be the continent: you can instead create a bar plot where each bar represents a country.

In this exercise, you’ll create a bar plot comparing the GDP per capita between the two countries in the Oceania continent (Australia and New Zealand).

Histograms

Visualizing population

A histogram is useful for examining the distribution of a numeric variable. In this exercise, you’ll create a histogram showing the distribution of country populations in the year 1952.

Visualizing population with x-axis on a log scale

In the last exercise you created a histogram of populations across countries. You might have noticed that there were several countries with a much higher population than others, which causes the distribution to be very skewed, with most of the distribution crammed into a small part of the graph. (Consider that it’s hard to tell the median or the minimum population from that histogram).

To make the histogram more informative, you can try putting the x-axis on a log scale.

Boxplots

Comparing GDP per capita across continents

A boxplot is useful for comparing a distribution of values across several groups. In this exercise, you’ll examine the distribution of GDP per capita by continent. Since GDP per capita varies across several orders of magnitude, you’ll need to put the y-axis on a log scale.

Adding a title to your graph

There are many other options for customizing a ggplot2 graph, which you can learn about other DataCamp courses. You can also learn about them from online resources, which is an important skill to develop.

As the final exercise in this course, you’ll practice looking up ggplot2 instructions by completing a task we haven’t shown you how to do.

Conclusion

  • Very easy course if you are familiar with dplyr and ggplot
  • No other packages are discussed
  • Hoped to learn something new
  • A minimal guidance complete case study wuould make a nice 5th chapter.