Scalable Data Processing

Jan 25, 2018 00:00 · 6949 words · 33 minute read english R bigmemory iotools

This is a note from the DataCamp Course on Scalable Data Processing in R.

Working with increasingly large data sets

In this chapter, we cover the reasons you need to apply new techniques when data sets are larger than available RAM. We show that importing and exporting data using the base R functions can be slow and some easy ways to remedy this. Finally, we introduce the bigmemory package.

Why is your code slow?

Reading and writing data to the hard drive takes much longer than reading and writing to RAM. This means if you need to retrieve data from the hard drive it takes much longer to move it to the CPU - where it can be processed - compared to moving data from RAM to the CPU. A program’s use of resources, like RAM, processors, and hard drive dictate how quickly your R code runs. You can’t change these resources without physically swapping them out for other hardware. However, you can often use the resources you have more efficiently. In particular, if you have a data set that is about the size of RAM, you might be better off saving most of the data set on the disk. By loading only the parts of a data set you need, you free up resources so that each part can be processed more quickly.

How does processing time vary by data size?

If you are processing all elements of two data sets, and one data set is bigger, then the bigger data set will take longer to process. However, it’s important to realize that how much longer it takes is not always directly proportional to how much bigger it is. That is, if you have two data sets and one is two times the size of the other, it is not guaranteed that the larger one will take twice as long to process. It could take 1.5 times longer or even four times longer. It depends on which operations are used to process the data set. In this exercise, you’ll use the microbenchmark package. Note: Numbers are specified using scientific notation \[1e5 = 1 * 10^5 = 100,000\]

Note that the resulting graph shows that the execution time is not the same every time. This is because while the computer was executing your R code, it was also doing other things. As a result, it is a good practice to run each operation being benchmarked mutiple times, and to look at the median execution time when evaluating the execution time of R code.

Reading a big.matrix object

In this exercise, you’ll create your first file-backed big.matrix object using the read.big.matrix() function. The function is meant to look similar to read.table() but, in addition, it needs to know what type of numeric values you want to read (“char”, “short”, “integer”, “double”), it needs the name of the file that will hold the matrix’s data (the backing file), and it needs the name of the file to hold information about the matrix (a descriptor file). The result will be a file on the disk holding the value read in along with a descriptor file which holds extra information (like the number of columns and rows) about the resulting big.matrix object.

## [1] 70000    16

Note that this file isn’t that big but the code will work for files that are much bigger.

Attaching a big.matrix object

Now that the big.matrix object is on the disk, we can use the information stored in the descriptor file to instantly make it available during an R session. This means that you don’t have to reimport the data set, which takes more time for larger files. You can simply point the bigmemory package at the existing structures on the disk and begin accessing data without the wait.

## [1] 70000    16
##      enterprise record_number msa perc_minority tract_income_ratio
## [1,]          1           566   1             1                  3
## [2,]          1           116   1             3                  2
## [3,]          1           239   1             2                  2
## [4,]          1            62   1             2                  3
## [5,]          1           106   1             2                  3
## [6,]          1           759   1             3                  3
##      borrower_income_ratio loan_purpose federal_guarantee borrower_race
## [1,]                     1            2                 4             3
## [2,]                     1            2                 4             5
## [3,]                     3            8                 4             5
## [4,]                     3            2                 4             5
## [5,]                     3            2                 4             9
## [6,]                     2            2                 4             9
##      co_borrower_race borrower_gender co_borrower_gender num_units
## [1,]                9               2                  4         1
## [2,]                9               1                  4         1
## [3,]                5               1                  2         1
## [4,]                9               2                  4         1
## [5,]                9               3                  4         1
## [6,]                9               1                  2         2
##      affordability year type
## [1,]             3 2010    1
## [2,]             3 2008    1
## [3,]             4 2014    0
## [4,]             4 2009    1
## [5,]             4 2013    1
## [6,]             4 2010    1

You’ve used your knowledge of base R to get information about a big.matrix object.

Creating tables with big.matrix objects

A final advantage to using big.matrix is that if you know how to use R’s matrices, then you know how to use a big.matrix. You can subset columns and rows just as you would a regular matrix, using a numeric or character vector and the object returned is an R matrix. Likewise, assignments are the same as with R matrices and after those assignments are made they are stored on disk and can be used in the current and future R sessions.

One thing to remember is that $ is not valid for getting a column of either a matrix or a big.matrix.

##      enterprise record_number msa perc_minority tract_income_ratio
## [1,]          1           566   1             1                  3
## [2,]          1           116   1             3                  2
## [3,]          1           239   1             2                  2
##      borrower_income_ratio loan_purpose federal_guarantee borrower_race
## [1,]                     1            2                 4             3
## [2,]                     1            2                 4             5
## [3,]                     3            8                 4             5
##      co_borrower_race borrower_gender co_borrower_gender num_units
## [1,]                9               2                  4         1
## [2,]                9               1                  4         1
## [3,]                5               1                  2         1
##      affordability year type
## [1,]             3 2010    1
## [2,]             3 2008    1
## [3,]             4 2014    0
## 
##  2008  2009  2010  2011  2012  2013  2014  2015 
##  8468 11101  8836  7996 10935 10216  5714  6734

Don’t forget that this is only a sample of the entire data set. So the values are propotional to the actual total number of mortgages. Does it seem strange that some years had proportionally more total mortgages?

Data summary using bigsummary

Now that you know how to import and attach a big.matrix object, you can start exploring the data stored in this object. As mentioned before, there is a whole suite of packages designed to explore and analyze data stored as a big.matrix object. In this exercise, you will use the biganalytics package to create summaries.

##            enterprise         record_number                   msa 
##             1.3814571           499.9080571             0.8943571 
##         perc_minority    tract_income_ratio borrower_income_ratio 
##             1.9701857             2.3431571             2.6898857 
##          loan_purpose     federal_guarantee         borrower_race 
##             3.7670143             3.9840857             5.3572429 
##      co_borrower_race       borrower_gender    co_borrower_gender 
##             7.0002714             1.4590714             3.0494857 
##             num_units         affordability                  year 
##             1.0398143             4.2863429          2011.2714714 
##                  type 
##             0.5300429
##                                min          max         mean          NAs
## enterprise               1.0000000    2.0000000    1.3814571    0.0000000
## record_number            0.0000000  999.0000000  499.9080571    0.0000000
## msa                      0.0000000    1.0000000    0.8943571    0.0000000
## perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
## tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
## borrower_income_ratio    1.0000000    9.0000000    2.6898857    0.0000000
## loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
## federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
## borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
## co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
## borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
## co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
## num_units                1.0000000    4.0000000    1.0398143    0.0000000
## affordability            0.0000000    9.0000000    4.2863429    0.0000000
## year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
## type                     0.0000000    1.0000000    0.5300429    0.0000000

Some categorical variables are already encoded with another value, so there are no NAs listed. In a few sections, we’ll go through how to fix this.

Copying matrices and big matrices

If you want to copy a big.matrix object, then you need to use the deepcopy() function. This can be useful, especially if you want to create smaller big.matrix objects. In this exercise, you’ll copy a big.matrix object and show the reference behavior for these types of objects.

## [1] NA
## [1] 1

You know the basics of loading, attaching, subsetting, and copying big.matrix objects. In the next section we’ll explore and begin analyzing the data set.

Processing and Analyzing Data with bigmemory

Now that you’ve got some experience using bigmemory, we’re going to go through some simple data exploration and analysis techniques. In particular, we’ll see how to create tables and implement the split-apply-combine approach.

Tabulating using bigtable

The bigtabulate package provides optimized routines for creating tables and splitting the rows of big.matrix objects. Let’s say you wanted to see the breakdown by ethnicity of mortgages in the housing data. The documentation from the website provides the mapping from the numerical value to ethnicity. In this exercise, you’ll create a table using the bigtable() function, found in the bigtabulate package.

##   Native Am       Asian       Black  Pacific Is       White Two or More 
##         143        4438        2020         195       50006         528 
##    Hispanic   Not Avail 
##        4040        8630

Borrower Race and Ethnicity by Year (I)

As a second exercise in creating big tables, suppose you want to see the total count by year, rather than for all years at once. Then you would create a table for each ethnicity for each year.

##   2008 2009 2010 2011 2012 2013 2014 2015        Race
## 1   11   18   13   16   15   12   29   29   Native Am
## 2  384  583  603  568  770  673  369  488       Asian
## 3  363  320  209  204  258  312  185  169       Black
## 4   33   38   21   13   28   22   17   23  Pacific Is
## 5 5552 7739 6301 5746 8192 7535 4110 4831       White
## 6   43   85   65   58   89   78   46   64 Two or More
## 7  577  563  384  378  574  613  439  512    Hispanic
## 9 1505 1755 1240 1013 1009  971  519  618   Not Avail

Some other examples:

##  2008  2009  2010  2011  2012  2013  2014  2015 
##  8468 11101  8836  7996 10935 10216  5714  6734
##   2008 2009 2010 2011 2012 2013 2014 2015
## 0 1064 1343  998  851 1066 1005  504  564
## 1 7404 9758 7838 7145 9869 9211 5210 6170

Female Proportion Borrowing

In the last exercise, you stratified by year and race (or ethnicity). However, there are lots of other ways you can partition the data. In this exercise and the next, you’ll find the proportion of female borrowers in urban and rural areas by year. This exercise is slightly different from the last one because rather than simply finding counts of things you want to get the proportion of female borrowers conditioned on the year.

In this exercise, we have defined a function that finds the proportion of female borrowers for urban and rural areas: female_residence_prop().

## [1] 0.2737439 0.2304965

… If only you could see the proportions for all years…

Split

To calculate the proportions for all years, you will use the function female_residence_prop() defined in the previous exercise along with three other functions:

  • split(): To “split” the mort data by year
  • Map(): To “apply” the function female_residence_prop() to each of the subsets returned from split()
  • Reduce(): To combine the results obtained from Map()

In this exercise, you will “split” the mort data by year.

## List of 8
##  $ 2008: int [1:8468] 2 8 15 17 18 28 35 40 42 47 ...
##  $ 2009: int [1:11101] 4 13 25 31 43 49 52 56 67 68 ...
##  $ 2010: int [1:8836] 1 6 7 10 21 23 24 27 29 38 ...
##  $ 2011: int [1:7996] 11 20 37 46 53 57 73 83 86 87 ...
##  $ 2012: int [1:10935] 14 16 26 30 32 33 48 69 81 94 ...
##  $ 2013: int [1:10216] 5 9 19 22 36 44 55 58 72 74 ...
##  $ 2014: int [1:5714] 3 12 50 60 64 66 103 114 122 130 ...
##  $ 2015: int [1:6734] 34 41 54 61 62 65 82 91 102 135 ...

Did you notice that the result is a named list of row numbers for each of the years?

Apply

In this exercise, you will “apply” the function female_residence_prop() to obtain the proportion of female borrowers for both urban and rural areas for all years using the Map() function.

## List of 8
##  $ 2008: num [1:2] 0.275 0.204
##  $ 2009: num [1:2] 0.244 0.2
##  $ 2010: num [1:2] 0.241 0.201
##  $ 2011: num [1:2] 0.252 0.241
##  $ 2012: num [1:2] 0.244 0.21
##  $ 2013: num [1:2] 0.275 0.257
##  $ 2014: num [1:2] 0.289 0.268
##  $ 2015: num [1:2] 0.274 0.23

Combine

You now know the female proportion borrowing for urban and rural areas for all years. However, the result resides in a list. Converting this list to a matrix or data frame may sometimes be convenient in case you want to calculate any summary statistics or visualize the results. In this exercise, you will combine the results into a matrix.

##      prop_female_urban prop_femal_rural
## 2008         0.2748514        0.2039474
## 2009         0.2441074        0.2002978
## 2010         0.2413881        0.2014028
## 2011         0.2520644        0.2408931
## 2012         0.2438950        0.2101313
## 2013         0.2751059        0.2567164
## 2014         0.2886756        0.2678571
## 2015         0.2737439        0.2304965

In the next coding exercise, you will visualize these results using ggplot2!

Visualizing Female Proportion Borrowing

The return type of functions in the bigtabulate and biganalytics packages are base R types that can be used just like you would with any analysis. This means that we can visualize results using ggplot2.

In this exercise, you will visualize the female proportion borrowing for urban and rural areas across all years.

The Borrower Income Ratio

The borrower income ratio is the ratio of the borrower’s (or borrowers’) annual income to the median family income of the area for the reporting year. This is the ratio used to determine whether borrower’s income qualifies for an income-based housing goal.

In the data set mort, missing values are recoded as 9. In this exercise, we replaced the 9’s in the "borrower_income_ratio" column with NA, so you can create a table of the borrower income ratios.

##                                min          max         mean          NAs
## enterprise               1.0000000    2.0000000    1.3814571    0.0000000
## record_number            0.0000000  999.0000000  499.9080571    0.0000000
## msa                      0.0000000    1.0000000    0.8943571    0.0000000
## perc_minority            1.0000000    9.0000000    1.9701857    0.0000000
## tract_income_ratio       1.0000000    9.0000000    2.3431571    0.0000000
## borrower_income_ratio    1.0000000    9.0000000    2.6898857    0.0000000
## loan_purpose             1.0000000    9.0000000    3.7670143    0.0000000
## federal_guarantee        1.0000000    4.0000000    3.9840857    0.0000000
## borrower_race            1.0000000    9.0000000    5.3572429    0.0000000
## co_borrower_race         1.0000000    9.0000000    7.0002714    0.0000000
## borrower_gender          1.0000000    9.0000000    1.4590714    0.0000000
## co_borrower_gender       1.0000000    9.0000000    3.0494857    0.0000000
## num_units                1.0000000    4.0000000    1.0398143    0.0000000
## affordability            0.0000000    9.0000000    4.2863429    0.0000000
## year                  2008.0000000 2015.0000000 2011.2714714    0.0000000
## type                     0.0000000    1.0000000    0.5300429    0.0000000
##   2008 2009 2010 2011 2012 2013 2014 2015       BIR
## 1 1205 1473  600  620  745  725  401  380 >=0,<=50%
## 2 2095 2791 1554 1421 1819 1861 1032 1145 >50,<=80%
## 3 4844 6707 6609 5934 8338 7559 4255 5169      >80%
## 4  324  130   73   21   33   71   26   40        NA

Where can you use bigmemory?

The bigmemory package is useful when your data are represented as a dense, numeric matrix and you can store an entire data set on your hard drive. It is also compatible with optimized, low-level linear algebra libraries written in C, like Intel’s Math Kernel Library. So, you can use bigmemory directly in your C and C++ programs for better performance.

If your data isn’t numeric - if you have string variables - or if you need a greater range of numeric types - like 8-bit integers - then you might consider trying the ff package. It is similar to bigmemory but includes a structure similar to a data.frame.

Working with iotools

We’ll use the iotools package that can process both numeric and string data, and introduce the concept of chunk-wise processing.

Foldable operations (I)

An operation that gives the same answer whether you apply it to an entire data set or to chunks of a data set and then on the results on the chunks is sometimes called foldable. The max() and min() operations are an example of this. Here, we have defined a foldable version of the range() function that takes either a vector or list of vectors. Verify that the function works by testing it on the mortgage data set.

## [1]   0 999

Foldable operations (II)

Now, you’ll use the function on partitions of the data set. You should realize that by performing this operation in pieces and then aggregating, you don’t need to have all of the data in a variable at once. This point isn’t that important with small data sets, like the mortgage sample data, but it is for large data sets.

## [1]   1 999   1 999

Notice that while each of the partitions could have been large, we were only processing a smaller portion of the rows at any time.

Compare read.delim() and read.delim.raw()

When processing a sequence of contiguous chunks of data on a hard drive, iotools can turn a raw object into a data.frame or matrix while - at the same time - retrieving the next chunk of data. These optimizations allow iotools to quickly process very large files.

## Unit: milliseconds
##                                                                                                                           expr
##      read.delim("/OneDrive/R/blog/content/posts/data/ScalableDataProcess/mortgage-sample.csv",      header = FALSE, sep = ",")
##  read.delim.raw("/OneDrive/R/blog/content/posts/data/ScalableDataProcess/mortgage-sample.csv",      header = FALSE, sep = ",")
##        min       lq     mean   median        uq       max neval cld
##  712.27780 866.6569 914.1527 887.2416 1013.4007 1091.1867     5   b
##   87.15821 108.1458 173.2231 141.7283  235.0384  294.0447     5  a

Notice that while iotools is a lot faster, you don’t have to wait very long in either case because the data set is small.

Reading raw data and turning it into a data structure

As mentioned before, part of what makes iotools fast is that it separates reading data from the hard drive from converting the binary data it into a data.frame or matrix. Data in their binary format are copied from the hard drive into memory as raw objects. These raw objects are then passed to optimized functions that turn them into data.frame or matrix objects.

In this exercise, you’ll learn how to separate reading data from the disk (using the readAsRaw() function) from converting raw binary data into a matrix or data.frame (using the mstrsplit() and dstrsplit() functions).

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    1  566    1    1    3    1    2    4    3     9     2     4     1
## [2,]    1  116    1    3    2    1    2    4    5     9     1     4     1
## [3,]    1  239    1    2    2    3    8    4    5     5     1     2     1
## [4,]    1   62    1    2    3    3    2    4    5     9     2     4     1
## [5,]    1  106    1    2    3    3    2    4    9     9     3     4     1
## [6,]    1  759    1    3    3    2    2    4    9     9     1     2     2
##      [,14] [,15] [,16]
## [1,]     3  2010     1
## [2,]     3  2008     1
## [3,]     4  2014     0
## [4,]     4  2009     1
## [5,]     4  2013     1
## [6,]     4  2010     1
##   V1  V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14  V15 V16
## 1  1 566  1  1  3  1  2  4  3   9   2   4   1   3 2010   1
## 2  1 116  1  3  2  1  2  4  5   9   1   4   1   3 2008   1
## 3  1 239  1  2  2  3  8  4  5   5   1   2   1   4 2014   0
## 4  1  62  1  2  3  3  2  4  5   9   2   4   1   4 2009   1
## 5  1 106  1  2  3  3  2  4  9   9   3   4   1   4 2013   1
## 6  1 759  1  3  3  2  2  4  9   9   1   2   2   4 2010   1

Reading chunks in as a matrix

In this exercise, you’ll write a scalable table() function counting the number of urban and rural borrowers in the mortgage dataset using chunk.apply(). By default, chunk.apply() aggregates the processed data using the rbind() function. This means that you can create a table from each of the chunks and then add up the rows of the resulting matrix to get the total counts for the table.

We have created a file connection fc to the "mortgage-sample.csv" file and read in the first line to get rid of the header.

## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
##         0    1
##  [1,] 309 2401
##  [2,] 289 2422
##  [3,] 266 2444
##  [4,] 300 2410
##  [5,] 279 2431
##  [6,] 310 2400
##  [7,] 274 2436
##  [8,] 283 2428
##  [9,] 259 2452
## [10,] 287 2423
## [11,] 288 2423
## [12,] 283 2428
## [13,] 271 2439
## [14,] 299 2411
## [15,] 294 2416
## [16,] 305 2405
## [17,] 280 2431
## [18,] 275 2435
## [19,] 303 2407
## [20,] 279 2431
## [21,] 296 2414
## [22,] 294 2417
## [23,] 288 2424
## [24,] 264 2446
## [25,] 292 2418
## [26,] 228 2013
##     0     1 
##  7395 62605

Reading chunks in as a data.frame

In the previous example, we read each chunk into the processing function as a matrix using mstrsplit(). This is fine when we are reading rectangular data where the type of element in each column is the same. When it’s not, we might like to read the data in as a data.frame. This can be done by either reading a chunk in as a matrix and then convert it to a data.frame, or you can use the dstrsplit() function.

## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""
## rural urban 
##  7395 62605

Now you can read large matrices or data.frames in chunks, processes them, and aggregate results.

Parallelizing calls to chunk.apply

The chunk.apply() function can also make use of parallel processes to process data more quickly. When the parallel parameter is set to a value greater than one on Linux and Unix machine (including the Mac) multiple processes read and process data at the same time thereby reducing the execution time. On Windows the parallel parameter is ignored.

## Unit: milliseconds
##                 expr      min       lq     mean   median       uq      max
##  iotools_read_fun(1) 102.8816 104.4467 116.1784 109.9024 127.9014 146.5071
##  iotools_read_fun(3) 102.8722 109.5406 131.5791 114.8067 139.5321 268.7813
##  neval cld
##     20   a
##     20   a

You should realize that you only get a speed increase when you have spare processors and when you can read from the hard drive faster than you can process data.

Case Study: A Preliminary Analysis of the Housing Data

In the previous chapters, we’ve introduced the housing data and shown how to compute with data that is about as big, or bigger than, the amount of RAM on a single machine. In this chapter, we’ll go through a preliminary analysis of the data, comparing various trends over time.

Race and Ethnic Representation in the Mortgage Data

In this exercise, you’ll get the race and ethnic proportions of borrowers in the mortgage data set, adjusted by the total number of borrowers. This will turn the race and ethnicity table you created before into a proportion. Later on, you’ll use these values to adjust for the race and ethnic proportions of the US population.

##   Native Am       Asian       Black  Pacific Is       White Two or More 
## 0.002330129 0.072315464 0.032915105 0.003177448 0.814828092 0.008603552 
##    Hispanic 
## 0.065830210

Are these result surprising based on what you know about demographic proportions in the U.S.?

Comparing the Borrower Race/Ethnicity and their Proportions

In this exercise, you’ll compare the US race and ethnic proportions to proportion of total borrowers by race or ethnicity. This will provide an initial check to see if each group is borrowing at a rate comparable to its proportional representation in the United States. The task is similar to the last exercise, but this time you’ll use iotools to accomplish it.

##                         Native Am      Asian      Black  Pacific Is
## Population Proportion 0.009000000 0.04800000 0.12600000 0.002000000
## Borrower Proportion   0.002330129 0.07231546 0.03291511 0.003177448
##                           White Two or More   Hispanic
## Population Proportion 0.7240000 0.029000000 0.16300000
## Borrower Proportion   0.8148281 0.008603552 0.06583021

Who is borrowing a rate higher than the population proportion? Who is borrowing at a lower rate? Is this what you expected?

Looking for Predictable Missingness

If data are missing completely at random, then you shouldn’t be able to predict when a variable is missing based on the rest of the data. Therefore, if you can predict missingness then the data are not missing completely at random. So, let’s use the glm() function to fit a logistic regression, looking for missingness based on affordability in the mort variable you created earlier. If you don’t find any structure in the missing data - i.e., the slope variables are not significant - it does not mean that you have proven the data are missing at random, but it is plausible.

## 
## Call:
## glm(formula = borrower_race_ind ~ affordability_factor, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5969  -0.5016  -0.5016  -0.5016   2.0867  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.7478     0.1376 -12.701   <2e-16 ***
## affordability_factor1  -0.2241     0.1536  -1.459   0.1447    
## affordability_factor2  -0.3090     0.1609  -1.920   0.0548 .  
## affordability_factor3  -0.2094     0.1446  -1.448   0.1476    
## affordability_factor4  -0.2619     0.1383  -1.894   0.0582 .  
## affordability_factor9   0.1131     0.1413   0.800   0.4235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52279  on 69999  degrees of freedom
## Residual deviance: 52166  on 69994  degrees of freedom
## AIC: 52178
## 
## Number of Fisher Scoring iterations: 4

It doesn’t look like there is a relationsip between missingness in the borrower’s race and the affordability of the home.

Borrower Race and Ethnicity by Year (II)

In this exercise, you’ll use both iotools and bigtabulate to tabulate borrower race and ethnicity by year.

## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""

In the next exercise, you will visualize the adjusted demographics trends!

Relative change in demographic trend

In the last exercise, you looked at the changes in borrowing across demographics over time. In this exercise, you’ll look at the relative change in demographic trend. To do this, you’ll normalize each group’s trend by borrowing in the first year, 2008.

##   2008 2009 2010 2011 2012 2013 2014 2015        Race
## 1   11   18   13   16   15   12   29   29   Native Am
## 2  384  583  603  568  770  673  369  488       Asian
## 3  363  320  209  204  258  312  185  169       Black
## 4   33   38   21   13   28   22   17   23  Pacific Is
## 5 5552 7739 6301 5746 8192 7535 4110 4831       White
## 6   43   85   65   58   89   78   46   64 Two or More
## 7  577  563  384  378  574  613  439  512    Hispanic

Which groups have the largest change and when? Why might you see this?

Borrower Region by Year

In this exercise you’ll tabulate the data by year and the msa (city vs rural) variable.

## [1] "\"enterprise\",\"record_number\",\"msa\",\"perc_minority\",\"tract_income_ratio\",\"borrower_income_ratio\",\"loan_purpose\",\"federal_guarantee\",\"borrower_race\",\"co_borrower_race\",\"borrower_gender\",\"co_borrower_gender\",\"num_units\",\"affordability\",\"year\",\"type\""

People living in the city recieved more mortgages. However, looks like the number of mortgages for both rural and city dwellers have decreased over time.

Who is securing federally guaranteed loans?

Borrower’s income is not in the data set. However, annual income divided by the median income of people in the local area is. This is called the Borrower Income Ratio. Let’s look at the proportion of federally guaranteed loans for each borrower income category.

##                     FHA/VA          RHS         HECM No Guarantee
## 0 <= 50        0.008944544 0.0014636526 0.0443974630    0.9451943
## 50 < 80        0.005977548 0.0024055985 0.0026971862    0.9889197
## > 80           0.001113022 0.0002428412 0.0006475766    0.9979966
## Not Applicable 0.023676880 0.0013927577 0.0487465181    0.9261838

Do you see how a higher proportion of lower income borrowers have HECM loans? Those are reverse-mortgages, where the Department of Housing and Urban Development buys back a house from someone whose already paid a sizeable portion of it.