United Nations General Assembly

Jan 14, 2018 00:00 · 4437 words · 21 minute read english R Data Manipulation Data Visualization

In this post, I’ll explore the historical voting of the United Nations General Assembly - specifically I’ll concentrate on analyzing differences in voting between countries, across time, and among international issues.

This case study is based on tidyverse tools - mostly: dplyr,ggplot2, purr,tidyr. Some use of broom as well.

Dataset comes from: Erik Voeten “Data and Analyses of Voting in the UN General Assembly” Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013) https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379

It could be easily downloaded from github repository:

but I won’t use it as it’s already preprocessed and cleaned. Instead I’ll give a try to a raw format, where things like country code etc. have to be taken care of

In case above link is broken use my copy votes.rdf

Cleaning and manipulating data

Cleaning and summarizing

Filtering rows

The vote column in the dataset has a number that represents that country’s vote:

## [1] 1 9 8 3 2

Where:

1 = Yes
2 = Abstain
3 = No
8 = Not present
9 = Not a member

For this case study ill remove all occurrences of “Not present” and “Not a member”. This will reduce my dataset by ~ 31%

## # A tibble: 1 x 2
##   `sum(droppedVote)` dropPrcnt
##                <int>     <dbl>
## 1             155382     0.305

Adding a year column

Column session represent the UN session number. Since the UN started voting in 1946, and holds one session every two years, I can get the year of a UN resolution by adding 1945 to the session number.

Adding a country column

The country codes in the ccode column are what’s called Correlates of War codes. This isn’t ideal for an analysis, since I’d like to work with recognizable country names.

One way to get around this issue is by using the countrycode package to translate ccode into something meaningful.

countrycode takes 3 arguments: - vector with code
- origin - original coding scheme (here it’s “cown”)
- destination - destination coding, for example “country.name”

For example, if I where to check some single country code - say: 2 in “cown” coding scheme I’d use:

## [1] "United States of America"

USA it is… In order to translate entire ccode column I’ll use:

## # A tibble: 6 x 6
##    rcid session  vote ccode  year country                 
##   <dbl>   <dbl> <dbl> <int> <dbl> <chr>                   
## 1  46.0    2.00  1.00     2  1947 United States of America
## 2  46.0    2.00  1.00    20  1947 Canada                  
## 3  46.0    2.00  1.00    40  1947 Cuba                    
## 4  46.0    2.00  1.00    41  1947 Haiti                   
## 5  46.0    2.00  1.00    42  1947 Dominican Republic      
## 6  46.0    2.00  1.00    70  1947 Mexico

Percent of yes votes

First I’ll look into the vote column and summarise occurrences of “Yes” votes. By looking at how often a given country votes “yes” I could see how often it agrees with the international consensus. When a country tends to vote “no” I can conclude it goes against the international consensus.

In the following analysis, I’m going to focus on “% of votes that are yes” as a metric for the “agreeableness” of countries.

There are a total of 353547 votes in the database, and 80% are yes votes.

Average “agreeableness” of countries by year

How the average “agreeableness” of countries changed from year to year.

We can see that average agreeableness kept rising (with some bumps) until late ’70s - early ’80s where it settled at it’s highest until 1991.

## # A tibble: 11 x 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1975  8987       0.820
##  2  1977 13310       0.874
##  3  1979 17103       0.843
##  4  1981 19236       0.840
##  5  1983 21560       0.849
##  6  1985 22775       0.842
##  7  1987 22067       0.874
##  8  1989 17378       0.883
##  9  1991 11357       0.856
## 10  1993 10272       0.798
## 11  1995 12645       0.809

Average “agreeableness” by country

What about the percentage of yes by country? Looking at a summary table won’t cut it, as I have 200 countries. I can play with sorting to see countries with most “yes” votes as opposed to those with most “no” votes:

## # A tibble: 200 x 3
##    country               total percent_yes
##    <chr>                 <int>       <dbl>
##  1 Sao Tome and Principe  1091       0.976
##  2 Seychelles              881       0.975
##  3 Djibouti               1598       0.961
##  4 Guinea Bissau          1538       0.960
##  5 Timor-Leste             326       0.957
##  6 Mauritius              1831       0.950
##  7 Zimbabwe               1361       0.949
##  8 Comoros                1133       0.947
##  9 United Arab Emirates   1934       0.947
## 10 Mozambique             1701       0.947
## # ... with 190 more rows

Interesting - there’s no single “bigger” country, among those with highest percent_yes.

## # A tibble: 200 x 3
##    country                                              total percent_yes
##    <chr>                                                <int>       <dbl>
##  1 Zanzibar                                                 2       0    
##  2 United States of America                              2568       0.269
##  3 Palau                                                  369       0.339
##  4 Israel                                                2380       0.341
##  5 Federal Republic of Germany                           1075       0.397
##  6 United Kingdom of Great Britain and Northern Ireland  2558       0.417
##  7 France                                                2527       0.427
##  8 Micronesia (Federated States of)                       724       0.442
##  9 Marshall Islands                                       757       0.491
## 10 Belgium                                               2568       0.492
## # ... with 190 more rows

Interesting as well - as it looks like, USA is the least agreeable country among all. I disregard Zanzibar as it has only 2 votes - also, it’s worth noting that there are two other countries with very little votes:

## # A tibble: 3 x 3
##   country     total percent_yes
##   <chr>       <int>       <dbl>
## 1 Zanzibar        2       0    
## 2 South Sudan    53       0.642
## 3 Kiribati       72       0.778

Let’s focus on visualizing these data

Visualizing with ggplot2

Summarizing by year and country

It feels more interesting to investigate trends of voting within specific countries than in the overall trend. So instead of summarizing just by year, I’ll summarize by both year and country, constructing a dataset that shows what fraction of the time each country votes “yes” in each year.

How did the UK vote overtime?

How about comparing some countries on one plot?

Adding two more to the basket:

All six graphs had the same axis limits. This made the changes over time hard to examine for plots with relatively little change.

Now, let’s see how did Europe go:

Among others it’s interesting to see missing data for Germany. That’s because before 1990 there where: - Federal Republic of Germany
- German Democratic Republic

Let’s compare those:

That’s interesting: the Communists had a higher percent of agreements than their West neighbors.

Now, I’ll try some linear modelling on this data

Modelling data with regression

A linear regression is a model that lets us examine how one variable changes with respect to another by fitting a best fit line. It is done with the lm() function in R.

Linear regression on the United States

## 
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222491 -0.080635 -0.008661  0.081948  0.194307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.6641455  1.8379743   6.890 8.48e-08 ***
## year        -0.0062393  0.0009282  -6.722 1.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared:  0.5854, Adjusted R-squared:  0.5724 
## F-statistic: 45.18 on 1 and 32 DF,  p-value: 1.367e-07

From the lm() output I can tell that there’s a negative effect of year showing a decrease in agreement percentage in the US \(b = -0.01\), 95% CI \([-0.01\), \(0.00]\), \(t(32) = -6.72\), \(p < .001\).

Using the broom package I can tidy up lm results and save them as dataframes for easier use:

##          term     estimate    std.error statistic      p.value
## 1 (Intercept) 12.664145512 1.8379742715  6.890274 8.477089e-08
## 2        year -0.006239305 0.0009282243 -6.721764 1.366904e-07

Biggest advantage of using broom is the ability to combine (with bind_rows()) multiple models into one object:

##          term     estimate    std.error statistic      p.value
## 1 (Intercept) 12.664145512 1.8379742715  6.890274 8.477089e-08
## 2        year -0.006239305 0.0009282243 -6.721764 1.366904e-07
## 3 (Intercept) -3.266547873 1.9577739504 -1.668501 1.049736e-01
## 4        year  0.001869434 0.0009887262  1.890750 6.774177e-02

Compute linear models for each country

In order to compute linear models for each country I’ll make use of several packages, starting with tidyr function: nest() that takes an ID columnt as agrument and nests all other observations as separate dataframes within the orginal one:

## # A tibble: 200 x 2
##    country                          data             
##    <chr>                            <list>           
##  1 Afghanistan                      <tibble [34 x 3]>
##  2 Argentina                        <tibble [34 x 3]>
##  3 Australia                        <tibble [34 x 3]>
##  4 Belarus                          <tibble [34 x 3]>
##  5 Belgium                          <tibble [34 x 3]>
##  6 Bolivia (Plurinational State of) <tibble [34 x 3]>
##  7 Brazil                           <tibble [34 x 3]>
##  8 Canada                           <tibble [34 x 3]>
##  9 Chile                            <tibble [34 x 3]>
## 10 Colombia                         <tibble [34 x 3]>
## # ... with 190 more rows

Just for reference, if I where to reverse nested structure to it’s orignal state I’d call nested %>% unnest(data) and save it back to whatever is needed.

Now that I divided the data for each country into a separate dataset in the data column, I need to fit a linear model to each of these datasets.

The map() function from purrr works by applying a formula to each item in a list, where . represents the individual item.

For example, to add one to each element of a vector numbers I’d call:

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6

This means that to fit a model to each dataset, I’ll:

map(data, ~ lm(percent_yes ~ year, data = .))

where . represents each individual item from the data column in by_year_country.

Following code will do all needed steps in one pipe:

  • nest by country
  • compute linear models for each country
  • tidy results
  • unnest model results
## # A tibble: 399 x 6
##    country     term         estimate std.error statistic       p.value
##    <chr>       <chr>           <dbl>     <dbl>     <dbl>         <dbl>
##  1 Afghanistan (Intercept) -11.1      1.47         -7.52 0.0000000144 
##  2 Afghanistan year          0.00601  0.000743      8.09 0.00000000306
##  3 Argentina   (Intercept) - 9.46     2.10         -4.50 0.0000832    
##  4 Argentina   year          0.00515  0.00106       4.85 0.0000305    
##  5 Australia   (Intercept) - 4.55     2.15         -2.12 0.0422       
##  6 Australia   year          0.00257  0.00108       2.37 0.0242       
##  7 Belarus     (Intercept) - 7.00     1.50         -4.66 0.0000533    
##  8 Belarus     year          0.00391  0.000759      5.15 0.0000128    
##  9 Belgium     (Intercept) - 5.85     1.52         -3.86 0.000522     
## 10 Belgium     year          0.00320  0.000765      4.19 0.000207     
## # ... with 389 more rows

Not all slopes are significant, and it would be usefull to filter out those that are not.

However, when dealing with lots of p-values, like one for each country, we run into the problem of multiple hypothesis testing, where we have to set a stricter threshold. The p.adjust() function is a simple way to correct for this, where p.adjust(p.value) on a vector of p-values returns a set that one can trust (by default Holm (1979) correction)

Here I’ll add two steps to process the slope_terms dataset: use a mutate to create the new, adjusted p-value column, and filter to filter for those below a .05 threshold.

Now that I’ve filtered for countries where the trend is probably not due to chance, I may want to find the countries with the highest and lowest slopes; that is, the estimate column.

## # A tibble: 61 x 7
##    country                    term  estimate std.er~ stat~ p.value p.adju~
##    <chr>                      <chr>    <dbl>   <dbl> <dbl>   <dbl>   <dbl>
##  1 Republic of Korea          year  -0.00921 1.55e-3 -5.96 1.39e-4 2.09e-2
##  2 Israel                     year  -0.00685 1.17e-3 -5.85 1.89e-6 3.31e-4
##  3 United States              year  -0.00624 9.28e-4 -6.72 1.37e-7 2.54e-5
##  4 Belgium                    year   0.00320 7.65e-4  4.19 2.07e-4 3.01e-2
##  5 Guinea                     year   0.00362 8.33e-4  4.35 1.87e-4 2.75e-2
##  6 Morocco                    year   0.00380 8.60e-4  4.42 1.46e-4 2.18e-2
##  7 Belarus                    year   0.00391 7.59e-4  5.15 1.28e-5 2.08e-3
##  8 Iran (Islamic Republic of) year   0.00391 8.56e-4  4.57 6.91e-5 1.07e-2
##  9 Congo                      year   0.00397 9.22e-4  4.30 2.27e-4 3.26e-2
## 10 Sudan                      year   0.00399 9.61e-4  4.15 2.98e-4 4.20e-2
## # ... with 51 more rows
## # A tibble: 61 x 7
##    country             term  estimate std.error statistic  p.value p.adju~
##    <chr>               <chr>    <dbl>     <dbl>     <dbl>    <dbl>   <dbl>
##  1 South Africa        year   0.0119   0.00140       8.47 1.60e- 8 3.05e-6
##  2 Kazakhstan          year   0.0110   0.00195       5.62 3.24e- 4 4.51e-2
##  3 Yemen Arab Republic year   0.0109   0.00159       6.84 1.20e- 6 2.11e-4
##  4 Kyrgyzstan          year   0.00973  0.000988      9.84 2.38e- 5 3.78e-3
##  5 Malawi              year   0.00908  0.00181       5.02 4.48e- 5 7.03e-3
##  6 Dominican Republic  year   0.00806  0.000914      8.81 5.96e-10 1.17e-7
##  7 Portugal            year   0.00802  0.00171       4.68 7.13e- 5 1.10e-2
##  8 Honduras            year   0.00772  0.000921      8.38 1.43e- 9 2.81e-7
##  9 Peru                year   0.00730  0.000976      7.48 1.65e- 8 3.12e-6
## 10 Nicaragua           year   0.00708  0.00107       6.60 1.92e- 7 3.55e-5
## # ... with 51 more rows

Joining and tidying

Now that I can look at % of yes votes and their relation to the year of voting it’s time to add more data to this dataset.

The description dataset keeps informations on the topic of each voting session:

Dataset can be obtained here descriptions.rdf

## # A tibble: 6 x 10
##    rcid session date                unres      me    nu    di    hr    co
##   <dbl>   <dbl> <dttm>              <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  46.0    2.00 1947-09-04 00:00:00 R/2/299     0     0     0  0     0   
## 2  47.0    2.00 1947-10-05 00:00:00 R/2/355     0     0     0  1.00  0   
## 3  48.0    2.00 1947-10-06 00:00:00 R/2/461     0     0     0  0     0   
## 4  49.0    2.00 1947-10-06 00:00:00 R/2/463     0     0     0  0     0   
## 5  50.0    2.00 1947-10-06 00:00:00 R/2/465     0     0     0  0     0   
## 6  51.0    2.00 1947-10-02 00:00:00 R/2/561     0     0     0  0     1.00
## # ... with 1 more variable: ec <dbl>

For example hr == 1 is a voting on Human Rights related topic.

There are six columns in the descriptions dataset (and therefore in the new joined dataset) that describe the topic of a resolution:

me: Palestinian conflict
nu: Nuclear weapons and nuclear material
di: Arms control and disarmament
hr: Human rights
co: Colonialism
ec: Economic development
## # A tibble: 60,962 x 14
##     rcid sessi~  vote ccode  year country  date                unres    me
##    <dbl>  <dbl> <dbl> <int> <dbl> <chr>    <dttm>              <chr> <dbl>
##  1  51.0   2.00  3.00     2  1947 United ~ 1947-10-02 00:00:00 R/2/~     0
##  2  51.0   2.00  3.00    20  1947 Canada   1947-10-02 00:00:00 R/2/~     0
##  3  51.0   2.00  2.00    40  1947 Cuba     1947-10-02 00:00:00 R/2/~     0
##  4  51.0   2.00  1.00    41  1947 Haiti    1947-10-02 00:00:00 R/2/~     0
##  5  51.0   2.00  3.00    42  1947 Dominic~ 1947-10-02 00:00:00 R/2/~     0
##  6  51.0   2.00  2.00    70  1947 Mexico   1947-10-02 00:00:00 R/2/~     0
##  7  51.0   2.00  2.00    90  1947 Guatema~ 1947-10-02 00:00:00 R/2/~     0
##  8  51.0   2.00  2.00    92  1947 El Salv~ 1947-10-02 00:00:00 R/2/~     0
##  9  51.0   2.00  3.00    93  1947 Nicarag~ 1947-10-02 00:00:00 R/2/~     0
## 10  51.0   2.00  2.00    95  1947 Panama   1947-10-02 00:00:00 R/2/~     0
## # ... with 60,952 more rows, and 5 more variables: nu <dbl>, di <dbl>, hr
## #   <dbl>, co <dbl>, ec <dbl>

Visualizing colonialism votes

Earlier I graphed the percentage of votes each year where the US voted “yes”. Now I’ll create that same graph, but only for votes related to colonialism.

This might be one reason for US having a low overall average of yes votes…

In order to make visualizations easier and more general I need to tidy up the structure of the dataset: gather topic columns into one specifing the topic itself and an other specificing whether or not a given topic was voted. I can do that by using the gather() function from tidyr package

There’s one more step of data cleaning to make this more interpretable. Right now, topics are represented by two-letter codes:

me: Palestinian conflict
nu: Nuclear weapons and nuclear material
di: Arms control and disarmament
hr: Human rights
co: Colonialism
ec: Economic development

I’ll recode the topic column to match a meanigful description and compute a summary dataset with total and percent_yes votes groupped by year, country and topic:

Now I can visualize the trends in percentage “yes” over time for all six topics side-by-side. Here, Ill visualize them just for the United States.

Computing linear models for each counry and topic

The next logical step, is once again use nest() to grup data by country and by topic and then using map() compute a linear model for percent_yes as a function of year.

To do that I need the following:

One last step is to filter out models that are not significant, keeping in mind a correction for p-value when dealing with multiple comparisons

## # A tibble: 56 x 8
##    country       topic      term  estimate std.err~ stati~ p.value p.adju~
##    <chr>         <chr>      <chr>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>
##  1 Vanuatu       Palestini~ year  -0.0327   0.00516  -6.33 2.60e-5 2.95e-2
##  2 Vanuatu       Coloniali~ year  -0.0179   0.00271  -6.60 2.53e-5 2.88e-2
##  3 Malta         Nuclear w~ year  -0.0112   0.00137  -8.15 3.14e-8 3.70e-5
##  4 Cyprus        Human rig~ year  -0.0108   0.00196  -5.48 1.22e-5 1.41e-2
##  5 United States Palestini~ year  -0.0107   0.00194  -5.51 6.85e-6 7.93e-3
##  6 Cyprus        Nuclear w~ year  -0.0107   0.00172  -6.20 1.76e-6 2.06e-3
##  7 Israel        Coloniali~ year  -0.00953  0.00177  -5.38 7.19e-6 8.31e-3
##  8 Romania       Human rig~ year  -0.00945  0.00185  -5.11 2.26e-5 2.59e-2
##  9 Malta         Arms cont~ year  -0.00930  0.00109  -8.51 1.46e-8 1.72e-5
## 10 Cyprus        Arms cont~ year  -0.00878  0.00123  -7.13 1.80e-7 2.11e-4
## # ... with 46 more rows

Hmmm… looks like Vanuatu has strong views on Palestinian Conflict and Colonialism as well…

And some problems with Human Rights as well…