All lab materials are available on Canvas in the Modules section. TA Office Hours are also available in the uploaded PPT.
Trial and error really helps. Now is the time to start looking for datasets and if you have found one, working with your data and applying the concepts from class on your data will help you master the material. It’s also okay to feel lost and frustrated - research is a non-linear process and there will be ups and downs. Remember to reach out for help!
Any logistical questions?
Before interacting with data files, we must ensure they reside in the working directory, which R will by default load data from and save data to.
# getwd()
setwd(dir = "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs")
# Each code chunk begins with ```{r} and ends with ```
To create a code chunk: Option
+ Cmd
+
I
(mac), Ctrl
+ Alt
+
I
(windows) To run the code: Cmd
+
Shift
+ Enter
(mac) Ctrl
+
Shift
+ Enter
(windows)
rbsi_class
) with the value 14? We also have an
observer today. Create an object(observer
) and assign the
value 1.rbsi_class <- 14
observer <- 1
We also want to create a vector with method TA names.
ta <- c("Mateo", "Stephanie", "Suhyen")
rbsi_class + observer
## [1] 15
Could we add rbsi_class
and ta
together?
rbsi_class + ta
## Error in rbsi_class + ta: non-numeric argument to binary operator
And finally, to remove an object:
rm(rbsi_class) # remove item
rm(list=ls()) # everything
Type the error you get on google:
By the end of lab and you should have…..
Before we get to cleaning any data we need to make sure our environment is set up. We can set up our environment by doing a few simple things:
Anyone can upload packages through the Comprehensive R Archive Network (CRAN, available here http://cran.r-project.org/). R packages are extensions to R, they usually contain code and data that can be installed by users.
We used a package to load data (read_dta). Packages are also used to run regressions, plot graphs, as we will explore later.
For this lab, we will need:
haven
for loading datadplyr
for tidying data and manipulating dataframesggplot2
for visualizing data#install.packages("haven") ##uncomment this line if you haven't downloaded the package
#install.packages("dplyr") ##uncomment this line if you haven't downloaded the package
#install.packages("ggplot2") ##uncomment this line if you haven't downloaded the package
library(haven)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
CCES
2016 Data on Harvard Dataverse - The CCES 2016 data only exist in
.tab
If you aren’t sure you can try googling the format.
The format is important because you have to select the file type when loading data, whether using a point and click function or using in line code functions.
Using point-and-click:
From the RStudio dropdown menu, select
File > Import Dataset > From text (base)
, (see image
below if these instructions are unclear) and then from the pop-up,
locate your downloaded csv file.
Using code functions:
there are several ‘read’ functions that load data into R:
library(foreign)
library(haven)
With that, let us load the two datasets we will be working with
today. Go to the following websites/Canvas and download the fatal
encounters data in .csv
format, and the CES data (I would
recommend using the Stata format and then loading the .dta
file in), and load it into your R.
CES 2020 Data on Harvard Dataverse
.csv
and
.dta
. Typically, researchers will pick the form that suits
their statistical program best.# library(haven)
ces <- read_dta("CES20_Common_OUTPUT_vv.dta")
fatal <- read.csv("fatal.csv")
If the data is successfully loaded, you will see both datasets in the local environment panel under the object names you gave them.
Today the research question we will be exploring is: are attitudes about law enforcement associated with (affected by) police violence against civilians?
Let’s brainstorm What is the independent variable? What is the dependent variable? What could be the unit of analysis? How would you measure either of these?
Before we launch into this analysis, we need to understand our data
and understand how to unify it into a single dataframe. Lets start by
doing something we should always do, determining the structure and type
of our data using the summary()
command.
This will tell us some basic information about our dataset – all the variables, the type of the data (e.g. numeric, character, factor) and some distributional information about continuous numeric variables.
summary(fatal)
summary(ces)
# names(fatal)
colnames(fatal)
## [1] "X" "id"
## [3] "Name" "Age"
## [5] "Gender" "Race"
## [7] "Race.with.imputations" "Imputation.probability"
## [9] "date" "address"
## [11] "City" "State"
## [13] "ZIP" "county"
## [15] "FullAddress" "Latitude"
## [17] "Longitude" "Agency"
## [19] "force" "description"
## [21] "Dispositions" "link"
## [23] "Unique.identifier..redundant." "dateformatted"
## [25] "numchar"
We can take a look at the first few rows (observations) of data in
each dataframe using the head()
command, which will display
the first few sets of observations in our dataframes.
head(fatal)
A useful strategy when trying to think about manipulating data is to identify what our absolutely ideal looking dataset would look like once we are finished working. I do this in nearly every project myself. What would it look like here. Remember, we are trying to determine whether attitudes about law enforcement are affected by police violence against civilians.
A few things we to keep in mind:
The data should be stored in a single dataframe. This would involve merging datasets if there are more than one.
What is the main identifier that we want to use to link the data together? Zipcode is the main geographic identifier.
We can create a cleaner dataframe with only the variables of interest. CES has 716 columns/variables and fatal encounters has 25. We can delete extraneous variables.
What are the variables we are interested in keeping in the dataframes? The best source of knowing the dataset is the Codebook and Questionnaire.
The next step in learning how to create one unified dataset is to first select the variables we are interested in, in each of the datasets. This brings us to the next part, subsetting.
One of the most common things you will do in R is subset data. Subsetting involves only keeping part of some larger dataframe based on a set of criteria that we offer the data in R. Put another way, subsetting is one way to manipulate a dataframe to only include those observations you need for analysis.
When subsetting there are generally two approaches:
[c(rows), c(columns)]
, based on the row and column numbers
of the dataframe we want to keep. This approach is less common but may
be useful to pull specific variables or observations out of your data.
When you use this approach, the square brackets serve as indexers for
your dataframe. Data follow the Rows x Columns (RxC) principle; in other
words, the first index is for Rows and the second index is for Columns
[c(R), c(C)].#rows 1:10, 8th column only
ces[c(1:10), c(1:8)]
## # A tibble: 10 × 8
## caseid commonweight commonpostweight vvweight vvweight_post tookpost CCEStake
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lb> <dbl+lb>
## 1 1.23e9 0.783 0.666 0.851 0.607 2 [Yes] 1 [Yes]
## 2 1.23e9 1.34 1.44 NA NA 2 [Yes] 1 [Yes]
## 3 1.23e9 0.406 0.342 NA NA 2 [Yes] 1 [Yes]
## 4 1.23e9 0.958 0.822 1.04 1.00 2 [Yes] 1 [Yes]
## 5 1.23e9 0.195 0.162 NA NA 2 [Yes] 1 [Yes]
## 6 1.23e9 1.06 0.880 1.15 0.988 2 [Yes] 1 [Yes]
## 7 1.23e9 0.758 0.848 0.824 0.176 2 [Yes] 1 [Yes]
## 8 1.23e9 1.16 1.19 1.26 1.29 2 [Yes] 1 [Yes]
## 9 1.23e9 0.846 0.714 0.920 1.04 2 [Yes] 1 [Yes]
## 10 1.23e9 0.919 0.751 0.999 0.921 2 [Yes] 1 [Yes]
## # ℹ 1 more variable: birthyr <dbl>
Now we want to keep the 8th through column (variable) and rows 1:10 in our Fatal Encounters dataframe. We would do the following:
#rows 1:10, 8th through 13th column only#
fatal[c(1:10), c(8:13)]
## Imputation.probability date address
## 1 0.986065754 1/12/00 Lexington Avenue and East First Street
## 2 0.98614316 1/21/00 Upper Sumner Hill Road
## 3 Not imputed 1/23/00 2104-1962 MA-138
## 4 Not imputed 1/23/00 2104-1962 MA-138
## 5 Not imputed 1/28/00 270 Valley St,
## 6 Not imputed 1/28/00 Bayview Avenue and Arlington Avenue
## 7 0.992780017 2/11/00 528 Mill Rd
## 8 Not imputed 3/16/00 Mantua Grove Road and Forest Parkway
## 9 Not imputed 3/29/00 West Street and Pine Street
## 10 0.962815715 4/5/00 779-625 Flaghole Rd
## City State ZIP
## 1 Clifton NJ 7011
## 2 Sumner ME 4292
## 3 Raynham MA 2767
## 4 Raynham MA 2767
## 5 Providence RI 2909
## 6 Jersey City NJ 7305
## 7 Irvington NJ 7111
## 8 West Deptford NJ 8066
## 9 Abington MA 2351
## 10 Andover NH 3216
subset()
command. We use this command to only keep some subset of the data based
on a set of criteria, rather than the location (row/column) of
data.For example, we can use the subset command to make our data only represent survey respondents on the CES in the state of North Carolina. We do so using the following syntax
nc <- subset(ces, inputstate == 37)
# subset if inputstate is 37 (North Carolina)
If we want to subset based on MULTIPLE conditions, we can do that easily, using & (and) | (or) in our subset command. For example, say we want to only include CCES survey respondents from North Carolina and Georgia, we use the OR (|) to subset based on those criteria.
ganc <-subset(ces, inputstate == 37 | inputstate == 13)
# subset if inputstate is 37 (North Carolina) OR 13 (Georgia)
But what about if we want to subset to only look at say, individuals who work at a union (union) and self-report having voted in 2020 (CC20_401) ? We need to consult our codebook and be sure to use the & (and) during the subset command.
unionvote <- subset(ces, CC20_401 == 5 & union == 1)
# select certain columns
unionvote <- subset(ces, select=c(caseid, CC20_401, union))
dplyr
dplyr
from
library(dplyr)
is one of the most commonly used packages
for manipulating data. We use the pipeline %>%
command,
which is similar to the assignment arrow, and enter dplyr functions,
like filter()
and select()
.
filter()
will subset rows and select()
will
subset columns.# subset respondents from NC
nc2 <- ces %>% filter(inputstate == 37)
# subset respondents from NC and GA
ganc2 <- ces %>% filter(inputstate == 37 | inputstate == 13)
ganc2 <- ces %>% filter(inputstate != 37)
unionvote2 <- ces %>% filter(CC20_401 == 5 & union == 1)
Now we have some rough sense of how the subset command works in R. With that in mind, let’s return to the research question at hand, which is attempting to assess whether attitudes toward the police become less favorable after police violence towards civilians.
Recall that we want the following variables from CES:
#this code keeps all rows and ONLY the column names I call here
# base R
# ces2 <- ces[, c("caseid","inputstate","lookupzip","CC20_307","CC20_334b","CC20_334d", "CC20_334e", "CC20_334f", "CC20_334g", "CC20_334h")]
# dplyr
ces2 <- ces %>% select("caseid","inputstate","lookupzip","CC20_307","CC20_334b","CC20_334d", "CC20_334e", "CC20_334f", "CC20_334g", "CC20_334h")
Now we need to do the same with the fatal encounters dataset, but what we need is two things - number of police killings per zipcode, and over a certain period of time. Note in the fatal.csv that the dataset starts from 2000 and ends in 2021. While the survey is in 2020.
How do we do this?
Zipcode of killing (ZIP) Date
In addition, let’s only limit this to killings that occurred in the year before respondents took their survey. Because survey responses were collected September 29th - Dec 14 of 2020, we will subset our data to only include police killings between September 27th of 2015 and Dec 14, 2020.
Now…look…there are 0 observations for the fatal2 dataset! What happened?
The problem lies with the structure of the date variable in fatalencounters. Using the class() command, we can see the type of data that the date variable is in the fatal encounters dataset.
summary(fatal$date)
## Length Class Mode
## 30021 character character
class(fatal$date)
## [1] "character"
summary(fatal$date)
## Length Class Mode
## 30021 character character
So instead, we need to convert this to a date object. To do so, we’ll need to use the as.date() command.The syntax of as.date is as follows:
Where %y stands for year, %m stands for month, and %d stands for day and the / marks represent the dividing characters. Looking at the date variable in fatalencounters, it seems the format is actually month/day/year so we can specify that below.
fatal$dateformat <- as.Date(fatal$date, format = "%m/%d/%y")
It is important to create a new variable rather than overwrite the old variable because it allows us to compare and see if the conversion to a date object worked. If we overwrote the old variable with our new variable we couldn’t validate this.
Let’s compare the results:
head(fatal$dateformat, 10)
## [1] "2000-01-12" "2000-01-21" "2000-01-23" "2000-01-23" "2000-01-28"
## [6] "2000-01-28" "2000-02-11" "2000-03-16" "2000-03-29" "2000-04-05"
head(fatal$date, 10)
## [1] "1/12/00" "1/21/00" "1/23/00" "1/23/00" "1/28/00" "1/28/00" "2/11/00"
## [8] "3/16/00" "3/29/00" "4/5/00"
See! The conversion worked! Now we can run that same code from earlier to subset our data.
fatal2 <- subset(fatal, dateformat > "2019-09-29" & dateformat < "2020-12-14")
# fatal2 <- fatal %>% filter(dateformat > "2019-09-29" & dateformat < "2020-12-14")
We want to ultimately merge the fatalencounters dataset with the CES data, so that we can see if residing in a zipcode with police violence affects attitudes towards police.
Right now the fatalencounters has the unit of analysis of individual police killing of a civilian. If we ultimately merge this dataset with the CES data, we will end up duplicating many CES responses, because in any zipcode with multiple police killings of civilians, we will end up merging EACH of them with EACH survey respondent. To put it in more concrete terms, if there are 8 police killings in the 27701 zipcode and 1 survey respondent in the 27701 zipcode, a simple merge will duplicate that 1 survey respondent 8 times in our dataframe.
So how should we aggregate the data?
Some options from dplyr to manipulate data:
** use ?help
to check what these functions do, eg:
?select
**
mutate()
: create new columns based on info from other
columnsgroup_by()
/ summarize()
: create summary
statistics of grouped dataApplying it on our dataset, this is what we want to do:
The syntax of these aggregations are as follows: newdf<-oldf %>% group_by(idvar) %>% summarise(obs = n())
fatal_zip <- fatal %>%
group_by(ZIP) %>%
summarise(numberofkillings = n())
head(fatal_zip, 10)
## # A tibble: 10 × 2
## ZIP numberofkillings
## <int> <int>
## 1 1013 1
## 2 1030 1
## 3 1040 1
## 4 1069 1
## 5 1082 1
## 6 1085 2
## 7 1089 2
## 8 1103 1
## 9 1105 2
## 10 1106 2
ces %>%
group_by(inputstate) %>%
summarise(mean = mean(CC20_307), n = n())
Now we have two dataframes the way we want them. How do we join the
information from the cces dataframe with the information from the
fatalencounters dataframe? In R, we can use the merge()
command to combine data frames. This function tries to identify columns
or rows that are common between the two different data frames using some
common identifiers.
There are 4 different kinds of merges, the type of merge determines what type of data is kept in the merger.
The syntax of a merge is as follows. - newdf <- merge(datasetx, datasety, by.x = c(“colnamex”), by.y = c(“colnamey”), all = TRUE)
# In this case we need to do a left outer join, because there are more zipcodes in the survey data (ces) than the police killings data (fatal).
That means we want to keep all the data on our survey respondents ratings of the police, but only need to keep the data from fatal encounters that merges with our survey responses. That’s because we don’t need the data on every police killing of a civilian, we just want to compare how survey respondents living in zipcodes with a police killing rate their local police compared to respondents living in a zipcode without a recent police killing.
merge(cces, fatalencounters, by.x = c(“lookupzip”), by.y = c(“ZIP), all.x = TRUE )
The key to merging is to merge on some common identifying information across datasets. In this case that is zipcode! So in the by.x and by.y commands, we input the zipcode variable from both datasets.
ces_merge <-merge(ces2, fatal_zip, by.x = c("lookupzip"), by.y = c("ZIP"), all.x = TRUE)
head(ces_merge, 10)
## lookupzip caseid inputstate CC20_307 CC20_334b CC20_334d CC20_334e
## 1 01001 1240642977 25 1 1 2 2
## 2 01001 1235172847 25 1 1 2 2
## 3 01001 1240490335 25 1 1 1 1
## 4 01001 1245486909 25 2 1 2 2
## 5 01002 1232441929 25 2 1 1 1
## 6 01002 1249694189 25 3 1 1 2
## 7 01002 1232540803 25 3 1 1 1
## 8 01002 1243898251 25 3 1 1 1
## 9 01002 1239514751 25 1 1 2 1
## 10 01002 1245173699 25 1 1 2 1
## CC20_334f CC20_334g CC20_334h numberofkillings
## 1 2 1 2 NA
## 2 2 2 1 NA
## 3 1 2 1 NA
## 4 1 2 2 NA
## 5 1 1 1 NA
## 6 1 1 1 NA
## 7 1 1 1 NA
## 8 1 1 1 NA
## 9 2 2 1 NA
## 10 1 1 1 NA
Success!!!
Let’s check out the data now using table()
, which will
tabulate rows and aggreate them for us. While we’re there, lets use
is.na()
or summary()
to find out if we have
missing data or not.
table(ces_merge$numberofkillings)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 11464 8280 5989 4300 3140 2220 1611 1428 770 785 498 355 265
## 14 15 16 17 18 19 20 21 22 23 24 25 28
## 283 175 81 116 86 94 74 17 23 20 28 25 7
## 32 33 34
## 11 16 8
summary(ces_merge$numberofkillings)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.00 3.00 3.83 5.00 34.00 18831
sum(is.na(ces_merge$numberofkillings))
## [1] 18831
We have 18,831 missing values? Why?
The reason is because our fatalencounters dataset presents every single police killing of a civilian that occurred. But it does NOT present data on the number of killings in each zipcode in the US. If we really believe fatalencounters then, we should treat all the missing/NA values as 0’s, meaning no police killings of civilians in the last 12 months.
To do so we will use the square brackets [ ] or indexing, and the is.na () command
# dplyr
ces_merge2 <- ces_merge %>%
mutate(killings = ifelse(is.na(numberofkillings) == TRUE, 0, numberofkillings)) ## another way to recode numberofkillings NA's into 0, into the new variable killings
# original
ces_merge$numberofkillings[is.na(ces_merge$numberofkillings)] <- 0
# check results
summary(ces_merge$numberofkillings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 2.648 4.000 34.000
sum(is.na(ces_merge$numberofkillings))
## [1] 0
In addition, we also might want to compare respondents living in zipcodes with no police killings to respondents living in zipcodes with any number (1 or more). This requires us to use the same square brackets [ ] or indexing to create a new variable. The syntax to do this is
ces_merge$binarykilling[ces_merge$numberofkillings == 0] <- 0
ces_merge$binarykilling[ces_merge$numberofkillings > 0] <- 1
Finally, we have a final problem. Let’s examine the dependent variables. We might want to change variable names as it’s hard to distinguish them.
summary(ces_merge$CC20_307)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.857 2.000 4.000 86
names(ces_merge)[names(ces_merge) == "CC20_307"] <- "safe"
names(ces_merge)[names(ces_merge) == "CC20_334d"] <- "decrease_pol"
names(ces_merge)[names(ces_merge) == "CC20_334e"] <- "ban"
names(ces_merge)[names(ces_merge) == "CC20_334f"] <- "natregistry"
names(ces_merge)[names(ces_merge) == "CC20_334g"] <- "weapons"
names(ces_merge)[names(ces_merge) == "CC20_334h"] <- "sue"
summary(ces_merge$safe)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.857 2.000 4.000 86
summary(ces_merge$decrease_pol)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.586 2.000 2.000 26
unique(ces_merge$decrease_pol)
## <labelled<double>[3]>: Policing policies -- Decrease the number of police on the street by 10 percent,
## [1] 2 1 NA
##
## Labels:
## value label
## 1 Support
## 2 Oppose
## 8 skipped
## 9 not asked
We want to recode decrease_pol
as a binary variable 0
and 1.
# original
# ces_merge$decrease_pol[ces_merge$decrease_pol == 1] <- 0
# ces_merge$decrease_pol[ces_merge$decrease_pol == 2] <- 1
# dplyr
# ifelse(condition, value_if_true, value_if_false)
ces_merge <- ces_merge %>% mutate(decrease_pol = ifelse(decrease_pol == 1, 0, 1))
Let’s practice with recoding ban
,
natregistry
, weapons
, sue
as a
binary variable 0 and 1.
ces_merge <- ces_merge %>% mutate(ban = ifelse(ban == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(natregistry = ifelse(natregistry == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(weapons = ifelse(weapons == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(sue = ifelse(sue == 1, 0, 1))
summary(ces_merge$ban)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.2352 0.0000 1.0000 26
Now we want to create a new additive variable that adds up the responses to create a score of negative attitudes.
ces_merge$police_score <- ces_merge$ban + ces_merge$natregistry + ces_merge$weapons + ces_merge$sue
summary(ces_merge$police_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.000 1.136 2.000 4.000 63
Though we’ll cover this more in the visualization lab, considering we did all this work, lets use ggplot and take a look at the distribution of responses!
library(ggplot2)
#now lets add a dotted line for the group means
p <- ggplot(ces_merge, aes(x=safe)) +
geom_histogram(aes(y = stat(width*density))) +
scale_y_continuous(labels = scales::percent) + scale_x_continuous(breaks=c(1,2,3,4),
labels=c("Safe", "", "", "Unsafe")) +
facet_grid(.~ binarykilling ) +
theme_bw() +
labs(title = "Police Killings and Feeling Safe around Police",
caption = "Data source: 2020 CCES (n = 61,000)",
x = "Attitude Towards Local Police",
y = "Percentage of Repondents")
p
## Warning: `stat(width * density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(width * density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 86 rows containing non-finite values (`stat_bin()`).
library(ggplot2)
#now lets add a dotted line for the group means
p2 <- ggplot(ces_merge, aes(x=police_score)) +
geom_histogram(aes(y = stat(width*density))) +
scale_y_continuous(labels = scales::percent) + scale_x_continuous(breaks=c(0,1,2,3,4),
labels=c("0", "1", "2","3","4")) +
facet_grid(.~ binarykilling ) +
theme_bw() +
labs(title = "Police Killings and Negative Attitudes toward Police",
caption = "Data source: 2020 CCES (n = 61,000)",
x = "Attitude Towards Local Police",
y = "Percentage of Respondents")
p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 63 rows containing non-finite values (`stat_bin()`).
To summarize, we learned about the following functions / commands in R
read.csv
read_dta
summary()
names()
head()
subset()
[r, c]
class()
merge()
filter()
select()
group_by()
summarise()
table()
is.na()
ifelse()
This is obviously a lot to remember and a lot to make sense of in a single lecture. Other key resources for you to use, which I draw on heavily for this guide: