1. Housekeeping

All lab materials are available on Canvas in the Modules section. TA Office Hours are also available in the uploaded PPT.

Trial and error really helps. Now is the time to start looking for datasets and if you have found one, working with your data and applying the concepts from class on your data will help you master the material. It’s also okay to feel lost and frustrated - research is a non-linear process and there will be ups and downs. Remember to reach out for help!

Any logistical questions?

2. Review

Set working directory before working on your files or saving your .Rmd file.

Before interacting with data files, we must ensure they reside in the working directory, which R will by default load data from and save data to.

# getwd() 
setwd(dir = "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs")

Using RStudio and RMarkdown

# Each code chunk begins with ```{r} and ends with ```

To create a code chunk: Option + Cmd + I (mac), Ctrl + Alt + I (windows) To run the code: Cmd + Shift + Enter (mac) Ctrl + Shift + Enter (windows)

Assigning and saving values How do we create an object for our class(rbsi_class) with the value 14? We also have an observer today. Create an object(observer) and assign the value 1.

rbsi_class <- 14
observer <- 1

We also want to create a vector with method TA names.

ta <- c("Mateo", "Stephanie", "Suhyen")

Data Types How do we find the total number of people in the course without the TAs?

rbsi_class + observer

## [1] 15

Could we add rbsi_class and ta together?

rbsi_class + ta

## Error in rbsi_class + ta: non-numeric argument to binary operator

And finally, to remove an object:

rm(rbsi_class) # remove item
rm(list=ls()) # everything

Debugging

Debugging functions in R:

Type the error you get on google:

https://stackoverflow.com - someone has probably asked your question before

3. Class Objectives

By the end of lab and you should have…..

Familiarized yourself with common approaches to cleaning and pre-processing data

Reading in Data
Subsetting our data
Transforming and aggregating variables
Merging across datasets

Run through examples of cleaning and pre-processing datasets

2020 Cooperative Election Study Data [found here https://cces.gov.harvard.edu/]
Fatal Encounters Dataset of Police Killings of Civilians [found here https://fatalencounters.org/]

Created a new dataset for a research question through cleaning and manipulating data in R

4. Working with Data

4.1. Setting up Work Environment

Before we get to cleaning any data we need to make sure our environment is set up. We can set up our environment by doing a few simple things:

Setting our working directory (we did it above)
Loading any relevant packages

Packages

Anyone can upload packages through the Comprehensive R Archive Network (CRAN, available here http://cran.r-project.org/). R packages are extensions to R, they usually contain code and data that can be installed by users.

We used a package to load data (read_dta). Packages are also used to run regressions, plot graphs, as we will explore later.

For this lab, we will need:

haven for loading data
dplyr for tidying data and manipulating dataframes
ggplot2 for visualizing data

#install.packages("haven") ##uncomment this line if you haven't downloaded the package
#install.packages("dplyr") ##uncomment this line if you haven't downloaded the package
#install.packages("ggplot2") ##uncomment this line if you haven't downloaded the package
library(haven)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

4.2. Reading and Loading data

Comma Separated Value, or .CSV files. These represent tabular data similar to a spreadsheet of data values like those generated by Microsoft Excel. This is the most common type of data you will work with. To load these we use the read_csv() command
STATA Files, or .DTA files. This is data that was saved in the file format of the statistical programming software STATA. To load this data we use the read.dta() OR read_dta() commands, based on the version of STATA the data comes from.
RDatafiles or .RDATA files, which represent a collection of R objects including data sets. To load these we use the load() command.

CCES 2016 Data on Harvard Dataverse - The CCES 2016 data only exist in .tab

If you aren’t sure you can try googling the format.

The format is important because you have to select the file type when loading data, whether using a point and click function or using in line code functions.

Using point-and-click:
From the RStudio dropdown menu, select File > Import Dataset > From text (base), (see image below if these instructions are unclear) and then from the pop-up, locate your downloaded csv file.
Using code functions:
there are several ‘read’ functions that load data into R:
- read.csv() # for excel files (Note: first save excel file as .csv)
- read.table() # for text files (.txt) or table files (.tab)
- load() # for R data files
- read.spss() # for spss files (.sav) from library(foreign)
- read_dta() # for stata files from library(haven)

With that, let us load the two datasets we will be working with today. Go to the following websites/Canvas and download the fatal encounters data in .csv format, and the CES data (I would recommend using the Stata format and then loading the .dta file in), and load it into your R.

CES 2020 Data on Harvard Dataverse

The CES 2020 data exists in two forms, .csv and .dta. Typically, researchers will pick the form that suits their statistical program best.

# library(haven)
ces <- read_dta("CES20_Common_OUTPUT_vv.dta")

Let’s load the Fatal Encounters Dataset of Police Killings of Civilians data.

fatal <- read.csv("fatal.csv")

If the data is successfully loaded, you will see both datasets in the local environment panel under the object names you gave them.

5. Tidying and Manipulating Data

Importing different types of data from different sources
Cleaning the data, including subsetting, altering values, etc.
Merging the data with other data
All of this must be done while being mindful of the substantive research question you are interested in testing.

Depending on the datasets you are using, merging could come before cleaning the dataset and vice versa.

Today the research question we will be exploring is: are attitudes about law enforcement associated with (affected by) police violence against civilians?

Let’s brainstorm What is the independent variable? What is the dependent variable? What could be the unit of analysis? How would you measure either of these?

Evaluating the Data

Before we launch into this analysis, we need to understand our data and understand how to unify it into a single dataframe. Lets start by doing something we should always do, determining the structure and type of our data using the summary() command.

This will tell us some basic information about our dataset – all the variables, the type of the data (e.g. numeric, character, factor) and some distributional information about continuous numeric variables.

summary(fatal)
summary(ces)

# names(fatal)
colnames(fatal)

##  [1] "X"                             "id"                           
##  [3] "Name"                          "Age"                          
##  [5] "Gender"                        "Race"                         
##  [7] "Race.with.imputations"         "Imputation.probability"       
##  [9] "date"                          "address"                      
## [11] "City"                          "State"                        
## [13] "ZIP"                           "county"                       
## [15] "FullAddress"                   "Latitude"                     
## [17] "Longitude"                     "Agency"                       
## [19] "force"                         "description"                  
## [21] "Dispositions"                  "link"                         
## [23] "Unique.identifier..redundant." "dateformatted"                
## [25] "numchar"

We can take a look at the first few rows (observations) of data in each dataframe using the head() command, which will display the first few sets of observations in our dataframes.

head(fatal)

Ideal datasets

A useful strategy when trying to think about manipulating data is to identify what our absolutely ideal looking dataset would look like once we are finished working. I do this in nearly every project myself. What would it look like here. Remember, we are trying to determine whether attitudes about law enforcement are affected by police violence against civilians.

A few things we to keep in mind:

The data should be stored in a single dataframe. This would involve merging datasets if there are more than one.
What is the main identifier that we want to use to link the data together? Zipcode is the main geographic identifier.
We can create a cleaner dataframe with only the variables of interest. CES has 716 columns/variables and fatal encounters has 25. We can delete extraneous variables.

What are the variables we are interested in keeping in the dataframes? The best source of knowing the dataset is the Codebook and Questionnaire.

The next step in learning how to create one unified dataset is to first select the variables we are interested in, in each of the datasets. This brings us to the next part, subsetting.

5.1. Subsetting Data

One of the most common things you will do in R is subset data. Subsetting involves only keeping part of some larger dataframe based on a set of criteria that we offer the data in R. Put another way, subsetting is one way to manipulate a dataframe to only include those observations you need for analysis.

When subsetting there are generally two approaches:

First, we often subset using square brackets [c(rows), c(columns)], based on the row and column numbers of the dataframe we want to keep. This approach is less common but may be useful to pull specific variables or observations out of your data. When you use this approach, the square brackets serve as indexers for your dataframe. Data follow the Rows x Columns (RxC) principle; in other words, the first index is for Rows and the second index is for Columns [c(R), c(C)].

Say we only want to keep the 8th column (variable) and rows 1:10 in our CES dataframe. We would do the following:

#rows 1:10, 8th column only
ces[c(1:10), c(1:8)]

## # A tibble: 10 × 8
##    caseid commonweight commonpostweight vvweight vvweight_post tookpost CCEStake
##     <dbl>        <dbl>            <dbl>    <dbl>         <dbl> <dbl+lb> <dbl+lb>
##  1 1.23e9        0.783            0.666    0.851         0.607 2 [Yes]  1 [Yes] 
##  2 1.23e9        1.34             1.44    NA            NA     2 [Yes]  1 [Yes] 
##  3 1.23e9        0.406            0.342   NA            NA     2 [Yes]  1 [Yes] 
##  4 1.23e9        0.958            0.822    1.04          1.00  2 [Yes]  1 [Yes] 
##  5 1.23e9        0.195            0.162   NA            NA     2 [Yes]  1 [Yes] 
##  6 1.23e9        1.06             0.880    1.15          0.988 2 [Yes]  1 [Yes] 
##  7 1.23e9        0.758            0.848    0.824         0.176 2 [Yes]  1 [Yes] 
##  8 1.23e9        1.16             1.19     1.26          1.29  2 [Yes]  1 [Yes] 
##  9 1.23e9        0.846            0.714    0.920         1.04  2 [Yes]  1 [Yes] 
## 10 1.23e9        0.919            0.751    0.999         0.921 2 [Yes]  1 [Yes] 
## # ℹ 1 more variable: birthyr <dbl>

Now we want to keep the 8th through column (variable) and rows 1:10 in our Fatal Encounters dataframe. We would do the following:

#rows 1:10, 8th through 13th column only#
fatal[c(1:10), c(8:13)]

##    Imputation.probability    date                                address
## 1             0.986065754 1/12/00 Lexington Avenue and East First Street
## 2              0.98614316 1/21/00                 Upper Sumner Hill Road
## 3             Not imputed 1/23/00                       2104-1962 MA-138
## 4             Not imputed 1/23/00                       2104-1962 MA-138
## 5             Not imputed 1/28/00                         270 Valley St,
## 6             Not imputed 1/28/00    Bayview Avenue and Arlington Avenue
## 7             0.992780017 2/11/00                            528 Mill Rd
## 8             Not imputed 3/16/00   Mantua Grove Road and Forest Parkway
## 9             Not imputed 3/29/00            West Street and Pine Street
## 10            0.962815715  4/5/00                    779-625 Flaghole Rd
##             City State  ZIP
## 1        Clifton    NJ 7011
## 2         Sumner    ME 4292
## 3        Raynham    MA 2767
## 4        Raynham    MA 2767
## 5     Providence    RI 2909
## 6    Jersey City    NJ 7305
## 7      Irvington    NJ 7111
## 8  West Deptford    NJ 8066
## 9       Abington    MA 2351
## 10       Andover    NH 3216

Next, the other way to subset is using the subset() command. We use this command to only keep some subset of the data based on a set of criteria, rather than the location (row/column) of data.

newdf <- subset(olddf, condition-to-subset-on)

For example, we can use the subset command to make our data only represent survey respondents on the CES in the state of North Carolina. We do so using the following syntax

nc <- subset(ces, inputstate == 37)
# subset if inputstate is 37 (North Carolina)

If we want to subset based on MULTIPLE conditions, we can do that easily, using & (and) | (or) in our subset command. For example, say we want to only include CCES survey respondents from North Carolina and Georgia, we use the OR (|) to subset based on those criteria.

ganc <-subset(ces, inputstate == 37 | inputstate == 13) 
# subset if inputstate is 37 (North Carolina) OR 13 (Georgia)

But what about if we want to subset to only look at say, individuals who work at a union (union) and self-report having voted in 2020 (CC20_401) ? We need to consult our codebook and be sure to use the & (and) during the subset command.

unionvote <- subset(ces, CC20_401 == 5 & union == 1)

# select certain columns
unionvote <- subset(ces, select=c(caseid, CC20_401, union))

Using dplyr dplyr from library(dplyr) is one of the most commonly used packages for manipulating data. We use the pipeline %>% command, which is similar to the assignment arrow, and enter dplyr functions, like filter() and select(). filter() will subset rows and select() will subset columns.

# subset respondents from NC
nc2 <- ces %>% filter(inputstate == 37)

# subset respondents from NC and GA
ganc2 <- ces %>% filter(inputstate == 37 | inputstate == 13)
ganc2 <- ces %>% filter(inputstate != 37)

unionvote2 <- ces %>% filter(CC20_401 == 5 & union == 1)

Applying subset to CES and Fatal

Now we have some rough sense of how the subset command works in R. With that in mind, let’s return to the research question at hand, which is attempting to assess whether attitudes toward the police become less favorable after police violence towards civilians.

Recall that we want the following variables from CES:

RespondentID (caseid)
Respondent State (inputstate)
Respondent Zipcode (lookupzip)
Respondent Attitudes toward Police (CC20_307, CC20_334b, CC20_334d, CC20_334e, CC20_334f, CC20_334g, CC20_334h)

#this code keeps all rows and ONLY the column names I call here

# base R
# ces2 <- ces[, c("caseid","inputstate","lookupzip","CC20_307","CC20_334b","CC20_334d", "CC20_334e", "CC20_334f", "CC20_334g", "CC20_334h")]

# dplyr
ces2 <- ces %>% select("caseid","inputstate","lookupzip","CC20_307","CC20_334b","CC20_334d", "CC20_334e", "CC20_334f", "CC20_334g", "CC20_334h")

Now we need to do the same with the fatal encounters dataset, but what we need is two things - number of police killings per zipcode, and over a certain period of time. Note in the fatal.csv that the dataset starts from 2000 and ends in 2021. While the survey is in 2020.

How do we do this?

Zipcode of killing (ZIP) Date

In addition, let’s only limit this to killings that occurred in the year before respondents took their survey. Because survey responses were collected September 29th - Dec 14 of 2020, we will subset our data to only include police killings between September 27th of 2015 and Dec 14, 2020.

Now…look…there are 0 observations for the fatal2 dataset! What happened?

The problem lies with the structure of the date variable in fatalencounters. Using the class() command, we can see the type of data that the date variable is in the fatal encounters dataset.

summary(fatal$date)

##    Length     Class      Mode 
##     30021 character character

class(fatal$date)

## [1] "character"

summary(fatal$date)

##    Length     Class      Mode 
##     30021 character character

So instead, we need to convert this to a date object. To do so, we’ll need to use the as.date() command.The syntax of as.date is as follows:

df$newvar<-as.date(df$oldvar, format = “%y/%m/%d”)

Where %y stands for year, %m stands for month, and %d stands for day and the / marks represent the dividing characters. Looking at the date variable in fatalencounters, it seems the format is actually month/day/year so we can specify that below.

fatal$dateformat <- as.Date(fatal$date, format = "%m/%d/%y")

It is important to create a new variable rather than overwrite the old variable because it allows us to compare and see if the conversion to a date object worked. If we overwrote the old variable with our new variable we couldn’t validate this.

Let’s compare the results:

head(fatal$dateformat, 10)

##  [1] "2000-01-12" "2000-01-21" "2000-01-23" "2000-01-23" "2000-01-28"
##  [6] "2000-01-28" "2000-02-11" "2000-03-16" "2000-03-29" "2000-04-05"

head(fatal$date, 10)

##  [1] "1/12/00" "1/21/00" "1/23/00" "1/23/00" "1/28/00" "1/28/00" "2/11/00"
##  [8] "3/16/00" "3/29/00" "4/5/00"

See! The conversion worked! Now we can run that same code from earlier to subset our data.

fatal2 <- subset(fatal, dateformat > "2019-09-29" & dateformat < "2020-12-14") 

# fatal2 <- fatal %>% filter(dateformat > "2019-09-29" & dateformat < "2020-12-14")

5.2 Aggregating Data

We want to ultimately merge the fatalencounters dataset with the CES data, so that we can see if residing in a zipcode with police violence affects attitudes towards police.

Right now the fatalencounters has the unit of analysis of individual police killing of a civilian. If we ultimately merge this dataset with the CES data, we will end up duplicating many CES responses, because in any zipcode with multiple police killings of civilians, we will end up merging EACH of them with EACH survey respondent. To put it in more concrete terms, if there are 8 police killings in the 27701 zipcode and 1 survey respondent in the 27701 zipcode, a simple merge will duplicate that 1 survey respondent 8 times in our dataframe.

So how should we aggregate the data?

Some options from dplyr to manipulate data:

** use ?help to check what these functions do, eg: ?select **

mutate(): create new columns based on info from other columns
group_by() / summarize(): create summary statistics of grouped data

Applying it on our dataset, this is what we want to do:

The syntax of these aggregations are as follows: newdf<-oldf %>% group_by(idvar) %>% summarise(obs = n())

fatal_zip <- fatal %>%
  group_by(ZIP) %>%
  summarise(numberofkillings = n())

head(fatal_zip, 10)

## # A tibble: 10 × 2
##      ZIP numberofkillings
##    <int>            <int>
##  1  1013                1
##  2  1030                1
##  3  1040                1
##  4  1069                1
##  5  1082                1
##  6  1085                2
##  7  1089                2
##  8  1103                1
##  9  1105                2
## 10  1106                2

ces %>% 
  group_by(inputstate) %>%
  summarise(mean = mean(CC20_307), n = n())

5.3. Merging Data

Now we have two dataframes the way we want them. How do we join the information from the cces dataframe with the information from the fatalencounters dataframe? In R, we can use the merge() command to combine data frames. This function tries to identify columns or rows that are common between the two different data frames using some common identifiers.

There are 4 different kinds of merges, the type of merge determines what type of data is kept in the merger.

Natural Join: To keep only rows that match from the data frames, specify argument all = FALSE.
Full Outer Join: To keep all rows from both data frames, specifcy all = TRUE.
Left Outer Join: To include all the rows of your data frame x and only those from your dataframe y that match, specify all.x = TRUE.
Right Outer Join: To include all the rows of your data frame y and only those from your dataframe x that match, specifcy all.y = TRUE.

The syntax of a merge is as follows. - newdf <- merge(datasetx, datasety, by.x = c(“colnamex”), by.y = c(“colnamey”), all = TRUE)

Merge explanation Source

# In this case we need to do a left outer join, because there are more zipcodes in the survey data (ces) than the police killings data (fatal).

That means we want to keep all the data on our survey respondents ratings of the police, but only need to keep the data from fatal encounters that merges with our survey responses. That’s because we don’t need the data on every police killing of a civilian, we just want to compare how survey respondents living in zipcodes with a police killing rate their local police compared to respondents living in a zipcode without a recent police killing.

merge(cces, fatalencounters, by.x = c(“lookupzip”), by.y = c(“ZIP), all.x = TRUE )

The key to merging is to merge on some common identifying information across datasets. In this case that is zipcode! So in the by.x and by.y commands, we input the zipcode variable from both datasets.

ces_merge <-merge(ces2, fatal_zip, by.x = c("lookupzip"), by.y = c("ZIP"), all.x = TRUE)
head(ces_merge, 10)

##    lookupzip     caseid inputstate CC20_307 CC20_334b CC20_334d CC20_334e
## 1      01001 1240642977         25        1         1         2         2
## 2      01001 1235172847         25        1         1         2         2
## 3      01001 1240490335         25        1         1         1         1
## 4      01001 1245486909         25        2         1         2         2
## 5      01002 1232441929         25        2         1         1         1
## 6      01002 1249694189         25        3         1         1         2
## 7      01002 1232540803         25        3         1         1         1
## 8      01002 1243898251         25        3         1         1         1
## 9      01002 1239514751         25        1         1         2         1
## 10     01002 1245173699         25        1         1         2         1
##    CC20_334f CC20_334g CC20_334h numberofkillings
## 1          2         1         2               NA
## 2          2         2         1               NA
## 3          1         2         1               NA
## 4          1         2         2               NA
## 5          1         1         1               NA
## 6          1         1         1               NA
## 7          1         1         1               NA
## 8          1         1         1               NA
## 9          2         2         1               NA
## 10         1         1         1               NA

Success!!!

5.4. Transforming Variables

Let’s check out the data now using table(), which will tabulate rows and aggreate them for us. While we’re there, lets use is.na() or summary() to find out if we have missing data or not.

table(ces_merge$numberofkillings)

## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 11464  8280  5989  4300  3140  2220  1611  1428   770   785   498   355   265 
##    14    15    16    17    18    19    20    21    22    23    24    25    28 
##   283   175    81   116    86    94    74    17    23    20    28    25     7 
##    32    33    34 
##    11    16     8

summary(ces_merge$numberofkillings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00    3.00    3.83    5.00   34.00   18831

sum(is.na(ces_merge$numberofkillings))

## [1] 18831

We have 18,831 missing values? Why?

The reason is because our fatalencounters dataset presents every single police killing of a civilian that occurred. But it does NOT present data on the number of killings in each zipcode in the US. If we really believe fatalencounters then, we should treat all the missing/NA values as 0’s, meaning no police killings of civilians in the last 12 months.

To do so we will use the square brackets [ ] or indexing, and the is.na () command

# dplyr 
ces_merge2 <- ces_merge %>%
  mutate(killings = ifelse(is.na(numberofkillings) == TRUE, 0, numberofkillings)) ## another way to recode numberofkillings NA's into 0, into the new variable killings

# original
ces_merge$numberofkillings[is.na(ces_merge$numberofkillings)] <- 0

# check results
summary(ces_merge$numberofkillings)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   2.648   4.000  34.000

sum(is.na(ces_merge$numberofkillings))

## [1] 0

In addition, we also might want to compare respondents living in zipcodes with no police killings to respondents living in zipcodes with any number (1 or more). This requires us to use the same square brackets [ ] or indexing to create a new variable. The syntax to do this is

df$new[df$old == logical] <- newvalue

ces_merge$binarykilling[ces_merge$numberofkillings  == 0] <- 0
ces_merge$binarykilling[ces_merge$numberofkillings  > 0] <- 1

Finally, we have a final problem. Let’s examine the dependent variables. We might want to change variable names as it’s hard to distinguish them.

summary(ces_merge$CC20_307)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.857   2.000   4.000      86

names(ces_merge)[names(ces_merge) == "CC20_307"] <- "safe"
names(ces_merge)[names(ces_merge) == "CC20_334d"] <- "decrease_pol"
names(ces_merge)[names(ces_merge) == "CC20_334e"] <- "ban"
names(ces_merge)[names(ces_merge) == "CC20_334f"] <- "natregistry"
names(ces_merge)[names(ces_merge) == "CC20_334g"] <- "weapons"
names(ces_merge)[names(ces_merge) == "CC20_334h"] <- "sue"

summary(ces_merge$safe)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.857   2.000   4.000      86

summary(ces_merge$decrease_pol)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.586   2.000   2.000      26

unique(ces_merge$decrease_pol)

## <labelled<double>[3]>: Policing policies -- Decrease the number of police on the street by 10 percent,
## [1]  2  1 NA
## 
## Labels:
##  value     label
##      1   Support
##      2    Oppose
##      8   skipped
##      9 not asked

We want to recode decrease_pol as a binary variable 0 and 1.

# original
# ces_merge$decrease_pol[ces_merge$decrease_pol  == 1] <- 0
# ces_merge$decrease_pol[ces_merge$decrease_pol  == 2] <- 1

# dplyr
# ifelse(condition, value_if_true, value_if_false)
ces_merge <- ces_merge %>% mutate(decrease_pol = ifelse(decrease_pol == 1, 0, 1))

Let’s practice with recoding ban, natregistry, weapons, sue as a binary variable 0 and 1.

ces_merge <- ces_merge %>% mutate(ban = ifelse(ban == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(natregistry = ifelse(natregistry == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(weapons = ifelse(weapons == 1, 0, 1))
ces_merge <- ces_merge %>% mutate(sue = ifelse(sue == 1, 0, 1))

summary(ces_merge$ban)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.2352  0.0000  1.0000      26

Now we want to create a new additive variable that adds up the responses to create a score of negative attitudes.

ces_merge$police_score <- ces_merge$ban + ces_merge$natregistry + ces_merge$weapons + ces_merge$sue
summary(ces_merge$police_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.000   1.136   2.000   4.000      63

Though we’ll cover this more in the visualization lab, considering we did all this work, lets use ggplot and take a look at the distribution of responses!

library(ggplot2)
#now lets add a dotted line for the group means
p <- ggplot(ces_merge, aes(x=safe)) + 
    geom_histogram(aes(y = stat(width*density))) + 
  scale_y_continuous(labels = scales::percent) + scale_x_continuous(breaks=c(1,2,3,4),
        labels=c("Safe", "", "", "Unsafe")) + 
  facet_grid(.~ binarykilling ) + 
    theme_bw() + 
    labs(title = "Police Killings and Feeling Safe around Police",
       caption = "Data source: 2020 CCES (n = 61,000)",
       x = "Attitude Towards Local Police",
       y = "Percentage of Repondents") 
p

## Warning: `stat(width * density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(width * density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 86 rows containing non-finite values (`stat_bin()`).

library(ggplot2)
#now lets add a dotted line for the group means
p2 <- ggplot(ces_merge, aes(x=police_score)) + 
    geom_histogram(aes(y = stat(width*density))) +
  scale_y_continuous(labels = scales::percent) + scale_x_continuous(breaks=c(0,1,2,3,4),
        labels=c("0", "1", "2","3","4")) + 
  facet_grid(.~ binarykilling ) + 
    theme_bw() + 
    labs(title = "Police Killings and Negative Attitudes toward Police",
       caption = "Data source: 2020 CCES (n = 61,000)",
       x = "Attitude Towards Local Police",
       y = "Percentage of Respondents") 
p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 63 rows containing non-finite values (`stat_bin()`).

6. Summary

To summarize, we learned about the following functions / commands in R

read.csv
read_dta
summary()
names()
head()
subset()
[r, c]
class()
merge()
filter()
select()
group_by()
summarise()
table()
is.na()
ifelse()

This is obviously a lot to remember and a lot to make sense of in a single lecture. Other key resources for you to use, which I draw on heavily for this guide:

https://stackoverflow.com/ – which contains answers to almost any coding question you may have
https://github.com/rochelleterman/PS239T – A course on computational social science by Professor Rochelle Terman which contains tons of simple coding examples and explanations. The full r code from each session is posted online *https://datavizf18.classes.andrewheiss.com/syllabus/ – A course on data vizualization in R by Professor Andrew Heiss. Tons of code examples.
https://judgelord.github.io/PS811/ – A course on statistical computing in social science by Dr. Devin Judge-Lord.
http://qss.princeton.press/ – A textbook on statistical analysis in social and political science with guided examples by Dr. Kosuke Imai
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Tidyverse+Cheat+Sheet.pdf – A tidyverse cheat sheet with examples and easy to reference common functions

Take-home Exercises

Lab 3 Exercises

Lab 3: Working with Data

Suhyen Bae

6/04/2024