Course Objectives

The goal of today’s class is to:

Install R and R Studio
Understand working folders and user interface of R Studio
Install an R package (RMarkdown)
Learn programming language concepts: data type (e.g. vectors, list, dataframes)
Perform basic data manipulation and use statistical functions: transforming a vector into dataframe, recalling items from dataframe, calculating descriptive statistics
Create a descriptive statistics data table

1. Intro to R and R Studio

We will be performing data manipulations and statistical analysis utilizing R in this course. R is an open source statistical programming environment meaning that anyone can download it for free, examine source code, and make their own contributions.

Other options: Python, STATA (Economics, Not free), SAS/SPSS (Psychology, Not free)
RStudio is an open text editor that makes R easier to work with. RStudio only works with R installed.

Downloading and Installing R

Go to https://cran.r-project.org/
On the front page, under Download and Install R, select the appropriate download for your operating system.
For Windows users, click Download R for Windows, and then select base, and then Download R 4.4.0 for Windows.
For Mac users, click Download R for Mac OS X, and then select R-4.4.0.pkg Make sure the downloaded R version is compatible with your macOS system.
Once your file is downloaded, run and install the R .exe (windows) or .dmg (for Mac OS) file (this will look a little different depending on your computer)

Version Control for R

version # allows us to see current R info on a computer
E.g. I am currently running R version 4.2.2 (2022-10-31) - “Innocent and Trusting”
Why are R version release names unusual? All release names are references to Peanuts the comic.

R nickname

Sometimes the version matters because R is a free software with contributions from other programmers, often called packages as we will show later. Packages can be consistently updated/not updated by authors. So if you run into problems using a package, the first step is often making sure your version of R is compatible with what the package requires.
To update R, click on Help in the top menu bar and Check for updates.

Downloading and Installing R Studio

Go to https://www.rstudio.com/
On the front page, click Download RStudio
We want the RStudio Desktop Open Source License, so select Download for that option
Now, you’ll see a list of Installers for Supported Platforms; click the one that’s right for your computer and download it.
Run the file to install RStudio and follow the instructions.

Version Control for RStudio

When you update R this does not update RStudio, and vice versa
To access the R Studio Version, type RStudio.Version() or find it in the About RStudio dropdown menu.

For a much more detailed intro to R, check out: https://cran.r-project.org/doc/manuals/R-intro.pdf

Welcome to RStudio

The rest of this lab will center on running through the basics of using R, using the RStudio interface, and getting familiar with the practice of programming and statistical analysis. It will be a lot of content to cover but working with the material hands on will help too! At the end of this tutorial, there will be an R file with the code used today, as well as an activity we will work through in lab to play around with some data and get familiar with the output in R.

The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

2. R Studio User Interface

To begin, open up RStudio, and you will usually see the main window has been divided into four panels. We will primarily be using the two panels on the left, the top is the script, and the bottom is the console.

RStudio layout

There are two main ways of interacting with R, by writing code and running it from the script (Top Panel), or by typing it and executing it in the console (Bottom Panel).

Panel 1. Script: the script is where you will write your code, or documentation, think of it as a document where your instructions go. You can run your script selecting the command you desire, and using the Ctrl + Enter shortcut on Windows, or Cmd + Return on Macs.
Panel 2. Console: type command in and directly run it by pressing Enter. Console is where code is executed and logs past commands and outputs.
Panel 3. Local environment: shows data, lists, other objects create.

Working Directory

A working directory is like a basecamp or school for all of your R codes, R files, and data for a specific project.
Each project should have its own folder with organized sub-folders.
You can only have one working directory active at any given time.
The dataset you are working with should be in the same working directory as your R code file.
To see your current working directory:

getwd()

## [1] "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs"

Before working on an R file, you want to set the working directory first.
To change working directory manually, click on the relevant folder in the Files section and click the gear shaped wheel, then set as working directory
To change working directory with code:

setwd(dir = "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs")

Installing Packages

We will be installing “Rmarkdown,” a package that enables creating a notebook with your code, comments, output all in one file. You can run chunks of code at a time, or all in one go, with the results appearing beneath the code (reproducible document).

The code for installing packages is install.packages() and the package name goes within the parentheses.

# install.packages("rmarkdown")

To create a new notebook in R Markdown: 1. File menu -> select New File -> R Notebook or R Markdown

This will open a .Rmd file that you can save to your computer using File and Save As
In the output section of the notebook, writing html_notebook would create an HTML file

To execute code: 1. Create a chunk to write code in using the shortcut Ctrl + Alt + I(windows) or Cmd + Option + I (mac).

Use the green triangle button on the toolbar of a code chunk that has the tooltip Run Current Chunk, or Ctrl + Shift + Enter (macOS: Cmd + Shift + Enter) to run the current chunk.
Press Ctrl + Enter (macOS: Cmd + Enter) to run just the current statement. Running a single statement is much like running an entire chunk consisting only of that statement.
There are other ways to run a batch of chunks if you click the menu Run on the editor toolbar, such as Run All, Run All Chunks Above, and Run All Chunks Below.

R Markdown Cheat Sheet: https://rmarkdown.rstudio.com/lesson-15.html

3. Intro to Programming

Arithmetic Operations

R is like a high powered calculator
Here is an example of a type of output you get in R console:

1 + 2

## [1] 3

16 / 4

## [1] 4

what is the code for computing 5 to the power of 31?

sqrt is an example of a function
a function takes inputs and produces outputs

sqrt(4)

## [1] 2

Syntax: Writing and commenting in-line

You can also add in-line comments and sections when writing code, so as to introduce some kind of structure or explanation in your code. This makes it easier for both yourself to keep track of the progress you have made, and for a reader to follow your code and what the variables mean. Often we use # in-line to indicate a comment, when R sees a #, it stops executing the rest of the line. Some conventions used in sectioning code is to use multiple #### to indicate headers and major sections in the code, ### as sub-sections, and just one # for minor comments For example, if one had two variables, x and x2, which referred to different things, it could be helpful to comment next to both variables during the assignment process what they mean to you.

# provide your comments on this line

Objects

R can store information as an object with a name of our choice. We use objects as shortcuts for recalling pieces of information. Using <- (assignment operator), we store a value to an object.

rbsi_class <- 14

If we want to know, how many slices of pizza we need for our class, we can use the object name to perform mathematical operations.

We can perform functions on objects.

sqrt(rbsi_class)

## [1] 3.741657

Let’s estimate that everyone in class is going to eat 3 slices of pizza, and each box of pizza has 8 slices. How many boxes of pizza should we order?

We can also change the value of an object by reassigning it a new value.

rbsi_class <- 14 + 10

Syntax: Case-sensitivity

Note, that R is case-sensitive. Hence, be careful of how you name objects! If saves rbsi_class R will not be able to recall Rbsi_class

rbsi_class

## [1] 24

Rbsi_class

## Error in eval(expr, envir, enclos): object 'Rbsi_class' not found

Download Dataset

Most of the times, we use data downloaded from external sources to analyze in R.

Download and save the dataset (1976-2020-president.csv) into your working directory.(Source: https://electionlab.mit.edu/data)

The most common types of data files utilized are: .csv .dta and .RData extensions. Based on the type of data, the code to recall the data varies. When creating your own data in another program, such as excel, remember to save the file as a CSV file.

Load the dataset into R

read.csv()
read.dta() from R package ‘foreign’ or read_dta() from R package ‘haven’
load()

data <- read.csv("1976-2020-president.csv")

# when performing data manipulations
# good practice to save loaded data object into another object so as to be able to revert back to the original loaded data if one makes mistakes 
pres_data <- data

Now, what do you see in your local environment?

Syntax: How to clear the environment

rm(x) # things can be removed individually rm(list=ls()) # or all at once

Try to remove rbsi_class from the local environment.

Data Types in R

There are different types of information stored in R as data. These datatypes are also known as class. R recognizes types of objects by assigning each object to a class in order to perform appropriate operations.

class(531)

## [1] "numeric"

class(TRUE)

## [1] "logical"

class("method lab")

## [1] "character"

class(pres_data)

## [1] "data.frame"

Numeric

Represents a number, either whole or real

whole.number <- 6 # integer
real.number <- 3.141592 # double/numeric

In the president turnout dataset, which variables are numeric?

class(pres_data$year)

## [1] "integer"

Syntax: Recall a column from a Dataframe To access something inside of R, $ is often used to refer to a column. Below we have a makeshift dataframe with two columns, age and name. my.data$age allows us to reference the age column.

Characters and Strings

Characters are any form of data that has an alphabet in it.
Strings are multiple characters in a row. If there is a mix of characters and numbers, R automatically interprets it as a string.
The syntax for characters (and strings) is that they have to be enclosed in quotation marks like "data"
if you reference a word without quotation marks, R will look for objects that have been created called data

In the president turnout dataset, which variables are characters?

Create an object named party_name and save one party name as a character

party_name <- "Democrat"

Boolean / Logical

Is a binary variable that can only have either of two answers: TRUE or FALSE
Is actually used very often, either in the TRUE/FALSE form, or in a 0/1 form
It is easier to code binary outcomes in binary because it is less prone to error; lets say the data was for a student’s attendance, whether they are Present/Absent is as easily represented as True/False

TRUE

## [1] TRUE

FALSE

## [1] FALSE

## [1] TRUE

## [1] FALSE

which variable in the turnout dataset is boolean?

Factor

R has a special class of data that is used fairly commonly, called factor data. Factors are used to capture categorical data, where there is a pre-defined set of values. For example take the number of classes in a college - there are four categories used to organize students: freshman, sophomore, junior, and senior. These are used regardless of how many years of study a student might have completed, and are often evaluated based on the quality and quantity of classes and course credits a student has completed. Another often used example for factors is gender: male, female, and other.

Syntax: Changing Data Types

# as.numeric()
# as.character()
# as.factor()

Syntax: Checking Data Types

Knowing the different data types is important because some operations only work with specific data types. For example, though you can add multiple numeric objects together using the addition operator +, it is not possible to do the same with character objects, or between character and numeric objects.

5 + 5

## [1] 10

"five" + "five"

## Error in "five" + "five": non-numeric argument to binary operator

5 + "five"

## Error in 5 + "five": non-numeric argument to binary operator

There are a couple of tools available to you for determining and changing the data type of an object.

is.numeric(pres_data$year)

## [1] TRUE

is.numeric(pres_data$totalvotes)

## [1] TRUE

str(pres_data$state)

##  chr [1:4287] "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...

class(pres_data$party_detailed)

## [1] "character"

Practice changing the party_detailed variable into a factor variable, and check to see the change was done correctly. We can transform a character variable to a factor variable.

We use the operator levels() to find out what the levels in a factor are.

Vectors and Lists

Vectors and lists are a way to store one or more values or objects, which can be either numbers or characters.

A vector is a one-dimensional array of numeric values, strings, or other information.

the notation for vectors is c()
You can use c() to add more elements to an existing vector

# concatenate
oranges <- c(10, 15, 20, 60, 65)
oranges

## [1] 10 15 20 60 65

apples <- c(2, 3, 4, 5, 5)
apples

## [1] 2 3 4 5 5

# use c() to combine multiple vectors
# order of items in vector matters when combining
vector <- c(oranges, apples)

Indexing is used to access specific elements of a vector, we use square brackets [].

# second item
vector[2]

## [1] 15

# second and third item
vector[c(2,3)]

## [1] 15 20

# arithmetic operation
vector <- vector + 10
print(vector)

##  [1] 20 25 30 70 75 12 13 14 15 15

Lists are similar to vectors in that they are a series of values, however the key difference between vectors and lists is that while vectors should contain the same kind of data types across all objects, lists can mix elements

the notation for lists is list()
Note that in the example below, though the input is the same for both x and y, since vectors require all elements to be the same, it has interpreted 2 as a character instead of a number. This is called type-casting.
Hence when we apply the function class() to both x and y, class tells us what kind of data is in the vectors, but sees the list as a different data type

vector <- c('uber', 2, 'lyft')
vector

## [1] "uber" "2"    "lyft"

class(vector)

## [1] "character"

list <- list('uber', 2, 'lyft')
list

## [[1]]
## [1] "uber"
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] "lyft"

class(list)

## [1] "list"

Using Vectors and Lists

Since vectors and lists are essentially just a group of objects, we can inspect their contents, structure, and also interact with the objects within them as a group.

Note the difference between adding a number to every value in a vector, and adding another number to a vector

length(oranges)

## [1] 5

class(oranges)

## [1] "numeric"

oranges + 30

## [1] 40 45 50 90 95

oranges2 <- c(oranges, 30)
oranges2

## [1] 10 15 20 60 65 30

In this case, oranges is a vector of 5 values, we can extract and replace values at specific locations in the vector. Using square brackets [] to select the position of the value we are interested in.

oranges[1]

## [1] 10

oranges[3] <- 5
oranges

## [1] 10 15  5 60 65

Syntax: Indices in R

For those who have some programming experience, R indices start at 1. Other programming languages in the C family such as C++, Java, and Python, start from 0 instead.

Matrices

Similar to vectors, Matrices (plural) or a Matrix (singular) are a collection of elements arranged into a fixed number of rows and columns. The function matrix() creates a matrix. All the columns must have the same data type, and must be the same length, though most often, matrices are numerical.

Below we have a 4x3 matrix, with four rows, three columns. Here we used 1:12 to create a sequence of numbers from 1 to 12. It is the same as using c(1,2,3,4,5,6,7,8,9,10,11,12)

1:12

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

matrix(1:12, nrow = 4, ncol = 3)

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

You can also create a matrix from a vector. Matrices have to have the same column length, so note how oranges with 5 values is not the right dimension for a matrix (R automatically repeats the first number), but oranges2 with 6 values is.

matrix(oranges, byrow = TRUE, nrow = 3)

## Warning in matrix(oranges, byrow = TRUE, nrow = 3): data length [5] is not a
## sub-multiple or multiple of the number of rows [3]

##      [,1] [,2]
## [1,]   10   15
## [2,]    5   60
## [3,]   65   10

matrix(oranges2, byrow = TRUE, nrow = 3)

##      [,1] [,2]
## [1,]   10   15
## [2,]   20   60
## [3,]   65   30

Dataframe

A dataframe is a general form of data that has columns and rows, like a list or a table. Unlike a matrix, a dataframe usually has multiple different types of information. You might see something similar in a class roster, a list of invitees to a birthday party, a list of ingredients in a recipe, these are all a form of data we interact with everyday.

If you have a matrix or a list you would like to convert in a dataframe, the associated function is data.frame(). Most of the time, we will be loading datasets.

# Example vectors
id <- c(1, 2, 3, 4, 5)
name <- c("Alice", "Bob", "Charlie", "David", "Eve")
age <- c(23, 35, 45, 28, 32)
passed <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

# Create data frame
data <- data.frame(id, name, age, passed)

# Another way
my.data <- data.frame(age = c(35, 24, 18, 72), name = c("Oliver", "Meghan",
                      "Cole", "Violet"))

# Print data frame
print(data)

##   id    name age passed
## 1  1   Alice  23   TRUE
## 2  2     Bob  35  FALSE
## 3  3 Charlie  45   TRUE
## 4  4   David  28   TRUE
## 5  5     Eve  32  FALSE

4. Basic Statistical Functions

Finally, let’s consider what first steps you can take for analyzing a dataset or a series of values.

Dataframe elements

And pres_data is a dataframe, with ___ observations, and ___ variables (or columns), you can see the names of the columns in head(), in View(), and also by calling them directly using colnames()

length(pres_data)

## [1] 15

head(pres_data)

##   year   state state_po state_fips state_cen state_ic       office
## 1 1976 ALABAMA       AL          1        63       41 US PRESIDENT
## 2 1976 ALABAMA       AL          1        63       41 US PRESIDENT
## 3 1976 ALABAMA       AL          1        63       41 US PRESIDENT
## 4 1976 ALABAMA       AL          1        63       41 US PRESIDENT
## 5 1976 ALABAMA       AL          1        63       41 US PRESIDENT
## 6 1976 ALABAMA       AL          1        63       41 US PRESIDENT
##                 candidate             party_detailed writein candidatevotes
## 1           CARTER, JIMMY                   DEMOCRAT   FALSE         659170
## 2            FORD, GERALD                 REPUBLICAN   FALSE         504070
## 3          MADDOX, LESTER AMERICAN INDEPENDENT PARTY   FALSE           9198
## 4 BUBAR, BENJAMIN ""BEN""                PROHIBITION   FALSE           6669
## 5               HALL, GUS        COMMUNIST PARTY USE   FALSE           1954
## 6         MACBRIDE, ROGER                LIBERTARIAN   FALSE           1481
##   totalvotes  version notes party_simplified
## 1    1182850 20210113    NA         DEMOCRAT
## 2    1182850 20210113    NA       REPUBLICAN
## 3    1182850 20210113    NA            OTHER
## 4    1182850 20210113    NA            OTHER
## 5    1182850 20210113    NA            OTHER
## 6    1182850 20210113    NA      LIBERTARIAN

colnames(pres_data)

##  [1] "year"             "state"            "state_po"         "state_fips"      
##  [5] "state_cen"        "state_ic"         "office"           "candidate"       
##  [9] "party_detailed"   "writein"          "candidatevotes"   "totalvotes"      
## [13] "version"          "notes"            "party_simplified"

dim(pres_data)

## [1] 4287   15

nrow(pres_data)

## [1] 4287

ncol(pres_data)

## [1] 15

Functions

We can also summarize (summary()) and tabulate (table()) the vector and dataframe variables respectively, to determine what the range of values are and whether there are NAs in the data. - The summary tells us what the min, max, and mean of each object of interest is, you can also find these values by calling them directly using specific functions: min(), max(), range(), mean(). - Additionally you can find the standard deviation of x using sd() and variance with var() - For some functions, need to remove missing data for calculations, so include na.rm=T

mean(pres_data$candidatevotes)

## [1] 311907.6

mean(pres_data$totalvotes)

## [1] 2366924

# If there were missing data then:
# mean(pres_data$candidatevotes, na.rm = T)

summary(pres_data$candidatevotes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     1177     7499   311908   199242 11110250

head(table(pres_data$candidatevotes))

## 
##  0  1  2  3  4  5 
##  2 24 14  9  8  9

summary(pres_data$candidatevotes)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     1177     7499   311908   199242 11110250

# table(pres_data$candidatevotes)
table(pres_data$year, pres_data$party_simplified)

##       
##        DEMOCRAT LIBERTARIAN OTHER REPUBLICAN
##   1976       51          27   203         51
##   1980       51          50   212         51
##   1984       51          39   182         51
##   1988       51          43   140         51
##   1992       51          47   221         51
##   1996       51          49   217         51
##   2000       51          48   219         51
##   2004       52          45   169         51
##   2008       51          43   206         51
##   2012       51          46   168         51
##   2016       53          49   191         52
##   2020       51          49   396         51

5. Exercise

Create a Descriptive Statistics Table

Create a descriptive statistics output for the (1) voteshare received by candidates and the (2) total voteshare. The table needs information on number of Observations, Min, Max, Mean, and Standard Deviation. For exercise purposes, create a vector consisting of the descriptive statistics. Combine the vectors together into a dataframe of descriptive statistics, with the variable names as columns.

Stats.	Candidate Votes
Obsv.
Min.
Max.
Mean
SD.

6. Additional Material

Arithmetic Operators

One of the most common functions of R is the use of arithmetic operators, here is a complete list of the most common operators.

The order of operations is the same as the convention in math: parenthesis, exponents, multiplication and division (from left to right), addition and subtraction (from left to right); also known as PEMDAS.
when in doubt use parentheses!

Operator	Description
+	addition
-	subtraction
*	multiplication
/	division
^ or **	exponentiation
x %% y	5 %% 2 is 1
x %/% y	5 %/% 2 is 2

Logical Operators

Another common operator type are logical operators

Operator	Description
<	Less than
>	Great than
<=	Less than or equal to
>=	Greater than or equal to
==	Equal to each other
!=	Not equal to each other

Help

For any of the functions used here, such as c(), list(), data.frame(), matrix(), and any future functions, you can use the ?help function in R to learn more about a function.

As shown below, this help function gives you more detail on the function, as well as some of the many options one can specify - such as for matrix, what are the dimensions of the matrix?

?matrix
?list

Syntax: Assignment vs Equals

As you might have noticed in the above code chunks, in R we use the assignment operator <- to assign value to objects. There are other assignment operators (such as =), but it can be confusing, so stick to using <-.

Value assignment means to set a value to be stored in an object (also called a variable). While this may seem inefficient, we could do 1 + 2 instead of x = 1, y = 2, x + y, it becomes increasingly difficult to manually type out the values of interest when the number of values we are working with gets bigger, like in a dataset.

x = 2 # Please avoid at all costs
x <- 4 # Better assignment operator
y <- 500

Syntax: Equal-to and not-equal-to

One of the main reasons why = can be confusing is because there is another operator == which is the equal-to operator. So it checks if the value on the left is the same as the value on the right, i.e. is 1 == 2, mathematically it is not. The opposite of the equal-to operator is the not-equal-to operator !=

x == 4

## [1] TRUE

x == 5

## [1] FALSE

x !=5

## [1] TRUE

This equal-to operator becomes very useful when we want to check if two values are the same. Take for instance those forms that ask you to type in your email twice, it probably uses an equal-to operator to determine if you typed the same email twice.

email <- 'john.smith@duke.edu'
email2 <- 'james.smith@duke.edu'
email == email2

## [1] FALSE

Syntax: And vs or

x == 600 & y == 700 # & == 'and this is also true'

## [1] FALSE

x == 1000 | y == 7 # | == 'this, or this, or both'

## [1] FALSE

Concatenation, Seq, Rep

How do we remember multiple numerical values without adding them?
How do we add together non-numerical values?
- We can use lists or vectors [### Vectors and Lists]
- the c() operator for vectors means concatenate
  - just remember that c() can only be used for data of the same type
- for strings we also have the option of using the paste() function

'a' + 'b' # Doesn't work

## Error in "a" + "b": non-numeric argument to binary operator

string.one <- "Hello"
string.two <- "World"
c(string.one, string.two)

## [1] "Hello" "World"

c(1,2,3,4,5)

## [1] 1 2 3 4 5

paste(string.one, string.two, sep = " ") # Default sep input

## [1] "Hello World"

paste0(string.one, string.two) # Eliminates spacing

## [1] "HelloWorld"

paste(string.one, whole.number) # Typecasts numeric

## [1] "Hello 6"

Remember in [### Vectors and Lists] I mentioned that vectors have to be the same type? Look at what happens when we try and concatenate characters and numbers:

x <- c(1, "a", 2, "b") # but they do have to be the same type!
# notice that 1 and 2 are of type character, this is called 'type casting'

Seq and Rep

In cases where there is some repetition or sequences in the multiple values, such as an index when we want to number off observations or a repeating sequence to group observations, we can also use seq() and rep() to make the job easier. These functions work both on numeric and character objects.

x <- c(1,2,3,4,5) # the c stands for concatenate
x <- seq(from = 1, to = 5, by = 1) # same as above
x <- 1:5 # same as above
x <- rep(1, times = 10)

x <- factor(LETTERS[1:4]); names(x) <- letters[1:4]
x

## a b c d 
## A B C D 
## Levels: A B C D

rep(x, 2)

## a b c d a b c d 
## A B C D A B C D 
## Levels: A B C D

Troubleshooting R and RStudio

The goal behind these labs is to introduce you to some of the syntax and concepts in using R as a way to conduct statistical analysis. Given time constraints, it is often hard to cover everything or cover each concept completely. The best way to learn and become familiar with programming is to work with data a lot and to look up codes as you work on them.