The goal of today’s class is to:
We will be performing data manipulations and statistical analysis utilizing R in this course. R is an open source statistical programming environment meaning that anyone can download it for free, examine source code, and make their own contributions.
Other options: Python, STATA (Economics, Not free), SAS/SPSS (Psychology, Not free)
RStudio is an open text editor that makes R easier to work with. RStudio only works with R installed.
Download and Install R
, select
the appropriate download for your operating system.Download R for Windows
, and
then select base, and then Download R 4.4.0 for Windows.Download R for Mac OS X
, and then
select R-4.4.0.pkg Make sure the downloaded R version is compatible with
your macOS system.version #
allows us to see current R info on a
computerR nickname
Help
in the top menu bar and
Check for updates
.Download RStudio
Download
for that optionInstallers for Supported Platforms
; click the one that’s
right for your computer and download it.RStudio.Version()
or find it in the About RStudio
dropdown menu.For a much more detailed intro to R, check out: https://cran.r-project.org/doc/manuals/R-intro.pdf
The rest of this lab will center on running through the basics of using R, using the RStudio interface, and getting familiar with the practice of programming and statistical analysis. It will be a lot of content to cover but working with the material hands on will help too! At the end of this tutorial, there will be an R file with the code used today, as well as an activity we will work through in lab to play around with some data and get familiar with the output in R.
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
To begin, open up RStudio, and you will usually see the main window has been divided into four panels. We will primarily be using the two panels on the left, the top is the script, and the bottom is the console.
RStudio layout
There are two main ways of interacting with R, by writing code and running it from the script (Top Panel), or by typing it and executing it in the console (Bottom Panel).
Panel 1. Script: the script is where you will write your code, or
documentation, think of it as a document where your instructions go. You
can run your script selecting the command you desire, and using the
Ctrl
+ Enter
shortcut on Windows, or
Cmd
+ Return
on Macs.
Panel 2. Console: type command in and directly run it by pressing
Enter
. Console is where code is executed and logs past
commands and outputs.
Panel 3. Local environment: shows data, lists, other objects create.
getwd()
## [1] "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs"
Files
section and click the gear shaped wheel, then
set as working directory
setwd(dir = "/Users/suhyenbae/Dropbox/RBSI 2024/RLabs")
We will be installing “Rmarkdown,” a package that enables creating a notebook with your code, comments, output all in one file. You can run chunks of code at a time, or all in one go, with the results appearing beneath the code (reproducible document).
The code for installing packages is install.packages()
and the package name goes within the parentheses.
# install.packages("rmarkdown")
To create a new notebook in R Markdown: 1. File menu -> select New File -> R Notebook or R Markdown
This will open a .Rmd file that you can save to your computer using File and Save As
In the output
section of the notebook, writing
html_notebook
would create an HTML file
To execute code: 1. Create a chunk to write code in using the
shortcut Ctrl
+ Alt
+ I
(windows)
or Cmd
+ Option
+ I
(mac).
Use the green triangle button on the toolbar of a code chunk that
has the tooltip Run Current Chunk
, or Ctrl
+
Shift
+ Enter
(macOS: Cmd
+
Shift
+ Enter
) to run the current
chunk.
Press Ctrl
+ Enter
(macOS:
Cmd
+ Enter
) to run just the current
statement. Running a single statement is much like running an entire
chunk consisting only of that statement.
There are other ways to run a batch of chunks if you click the
menu Run
on the editor toolbar, such as
Run All
, Run All Chunks Above
, and
Run All Chunks Below
.
R Markdown Cheat Sheet: https://rmarkdown.rstudio.com/lesson-15.html
1 + 2
## [1] 3
16 / 4
## [1] 4
what is the code for computing 5 to the power of 31?
sqrt
is an example of a functionsqrt(4)
## [1] 2
Syntax: Writing and commenting in-line
You can also add in-line comments and sections when writing code, so
as to introduce some kind of structure or explanation in your code. This
makes it easier for both yourself to keep track of the progress you have
made, and for a reader to follow your code and what the variables mean.
Often we use #
in-line to indicate a comment, when R sees a
#
, it stops executing the rest of the line. Some
conventions used in sectioning code is to use multiple ####
to indicate headers and major sections in the code, ###
as
sub-sections, and just one #
for minor comments For
example, if one had two variables, x and x2, which referred to different
things, it could be helpful to comment next to both variables during the
assignment process what they mean to you.
# provide your comments on this line
R can store information as an object with a name of our
choice. We use objects as shortcuts for recalling pieces of information.
Using <-
(assignment operator), we store a value to an
object.
rbsi_class <- 14
If we want to know, how many slices of pizza we need for our class, we can use the object name to perform mathematical operations.
We can perform functions on objects.
sqrt(rbsi_class)
## [1] 3.741657
Let’s estimate that everyone in class is going to eat 3 slices of pizza, and each box of pizza has 8 slices. How many boxes of pizza should we order?
We can also change the value of an object by reassigning it a new value.
rbsi_class <- 14 + 10
Syntax: Case-sensitivity
Note, that R is case-sensitive. Hence, be careful of how you name
objects! If saves rbsi_class
R will not be able to recall
Rbsi_class
rbsi_class
## [1] 24
Rbsi_class
## Error in eval(expr, envir, enclos): object 'Rbsi_class' not found
Most of the times, we use data downloaded from external sources to analyze in R.
The most common types of data files utilized are: .csv
.dta
and .RData
extensions. Based on the type
of data, the code to recall the data varies. When creating your own data
in another program, such as excel, remember to save the file as a CSV
file.
data <- read.csv("1976-2020-president.csv")
# when performing data manipulations
# good practice to save loaded data object into another object so as to be able to revert back to the original loaded data if one makes mistakes
pres_data <- data
Now, what do you see in your local environment?
Syntax: How to clear the environment
rm(x) # things can be removed individually rm(list=ls()) # or all at once
Try to remove rbsi_class
from the local
environment.
There are different types of information stored in R as data. These
datatypes are also known as class
. R recognizes types of
objects by assigning each object
to a class
in
order to perform appropriate operations.
class(531)
## [1] "numeric"
class(TRUE)
## [1] "logical"
class("method lab")
## [1] "character"
class(pres_data)
## [1] "data.frame"
whole.number <- 6 # integer
real.number <- 3.141592 # double/numeric
In the president turnout dataset, which variables are numeric?
class(pres_data$year)
## [1] "integer"
Syntax: Recall a column from a Dataframe To access something inside of R,
$
is often used to refer to a column. Below we have a makeshift dataframe with two columns, age and name.my.data$age
allows us to reference the age column.
"data"
data
In the president turnout dataset, which variables are characters?
Create an object named party_name
and save one party
name as a character
party_name <- "Democrat"
TRUE
or FALSE
TRUE/FALSE
form, or in a 0/1
formPresent/Absent
is as easily represented as
True/False
TRUE
## [1] TRUE
FALSE
## [1] FALSE
T
## [1] TRUE
F
## [1] FALSE
which variable in the turnout dataset is boolean?
factor
data. Factors are used to capture categorical data,
where there is a pre-defined set of values. For example take the number
of classes in a college - there are four categories used to organize
students: freshman, sophomore, junior, and senior. These are used
regardless of how many years of study a student might have completed,
and are often evaluated based on the quality and quantity of classes and
course credits a student has completed. Another often used example for
factors is gender: male, female, and other.Syntax: Changing Data Types
# as.numeric()
# as.character()
# as.factor()
Syntax: Checking Data Types
Knowing the different data types is important because some operations
only work with specific data types. For example, though you can add
multiple numeric objects together using the addition operator
+
, it is not possible to do the same with character
objects, or between character and numeric objects.
5 + 5
## [1] 10
"five" + "five"
## Error in "five" + "five": non-numeric argument to binary operator
5 + "five"
## Error in 5 + "five": non-numeric argument to binary operator
There are a couple of tools available to you for determining and changing the data type of an object.
is.numeric(pres_data$year)
## [1] TRUE
is.numeric(pres_data$totalvotes)
## [1] TRUE
str(pres_data$state)
## chr [1:4287] "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
class(pres_data$party_detailed)
## [1] "character"
Practice changing the party_detailed
variable into a
factor variable, and check to see the change was done correctly. We
can transform a character
variable to a factor
variable.
We use the operator levels()
to find out what the levels
in a factor are.
Vectors and lists are a way to store one or more values or objects, which can be either numbers or characters.
A vector is a one-dimensional array of numeric values, strings, or other information.
c()
c()
to add more elements to an existing
vector# concatenate
oranges <- c(10, 15, 20, 60, 65)
oranges
## [1] 10 15 20 60 65
apples <- c(2, 3, 4, 5, 5)
apples
## [1] 2 3 4 5 5
# use c() to combine multiple vectors
# order of items in vector matters when combining
vector <- c(oranges, apples)
Indexing is used to access specific elements of a vector, we
use square brackets []
.
# second item
vector[2]
## [1] 15
# second and third item
vector[c(2,3)]
## [1] 15 20
# arithmetic operation
vector <- vector + 10
print(vector)
## [1] 20 25 30 70 75 12 13 14 15 15
Lists are similar to vectors in that they are a series of values, however the key difference between vectors and lists is that while vectors should contain the same kind of data types across all objects, lists can mix elements
list()
2
as a character instead of a number. This is
called type-casting.class()
to both x and
y, class tells us what kind of data is in the vectors, but sees the list
as a different data typevector <- c('uber', 2, 'lyft')
vector
## [1] "uber" "2" "lyft"
class(vector)
## [1] "character"
list <- list('uber', 2, 'lyft')
list
## [[1]]
## [1] "uber"
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] "lyft"
class(list)
## [1] "list"
Since vectors and lists are essentially just a group of objects, we can inspect their contents, structure, and also interact with the objects within them as a group.
length(oranges)
## [1] 5
class(oranges)
## [1] "numeric"
oranges + 30
## [1] 40 45 50 90 95
oranges2 <- c(oranges, 30)
oranges2
## [1] 10 15 20 60 65 30
In this case, oranges
is a vector of 5 values, we can
extract and replace values at specific locations in the vector. Using
square brackets []
to select the position of the value we
are interested in.
oranges[1]
## [1] 10
oranges[3] <- 5
oranges
## [1] 10 15 5 60 65
Syntax: Indices in R
For those who have some programming experience, R indices start at 1. Other programming languages in the C family such as C++, Java, and Python, start from 0 instead.
Similar to vectors, Matrices (plural) or a Matrix (singular)
are a collection of elements arranged into a fixed number of rows and
columns. The function matrix()
creates a matrix. All the
columns must have the same data type, and must be the same length,
though most often, matrices are numerical.
Below we have a 4x3 matrix, with four rows, three columns. Here we
used 1:12
to create a sequence of numbers from 1 to 12. It
is the same as using c(1,2,3,4,5,6,7,8,9,10,11,12)
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
matrix(1:12, nrow = 4, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
You can also create a matrix from a vector. Matrices have to have the
same column length, so note how oranges
with 5 values is
not the right dimension for a matrix (R automatically repeats the first
number), but oranges2
with 6 values is.
matrix(oranges, byrow = TRUE, nrow = 3)
## Warning in matrix(oranges, byrow = TRUE, nrow = 3): data length [5] is not a
## sub-multiple or multiple of the number of rows [3]
## [,1] [,2]
## [1,] 10 15
## [2,] 5 60
## [3,] 65 10
matrix(oranges2, byrow = TRUE, nrow = 3)
## [,1] [,2]
## [1,] 10 15
## [2,] 20 60
## [3,] 65 30
A dataframe is a general form of data that has columns and rows, like a list or a table. Unlike a matrix, a dataframe usually has multiple different types of information. You might see something similar in a class roster, a list of invitees to a birthday party, a list of ingredients in a recipe, these are all a form of data we interact with everyday.
If you have a matrix or a list you would like to convert in a
dataframe, the associated function is data.frame()
. Most of
the time, we will be loading datasets.
# Example vectors
id <- c(1, 2, 3, 4, 5)
name <- c("Alice", "Bob", "Charlie", "David", "Eve")
age <- c(23, 35, 45, 28, 32)
passed <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
# Create data frame
data <- data.frame(id, name, age, passed)
# Another way
my.data <- data.frame(age = c(35, 24, 18, 72), name = c("Oliver", "Meghan",
"Cole", "Violet"))
# Print data frame
print(data)
## id name age passed
## 1 1 Alice 23 TRUE
## 2 2 Bob 35 FALSE
## 3 3 Charlie 45 TRUE
## 4 4 David 28 TRUE
## 5 5 Eve 32 FALSE
Finally, let’s consider what first steps you can take for analyzing a dataset or a series of values.
And pres_data
is a dataframe, with ___ observations, and
___ variables (or columns), you can see the names of the columns in
head()
, in View()
, and also by calling them
directly using colnames()
length(pres_data)
## [1] 15
head(pres_data)
## year state state_po state_fips state_cen state_ic office
## 1 1976 ALABAMA AL 1 63 41 US PRESIDENT
## 2 1976 ALABAMA AL 1 63 41 US PRESIDENT
## 3 1976 ALABAMA AL 1 63 41 US PRESIDENT
## 4 1976 ALABAMA AL 1 63 41 US PRESIDENT
## 5 1976 ALABAMA AL 1 63 41 US PRESIDENT
## 6 1976 ALABAMA AL 1 63 41 US PRESIDENT
## candidate party_detailed writein candidatevotes
## 1 CARTER, JIMMY DEMOCRAT FALSE 659170
## 2 FORD, GERALD REPUBLICAN FALSE 504070
## 3 MADDOX, LESTER AMERICAN INDEPENDENT PARTY FALSE 9198
## 4 BUBAR, BENJAMIN ""BEN"" PROHIBITION FALSE 6669
## 5 HALL, GUS COMMUNIST PARTY USE FALSE 1954
## 6 MACBRIDE, ROGER LIBERTARIAN FALSE 1481
## totalvotes version notes party_simplified
## 1 1182850 20210113 NA DEMOCRAT
## 2 1182850 20210113 NA REPUBLICAN
## 3 1182850 20210113 NA OTHER
## 4 1182850 20210113 NA OTHER
## 5 1182850 20210113 NA OTHER
## 6 1182850 20210113 NA LIBERTARIAN
colnames(pres_data)
## [1] "year" "state" "state_po" "state_fips"
## [5] "state_cen" "state_ic" "office" "candidate"
## [9] "party_detailed" "writein" "candidatevotes" "totalvotes"
## [13] "version" "notes" "party_simplified"
dim(pres_data)
## [1] 4287 15
nrow(pres_data)
## [1] 4287
ncol(pres_data)
## [1] 15
We can also summarize (summary()
) and tabulate
(table()
) the vector and dataframe variables respectively,
to determine what the range of values are and whether there are NAs in
the data. - The summary tells us what the min, max, and mean of each
object of interest is, you can also find these values by calling them
directly using specific functions: min()
,
max()
, range()
, mean()
. -
Additionally you can find the standard deviation of x using
sd()
and variance with var()
- For some
functions, need to remove missing data for calculations, so include
na.rm=T
mean(pres_data$candidatevotes)
## [1] 311907.6
mean(pres_data$totalvotes)
## [1] 2366924
# If there were missing data then:
# mean(pres_data$candidatevotes, na.rm = T)
summary(pres_data$candidatevotes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1177 7499 311908 199242 11110250
head(table(pres_data$candidatevotes))
##
## 0 1 2 3 4 5
## 2 24 14 9 8 9
summary(pres_data$candidatevotes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1177 7499 311908 199242 11110250
# table(pres_data$candidatevotes)
table(pres_data$year, pres_data$party_simplified)
##
## DEMOCRAT LIBERTARIAN OTHER REPUBLICAN
## 1976 51 27 203 51
## 1980 51 50 212 51
## 1984 51 39 182 51
## 1988 51 43 140 51
## 1992 51 47 221 51
## 1996 51 49 217 51
## 2000 51 48 219 51
## 2004 52 45 169 51
## 2008 51 43 206 51
## 2012 51 46 168 51
## 2016 53 49 191 52
## 2020 51 49 396 51
Create a descriptive statistics output for the (1) voteshare received by candidates and the (2) total voteshare. The table needs information on number of Observations, Min, Max, Mean, and Standard Deviation. For exercise purposes, create a vector consisting of the descriptive statistics. Combine the vectors together into a dataframe of descriptive statistics, with the variable names as columns.
Stats. | Candidate Votes |
---|---|
Obsv. | |
Min. | |
Max. | |
Mean | |
SD. |
One of the most common functions of R is the use of arithmetic operators, here is a complete list of the most common operators.
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
x %% y | 5 %% 2 is 1 |
x %/% y | 5 %/% 2 is 2 |
Another common operator type are logical operators
Operator | Description |
---|---|
< | Less than |
> | Great than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to each other |
!= | Not equal to each other |
For any of the functions used here, such as c()
,
list()
, data.frame()
, matrix()
,
and any future functions, you can use the ?help
function in
R to learn more about a function.
As shown below, this help function gives you more detail on the function, as well as some of the many options one can specify - such as for matrix, what are the dimensions of the matrix?
?matrix
?list
Syntax: Assignment vs Equals
As you might have noticed in the above code chunks, in R we use the
assignment operator <-
to assign value to objects. There
are other assignment operators (such as =
), but it can be
confusing, so stick to using <-
.
Value assignment means to set a value to be stored in an object (also
called a variable). While this may seem inefficient, we could do
1 + 2
instead of x = 1, y = 2, x + y
, it
becomes increasingly difficult to manually type out the values of
interest when the number of values we are working with gets bigger, like
in a dataset.
x = 2 # Please avoid at all costs
x <- 4 # Better assignment operator
y <- 500
Syntax: Equal-to and not-equal-to
One of the main reasons why =
can be confusing is
because there is another operator ==
which is the equal-to
operator. So it checks if the value on the left is the same as the value
on the right, i.e. is 1 == 2
, mathematically it is not. The
opposite of the equal-to operator is the not-equal-to operator
!=
x == 4
## [1] TRUE
x == 5
## [1] FALSE
x !=5
## [1] TRUE
This equal-to operator becomes very useful when we want to check if two values are the same. Take for instance those forms that ask you to type in your email twice, it probably uses an equal-to operator to determine if you typed the same email twice.
email <- 'john.smith@duke.edu'
email2 <- 'james.smith@duke.edu'
email == email2
## [1] FALSE
Syntax: And vs or
x == 600 & y == 700 # & == 'and this is also true'
## [1] FALSE
x == 1000 | y == 7 # | == 'this, or this, or both'
## [1] FALSE
c()
operator for vectors means concatenate
c()
can only be used for data of the
same typepaste()
function'a' + 'b' # Doesn't work
## Error in "a" + "b": non-numeric argument to binary operator
string.one <- "Hello"
string.two <- "World"
c(string.one, string.two)
## [1] "Hello" "World"
c(1,2,3,4,5)
## [1] 1 2 3 4 5
paste(string.one, string.two, sep = " ") # Default sep input
## [1] "Hello World"
paste0(string.one, string.two) # Eliminates spacing
## [1] "HelloWorld"
paste(string.one, whole.number) # Typecasts numeric
## [1] "Hello 6"
Remember in [### Vectors and Lists] I mentioned that vectors have to be the same type? Look at what happens when we try and concatenate characters and numbers:
x <- c(1, "a", 2, "b") # but they do have to be the same type!
# notice that 1 and 2 are of type character, this is called 'type casting'
Seq and Rep
In cases where there is some repetition or sequences in the multiple
values, such as an index when we want to number off observations or a
repeating sequence to group observations, we can also use
seq()
and rep()
to make the job easier. These
functions work both on numeric and character objects.
x <- c(1,2,3,4,5) # the c stands for concatenate
x <- seq(from = 1, to = 5, by = 1) # same as above
x <- 1:5 # same as above
x <- rep(1, times = 10)
x <- factor(LETTERS[1:4]); names(x) <- letters[1:4]
x
## a b c d
## A B C D
## Levels: A B C D
rep(x, 2)
## a b c d a b c d
## A B C D A B C D
## Levels: A B C D
For a much more detailed intro to R, check out: https://cran.r-project.org/doc/manuals/R-intro.pdf
?help
function is incredibly useful for
checking the syntax of some common r functions - what goes into the
function? How can we use the function? Often at the end of the help
section there are examples as well of how the function is used