UPDATE: cesR
is now available through the CRAN network and can be installed using install.packages("cesR")
.
This is a bit of a breakdown on my thought process on writing designing the package cesR
for R. The full paper for cesR
is available from SocArXiv Papers and the full code for this project can be found on my Github account.
Introduction
This past summer (2020 to give a date-mark), I had the privilege of doing my co-op with Dr. Rohan Alexander in the Toronto Data Lab. While this wasn’t the project I had set out to complete, nor the one I had proposed, it was what ended up being the most rewarding.
The purpose of cesR
is to make accessing Canadian Election Study (CES) datasets easier for R users. It was inspired by the work being done in the R community through such packages as the opendatatoronto
package and the Lahman
package. Packages such as these are important as they improve the functionality of working within R by minimizing the number of steps required to load data and increasing the availability of data to R users.
cesR
does this through the use of five functions: get_ces()
, get_cescodes()
, get_preview()
, get_question()
, and get_decon()
.
Functions
get_ces()
When called, the get_ces()
function returns a requested CES survey as a data object and prints the associated citation and URL for the survey dataset repository to the console. The function takes one argument, srvy
, as a character string and a second, pos
, numeric argument used to set the environment. srvy
is a vector member that is associated with a CES survey. If the provided character string argument matches a member of the built-in vector, the associated file is downloaded using the download.file()
function from the utils
R package and is stored in a temporary directory. Upon downloading the file, it is read into R using either the read_dta()
or read_sav()
functions from the haven
R package, or the read_tsv()
function from the readr
package. A data frame is then assigned using the assign()
function from the base
R package as a data object in the global environment. The downloaded file and file directory are then removed from the computer using the unlink()
function from base
R. Finally, the recommended citation for the requested survey dataset and URL of the survey data storage location are printed in the console as a message.
If the provided character string argument does not have a match in the built-in vector, then the function process is stopped and a warning message is printed to the console.
Here is an example of the get_ces()
code.
# 'get_ces' function, uses one variable 'srvy'
get_ces <- function(srvy, pos = 1){
# if 'srvy' is in 'ces_codes' vector
if(srvy %in% ces_codes){
# if 'srvy' is equal to 'ces2019_web'
if(srvy == "ces2019_web"){
# create temporary file name holder
hldr <- tempfile(fileext = ".dta")
# if the file does not exist
if(!file.exists(hldr)){
# assign download url
cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
# download the file from the url and assign file name from holder
utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
# assign the data file to a globally available variable
assign("ces2019_web", haven::read_dta(hldr, encoding = "latin1"), envir = as.environment(pos))
# remove the temporary downloaded data file
unlink(hldr, recursive = T)
# print citation and link
message(ref2019web)
}
}
}
else{
# if the provided code is not in the 'ces_codes' vector then stop process and print this message
stop("Incorrect CES dataset code provided. Use function get_cescodes() for a printout of useable code calls.")
}
}
get_cescodes()
I’ve actually found a more efficient way of doing this. Will post that at a later date once the function is updated. It also means adding another dependency to cesR
.
To call a CES survey you require an associated survey code. The get_cescodes()
function provides a user with a print out of these survey codes. This function does not require any arguments passed to it, but does have the option of setting the number of values returned through the head()
function. If passed without any arguments, the function is prints to the console a dataframe that contains all the survey codes and their associated argument calls.
# get_cescodes function
# creates three vectors of the ces survey codes and associated calls
# converts those vectors to data frames with associated index number for call
# merges the three data frames and renames the columns
# removes the data frame items and prints merged results
# can be used to lookup a survey code and the associated calls.
get_cescodes <- function(indx = 22){
ces1 <- (c("ces2019_web", "ces2019_phone", "ces2015_web", "ces2015_phone", "ces2015_combo",
"ces2011", "ces2008", "ces2004", "ces0411", "ces0406", "ces2000", "ces1997", "ces1993",
"ces1988", "ces1984", "ces1974", "ces7480", "ces72_jnjl", "ces72_sep", "ces72_nov",
"ces1968", "ces1965"))
ces2 <- c('"ces2019_web"', '"ces2019_phone"', '"ces2015_web"', '"ces2015_phone"', '"ces2015_combo"',
'"ces2011"', '"ces2008"', '"ces2004"', '"ces0411"', '"ces0406"', '"ces2000"', '"ces1997"', '"ces1993"',
'"ces1988"', '"ces1984"', '"ces1974"', '"ces7480"', '"ces72_jnjl"', '"ces72_sep"', '"ces72_nov"',
'"ces1968"', '"ces1965"')
ces1 <- data.frame(ces1)
ces1$index <- seq.int(nrow(ces1))
ces2 <- data.frame(ces2)
ces2$index <- seq.int(nrow(ces2))
ces_calltable <- merge(ces1, ces2, by = "index")
ces_calltable <- dplyr::rename(ces_calltable, ces_survey_code = ces1, get_ces_call_char = ces2)
rm(ces1)
rm(ces2)
utils::head(ces_calltable, indx)
}
get_preview()
Sometimes it can be helpful to have a truncated preview of a dataset to assist in exploratory analysis. The get_preview()
function provides just this. The function takes two arguments, a character string to call a survey of the same style as used for the get_ces()
function and a numerical value that sets the number of observations returned. If no value is provided for the number of rows, a default of six is returned.
# function to call to create previews of the CES surveys
# code for the first section of the function is commented with how the function works,
# all following sections work in the same manner.
get_preview <- function(srvy, obs = 6, pos = 1){
# if 'srvy' is in 'ces_codese' vector
if(srvy %in% ces_codes){
# if 'srvy' is equal to 'ces2019_web'
if(srvy == "ces2019_web"){
# create temporary file name holder
hldr <- tempfile(fileext = ".dta")
# if the file does not exist
if(!file.exists(hldr)){
# assign download url
cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
# download the file from the url and assign file name from holder
utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
# create a locally available variable
survey_read <- haven::read_dta(hldr, encoding = "latin1")
# assign the data file to a globally available variable
assign("ces2019_web_preview", utils::head(labelled::to_factor(survey_read), obs), envir = as.environment(pos))
# remove the temporary file
unlink(hldr, recursive = T)
# remove the local variable
rm(survey_read)
}
}
}
else{
# if the provided code is not in the 'ces_codes' vector then stop process and print this message
stop("Incorrect CES dataset code provided. Use function get_cescodes() for a printout of useable code calls")
}
}
get_question()
The get_question()
function provides you with the ability to look up a survey question associated with a given column name. The function takes two arguments in the form of character strings, those being the name of a data object and the name of a column in the given data object. The function works such that it checks whether the given data object exists using the exists()
function from the base
R. If the object does not exist, the function will print out a warning in the console stating Warning: Data object does not exist
. If the object does exist, get_question()
will check if the given column name exists in the given data object. This is done using a combination of the hasName()
function from the utils
package and the get()
function from base
R. The hasName()
function checks if the given column name is in the given data object. Because the arguments are given as character strings the get()
function is used to return the actual data object instead of the provided character string. Otherwise, the hasName()
function would only check if the given column name argument occurred in the given character string argument and not the actual data object. If the column does not exist in the data object, a warning is printed in the console stating Warning: Variable is not in dataset
. If the given column exists in the given data object, get_question()
will print the variable label of the given column to the console using a combination of the var_label()
function from the labelled
package and the get()
function from the base
package.
As a side note, I provide a step-by-step breakdown of this function in this post.
# function to produce the column label for requested dataset and variable
# takes two parameters as character strings
# 'do' data object and 'q' question
get_question <- function(do, q){
if(exists(do)){ # if data object exists
if(utils::hasName(get(do), q)){ # if data object has the name of the given question
message(labelled::var_label(get(q, get(do)))) # print out concatenation of the column label
# the get function is required because it
# returns the object from the provided character string
}
else{
message("Warning: Variable is not in dataset") # else, print this warning if question does not exist
# cat is used instead of stop because stop breaks the function
}
}
else{
message("Warning: Data object does not exist") # else, print this warning if data object does not exist
}
}
get_decon()
The last, but not least, function of the cesR
package is get_decon()
. When called, get_decon()
creates a subset of the 2019 CES online survey under the name decon
(demographics and economics). The get_decon()
function requires no arguments passed to it to be called, but does have an argument of pos
used to set the environment in which the data is loaded. This is by default the global environment.
get_decon()
first checks the global environment if an object of the name decon
exists using the exists()
function from base
R . This prevents the decon
dataset from being recreated if the object exists, preventing an accidental overwrite of the dataset. If the get_decon()
function is run when an object with the name decon
already exists a warning will print in the console stating Warning: Dataframe already exists.
If a situation arises in which the decon
dataset needs to be recreated, then the best course of action is to use the rm()
function from base
R to remove the decon
object and then run the get_decon()
function again.
# function to create 'decon' dataset
# does not use any variable calls
get_decon <- function(pos = 1){
# if object does not exist in global environment
if(!exists("decon")){
# assign url to 'cesfile'
cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
# assign temporary file with .dta extension to placeholder variable
hldr <- tempfile(fileext = ".dta")
# download the file from url assigned to 'cesfile' with file extension from the temporary placeholder
utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
# assign data file to temporary data object
ces2019_hldr <- haven::read_dta(hldr, encoding = "latin1")
# create new data object with selected columns from temporary data object
decon <- dplyr::select(ces2019_hldr, c(5:6, 8:10, 20:49, 69,76, 194, 223:227, 245, 250:251, 258, 123:125))
# rename columns in new data object
decon <- dplyr::rename(decon,
citizenship = 1, # rename column 1 to citizenship
yob = 2, # rename column 2 to yob
gender = 3, # rename column 3 to gender
province_territory = 4, # rename column 4 to province_territory
education = 5, # rename column 5 to education
vote_likely = 6, # rename column 6 to vote_likely
vote_likely_ifable = 7, # rename column 7 to vote_likely_ifable
votechoice = 8, # rename column 8 to votechoice
votechoice_text = 9, # rename column 9 to votechoice_text
votechoice_couldvote = 10, # rename column 10 to votechoice_couldvote
votechoice_couldvote_text = 11, # rename column 11 to votechoice_couldvote_text
vote_unlikely = 12, # rename column 12 to vote_unlikely
vote_unlikely_text = 13, # rename column 13 to voter_unlikely_text
vote_unlikely_couldvote = 14, # rename column 14 to vote_unlikely_couldvote
vote_unlikely_couldvote_text = 15, # rename column 14 to vote_unlikely_couldvote_text
vote_advancevote_choice = 16, # rename column 16 to vote_advancevote_choice
vote_advancevote_choice_text = 17, # rename column 17 to vote_advancevote_choice_text
vote_partylean = 18, # rename column 18 to vote_partylean
vote_partylean_text = 19, # rename column 19 to vote_partylean_text
vote_partylean_couldvote = 20, # rename column 20 to vote_partylean_couldvote
vote_partylean_couldvote_text = 21, # rename column 21 to vote_partylean_couldvote_text
votechoice_secondchoice = 22, # rename column 22 to votechoice_secondchoice
votechoice_secondchoice_text = 23, # rename column 23 to votechoice_secondchoice_text
votechoice_couldvote_secondchoice = 24, # rename column 24 to votechoice_couldvote_secondchoice
votechoice_couldvote_secondchoice_text = 25, # rename column 25 to votechoice_couldvote_secondchoice_text
votechoice_partynotvote_1 = 26, # rename column 26 to votechoice_partynotvote_1
votechoice_partynotvote_2 = 27, # rename column 27 to votechoice_partynotvote_2
votechoice_partynotvote_3 = 28, # rename column 28 to votechoice_partynotvote_3
votechoice_partynotvote_4 = 29, # rename column 29 to votechoice_partynotvote_4
votechoice_partynotvote_5 = 30, # rename column 30 to votechoice_partynotvote_5
votechoice_partynotvote_6 = 31, # rename column 31 to votechoice_partynotvote_6
votechoice_partynotvote_7 = 32, # rename column 32 to votechoice_partynotvote_7
votechoice_partynotvote_8 = 33, # rename column 33 to votechoice_partynotvote_8
votechoice_partynotvote_9 = 34, # rename column 34 to votechoice_partynotvote_9
votechoice_partynotvote_text = 35, # rename column 35 to votechoice_partynotvote_text
lr_scale_bef = 36, # rename column 36 to lr_scale_bef
lr_scale_aft = 37, # rename column 37 to lr_scale_aft
religion = 38, # rename column 38 to religion
sexuality_selected = 39, # rename column 39 to sexuality_selected
sexuality_text = 40, # rename column 40 to sexuality_text
language_eng = 41, # rename column 41 to language_eng
language_fr = 42, # rename column 42 to language_fr
language_abgl = 43, # rename column 43 to language_abgl
employment = 44, # rename column 44 to employment
income = 45, # rename column 45 to income
income_cat = 46, # rename column 46 to income_cat
marital = 47, # rename column 47 to marital
econ_retro = 48, # rename column 48 to econ_retro
econ_fed = 49, # rename column 49 to econ_fed
econ_self = 50) # rename column 50 to econ_self
decon <- labelled::to_factor(decon) # convert variables to factors
decon <- dplyr::mutate(decon, lr_scale_bef = as.character(lr_scale_bef)) # reassign values in lr_scale_bef column as characters for uniting
decon <- dplyr::mutate(decon, lr_scale_aft = as.character(lr_scale_aft)) # reassign values in lr_scale_aft column as characters for uniting
decon <- tidyr::unite(decon, "lr_scale", lr_scale_bef:lr_scale_aft, na.rm = T, remove = F) # unite lr_scale_bef and lr_scale_aft columns into new column lr_scale
decon <- dplyr::mutate_if(decon, is.character, list(~dplyr::na_if(., ""))) # replaces empty cells in new lr column with NA
assign("decon", dplyr::mutate(decon, ces_code = "ces2019_web", .before = 1), envir = as.environment(pos))
# remove temporary data object
rm(ces2019_hldr)
# remove the temporary placeholder
unlink(hldr, recursive = T, force = T)
# print out a concatenation of the survey citation
message("TO CITE THIS SURVEY FILE: Stephenson, Laura B; Harell, Allison; Rubenson, Daniel; Loewen, Peter John, 2020, '2019 Canadian Election Study - Online Survey',
https://doi.org/10.7910/DVN/DUS88V, Harvard Dataverse, V1\nLINK: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V")
}
else{
# if the file does exist stop process and print this message
stop("Warning: Dataframe already exists.")
}
}
Takeaways
Here are a few things that I learned or found useful while creating this package.
Using the normal assign function <-
in a function does not assign a globally available object. Instead, you need to use the assign()
function.
If you are going to use functions from another package in your package, use the ::
method of calling the function. This way there will be no confusion between functions of the same name from different packages.
The document
and roxygenise
functions are your friends. I found that a lot of issues I was having when testing the cesR
function was because I had not run either of these functions.
The R Packages book from Hadley Wickham is one of the best resources you can find for creating an R package. I cannot recommend it enough.
Installation
If you would like to use cesR
, you can install it using:
install.packages("ceseR")
or the development version using:
devtools::install_github("hodgettsp/cesR")