UPDATE: cesR is now available through the CRAN network and can be installed using install.packages("cesR").


This is a bit of a breakdown on my thought process on writing designing the package cesR for R. The full paper for cesR is available from SocArXiv Papers and the full code for this project can be found on my Github account.


Introduction

This past summer (2020 to give a date-mark), I had the privilege of doing my co-op with Dr. Rohan Alexander in the Toronto Data Lab. While this wasn’t the project I had set out to complete, nor the one I had proposed, it was what ended up being the most rewarding.

The purpose of cesR is to make accessing Canadian Election Study (CES) datasets easier for R users. It was inspired by the work being done in the R community through such packages as the opendatatoronto package and the Lahman package. Packages such as these are important as they improve the functionality of working within R by minimizing the number of steps required to load data and increasing the availability of data to R users.

cesR does this through the use of five functions: get_ces(), get_cescodes(), get_preview(), get_question(), and get_decon().



Functions

get_ces()

When called, the get_ces() function returns a requested CES survey as a data object and prints the associated citation and URL for the survey dataset repository to the console. The function takes one argument, srvy, as a character string and a second, pos, numeric argument used to set the environment. srvy is a vector member that is associated with a CES survey. If the provided character string argument matches a member of the built-in vector, the associated file is downloaded using the download.file() function from the utils R package and is stored in a temporary directory. Upon downloading the file, it is read into R using either the read_dta() or read_sav() functions from the haven R package, or the read_tsv() function from the readr package. A data frame is then assigned using the assign() function from the base R package as a data object in the global environment. The downloaded file and file directory are then removed from the computer using the unlink() function from base R. Finally, the recommended citation for the requested survey dataset and URL of the survey data storage location are printed in the console as a message.

If the provided character string argument does not have a match in the built-in vector, then the function process is stopped and a warning message is printed to the console.

Here is an example of the get_ces() code.

# 'get_ces' function, uses one variable 'srvy'
get_ces <- function(srvy, pos = 1){
  # if 'srvy' is in 'ces_codes' vector
  if(srvy %in% ces_codes){
    # if 'srvy' is equal to 'ces2019_web'
    if(srvy == "ces2019_web"){
      # create temporary file name holder
      hldr <- tempfile(fileext = ".dta")
      # if the file does not exist
      if(!file.exists(hldr)){
        # assign download url
        cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
        # download the file from the url and assign file name from holder
        utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
        # assign the data file to a globally available variable
        assign("ces2019_web", haven::read_dta(hldr, encoding = "latin1"), envir = as.environment(pos))
        # remove the temporary downloaded data file
        unlink(hldr, recursive = T)
        # print citation and link
        message(ref2019web)
      }
    }
  }
  else{
    # if the provided code is not in the 'ces_codes' vector then stop process and print this message
    stop("Incorrect CES dataset code provided. Use function get_cescodes() for a printout of useable code calls.")
  }
}



get_cescodes()

I’ve actually found a more efficient way of doing this. Will post that at a later date once the function is updated. It also means adding another dependency to cesR.

To call a CES survey you require an associated survey code. The get_cescodes() function provides a user with a print out of these survey codes. This function does not require any arguments passed to it, but does have the option of setting the number of values returned through the head() function. If passed without any arguments, the function is prints to the console a dataframe that contains all the survey codes and their associated argument calls.

# get_cescodes function
# creates three vectors of the ces survey codes and associated calls
# converts those vectors to data frames with associated index number for call
# merges the three data frames and renames the columns
# removes the data frame items and prints merged results
# can be used to lookup a survey code and the associated calls.
get_cescodes <- function(indx = 22){
  ces1 <- (c("ces2019_web", "ces2019_phone", "ces2015_web", "ces2015_phone", "ces2015_combo",
                "ces2011", "ces2008", "ces2004", "ces0411", "ces0406", "ces2000", "ces1997", "ces1993",
                "ces1988", "ces1984", "ces1974", "ces7480", "ces72_jnjl", "ces72_sep", "ces72_nov",
                "ces1968", "ces1965"))
  ces2 <- c('"ces2019_web"', '"ces2019_phone"', '"ces2015_web"', '"ces2015_phone"', '"ces2015_combo"',
                '"ces2011"', '"ces2008"', '"ces2004"', '"ces0411"', '"ces0406"', '"ces2000"', '"ces1997"', '"ces1993"',
                '"ces1988"', '"ces1984"', '"ces1974"', '"ces7480"', '"ces72_jnjl"', '"ces72_sep"', '"ces72_nov"',
                '"ces1968"', '"ces1965"')
  ces1 <- data.frame(ces1)
  ces1$index <- seq.int(nrow(ces1))
  ces2 <- data.frame(ces2)
  ces2$index <- seq.int(nrow(ces2))
  ces_calltable <- merge(ces1, ces2, by = "index")
  ces_calltable <- dplyr::rename(ces_calltable, ces_survey_code = ces1, get_ces_call_char = ces2)
  rm(ces1)
  rm(ces2)
  utils::head(ces_calltable, indx)
}



get_preview()

Sometimes it can be helpful to have a truncated preview of a dataset to assist in exploratory analysis. The get_preview() function provides just this. The function takes two arguments, a character string to call a survey of the same style as used for the get_ces() function and a numerical value that sets the number of observations returned. If no value is provided for the number of rows, a default of six is returned.

# function to call to create previews of the CES surveys
# code for the first section of the function is commented with how the function works,
# all following sections work in the same manner.
get_preview <- function(srvy, obs = 6, pos = 1){
  # if 'srvy' is in 'ces_codese' vector
  if(srvy %in% ces_codes){
    # if 'srvy' is equal to 'ces2019_web'
    if(srvy == "ces2019_web"){
      # create temporary file name holder
      hldr <- tempfile(fileext = ".dta")
      # if the file does not exist
      if(!file.exists(hldr)){
        # assign download url
        cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
        # download the file from the url and assign file name from holder
        utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
        # create a locally available variable
        survey_read <- haven::read_dta(hldr, encoding = "latin1")
        # assign the data file to a globally available variable
        assign("ces2019_web_preview", utils::head(labelled::to_factor(survey_read), obs), envir = as.environment(pos))
        # remove the temporary file
        unlink(hldr, recursive = T)
        # remove the local variable
        rm(survey_read)
      }
    }
  }
  else{
    # if the provided code is not in the 'ces_codes' vector then stop process and print this message
    stop("Incorrect CES dataset code provided. Use function get_cescodes() for a printout of useable code calls")
  }
}



get_question()

The get_question() function provides you with the ability to look up a survey question associated with a given column name. The function takes two arguments in the form of character strings, those being the name of a data object and the name of a column in the given data object. The function works such that it checks whether the given data object exists using the exists() function from the base R. If the object does not exist, the function will print out a warning in the console stating Warning: Data object does not exist. If the object does exist, get_question() will check if the given column name exists in the given data object. This is done using a combination of the hasName() function from the utils package and the get() function from base R. The hasName() function checks if the given column name is in the given data object. Because the arguments are given as character strings the get() function is used to return the actual data object instead of the provided character string. Otherwise, the hasName() function would only check if the given column name argument occurred in the given character string argument and not the actual data object. If the column does not exist in the data object, a warning is printed in the console stating Warning: Variable is not in dataset. If the given column exists in the given data object, get_question() will print the variable label of the given column to the console using a combination of the var_label() function from the labelled package and the get() function from the base package.

As a side note, I provide a step-by-step breakdown of this function in this post.

# function to produce the column label for requested dataset and variable
# takes two parameters as character strings
# 'do' data object and 'q' question
get_question <- function(do, q){
  if(exists(do)){                                                     # if data object exists
    if(utils::hasName(get(do), q)){                                   # if data object has the name of the given question
      message(labelled::var_label(get(q, get(do))))                       # print out concatenation of the column label
                                                                      # the get function is required because it
                                                                      # returns the object from the provided character string
    }
    else{
      message("Warning: Variable is not in dataset")                      # else, print this warning if question does not exist
                                                                      # cat is used instead of stop because stop breaks the function
    }
  }
  else{
    message("Warning: Data object does not exist")                        # else, print this warning if data object does not exist
  }
}



get_decon()

The last, but not least, function of the cesR package is get_decon(). When called, get_decon() creates a subset of the 2019 CES online survey under the name decon (demographics and economics). The get_decon() function requires no arguments passed to it to be called, but does have an argument of pos used to set the environment in which the data is loaded. This is by default the global environment.

get_decon() first checks the global environment if an object of the name decon exists using the exists() function from base R . This prevents the decon dataset from being recreated if the object exists, preventing an accidental overwrite of the dataset. If the get_decon() function is run when an object with the name decon already exists a warning will print in the console stating Warning: Dataframe already exists. If a situation arises in which the decon dataset needs to be recreated, then the best course of action is to use the rm() function from base R to remove the decon object and then run the get_decon() function again.

# function to create 'decon' dataset
# does not use any variable calls
get_decon <- function(pos = 1){
    # if object does not exist in global environment
    if(!exists("decon")){
       # assign url to 'cesfile'
       cesfile <- "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/DUS88V/RZFNOV"
       # assign temporary file with .dta extension to placeholder variable
       hldr <- tempfile(fileext = ".dta")
       # download the file from url assigned to 'cesfile' with file extension from the temporary placeholder
       utils::download.file(cesfile, hldr, quiet = F, mode = "wb")
       # assign data file to temporary data object
       ces2019_hldr <- haven::read_dta(hldr, encoding = "latin1")
       # create new data object with selected columns from temporary data object
       decon <- dplyr::select(ces2019_hldr, c(5:6, 8:10, 20:49, 69,76, 194, 223:227, 245, 250:251, 258, 123:125))
       # rename columns in new data object
       decon <- dplyr::rename(decon,
                              citizenship = 1,                                                        # rename column 1 to citizenship
                              yob = 2,                                                                # rename column 2 to yob
                              gender = 3,                                                             # rename column 3 to gender
                              province_territory = 4,                                                 # rename column 4 to province_territory
                              education = 5,                                                          # rename column 5 to education
                              vote_likely = 6,                                                        # rename column 6 to vote_likely
                              vote_likely_ifable = 7,                                                 # rename column 7 to vote_likely_ifable
                              votechoice = 8,                                                         # rename column 8 to votechoice
                              votechoice_text = 9,                                                    # rename column 9 to votechoice_text
                              votechoice_couldvote = 10,                                              # rename column 10 to votechoice_couldvote
                              votechoice_couldvote_text = 11,                                         # rename column 11 to votechoice_couldvote_text
                              vote_unlikely = 12,                                                     # rename column 12 to vote_unlikely
                              vote_unlikely_text = 13,                                                # rename column 13 to voter_unlikely_text
                              vote_unlikely_couldvote = 14,                                           # rename column 14 to vote_unlikely_couldvote
                              vote_unlikely_couldvote_text = 15,                                      # rename column 14 to vote_unlikely_couldvote_text
                              vote_advancevote_choice = 16,                                           # rename column 16 to vote_advancevote_choice
                              vote_advancevote_choice_text = 17,                                      # rename column 17 to vote_advancevote_choice_text
                              vote_partylean = 18,                                                    # rename column 18 to vote_partylean
                              vote_partylean_text = 19,                                               # rename column 19 to vote_partylean_text
                              vote_partylean_couldvote = 20,                                          # rename column 20 to vote_partylean_couldvote
                              vote_partylean_couldvote_text = 21,                                     # rename column 21 to vote_partylean_couldvote_text
                              votechoice_secondchoice = 22,                                           # rename column 22 to votechoice_secondchoice
                              votechoice_secondchoice_text = 23,                                      # rename column 23 to votechoice_secondchoice_text
                              votechoice_couldvote_secondchoice = 24,                                 # rename column 24 to votechoice_couldvote_secondchoice
                              votechoice_couldvote_secondchoice_text = 25,                            # rename column 25 to votechoice_couldvote_secondchoice_text
                              votechoice_partynotvote_1 = 26,                                         # rename column 26 to votechoice_partynotvote_1
                              votechoice_partynotvote_2 = 27,                                         # rename column 27 to votechoice_partynotvote_2
                              votechoice_partynotvote_3 = 28,                                         # rename column 28 to votechoice_partynotvote_3
                              votechoice_partynotvote_4 = 29,                                         # rename column 29 to votechoice_partynotvote_4
                              votechoice_partynotvote_5 = 30,                                         # rename column 30 to votechoice_partynotvote_5
                              votechoice_partynotvote_6 = 31,                                         # rename column 31 to votechoice_partynotvote_6
                              votechoice_partynotvote_7 = 32,                                         # rename column 32 to votechoice_partynotvote_7
                              votechoice_partynotvote_8 = 33,                                         # rename column 33 to votechoice_partynotvote_8
                              votechoice_partynotvote_9 = 34,                                         # rename column 34 to votechoice_partynotvote_9
                              votechoice_partynotvote_text = 35,                                      # rename column 35 to votechoice_partynotvote_text
                              lr_scale_bef = 36,                                                      # rename column 36 to lr_scale_bef
                              lr_scale_aft = 37,                                                      # rename column 37 to lr_scale_aft
                              religion = 38,                                                          # rename column 38 to religion
                              sexuality_selected = 39,                                                # rename column 39 to sexuality_selected
                              sexuality_text = 40,                                                    # rename column 40 to sexuality_text
                              language_eng = 41,                                                      # rename column 41 to language_eng
                              language_fr = 42,                                                       # rename column 42 to language_fr
                              language_abgl = 43,                                                     # rename column 43 to language_abgl
                              employment = 44,                                                        # rename column 44 to employment
                              income = 45,                                                            # rename column 45 to income
                              income_cat = 46,                                                        # rename column 46 to income_cat
                              marital = 47,                                                           # rename column 47 to marital
                              econ_retro = 48,                                                        # rename column 48 to econ_retro
                              econ_fed = 49,                                                          # rename column 49 to econ_fed
                              econ_self = 50)                                                         # rename column 50 to econ_self
       decon <- labelled::to_factor(decon)                                                            # convert variables to factors
       decon <- dplyr::mutate(decon, lr_scale_bef = as.character(lr_scale_bef))                       # reassign values in lr_scale_bef column as characters for uniting
       decon <- dplyr::mutate(decon, lr_scale_aft = as.character(lr_scale_aft))                       # reassign values in lr_scale_aft column as characters for uniting
       decon <- tidyr::unite(decon, "lr_scale", lr_scale_bef:lr_scale_aft, na.rm = T, remove = F)     # unite lr_scale_bef and lr_scale_aft columns into new column lr_scale
       decon <- dplyr::mutate_if(decon, is.character, list(~dplyr::na_if(., "")))                     # replaces empty cells in new lr column with NA
       assign("decon", dplyr::mutate(decon, ces_code = "ces2019_web", .before = 1), envir = as.environment(pos))
       # remove temporary data object
       rm(ces2019_hldr)
       # remove the temporary placeholder
       unlink(hldr, recursive = T, force = T)
       # print out a concatenation of the survey citation
       message("TO CITE THIS SURVEY FILE: Stephenson, Laura B; Harell, Allison; Rubenson, Daniel; Loewen, Peter John, 2020, '2019 Canadian Election Study - Online Survey',
           https://doi.org/10.7910/DVN/DUS88V, Harvard Dataverse, V1\nLINK: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V")
    }
    else{
        # if the file does exist stop process and print this message
        stop("Warning: Dataframe already exists.")
    }
}



Takeaways

Here are a few things that I learned or found useful while creating this package.

Using the normal assign function <- in a function does not assign a globally available object. Instead, you need to use the assign() function.

If you are going to use functions from another package in your package, use the :: method of calling the function. This way there will be no confusion between functions of the same name from different packages.

The document and roxygenise functions are your friends. I found that a lot of issues I was having when testing the cesR function was because I had not run either of these functions.

The R Packages book from Hadley Wickham is one of the best resources you can find for creating an R package. I cannot recommend it enough.




Installation

If you would like to use cesR, you can install it using:

install.packages("ceseR")

or the development version using:

devtools::install_github("hodgettsp/cesR")