Introduction

A few months ago we managed to get another package onto the CRAN network. AustralianElections, similar in concept to cesR, is a data package designed to improve data accessibility and workflow reproducibility. In this case, this is to provide access to data covering over a century of Australian elections from 1901-2019. As a note, I was not involved in collecting the data, so I cannot speak to those aspects of this project; my main focus here was ensuring access to the data in an efficient way. This isn’t so much a way of using the functions and package but is more a breakdown of the function code.

This is done through the use of three functions:

Functions

get_auselection()

This is the driving function behind AustralianElections. It is what allows you to download and load the collected datasets. The code used in this function is based upon what I wrote for the get_ces() function for cesR, but (!) I think this is much more efficient. To the point that I might have to re-write cesR functions to follow this code. We live and we learn, eh?

Housekeeping

Before moving into the aspects of the function that call specific datasets, a few housekeeping measures are taken. First, the function takes three arguments: dsr (a character string used to perform a dataset request), opr (a character string used to pass a relational operator or set a range of years), and year (a numeric value used to request a single election year or range of years).

Next, the function established the accepted operators to be used with the opr argument. These include all relational operators as well as the character string range. These are assigned as character strings to a vector contained within the function that is checked against by later statements.

Next, the download links are assigned to a vector. A requested dataset provided through the dsr argument is checked against the vector members and calls a specific url based upon the index location within the vector.

Next, a vector of all election years is assigned. This is used to check against the provided years argument.

Next, a tibble of request codes with matching dataset information is built for each dataset. The information includes the associated dataset, the years covered by the data, number of observations within the dataset, and the associated request codes. Why a tibble, you may ask. Because this provides a nice reference that can be printed to the console if you need to look up a specific request code or want information on a dataset. I explain how to call these codes in the next section.

Note, this is all effective for this function because there are so few items to establish. Notably, only five datasets. As this number increases, so to will be the resources used. Keep this in mind if following this style of building a function, it may not always be entirely effective/efficient.

# assign function name `get_auselection`
get_auselection <- function(dsr, opr = "", year = 0){

  # create vector to hold acceptable `opr` values
  relops <- c("==", ">=", "<=", "<", ">", "!=", "", "range")

  # assign download URLs vector, these are used to download the data files in a .csv format
  dwnlds <- c("https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/byelections.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/elections.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/parliaments.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data_with_ids.csv"

  )
  # assign a vector of election years
  elect_years <- c(1901, 1903, 1906, 1910, 1913, 1914, 1917, 1919, 1922, 1925, 1928, 1929, 1931, 1934, 1937, 1940,
                   1943, 1946, 1949, 1951, 1954, 1955, 1958, 1961, 1963, 1966, 1969, 1972, 1974, 1975, 1977, 1980,
                   1983, 1984, 1987, 1990, 1993, 1996, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2019)

  # assign a tibble to hold the request codes used as arguments
  codetibble <- tibble::tibble(
    request_code = c("byelections", "elections", "parliaments", "voting_data", "voting_data_with_ids", "codes"),
    dataset = c("Byelection data - 1901-2018 - 158 obs. - 9 var.", "Election data - 1901-2019 - 46 obs. - 4 var.", "Parliament data - 1901-2019 - 75 obs. - 9 var.",
                "Voting data - 1901-2019 - 65337 obs. - 25 var.", "Voting data with IDs - 1901-2019 - 65337 obs. - 22 var.", "Request codes - used to download a dataset.")
  )
  

Errors

Part of what makes a good function is that it stops and tells you if something is wrong. It may mean more initial coding, but ultimately means less struggles for the people calling the function. Here, I’ve built in three stop states that will stop the function from running and tell you what went wrong.

First, if the provided dataset request code is not a character string the function will stop and print out, “Provided request code must be a character string.”

Second, if the provided year is not numeric the function will stop and print out the statement, “Year must be a numeric value greater than 0.”

Third, and lastly, if the provided relational operator is not within the built-in vector, then the function will stop and the message, “Provided value for opr is not a valid relational operator. Please use one of ‘==’, ‘>=’, ‘<=’, ‘>’, ‘<’, ‘!=’. Use opr = ‘range’ to return a range of election years” will print to the console.

  # set first error for if dsr is not a character string
  if(!purrr::is_character(dsr)){
    stop("Provided request code must be a character string.")
  }

  # set error for if year is not a numeric value
  else if(!is.numeric(year)){
    stop("Year must be a numeric value greater than 0.")
  }

  # set error for when passed an incorrect relational operator
  else if(!(opr %in% relops)){
    stop("Provided value for `opr` is not a valid relational operator. Please use one of '==', '>=', '<=', '>', '<', '!='. Use opr = 'range' to return a range of election years.")
  }

Arguments and Temporary Files

As I mentioned earlier, the reason I put the request codes into a tibble was so that they could be printed alongside some information to the console. The other thing I’ve done is set it so this function can be used to show all the years in which elections were held and that can be used to filter the data.

To show these years all you need to do is set the argument for dsr to “years” as such, get_auselection(dsr == "years").

Likewise, if you want to see what codes to use to request a dataset all you need to do is set dsr to “codes” as such, get_auselection(dsr == "codes").

The last little bit of code in the following chunk doesn’t have to do with codes or years, but is important to the rest of the function. What it does is establish a temporary csv file that will stand as the download. This just provides a directory in which the function to download and read the datasets without permanently adding the files to your hard drive. Save on resources and all that.

  # show in which years an election was held
  else if(dsr == "years"){
    elect_years
  }
  # else if provided character string is in request_code column values
  else if(dsr %in% codetibble$request_code){
    # assign to temporary file with .csv file type
    tmpdir <- tempfile(fileext = ".csv")

    # return request code arguments if given "codes" as argument
    if(dsr == "codes"){
      # show first five values of codetibble
      utils::head(codetibble, 5)
    }

Getting the Faux Meat (or Downloading and Loading Datasets)

Now we’re at the faux meat of the function. Here we get the data that is ever so important to the analyses we wish to perform. This works similarily to get_ces() from cesR in that it runs through if else statements and downloads a dataset given a specific set of parameters. In the code chunk below, the argument for dsr is given a “byelections”. Meaning, the byelections dataset is what will be downloaded. This is given in an else if statement because technically the codes argument is what starts the if statement order.

So, if dsr is equivalent to “byelections (dsr == "byelections") then the first item in vector dwnlds (dwnlds[[1]]) is downloaded to the temporary directory. Then, if the year is left blank it defaults to a value of 0 (any(year == 0)) and the dataset is read in. However, if a year is provided (any(year != 0)) then a temporary variable x is created to hold the dataset to be filtered for the provided year(s). This temporary variable is created within the function and is not available in the global environment in which you would normally be working.

    # if `dsr` is equal to "byelections"
    else if(dsr == "byelections"){

      # download from the assigned URL to the temporary directory, showing download progress
      utils::download.file(dwnlds[[1]], tmpdir, quiet = F)


      # if year is left at default value of 0
      if(any(year == 0)){
        # read in complete CSV using `read_csv` from `readr`, do not show column types
        readr::read_csv(tmpdir, show_col_types = F)
      }

      # else if year does not equal default value
      else if(any(year != 0)){

        # read in CSV and filter for the requested year
        x <- readr::read_csv(tmpdir, show_col_types = F)

Relational Operators and Filter Election Years (and a Final Stop Message)

And this is where we filter x for the provided year(s) using the provided relational operator. This operator is provided as a character string which is used to check against the members of one of the vectors created earlier. Using the provided operator and year, the data in x is filtered accordingly. The last section of this code does not rely on an operator but instead uses a range of years set by providing the string “range” instead of an operator. This is useful if you are looking for say elections from 1970 to 2010, in this case you would put get_auselection(dsr = "byelections", opr = "range", year = c(1970, 2010)).

Finally, the function ends with another stop message where if the provided request code is not in the lookup tibble then the function stops.


        # if opr is set to specific relational operator
        if(opr == "==" | opr == ""){

          # filter dataset using given relational operator
          # the rest of the code follows similarly
          dplyr::filter(x, lubridate::year(date) == year)
        }

        else if(opr == ">="){
          dplyr::filter(x, lubridate::year(date) >= year)
        }

        else if(opr == "<="){
          dplyr::filter(x, lubridate::year(date) <= year)
        }

        else if(opr == "<"){
          dplyr::filter(x, lubridate::year(date) < year)
        }

        else if(opr == ">"){
          dplyr::filter(x, lubridate::year(date) > year)
        }

        else if(opr == "!="){
          dplyr::filter(x, lubridate::year(date) != year)
        }

        else if(opr == "range"){
          dplyr::filter(x, lubridate::year(date) >= min(year), lubridate::year(date) <= max(year))
        }
      }
    }
    else{
    # else character string is not in tibble values stop function and return error
    stop("Provided request code is not associated with a dataset. Please use get_auselections('codes') to print out a set of useable request codes.")
  }
}

auselect_winners()

Onto the veggie gravy of this faux meat and potatoes meal. This next function, auselect_winners(), downloads and assigns to a variable a filtered dataset based upon the election winners. This uses one of two created datasets where the data either does or does not include an ID variable.

The function takes only one argument, w_ids, which is a boolean and tells the function whether to download the dataset with IDs or without. By default the setting is TRUE, so the data will download with IDs. Usually more useful to have a column you can remove as opposed to trying to add that column later. It’s like tailoring or carpentry. Easier to remove than add.

The code for this function can be seen below. It’s fairly short so I’ve kept it all in one chunk.

The way this works is that a vector of the download urls is first created. Again, because there are only two urls this is fairly efficient, but as this number increases so too do resource issues.

Next, a temporary directory and file location is established using the tempfile function. This is similar to what is being done in get_auselection().

After that, if the w_ids argument is TRUE or T then the url at the second index of the dwnlds vector is called.

The data at this url is read to a temporary variable x and then filtered for the created winnerDummy variable where 1 means winner.

The next part after that is just for when w_ids is FALSE or F. It works the same way as when w_ids is TRUE but downloads the item at the first index of the dwnlds vector.

And that’s it for auselect_winners(). Quick and simple, yet very useful.

auselect_winners <- function(w_ids = T){
  dwnlds <- c("https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data_with_ids.csv"

  )

  tmpdir <- tempfile(fileext = ".csv")

  if(w_ids == T){
    utils::download.file(dwnlds[[2]], tmpdir, quiet = F)

    x <- readr::read_csv(tmpdir, show_col_types = F)

    dplyr::filter(x, winnerDummy == 1)
  }

  else if(w_ids == F){
    utils::download.file(dwnlds[[1]], tmpdir, quiet = F)

    x <- readr::read_csv(tmpdir, show_col_types = F)

    dplyr::filter(x, winnerDummy == 1)
  }
}

auselect_twopp()

And now for some oven roasted veggies. This is the last of the three functions included in the AustralianElections package. What auselect_twopp() does is download and assign to a given variable another filtered dataset. This is practically the same as auselect_winners(), so my breakdown for that function is applicable here.

The function takes one boolean argument that determines which of two datasets to download.

A vector of urls is created within the function and the provided boolean determines which of these urls to use.

A temporary variable is created within the function to hold the downloaded data and this data is filtered based upon the twoPP dummy variable.

auselect_twopp <- function(w_ids = T){
  dwnlds <- c("https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data.csv",
              "https://raw.github.com/RohanAlexander/australian_federal_elections/master/outputs/voting_data_with_ids.csv"

  )

  tmpdir <- tempfile(fileext = ".csv")

  if(w_ids == T){
    utils::download.file(dwnlds[[2]], tmpdir, quiet = F)

    x <- readr::read_csv(tmpdir, show_col_types = F)

    dplyr::filter(x, twoPP == 1)
  }

  else if(w_ids == F){
    utils::download.file(dwnlds[[1]], tmpdir, quiet = F)

    x <- readr::read_csv(tmpdir, show_col_types = F)

    dplyr::filter(x, twoPP == 1)
  }
}

Takeaways

Writing this package, I was really able to improve upon my designs for cesR. Instead of having every vector pre-established and then called upon, everything is just created within the function. It really cleans up the code and is constructed in a much more understandable way. I’ll definitely be going back to rework some of cesR’s code.

Installation

If you wish to use AustralianElections you can install it using the following code.

install.packages("AustralianElections")