Data Cleaning and Transformation

Dr. Nathaniel Cline

A Note on Grammars of wrangling and graphics

A grammar of data wrangling…

… based on the concepts of functions as verbs that manipulate data frames

  • select: pick columns by name
  • arrange: reorder rows
  • slice: pick rows using index(es)
  • filter: pick rows matching criteria
  • distinct: filter for unique rows
  • mutate: add new variables
  • summarise: reduce variables to values
  • group_by: for grouped operations
  • … (many more)

Rules of dplyr functions

  • First argument is always a data frame
  • Subsequent arguments say what to do with that data frame
  • Always return a data frame
  • Don’t modify in place

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

Think about the following sequence of actions - find keys, unlock car, start car, drive to work, park.

  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "work"))
  • Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()

A note on piping and layering

  • %>% used mainly in dplyr pipelines, we pipe the output of the previous line of code as the first input of the next line of code

  • + used in ggplot2 plots is used for “layering”, we create the plot in layers, separated by +

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options

Aesthetics options

Commonly used characteristics of plotting characters that can be mapped to a specific variable in the data are

  • colour
  • shape
  • size
  • alpha (transparency)

Mapping vs. setting

  • Mapping: Determine the size, alpha, etc. of points based on the values of a variable in the data
    • goes into aes()
  • Setting: Determine the size, alpha, etc. of points not based on the values of a variable in the data
    • goes into geom_*()

Faceting

  • Smaller plots that display different subsets of the data

  • Useful for exploring conditional relationships and large data