Overview

Raw and processed data

Tidy Data

  1. Raw Data
    • no software processing has been done
    • did not manipulate, remove, or summarize in anyway
  2. Tidy data set
    • end goal of cleaning data process
    • each variable should be in one column
    • each observation of that variable should be in a different row
    • one table for each kind of variable
      • if there are multiple tables, there should be a column to link them
    • include a row at the top of each file with variable names (variable names should make sense)
    • in general data should be save in one file per table
  3. Code book describing each variable and its values in the tidy data set
    • information about the variables (w/ units) in dataset NOT contained in tidy data
    • information about the summary choice that were made (median/mean)
    • information about experimental study design (data collection methods)
    • common format for this document = markdown/Word/text
      • “study design” section = thorough description of how data was collected
      • “code book” section = describes each variable and units
  4. Explicit steps and exact recipe to get through 1 - 3 (instruction list)
    • ideally a computer script (no parameters)
    • output = processed tidy data
    • in addition to script, possibly may need steps to run files, how script is run, and explicit instructions

Download files

Reading Excel files

Reading XML

Reading JSON

data.table

Reading from MySQL

HDF5

Web Scraping (tutorial)

Working with API

Reading from Other Sources

dplyr

tidyr

lubridate

Subsetting and Sorting

Summarizing Data

Admitted Rejected
Male 1198 1493
Female 557 1278

Creating New Variables

Reshaping Data

Merging Data

Editing Text Variables

Regular Expressions

Working with Dates

Data Sources