Processing Data

First, we download the training and test datasets and load them in through the read.csv function. During my exploratory data analysis, I saw that blank values, “NA”, and “#DIV/0!” often show up in data columns so I have decided to treat all of these values as NA.

# load packag
library(caret)
# download data 
if(!file.exists("pml-training.csv")){
    download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", 
        destfile = "pml-training.csv", method = "curl")
}
if(!file.exists("pml-testing.csv")){
    download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", 
        destfile = "pml-testing.csv", method = "curl")
}
# load data
train <- read.csv("pml-training.csv", header = TRUE, na.strings=c("","NA", "#DIV/0!"))
test <- read.csv("pml-testing.csv", header = TRUE, na.strings=c("","NA", "#DIV/0!"))

In order to run the machine learning algorithms, the features used cannot contain any NA values. To see which variables/features should be used, I calculated the percentage of NA’s for each column.

# see error percentage 
NAPercent <- round(colMeans(is.na(train)), 2)
table(NAPercent)
## NAPercent
##    0 0.98    1 
##   60   94    6

From above, we can see that only 60 variables have complete data so those are the variables we will use to build the prediction algorithm. I removed the first variable here because it is the row index from the csv file and not a true variable.

# find index of the complete columns minus the first 
index <- which(NAPercent==0)[-1]
# subset the data
train <- train[, index]
test <- test[, index]
# looking at the structure of the data for the first 10 columns
str(train[, 1:10])
## 'data.frame':    19622 obs. of  10 variables:
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...

From the structure of the data, we can see that the first 6 variables user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window are simply administrative parameters and are unlikely to help us predict the activity the subjects are performing. Therefore, we are going to leave those 6 columns out before we build the algorithm. In addition, to make the columns easier to deal with, we will go ahead and convert all features to numeric class.

# subset the data
train <- train[, -(1:6)]
test <- test[, -(1:6)]
# convert all numerical data to numeric class
for(i in 1:(length(train)-1)){
    train[,i] <- as.numeric(train[,i])
    test[,i] <- as.numeric(test[,i])
}

Cross Validation

Forthis project, we will focus on using the two most widely-used, most accurate prediction algorithms,

We set test set aside and split the train data into two sections for cross validation. We will allocate 80% of the data to train the model and 20% to validate it.

We expect that the out-of-bag (OOB) error rates returned by the models should be good estimate for the out of sample error rate. We will get actual estimates of error rates from the accuracies achieved by the models.

# split train data set
inTrain <- createDataPartition(y=train$classe,p=0.8, list=FALSE)
trainData <- train[inTrain,]
validation <- train[-inTrain,]
# print out the dimentions of the 3 data sets
rbind(trainData = dim(trainData), validation = dim(validation), test = dim(test))
##             [,1] [,2]
## trainData  15699   53
## validation  3923   53
## test          20   53

Comparing Model and Results

First, We will use random forest to build the first model. Because the algorithm is computationally intensive, we will leverage parallel processing using multiple cores through the doMC package

# load doMC package 
library(doMC)
# set my cores 
registerDoMC(cores = 8)
# load randomForest package
library(randomForest)
# run the random forest algorithm on the training data set
rfFit <- randomForest(classe~., data = trainData, method ="rf", prox = TRUE)
rfFit
## 
## Call:
##  randomForest(formula = classe ~ ., data = trainData, method = "rf",      prox = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.43%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4462    2    0    0    0 0.0004480287
## B   16 3018    4    0    0 0.0065832785
## C    0    8 2725    5    0 0.0047479912
## D    0    0   24 2546    3 0.0104935873
## E    0    0    1    4 2881 0.0017325017
# use model to predict on validation data set
rfPred <- predict(rfFit, validation)
# predicted result
confusionMatrix(rfPred, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    2    0    0    0
##          B    0  756    3    0    0
##          C    0    1  681    6    0
##          D    0    0    0  637    1
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9967          
##                  95% CI : (0.9943, 0.9982)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9958          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9960   0.9956   0.9907   0.9986
## Specificity            0.9993   0.9991   0.9978   0.9997   1.0000
## Pos Pred Value         0.9982   0.9960   0.9898   0.9984   1.0000
## Neg Pred Value         1.0000   0.9991   0.9991   0.9982   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1927   0.1736   0.1624   0.1835
## Detection Prevalence   0.2850   0.1935   0.1754   0.1626   0.1835
## Balanced Accuracy      0.9996   0.9975   0.9967   0.9952   0.9993

Next, we will try the Generalized Boosted Regression Models.

# run the generalized boosted regression model
gbmFit <- train(classe~., data = trainData, method ="gbm", verbose = FALSE)
gbmFit
## Stochastic Gradient Boosting 
## 
## 15699 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 15699, 15699, 15699, 15699, 15699, 15699, ... 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.7481717  0.6807386  0.007517094
##   1                  100      0.8152710  0.7662110  0.004916848
##   1                  150      0.8500706  0.8102767  0.004760952
##   2                   50      0.8519370  0.8124713  0.005877966
##   2                  100      0.9061600  0.8812564  0.004217447
##   2                  150      0.9304904  0.9120490  0.003690489
##   3                   50      0.8956435  0.8678982  0.004276404
##   3                  100      0.9404469  0.9246541  0.004019593
##   3                  150      0.9594901  0.9487596  0.003121579
##   Kappa SD   
##   0.009517534
##   0.006225865
##   0.006044945
##   0.007440065
##   0.005345304
##   0.004675461
##   0.005406066
##   0.005087412
##   0.003945901
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3 and shrinkage = 0.1.
# use model to predict on validation data set
gbmPred <- predict(gbmFit, validation)
# predicted result
confusionMatrix(gbmPred, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1095   25    0    3    2
##          B   17  713   24    6    9
##          C    2   21  652   26    6
##          D    2    0    7  606    3
##          E    0    0    1    2  701
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9602          
##                  95% CI : (0.9536, 0.9661)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9497          
##  Mcnemar's Test P-Value : 9.725e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9812   0.9394   0.9532   0.9425   0.9723
## Specificity            0.9893   0.9823   0.9830   0.9963   0.9991
## Pos Pred Value         0.9733   0.9272   0.9222   0.9806   0.9957
## Neg Pred Value         0.9925   0.9854   0.9900   0.9888   0.9938
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2791   0.1817   0.1662   0.1545   0.1787
## Detection Prevalence   0.2868   0.1960   0.1802   0.1575   0.1795
## Balanced Accuracy      0.9852   0.9608   0.9681   0.9694   0.9857

From the above, we can see that randomForest is the better performing algorithm with 0.43% out-of-bag (OOB) error rate, which is what we expect the out of sample error rate to be. When applied to the validation set for cross validation, the model achieved an accuracy of 99.7%, which indicates the actual error rate is 0.3%, where as GBM has an accuracy of 96.0% with error rate of 4.0%.

Result

We can apply the randomForest model to the 20 given test set for the predictions. The results were all correct.

# apply random forest model to test set
predict(rfFit, test)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E