The goal of this project is to use machine learning to predict the manner in which barbell lifts (correctly or incorrectly, in five different ways [“classe” = A, B, C, D, E]). Data was collected from six participants using accelerometers on the belt, forearm, arm, and dumbell. Given the nature of the data, two different machine learning classification methods were applied and tested: CART (Random Forest pre-processed using Principal Component Analysis) & SVM (with normalized data). The performance of the Random Forest-PCA model yielded a remarkable 99.4% accuracy rate when tested on the validation set. The model was then used to predict the manner in which 20 individuals may have performed barbell lifts.
The data source used to perform this machine learning project is the “Weightlifting Exercise Dataset(Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.”. The original data is available via Groupware@LES and licensed under CC-BY-SA.
The following dependencies are required in order to perform data processing, exploration, and modeling:
require("caret")
require("e1071")
require("randomForest")
require("gbm")
require("Hmisc")
require("corrplot")
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
There are 19622 observations and 160 variables in the training set. There are 20 observations and 160 variables in the testing set.
Remove columns that don’t relate to accelerometer measurements. Specifically, the first seven columns are metadata that describe the participants and windows in which measurements were taken (e.g. index, name, timestamps, window).
training <- training[,-c(1:7)]
testing <- testing[, -c(1:7)]
Identify which variables contain “NA” observation values in training & test sets
trainingNA <- colnames(training)[colSums(is.na(training)) > 0]
testingNA <- colnames(testing)[colSums(is.na(testing)) > 0]
There are 67 variables that contain NA values in the training set. There are 100 variables that contain NA values in the testing set.
Identify which variables contain all “NA” values, if any.
trainingNAall <- colnames(training)[colSums(is.na(training)) == nrow(training)]
testingNAall <- colnames(testing)[colSums(is.na(testing)) == nrow(testing)]
There are no variables that contain all “NA” values in the training set. There are 100 variables that contain all “NA” values in the testing set.
Identify variables that have all NA values in testing set and remove from both training and testing sets. In addition, remove any observations in the testing and training sets that contain NA values which may adversely affect modeling.
removeTraining <- names(training) %in% testingNAall
cleanTraining <- training[!removeTraining]
training <- na.omit(cleanTraining)
removeTesting <- names(testing) %in% testingNAall
cleanTesting <- testing[!removeTesting]
testing <- na.omit(cleanTesting)
There are now 19622 complete observations and 53 variables in the training set. There are now 20 complete observations and 53 variables in the testing set.
For reproducibility purposes, set and split the training set into a training set (75%) and validation set (25%).
set.seed(12345)
inTrain <- createDataPartition(y=training$classe, p=0.75, list=F)
trainset <- training[inTrain,]
validation <- training[-inTrain,]
training <- trainset
Given the high multidimensional nature of the dataset, create a correlation matrix to possibly correlated variables.
corrplot(cor(training[, c(1:52)]), method ="circle", type="lower", tl.cex = 0.5)
The plot of the correlation matrix highlights several strong positive and negative correlations, suggesting that dimension reduction, and in turn model improvement, can be acheived through Principal Component Analysis or Singular Value Decomposition.
Given that the predicted variable is a discrete multi-nominal variable, either a Classification & Regression Tree (CART – in this case a Random Forest) or Support Vector Machine (SVM) model would be appropriate.
modrf <- randomForest(classe ~ ., data=training, trControl=trainControl(method="pca"))
modsvm <- svm(classe ~ ., data=training)
Now that the models have been fitted, both can be tested on the validation datasets before being applied to the testing set.
predrf <- predict(modrf, validation)
predsvm <- predict(modsvm, validation)
Confusion matrices can be constructed using the predictions generated by the models and tested on the validation set. The accuracy levels can be extracted and multiplied by 100 to generate a prective accuracy percentage rate.
confusionMatrix(predrf, validation$classe)$overall[1] * 100
## Accuracy
## 99.42904
confusionMatrix(predsvm, validation$classe)$overall[1] *100
## Accuracy
## 94.39233
Based on accuracy rates, both models perform well however the accuracy rate of the Random Forest (PCA) model is superior. It can be applied to predict the classes in the testing dataset with a high level of confidence.
The expected out of sample accuracy rate can be obtained using the validation dataset and multiplying the resulting error by 100.
accuracy <- postResample(predrf, validation$classe)
error <- 1 - as.numeric(confusionMatrix(predrf, validation$classe)$overall[1])
error * 100
## [1] 0.5709625
The expected out of sample error rate from the Random Forest (PCA) model is 0.5%
Using the Random Forest PCA model, classes can be predicted for each of the 20 observations in the training data set.
predict(modrf, testing)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E