top of page

Machine Learning in R Using Decision Trees

____________________________________________

As a graduate student, I feel the drive to continually apply my foundation of knowledge towards higher and more complex tasks. Sometimes, these tasks take form through learning something new. During the Fall semester of 2018, I was challenged to use my basic knowledge of the coding language R to create an example of machine learning in my Data Mining class.

​

Machine learning is a principle in computer science that allows computers to accurately predict data or results given an initial data set. Within machine learning, the initial data set is broken up into a test and train set, where the training set generally makes up the majority of the data. The training set is fed into algorithms which continually loop through the data to discover trends and make predictions on the variable of interest. To test the accuracy of the algorithm, the remaining test set is used.

 

To apply this exercise towards my field of expertise, I chose to analyze a dataset of fetal cardiotocography (heartbeat) data. I also specifically chose to use decision trees to analyze the data. Decision trees are a simple and visual form of machine learning that use the concept of purity or homogeneity to break down information into similar or like classes. Ultimately the goal of decision trees is to define and predict a variable of interest: in our case, we want to determine the trends that help us classify NSP. NSP is the fetal heart state variable that describes whether the recorded heart sounds are normal (N), suspect (S), or pathological (P). These are encoded within the dataset as 1, 2, and 3, respectively.

 

Starting with the whole dataset, decision trees proceed in steps to break down subsets of data into "branches." These branches end in "leaves," which are formed when homogeneity is reached, when a class is chosen to represent the subset, or when there is no more data to analyze. A number of methods can be used to determine homogeneity and therefore build a decision tree; most R functions use the Gini Index.

 

To begin with, cardiotocography data was imported as a .csv file and split into random subsets representing 30% and 70% of the data for testing and training, respectively. Numerical variables were then compared against our variable of interest NSP in a regression analysis. Following this, an initial decision tree was created with the "rpart" function. This visual was also graphed with "rpart.plot."

​

After the decision tree was constructed, its accuracy was tested using confusion matrices. Confusion matrices are constructs that compare true results to incorrectly predicted results: they describe the number of times that our "leaf" node was accurately predicted as a class of NSP and NOT a class of NSP (true positive and true negative, res,), and they also describe the amount of leaves that inaccurately predicted our class (false positive and false negative). Accuracy itself is calculated by taking the sum of the true positives and negatives and dividing it by all predictions (TP+TN/TP+TN+FP+FN).

​

Besides testing for accuracy, code was written to examine the decision tree after bagging and pruning. As discussed above, pruning is the process of including a "stop" parameter within your decision tree construction to prevent overfitting. Bagging, on the other hand, is a sampling strategy to test the robustness of the decision tree. Short for "bootstrap aggregating," bagging takes random samples of the dataset and uses them to build numerous small-sample size decision trees. Those datasets that produce high accuracy are then used within a main decision tree's construction.

​

Besides creating decision trees in R, this dataset was also used to create decision trees in Python. Within Python, the process for construction is very similar but requires different steps and functions due to Python's functionality. Results for the decision tree in Python are displayed below next to results for construction within R.

​

Results of the decision trees are organized below. Shown below is the original decision tree constructed without bagging or pruning and created within R. Under that is a decision tree constructed under the same parameters but in Python. Lastly, at the bottom and in the center is a table outlining accuracy, error rate, and precision for the R decision tree after no effects, after bagging, and after pruning. Error rate is simply "1 - accuracy," and precision is true positive over both true positive and false positive (TP/TP+FP).

​

Interestingly, each decision tree shows different results. Although built under the same parameters, this discrepancy is most likely due to the way each program constructs their decision trees and determines splits for branches. This discrepancy could also be due to very small differences in the Gini Indices of certain variables. This, in combination with the randomized test and train subsets, could account for the differing results seen below.

​

In all, I thoroughly enjoyed this project. Through my Data Mining class, I was able to apply my knowledge of the R coding language in order to solve new and interesting problems. This kind of investigation is paramount as our world becomes increasingly revolutionized through technology, automation, and advances in data science. My experiences as an engineer and a tinkerer give me unique insights into the problem-solving processes required to get the most out of coding languages and profit the most out of trends revealed in the data.

​

If you want to learn more about my experiences with the statistics-based coding language R, or about how my training as an engineer has helped me to interpret its results, feel free to contact me via email on the home page.

CTGDecTree.png
DecTree1.JPG
CTGDecTreeResNew.JPG
bottom of page