Decision trees are used extensively in machine learning because they are easy to use, easy to interpret, and easy to operationalize. KD Nuggets, one of the most respected sites for data science and machine learning, recently published an article that identified decision trees as a top 10 algorithm for machine learning. (1)
Besides being easy to understand, use and operationalize, most decision tree algorithms contain very usefulfeatures, such as:
If you are new to machine learning, some of these concepts may be unfamiliar. The goal of this blog is to provide you with the basics of decision trees using Talend and Apache Spark.
If you want to learn more about advanced analytics, please see the references section below.(2) Some of the concepts presented here are sourced from that material.
In this post Ill be referring to a few terms that warrant a brief explanation:
Decision treeshave several implementations. The three most common are:
Of the three implementations, CART is the most popular. The name refers to the fact that CART can be used when the target variable is expressed either as a finite list, such as age or hair color (classification), or is continuous, such as annual precipitation (regression). * Note that it is the target variable that determines the type of tree.
Below, Ill demonstrate a simple classification tree using data well known to the machine learning community. The kyphosis data set (3) reports on children who, having had corrective spinal surgery, exhibited kyphosis post-surgery. Below is a sampling of the kyphosis data set which was used to train a decision tree model:
The feature variables are 1. Age, 2. Number, and 3. Start. The target variable is indicated by Kyphosis. It is a simple category that has two levels; absent and present.
Below is a visual depiction of a classification tree trained using the kyphosis data set.
Key components of a decision tree (4):
Decision trees are built using a recursive partitioning algorithm that is easily described using the following pseudo code (2):
The concept of best split is pertinent to how decision trees are built. Best split refers to node purity (P) which is a measure of its homogeneity (information gain) or heterogeneity (entropy).
Using the root node as an example:
Root Node: P (Kyphosis = absent) = 47/60 = 78% pure on the class Kyphosis = absent.
With this information, the decision tree algorithm then iterates through all subgroups to find the next best fit. Here, the feature Start where its value is greater than or equal to 12, was chosen as the best split.
Internal Node 1 Left Branch: P (Kyphosis = absent) = 32/34 = 94% pure on the class Kyphosis = absent. The 57% is the percentage of observations used at this node.
Node pureness is used by the decision tree algorithm as input to much more involved calculations called entropy and information gain.
However, the goal of information gain is to compare the purity of the parent node before the split with the degree of purity after the split. The feature with the greatest information gain is considered the most informative and it is used at the split. (4) In the example above, Start >= 12 is the most informative feature for the first internal node.
As mentioned earlier, decision trees are intuitive and easy to explain. They are also very easy to implement using any programming language that supports conditional statements. Using our decision tree, lets define a few rules for the conditional logic:
Implementing a decision tree model is conceptually straight forward. However, training, testing, and operationalizing a model is not without some challenges.
One such challenge is the maintenance and upkeep of the model. The learning part of machine learning alludes to the fact that models need to evolve with the data. This necessitates model re-training and depending on the business use case can occur quite frequently.
To maximize efficiency, model training, re-training, and operationalization needs to be automated and Talend makes this easy.
Talend provides a comprehensive eco-system of tools and technologies to facilitate the integration of machine learning into data integration work flows in a continuous and automated way. This enables organizations to focus more on business outcomes and realize a faster time to value for their most valuable asset; their data.
But dont just take my word for it. See for yourself!Get Started Right Now with Our Hands-on Tutorial:Machine Learning 101 Decision Trees
As always, I am interested in hearing your thoughts on this topic and encourage your feedback.
 Le, J. (August, 2016). The 10 Algorithms Machine Learning Engineers Need to Know. Accessed from URL
 Abbott, D. (2014). Applied Predictive Analytics. Indianapolis: Wiley
 John M. Chambers and Trevor J. Hastie eds. (1992)Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA 1992. Kyphosis data set
 Diertrich, D. Heller B, Yang, B. (2015). Data Science and Big Data Analytics. Indianapolis: Wiley