What is ID3 in a decision tree

Decision tree algorithm ID3

This article is part 2 of 4 in the series of articles, Machine Learning with Decision Tree Techniques.

Engineers are very familiar with decision trees in order to break down products hierarchically and to create procedural instructions. The data scientists also want to create procedural instructions, but automatically from the data. Applied in this way, decision trees are a form of machine learning: the machine itself should find a way to assign an object to a class.

The ID3 algorithm

Understanding the ID3 algorithm is worthwhile because it is the basis for many other algorithms based on it. With its iterative and recursive approach, it is also quite easy to understand, but its effect should not be underestimated. The procedure can be broken down into three essential steps, whereby the first step unfolds the actual effect (with all advantages and disadvantages):

  1. Step: Selecting the attribute with the highest information gain
    Look at all the attributes (features) of the data set and determine which attribute best classifies the data.
  2. Step: Create a node with the attribute
    If the results under this node are unambiguous (1 unique value), save it in this node and jump back.
  3. Step: Recursive continuation of this process
    Otherwise, break the data down into n subsets for each attribute and repeat these steps for each of the subsets.

Information gain - and how to calculate it

The information gain of an attribute () in the sense of the ID3 algorithm is the difference from the entropy () (see Part 1 of the series of articles: Entropy, a measure of impurity in data) of the entire data set () and the sum of the weighted entropies of the attribute for each individual value (Value ), which occurs in the attribute:

How the calculation of the information gain works will be shown in Part 3 of this series of articles (to be published shortly).

The advantages of the ID3 algorithm - and the disadvantages

The algorithm is the basis for many other algorithms. In its simplicity it has certain advantages - which probably make it the most popular decision tree algorithm - but it also has a number of disadvantages that should be considered.

  • easy to understand and therefore quickly implemented
  • is a good basis for random forests
  • all attributes play a role, but the tree tends to be small because the information gain dictates the sequence
  • also works (with adjustments) for multiple classification
  • The order in which the information is obtained does not necessarily result in the best or smallest tree among all the possibilities. It is a greedy algorithm and therefore "myopic"
  • the search for decision rules is therefore not complete / comprehensive
  • Since the tree should continue to grow via ID3 until the data is explained as clearly as possible, overfitting is almost provoked

Pay attention to and avoid overfitting

Decision trees generated from data tend to be overfitted. This means that the trees can adapt to the training data to the extent that they fit perfectly, but have no or only an inadequate generalizing description. New data, which can have a higher diversity than the training data, are then no longer correctly classified under a reasonable error rate.

Beware of key columns!

Some attributes force an over-adjustment: For example, if an attribute such as "customer ID" (unique number per customer) is included, we can expect an entropy of 0 for each individual value in the attribute, because everyone ID describes a unique case (customer, customer group, etc.). It follows that the information gain for this attribute is maximal. Here the tree would get an enormous width, which would not be helpful, because each value (IDs) would get a single branch in the tree, which leads to a unique result. The tree cannot be used for new data (new customer numbers) because it no longer represents a general description, but is only an image of the training data.

Prunning - shorten the tree afterwards

Particularly large trees are not good trees and are a sign of over-adaptation. One way to reduce the size is to recalculate the information gains and shorten branches (generalization) if the information gain is too low. Often it is not the entropy or the Gini coefficient but the classification error that is used as a measure of the impurity.

Random forests as an overfitting panacea

Random Forests (a form of ensemble learning) is a collective decision on class membership based on several decision trees. This kind of “democratic” machine learning is also possible Ensemble learning called. If several decision trees with different structures are used for common classification, the effect of overfitting individual trees is usually reduced.

Benjamin Aunkofer

Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.

Tags:Data Mining, Data Science, Decision Science, Decision Tree, ID3, Machine Learning