Information Gain, Gain Ratio and Gini Index

A beautiful sight

Information Gain, Gain Ratio and Gini Index are the three fundamental criteria to measure the quality of a split in Decision Tree.

In this blog post, we attempt to clarify the above-mentioned terms, understand how they work and compose a guideline on when to use which.

In fact, these 3 are closely related to each other. Information Gain, which is also known as Mutual information, is devised from the transition of Entropy, which in turn comes from Information Theory. Gain Ratio is a complement of Information Gain, was born to deal with its predecessor’s major problem. Gini Index, on the other hand, was developed independently with its initial intention is to assess the income dispersion of the countries but then be adapted to work as a heuristic for splitting optimization.

Test your knowledge
0%

Information Gain, Gain Ratio and Gini Index - Quiz 1

1 / 3

What best describes the Information Entropy?

2 / 3

Why do we need a splitting criterion for Decision trees? Choose all that apply.

3 / 3

Amongst Information Gain, Gain Ratio and Gini Index, which is usually the fastest?

Your score is

0%

Please rate this quiz

Before diving into details, it helps to elaborate on the definition of Entropy.

Information Entropy

Information Entropy, or just Entropy, is a measurement of the uncertainty in data. In the context of Classification Machine Learning, Entropy measures the diversification of the labels.

  • A low Entropy indicates that the data labels are quite uniform.
    E.g. suppose a dataset has 100 samples. Among those, there are 1 Positive and 99 Negative labeled data points. In this case, the Entropy is very low.
    In an extreme case, suppose all the 100 samples are Positive, then the Entropy is at its minimum, a.k.a zero.
  • A high Entropy means the labels are in chaos.
    E.g. a dataset with 45 Positive samples and 55 Negative samples has a very high Entropy.
    The extreme case, when the Entropy is highest happens when exactly half of the data belongs to each of the labels.

In another point of view, the Entropy measures how hard we guess the label of a randomly taken sample from the dataset. If most of the data have the same label, says, Positive, meaning the Entropy is low, thus we can bet the label of the random sample is Positive with confidence. On the flip side, If the Entropy is high, meaning the probabilities of the sample to fall into the 2 classes are comparable, making us hard to make a guess.

The formula of Entropy is given by:

H = -(\sum p_i \log_2{p_i})

where p_i is the proportion of class i in the dataset.

For example, a dataset with 30 Positive and 70 Negative samples has its Entropy:

\begin{aligned}H &= -(0.3 * \log_2{(0.3)} + 0.7 * \log_2{(0.7)}) \\    &\approx 0.88 \end{aligned}

A small issue with this formula is that \log{(0)} is undefined. Thus, when all samples belong to the same class, we would have trouble computing the Entropy. For this case, we assume p_i \log{p_i} = 0. This assumption makes sense since lim_{x \to 0} x \log{(x)} = 0. Also, if an event does not occur, it does not contribute to the disturbance of the data, hence does not affect Entropy.

On a side note, it is natural to wonder why the Entropy has this formula. Entropy is a measurement borrowed from Information Theory, or to be more specific, Data Compression. The Entropy of a dataset that contains words indicates the average number of bits needed to compress each word of the document. For example, suppose there is a document formed by 4 distinct words, with their proportion being (0.2, 0.1, 0.4, 0.3). The Entropy of this document is, as calculated by the above formula, H \approx 1.85. If the document’s length (the number of words) is n, says 30, then the approximate number of binary bits to encode is n*H \approx 30 * 1.85 \approx 55.5. More on this can be found in Shannon’s source coding theory. As the number of bits to encode a word should not be larger than (approximately) log_2n, the maximum bound of the Entropy is also log_2n.

Information Gain

We know what the Entropy is and how to compute it. Now it’s time to move on to the splitting criteria. The first one to be examined is Information Gain (IG).

The idea of IG is simple: the more the Entropy being reduced after splitting (that is, the more the dataset being clear after splitting, or says, the information gained by split), the more the Information Gain.

Let’s take an example.

Suppose we have a dataset of our last 100 days which records if we go outside to play or not. Positive (P) means we do go outside, while Negative (N) means we stay at home to study Data Mining.

a dataset with 30 postive and 70 negative samples.

The Entropy of our initial dataset is:

\begin{aligned}H &= - (0.3 * \log_2 0.3 + 0.7 * \log_2 0.7) \\    &\approx 0.88 \end{aligned}

We want to make use of this dataset to build a Classification Tree that can predict if we will go out given predictor variables (e.g. the weather, is the day a weekday or weekend, the Air quality index).

To make a split with Information Gain, we need to measure how much information is gained if each of the predictor variables is used for splitting.

Firstly, let’s try using the Weather:

being splitted by Weather, the resulting 2 datasets, one has 25 P and 10 N while the other has 5 P and 60 N

The Entropies of the resulting 2 sub-datasets are:

\begin{aligned}H_{Weather = Sunny} &= - (\frac{25}{35} * \log_2\frac{25}{35} + \frac{10}{35} * \log_2 \frac{10}{35}) \\    &\approx 0.86 \end{aligned}

\begin{aligned}H_{Weather = Rainy} &= - (\frac{5}{65} * \log_2\frac{5}{65} + \frac{60}{65} * \log_2 \frac{60}{65}) \\    &\approx 0.39 \end{aligned}

The Information Gain of a split equals the original Entropy minus the weighted sum of the sub-entropies, with the weights equal to the proportion of data samples being moved to the sub-datasets.

IG_{split} = H - (\sum \frac{|D_j|}{|D|} * H_{j})

where:

  • D is the original dataset.
  • D_j is the j-th sub-dataset after being split.
  • |D| and |D_j| are the numbers of samples belong to the original dataset and the sub-dataset, respectively.
  • H_{j} is the Entropy of the j-th sub-dataset.

To illustrate, the Information Gain using Weather is:

\begin{aligned}IG_{Weather} &= H - (\sum \frac{|D_j|}{|D|} * H_{j})  \\                          &= H - (\frac{35}{100} H_{Weather = Sunny} + \frac{65}{100} H_{Weather = Rainy}) \\                          &\approx 0.88 - 0.55 \\                          &\approx 0.33\end{aligned}

All predictor variables are computed their splitting Information Gains similar to the process above, then the one with the highest value will be chosen to make the actual split.

Although being very useful, the Information Gain has an undesired characteristic, which is to favor the predictor variables with a large number of values. Those highly branching predictors are likely to split the data into subsets with low Entropy values. For example, in the extreme case, the ID code.

the id code split the data into n chunks, with 1 sample each chunk

The disadvantages of these splits are:

  • Making the model more prone to over-fitting.
  • The number of nodes in the tree may be very large.

To address this issue, an adjusted version of Information Gain was born, called Gain Ratio.

Gain Ratio

Gain Ratio attempts to lessen the bias of Information Gain on highly branched predictors by introducing a normalizing term called the Intrinsic Information.

The Intrinsic Information (II) is defined as the entropy of sub-dataset proportions. In other words, it is how hard for us to guess in which branch a randomly selected sample is put into.

The formula of Intrinsic Information is:

II = - (\sum \frac{|D_j|}{|D|} * \log_2\frac{|D_j|}{|D|})

In the above example of splitting using Weather, the Intrinsic Value is:

\begin{aligned}II &= - (\frac{35}{100} * \log_2 \frac{35}{100} + \frac{65}{100} * \log_2 \frac{65}{100}) \\    &\approx 0.93\end{aligned}

The Gain Ratio is:

Gain Ratio = \frac{\text{Information Gain}}{\text{Intrinsic Information}}

Plug it to the above example:

\begin{aligned}\text{Gain Ratio}_\text{ Weather} &\approx \frac{0.33}{0.93} \\                                                             &\approx 0.35\end{aligned}

For all the predictor variables, the one that gives the highest Gain Ratio is chosen for the split.

Gini Index

The last measurement is the Gini Index, which is derived separately from a different discipline. As we stated from the opening section of this post, the Gini Index (or Gini Coefficient) was first introduced to measure the wealth distribution of a nation’s residents.

The Gini of a dataset is:

\text{Gini} = 1 - (\sum p_i^2)

where p_i is the proportion of a label.

The Gini of the above original dataset is:

\begin{aligned}\text{Gini} (D) &= 1 - (0.3^2 + 0.7^2) \\                          &= 0.42\end{aligned}

The Gini of a split is computed by:

\text{Gini}_\text{split} = \sum \frac{|D_j|}{|D|}\text{Gini}_j

where \text{Gini}_j is the Gini of the j-th sub-dataset.

For the above example with Weather:

\begin{aligned}\text{Gini}_\text{split = Weather} &= \frac{35}{100} *  \text{Gini}_\text{Sunny} + \frac{65}{100} *  \text{Gini}_\text{Rainy} \\                                                             &\approx 0.35 * 0.41 + 0.65 * 0.14\\                                                             &\approx 0.2345\end{aligned}

For all the predictors, the one that generates the lowest Gini split is chosen.

Comparision

In theory:

  • Information Gain is biased toward high branching features.
  • Gain Ratio, as the result of Intrinsic Information, prefers splits with some partitions being much smaller than the others.
  • Gini Index is balanced around 0.5, while the Entropy penalizes small proportions more than the large ones.

Below is a plot from ClementWalter on StackExchange comparing how Information Gain and Gini Index penalize according to proportions. The comparison is based on Binary Classification with values being normalized.

Normalised Gini and Entropy criteria
Gini’s penalty scheme is symmetric around 0.5, while Entropy penalizes small proportions harder.

In practice, surprisingly, the performances of split measurements are quite similar, as Laura and Kilian pointed out in their paper, only 2% of the times that Information Gain and Gini Index disagree with each other, so it is really hard to say which one is better.

In terms of time efficiency, it is obvious from the formulas that Gini is faster than IG, which in turn is faster than Gain Ratio. This explains why the Gini Index is usually the default choice in many implementations of the Decision Tree.

Test your understanding
0%

Information Gain, Gain Ratio and Gini Index - Quiz 2

1 / 9

Screenshot From 2020 03 23 10 13 56

Above is the formula of ...

2 / 9

Screenshot From 2020 03 23 10 18 23

Above is the formula of ...

3 / 9

An original binary-labeled dataset D has 200 Positive and 300 Negative samples. After being applied a split S, it is divided into 3 sub-datasets with the numbers of Positive and Negative samples are (40, 100), (90, 50), and (70, 150). What is approximately the Gini-Index of this split?

4 / 9

Screenshot From 2020 03 23 10 11 17

Above is the formula of ...

5 / 9

An original binary-labeled dataset D has 200 Positive and 300 Negative samples. After being applied a split S, it is divided into 3 sub-datasets with the numbers of Positive and Negative samples are (40, 100), (90, 50), and (70, 150). What is approximately the Information Gain of this split?

6 / 9

Screenshot From 2020 03 23 10 17 14

Above is the formula of ...

7 / 9

What is the Entropy of a binary-labeled dataset with 20 Positive and 10 Negative samples?

8 / 9

An original binary-labeled dataset D has 200 Positive and 300 Negative samples. After being applied a split S, it is divided into 3 sub-datasets with the numbers of Positive and Negative samples are (40, 100), (90, 50), and (70, 150). What is approximately the Gain Ratio of this split?

9 / 9

Amongst Information Gain, Gain Ratio and Gini Index, which is usually the most effective splitting criterion?

Your score is

0%

Please rate this quiz

References:

  • Wikipedia’s page about Entropy: link
  • Brillian’s page about Entropy: link
  • Wikipedia’s page about Shannon source coding theorem: link
  • Notre Dame University’s slides about Information Gain: link
  • Wikipedia’s page about Gini coefficient: link
  • Hong Kong University’s slide about Gain Ratio: link
  • Theoretical comparison between the Gini Index and Information Gain criteria: link
  • A question on StackExchange about Gini versus IG: link

Leave a Reply