Decision Trees, Visually

Forget the formulas for a moment. Watch a handful of real data points act out Gini impurity, entropy, and information gain — and see exactly why a decision tree splits the way it does.

Here are 20 loan applicants. A bank must decide: approve or deny? Each dot is one applicant — hover to see their income, credit score, and employment.

ApprovedDenied

Scroll to begin ↓

The intuition

What does “impurity” mean?

Pick an applicant, then guess their decision by pointing at another applicant at random. How often are you wrong? The messier (more impure) the group, the more often your guess misses.

Wrong

Trials

—

Wrong rate

That long-run chance of guessing wrong is the Gini impurity: 1 − Σpᵢ² = 0.48. Run it enough times and the wrong-rate settles right onto 0.48.

Measuring mess · Gini

Gini impurity, one step at a time

Read off each class’s share of the same 20 applicants, then ask: if you picked two at random, how often would they disagree?

Start with the whole group: 20 applicants, a mix of approvals and denials.

Scroll to advance the calculation ↓

Measuring mess · Entropy

Entropy: counting bits of surprise

Think of surprise as the number of yes/no questions you’d need to guess an outcome. Entropy is the average over all applicants.

The same 20 applicants — now measured as "surprise". Pick one at random: how surprised are we by what we observe?

Scroll to advance — or use the buttons on the visual ↓

Zooming out · Gini vs Entropy

Zooming out: the same shape across every possible split

Gini and entropy measure the same idea — how mixed a group is — with different math. Here’s how they compare across every possible split proportion.

At our dataset’s proportion (p = 0.60), Gini reads 0.48 and Entropy reads 0.97 bits.

The split · Information gain

Was the split worth it?

Drag the threshold to slice the applicants in two. Information gain measures how much impurity the split removed — the drop from the grey “before” bar.

Threshold: credit ≤ 664Gain: 0.32

560810

Before the split: one mixed group with its own impurity.

G(\text{parent}) = 1 - (0.60^2 + 0.40^2) = 0.48

Putting it together

Growing the whole tree

A decision tree is just this idea on repeat: at every impure group, pick the split with the most information gain, then do it again on the children.

How the split is chosen: at each node the algorithm tries every feature and every threshold, scores each by information gain (how much it lowers impurity), and keeps the single best — the cleanest separation of approve vs deny.

Each node’s ring shows its impurity — watch it shrink toward pure leaves as the tree grows. That shrinking impurity is exactly what information gain buys.