On Simpson's Paradox.


In this short note, we try to discover Simpson's paradox for ourselves. Albeit, we are guided by a vague memory of having seen it before. As a byproduct, we land upon a necessary and sufficient condition for Simpson's paradox to manifest itself in actual data.

Target Audience:

Data Science practitioners.



Perhaps the shortest description of the paradox is right there in the Wikipedia page on Simpson's Paradox (also called the Yule-Simpson effect). At this point, it might help to directly quote Pearl for an English description of the paradox: whereby the association between a pair of variables ($X$, $Y$ ) reverses sign upon conditioning of a third variable, $Z$, regardless of the value taken by Z.


This is how we will proceed wrt this topic.

Start Off:

Herein, let us assume a reader who has seen some form of Simpson's paradox before, and is trying to reconstruct it for herself. She remembers that there were some matrices involved, and so armed with that memory, we start off describing a matrix. But before that, let us decide on some nomenclature. It would help to carry along a concrete use-case/example in mind. We will call this the


Let us imagine that there is a (fictitious) college with just two departments (namely, "Mechanical Engg." and "Civil Engg.") - any student in the college is in exactly one of these two departments. We are considering the male and female populations of the departments and the college as a whole. Thus, in view of the English description of the paradox above, we will set $X = \textrm{Women}$, $Y = \textrm{Men}$ and $Z$ would be a Boolean corresponding to being a member of the "Mechanical Engg." department (so that $\neg Z$ corresponds to "Civil Engg."). Thus we can encode the information in the story as a $2\times2$ matrix, where Here, we present a sample $2\times 2$ matrix $\mathbf{A}$:

The Association:

We may also vaguely recall in case we have encountered Simpson's paradox before, that the association that we are looking for was a comparison between the columns $X$ and $Y$, for each of the rows.

More importantly, the association/comparison is to be such that we will evaluate it for either of the rows, and also for the sum of the rows together. What will that correspond to? The evaluation of the association for the first row, would correspond to the situation where the conditioning variable $Z$ is present. Evaluating the comparison for the second row would likewise correspond to the situation where the conditioning variable $Z$ is absent. Finally, if we add the two rows, and evaluate the same comparison, this would correspond to the overall situation (where $Z$ may be present or absent - we don't care).

Note that, this may seem like a lot of foolhardy work, but now that we have clarified what we are looking for, we can go actively hunt for our paradox.

False Start 1:

One easy association we can look for is whether the entry for $\textrm{Women}$ in a row is greater than the entry for $\textrm{Men}$ in the same row. For instance in the example above, for both of the departments, the number of $\textrm{Women}$ is greater than the number of $\textrm{Men}$. So, when we sum the two rows together, thereby removing the conditioning of the variable $Z$, this same association still holds (in this example, the total number of Women across departments is $5 + 3 = 8$ while the total number of Men is $4 + 2 = 6$).

False Start 2:

The false start 1 indicates that the paradox cannot deal with just numbers - in the case of numbers/integers, we cannot expect a reversal of association. The next thing to try should be with ratios!

Note that for this, we will have to adapt the story slightly, in order to accomodate the modification that entries are now ratios instead of plain ole integers. Let us describe a sample matrix $\mathbf{B}$:


How do we interpret the ratios in this matrix? Given the setting of genders, and departments, it should be easy to give a new interpretation for the ratio entries of the matrix. The reader is encouraged to come up with her own interpretation. For instance, from the scenario encoded in matrix $\mathbf{B}$, $3$ out of $6$ women in the Civil Engg. department pass their exams.

Again, we choose the association as this ratio being larger for women versus men. In this case, for the Mechanical Engg. department, this does hold true. So also for the Civil Engg. department. And if you sum up the total numbers of women and passing women (and also men, and passing men separately), we get that out of $8$ women (across the two departments), $4$ of them pass. While for men, the numbers are: out of $8$ men, $3$ men pass. Hmm, the association - that a larger ratio of women pass than men - still seems to hold.

Let us understand more clearly as to why the current setup did not lead to a reversal. Consider the matrix $\mathbf{C}$:


Here, we are supposing the following:

and we are asking about the association (i.e. whether it gets reversed) when the rows are "summed": $$(a + m)/(b + n) >? (c + r)/(d + s)$$

In this specific example, it turns out that $a/b = m/n$. So if the supposed associations hold, then when the rows are "summed", there will be no association-reversal. How do we show this? To understand this, we will need the following fundamental (and highly useful):

Inequality 1

If $a/b \geqslant c/d$ then $a/b \geqslant (a + c)/(b + d) \geqslant c/d$. This may also be reformulated as $(a + c)/(b + d) \leqslant \max(a/b, c/d)$.

Thus, if $a/b = m/n$, then $(a + m)/(b + n) = a/b = m/n$. Also, since $a/b = m/n > c/d, r/s$ and hence $a/b > \max(c/d, r/s)$, we have that $(a + m)/(b + n) = a/b > \max(c/d, r/s) \geqslant (c + r)/(d + s)$.

Altogether the main reason why the association did not reverse in this case is because (for both the departments), the ratio of women passing is the same. The reader may convince herself, that, analogously, if the ratio of men passing were the same, then we would not have a paradoxical situation.

So if at all a paradox has to emerge, we need the ratios of men (or women) passing (across the departments) to be different!

And finally, the paradox!

This last question/comment is precisely the source of the paradox. Let us agree to call a matrix that exemplifies the paradox as a paradoxical matrix. Without further ado, we provide an example of a paradoxical matrix $\mathbf{D}$:
Note that for the first department the ratio of women is $1/2$ which is greater than $2/5$ (the ratio for men). Similarly for the second department the same association holds: $1/3 > 1/4$. However, aggregating the rows give the ratio for women as $11/32 = 0.34375$ whereas for men it is $5/14 = 0.357...$. This is the association reversal that Simpson's paradox consists of! We have found the paradox!!

How did we find that?

Let us think of the problem slightly abstractly, and it will help if the reader can think in terms of limits. So far, we have seen that a necessary condition for a paradox matrix is that the ratios for women (across departments) are different. At the end of the current section, we will be able to make a very strong statement: that that condition is sufficient too for deriving a paradoxical matrix!

First, let us see the Inequality 1 above in another avatar:

Inequality 2:

If $a/b \geqslant c/d$ then $a/b \geqslant (ta + c)(tb + d) \geqslant c/d$ for any $t \in (0, \infty)$.

Also note that when $t \rightarrow 0$, the middle ratio tends to $c/d$. At the other extremity, if $t \rightarrow \infty$, the middle ratio tends to $a/b$. We will use this degree of freedom to play with the ratios so as to achieve our paradox.


Armed with the above, let us consider a special case where the ratios for women are $1/2$ and $1/3$ - the reader is invited to work out any other situation where the ratios are different. Let $\mathbf{D}$ now be:

We start off with this template, where the ratios (across departments) for men are only slightly lower than the corresponding ratios for women. This is signified by the $\epsilon$.

Let us outline what our strategy will be.



  1. Understanding Simpson's Paradox by Judea Pearl.
Created 8 June 2017.