In this short note, we try to discover Simpson's paradox for ourselves. Albeit, we are guided by a vague memory of having
seen it before. As a byproduct, we land upon a necessary and sufficient condition for Simpson's paradox to manifest
itself in actual data.
Target Audience:
Data Science practitioners.
Why:
First, the paradox in itself, is an extremely important facet of statistics and data science which may
influence important decisions in real life. To quote
Judea Pearl about the reversal that is implicit
in the paradox, that "reversal may lead to difficult choices in critical decision-making situations''.
I can safely say that this is the one
phenomenon that I almost surely choose in order to describe some data science magic to friends who are not in that area,
and more often than not, I have stumbled. To the very many friends and family members, whom I have befuddled by my messy
effort at elucidating
this beautiful fact/paradox, many apologies.
More often than not, in order to understand a phenomenon, it is fruitful to understand the simplest examples for
that phenomenon. The writing below is an effort in that direction, vis-a-vis Simpson's Paradox.
What:
Perhaps the shortest description of the paradox is right there in the
Wikipedia page on Simpson's Paradox (also called
the Yule-Simpson effect).
At this point, it might help to directly quote Pearl
for an English description of the paradox: whereby the association
between a pair of variables ($X$, $Y$ ) reverses sign upon conditioning of a third variable, $Z$, regardless of the
value taken by Z.
How:
This is how we will proceed wrt this topic.
We will try to tease out what that English description means - this will lead to a few hiccups.
Thus, we will land on the appropriate examples elucidating the paradox.
We will tease out some main threads that makes the paradox work, and leave a detailed discussion to
a Part 2 of this post.
Start Off:
Herein, let us assume a reader who has seen some form of Simpson's paradox before, and is trying to reconstruct
it for herself. She remembers that there were some matrices involved, and so armed with that memory, we start off
describing a matrix. But before that, let us decide on some nomenclature.
It would help to carry along a concrete use-case/example in mind. We will call this the
Story
Let us imagine that there is a (fictitious)
college with just two departments (namely, "Mechanical Engg." and "Civil Engg.") - any student in the college
is in exactly one of these two departments. We are considering the male and female populations of the departments
and the college as a whole. Thus, in view of the English description of the paradox above, we will
set $X = \textrm{Women}$, $Y = \textrm{Men}$ and $Z$ would be a Boolean corresponding to being a member of the
"Mechanical Engg." department (so that $\neg Z$ corresponds to "Civil Engg.").
Thus we can encode the information in the story as a $2\times2$ matrix,
where
(Columns) The columns will correspond to $X = \textrm{Women}$
and $Y = \textrm{Men}$.
(Rows) The rows will correspond to the departments, i.e.
"Mechanical Engg." and "Civil Engg." - these will correspond to $Z$ and $\neg Z$ (i.e. presence
or absence of the third conditioning variable).
(Entries) Also, let each entry of the matrix be the number of men/women in the
corresponding department.
Here, we present a sample $2\times 2$ matrix $\mathbf{A}$:
Women
Men
Mech.
$5$
$4$
Civil
$3$
$2$
The Association:
We may also vaguely recall in case we have encountered Simpson's paradox before,
that the association that we are looking for was a comparison between the columns
$X$ and $Y$, for each of the rows.
More importantly, the association/comparison is to be such that we will evaluate
it for either of the rows, and also for the sum of the rows together. What will that correspond to?
The
evaluation of the association for the first row, would correspond to the situation where the conditioning variable
$Z$ is present. Evaluating the comparison for the second row would likewise correspond to the situation where
the conditioning variable $Z$ is absent. Finally, if we add the two rows, and evaluate the same comparison, this
would correspond to the overall situation (where $Z$ may be present or absent - we don't care).
Note that, this may seem like a lot of foolhardy work, but now that we have clarified what we are looking for, we can
go actively hunt for our paradox.
False Start 1:
One easy association we can look for is whether the entry for $\textrm{Women}$ in a row is greater than
the entry for $\textrm{Men}$ in the same row. For instance in the example above, for both of the departments,
the number of $\textrm{Women}$ is greater than the number of $\textrm{Men}$. So, when we sum the two rows together,
thereby removing the conditioning of the variable $Z$, this same association still holds (in this example,
the total number of Women across departments is $5 + 3 = 8$ while the total number of Men is $4 + 2 = 6$).
False Start 2:
The false start 1 indicates that the paradox cannot deal with just numbers - in the case of numbers/integers,
we cannot expect a reversal of association. The next thing to try should be with ratios!
Note that for this, we will have to adapt the story slightly, in order to accomodate the modification that
entries are now ratios instead of plain ole integers. Let us describe a sample matrix $\mathbf{B}$:
Women
Men
Mech.
$\frac{1}{2}$
$\frac{1}{3}$
Civil
$\frac{3}{6}$
$\frac{2}{5}$
How do we interpret the ratios in this matrix? Given the setting of genders, and departments, it should be
easy to give a new interpretation for the ratio entries of the matrix. The reader is encouraged to come up
with her own interpretation.
(Entries) Let each entry of the matrix be the ratio of women (or men) who
pass in the
corresponding department, where the denominator is the total number of
women (or men) and the numerator is the number who actually pass.
For instance, from the scenario encoded in matrix $\mathbf{B}$, $3$ out of $6$ women in the Civil Engg.
department pass their exams.
Again, we choose the association as this ratio being larger for women versus men. In this case, for
the Mechanical Engg. department, this does hold true. So also for the Civil Engg. department. And if you
sum up the total numbers of women and passing women (and also men, and passing men separately), we get that
out of $8$ women (across the two departments), $4$ of them pass. While for men, the numbers are:
out of $8$ men, $3$ men pass. Hmm, the association - that a larger ratio of women pass than men -
still seems to hold.
Let us understand more clearly as to why the current setup
did not lead to a reversal. Consider the matrix $\mathbf{C}$:
Women
Men
Mech.
$\frac{a}{b}$
$\frac{c}{d}$
Civil
$\frac{m}{n}$
$\frac{r}{s}$
Here, we are supposing the following:
Association for the first row: $a/b > c/d$
Association for the second row: $m/n > r/s$
and we are asking about the association (i.e. whether it gets reversed)
when the rows are "summed":
$$(a + m)/(b + n) >? (c + r)/(d + s)$$
In this specific example, it turns out that $a/b = m/n$. So if the supposed associations hold,
then when the rows are "summed", there will be no association-reversal. How do we show this?
To understand this, we will need the following fundamental (and highly useful):
Inequality 1
If $a/b \geqslant c/d$ then $a/b \geqslant (a + c)/(b + d) \geqslant c/d$. This may also be reformulated
as $(a + c)/(b + d) \leqslant \max(a/b, c/d)$.
Thus, if $a/b = m/n$, then $(a + m)/(b + n) = a/b = m/n$.
Also, since $a/b = m/n > c/d, r/s$ and hence $a/b > \max(c/d, r/s)$, we have that
$(a + m)/(b + n) = a/b > \max(c/d, r/s) \geqslant (c + r)/(d + s)$.
Altogether the main reason why the association did not reverse in this case is because
(for both
the departments), the ratio of women passing is the same. The reader may convince herself,
that, analogously, if the ratio of men passing were the same, then we would not have a
paradoxical situation.
So if at all a paradox has to emerge, we need the ratios of men (or women) passing (across the
departments) to be different!
And finally, the paradox!
This last question/comment is precisely the source of the paradox. Let us agree to call
a matrix that exemplifies the paradox as a paradoxical matrix.
Without further ado, we provide an example of a paradoxical matrix $\mathbf{D}$:
Women
Men
Mech.
$\frac{1}{2}$
$\frac{4}{10}$
Civil
$\frac{10}{30}$
$\frac{1}{4}$
Note that for the first department the ratio of women is $1/2$ which is greater than $2/5$ (the ratio
for men). Similarly for the second department the same association holds: $1/3 > 1/4$. However,
aggregating the rows give the ratio for women as $11/32 = 0.34375$ whereas for men it is
$5/14 = 0.357...$. This is the association reversal that Simpson's paradox consists of!
We have found the paradox!!
How did we find that?
Let us think of the problem slightly abstractly, and it will help if the reader can think in terms of
limits. So far, we have seen that a necessary condition for a paradox matrix is that the ratios for
women (across departments) are different. At the end of the current section, we will be able to
make a very strong statement: that that condition is sufficient too for deriving a paradoxical
matrix!
First, let us see the Inequality 1 above in another avatar:
Inequality 2:
If $a/b \geqslant c/d$ then $a/b \geqslant (ta + c)(tb + d) \geqslant c/d$ for any $t \in (0, \infty)$.
Also note that when $t \rightarrow 0$, the middle ratio tends to $c/d$. At the other extremity, if
$t \rightarrow \infty$, the middle ratio tends to $a/b$. We will use this
degree of freedom to play with the ratios so as to achieve our paradox.
Dénouement
Armed with the above, let us consider a special case where the ratios for women are $1/2$ and $1/3$ -
the reader is invited to work out any other situation where the ratios are
different. Let $\mathbf{D}$ now be:
Women
Men
Mech.
$\frac{1}{2}$
$\frac{k-\epsilon}{2k}$
Civil
$\frac{1}{3}$
$\frac{m-\epsilon}{3m}$
We start off with this template, where the ratios (across departments) for men are
only slightly lower than the corresponding ratios for women. This is signified by the
$\epsilon$.
Let us outline what our strategy will be.
We will fix the actual numbers for women (who pass) at $1/2$ and $1/3$ for the two
departments respectively. So out of a total of $2 + 3 = 5$ women, exactly
$1 + 1 = 2$ pass (across all the departments).
For men, we will keep the passing ratios as floating, to enable play.
Noting that $1/2 \geqslant 1/3$ we will keep the ratio in department 1 as
${(k - \epsilon)t}/{2kt}$ where $t$ is a variable that we are free to choose.
The ratio for department 2 is $(m -\epsilon)/3m$ for suitable $m$.
The big idea is the following:
We see that for women (across all departments),
the passing ratio is $2/5 < 1/2$. For men, however it is
$(ta + c)/(tb + d)$ where $a/b = (k-\epsilon)/2k$ and
$c/d = (m -\epsilon)/3m$.
The money statement: We can stretch out $t$ (and appeal to Inequality 2) so that
the overall ratio $(ta + c)/(tb + d)$ starts looking more like $a/b$
which in this case
is pretty close to $1/2$. Indeed, it suffices to stretch out $t$ just so
that this ratio
is greater than $2/5$ (the passing ratio for women) (see below for how $t$ fits in
our description).
Women
Men
Mech.
$\frac{1}{2}$
$\frac{t(k-\epsilon)}{2kt}$
Civil
$\frac{1}{3}$
$\frac{m-\epsilon}{3m}$
And that's it! This is how we can construct paradoxical matrices!
If the reader feels there are too many free variables floating around, let's try
setting some of these parameters.
Let $k = 2$, $m = 2$ so that the ratios for men are $a/b = 1/4$ and
$c/d = 1/6$. This will not work since no matter what $t$ is, the ratio
$(ta + b)/(tc + d)$ will not lie in the desired range $[1/3, 1/2]$.
The easy fix is that we need to make $k$ larger. How large? So that we have
at least a chance of superseding the overall ratio for women, i.e. $2/5$. So
we want $(k-1)/2k > 2/5$ which means $(5k - 5) > 4k$ i.e. $k > 5$. Choosing
$k = 6$ gives $(k-1)/2k = 5/12$.
How do we choose $t$? Well, this will depend on how low the second
passing ratio for men, $(m -\epsilon)/3m$, is. Say, $m = 2$ so that this ratio
is $1/6$. Then we want $t$ such that $(5t +1)/(12t + 6) > 2/5$.
This means $(25t + 5) > (24t + 12)$ i.e. $t > 7$.
Bottomline: Given the various parameters $k, m, t$ there is enough freedom to
set them right so as to yield the paradox. The reader is invited to try with some
(different) setting of passing ratios for women.
Takeaways
This concludes our demonstration of the paradox.
In fact, we ended up
arriving at
necessary and sufficient conditions for the following question.
Suppose we are given a
partially filled matrix, i.e. one in which only the entries of a
column (or a row) are filled up. When can we extend/complete this to a
paradoxical matrix?
We essentially showed that the necessary condition is also sufficient.
It is necessary that the filled up column has different ratios as its
entries. The long-winded discussion above indicates that this is sufficient too.
This post was about the mechanics of conjuring examples of paradoxical
matrices. How do we really understand/resolve Simpson's paradox?
For now, I refer the reader to Judea Pearl's
excellent article on this topic - I
hope to make that the subject matter of another post.