Generating Combinations.

TL;DR:

In this short note, we discuss and dissect a program for generating all the combinations of a collection/set of $n$ characters, $r$ at a time. For $r$-combinations of $n$ objects, the program will generate all $n \choose r$ combinations.

Target Audience:

Beginners in programming. Prerequisite: working/basic knowledge of Python.

Background:

While I was going through the Python Docs for itertools I came across the following pretty program for generating all the combinations of a set of characters.

Why:

How we will proceed

Analysis:

Decide on an order:

Let us start with the desiderata: we would like to generate all the combinations of a set of $n$ characters, $r$ at a time. What is not specified is the order in which we need to generate these combinations in. That is up to us. It is quite popular (in Computer Science, as also elsewhere) to generate objects in what is called the lexicographical order (also called dictionary order). This is a (natural) order of choice, and we will adopt this for our purposes.

Also observe that given the set of $n$ characters input as an array, we essentially want to consider the indices of the array, while generating the combinations. This makes sense because the actual set of characters can be any arbitrary set, without any specified ordering between them (say for instance, all the punctuation marks of the English language). We will be generating combinations in a lexicographical ordering of these indices (so $023$ comes before $034$ etc.).

Sample lexicographical ordering (for $n=5$, $r = 3$): $012 < 013< 014 < 023 < 024 < 034 < 123 < 124 < 134 < 234$.

Start writing code:

When we try to start writing code, we come across our first problem:

Observe transitions according to the order:

Once we have chosen the order (as above) in which to generate the combinations, it would help to look at a few sample cases to see how the transitions work from one combination to the next.
Looking at the sequence of $5 \choose 3$ combinations above (for the combinations for $n = 5$ and $r = 3$), we notice that the changes seem to be starting at the end of the sequence - or more accurately, in reverse order. Mapping back to arbitrary $n$ and $r$ (from this tiny example), the start of the sequence is $\mathrm{range}(r)$ and the end of the sequence is $\{ n-r, n-r + 1, \cdots, n - 1\}$.

Aside 1:

Here's a Python puzzle: you are given input $\mathrm{range}(r)$ and you have to output $\{ n-r, n-r + 1, \cdots, n - 1\}$. I.e. you have to output the same array, with $n-r$ added to each element of the array.

Aside 2:

We have decided to cycle through the indices in a lexicographical ordering, and that the array $\mathrm{indices}$ will contain these indices. If the original set of characters is $A$, then we will generate the elements of $A$ corresponding to the indices in the array $\mathrm{indices}$. Python code for this?

Answer: use list comprehensions (again!): B = [A[i] for i in indices]. Look at line 9.

Design the invariants:

The principal task is to design some invariants that will guide our algorithm. (By the way, it is always a good idea to reason about algorithms in terms of the invariants that we want. Not only does it make for cleaner thought processes, it also helps in proving program correctness.) This part is the meat of the program (and our discussion). Let's recall the example from above: $012 < 013< 014 < 023 < 024 < 034 < 123 < 124 < 134 < 234$.

Invariant 1: Given the $\mathrm{indices}$ array, at any point in the algorithm, the entries of the array are monotone non-decreasing.

This makes sense since we are searching for combinations and we can actually afford to enforce this property.

In order to introduce Invariant 2, we will introduce some nomenclature. Let us try to understand how the transition is made from one combination to the next. Specifically, given a combination (say, $124$), we will attempt to find out where the next change should occur. As mentioned above, we will have to look at the array in the reverse order. Given $n = 5$ and $4$ is the maximum any element can go up to, in $124$ the last character has hit a wall. So we keep searching for the next element in reverse order (here, a $2$). Has that also hit the wall (i.e. the maximum it can be?). Note that for the last-but-one character, the maximum it can be is not $4$ but $3$.

Aside 3:

For general $n$ and $r$ find out the maximum the $i^{th}$ entry of the array $\mathrm{indices}$ can be. Answer: it is $i + n - r$ (see Aside 1).

Nomenclature:

In order to understand code (or papers, research, or what have you) its good to give meaningful names to modularize and streamline thinking. Let's call an index saturated if that index is already the maximum it can be; otherwise call an index unsaturated. Thus for instance (with $n = 5$, $r = 3$), in the array $\mathrm{indices} = [0, 2, 4]$, the index $2$ is saturated, while indices $0$, $1$ are unsaturated. More specifically, index $i$ is saturated if $\mathrm{indices}[i] == i + n - r$ (and unsaturated if the LHS is $<$ RHS). We are now ready to describe Invariant 2.

Invariant 2: Given the $\mathrm{indices}$ array, if some index $i$ is saturated, then all indices $(i+1), (i+2), \cdots, (r-1)$ are saturated.

This is an easy corollary of the monotonicity enforced by Invariant 1 (why?). This will pave the way for our final idea, that will actually give us the solution.

Main Idea: Given a combination in $\mathrm{indices}$, the next combination in the lexicographical order is obtained by changing the first (from the right, i.e. in reverse order) unsaturated index in the array $\mathrm{indices}$.

Note: such an unsaturated index might not even exist! When can this happen? Well, this means that every index in the array is saturated. But this means that the array $\mathrm{indices} = [ (n-r), (n-r +1), \cdots, n - 1]$. This means in turn, that we have reached the end of our task - we have generated all the combinations possible!

To implement the main idea, we will write code for the following functionality: given an input array $\mathrm{indices}$, (and $n$, $r$), return the (unsaturated) index into the array that will be changed next. What is wrong with this? For the array $\mathrm{indices} = [ 0, 2, 3]$ the output is correct (index $2$ - with indices starting with $0$) is right. However, for the array $\mathrm{indices} = [ 0, 2, 4]$, the answer should be index $1$, but the program bottoms out and returns $-1$. What gives? It is easy to understand a mistake: the first time, we find a saturated index (i.e. $\mathrm{indices}[i]$ equals $i + n - r$) in line 10, the code does not even search for the next $i$ (in the reverse order) but instead aborts it and outputs $-1$.

The correct behavior we desire is that we need to search the entire array, break-ing (or return-ing) only when we find a unsaturated index. This may be implemented by leaving out lines 13 and 14 from the above.
def find_index_to_change(indices,n, r):
    for i in reversed(range(r)):
        if indices[i] < (i + n - r):
                return i
All is good with the above, just that we may fall in the corner case where we zip through the entire array without finding an unsaturated index. This means that the indices array is $[ n - r, n-r+1, \cdots, n-1]$ (i.e. the last member of the sequence of combinations). We would then have finished with generating all the combinations, so we can go home happy.

In this corner case, we would like to output a default value (say $-1$).

The interesting Python construct:

So how do we implement this desired behavior, i.e. of returning a default (say, $-1$) if we are unable to find any unsaturated index? Answer: We would need to add an $\mathrm{else}$ clause to the $\mathrm{for}$ loop above.

Wait a minute, wha...t? An else-clause paired with the for loop? Yes, such a construct indeed exists - and it finds use in precisely such scenarios as this: if we have zipped through a for-loop without break-ing or return-ing from it, then the else conditional is reached.

This gives us the correct implementation of find_index_to_change:
def find_index_to_change(indices,n, r):
    for i in reversed(range(r)):
        if indices[i] < (i + n - r):
            return i
    else:
        return -1
Note that this corresponds to adding back on the bottoming-out lines 13 and 14, suitably un-indented as above; I invite the reader to go back to the trinket above and run this out).

See this SO link for more discussion on this interesting and useful construct!

Finishing up

We just have one last module to write. So we know which index to change in the $\mathrm{indices}$ array, what is the actual change to be done? This functionality will be encapsulated in the function change_indices: it takes as parameters the array $\mathrm{indices}$, and $i$, the first (from right) unsaturated index in the array $\mathrm{indices}$ (i.e. the output of the function find_index_to_change).

The work that needs to be done is actually simple. If $i$ is the index that initiates the change, the actual change consists of resetting the sub-array $\mathrm{indices}[i:]$. The reset consists of setting this part of the array to the lexicographically least admissible value:
def change_indices(indices,i):
    r = len(indices)
    indices[i] += 1
    for j in range(i+1, r):
        indices[j] = indices[j-1] +1
    return indices

Stitching it together

The overall structure of the final program will be as follows: The final trinket is shown below:

Final words

The reader is invited to revisit the original code snippet, for comparing and contrasting. One thing that the original code implements is that it takes in an iterable input, and uses generators to "yield" the combinations. This is in fact the preferred coding style, see (for instance) Jeff Knupp's article, more so when we are generating objects (such as combinations, permutations etc.). At this point I invite the reader to translate the code to one that uses generators/yield-statements.

I also leave it to the reader to check out the section on permutations on the itertools website, and conduct a similar analysis for that code.

The main take-aways are


Created 19 March 2017.