The Boyer-Moore Algorithm
Computing the majority vote on pen and paper
In this article, we discuss a beautiful algorithm of Robert S. Boyer and J Strother Moore, published in 1981, that allows to compute the majority vote of a sequence of elements in linear (in the number of elements) time and constant space!
Suppose that we have a very long sequence of integer numbers, appearing one by one. Our goal is to compute the majority element, if such an element exists. A majority element in a sequence of n (not necessarily distinct) elements is an element that appears more than n/2 times in the sequence. More precisely, if n = 2k, or n = 2k+1, for some natural number k, then an element is a majority element if it appears at least k+1 times.
Here is an example. Suppose that we have the following sequence of 9 elements:
1, 2, 11, 4, 5, 2, 1, 2, 17.
The most frequent element in the above sequence is the element 2, which appears 3 times. Since a majority element for a sequence of length 9 is an element that appears more than 9/2 times, i.e., at least 5 times, we conclude that there is no majority element in the above sequence.
Let’s look now at a modified sequence:
2, 2, 11, 2, 5, 2, 1, 2, 17.
In this sequence of length 9, the element 2 appears 5 times, and thus, we conclude that it is a majority element.
Observation: A finite sequence can have at most one majority element.
This problem is a standard problem in coding interviews, but its inherent beauty is such that I think it is worth writing about it.
First approach
A natural approach is to keep track of all distinct elements that appear in the sequence, along with the frequency of each element. At the end, we can go over all elements and the corresponding frequencies, and decide whether there exists a majority element or not.
The above approach indeed works, and, by using good enough data structures, it allows us to solve the problem fast enough. The main drawback is that in the worst-case, we will have to store most of the elements that we see, and thus, we will need space proportional to the length of the sequence. More precisely, in the worst case we will use Θ(n) space, when processing a sequence with n elements. As suggested by the title of this article, this is definitely not realistic to do with a pencil and a single piece of paper, as the sequence might contain millions of elements!
Still, this approach can work if we know, for example, that each element is an integer between 1 and K, for some small integer K. In such a case, we build a table of K entries, one for each element, and simply keep track of the frequency of appearance of each element.
Second approach
Another natural way to go would be to sort the list, and then look at the median element; if there exists a majority element, the median element will be equal to it. This is an elegant approach that can be implemented in such a way that it can give an algorithm that runs in time O(n log n) and constant extra space (see, e.g., heapsort). So, technically, it fulfills the promise of the title, but its running time is not linear, and it is definitely not elementary to implement heapsort and do in-place sorting.
The Boyer-Moore Algorithm
And now comes the magic. We will compute the majority element, if such an element exists, by simply using 2 memory slots, or two lines on a single piece of paper!
We store two entries:
- m, which represents the candidate “majority” element, and
- count, which keeps track of whether we have an excess of appearances of some element. The variable count will never drop below 0.
We now make things more concrete. Suppose that we have a sequence
x₁, x₂, …, xₙ
of n elements.
We start by setting m := x₁ and count := 1. We continue with the second element x₂. If x₂ = m, then we increase count by 1. If not, then we decrease count by 1, which means that it becomes 0. In general, when processing the i-th element xᵢ, for i > 1, we do the following:
- If count = 0, then we set m := xᵢ and count :=1.
- If count > 0, we further do the following check. If xᵢ = m, then we increase countby 1, otherwise we decrease count by 1.
We are essentially done! In case we are guaranteed to have a majority element in the sequence, we simply return m. If there is no such guarantee, then m is the only candidate majority element, and we have to do a second pass in order to check whether m is indeed a majority element (in fact, this second pass is only needed if the final value of count is strictly positive, as will become clear in the next section). This can be done very easily, by simply counting the occurrences of m.
The whole algorithm runs in O(n) time and uses O(1) extra space, or more precisely, O(1) extra memory slots.
Proof of correctness
The main idea behind the algorithm is the following. At a high level, we “pair” elements that are not equal, from left to right, and we cross them out in pairs. After an iteration, we have a value m and a number count; this means that we have count many appearances of the element m that have not been crossed out yet. Then, if at the next iteration we encounter some element different than m, we pair it with the first (leftmost) appearance of the element m that has not yet been crossed out, and we cross both elements out. This is the reason we decrease the variable count by 1 in this case. At the end, we can be certain that if there is a majority element, then since it appears more than n/2 times, there are not enough elements to cross out all of its occurrences, and thus it must be the element stored at variable m.
We now make things more formal. We use this notion of pairs that we used in the previous paragraph, and we view the algorithm as an algorithm that creates pairs of elements that are not equal. We will show, by using mathematical induction, that for every i,
After iteration i, the only element that has not been paired yet is the one stored at variable “m”, and moreover, we have exactly “count” many unpaired copies of that element.
We are ready to do the induction (in case you need to refresh some things about mathematical induction, you can have a quick look at a previous article of mine here).
- i = 1 (base case). After processing the first number, we have that m = x₁ and count = 1. It is clear that we have exactly 1 unpaired copy of the element stored at variable m.
- from i-1 to i (induction step). Let i ≥ 2, and suppose that the induction hypothesis holds for all iterations up to i-1. This means that after iteration i-1, the only element that has not been paired yet is the one stored at variable m, and moreover, we have exactly count many unpaired copies of that element. The algorithm now processes element xᵢ. We do some case analysis. If count = 0, then this means that all elements before xᵢ have been paired, and thus the only unpaired element after iteration i is xᵢ. Since the algorithm sets m = xᵢ andcount = 1 in this case, then the statement indeed holds after iteration i. So, suppose that count > 0 at the beginning of iteration i. In this case, if m is equal to xᵢ, then this means that we have one extra copy of the element stored in mthat is unpaired. Since the variable count is increased by 1 in such a case, then the statement holds after iteration i. Finally, if if m is not equal to xᵢ, then we can pair one copy of the element stored in m with xᵢ, and so the total number of unpaired copies of the element stored in m decreases by 1. Since the variable count is decreased by 1 in such a case, then the statement again holds after iteration i. We are done!
The above proof shows that at the end of the algorithm, we have managed to pair distinct elements, and the only element that we have not been able to pair (if any) is the one stored at variable m, if and only if count > 0.
To conclude the proof (and see why this pairing of elements was, conceptually, very convenient), we observe that if an element is a majority element in a sequence, then it is impossible to pair all of its occurrences with distinct elements, exactly because it is a majority element (which means that there are not enough other elements to create these pairs). Thus, at the end of the algorithm, the variable m will necessarily contain the majority element!
Sample implementation of the Boyer-Moore algorithm in C
Here, we provide a very straightforward implementation of the Boyer-Moore algorithm in plain C!
/* the function that computes the majority element, if any, and
stores it in the variable "majority_element".
the flag "majority_check" is set to 1 if there exists a majority
element, otherwise it is set to 0.
*/void Boyer_Moore(int n, int *x, int *majority_element, int *majority_check)
{
// n denotes the number of elements
// x is the array of elements, indexed from 1 to n.
int m; // the variable "m"
int count; // the variable "count"
int i;
m = x[1];
count = 1;
for (i = 2; i <= n; i++)
if (count == 0) {
m = x[i];
count = 1;
}
else {
if (m == x[i])
count = count + 1;
else
count = count - 1;
}
// the element in "m" is now the only candidate majority element
if (count == 0) {
*majority_check = 0;
return;
}
count = 0;
for (i = 1; i <= n; i++)
if (x[i] == m)
count = count + 1;
if (count > n/2) {
*majority_element = m;
*majority_check = 1;
}
else
*majority_check = 0;
}