Research

The Origin of Group Testing

How to test N people with less than N tests

Lorenzo Duso

16 Sep 2020 — 7 min read

The invention of group testing dates back to WWII, as the U.S. army wanted to weed out all syphilitic men called up for enlistment. The presence of syphilis could be identified with a blood test. However, individually testing the blood sample of each new recruit costed lots of time and resources. Thus, the U.S. army had to face the following problem:

Is it possible to test N people with less than N tests?

The statistician Robert Dorfman had a brilliant idea to minimize the number of blood tests required. He proposed that:

The blood samples of k people can be pooled and analyzed together. If the test is negative, this one test suffices for the group of k people. Instead, if the test is positive, each of the k persons must be tested again separately, and in total k+1 tests are required for the group of k people.

The principle of group testing is summarized in the figure below.

Visual comparison between Individual Testing and Group Testing. In Individual Testing (left), each person is tested independently and receives an independent outcome. This strategy requires as many tests as there are people. In Group Testing (right), the samples of a group of people are pooled together and analyzed with one single test. If the group result is negative, all the individuals of the group can be considered negative. Instead, if the group result is positive, an additional test is needed for each person of that group. Group Testing can potentially save lots of tests and time. Image by author.

What’s the performance of group testing? Dorfman showed that it depends on the probability p that a recruit is infected and, importantly, on the choice of the group size k.

Here, we will reproduce Dorfman’s computations in order to evaluate the performance of group testing. The following steps are needed:

Compute the group-outcome probabilities
Calculate the total number of tests required
Optimize the group size k, for a given prevalence rate p

Note: the practical effectiveness of group testing depends also on the sensitivity of the test, which is specific for the considered illness. In the following, we will assume that test outcomes are always reliable.

1. Group Probabilities

Let p denote the probability that a single person is infected. This corresponds to the prevalence rate of the considered disease in the population (it was syphilis for Dofrman’s analysis, or Covid today…). Thus, p is the independent variable of the problem and is assumed to be known. It is convenient to define q=1-p as the probability that a person is healthy.

The pooled test for a group of k people will be negative if all of them are healthy. We find the probability P(-) of this event as:

Probability of a negative outcome for a pooled test of k people. Image by author.

Conversely, the test outcome of the group is positive if at least one person is infected. The probability P(+) of this scenario is complementary to P(-):

Probability of a positive outcome for a pooled test of k people. Image by author.

We can think of the test outcome of each group as a coin toss, which decides whether the group is healthy or needs to undergo the individual test round. More precisely, we can say that the outcome is a Bernoulli random variable with “success probability” equal to P(+).

2. Expected Number of Tests

Given a total of N recruits, we can now compute the number of tests required within the group strategy. This number is random, since it is subject to the probabilistic outcomes of all groups (see footnote¹). Nevertheless, we can study its expected value.

First of all, each of the N/k groups receives one initial test. Next, whenever a group gets a positive outcome, k additional tests have to be performed. Overall, the expected number of tests, E[T], reads

Computation of the expected number of tests within the group testing strategy. Image by author.

We see that the result is proportional to N, the total number of recruits. Thus, the term f(k,p) can be interpreted as the expected number of tests per person: the initial test counts as 1/k because is shared with the group, and then an individual test occurs with probability P(+).

Whenever f(k,p) is less than 1, group testing is more efficient than performing Nindividual tests. The following plot illustrates that, for each prevalence rate p, there is a group size k* for which the number of tests is minimized.

Plot of the expected number of tests per person as a function of the group size k for different prevalence rates p. The black arrows indicate the optimal group size k* that minimizes the average number of tests for each choice of p. Image by author.

By following the trend of the arrows in the plot, we can appreciate that

the rarer the prevalence of the disease, the larger the optimal group size gets.

This makes sense because, with lower p, it will be more likely that larger groups will get a negative result: so, we can save on tests by pooling more people together.

Next, our goal is to derive a formula to determine the optimal group size k*.

3. Optimal Group Size

For a given prevalence rate p, the optimal group size k* is found by minimizing f(k,p), the expected number of tests per person. Therefore, we set the partial derivative of f(k,p) with respect to k equal to zero:

Derivation of the equation for the optimal group size. Image by author.

Unfortunately, this equation does not admit an explicit solution. However, since the value of p is typically small, we can proceed with a Taylor expansion around p=0 as follows

Approximation of the optimal group size k* for small p. Image by author.

The obtained approximation is of great practicality. For small p,

the optimal group size k* is well approximated by the inverse of the square root of p.

For instance, we find k*=5 for prevalence rate 5%, k*=10 for p=1%, and k*=32 for p=0.1%.

4. Employing Optimal Group Testing

We can explicitly check how many tests E[T*] are required on average with the optimal group size by substituting the approximated formula for k* into E[T]. By Taylor expanding again for small p, we get:

Approximation of the optimal expected number of tests. Image by author.

Whenever p<1/4, we see that E[T*]<N and, therefore, group testing is more efficient than individual testing (²). Since 1/4 is a remarkably high prevalence rate, this basically implies that group testing is a better solution in most of the practical situations.

We can appreciate that the efficiency of group testing is higher and higher the smaller the prevalence rate p is. For instance, considering N=1000recruits, the expected number of tests would be

E[T*]=400 for prevalence rate p=5% and k*=5.
E[T*]=200 with p=1% and k*=10.
E[T*]=63 for p=0.1% and k*=32.

All of these cases achieve a much lower number than N=1000 individual tests. Group testing is smart.

We can plot the approximated optimal number of tests per person f(k*(p))=E[T*]/N=2/k* and appreciate how it passes through the minima of the curves f(k,p).

For each prevalence rate p, the optimal group size k* minimizes the expected number of tests per person. The black line passing through the minima shows that the provided approximation works well. Image by author.

Note that f(k*(p)) can be also interpreted as the overall relative testing cost of the group strategy. For example, for p=1% we can save 80% of the tests.

Summarizing, we found that:

1) Group testing is incredibly more efficient than individual testing, especially with small prevalence rate p.

2) The expected number of tests is approximately 2N/k*, with optimal group size k* given by the inverse of the square root of p .

Dorfman’s Legacy: Group Testing

Robert Dorfman published the blood-testing strategy against syphilis in a short research article in 1943, as he was serving as an operational analyst in the U.S. Army Air Forces. His contribution not only saved millions of lab analyses, but also set the start of the field of applied mathematics named group testing.

Group testing is an object of research still today and has found several applications across different fields of science and engineering.

Footnotes

(¹) Technically, the number of groups to be retested follows a Binomial distribution with number parameter N/k and success probability P(+) as defined above. A Binomial random variable simply describes the amount of positive outcomes out of a given number of independent coin tosses. The expected value of a Binomial random variable turns out to be the number of coins (here, of groups) times the probability P(+) of positive outcome.

(²) Please, note that all results obtained after a Taylor expansion are approximated, although accurate for small prevalence rate p. Some of the values reported above might be slightly different if obtained with an exact numerical optimization.