Tuesday, May 06, 2008

Shades of Paulos

There's been a discussion involving statistics over at Patterico's Pontifications.

Here, here, here, and here.

The discussion is prompted by an article in the Los Angeles Times. It seems there was a rape and murder that took place in the 1970s. Some DNA from the crime scene was analyzed, and tested against a database of 338,000 criminal offenders.

There was one match -- to one John Puckett.

Now, to be fair, Puckett had a history of rape. He was working in the victim's neighborhood at the time the rape occurred, and his MO was compatible (at the very least) with the details of the crime.

However, the DNA sample from the crime scene was badly deteriorated, and the match was made based on 5 genetic markers, fewer than half of the usual 13.

Puckett insisted he was innocent, saying that although DNA at the crime scene happened to match his, it belonged to someone else.

At Puckett's trial earlier this year, the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million.

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett's case, it was 1 in 3.

The case Puckett's defenders make: If you troll through a database containing 338,000 DNA samples, your chance of matching any one entry in the database is 1 in 1.1 million. But you're making 338,000 attempts, and experts think your odds of getting any match with someone in the database is 338,000 in 1.1 million. 338,000 divided by 1.1 million is 30.7% -- pretty close to 1 in 3.

Whatever anyone may think about the strength of the case against Puckett, the LATimes article does make a valid point. Most people don't have a clue about probability. And this is not just a point about jurors -- it applies to lawyers, judges, and even many expert witnesses.

I happen to have a pretty good grasp of statistics and probability. (When my statistical thermodynamics professor calls my grasp of statistics "strong", I think that qualifies as "bragging rights".) I propose to comment at some length on the probability calculations involved in this case.

First of all, there's the "lottery question". If you have no particular reason to suspect there will be a match, what is the probability of a 1.1 million-to-one match with at least one member of a group of 338 thousand people? It turns out to be the probability of not matching any one person, multiplied by itself 338,000 times.

An example with smaller numbers: Suppose you roll four dice. What is the probability that at least one of the dice will roll a six?

Each die has a one-in-six chance of coming up "six". That means each die has five chances in six of coming up something else.

Each die is going to do whatever it does, independent of what any other die does.

So if the first die comes up "six", there is one chance in six the next die will come up "six", and five in six that it will do something else.

And if the first die comes up something other than "six", the next die will have the same chances.

For two dice, the first die has five chances in six of coming up something other than "six". This 5/6 chance is then multiplied by the probability of each path the next die can take, and the same thing applies to each subsequent die.

So the odds that none of the dice come up "six" is 5/6 times 5/6 times 5/6 times 5/6. Multiplied together, that's 625/1296, or about 48%.

Now that's the chance that none of the dice will roll "six". That means there's a 52% chance that at least one of the dice will roll "six".

This includes rolling one "six", two "sixes", three "sixes", and all four rolling "six". Calculating the chance of all four rolling "six" is easy: 1/6 times 1/6 times 1/6 times 1/6. That's 1/1296. Calculating the odds for exactly one, exactly two, and exactly three is a bit harder and I'd rather not deal with it here. Fortunately, someone else has dealt with it elsewhere.

Now, when you get into very large numbers, this math gets to be a major headache. Fortunately, mathematicians have risen to the occasion, and come up with alternative formulations that are pretty much spot-on. One relies on what is known as the Poisson distribution.

The Poisson distribution is used when you have a small probability of an event happening. This can include events like getting a million-to-one genetic match with a random person, or a crowd of random people. Mathematically, it takes some probability of the event happening, P, and calculates the probability of that event happening some number of times in a particular domain.

In the case of the cold hit on the criminal database, P is 338 thousand, divided by 1.1 million, or 0.303. Poisson's formula gives a distribution, expressed as

P = e-P [1 + P + 1/2 P2 + 1/6 P3 + 1/24 P4 +...]

or Pi = e-p[Pi/i!]

This last equation says that the probability of finding some number of occurrences of the event equal to i in a particular region is equal to Euler's constant (e) raised to the power of -P, times P raised to the i power, divided by the factorial of i. (The factorial of a number N is 1 X 2 X 3 X ... X N. Factorials get very large in a hurry.)

In particular, the chance of having no hits in the database, at random, is e-0.303 = 73.5%. Subtracting this from 100% gives us the chance of having at least one hit, which is 26.5%.

That means, if you ran a DNA sample against a database containing 338,000 random people, you have about one chance in four of getting a "hit" at random. That means, that if it turns out the real perpetrator is in some other group, you still have one chance in four of getting a "hit" on someone who is not the perpetrator, just by chance.


There was some question as to whether one could address this question using Bayes' Theorem. If we know the chance of some event B, given A, Bayes' Theorem lets us calculate the chance of A, given that we know B applies. I'm pretty sure one could, but I'd rather not.

It's possible to spend quite a bit of time thrashing around with probabilities in Bayes' Theorem if you're not sure which numbers to assign to which probabilities. In this case, I think it's easier if you go to the derivation of the theorem from conditional probabilities. At least it is if you studied sets in grade school. (It seems to have been part of the "new math" fad.) I learned to think about sets by drawing diagrams, most of which wind up looking like the MasterCard logo.

These are two sets, A and B, inside a universe containing all the relevant possibilities. Each set takes up part of the universe, and their areas represent the probability that any member of the set {U} will be found in each set. (Not to scale.) The probability of a member of the universal set being in {A} is simply the number of elements in {A} divided by the number of elements in {U}. If {A} is the set of people rh-negative blood, in a universe of 100 people, you’ll find 15 who are in {A}. If {B} is the set of people with blood type B, in that same universe of 100 people, you’d expect (I think) 30 people to be members of {B} You’ll notice the area where the sets overlap is a different color. This area is known as the set intersection, and it represents the members of the universe that belong to both {A} and {B}. (I’m writing it here as (A*B).) In the case of the blood types, the intersection of {A} and {B} would contain about 4½ people.

We can assess the conditional probability of a member of the universe being in set {A}, if we know it’s in set {B}. That’s given by

P(A|B) = P(A*B) / P(B)

If you have the diagram in front of you, with the corresponding probabilities, you can easily write down the conditional probabilities. Bayes’ Theorem comes into play when you don’t have those numbers, and have to work them out from other probabilities.

Here’s another diagram:

In this case, set {A} is completely inside set {B}. In this case, set {B} might represent the set of people who match a particular DNA sample, and {A} represents the set of people who actually committed the crime. This is a no-brainer. We know that not everyone will match the DNA sample. Indeed, the odds very long, and if we were drawing this diagram to scale, set {B} would be a tiny speck in one corner of {U}. Likewise, if we would expect the actual perpetrator’s DNA to match the sample. I’ll even stipulate that there are no false negatives here – the suspect will match and that’s that.

The conditional probability changes a bit. Since {A} is now entirely inside {B}, {A*B} is exactly equal to {A}. Every member of {A} is a member of both {A} and {B}, which was the definition of the set intersection in the first place.

P(A|B) = P(A) / P(B).

The probability that a person who matches the DNA sample is the perpetrator, based only on there being a positive match, is equal to the number of members in {A} divided by the number of members in {B}. So, if there are 6000 people in the world who would match a particular DNA sample, the chance that any one of them is the actual perpetrator is one in 6000.

We actually don’t need Bayes’ Theorem for this at all, just a little set theory and a little common sense.

No comments: