Top notch — one of the most enjoyable books I’ve ever read and all the more enjoyable for having found it by chance in a bookshop.
The enthusiasm and wonder at the nature of Bayesian statistics is contagious. It has just the right amount of history, humour, and science to keep it varied and interesting.
Highlights
When we make decisions about things that are uncertain - which we do all the time - the extent to which we are doing that well is described by Bayes' theorem. Any decision-making process, anything that, however imperfectly, tries to manipulate the world in order to achieve some goal, whether that's a bacterium seeking higher glucose concentrations, genes trying to pass copies of themselves through generations, or governments trying to achieve economic growth; if it's doing a good job, it's being Bayesian.
You know that if a woman has cancer the mammogram will correctly identify it 80 per cent of the time (it's 80 per cent sensitive) and miss it the other 20 per cent. If she doesn't have cancer, it will correctly give the all-clear 90 percent of the time (it's 90 per cent specific), but give a false positive 10 per cent of the time.
You get the test. It comes back positive. Does that mean there's a 90 per cent chance you've got breast cancer? No. With the information I've given you, you simply don't know enough to say what your chances are.
What you need to know is how likely you thought it was that you had breast cancer before you took the test.
What Bayes' theorem tells you is how much you should change your belief. But in order to do that, you have to have a belief in the first place.
During the trial of O.J. Simpson, the former American football star, for the murder of his wife Nicole Brown Simpson, the prosecution showed that Simpson had been physically abusive. The defence argued that 'an infinitesimal percentage - certainly fewer than 1 in 2,500 - of men who slap or beat their wives go on to murder them' in a given year.
But that was making the opposite mistake to the prosecutor's fallacy. The annual probability that a man who beats his wife will murder her might be 'only' one in 2,500. But that's not what we're asking. We're asking if a man beats his wife, and given that the wife has been murdered, what's the probability it was by her husband?
All decision-making under uncertainty is Bayesian - or to put it more accurately, Bayes' theorem represents ideal decision-making, and the extent to which an agent is obeying Bayes is the extent to which it's making good decisions.
One of Bayes' great contributions to probability theory was not mathematical, but philosophical. So far we've been talking about probability as if it's a real thing, out there in the world. [...]
That is, for Bayes, probability is subjective. It's a statement about our ignorance and our best guesses of the truth. It's not a property of the world around us, but of our understanding of the world.
Frequentist statistics do the opposite of what we've been talking about. Where Bayes' theorem takes you from data to hypothesis - How likely is the hypothesis to be true, given the data I've seen? - frequentist statistics take you from hypothesis to data: How likely am I to see this data, assuming a given hypothesis is true?
A more technical objection was that of the mathematician and logician George Boole, who pointed out that there are different kinds of ignorance. A simplified example taken from Clayton: say that you have an urn with two balls in it. You know the balls are either black or white. Do you assume that two black balls, one black ball, and zero black balls are all equally likely outcomes? Or do you assume that each ball is equally likely to be black or white?
This really matters. In the first example, your prior probabilities are 1/3 for each outcome. In the second, you have a binomial distribution: there's only one way to get two black balls or zero black balls, but two ways to get one of each. So your prior probabilities are 1/4 for two blacks, 1/2 for one of each, 1/4 for two whites.
Your two different kinds of ignorance are completely at odds with each other. If you imagine your urn contains not four but 10,000 balls, under the first kind of ignorance, your urn is equally likely to contain one black and 9,999 whites as it is 5,000 of each. But under the second kind of ignorance, that would be like saying you're just as likely to see 9,999 heads out of 10,000 coin-flips as you are 5,000, which is of course not the case. Under that second kind of ignorance, you know you're far more likely to see a roughly 50-50 split than a 90-10 or 100-0 split in a large urn with hundreds or thousands of balls, even though you're supposed to be ignorant.
But the underlying problem of Bayesian priors is a philosophical one: they're subjective. As we said earlier, they're a statement not about the world, but about our own knowledge and ignorance.
What the Bayesian model seems to say is that whether something is true or not depends on how strongly I believed it before. So we carry out a study on the homeopathy or the Higgs boson, and find some positive result, then you might think that the result is very likely to be real, and I might not, and we might both be correct to do so - if our prior probabilities were sufficiently different.
Like [Francis] Galton both [Karl] Pearson and [Ronald] Fisher had what we would now consider pretty unpleasant views. Specifically, they were both big fans of eugenics.
Again, I want to be careful here. Many at the time who were very much on the progressive, liberal end of society were also pro-eugenics. Marie Stopes, the campaigner for birth control, abortion and women's rights, was also a major supporter of eugenics. John Maynard Keynes, the great economist and liberal ... was another. Sidney and Beatrice Webb, George Bernard Shaw, Bertrand Russell, all heroes of the socialist and liberal movements, were in favour of selective breeding of humanity in order to create a better, more perfect society.
Bayesianism treats probability as subjective: a statement about our ignorance of the world. Frequentists treat it as objective: a statement about how often some outcome will happen, if you did it a huge number of times.
In his book Thinking, Fast and Slow - also published in 2011 - the Nobel-prize winning psychologist Daniel Kahneman wrote of priming: 'Disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true'.
But then along came Bem.
Bem's study consisted of several experiments, but we'll focus on just one of them: an entirely unremarkable example of priming in all ways except one. The experiment, like the rest of the them, gave a prime and saw how it affected behaviour. In this case, the subjects were primed with a positive or negative word ('beautiful', say, or 'ugly') then were shown images and asked to press a button, as quickly as possible, to indicate whether the image shown was pleasant or unpleasant. [...]
The big twist though, was that in half of the trials, the subjects were given the prime after they had been shown the image. And - this is the important bit - the priming worked. People were quicker to indicate that pleasant images were pleasant when they were given a positive word, even though they word didn't appear until they'd made their choice.
The finding was statistically significant - a p-value of 0.01; enough, by modern standards to reject the null hypothesis. And Bem suggested that this was evidence for 'psi' - for psychic powers, clairvoyance. The other eight experiments in the study, using subtly different methods but all of them essentially social-psychology staples with the time order reversed, similarly achieved significance.
Most of us would probably agree that psychic powers don't exist. But here was an apparently well-caried-out study, which appeared to find - nine times! - that they do.
The third great blow of 2011 [replication crisis] came in the form of a paper called 'False Positive Psychology' by the psychologists Joseph Simmons, Leif Nelson and Uri Simonsohn. [...]
In the study, they asked twenty undergraduates to listen to a song - either 'When I'm Sixty-Four' by the Beatles, or 'Kalimba' by Mr Scruff. Then they compared the ages of the two groups. It turned out that people who had listened to 'When I'm Sixty-Four' had become nearly eighteen months younger. Again, it was statistically significant: p=0.04.
Once more I think most people would agree that it is unlikely that listening to the Beatles actively makes people younger - not simply makes them feel younger, but makes their birthdays become more recent. The result cannot be real. And yet, once more, the False Positive Psychology paper proved it to be real, to the standards of modern social science, and only used the same statistical methods that other scientists were using every day.
The easiest way to get a p<0.05 result - that is, something that you'd only see by coincidence one time in twenty - is to do twenty experiments, and then publish the one that comes up. That's exactly what the False Positive Psychology people did...
The aforementioned Daryl Bem, in a 1987 book chapter written as a guide to help students get their research published, wrote that 'there are two articles you can write: the article you planned to write when you designed your study; the article that makes the most sense now that you have seen the results. The correct answer is the second one'.
You might be wondering what all this has to do with Bayes' theorem. [...]
A p-value is not, as we've discussed, a measure of how likely it is that your hypothesis is true, given your data. It's a measure of how likely it is that you would see that data, given a certain hypothesis. But - as Bayes noted, and as Laplace later fleshed out - that's not enough. If you want to measure how likely it is that your hypothesis is true, you simply cannot avoid priors. You need Bayes' theorem.
What's perhaps even more astonishing than that is that another study looked at thirty 'introduction to psychology' textbooks and found that twenty-five of them gave a definition of 'statistical significance', and that twenty-two of those twenty-five were wrong. Again, the most common error was that they assumed a p-value gave the probability that the results were due to chance. This is completely backwards. What p-values tell you is how likely you are to see that data, given a hypothesis.
There are some fun things that fall out of the Bayesian decision system. One is that you can't go looking for new evidence to support your theory; it is impossible, because any evidence you find must (in expectation) be as likely to reduce your belief as increase it, and if you don't find evidence, that is in fact evidence against your hypothesis. [...]
The temptation is to think that if I find something, it will increase my confidence; but that if I don't find anything, it will have no effect. But that's not how it works. If some piece of evidence would shift your belief by some amount, then the absence of that evidence must shift your belief in the opposite direction, and by an amount proportionate to how strongly you expected the evidence. [...]
Again, this is unavoidable. If some evidence is strongly expected, then it can't move your beliefs very much; it's already part of the model of the world that you've built. But if something really unexpected happens - or, in this case, if something expected doesn't happen - it should move your posterior belief significantly.
If your reasoning doesn't work like this - and for a lot of us, it doesn't, especially on political questions, because we're prone to confirmation bias and groupthink - you are simply not making good use of evidence; you are not updating your beliefs in the best possible way.
A famous 1978 study asked sixty medics - twenty medical students, twenty junior doctors and twenty more senior doctors - at Harvard Medical School the following question: 'If a test to detect a disease whose prevalence is 1/1000 has a false-positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person's symptoms or signs? [...]
Of the million, 1,000 will have the disease and 999,000 won't. Of the 990,000, our test will return false positives on 49,950. So assuming it correctly identifies all 1,000 who do, anyone who has a positive test will have a slightly less than 2 per cent chance of having the disease (1,000 / (49,950 + 1000) =~ 0.02).
This seems like something important for doctors to be able to work out. But the 1978 study found that only eleven of the sixty medics gave the right answer (and those eleven were evenly spread among the groups: the students did no worse than the senior doctors). Nearly half said 95 per cent: that is, they failed to take base rates into account at all.
In general, it's best to think of human biases as products of our mental heuristics, shortcuts that allow us to do what would otherwise be extraordinarily complex maths. [...]
A straightforward example is catching a thrown ball. Doing the maths [...] would be astonishingly complicated...
But that's not what we're doing. When a cricket fielder sees the ball [...] she's not doing differential calculus. What she's doing is using a simple shortcut called the gaze heuristic. The psychologist Gerd Gigerenzer describes it like this: 'Fix your gaze on the ball and start running, and adjust your running speed so that the angle of gaze remains constant'. No mathematics are involved at all...
Imagine you meet a mathematician, and she tells you that she has two children. You ask if at least one of them is a boy. (It's a strange question, but this problem is incredibly sensitive to tiny changes in wording so I have to be careful). She says, yes, at least one of my children is a boy. What's the chance that she is the mother of two boys?
It obviously should be fifty-fifty. The other is a girl or a boy! It doesn't matter what the one you about is! But ... it's not. The chance is 1/3 again.
As you may be able to tell, this drives me crazy. But it's very much unavoidable. Just as Fermat and Pascal realised nearly 400 years ago, what matters is the number of possible outcomes (assuming that all those outcomes are equally likely). There are four possible pairs of children that a mother of two might have: girl-girl, girl-boy, boy-girl and boy-boy. They're all, to a first approximation, equally probable.
If you know that at least one of the children is a boy, but you don't know which one, then you've ruled out one of those combinations - girl-girl. Girl-boy, boy-girl and boy-boy all remain. You've already got one boy in the bank, as it were. So the other child is either a girl, a girl or a boy. The unknown child is twice as likely to be a girl as a boy. (Which I find weird because it's like there some strange quantum effect where knowledge of one child affects the sex of the other).
Once again, this is sensitive to far more than just the bare fact that there's a boy. If you know that the elder child is a boy, you've ruled out two possibilities - girl-girl and girl-boy. So now there are only two remaining outcomes, boy-girl and boy-boy. The posterior probability of the other child being a boy is 50 per cent.
In fact, everything you perceive about the world is due to Bayes' theorem. Perception and consciousness itself is - in quite a direct sense - Bayesian.
We know that 'we' are brains sitting inside bone cavities, connected to the outside world only by fleshy strings of nerves that are linked to sensory organs. What the Bayesian brain model says is that perception is a two-way street: information travels 'up' from our senses, yes, but it also travels 'down' from our internal model of the universe. Our perception is the commingling of that bottom-up stream with the top-down one. The two constrain each other - if the top-down priors are strong, then it requires precise, strong evidence from the senses to overturn them.
Immanuel Kant, in the eighteenth century, said that the universe as it truly is must be unknowable, and all we ever know is the world through our senses: he made a clear distinction between phenomena, our perceptions of objects, and noumena, the things-in-themselves. More than that, he foreshadowed the Bayesian model of the brain: he argued that our brains must have pre-embedded conceptual frameworks with which to make sense of the world, or the data coming from our senses would be a meaningless jumble. We must, in modern language, have had priors. We don't just passively perceive the world: we construct it, or a model of it.
This idea was taken further by the nineteenth-century German polymath Hermann von Helmholtz [...] But his great insight was that we cannot perceive the world as it truly is - we're not fast enough.
Our nervous system was known, at the time Helmholtz was working, to be electrical in nature. Electricity was known to travel extremely fast - the speed of light - so it was assumed that nerve signals travelled from our sense-organs to our brains essentially instantaneously. Helmholtz's professor told him not to bother trying to measure it. But Helmholtz did so anyway, and discovered - to everyone's surprise - that nerve signals travel embarrassingly slowly: about fifty metres per second, or 180 kilometres an hour. He also measured the time it took someone to respond to a sensation, such as a touch on the arm, by pressing a button as quickly as possible, and found that the time from sensation to reaction was more than a tenth of a second. This showed, he argued, that it is impossible that our perception of the world is real and immediate. It can't be, for the simple reason that the information in the world can't get to us quickly enough. If perception were direct, then we'd be constantly seeing the world a small but appreciable moment behind events. [...]
Helmholtz argued, therefore, that our apparently effortless, apparently instantaneous perception of the world must be an illusion.
In the 1970s, the British psychologist Richard Gregory built on Helmholtz's work. He suggested that our perceptions are essentially hypotheses - he explicitly drew an analogy with how the scientific process makes hypotheses about the world - and we test those hypotheses with our senses. He used a series of optical illusions to demonstrate his point. Optical illusions, he argued, are not just defects in our perception - they are created by the way our brains manufacture the model of the world
The neuroscientist Chris Firth says that our perception of reality is a controlled hallucination.
Imagine that I'm looking at a coffee cup on my desk [...] The intuitive model of perception says that I perceive the coffee cup through bottom-up signals - that is, signals come in through my eyes, like a television camera transmitting pixels to a TV screen in our brain: signals of basic features of reality like colour, lines, shape. Lower-level processing in our brain takes those features and builds them into ever more complicated ideas, which are then compared against memories and knowledge of the world and assigned labels like 'mug' and 'coffee'.
That model of bottom-up perception drove a lot of cognitive science for many years. But now the understanding is that it goes something like this.
Instead of our image of the world coming in from our senses, our brains are making it up, constantly. We build a 3D model around ourselves. We're predicting - hallucinating - the world. There's not just a bottom-up stream of information - there is, vitally, a top-down one, as well. Higher-level processing in our brain sends a signal down, towards our nerve receptors, telling them what signals to expect. [...]
They're guesses, predictions, hypotheses, cascading down from the conceptually complex higher levels to the utterly minimalist nerve-signal levels.
At the same time, signals are coming up from those nerves: these nerves are firing this many times [...] The coffee cup is where the coffee cup should be, so there's no need to send any signals further up the chain. The hallucinated scene around me can stay in place. [...]
When the predicted pattern doesn't match the received one, the low-level processor bump the problem one place higher up the chain. If the slightly-higher-level processor can explain it, then it will do so, and will send new signals back down the chain. If it can't it sends its signal higher up, until it reaches the high-level areas which can explain it with the conceptually complex understanding that I finished my coffee a quarter of an hour ago...
What's important then is not the signals coming up from my nerves, per se, but the difference between the predictions cascading down from my higher-level brain regions. The crucial phrase is prediction error, that difference between expectations and results. In this framework, your brain is constantly trying to minimise prediction error, making its model as close to reality as possible by updating it as new signals come in.
The key thing is that the brain hates prediction error. It wants to minimise the difference between its predictions and its sense data. It really wants its predictions to be right. [...]
Something grabs your attention when the bottom-up data about it coming from your senses doesn't match the top-down prediction coming from your brain - sending a loud, urgent signal right to the top.
The trouble is that the nerve signals we receive from the world [...] are dependent not just upon changes in the world, but upon changes in our bodies. If a horizontal line of retinal cells fire in sequence, that could be because a bright light has moved from right to left in front of me. Or it could be because I've turned my head [...] So our brains not only need to predict the signals coming from the world - they also need to predict how the signals coming from the world would change, if we performed some action. Then they need to subtract those predictions from the predictions of how the world itself is changing, to give the impression of a stable reality.
But there's more to it than this. The brain wants to reduce prediction error, as we've seen. It can do that by changing its beliefs to match the world [...] But it can also change the world in order to match its beliefs.
We've seen that our experience of the world is actually our prediction of the world - our Bayesian prior - rather than the content of our senses, although it is constrained by the data from our senses. [...] sometimes changes in your sense data will be caused by changes in the outside world, and sometimes they'll be caused by your own movements. You need to be able to tell the two apart, and discount the latter, so that you get a sense of a stable world which you can detect movement in. [...]
'When you move,' Frith says, 'the movements you cause are suppressed, leaving the movements that you have not caused, which are usually more important'.
Here's an interesting thing. People with schizophrenia are less susceptible to many optical illusions than the average person. [...]
What appears to be going on is that schizophrenics have weaker priors than we do. Their predictions of the world are less precise...
Unfortunately, that has other, less beneficial effects. For instance, schizophrenics often report that their body is under the control of some outside force - that when their arm moves, it's not them who's moving it. [...]
The Bayesian explanation is that [the subject]'s predictions of how her arm will move are less precise, so that when she moves her arm, that movement is not 'subtracted' from her experience in the same way it would be for a neurotypical person. She experiences the movement unsuppressed, just as if someone else were to pick her arm up and move it for her.
[...] schizophrenic people might get inappropriately surprised that someone with their first name is mentioned in the newspaper, or that they saw a car with a numberplate that has the number thirteen in it, or whatever. Because it's caused a prediction error, it has to be explained away with hypotheses, and that creates delusions, such as that the TV or newspapers are giving them secret messages.
The Bayesian model of depression is that it is caused by inappropriately strong priors on some negative belief [...]
Evidence that comes in that could prove otherwise [...] gets discounted, because your prior beliefs are so strong [...]
[Karl] Friston takes the Bayesian brain model further [...]. For Friston [...] minimising prediction error isn't just sense-making. It is, in this model, our fundamental motivation. Hunger, sexual desire, boredom - all our wants and needs - can be described in terms of a struggle to reduce the difference between top-down prediction and bottom-up sense data, between your prior and posterior distributions. [...]
More than this: this is, according to Friston, the fundamental driver of all life. A bacterium, a mouse, a whale: they're all trying to, in a mathematical sense, reduce the difference between what they predict and what they experience.
[...] any living or self-organising thing must work to maintain a boundary between itself and the universe. It must maintain the right temperature, the right pressure, the right mix of chemicals on the inside of the boundary. In other words, it must minimise entropy.
A very basic single-celled organism won't make complicated predictions [...] Say, if it's trying to estimate its internal salt concentrations, it might predict [algorithmically] the number of sodium ions passing across its cell membrane per second.
What's crucial, though, is that the organism can only survive if these predictions are correct. It can't update the model and say 'Ah, looks like I'm severely hyponatraemic, better change my predictions [...]'. If it does so, it will rapidly die.
But there are two ways of reducing prediction error. One is changing your prediction, sure. Another is changing the world so it matches your prediction. So the bacterium might metabolise some food, or [...] get moving until it's somewhere with a higher sodium concentration.
In this model, 'desire' and 'prediction' are the same thing. The bacterium wants to reduce its prediction error (or 'free energy'), for whatever it's predicting. [...]
For life-critical predictions, though, its predictions must be fixed. You cannot change your model of what your body temperature is or what your glucose levels are, outside very narrow windows. So the only way of minimising prediction error is by changing the world, or your position in it, so that your predictions are true.
What more sophisticated animals, like humans, can do better than bacteria, though, is to manage their surroundings with an eye on the future, to avoid ending up in situations where their predictions of 'having enough oxygen' or 'not being on fire' might stop coming true. In mathematical terms, we want to minimise expected prediction error, or expected surprise.
To reiterate: according to the free energy model [a term borrowed from physics which equates to prediction error here], your brain treats predictions like 'I won't get wet if I go outside' and 'I will not go into hypoglycaemic shock' just the same. [...] It can change the world so its predictions are true, by grabbing an umbrella, or it can change its predictions so that they meet the world, by accepting that you will get wet. It can update its priors.
In the 'hypoglycaemic' shock situation, it can't. You have certain, deeply wired priors about the state of the world that will not change.
Minimising free energy means changing your state to avoid prediction error, but it also means trying to find out as much as you can about the world in order to make better predictions: finding the optimal move to make next to gather information [...]. You can minimise prediction error by generating better models of the world.
So it's all predictions, and the interesting thing is prediction error. A confident, precise prior that is contradicted by precise information coming from the world should result in a radically changed posterior probability. And the degree to which you change your beliefs is dictated by Bayes' theorem.