Lecture 2: Conditioning and Bayes' Rule | Video Lectures | Probabilistic Systems Analysis and Applied Probability | Electrical Engineering and Computer Science

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

About this Video
Playlist
Transcript
Lecture Slides
Download this Video

Description: In this lecture, the professor discussed conditional probability, multiplication rule, total probability theorem, and Bayes' rule.

Instructor: John Tsitsiklis

Lecture 1: Probability Mode...

Now Playing

Lecture 2: Conditioning and...

Lecture 3: Independence

Lecture 4: Counting

Lecture 5: Discrete Random ...

Lecture 6: Discrete Random ...

Lecture 7: Multiple Discret...

Lecture 8: Continuous Rando...

Lecture 9: Multiple Continu...

Lecture 10: Continuous Baye...

Lecture 11: Derived Distrib...

Lecture 12: Iterated Expect...

Lecture 13: Bernoulli Process

Lecture 14: Poisson Process I

Lecture 15: Poisson Process II

Lecture 16: Markov Chains I

Lecture 17: Markov Chains II

Lecture 18: Markov Chains III

Lecture 19: Weak Law of Lar...

Lecture 20: Central Limit T...

Lecture 21: Bayesian Statis...

Lecture 22: Bayesian Statis...

Lecture 23: Classical Stati...

Lecture 24: Classical Infer...

Lecture 25: Classical Infer...

Download this transcript - PDF (English - US)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu

JOHN TSISIKLIS: So here's the agenda for today. We're going to do a very quick review. And then we're going to introduce some very important concepts. The idea is that all information is-- Information is always partial. And the question is what do we do to probabilities if we have some partial information about the random experiments. We're going to introduce the important concept of conditional probability. And then we will see three very useful ways in which it is used. And these ways basically correspond to divide and conquer methods for breaking up problems into simpler pieces. And also one more fundamental tool which allows us to use conditional probabilities to do inference, that is, if we get a little bit of information about some phenomenon, what can we infer about the things that we have not seen?

So our quick review. In setting up a model of a random experiment, the first thing to do is to come up with a list of all the possible outcomes of the experiment. So that list is what we call the sample space. It's a set. And the elements of the sample space are all the possible outcomes. Those possible outcomes must be distinguishable from each other. They're mutually exclusive. Either one happens or the other happens, but not both. And they are collectively exhaustive, that is no matter what the outcome of the experiment is going to be an element of the sample space.

And then we discussed last time that there's also an element of art in how to choose your sample space, depending on how much detail you want to capture. This is usually the easy part. Then the more interesting part is to assign probabilities to our model, that is to make some statements about what we believe to be likely and what we believe to be unlikely. The way we do that is by assigning probabilities to subsets of the sample space. So as we have our sample space here, we may have a subset A. And we assign a number to that subset P(A), which is the probability that this event happens. Or this is the probability that when we do the experiment and we get an outcome it's the probability that the outcome happens to fall inside that event.

We have certain rules that probabilities should satisfy. They're non-negative. The probability of the overall sample space is equal to one, which expresses the fact that we're are certain, no matter what, the outcome is going to be an element of the sample space. Well, if we set the top right so that it exhausts all possibilities, this should be the case.

And then there's another interesting property of probabilities that says that, if we have two events or two subsets that are disjoint, and we're interested in the probability, that one or the other happens, that is the outcome belongs to A or belongs to B. For disjoint events the total probability of these two, taken together, is just the sum of their individual probabilities. So probabilities behave like masses. The mass of the object consisting of A and B is the sum of the masses of these two objects. Or you can think of probabilities as areas. They have, again, the same property. The area of A together with B is the area of A plus the area B.

But as we discussed at the end of last lecture, it's useful to have in our hands a more general version of this additivity property, which says the following, if we take a sequence of sets-- A1, A2, A3, A4, and so on. And we put all of those sets together. It's an infinite sequence. And we ask for the probability that the outcome falls somewhere in this infinite union, that is we are asking for the probability that the outcome belongs to one of these sets, and assuming that the sets are disjoint, we can again find the probability for the overall set by adding up the probabilities of the individual sets.

So this is a nice and simple property. But it's a little more subtle than you might think. And let's see what's going on by considering the following example. We had an example last time where we take our sample space to be the unit square. And we said let's consider a probability law that says that the probability of a subset is just the area of that subset. So let's consider this probability law. OK.

Now the unit square is the set --let me just draw it this way-- the unit square is the union of one element set consisting all of the points. So the unit square is made up by the union of the various points inside the square. So union over all x's and y's. OK? So the square is made up out of all the points that this contains.

And now let's do a calculation. One is the probability of our overall sample space, which is the unit square. Now the unit square is the union of these things, which, according to our additivity axiom, is the sum of the probabilities of all of these one element sets. Now what is the probability of a one element set? What is the probability of this one element set? What's the probability that our outcome is exactly that particular point? Well, it's the area of that set, which is zero. So it's just the sum of zeros. And by any reasonable definition the sum of zeros is zero. So we just proved that one is equal to zero.

OK. Either probability theory is dead or there is some mistake in the derivation that I did. OK, the mistake is quite subtle and it comes at this step. We're sort of applied the additivity axiom by saying that the unit square is the union of all those sets. Can we really apply our additivity axiom. Here's the catch. The additivity axiom applies to the case where we have a sequence of disjoint events and we take their union. Is this a sequence of sets? Can you make up the whole unit square by taking a sequence of elements inside it and cover the whole unit square? Well if you try, if you start looking at the sequence of one element points, that sequence will never be able to exhaust the whole unit square.

So there's a deeper reason behind that. And the reason is that infinite sets are not all of the same size. The integers are an infinite set. And you can arrange the integers in a sequence. But the continuous set like the units square is a bigger set. It's so-called uncountable. It has more elements than any sequence could have. So this union here is not of this kind, where we would have a sequence of events. It's a different kind of union. It's a Union that involves a union of many, many more sets. So the countable additivity axiom does not apply in this case. Because, we're not dealing with a sequence of sets. And so this is the incorrect step.

So at some level you might think that this is puzzling and awfully confusing. On the other hand, if you think about areas of the way you're used to them from calculus, there's nothing mysterious about it. Every point on the unit square has zero area. When you put all the points together, they make up something that has finite area. So there shouldn't be any mystery behind it.

Now, one interesting thing that this discussion tells us, especially the fact that the single elements set has zero area, is the following-- Individual points have zero probability. After you do the experiment and you observe the outcome, it's going to be an individual point. So what happened in that experiment is something that initially you thought had zero probability of occurring. So if you happen to get some particular numbers and you say, "Well, in the beginning, what did I think about those specific numbers? I thought they had zero probability. But yet those particular numbers did occur."

So one moral from this is that zero probability does not mean impossible. It just means extremely, extremely unlikely by itself. So zero probability things do happen. In such continuous models, actually zero probability outcomes are everything that happens. And the bumper sticker version of this is to always expect the unexpected. Yes?

AUDIENCE: [INAUDIBLE].

JOHN TSISIKLIS: Well, probability is supposed to be a real number. So it's either zero or it's a positive number. So you can think of the probability of things just close to that point and those probabilities are tiny and close to zero. So that's how we're going to interpret probabilities in continuous models. But this is two chapters ahead. Yeah?

AUDIENCE: How do we interpret probability of zero? If we can use models that way, then how about probability of one? That it it's extremely likely but not necessarily for certain?

JOHN TSISIKLIS: That's also the case. For example, if you ask in this continuous model, if you ask me for the probability that x, y, is different than the zero, zero this is the whole square, except for one point. So the area of this is going to be one. But this event is not entirely certain because the zero, zero outcome is also possible. So again, probability of one means essential certainty. But it still allows the possibility that the outcome might be outside that set. So these are some of the weird things that are happening when you have continuous models. And that's why we start to this class with discrete models, on which would be spending the next couple of weeks.

OK. So now once we have set up our probability model and we have a legitimate probability law that has these properties, then the rest is usually simple. Somebody asks you a question of calculating the probability of some event. While you were told something about the probability law, such as for example the probabilities are equal to areas, and then you just need to calculate. In these type of examples somebody would give you a set and you would have to calculate the area of that set. So the rest is just calculation and simple.

Alright, so now it's time to start with our main business for today. And the starting point is the following-- You know something about the world. And based on what you know when you set up a probability model and you write down probabilities for the different outcomes. Then something happens, and somebody tells you a little more about the world, gives you some new information. This new information, in general, should change your beliefs about what happened or what may happen. So whenever we're given new information, some partial information about the outcome of the experiment, we should revise our beliefs. And conditional probabilities are just the probabilities that apply after the revision of our beliefs, when we're given some information.

So lets make this into a numerical example. So inside the sample space, this part of the sample space, let's say has probability 3/6, this part has 2/6, and that part has 1/6. I guess that means that out here we have zero probability. So these were our initial beliefs about the outcome of the experiment. Suppose now that someone comes and tells you that event B occurred. So they don't tell you the full outcome with the experiment. But they just tell you that the outcome is known to lie inside this set B.

Well then, you should certainly change your beliefs in some way. And your new beliefs about what is likely to occur and what is not is going to be denoted by this notation. This is the conditional probability that the event A is going to occur, the probability that the outcome is going to fall inside the set A given that we are told and we're sure that the event lies inside the event B Now once you're told that the outcome lies inside the event B, then our old sample space in some ways is irrelevant. We have then you sample space, which is just the set B. We are certain that the outcome is going to be inside B.

For example, what is this conditional probability? It should be one. Given that I told you that B occurred, you're certain that B occurred, so this has unit probability. So here we see an instance of revision of our beliefs. Initially, event B had the probability of (2+1)/6 -- that's 1/2. Initially, we thought B had probability 1/2. Once we're told that B occurred, the new probability of B is equal to one. OK.

How do we revise the probability that A occurs? So we are going to have the outcome of the experiment. We know that it's inside B. So we will either get something here, and A does not occur. Or something inside here, and A does occur. What's the likelihood that, given that we're inside B, the outcome is inside here? Here's how we're going to think about. This part of this set B, in which A also occurs, in our initial model was twice as likely as that part of B. So outcomes inside here collectively were twice as likely as outcomes out there.

So we're going to keep the same proportions and say, that given that we are inside the set B, we still want outcomes inside here to be twice as likely outcomes there. So the proportion of the probabilities should be two versus one. And these probabilities should add up to one because together they make the conditional probability of B. So the conditional probabilities should be 2/3 probability of being here and 1/3 probability of being there. That's how we revise our probabilities. That's a reasonable, intuitively reasonable, way of doing this revision. Let's translate what we did into a definition.

The definition says the following, that the conditional probability of A given that B occurred is calculated as follows. We look at the total probability of B. And out of that probability that was inside here, what fraction of that probability is assigned to points for which the event A also occurs? Does it give us the same numbers as we got with this heuristic argument? Well in this example, probability of A intersection B is 2/6, divided by total probability of B, which is 3/6, and so it's 2/3, which agrees with this answer that's we got before. So the former indeed matches what we were trying to do.

One little technical detail. If the event B has zero probability, and then here we have a ratio that doesn't make sense. So in this case, we say that conditional probabilities are not defined.

Now you can take this definition and unravel it and write it in this form. The probability of A intersection B is the probability of B times the conditional probability. So this is just consequence of the definition but it has a nice interpretation. Think of probabilities as frequencies. If I do the experiment over and over, what fraction of the time is it going to be the case that both A and B occur? Well, there's going to be a certain fraction of the time at which B occurs. And out of those times when B occurs, there's going to be a further fraction of the experiments in which A also occurs.

So interpret the conditional probability as follows. You only look at those experiments at which B happens to occur. And look at what fraction of those experiments where B already occurred, event A also occurs. And there's a symmetrical version of this equality. There's symmetry between the events B and A. So you also have this relation that goes the other way.

OK, so what do we use these conditional probabilities for? First, one comment. Conditional probabilities are just like ordinary probabilities. They're the new probabilities that apply in a new universe where event B is known to have occurred. So we had an original probability model. We are told that B occurs. We revise our model. Our new model should still be legitimate probability model. So it should satisfy all sorts of properties that ordinary probabilities do satisfy.

So for example, if A and B are disjoint events, then we know that the probability of A union B is equal to the probability of A plus probability of B. And now if I tell you that a certain event C occurred, we're placed in a new universe where event C occurred. We have new probabilities for that universe. These are the conditional probabilities. And conditional probabilities also satisfy this kind of property. So this is just our usual additivity axiom but the applied in a new model, in which we were told that event C occurred. So conditional probabilities do not taste or smell any different than ordinary probabilities do. Conditional probabilities, given a specific event B, just form a probability law on our sample space. It's a different probability law but it's still a probability law that has all of the desired properties.

OK, so where do conditional probabilities come up? They do come up in quizzes and they do come up in silly problems. So let's start with this. We have this example from last time. Two rolls of a die, all possible pairs of roles are equally likely, so every element in this square has probability of 1/16. So all elements are equally likely. That's our original model. Then somebody comes and tells us that the minimum of the two rolls is equal to zero. What's that event? The minimum equal to zero can happen in many ways, if we get two zeros or if we get a zero and-- sorry, if we get two two's, or get a two and something larger. And so the is our new event B. The red event is the event B.

And now we want to calculate probabilities inside this new universe. For example, you may be interested in the question, questions about the maximum of the two rolls. In the new universe, what's the probability that the maximum is equal to one? The maximum being equal to one is this black event. And given that we're told that B occurred, this black events cannot happen. So this probability is equal to zero. How about the maximum being equal to two, given that event B? OK, we can use the definition here. It's going to be the probability that the maximum is equal to two and B occurs divided by the probability of B. The probability that the maximum is equal to two.

OK, what's the event that the maximum is equal to two? Let's draw it. This is going to be the blue event. The maximum is equal to two if we get any of those blue points. So the intersection of the two events is the intersection of the red event and the blue event. There's only one point in their intersection. So the probability of that intersection happening is 1/16. That's the numerator. How about the denominator? The event B consists of five elements, each one of which had probability of 1/16. So that's 5/16. And so the answer is 1/5.

Could we have gotten this answer in a faster way? Yes. Here's how it goes. We're trying to find the conditional probability that we get this point, given that B occurred. B consist of five elements. All of those five elements were equally likely when we started, so they remain equally likely afterwards. Because when we define conditional probabilities, we keep the same proportions inside the set. So the five red elements were equally likely. They remain equally likely in the conditional world. So conditional event B having happened, each one of these five elements has the same probability. So the probability that we actually get this point is going to be 1/5. And so that's the shortcut.

More generally, whenever you have a uniform distribution on your initial sample space, when you condition on an event, your new distribution is still going to be uniform, but on the smaller events of that we considered. So we started with a uniform distribution on the big square and we ended up with a uniform distribution just on the red point.

Now besides silly problems, however, conditional probabilities show up in real and interesting situations. And this example is going to give you some idea of how that happens. OK. Actually, in this example, instead of starting with a probability model in terms of regular probabilities, I'm actually going to define the model in terms of conditional probabilities. And we'll see how this is done. So here's the story. There may be an airplane flying up in the sky, in a particular sector of the sky that you're watching. Sometimes there is one sometimes there isn't. And from experience you know that when you look up, there's five percent probability that the plane is flying above there and 95% probability that there's no plane up there.

So event A is the event that the plane is flying out there. Now you bought this wonderful radar that's looks up. And you're told in the manufacturer's specs that, if there is a plane out there, your radar is going to register something, a blip on the screen with probability 99%. And it will not register anything with probability one percent. So this particular part of the picture is a self-contained probability model of what your radar does in a world where a plane is out there. So I'm telling you that the plane is out there.

So we're now dealing with conditional probabilities because I gave you some particular information. Given this information that the plane is out there, that's how your radar is going to behave with probability 99% is going to detect it, with probability one percent is going to miss it. So this piece of the picture is a self-contained probability model. The probabilities add up to one. But it's a piece of a larger model.

Similarly, there's the other possibility. Maybe a plane is not up there and the manufacturer specs tell you something about false alarms. A false alarm is the situation where the plane is not there, but for some reason your radar picked up some noise or whatever and shows a blip on the screen. And suppose that this happens with probability ten percent. Whereas with probability 90% your radar gives the correct answer.

So this is sort of a model of what's going to happen with respect to both the plane -- we're given probabilities about this -- and we're given probabilities about how the radar behaves. So here I have indirectly specified the probability law in our model by starting with conditional probabilities as opposed to starting with ordinary probabilities. Can we derive ordinary probabilities starting from the conditional number ones? Yeah, we certainly can.

Let's look at this event, A intersection B, which is the event up here, that there is a plane and our radar picks it up. How can we calculate this probability? Well we use the definition of conditional probabilities and this is the probability of A times the conditional probability of B given A. So it's 0.05 times 0.99. And the answer, in case you care-- It's 0.0495. OK. So we can calculate the probabilities of final outcomes, which are the leaves of the tree, by using the probabilities that we have along the branches of the tree. So essentially, what we ended up doing was to multiply the probability of this branch times the probability of that branch.

Now, how about the answer to this question. What is the probability that our radar is going to register something? OK, this is an event that can happen in multiple ways. It's the event that consists of this outcome. There is a plane and the radar registers something together with this outcome, there is no plane but the radar still registers something.

So to find the probability of this event, we need the individual probabilities of the two outcomes. For the first outcome, we already calculated it. For the second outcome, the probability that this happens is going to be this probability 95% times 0.10, which is the conditional probability for taking this branch, given that there was no plane out there. So we just add the numbers. 0.05 times 0.99 plus 0.95 times 0.1 and the final answer is 0.1445. OK.

And now here's the interesting question. Given that your radar recorded something, how likely is it that there is an airplane up there? Your radar registering something -- that can be caused by two things. Either there's a plane there, and your radar did its job. Or there was nothing, but your radar fired a false alarm. What's the probability that this is the case as opposed to that being the case? OK. The intuitive shortcut would be that it should be the probability-- you look at their relative odds of these two elements and you use them to find out how much more likely it is to be there as opposed to being there.

But instead of doing this, let's just write down the definition and just use it. It's the probability of A and B happening, divided by the probability of B. This is just our definition of conditional probabilities. Now we have already found the numerator. We have already calculated the denominator. So we take the ratio of these two numbers and we find the final answer -- which is 0.34. OK.

There's this slightly curious thing that's happened in this example. Doesn't this number feel a little too low? My radar -- So this is a conditional probability, given that my radar said there is something out there, that there is indeed something there. So it's sort of the probability that our radar gave the correct answer. Now, the specs of our radar we're pretty good. In this situation, it gives you the correct answer 99% of the time. In this situation, it gives you the correct answer 90% of the time. So you would think that your radar there is really reliable.

But yet here the radar recorded something, but the chance that the answer that you get out of this is the right one, given that it recorded something, the chance that there is an airplane out there is only 30%. So you cannot really rely on the measurements from your radar, even though the specs of the radar were really good. What's the reason for this? Well, the reason is that false alarms are pretty common.

Most of the time there's nothing. And there's a ten percent probability of false alarms. So there's roughly a ten percent probability that in any given experiment, you have a false alarm. And there is about the five percent probability that something out there and your radar gets it. So when your radar records something, it's actually more likely to be a false alarm rather than being an actual airplane. This has probability ten percent roughly. This has probability roughly five percent

So conditional probabilities are sometimes counter-intuitive in terms of the answers that they get. And you can make similar stories about doctors interpreting the results of tests. So you tested positive for a certain disease. Does it mean that you have the disease necessarily? Well if that disease has been eradicated from the face of the earth, testing positive doesn't mean that you have the disease, even if the test was designed to be a pretty good one. So unfortunately, doctors do get it wrong also sometimes. And the reasoning that comes in such situations is pretty subtle.

Now for the rest of the lecture, what we're going to do is to take this example where we did three things and abstract them. These three trivial calculations that's we just did are three very important, very basic tools that you use to solve more general probability problems. So what's the first one? We find the probability of a composite event, two things happening, by multiplying probabilities and conditional probabilities. More general version of this, look at any situation, maybe involving lots and lots of events.

So here's a story that event A may happen or may not happen. Given that A occurred, it's possible that B happens or that B does not happen. Given that B also happens, it's possible that the event C also happens or that event C does not happen. And somebody specifies for you a model by giving you all these conditional probabilities along the way. Notice what we move along the branches as the tree progresses. Any point in the tree corresponds to certain events having happened.

And then, given that this has happened, we specify conditional probabilities. Given that this has happened, how likely is it for that C also occurs? Given a model of this kind, how do we find the probability or for this event? The answer is extremely simple. All that you do is move along with the tree and multiply conditional probabilities along the way. So in terms of frequencies, how often do all three things happen, A, B, and C? You first see how often does A occur. Out of the times that A occurs, how often does B occur? And out of the times where both A and B have occurred, how often does C occur? And you can just multiply those three frequencies with each other.

What is the formal proof of this? Well, the only thing we have in our hands is the definition of conditional probabilities. So let's just use this. And-- OK. Now, the definition of conditional probabilities tells us that the probability of two things is the probability of one of them times a conditional probability. Unfortunately, here we have the probability of three things. What can I do? I can put a parenthesis in here and think of this as the probability of this and that and apply our definition of conditional probabilities here. The probability of two things happening is the probability that the first happens times the conditional probability that the second happens, given A and B, given that the first one happened.

So this is just the definition of the conditional probability of an event, given another event. That other event is a composite one, but that's not an issue. It's just an event. And then we use the definition of conditional probabilities once more to break this apart and make it P(A), P(B given A) and then finally, the last term. OK.

So this proves the formula that I have up there on the slides. And if you wish to calculate any other probability in this diagram. For example, if you want to calculate this probability, you would still multiply the conditional probabilities along the different branches of the tree. In particular, here in this branch, you would have the conditional probability of C complement, given A intersection B complement, and so on. So you write down probabilities along all those tree branches and just multiply them as you go.

So this was the first skill that we are covering. What was the second one? What we did was to calculate the total probability of a certain event B that consisted of-- was made up from different possibilities, which corresponded to different scenarios. So we wanted to calculate the probability of this event B that consisted of those two elements.

Let's generalize. So we have our big model. And this sample space is partitioned in a number of sets. In our radar example, we had a partition in two sets. Either a plane is there, or a plane is not there. Since we're trying to generalize, now I'm going to give you a picture for the case of three possibilities or three possible scenarios. So whatever happens in the world, there are three possible scenarios, A1, A2, A3. So think of these as there's nothing in the air, there's an airplane in the air, or there's a flock of geese flying in the air. So there's three possible scenarios.

And then there's a certain event B of interest, such as a radar records something or doesn't record something. We specify this model by giving probabilities for the Ai's-- That's the probability of the different scenarios. And somebody also gives us the probabilities that this event B is going to occur, given that the Ai-th scenario has occurred. Think of the Ai's as scenarios.

And we want to calculate the overall probability of the event B. What's happening in this example? Perhaps, instead of this picture, it's easier to visualize if I go back to the picture I was using before. We have three possible scenarios, A1, A2, A3. And under each scenario, B may happen or B may not happen. And so on. So here we have A2 intersection B. And here we have A3 intersection B. In the previous slide, we found how to calculate the probability of any event of this kind, which is done by multiplying probabilities here and conditional probabilities there.

Now we are asked to calculate the total probability of the event B. The event B can happen in three possible ways. It can happen here. It can happen there. And it can happen here. So this is our event B. It consists of three elements. To calculate the total probability of our event B, all we need to do is to add these three probabilities. So B is an event that consists of these three elements. There are three ways that B can happen. Either B happens together with A1, or B happens together with A2, or B happens together with A3.

So we need to add the probabilities of these three contingencies. For each one of those contingencies, we can calculate its probability by using the multiplication rule. So the probability of A1 and B happening is this-- It's the probability of A1 and then B happening given that A1 happens. The probability of this contingency is found by taking the probability that A2 happens times the conditional probability of A2, given that B happened. And similarly for the third one. So this is the general rule that we have here. The rule is written for the case of three scenarios. But obviously, it has a generalization for the case of four or five or more scenarios. It gives you a way of breaking up the calculation of an event that can happen in multiple ways by considering individual probabilities for the different ways that the event can happen.

OK. So-- Yes?

AUDIENCE: Does this have to change for infinite sample space?

JOHN TSISIKLIS: No. This is true whether your sample space is infinite or finite. What I'm using in this argument that we have a partition into just three scenarios, three events. So it's a partition to a finite number of events. It's also true if it's a partition into an infinite sequence of events. But that's, I think, one of the theoretical problems at the end of the chapter. You probably may not need it for now.

OK, going back to the story here. There are three possible scenarios about what could happen in the world that are captured here. Event, under each scenario, event B may or may not happen. And so these probabilities tell us the likelihoods of the different scenarios. These conditional probabilities tell us how likely is it for B to happen under one scenario, or the other scenario, or the other scenario.

The overall probability of B is found by taking some combination of the probabilities of B in the different possible worlds, in the different possible scenarios. Under some scenario, B may be very likely. Under another scenario, it may be very unlikely. We take all of these into account and weigh them according to the likelihood of the scenarios. Now notice that since A1, A2, and three form a partition, these three probabilities have what property? Add to what? They add to one. So it's the probability of this branch, plus this branch, plus this branch. So what we have here is a weighted average of the probabilities of the B's into the different worlds, or in the different scenarios.

Special case. Suppose the three scenarios are equally likely. So P of A1 equals 1/3, equals to P of A2, P of A3. what are we saying here? In that case of equally likely scenarios, the probability of B is the average of the probabilities of B in the three different words, or in the three different scenarios. OK.

So to finally, the last step. If we go back again two slides, the last thing that we did was to calculate a conditional probability of this kind, probability of A given B, which is a probability associated essentially with an inference problem. Given that our radar recorded something, how likely is it that the plane was up there? So we're trying to infer whether a plane was up there or not, based on the information that we've got.

So let's generalize once more. And we're just going to rewrite what we did in that example, but in terms of general symbols instead of the specific numbers. So once more, the model that we have involves probabilities of the different scenarios. These we call them prior probabilities. They're are our initial beliefs about how likely each scenario is to occur. We also have a model of our measuring device that tells us under that scenario how likely is it that our radar will register something or not. So we're given again these conditional probabilities. We're given the conditional probabilities for these branches.

Then we are told that event B occurred. And on the basis of this new information, we want to form some new beliefs about the relative likelihood of the different scenarios. Going back again to our radar example, an airplane was present with probability 5%. Given that the radar recorded something, we're going to change our beliefs. Now, a plane is present with probability 34%. The radar, since we saw something, we are going to revise our beliefs as to whether the plane is out there or is not there.

And so what we need to do is to calculate the conditional probabilities of the different scenarios, given the information that we got. So initially, we have these probabilities for the different scenarios. Once we get the information, we update them and we calculate our revised probabilities or conditional probabilities given the observation that we made. OK. So what do we do? We just use the definition of conditional probabilities twice. By definition the conditional probability is the probability of two things happening divided by the probability of the conditioning event.

Now, I'm using the definition of conditional probabilities once more, or rather I use the multiplication rule. The probability of two things happening is the probability of the first and the second. So these are things that are given to us. They're the probabilities of the different scenarios. And it's the model of our measuring device, which we assume to be available. And how about the denominator? This is total probability of the event B. But we just found that's it's easy to calculate using the formula in the previous slide. To find the overall probability of event B occurring, we look at the probabilities of B occurring under the different scenario and weigh them according to the probabilities of all the scenarios.

So in the end, we have a formula for the conditional probability, A's given B, based on the data of the problem, which were probabilities of the different scenarios and conditional probabilities of B, given the A's. So what this calculation does is, basically, it reverses the order of conditioning. We are given conditional probabilities of these kind, where it's B given A and we produce new conditional probabilities, where things go the other way.

So schematically, what's happening here is that we have model of cause and effect and-- So a scenario occurs and that may cause B to happen or may not cause it to happen. So this is a cause/effect model. And it's modeled using probabilities, such as probability of B given Ai. And what we want to do is inference where we are told that B occurs, and we want to infer whether Ai also occurred or not. And the appropriate probabilities for that are the conditional probabilities that A occurred, given that B occurred.

So we're starting with a causal model of our situation. It models from a given cause how likely is a certain effect to be observed. And then we do inference, which answers the question, given that the effect was observed, how likely is it that the world was in this particular situation or state or scenario.

So the name of the Bayes rule comes from Thomas Bayes, a British theologian back in the 1700s. It actually-- This calculation addresses a basic problem, a basic philosophical problem, how one can learn from experience or from experimental data and some systematic way. So the British at that time were preoccupied with this type of question. Is there a basic theory that about how we can incorporate new knowledge to previous knowledge. And this calculation made an argument that, yes, it is possible to do that in a systematic way. So the philosophical underpinnings of this have a very long history and a lot of discussion around them. But for our purposes, it's just an extremely useful tool. And it's the foundation of almost everything that gets done when you try to do inference based on partial observations. Very well. Till next time.

Conditioning and Bayes' Rule (PDF)

Free Downloads

Video

iTunes U (MP4 - 112MB)
Internet Archive (MP4 - 112MB)

Subtitle

English - US (SRT)