Lecture 12: Clustering

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

About this Video
Playlist
Transcript
Download this Video

Description: Prof. Guttag discusses clustering.

Instructor: John Guttag

Lecture 1: Introduction and...

Lecture 2: Optimization Pro...

Lecture 3: Graph-theoretic ...

Lecture 4: Stochastic Thinking

Lecture 5: Random Walks

Lecture 6: Monte Carlo Simu...

Lecture 7: Confidence Inter...

Lecture 8: Sampling and Sta...

Lecture 9: Understanding Ex...

Lecture 10: Understanding E...

Lecture 11: Introduction to...

Now Playing

Lecture 12: Clustering

Lecture 13: Classification

Lecture 14: Classification ...

Lecture 15: Statistical Sin...

Download this transcript - PDF (English - US)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation, or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JOHN GUTTAG: I'm a little reluctant to say good afternoon, given the weather, but I'll say it anyway. I guess now we all do know that we live in Boston. And I should say, I hope none of you were affected too much by the fire yesterday in Cambridge, but that seems to have been a pretty disastrous event for some.

Anyway, here's the reading. This is a chapter in the book on clustering, a topic that Professor Grimson introduced last week. And I'm going to try and finish up with respect to this course today, though not with respect to everything there is to know about clustering.

Quickly just reviewing where we were. We're in the unit of a course on machine learning, and we always follow the same paradigm. We observe some set of examples, which we call the training data. We try and infer something about the process that created those examples. And then we use inference techniques, different kinds of techniques, to make predictions about previously unseen data. We call that the test data.

As Professor Grimson said, you can think of two broad classes. Supervised, where we have a set of examples and some label associated with the example-- Democrat, Republican, smart, dumb, whatever you want to associate with them-- and then we try and infer the labels. Or unsupervised, where we're given a set of feature vectors without labels, and then we attempt to group them into natural clusters. That's going to be today's topic, clustering.

So clustering is an optimization problem. As we'll see later, supervised machine learning is also an optimization problem. Clustering's a rather simple one.

We're going to start first with the notion of variability. So this little c is a single cluster, and we're going to talk about the variability in that cluster of the sum of the distance between the mean of the cluster and each example in the cluster. And then we square it. OK? Pretty straightforward.

For the moment, we can just assume that we're using Euclidean distance as our distance metric. Minkowski with p equals two. So variability should look pretty similar to something we've seen before, right? It's not quite variance, right, but it's very close. In a minute, we'll look at why it's different.

And then we can look at the dissimilarity of a set of clusters, a group of clusters, which I'm writing as capital C, and that's just the sum of all the variabilities. Now, if I had divided variability by the size of the cluster, what would I have? Something we've seen before. What would that be? Somebody? Isn't that just the variance?

So the question is, why am I not doing that? If up til now, we always wanted to talk about variance, why suddenly am I not doing it? Why do I define this notion of variability instead of good old variance? Any thoughts?

What am I accomplishing by not dividing by the size of the cluster? Or what would happen if I did divide by the size of the cluster? Yes.

AUDIENCE: You normalize it?

JOHN GUTTAG: Absolutely. I'd normalize it. That's exactly what it would be doing. And what might be good or bad about normalizing it?

What does it essentially mean to normalize? It means that the penalty for a big cluster with a lot of variance in it is no higher than the penalty of a tiny little cluster with a lot of variance in it. By not normalizing, what I'm saying is I want to penalize big, highly-diverse clusters more than small, highly-diverse clusters. OK? And if you think about it, that probably makes sense. Big and bad is worse than small and bad.

All right, so now we define the objective function. And can we say that the optimization problem we want to solve by clustering is simply finding a capital C that minimizes dissimilarity? Is that a reasonable definition? Well, hint-- no. What foolish thing could we do that would optimize that objective function? Yeah.

AUDIENCE: You could have the same number of clusters as points?

JOHN GUTTAG: Yeah. I can have the same number of clusters as points, assign each point to its own cluster, whoops. Ooh, almost a relay. The dissimilarity of each cluster would be 0. The variability would be 0, so the dissimilarity would be 0, and I just solved the problem. Well, that's clearly not a very useful thing to do.

So, well, what do you think we do to get around that? Yeah.

AUDIENCE: We apply a constraint?

JOHN GUTTAG: We apply a constraint. Exactly. And so we have to pick some constraint.

What would be a suitable constraint, for example? Well, maybe we'd say, OK, the clusters have to have some minimum distance between them. Or-- and this is the constraint we'll be using today-- we could constrain the number of clusters. Say, all right, I only want to have at most five clusters. Do the best you can to minimize dissimilarity, but you're not allowed to use more than five clusters. That's the most common constraint that gets placed in the problem.

All right, we're going to look at two algorithms. Maybe I should say two methods, because there are multiple implementations of these methods. The first is called hierarchical clustering, and the second is called k-means. There should be an S on the word mean there. Sorry about that.

All right, let's look at hierarchical clustering first. It's a strange algorithm. We start by assigning each item, each example, to its own cluster. So this is the trivial solution we talked about before. So if you have N items, you now have N clusters, each containing just one item.

In the next step, we find the two most similar clusters we have and merge them into a single cluster, so that now instead of N clusters, we have N minus 1 clusters. And we continue this process until all items are clustered into a single cluster of size N.

Now of course, that's kind of silly, because if all I wanted to put them all it in is in a single cluster, I don't need to iterate. I just go wham, right? But what's interesting about hierarchical clustering is you stop it, typically, somewhere along the way. You produce something called a [? dendogram. ?] Let me write that down.

At each step here, it shows you what you've merged thus far. We'll see an example of that shortly. And then you can have some stopping criteria. We'll talk about that. This is called agglomerative hierarchical clustering because we start with a bunch of things and we agglomerate them. That is to say, we put them together.

All right? Make sense? Well, there's a catch. What do we mean by distance? And there are multiple plausible definitions of distance, and you would get a different answer depending upon which measure you used. These are called linkage metrics.

The most common one used is probably single-linkage, and that says the distance between a pair of clusters is equal to the shortest distance from any member of one cluster to any member of the other cluster. So if I have two clusters, here and here, and they have bunches of points in them, single-linkage distance would say, well, let's use these two points which are the closest, and the distance between these two is the distance between the clusters.

You can also use complete-linkage, and that says the distance between any two clusters is equal to the greatest distance from any member to any other member. OK? So if we had the same picture we had before-- probably not the same picture, but it's a picture. Whoops. Then we would say, well, I guess complete-linkage is probably the distance, maybe, between those two.

And finally, not surprisingly, you can take the average distance. These are all plausible metrics. They're all used and practiced for different kinds of results depending upon the application of the clustering.

All right, let's look at an example. So what I have here is the air distance between six different cities, Boston, New York, Chicago, Denver, San Francisco, and Seattle. And now let's say we're-- want to cluster these airports just based upon their distance.

So we start. The first piece of our [? dendogram ?] says, well, all right, I have six cities, I have six clusters, each containing one city. All right, what happens next? What's the next level going to look like? Yeah?

AUDIENCE: You're going from Boston [INAUDIBLE]

JOHN GUTTAG: I'm going to join Boston and New York, as improbable as that sounds. All right, so that's the next level. And if for some reason I only wanted to have five clusters, well, I could stop here.

Next, what happens? Well, I look at it, I say well, I'll join up Chicago with Boston and New York. All right. What do I get at the next level? Somebody? Yeah.

AUDIENCE: Seattle [INAUDIBLE]

JOHN GUTTAG: Doesn't look like it to me. If you look at San Francisco and Seattle, they are 808 miles, and Denver and San Francisco is 1,235. So I'd end up, in fact, joining San Francisco and Seattle.

AUDIENCE: That's what I said.

JOHN GUTTAG: Well, that explains why I need my hearing fixed.

[LAUGHTER]

All right. So I combine San Francisco and Seattle, and now it gets interesting. I have two choices with Denver. Obviously, there are only two choices, and which I choose depends upon which linkage criterion I use. If I'm using single-linkage, well, then Denver gets joined with Boston, New York, and Chicago, because it's closer to Chicago than it is to either San Francisco or Seattle.

But if I use complete-linkage, it gets joined up with San Francisco and Seattle, because it is further from Boston than it is from, I guess it's San Francisco or Seattle. Whichever it is, right? So this is a place where you see what answer I get depends upon the linkage criteria.

And then if I want, I can consider to the next step and just join them all. All right? That's hierarchical clustering. So it's good because you get this whole history of the [? dendograms, ?] and you get to look at it, say, well, all right, that looks pretty good. I'll stick with this clustering.

It's deterministic. Given a linkage criterion, you always get the same answer. There's nothing random here. Notice, by the way, the answer might not be optimal with regards to that linkage criteria. Why not? What kind of algorithm is this?

AUDIENCE: Greedy.

JOHN GUTTAG: It's a greedy algorithm, exactly. And so I'm making locally optimal decisions at each point which may or may not be globally optimal.

It's flexible. Choosing different linkage criteria, I get different results. But it's also potentially really, really slow. This is not something you want to do on a million examples. The naive algorithm, the one I just sort of showed you, is N cubed. N cubed is typically impractical.

For some linkage criteria, for example, single-linkage, there exists very clever N squared algorithms. For others, you can't beat N cubed. But even N squared is really not very good. Which gets me to a much faster greedy algorithm called k-means.

Now, the k in k-means is the number of clusters you want. So the catch with k-means is if you don't have any idea how many clusters you want, it's problematical, whereas hierarchical, you get to inspect it and see what you're getting. If you know how many you want, it's a good choice because it's much faster.

All right, the algorithm, again, is very simple. This is the one that Professor Grimson briefly discussed. You randomly choose k examples as your initial centroids. Doesn't matter which of the examples you choose. Then you create k clusters by assigning each example to the closest centroid, compute k new centroids by averaging the examples in each cluster.

So in the first iteration, the centroids are all examples that you started with. But after that, they're probably not examples, because you're now taking the average of two examples, which may not correspond to any example you have. Actually the average of N examples.

And then you just keep doing this until the centroids don't move. Right? Once you go through one iteration where they don't move, there's no point in recomputing them again and again and again, so it is converged.

So let's look at the complexity. Well, at the moment, we can't tell you how many iterations you're going to have, but what's the complexity of one iteration? Well, let's think about what you're doing here. You've got k centroids. Now I have to take each example and compare it to each-- in a naively, at least-- to each centroid to see which it's closest to. Right? So that's k comparisons per example. So that's k times n times d, where how much time each of these comparison takes, which is likely to depend upon the dimensionality of the features, right? Just the Euclidean distance, for example.

But this is a way small number than N squared, typically. So each iteration is pretty quick, and in practice, as we'll see, this typically converges quite quickly, so you usually need a very small number of iterations. So it is quite efficient, and then there are various ways you can optimize it to make it even more efficient. This is the most commonly-used clustering algorithm because it works really fast.

Let's look at an example. So I've got a bunch of blue points here, and I actually wrote the code to do this. I'm not going to show you the code. And I chose four centroids at random, colored stars. A green one, a fuchsia-colored one, a red one, and a blue one. So maybe they're not the ones you would have chosen, but there they are. And I then, having chosen them, assign each point to one of those centroids, whichever one it's closest to. All right? Step one.

And then I recompute the centroid. So let's go back. So we're here, and these are the initial centroids. Now, when I find the new centroids, if we look at where the red one is, the red one is this point, this point, and this point. Clearly, the new centroid is going to move, right? It's going to move somewhere along in here or something like that, right? So we'll get those new centroids. There it is.

And now we'll re-assign points. And what we'll see is this point is now closer to the red star than it is to the fuchsia star, because we've moved the red star. Whoops. That one. Said the wrong thing. They were red to start with. This one is now suddenly closer to the purple, so-- and to the red. It will get recolored.

We compute the new centroids. We're going to move something again. We continue. Points will move around. This time we move two points. Here we go again. Notice, again, the centroids don't correspond to actual examples. This one is close, but it's not really one of them. Move two more. Recompute centroids, and we're done.

So here we've converged, and I think it was five iterations, and nothing will move again. All right? Does that make sense to everybody? So it's pretty simple.

What are the downsides? Well, choosing k foolishly can lead to strange results. So if I chose k equal to 3, looking at this particular arrangement of points, it's not obvious what "the right answer" is, right? Maybe it's making all of this one cluster. I don't know. But there are weird k's and if you choose a k that is nonsensical with respect to your data, then your clustering will be nonsensical. So that's one problem we have think about. How do we choose k?

Another problem, and this is one somebody raised last time, is that the results can depend upon the initial centroids. Unlike hierarchical clustering, k-means is non-deterministic. Depending upon what random examples we choose, we can get a different number of iterations. If we choose them poorly, it could take longer to converge.

More worrisome, you get a different answer. You're running this greedy algorithm, and you might actually get to a different place, depending upon which centroids you chose. So these are the two issues we have to think about dealing with.

So let's first think about choosing k. What often happens is people choose k using a priori knowledge about the application. If I'm in medicine, I actually know that there are only five different kinds of bacteria in the world. That's true. I mean, there are subspecies, but five large categories. And if I had a bunch of bacterium I wanted to cluster, may just set k equal to 5.

Maybe I believe there are only two kinds of people in the world, those who are at MIT and those who are not. And so I'll choose k equal to 2. Often, we know enough about the application, we can choose k. As we'll see later, often we can think we do, and we don't. A better approach is to search for a good k.

So you can try different values of k and evaluate the quality of the result. Assume you have some metric, as to say yeah, I like this clustering, I don't like this clustering. And we'll talk about do that in detail.

Or you can run hierarchical clustering on a subset of data. I've got a million points. All right, what I'm going to do is take a subset of 1,000 of them or 10,000. Run hierarchical clustering. From that, get a sense of the structure underlying the data. Decide k should be 6, and then run k-means with k equals 6. People often do this. They run hierarchical clustering on a small subset of the data and then choose k.

And we'll look-- but one we're going to look at is that one. What about unlucky centroids? So here I got the same points we started with. Different initial centroids. I've got a fuchsia one, a black one, and then I've got red and blue down here, which I happened to accidentally choose close to one another.

Well, if I start with these centroids, certainly you would expect things to take longer to converge. But in fact, what happens is this-- I get this assignment of blue, this assignment of red, and I'm done. It converges on this, which probably is not what we wanted out of this. Maybe it is, but the fact that I converged on some very different place shows that it's a real weakness of the algorithm, that it's sensitive to the randomly-chosen initial conditions.

Well, couple of things you can do about that. You could be clever and try and select good initial centroids. So people often will do that, and what they'll do is try and just make sure that they're distributed over the space. So they would look at some picture like this and say, well, let's just put my centroids at the corners or something like that so that they're far apart.

Another approach is to try multiple sets of randomly-chosen centroids, and then just select the best results. And that's what this little algorithm on the screen does. So I'll say best is equal to k-means of the points themselves, or something, then for t in range number of trials, I'll say C equals k-means of points, and I'll just keep track and choose the one with the least dissimilarity. The thing I'm trying to minimize. OK?

The first one is got all the points in one cluster. So it's very dissimilar. And then I'll just keep generating for different k's and I'll choose the k that seems to be the best, that does the best job of minimizing my objective function. And this is a very common solution, by the way, for any randomized greedy algorithm. And there are a lot of randomized greedy algorithms that you just choose multiple initial conditions, try them all out and pick the best.

All right, now I want to show you a slightly more real example. So this is a file we've got with medical patients, and we're going to try and cluster them and see whether the clusters tell us anything about the probability of them dying of a heart attack in, say, the next year or some period of time. So to simplify things, and this is something I have done with research, but we're looking at only four features here-- the heart rate in beats per minute, the number of previous heart attacks, the age, and something called ST elevation, a binary attribute.

So the first three are obvious. If you take an ECG of somebody's heart, it looks like this. This is a normal one. They have the S, the T, and then there's this region between the S wave and the T wave. And if it's higher, hence elevated, that's a bad thing. And so this is about the first thing that they measure if someone is having cardiac problems. Do they have ST elevation?

And then with each patient, we're going to have an outcome, whether they died, and it's related to the features, but it's probabilistic not deterministic. So for example, an older person with multiple heart attacks is at higher risk than a young person who's never had a heart attack. That doesn't mean, though, that the older person will die first. It's just more probable.

We're going to take this data, we're going to cluster it, and then we're going to look at what's called the purity of the clusters relative to the outcomes. So is the cluster, say, enriched by people who died? If you have one cluster and everyone in it died, then the clustering is clearly finding some structure related to the outcome.

So the file is in the zip file I uploaded. It looks more or less like this. Right? So it's very straightforward. The outcomes are binary. 1 is a positive outcome. Strangely enough in the medical jargon, a death is a positive outcome. I guess maybe if you're responsible for the medical bills, it's positive. If you're the patient, it's hard to think of it as a good thing. Nevertheless, that's the way that they talk. And the others are all there, right? Heart rate, other things.

All right, let's look at some code. So I've extracted some code. I'm not going to show you all of it. There's quite a lot of it, as you'll see. So we'll start-- one of the files you've got is called cluster dot pi. I decided there was enough code, I didn't want to put it all in one file. I was getting confused. So I said, let me create a file that has some of the code and a different file that will then import it and use it. Cluster has things that are pretty much unrelated to this example, but just useful for clustering.

So an example here has name, features, and label. And really, the only interesting thing in it-- and it's not that interesting-- is distance. And the fact that I'm using Minkowski with 2 says we're using Euclidean distance.

Class cluster. It's a lot more code to that one. So we start with a non-empty list of examples. That's what init does. You can imagine what the code looks like, or you can look at it. Update is interesting in that it takes the cluster and examples and puts them in-- if you think of k-means in the cluster closest to the previous centroids and then returns the amount the centroid has changed. So if the centroid has changed by 0, then you don't have anything, right? Creates the new cluster.

And the most interesting thing is computeCentroid. And if you look at this code, you can see that I'm a slightly unreconstructed Python 2 programmers. I just noticed this. I really shouldn't have written 0.0. I should have just written 0, but in Python 2, you had to write that 0.0. Sorry about that. Thought I'd fixed these.

Anyway, so how do we compute the centroid? We start by creating an array of all 0s. The dimensionality is the number of features in the example. It's one of the methods from-- I didn't put up on the PowerPoint. And then for e in examples, I'm going to add to vals e.getFeatures, and then I'm just going to divide vals by the length of self.examples, the number of examples.

So now you see why I made it a pylab array, or a numpy array rather than a list, so I could do nice things like divide the whole thing in one expression. As you do math, any kind of math things, you'll find these arrays are incredibly convenient. Rather than having to write recursive functions or do bunches of iterations, the fact that you can do it in one keystroke is incredibly nice. And then I'm going to return the centroid.

Variability is exactly what we saw in the formula. And then just for fun, so you could see this, I used an iterator here. I don't know that any of you have used the yield statement in Python. I recommend it. It's very convenient.

One of the nice things about Python is almost anything that's built in, you can make your own version of it. And so once I've done this, if c is a cluster, I can now write something like for c in big C, and this will make it work just like iterating over a list. Right, so this makes it possible to iterate over it. If you haven't read about yield, you probably should read the probably about two paragraphs in the textbook explaining how it works, but it's very convenient. Dissimilarity we've already seen.

All right, now we get to patients. This is in the file lec 12, lecture 12 dot py. In addition to importing the usual suspects of pylab and numpy, and probably it should import random too, it imports cluster, the one we just looked at. And so patient is a sub-type of cluster.Example.

Then I'm going to define this interesting thing called scale attributes. So you might remember, in the last lecture when Professor Grimson was looking at these reptiles, he ran into this problem about alligators looking like chickens because they each have a large number of legs. And he said, well, what can we do to get around this? Well, we can represent the feature as a binary number. Has legs, doesn't have legs. 0 or 1. And the problem he was dealing with is that when you have a feature vector and the dynamic range of some features is much greater than the others, they tend to dominate because the distances just look bigger when you get Euclidean distance.

So for example, if we wanted to cluster the people in this room, and I had one feature that was, say, 1 for male and 0 for female, and another feature that was 1 for wears glasses, 0 for doesn't wear glasses, and then a third feature which was weight, and I clustered them, well, weight would always completely dominate the Euclidean distance, right? Because the dynamic range of the weights in this room is much higher than the dynamic range of 0 to 1.

And so for the reptiles, he said, well, OK, we'll just make it a binary variable. But maybe we don't want to make weight a binary variable, because maybe it is something we want to take into account. So what we do is we scale it. So this is a method called z-scaling. More general than just making things 0 or 1.

It's a simple code. It takes in all of the values of a specific feature and then performs some simple calculations, and when it's done, the resulting array it returns has a known mean and a known standard deviation.

So what's the mean going to be? It's always going to be the same thing, independent of the initial values. Take a look at the code. Try and see if you can figure it out. Anybody want to take a guess at it? 0. Right? So the mean will always be 0. And the standard deviation, a little harder to figure, but it will always be 1.

OK? So it's done this scaling. This is a very common kind of scaling called z-scaling. The other way people scale is interpolate. They take the smallest value and call it 0, the biggest value, they call it 1, and then they do a linear interpolation of all the values between 0 and 1. So the range is 0 to 1. That's also very common.

So this is a general way to get all of the features sort of in the same ballpark so that we can compare them. And we'll look at what happens when we scale and when we don't scale. And that's why my getData function has this parameter to scale. It either creates a set of examples with the attributes as initially or scaled.

And then there's k-means. It's exactly the algorithm I showed you with one little wrinkle, which is this part. You don't want to end up with empty clusters. If I tell you I want four clusters, I don't mean I want three with examples and one that's empty, right? Because then I really don't have four clusters.

And so this is one of multiple ways to avoid having empty clusters. Basically what I did here is say, well, I'm going to try a lot of different initial conditions. If one of them is so unlucky to give me an empty cluster, I'm just going to skip it and go on to the next one by raising a value error, empty cluster. And if you look at the code, you'll see how this value error is used.

And then try k-means. We'll call k-means numTrial times, each one getting a different set of initial centroids, and return the result with the lowest dissimilarity. Then I have various ways to examine the results. Nothing very interesting, and here's the key place where we're going to run the whole thing.

We'll get the data, initially not scaling it, because remember, it defaults to true. Then initially, I'm only going to try one k. k equals 2. And we'll call testClustering with the patients. The number of clusters, k. I put in seed as a parameter here because I wanted to be able to play with it and make sure I got different things for 0 and 1 and 2 just as a testing thing. And five trials it's defaulting to.

And then we'll look at testClustering is returning the fraction of positive examples for each cluster. OK? So let's see what happens when we run it.

All right. So we got two clusters. Cluster of size 118 with .3305, and a cluster of size 132 with a positive fraction of point quadruple 3. Should we be happy? Does our clustering tell us anything, somehow correspond to the expected outcome for patients here? Probably not, right?

Those numbers are pretty much indistinguishable statistically. And you'd have to guess that the fraction of positives in the whole population is around .33, right? That about a third of these people died of their heart attack. And I might as well have signed them randomly to the two clusters, right? There's not much difference between this and what you would get with the random result.

Well, why do we think that's true? Because I didn't scale, right? And so one of the issues we had to deal with is, well, age had a big dynamic range, and, say, ST elevation, which I told you was highly diagnostic, was either 0 or 1. And so probably everything is getting swamped by age or something else, right?

All right, so we have an easy way to fix that. We'll just scale the data. Now let's see what we get. All right. That's interesting. With casting rule? Good grief. That caught me by surprise.

Good thing I have the answers in PowerPoint to show you, because the code doesn't seem to be working. Try it once more. No. All right, well, in the interest of getting through this lecture on schedule, we'll go look at the results that we get-- I got last time I ran it.

All right. When I scaled, what we see here is that now there is a pretty dramatic difference, right? One of the clusters has a much higher fraction of positive patients than others, but it's still a bit problematic. So this has pretty good specificity, or positive predictive value, but its sensitivity is lousy.

Remember, a third of our initial population more or less, was positive. 26 is way less than a third, so in fact I've got a class, a cluster, that is strongly enriched, but I'm still lumping most of the positive patients into the other cluster.

And in fact, there are 83 positives. Wrote some code to do that. And so we see that of the 83 positives, only this class, which is 70% positive, only has 26 in it to start with it. So I'm clearly missing most of the positives.

So why? Well, my hypothesis was that different subgroups of positive patients have different characteristics. And so we could test this by trying other values of k to see with-- we would get more clusters. So here, I said, let's try k equals 2, 4, and 6. And here's what I got when I ran that.

So what you'll notice here, as we get to, say, 4, that I have two clusters, this one and this one, which are heavily enriched with positive patients. 26 as before in the first one, but 76 patients in the third one. So I'm now getting a much higher fraction of patients in one of the "risky" clusters.

And I can continue to do that, but if I look at k equals 6, we now look at the positive clusters. There were three of them significantly positive. But I'm not really getting a lot more patients total, so maybe 4 is the right answer.

So what you see here is that we have at least two parameters to play with, scaling and k. Even though I was only wanted a structure that would separate the risk-- high-risk patients from the lower-risk, which is why I started with 2, I later discovered that, in fact, there are multiple reasons for being high-risk. And so maybe one of these clusters is heavily enriched by old people. Maybe another one is heavily enriched by people who have had three heart attacks in the past, or ST elevation or some combination. And when I had only two clusters, I couldn't get that fine gradation.

So this is what data scientists spend their time doing when they're doing clustering, is they actually have multiple parameters. They try different things out. They look at the results, and that's why you actually have to think to manipulate data rather than just push a button and wait for the answer.

All right. More of this general topic on Wednesday when we're going to talk about classification. Thank you.

Free Downloads

Video

Subtitle