Blogacz: Another Post About CFB Ratings

I've always wanted to write a bit of a manifesto about college football rankings and ratings systems, but I really haven't ever had the impetus to follow through. However, a series of tweets yesterday from prominent football writer Pat Forde inspired me to put a few of my thoughts into words. There's are a lot of differing opinions on what the rankings should reflect, and there are a lot of misunderstandings about what the various ratings systems represent. Forde and those who responded to him give me the perfect set-up to organize my thoughts. Here they are:

The nature of rankings

May need to swear off Sagarin Ratings. Just compared preseason to new ratings. Stanford went from 88.82 to 90.49 without playing. ...
— Pat Forde (@YahooForde) September 5, 2013

This was the first tweet from Forde on the matter. This isn't the most ridiculous thought of the series (partially because it raises a legitimate question about the face validity of the system), but it's a good place to start. Pundits often complain when an idle team moves up in the human polls as well, so it's a familiar argument that deserves a response.

Perhaps the most fundamental thing to remember about rankings systems is that the ordinal number next to your name has no inherent value by itself . You earn that ranking simply by being better than all of the teams below you, and worse than all the teams above you. Thus, moving up in the rankings doesn't necessarily mean you've gotten better; it can just as easily mean the teams around you have gotten worse. I know that Forde quoted the rating number, and not the ordinal ranking, but the same point still stands. The rating number only means something relative to the other teams rating numbers, so the argument is basically the same.

To illustrate this, let's look at the four teams that fell below Stanford (who went from 7th to 3rd) and why they did so. Stanford leapt a team that lost (Georgia) and a team that was not particularly impressive against a mediocre opponent (Texas A&M vs. Rice). It makes sense to see them climb over those teams, even thought Georgia's loss was strong. You can also see the arguments against the other two teams the Cardinal jumped as Oregon played a nobody and LSU only won by 10. Those are less satisfying arguments, but when we're dealing with small margins between the top teams, it doesn't take much to move the needle. In the end, all of those teams are so close, that they will all probably switch spots next week. Which brings me to my next point:

It's one game

Washington is four spots behind Boise after pole-axing the Bus by 32 points. Clemson behind Georgia.
— Pat Forde (@YahooForde) September 5, 2013

This is probably the most obvious point, but I will still make it: Teams have only played one game. Weird things happen in single game samples, like Boise losing by 32 to a team that they had just beaten in a bowl last season. Of course, this doesn't mean that we should ignore what happened last weekend. Just the opposite, we should weigh that very heavily in our evaluations of teams since it's likely the most relevant information. However, it's still just one game with a lot of unknowns. Perhaps Washington is better than Boise, but given our prior assumption that Boise was better, how much of that are we willing to discount based on a single result? Some are willing to discard it entirely (which anyone is free to do if they want), but that's just not the best way to evaluate the relative quality of football teams.

One other thing about this is that Forde only states the current positions of the teams. He doesn't show how much the ratings have changed from the first week. Since Sagarin uses a prior in his ratings for roughly the first half of the season, it's clear that the preseason weighting will have some effect. I think that Forde's tweet overstates that effect a bit. For the two examples he gives:

Georgia - preseason rank of 5 with an 89.55 rating - week 1 rank of 5 with an 89.66 rating*
Clemson - preseason rank of 16 with an 83.55 rating - week 1 rank of 11 with an 85.12 rating

Boise - preseason rank of 15 with an 83.71 rating - week 1 rank of 22 with an 81.31 rating
Washington - preseason rank of 40 with a 76.79 rating - week 1 rank of 26 with an 80.57 rating.

*I'm using the composite rating for now, since that is what Forde quoted in his first tweet. I will talk about why I don't like that later.

In the case of UGA-Clemson, we see a slight bump for Clemson, and a stand still for Georgia. This makes sense as Georgia lost on the road to a very good team by three points. I can't see making up the entire difference of 11 ranking spots because of a one-score game that most would agree was quite even.

In the case of Boise-UW, this exercise shows us the large swing that we would expect from such a lopsided result as Boise is now 21 spots closer to the Huskies in the rankings. However, it's not quite enough to overcome the massive gulf between the teams in preseason expectations. While most human voters (me included) have Washington above Boise, I can see the merit in not reacting too crazily to one result.

There's going to be outliers

@YahooForde My personal (biased) favorite was #UofL at 27 and #KState at #28. UK was also 15 spots better than WKU, too.
— Steve Bittenbender (@freelancehack) September 5, 2013

A few people replied to Forde with their own discrepancies. While I appreciate the usage of the comma before "too" in this tweet, cherry picking the odd ducks in a set of results is still not the best way to go in life. There are two reasons for this. One, advanced models are likely to be considering different factors in different ways, and thus will inherently give you some results you wouldn't have thought of. Of course, this is a feature and not a bug, because why else would you design such a system if not to further enlighten yourself. As Bill James famously said*, a good metric will confirm your beliefs 80% of the time, and surprise you 20% of the time.

*This attribution might be an urban legend. The best I could find in a quick search was this, which is promising since it's from SABR, but not great since it just off-handedly mentions it. Oh well, it's a good quote anyway.

The second argument against cherry picking is that everyone, including those who make the models, knows full well that they are not perfect. While we've gotten really good at determining what leads to winning teams, we still have just a fraction of the data that it would take to definitively assign value judgments to teams. There is a very good chance that some of these discrepancies are legitimately wrong. While it would be tempting to go in and "fix" all of these issues, there are a lot of problems with that. Ken Pomeroy goes into great detail about this in his response to those that decried Wisconsin being #2 a couple of years ago. The basic summary is that if you went and made one change, then you'd probably have to make another and another, until you got to the point where the final product is worse than what you initially had. These systems are generally designed to provide the most accurate results for the full population of the teams at hand. This means that some of them will fall through the cracks.

In the end, it's entirely possible that Louisville is a really great team that all of the computers are underrating. But, since it's possible that they aren't that good, and since they are just one output out of 125, ratings systems shouldn't change just to smooth out the outliers.

Understand how the ratings work before criticizing them

From his earlier and later tweets, it's clear that Forde understands that there is a preseason weighting that goes into Sagarin's ratings. He argues against this, but I am afraid that Bayesian statistics are here to stay. What he doesn't seem to understand (to his detriment) is that he isn't using the best measure available to him from Sagarin's ratings.

As I said a few paragraphs ago, the numbers that Forde quotes come from the left-most composite rating from Sagarin's site. While this rating system is likely good, I'm not sure what value it provides, if any, over the right-most "Pure Points" predictor. On his ratings page, Sagarin says:

"In ELO-CHESS, only winning and losing matters; the score margin is of no consequence,
which makes it very "politically correct". However it is less accurate in its predictions for
upcoming games than is the PURE POINTS, in which the score margin is the only thing that matters....The overall RATING is a synthesis of the two diametrical opposites, ELO-CHESS and PURE POINTS (PREDICTOR)."

The "Elo-Chess" rating that Sagarin mentions is what he submits to the BCS. As I've mentioned before, this is the neutered ranking that the BCS encouraged in the early 2000s to remove the motivation to run up the score (boy did that not work). By removing margin of victory, which is the best simple way to evaluate teams, ratings such as the Elo-Chess don't seem to have any real value. A team can win a bunch of close, flukish games (hello 2012 Florida) and get the same level of reward as those that are more dominant, and thus better.

Anyway, this ties back to Forde because he uses the composite rating which includes the icky Elo-Chess rating.* If he would use the more accurate predictor rating, then a funny thing happens to his Boise-UW example: Instead of being four spots behind Boise, Washington is now 3 spots ahead of them (23rd to 26th). If Forde would look at the better set of ratings, then he might feel that they are a touch more reasonable that he perceives them to be.

*This is partially Sagarin's fault since the composite rating is the most prominently displayed metric on his ratings page. I do wonder if perhaps the composite rating is a more complicated synthesis than it seems, and is maybe the best judge of which teams are the best, but Sagarin doesn't seem to write much about his systems, so I am not sure.

Understanding the differing goals between different rating systems and why those are important

Fact is, there shouldn't be a preseason computer rating. Data from previous seasons shouldn't impact this season.
— Pat Forde (@YahooForde) September 5, 2013

Bravo RT @KevinPauga @YahooForde The #KPI, everybody started at 0 last week. No preconceived anything http://t.co/t5v6TiTbye and @KPIsports
— Pat Forde (@YahooForde) September 5, 2013

The first tweet above continues the thought I had a few paragraphs ago. If you're going to make a good prediction system using limited data, then a Bayesian mindset is pretty much necessary. If a computer rating didn't have a preseason weighting, then it would look something like the rankings linked to in the second tweet. Those ratings have a five-way tie for first between Arizona, Duke, Tennessee, Georgia Tech, and San Jose State. Those may all be good teams this season, but they are clearly not anything resembling an actual top five. They also don't appear to be the teams that have accomplished the most, as we don't see early season stalwarts such as LSU or Washington in the top five. This seems to be more of an efficiency metric, that likely has little predictive value.

That said, I do agree with Forde that the final determination of who plays in the title game should rest only on what happened this season. In a lot of ways, this kind of mimics the basic idea of the American Dream. That is, no matter what happened before, you still have a chance to be the best if you give your best effort and play well. Of course, we all know that both life and football are not that simple, and that teams that were terrible last year are unlikely to succeed at the highest level this year. But still, there's always hope.*

*I might have stolen this thought from "How Football Explains America."

Since we're together on that basic point, this disagreement boils down to something else. We have the same goal (ranking the best teams in the proper order), but different ways of going about it. An article I found with the Google (by another author) best sums up what I think Forde is trying to get at:

"BCS berths aren't awarded on the basis of hypothetical future results, or guesses at perceived strengths. They're awarded on the basis of achievement, on wins and losses and conference championships. Including margin-of-victory may make the BCS computer rankings "more accurate" when it comes to selecting which teams are playing the best football, but it would make them less accurate when it comes to answering the question the BCS rankings are trying to answer: which teams are most deserving ."

This quote ties in not only to this post, but also to the post where I decried "deserverism." In this paragraph, and in Forde's tweets, the writers seem to make a distinction between the best and the most deserving. Judging by the arguments they make, the reason for this is clear. Both I and them wish to get the best teams into the championship game. Writers like them look for the simple answer whether it be a pair of undefeated teams, or a one-loss team from one of the best conferences. Their definition of "deserve" seems to be something that allows them to sleep easy at night. In doing this, they shun the complexity inherent to a sport where 125 teams play 12 or 13 games against varying levels of opponents in different conditions with different stakes on the line. While they seem to dislike complexity, I embrace it. I want to see the best teams rewarded for being the best, whether or not they have an immaculate record. Reality is complicated and there are rarely black and white answers.* If we want to be honest with ourselves, then we have to face that fact, and do the hard work to truly answer the question of who is best. Thankfully people like Jeff Sagarin, the folks at Football Outsides, and others have taken up the monumental task of sorting out the sport that is probably the least sort-out-able. Those such measures help enlighten me to things I may have missed, and with some open-mindedness, I would like to think that the mainstream thought will eventually embrace this as well.

*Except for USC vs. Texas in 2005. That was perfect.

In the meantime, I think a middle ground is more than reasonable. Human polls are ridiculously flawed, but at the same time there are things that humans can discern that computer models can't easily take in. Understanding the context behind a game can often give us a much better picture of what happened than a box score. For example, in my last post I ragged on Louisville for only beating awful Southern Miss by 4 last year. While that does look terrible on the surface, the fact that it was played in a monsoon could explain some of the Cardinals perceived underperfomance. In the end, college football has a gloriously short season that makes it nearly impossible to definitively say who is the best. Because of this, it can sometimes be best to embrace the random and celebrate a miraculously undefeated team even if they aren't the best.* I fully support this idea as long as its balanced with reason and understanding of what we do know about quantitative team performance. We're clearly not there yet, but with more discussions like this, I think we can achieve a balance that makes college football better than ever.

*No idea who I'm talking about there. None at all.

Blogacz

Friday, September 6, 2013

Another Post About CFB Ratings

No comments:

Post a Comment