Tuesday 9 November 2021

Why Elo Is A Mistake

I'll fully admit that I forgot that U SPORTS switched women's hockey over to the Elo Ratings System, and has been using this system to generate their weekly Top Ten rankings despite the fact that I'm going to show you how flawed this system is. What seems like a good idea on paper has become a bit of a joke around the pressbox simply due to the fact that the idealisms behind the Elo System are anything but ideal in reality. As Drew Carey used to say on Whose Line Is It Anyway?, let's take a wander down this dark path where everything is made up and the points don't matter.

What Is Elo?

The background for the Elo Ratings System is that it was devised as a chess rating system by Arpad Elo to rank players based on results in chess matches in order to determine the world rankings. The entire premise was based on Elo's assumption that each player's performances will fluctuate randomly from match to match, but how a player plays the game would change slowly over time thus giving an accurate ranking based upon as much complete match data that was provided. In a statistical sense, Elo thought a player's true skill was best represented as the mean of that player's performance random variable.

Since all players start at the same rating, the overall effect of the Elo Rating System was a net-zero approach to ratings in that wins would earn Player A points by the same amount that Player B's rating would decrease. Think of this like a +1 for a win and a -1 for a loss. As players played more and more chess games, the ratings would show who was winning more often, and that player's probability of winning could be calculated based upon the points earned - aka his rating - compared to his opponent's rating.

The Fallacy

The reason the Elo System works in chess so well is that people, like leopards' placement of their spots, don't change all that much over the years, and any changes made to the strategy they use in chess would ultimately bear weight over the course of time. If a player became a better player through experience, his or her ranking would improve with each game he or she wins. If a player was careless and lost games, that would also show through the mean value of the rating he or she earned.

Why this doesn't work in team sports is that there is turnover each and every year on all rosters. Players graduate, players stop playing, players are recruited, and players earn walk-on spots each and every season across this land, so the make-up of every team is changing constantly. As a result, coaches change strategies and systems based on the strengths and weaknesses of their teams from year to year. Normally, those changes are small changes, but, as seen with Calgary, coaching changes will introduce new systems and new structure to teams as well.

It also needs to be said that the Calgary Dinos teams that featured Hayley Wickenheiser look nothing like the Calgary Dinos team of today just as the Alberta Pandas who won an armful of national championship trophies look nothing like the team does today either. Using past performances to predict future success of players who never played in those eras of the past makes little to no sense whatsoever.

Because the make-up of teams can change dramatically from year to year, using a system of probability to predict a top-ten list of teams who will likely win against teams from other conferences seems foolish when some conferences only see their teams meet once per season at the National Championship. The lack of crossover between the conferences makes determining who would beat whom entirely a point of academia, not math, so the algorithm has as much chance of being right as it does being wrong. That's not predicting a trend based on past performances; rather, it's the flip of a coin.

The Flaws - Crossing Over?

Based on U SPORTS' own definition of how they calculate the Elo values for each team, there are a number of flaws in the system they're using. By their own admission, they have no idea how accurate the Elo system is when it comes to predicting wins for teams, but "it does provide a reasonable (and always-evolving) estimate of the relative strengths of teams within a contained league or system." The flaw in this is the qualifier on that statement - "within a contained league or system".

As stated above, there simply isn't enough crossover between the conferences at any point in any season to accurate assess whether Team A from Canada West is stronger than Team B from the OUA because the history between Canada West and OUA teams offer such a small sample size. If we use Manitoba's two visits to the National Championships in 2018 and 2019, Manitoba beat Queen's and Western in 2018, but lost to Guelph before defeating Toronto in 2019. In knowing Manitoba is 3-1 in the last three seasons against the OUA, what does that say about Manitoba's chances of beating Ryerson this season?

If your answer is "absolutely nothing", you're exactly right. The chances of either team winning is 50/50 simply due to the fact that we have zero history between the squads. There are variables that can affect the outcome based on what we know about either team, but none of those factors are counted in the Elo System used by U SPORTS. Things such as injuries, starting netminders, player streaks going into the game, and the strength of each team's schedule aren't factors in the Elo System, and we know that all of these variables affect the outcome of games.

The Flaws - Blowouts Matter?

According to the Elo System, there might be very good teams (Manitoba? UPEI?) that fail to crack the Top Ten ranking while other teams (Concordia? Toronto?) find their way into the Top Ten due to their ability to blow out other teams. In the spirit of fairness, no coach at the U SPORTS level should be emphasizing the embarrassment of any other team, but it seems that the Elo System is all about embarrassing good teams. The explanation reads,
If Elo wants good teams to blow out other good teams, I would suggest that the algorithm that U SPORTS is using has a significant flaw in it already. I can assure everyone that beating Alberta 1-0 in five overtimes is far more exciting than a 6-1 or 7-2 win over Alberta from personal experience. There's no reward for a team hammering another team when one considers that the Elo System is about probability of winning, not predicting results in any game.

Beyond that, the author of the explanation above, Mario Kovacevic, points out that one would need a human perspective to determine which team is better than others when three teams are extremely close in Elo values. If Elo can't determine that on its own, the system is missing the entire point of what it's supposed to do in predicting who would win versus the other teams based on blowouts and scores. A win is a win no matter how you slice it, and reading a result where a team wins by a a single goal or by a dozen goals doesn't give you the whole story as to what happened in that game.

The Flaws - The Past Matters?

I struggle to understand why the author of the paper - again, Mr. Kovacevic - opted to use 2012 as the starting point for the statistical analysis in the algorithm. If we look at Canada West, we know that the Mount Royal Cougars began their play at the U SPORTS level that season after having dominated the ACAC. It would take the Cougars five years to make the Canada West playoffs in 2017, and they would finally reach the .500 mark in a season in 2020. They had no playoff victories until 2019 when they won one of three games against the Huskies, and hadn't won a playoff series until 2020 when they defeated both Regina and Calgary - two teams who had zero playoff wins since 2016 - before falling to Alberta in the Canada West Final in two straight games followed by their overtime win over Toronto at the National Championship.

If all of these games are factored into Mount Royal's total score since 2012, how on earth does Mount Royal have an Elo value of 1643.86 this season that has them ranked third in the nation? What about all of the history that Mount Royal had before 2020 when they went on that amazing run? How does none of that factor in because I find it hard to believe that a team who is 5-9 in the playoffs in Canada West in nine seasons can somehow have more points in the Elo System than all of Saskatchewan (2018 National Championship fourth-place finish), Manitoba (2018 National Champions and 2019 National Championship appearance), and Alberta (2019 and 2020 National Championship appearances)?

Based on this illogical logic, shouldn't the Montreal Carabins be the best team in the nation annually since they have two National Championship victories and another U SPORTS National Championship Final appearance to their name since 2012? Or perhaps it should be McGill since they have one National Championship to go along with three silver medals at the National Championship since 2012? Western has captured a National Championship and earned a silver medal, so why aren't they ranked?

It's hard to justify an Elo points system when the points system itself doesn't seem to support its own conclusions. The AUS is 12-25 at National Championships since 2012 - the only time they crossover to play other conferences each season outside of exhibition games - and yet they have three teams in the Top Ten with Saint Mary's (1649.04), StFX (1630.05), and UNB (1605.64) on the list. For the record, StFX is 5-7 with a bronze medal in 2013, Saint Mary's is 4-4 with a bronze medal in 2016, and UNB has never qualified or played at the National Championship, yet UNB has more Elo points than Concordia who won a bronze in 2018?

Moreover, how does a conference with a .324 win percentage at National Championships have three teams with more than 1600 Elo points if they don't win big games? They only win 32.4% of the time against other conferences, and the Elo System wants me to believe there are three teams in the AUS that are better than Manitoba or Saskatchewan or Concordia or Guelph historically since 2012? Wow.

The Flaws - Assume Nothing

The following line, more than anything, is why this system is broken.
The ignorance that U SPORTS has for its own rules is quite astounding, but we do know that a student-athlete has five years of eligibility. Assuming that Laval, the school in the example, has 13 seniors graduating, that's a significant section of the team's roster that will not be playing in the following season that will need to be replaced by younger, less experienced players. The assumption that they'll be good is a fool's bet because there's no telling how well Laval will play with all those new faces in the lineup.

Manitoba is a prime example of this as they won the National Championship in 2018, placed fifth at the 2019 National Championship, and saw a significant chunk of their roster graduate out after those two seasons. With twelve new faces in the lineup in 2019-20, the Bisons missed the playoffs in Canada West entirely as the rebuild had begun.

Historically since 2012, Montreal is the only team to appear in eight of the nine National Championships held. McGill has appeared seven times, Alberta has played five times, and StFX has been there five times as well. The catch in these appearances is that the RSEQ traditionally had five teams in the conference, so having a two-in-five shot annually to make the National Championship are pretty good odds. McGill and Montreal took advantage of those odds by being the best teams in the RSEQ for the last decade.

What's interesting to note is that Montreal, who played in every National Championship except 2017, is nowhere to be seen on the rankings despite them clearly being involved in the most National Championships since 2012. If we believe that teams should be ranked higher after a prolonged run of success, where is Montreal on the rankings? How is it that the prolonged success of Montreal is ignored while UNB, who has yet to qualify for any National Championship, is better than Montreal in terms of Elo points?

Again, just when you think you have this system figured out, there's a rather glaring error in the data or the rankings that's hard to overlook based on what we know.

The Flaws - Check Your Math

Again, I am flummoxed by the explanation of the math in this system.
The only time that the advantage is equal to the disadvantage in playing games is if the chances of winning the game are 50/50. If a higher ranked team wins, their point totals go up minimally, but they still increase. When we see Toronto and Mount Royal are currently separated in the rankings by less than three points, playing an extra game in the OUA could affect the rankings in a big way.

On the flip side, because there's a 44-point gap between top-seeded McGill and second-seeded Saint Mary's, a loss to a team like the Carabins won't change McGill's position on the rankings very much if Montreal is lurking just outside the Top Ten. Yet the loss should push McGill from the top spot in the rankings because Montreal is worse than, say, UNB who has defeated Saint Mary's once already this season. But because the point totals appear to be arbitrary when it comes to wins and losses and the scores in those games, it would take McGill being blown out by Montreal to help Saint Mary's into top spot in the rankings.

Because the conferences don't crossover, the chances of overtaking a team in the rankings is slim because these teams will never see each other during the season for the most part. Beyond that, the text on the snip above clearly states, "playing more games will certainly make the rankings within that conference more 'accurate,' but we know that's not true either since UNB already has a better record than StFX, but sits lower than them in the rankings despite being one of the teams to hand Saint Mary's a loss.

Can This Be Fixed?

The short answer is no. The longer answer is that the data sets need to be massaged in order to reflect recency. I say that because the results achieved in 2012 are seemingly weighed the same as results achieved in 2020 when we know that there has been an entire generational turnover for each and every team. If we assigned a weighted amount to anything beyond 2018 where the results matter less the further back one goes into the past, this could work more accurately as a ranking system. Right now, it's simply a mess of convoluted math, terrible assumptions, and complete ignorance.

So What Do We Do, Smart Guy?

Admittedly, I am not a data scientist. I don't know how one would set up something better, but I know that this can't continue. I would propose that we use a combination of a mathematical approach and human voting since there are things that people see that computers don't. Allowing the media to vote on the rankings of teams would be ideal because we watch most, if not all, the games from our conferences, and we check things like strength of schedule, future games, injury reports, and eligibility.

Just as we've seen reports about the RPI being flawed in the NCAA, the NCAA finally made the decision to move off of RPI for rankings in favour of "The NET". The only problem is that "The NET" returned rankings entirely similar to the hard-to-take-seriously nature of the Elo System which ultimately caused teams to be furious when it came to rankings for the March Madness Tournament. There are people who have already devised new systems, but it seems that the NCAA is putting all their eggs into "The NET".

At the end of the day, perhaps we're just not very good at building algorithms to tell us who is the best team in the land as both the NCAA and U SPORTS have shown us.

You're No Help, Teebz

I never said I had an answer, but I think there is a solution. It's one that likely won't be as fun as the other options, but it will solve a lot of problems: only rank the final eight teams going into the U SPORTS National Championship.

We know who is leading each conference. That takes a simple search engine entry and a mouse click. We can dig a little further to see who has played whom, who is getting big contributions from players, and the trends each team has shown from scores, gamesheets, and streak information. At the end of the day, though, the eight teams who qualify for the National Championship are really the only teams who need a ranking in order to determine placement and who will play whom. If Manitoba falls in the semi-final despite being ranked #1, #10, or #55 on this Elo System, does it affect anyone else's playoffs in the OUA, RSEQ, or AUS?

The rankings put out by U SPORTS are nothing more than a conversational piece, and should in no way be used as a way to evaluate teams in conferences that never play one another. If they're referenced in any meaningful way by someone as a measure of success, nod and smile and give it no further thought because the flaws in the system make it hard to understand why anyone would trust the system.

Until next time, keep your sticks on the ice!

No comments: