Tantrix Elo Ratings - Theory & Method

Reference : "The International Chess Federation Rating System" by Arpad E. Elo (P) 1973

Theory         Potential Problems         Description of the Method only

No need to read all of this page now, unless you're particularly interested or want to calculate your own rating. Comments are welcomed, but please refer to this before making comments on the system because the reasoning behind the method (even the bits of it that may seem weird!) and the theoretical flaws I am already aware of are explained here. I'm not going to make the mistake people often make of claiming the system is totally theoretically sound. If you want to skip the theory and just see the method, click here.

If you are intending to run a Tantrix tournament and want to find out the conditions for it to be eligible to count towards the Elo ratings, please see section 10 below.

Theory:

Under the Elo method, which has been used for international chess ratings since 1970 and is now used for other sports & games as well, a player's performance is assumed to be normally distributed with a mean equal to their rating and an arbitrary standard deviation (s.d.) which is assumed to be the same for each player and to reflect the variability in their performance - small variations occurring more frequently than large ones.

The
simple interpretation of the ratings page explains two intuitive ways in which the ratings can be interpreted and forms part of this explanation. In fact, a linear relationship might be sufficient to express the difference in standard of two players - it is so that we can express the relative ratings of many different players in a way which allows consistent interpretation that normal distributions are needed.

The user-defined factors in the method are the average player rating, the variance, the scores to which the method is applied and the method of averaging over games. All of these issues are covered in the "Method" section below.

The normal distribution theory applied and the exact calculation of the Diff. v % TP table is covered on the how the difference table was calculated page.

Potential problem:

In reading what follows, you should remember that the theoretical basis for many non-Elo ranking systems is so flawed or even non-existent that it is not even possible to identify the flaws of those systems, let alone quantify them. eg. in most ranking systems, the difference in ranking between two players does not imply anything specific about the probability of either player winning, making such systems impossible to test in any meaningful way. So, the fact that it is possible to identify and understand theoretical flaws in the system (and hence take action to mitigate them) is actually part of what makes Elo so much better as a rating system than most alternatives rather than a weakness.

A major theoretical objection to this method is that it is unlikely that even chess performance is normally distributed, though it probably approximates to that. For Tantrix, using the tournament point system as a basis for Tantrix ratings means that (even without allowing for the element of chance in Tantrix), even for a very strong player playing a very weak player, scores above 80-85 % are almost impossible to achieve over a large number of games.

This means that too many Tantrix performances fall into the centre of the distribution and too few into the tail for the shape to be exactly that of a normal distribution. Using just win percentages instead of tournament point scores would mitigate that, but it would also imply ignoring a huge amount of information relevant to rating, so is not advisable.

A potential result of this is that once a lot of weaker players start to play in tournaments and maybe end up with quite low ratings, those with the highest ratings may find it virtually impossible not to lose rating points when they play them. Whether the effect will be significant over a whole tournament against both stronger and weaker players is hard to tell.

This effect may not happen anyway - the same reasoning may imply that the ratings (at least of those who have played a lot of games) will remain much closer together than in chess, which is fair enough, since the chance factor in Tantrix is a great leveller!

(Jan 2003) In practice, the Tantrix Elo ratings (particularly those that are fully established, which are the ones that matter most in this context) have turned out to be closer together than Elo ratings in chess, so happily the system has largely sorted out the potential problem described above by itself.

Method used for Tantrix Elo ratings:

As a result of wanting to be completely open and accountable about the method used, the steps below include a full history of changes in the early years (much of which is no longer very relevant) and copious explanations, particularly about adjustments to the method which are complex but not that significant overall. Explanations have tended to be added piecemeal over time so they are not in a particularly logical order.

So, despite the favourable comments made about this page from within the Tantrix world and beyond, I am aware that the method is now quite hard for the casual reader to understand. Sometime in 2003, I hope to produce a much shorter summary of the method, aimed mainly at those who want to attempt to calculate their own rating for individual games and matches.

These rules are subject to modification as the method evolves. Since the ratings were introduced, Rules 3) and 6) were changed on 23 August 1999 and cosmetic changes were made to a couple of other rules.

Rule 4) was changed on 29 August 1999 and at the end of 2000 to make the calculation of G for multi-game matches simpler and fit better with the World Championship format. The latter change was flagged before the 2000 World Championship and applied retrospectively to all of the games played in 2000. Rule 6) was changed to increase the G threshold for a rating to be official since the change to rule 4 tends to increase G and because the list is now based on two years' tournaments. Rule 6) was also changed in July 2001 to allow retired players to be removed from the official list at the end of a calendar year where they have not played any tournament games.

Otherwise, the method was largely finalised by the end of 1999, at which time ratings were recalculated. In 2000, apart from the simplification in the calculation of G, there have been just a couple of minor wording changes in May to take account of the 1999 transitional arragements no longer applying and a change to Rule 8c) in August to help remove potential distortions in single elimination tournaments with a majority of players not having official ratings. In 2001, minor changes were made to rule 6 to drop players who have effectively retired from the top of the lists more quickly.

In 2002, the rule about when deweighting occurs was clarified to reflect what had actually been happening in practice throughout. In addition, "official" and "unofficial" were changed to "established" and "provisional" respectively and the crossover point increased. More concrete provisions for dealing with 'deflation' were also introduced. Finally, the definition of an official tournament game was considerably clarified. These amendments have already been used in calculating interim ratings for the seeding of the Word Team Tantrix Championship (WTTC) and they were reflected in the first 2002 published rating list which appeared on this site in August 2002.

1) Scores used: Tournament point (TP) scores are used to provide the percentages.
Click for explanation

2) Variance: For chess, the (arbitrary) s.d. is 200, for Tantrix it has been set to 400. Click for explanation

3) Mean: When the ratings were recalculated at the end of 1999, the weighted (by G) average rating for all players was set to 1850.Click for explanation. The overall average rating has gone down since because tournaments attract a far broader range of entrants than they used to but the average rating of the top players has remained about the same as intended. This allows a reasonably fair comparison of players' ratings over time to be made.

4) Averaging ratings from different events: The factor G is used as the measure of the number of games played. It is now simply one for every tournament game played, subject to de-weighting over time as described in 5) below. The published rating is an average weighted by the values of G for each tournament.

5) Decay: After 12 months, older tournaments will be dropped from the ratings gradually by repeatedly halving their influence after each full 12 month period following end of the tournament to which they relate. This will be done by halving their contribution to G, which automatically halves the contribution of the corresponding rating performances too. However, if the date of an annual championship (eg. NZ Championship) is moved from year to year, the de-weighting of results from this championship in previous years will occur at the time when the rating list incorporating the results from corresponding tournament is held in the current year.

6) "Established" and "provisional" ratings:

a) A published rating will be "established" (and used in full when determining the seeding points used to determine seedings for future tournaments) if the player's G is 32 or greater AND they have played games in at least three tournaments (counting the WTC and Plate together as one tournament), including at least one tournament in or since the most recently completed calendar year.

b) In deciding seedings, which will be based on the ratings at the closing date for entries to the tournament concerned, if G is less than the "established" threshold, the rating is regarded as "provisional" and is recalculated for seeding purposes as if an additional (threshold - G) games were played at a level 250 Elo points below the player's current rating, with the additional provision that the value of G which can be taken into account in deciding seedings will be limited to 50% of the threshold if less than two tournaments have been completed and 75% of the threshold if exactly two have been completed, with a tournament only counted as completed if the player plays six games or more in it. Further reductions of 100 are applied if the player has not played at least one tournament game in or since the most recently completed calendar year and 100 * (6 - G) if G is 5 or less.

c) If a player has no Elo rating or if double the player's lobby ranking at the closing date less 250 is greater than the seeding point value based on their Elo rating, then that player's seeding point total is deemed to be 2 x their lobby ranking at the closing date less 250.

d) For players in a team tournament, an overriding minimum seeding point value of 1400 is applied to avoid too much distortion of the team average for a team which is forced to field a very weak player - changed from 1500 to 1400 in 2003 to avoid the converse problem of a lower team's seeding band being decided by just one player's rating or by no ratings at all.

7) Calculating the rating implied by a tournament: A player's rating for any given tournament will be the average rating of their opponents PLUS the "Diff." factor corresponding to their % TP score against those players, read off the table of Diff. v % TP.

8) Average opponent rating: The calculation of this depends on the composition of the tournament and is covered in even more detail in books on the Elo method:

a) If all opponents had ratings based on more than 20 tournament games before the tournament, the average of their pre-tournament ratings are used.

b) For the inaugural year where no players had G>=20 ratings beforehand, they were brought in at an estimated average value and the "Diff." values found in 7) above were applied to these values to produce new ratings. These were then used as a new starting point and the process was iterated until the change in each player's rating over two successive iterations is zero - this was easy to do on a spreadsheet. This gave a set of self-consistent ratings reflecting each player's overall performance in tournament games that year. However, if there are any future tournaments where very few players have pre-tournament ratings, it may be necessary to combine that tournament with the next tournament with more previously rated players in it which involves some of the same players in order to provide the necessary fixed point for starting the iteration process.

c) If there is a mix of players with G>=20 and G less than 20 before the tournament, the pre-tournament ratings of the G>=20 players are fixed and iteration as in b) above is used to find the (pre- = post-) ratings of the other players. Once some players have G>=20, the values at which this iteration process is started is irrelevant except that good initial estimates will reduce the number of iterations required. The post-tournament ratings of the first group of players are then calculated as in a) above. The only exception to this is if a player with G < 20 before a tournament loses all of their games in the tournament or only plays against one opponent. That player's games are ignored when rating their any opponents with G>=20 for that tournament but one is still added to the opponent's G for each such game played. The ratings of the opponents would otherwise be distorted downwards slightly for mathematical reasons - this can be explained with an example on request.

9) Deflation control: (this section is very technical and not completely stable yet, so it can safely be ignored if you wish) According to Elo, this is "the key to the successful operation of the system". Deflation measures implicitly assume that there is some kind of fixed 'standard' level on which to base them, and contrary to what Elo appears to assert, there is no such measurable 'standard', so such measures are bound to have to be slightly subjective, even if they can be expressed in mathematical terms so as not to look that way.

It was previously stated that measures to control overall deflation in the ratings would be introduced when we saw how the system evolves. Players with G>=20 sometimes show a step change improvement in their performance from one tournament to the next to an extent that was originally expected only to be achieved by players who had played a lot less tournament games.

So, to avoid the deflationary effect that this was tending to produce when such players sustained this performance over a lot of games during a tournament, players with G>=20 before the tournament who play four or more different opponents and seven or more games (c.f. the ITM norm qualifications) and outperform their previous overall rating by 100 points or more at the end of the first set of iterations for a single tournament are allowed to float for AOR purposes like players with G below 20 during a second set of iterations, with the proviso (2002 WTC onwards) that during the second set of iterations, the rating of such players for AOR purposes is reduced down towards their pre-tournament rating by a factor of (1 / the no. of opponents faced in the tournament) - this provision simply helps to avoid any chance of values spiralling out of control. For the same reason, a player's contribution to AOR (but not their own rating for the tournament) is capped at 2050 (the Elo+G level for a TGM norm) if the only reason it would be above this level is because of the adjustment described earlier in this section.

The requirement for at least four different opponents to be involved is designed to avoid the situation where a player gets a huge rating based on games against one or two strong players who are playing badly but is knocked out quickly afterwards, since in that case it is more likely to be the case that the strong players are playing well below their previous standard rather than that the lowly rated player has made a sudden huge improvement.

The limits (4 opponents, 7 games, a tournament-only rating 100 higher than their overall rating) are fairly arbitrary, though having said that the main G=20 limit is somewhat arbitrary too. For 2005, we will look into the possibility of allowing all players to 'float' for AOR purposes when doing a second set of iterations (the first set of iterations having determined the average Elo rating for the tournament as a whole) or doing what some chess federations have recently started doing - this is to do a first set of iterations and then use each player's rating after the first pass as the player's rating for AOR purposes and then do a single extra iteration (with no 'floating') based on the resulting AOR values to calculate each player's final rating for that tournament. However, these methods (while they may seem more scientific than the current one) can give more anomalous results under certain circumstances, so plenty more thinking will be done before a permanent solution is implemented.

A linked issue is that for a single tournament, the average post-tournament rating should in theory be the same as the average pre-tournament rating, measured as the sum over all players who had ratings based on G >= 20 before the tournament of (rating * G) for that tournament. Small approximations in Excel's NORMSINV function may tend to reduce the average rating by a small amount in certain circumstances, so where this phenomenon occurs, the ratings calculated should all be adjusted upwards at the end by the same factor to keep the SUM (rating * G) total the same. Click for explanation. However, this is not on its own a solution to the problem of step changes in performance from established players, even though it sounds like it should be!

None of this should make a huge difference to the overall results of the calculations, but we obviously want to make any anti-deflationary measures as scientific as possible. The effect of existing deflationary measures and the need for further deflationary measures will be kept under review.

10) Tournaments and games included in the Elo ratings:

a) Only official tournament games will be counted. An "official tournament" can be played online or offline but needs to meet the following conditions:

i) Games must have time limits per player per game of 15 minutes (online) or 20 minutes (offline), with time limits up to 10% below those values or higher than those values (as long as there is a time limit) still being acceptable. It is usually not too difficult to borrow clocks from a local chess (or other strategy game) club if you want to run an official offline tournament. Tournament games should preferably use the time penalty system rather than instant losses on time - even offline, the number of minutes over time has proved to be easy for a controller to ascertain even if only the most basic chess clock is used.

ii) The tournament must be reasonably open. Tournaments where entry is limited by country or by region, tournaments with age limits and tournaments limited by ranking or another objective qualification such as a high finish in a previous tournament are all acceptable. However, totally unadvertised ad hoc tournaments between groups of friends (while strongly encouraged since they can be a lot of fun) are not classed as official tournaments for this purpose.

iii) The tournament must be advertised, eg. asking for it to be advertised on the tournament website is easiest and/or you could advertise it on the Tantrix website for the relevant country/ies, ask for a news item about it to appear in the lobby or (as far as is practical) email all players who are eligible to play in it. Players should be informed in advance (eg. on the tournament home page or in an email) that the games are going to be Elo rated.

iv) The tournament point scores in each game must be supplied by the controller after the tournament finishes (or by another player, for example if the controller was playing in the tournament and withholds the results because he/she did badly and no longer wants them to be rated!), and it must be made clear which of the results were defaults, if any.

b) A game which is defaulted before it is started will not be counted for rating purposes at all, even though it counts for the tournament concerned. This is because it is often not the defaulted player's fault (more a function of their Internet connection) and the Elo rating is designed to reflect actual play.

c) If one player has disconnected early and has not returned within the time allowed and the Robot has been allowed to play on for them or the game was unable to be completed by either player for any reason and the game has been sent for adjudication, the adjudicator of the game will also decide whether the game concerned should be counted in full towards Elo ratings or not counted towards Elo ratings at all. In general, this type of game will not count towards Elo ratings unless the disconnection occurs close to the end of the game and the likely win/loss/draw outcome can be determined by the adjudicator with a high degree of confidence (even if there is some uncertainty about the margin) or the disconnection occurs earlier in the game but it was already obvious who was going to win, e.g. because one player had completed (or was very likely to complete) a large loop, or if it appears at all possible that one player may have disconnected because they were likely to lose and did not want the result to affect their rating.

d) Time penalties for hitting or exceeding the standard time limit (15:00 from 2001) will be included in the TP score used to calculate the ratings on the same basis as they are in the tournaments themselves.

11) Final note:

Comments on anything on this page (especially the few issues that are more subjective and hence the hardest to get right) are always welcomed.

Further information:

The following links contain even more information - the less technical notes are starred:

* Simple interpretation of the ratings *

* Why do rating and tournament positions differ? *

How the difference table was calculated

* Tournament categories *

* Tantrix Master titles *

* BACK TO MAIN TANTRIX ELO RATINGS PAGE *

The date and time are :

Thursday, 11-Mar-2010 14:16:24 GMT
Thursday, 11-Mar-2010 14:16:24 GMT (local)

8777