Are there common features of teams with large Pythagorean variances?
Categories: Goalscoring Models, Soccer Pythagorean: Theory, Team Performance
My last post has sparked a question in my soccermetric mind: Are there common features in the offensive and defensive goal distributions for teams with large Pythagorean variances? The Soccer Pythagorean works well at assessing the level of team performance relative to expectations from their goal statistics. It can even predict point totals within a relatively narrow margin (4-6 points). The estimation falls short with teams that have extremely lopsided offensive goal statistics.
A couple of days ago I looked at Ajax Amsterdam's goalscoring distributions from this season and observed that their offensive distribution is skewed in the opposite direction from that of more typical goal distributions. More importantly, the offensive distribution was skewed in the opposite direction from the defensive goal distribution, which would make the curvefit of the underlying distribution very difficult. To find out if that was also the case for other teams with extremely lopsided goal statistics, I took a look at Barcelona's record in the Spanish Primera last season when they won The Treble. Below is a histogram and a smoothed probability density of their goal offense (horizontal axis is number of goals, vertical axis is probability from 0 to 1):
(As you can see, I'm starting to get the hang of using R. 🙂 )
Here is the same type of plot with Barcelona's goal defense:
And finally, here's the final league table from the 2008-09 season with Pythagorean estimates:
Team | GP | GF | GA | Pts | Pythag | +/- |
---|---|---|---|---|---|---|
Barcelona | 38 | 105 | 35 | 87 | 76 | +11 |
Real Madrid | 38 | 83 | 52 | 78 | 65 | +13 |
Sevilla | 38 | 54 | 39 | 70 | 62 | +8 |
Atlético Madrid | 38 | 80 | 57 | 67 | 61 | +6 |
Villarreal | 38 | 61 | 54 | 65 | 56 | +9 |
Valencia | 38 | 68 | 54 | 62 | 59 | +3 |
Deportivo La Coruña | 38 | 48 | 47 | 58 | 52 | +6 |
Málaga | 38 | 55 | 59 | 55 | 50 | +5 |
Mallorca | 38 | 53 | 60 | 51 | 48 | +3 |
Espanyol | 38 | 46 | 49 | 47 | 50 | -3 |
Almería | 38 | 45 | 61 | 46 | 42 | +4 |
Racing Santander | 38 | 49 | 48 | 46 | 52 | -6 |
Athletic Bilbao | 38 | 47 | 62 | 44 | 43 | +1 |
Sporting de Gijón | 38 | 47 | 79 | 43 | 35 | +8 |
Osasuna | 38 | 41 | 47 | 43 | 47 | -4 |
Valladolid | 38 | 46 | 58 | 43 | 44 | -1 |
Getafe | 38 | 50 | 56 | 42 | 48 | -6 |
Betis | 38 | 51 | 58 | 42 | 48 | -6 |
Numancia | 38 | 38 | 69 | 35 | 33 | +2 |
Recreativo | 38 | 34 | 57 | 33 | 36 | -3 |
Now, Barcelona's goal distributions are different from Ajax's in that they are both skewed in the same direction. This characteristic is typical of most teams. The difference in Barcelona's goal distribution is that a second peak pops up at six goals, which is known in the statistical parlance as a bimodal distribution. Last season's team scored six goals in a higher proportion of matches than they scored zero, four, or five. A Weibull curvefit would miss about half of that occurrence, which could explain the discrepancy in the final Pythagorean estimation.
Let's assume that the current curve fit estimates that Barcelona will score six goals in 5% of its matches, or 2 matches (.05*38=1.9). If Barcelona scores six goals in a match, the chances of them winning the match are very good, almost 100% in fact, so let's assume they take all points in those games. The difference between the curve fit and reality is about 7%, or about three matches (.07*38=2.66). So the failure to pick up the second mode in Barca's goalscoring distribution creates a discrepancy of nine points — just about the entire Pythagorean variation.
So it seems that a change in the distribution skewness doesn't have to be present to produce large changes in the Pythagorean estimate. Bimodal distributions also have the same effect.
UPDATE (9 May): You know, maybe there's not much of a difference. I looked through my code and noticed that the win/draw probability calculations consider scenarios were a team has scored up to five goals in a game. That's usually sufficient for most leagues, but not in the Spanish league last season. I increased the upper limit to ten and recalculated, and this is what I got:
Team | GP | GF | GA | Pts | Pythag | +/- |
---|---|---|---|---|---|---|
Barcelona | 38 | 105 | 35 | 87 | 87 | 0 |
Real Madrid | 38 | 83 | 52 | 78 | 69 | +9 |
Sevilla | 38 | 54 | 39 | 70 | 62 | +8 |
Atlético Madrid | 38 | 80 | 57 | 67 | 65 | +2 |
Villarreal | 38 | 61 | 54 | 65 | 57 | +8 |
Valencia | 38 | 68 | 54 | 62 | 61 | +1 |
Deportivo La Coruña | 38 | 48 | 47 | 58 | 52 | +6 |
Málaga | 38 | 55 | 59 | 55 | 50 | +5 |
Mallorca | 38 | 53 | 60 | 51 | 48 | +3 |
Espanyol | 38 | 46 | 49 | 47 | 50 | -3 |
Almería | 38 | 45 | 61 | 46 | 42 | +4 |
Racing Santander | 38 | 49 | 48 | 46 | 53 | -7 |
Athletic Bilbao | 38 | 47 | 62 | 44 | 43 | +1 |
Sporting de Gijón | 38 | 47 | 79 | 43 | 35 | +8 |
Osasuna | 38 | 41 | 47 | 43 | 47 | -4 |
Valladolid | 38 | 46 | 58 | 43 | 44 | -1 |
Getafe | 38 | 50 | 56 | 42 | 48 | -6 |
Betis | 38 | 51 | 58 | 42 | 48 | -6 |
Numancia | 38 | 38 | 69 | 35 | 33 | +2 |
Recreativo | 38 | 34 | 57 | 33 | 36 | -3 |
Spot. On.
What was most incredible about Barcelona's season was that it obscured the fact that Real Madrid, Atletico Madrid, and Sevilla were also playing at a high level.
I still think it might be useful to look at common features of teams with large Pythagorean variances. I just don't think it applies in Barcelona's case.