Further comments on the MCFC Analytics dataset
Categories: Database Development, Match Data Collection
A few hours have passed since my initial post in reaction to the ‘basic’ dataset released by Man City’s performance analysis department, and I have some further comments.
First of all, I hold Manchester City’s Performance Analysis department, and especially Gavin Fleig, in very high regard. I’ve gotten to know Gavin professionally and personally over the last 18 months, and he’s a honest and sincere man who is passionate about analytics in soccer and is one of the leaders in bringing this perspective to the sport. Gavin is also sincere about nurturing and enlarging the soccer analytics community, as am I, and I want to help support it as much as possible.
I also have a very good working relationship with Opta Sports and several of their technical and non-technical personnel. I appreciate their support of this company from the days when it was still a blog and a dream. They also have an interest in seeing the soccer analytics community grow, yet as the owner of this dataset they wish to retain certain rights. It is their data, after all.
I recognize that this initial dataset is one aimed at “basic” users. Such users will not want to write custom preprocessing software or make complicated queries of finely-grained match data. For those wading into soccer analytics, perhaps summary statistics are a good place to start. I have my own opinions on the subject, but fair enough. Again, it’s Opta’s prerogative to only allow summary statistics to be available to the general public, and I can’t get into the head of a “basic” user — it’s possible that to them, this dataset is sufficient and that’s that.
My opinion is that even if all you wanted were new or creative analytics on team and player performance, there are still data points that would be very useful to have. Here are some of them:
(1) Substitution pairs and their timings. There are fields in the dataset that flag whether a player started a match, exited the match or entered it as a substitute. That’s good for establishing who is in the match lineup, but if you’re doing a study of substitution practices or the effect of substitutions on the summary statistics of a match, it would be very useful to know who was subbed for whom and at what time of the match. The kind of study that Bret Myers did on substitution timing is not possible with this basic dataset.
(2) Match referees. Opta tracks the name of the match referee in their detailed feed (there is an ID that is cross-referenced against a record of match referees in the Premier League). Granted, Opta focuses on team and player performance, but with this dataset we know how many cards of either type were given out, how many fouls were called and in a particular sector of the field, and the number of both events for the same collection of players on the pitch. Do those events happen in a vacuum and independent of the referee? Would it not be useful to add a referee field to whatever study is made?
(3) Goal timings. It’s possible to back out the number of goals scored by either team in a match, we know how many were scored with which body part, and we know who scored the match-winning goal, but we don’t know when a goal was scored. Metrics like Chris Anderson’s Leverage plots or Ford Bohrmann’s Expected Points Added aren’t possible with this dataset.
So when I proposed my own project, I did it as a way to supplement this “basic” dataset with supporting data. These data are publicly available information — the times of goals, subs, expulsions, and cards are public domain, as well as the name of the referee. A “basic” user could compile the data on his/her own, but I want to enlist the community in this effort. I also recognize that the in-match summary statistics do belong to Opta, and Opta choose to retain their rights to anything derived from them. So whatever is developed should be independent of the large spreadsheet, but there has to be a way to link the match and player IDs of the basic dataset to historical match details. I haven’t thought of a good workaround at this hour.
Again, maybe I am in a very small minority of the 1500+ who have accessed the MCFC Analytics site over the last 36 hours. Maybe the basic dataset is fine for those getting into this field, and the fields are sufficient to build some clever algorithms and conduct some compelling studies. Maybe it doesn’t need to be “enriched” or spiked or anything else.
I earnestly seek your feedback on this issue. If I’m wrong or out of line, let me know and I’ll suspend this effort.
If you agree that there are data that would make studies with this basic dataset more meaningful, more compelling, and more valuable, join us in this project.