Integrated Data Modeling
The foundation of analytics is data.
It’s been said that 80% of the data analysis process is consumed with collecting, cleaning, and organizing incoming data. Data analysis in soccer is no exception, and in some respects the situation is even worse. There are many soccer databases, whether offline or online, that contain large amounts of historical or in-match data. But most of the time these data are not organized in a way that facilitates analysis, and an end-user has to devote a significant amount of time to pre-process the data. This process is labor- and time-intensive and can prevent more advanced analysis from taking place. All of that information embedded in the data remains there, not revealed, and not acted upon.
We experienced this problem when we started our own internal analytics projects, and the existing solutions that were out there didn’t meet our requirements. So we created our own.
The Football Match Result Database (FMRD) is a relational database schema that models match result data needed to support basic and advanced match analytics. Match result data describe the basic events that fully describe a soccer match, whether it was played in 2012 or 1870. These data include the following:
- Top-level data on the soccer match, including match date, competition name and phase, phase-specific details, participating teams, venues, and environmental conditions.
- Starting and substitute players for each team in the match.
- All macro-events that occur during the match, such as goals, penalties, disciplinary incidents, and substitutions.
- Biographical and physical data on participating personnel, such as players, managers, and match referees.
The FMRD is the foundation of Soccermetrics’ analytics infrastructure. We devoted a lot of time to describing a soccer match in its most basic details, and then developing models to relate those basic details to each other. We’re obsessive about enforcing data and relationship integrity, and we’ve incorporated database views to make application development easier. Yet we realize that our end users are different, so we open-sourced the schema so that other types of data can be captured.
Want to capture touch-by-touch events, such as those created by Opta and Prozone, in a structured format? We’ve developed the Football Match Event Database (FMED), which is a superset of FMRD and captures in-match events and the match times and spatial locations at which they occur.
Are you interested in “just the facts”? We’ve also created the light version of the Football Match Result Database (FMRD-Light), a subset of FMRD that captures top-level data on a match and the final scores. It’s the perfect data model for those who want to track scorelines or generate league tables with minimum overhead.
These database schemas are compatible with most of the major database engines, and we’re adding more translations upon customer demand. We’re developing applications to allow customers to integrate their match data into FMRD-formatted databases from which further analysis can be made.