This guest post if from StatDNA Soccer Analytics Research Competition Winner Sarah Rudd. Sarah blogs regularly on her soccer and statistics work on the On Football blog and is an employee of Microsoft.
Charlie Adam, a fantastic player who, for some reason, insists on taking a shot from 40 yards out every game. From a fan perspective, it drives me crazy because in almost every instance, all it accomplishes is giving the ball back to the other team. He never scores and rarely comes close to even troubling the keeper from these long range shots. From an analytics perspective, it got me thinking: how much of an opportunity is Charlie Adam wasting with these shots? Can we estimate how likely a team is to score from a given game state (position of the ball, defensive pressure and defensive shape)? Given those estimates, what does that tell us about teams’ tendencies and individual performances?
With the ball at midfield, a team is very unlikely to score from a shot, but they could pass it around searching for a better opportunity and eventually the team will either score or turn the ball over to the other team. My aim was to determine how likely those two outcomes are. I decided to use Markov Chains with absorption states to model possessions. Drive by Football has a good explanation of Markov Chains if you aren’t familiar with them. Basically they are a way of modeling an outcome based on the probability of transitioning from one state to another. In this example, the states would be a combination of position on the field, defensive pressure and the shape of the defense. The transitions would be an action performed by the players (pass, shoot, dribble, tackle, etc.). One of the keys to Markov Chains is that they require that the current state is independent from the previous state, meaning, it doesn’t matter how we got here, every time we are in the state, things should be the same. This is a big assumption to make in soccer, but given the defensive metadata that StatDNA provides, we are able to group situations that are more similar than if we were just using position (for example we can isolate situations where the player is 1-on-1 with the keeper in the box versus only knowing the player was in the box, but not knowing if there were several defenders in their way or not).
The first order of business was determining what my game states were going to be. I wanted to divide the field up into a fine grid but that meant my transition matrix was going to contain several million elements. Instead I settled on the following grid system based on the different characteristics of events that happen (see diagram below). Most shots occur in Zones 2+5, most goals come from Zone 5, Zones 1+3 are early crosses, etc. Along with a zone, each state also has defensive pressure and defensive shape associated with it. For example, 2 states could be “Zone 5, behind the defense, no pressure” and “Zone 5, behind the defense, under lots of pressure”.
Additionally I defined states for set pieces because of their unique characteristics in the game: long and short corners, long and short free kicks, deep and shallow throw-ins and penalties. Overall there were 37 different states the ball could be in, plus the two absorbing states: goal and turnover to the other team.
With the states defined, the next step was to calculate the transition probabilities. For each state, I wanted to know how likely the ball was to be moved to each one of the other states. The great thing about Markov Chains is that once we have the transition probabilities, we can calculate the probability of the ball ending up in one of the absorbing states after an infinite number of moves. The states are called absorption states because once the ball is in that state it doesn’t leave, the possession is over. By looking at an infinite number of moves, it makes no difference if the ball ends up in the transition state after 1, 5, 10 or 100 transitions. Possessions of arbitrary length are handled nicely because of this trait. We can easily look at all the different possible ways the possession can unfold and calculate how likely a team is to score from a given starting state. I did this not just for the entire league to see general trends, but also for each individual team’s offense and defense.
Short versus Long Set Pieces
Using Markov Chains to figure out the likelihood of scoring a goal from a given state, we can start to answer questions like: is it better to take a corner long or short? For the given dataset (which is only a sample of matches for each team for the 2010-2011 Premier League season), league-wide the answer is that long corners are significantly more-likely to result in a goal eventually than short corners (2.39% for long corners vs. 1.67% for short corners). One thing to note is that I defined a change of possession by a controlled, deliberate action by the opposing team. Clearances were not considered a controlled action, so the possession resulting from a corner includes not just the corner itself, but the ensuing possession by the team until the opposition gains control. Digging down into the individual teams, you can see which teams are the best at taking long corners (Arsenal, Newcastle and Stoke), which teams are best at short corners (Spurs, West Brom and Aston Villa) and which teams aren’t very good at any type of corner (Wigan, Birmingham, and West Ham).
The same technique can be used to examine how teams defend corners. Below is a graph that shows each team’s probability of conceding from both types of corners. Not surprisingly, Arsenal is one of the worst teams at defending long corners. Manchester United is notably worse at defending short corner than they are at defending long corners. These bits of info could be valuable when planning a team’s in-game strategy.
This type analysis can be done for any of the game states that were defined and can be used to look at whether a team is good at counter attacking, whether they are better under pressure or if they need more space to operate, or whether throw-ins are advantageous, for example.
Individual Offensive Contribution
With each state having a value assigned to it (the likelihood of scoring a goal), we can take a look at how much an individual affects a team’s chance at scoring a goal by looking at the difference in value from the state the player receives the ball, to the state the player puts the ball. For example, let’s say a player is in a state with a value of .05 and plays a through ball to their teammate, putting them into a good goal scoring opportunity with a value of .25. The passing player would be credited with creating .2 units of offense. If the receiving player goes on to score a goal, they would be credited with .75 units of offense and if they miss, they receive a penalty of -.25 units of offense (goals have a value of 1 and turnovers a value of 0). If the shot is deflected for a corner, the value is somewhere in between.
There are several advantages to this method versus looking at existing metrics like passing percentage and goals. For one, passing percentage treats all passes equally. This system weights each pass with the amount the player helped improve the team’s chance of scoring. When looking at goals, instead of giving full credit to the goalscorer, players who helped move the ball into a good position are rewarded. Those same players are still rewarded even if the chance is not converted.
For the sample dataset provided by StatDNA, the top offensive contributors were Tim Cahill, Yaya Toure and Cesc Fabregas. Liverpool’s new signings, Jordan Henderson and Stuart Downing, both are in the top 25, but Raul Meireles, who recently left Anfield for Stamford Bridge, was #7.
We can also examine who is the most wasteful with the ball by looking at who has the lowest offensive contributions. Goalkeepers are colored in grey in the diagram below. The strong presence of goal keepers among the worst contributors should be a red flag for most teams, as it possibly indicates significant room for improvement in the keeper’s distribution. Darren Bent is far and away the most wasteful player outfield player in the dataset. The sample isn’t representative of his season as he scored 17 goals last year, but only one of those goals was present in the sample set. However, in the set he had 19 opportunities where he received the ball in a state with a probability of scoring greater than 10% (the average probability of these chances was 22%). Darren Bent only converted one of these chances and his offensive contribution for these high probability chances was -0.263. Imagine how high he’d be ranked if he could have finished some of these chances.
There are loads of additional questions that you can start to try to answer using this framework. The data can be sliced and diced in all sorts of interesting ways. Currently the model doesn’t account for the quality of the opposition, which would be a good next step in developing this framework further.