The following guest post was written by Brian Mills, a Doctoral Candidate in Sports Management and Graduate Student in Statistics at the University of Michigan.
Anyone that has watched or played soccer can understand and appreciate the importance of scoring a goal. Scoring is a relatively rare event when compared to other sports like lacrosse, football or even baseball. Therefore, understanding how each event in a game increases the probability of scoring a goal—or the inverse: keeping a goal from being scored by the opponent—should prove useful for anyone trying to understand how to increase the winning expectancy of their respective team.
Recently, Ben Alamar explained how StatDNA evaluates passing by players going from one field zone to another. Successfully passing the ball from outside the box to just in front of the goal, of course, increases the probability of a goal by much more than passing across the midfield line. Therefore, getting the ball nearer the opponent’s net is an important part of understanding goal creation—as in the aforementioned blog post—and what the player does with the ball once it is there is also key to getting it in the net. As you can see below, getting the ball near the net is relatively rare when compared to the occurrence of events near midfield (dark red). So not only are goals important, but so are the previous events that allowed that goal to take place.
What I will share today attempts to take this a little further—albeit in a very preliminary model—which evaluates the change in probability of a win for each event occurring in the game, given where that event took place on the field.

Using this idea, we can do a few interesting things. First, we can create Win Probability Graphs. Those are always fun to look at, but there is an advantage in player evaluation. For each event, the probability that one’s team will win either goes up or down. A completed pass is a positive for the team completing it. A save is positive for the goalie’s team. From here, we can aggregate across each play the player is involved in and get a total—or per-event average—Win Probability Added (“WPA”) measure.
Now, the idea of Win Probability Charts and WPA is not new. Fangraphs does them for baseball and I have seen some of the former for soccer online. However, those for soccer have generally used only goals and the time dependence of being ahead (behind) in the game. But the goal event itself isn’t the only thing to account for. Here, I not only expand on the probability of a Win, Loss or Draw from goals, but also other events and the position on the field in which they take place.
On to the model. For evaluating win probability, I have been working with a vector generalized additive proportional odds model. This allows the ordering of the three possible outcomes for the game and calculate the probability of each at a given time point: 1) Home Team Win, 2) Draw and 3) Away Team Win. Those familiar with smoothing techniques will note that with a GAM we can not only calculate the probability change for each event (thanks to the fantastic touch-by-touch EPL data from StatDNA) but also the change given the spatial proximity to midfield and the out-of-bounds lines. Using a smoother allows us to control for the two-dimensional (and non-monotonic) changes in probability given the event location and adjust accordingly. After all, a shot on goal from the midfield line likely does not have the same win probability influence as one taken inside the box.
While I won’t go into details with the modeling itself, this model requires that the probability of each of the 3 possible events will always add up to 100% at any given time point. If one team’s probability of a win increases, then the probability of the other team winning (and/or a draw) must decrease. Below, I have a version of the Win Probability Charts for Arsenal vs. Everton and Chelsea vs. Sunderland. These are created by predicting the ordered Win-Draw-Loss probability for each event directly from the model, given the event taking place and the previous game state. There are a few logical things to notice here to help validate the model:
1) The closer to the end of the game, the more that a lead-changing goal affects the probability of a win.
2) The home team (RED) begins with a higher expected win probability than the away team (BLUE).
3) A goal is, of course, the most valuable event in the game (more on this later).
4) About 30% of games end in Draws, so the starting point of the Draw (Yellow) line makes sense.


I must note some issues with this preliminary version. First, I do not use any prior knowledge of the team’s ability. Arsenal is likely at more of an advantage playing against Everton than the graph may imply using the sample average, given their better record in each of the past few seasons. In general, I’m not totally satisfied with the starting point of the home and away win probabilities shown on the chart. Secondly, sometimes the model does not drift far enough toward 100% when a team is ahead and nearing the end of the game. Take Arsenal vs. Everton, for example. With near 0 seconds left in the game, the probability of an Arsenal win should be at essentially 100%. This is likely due to the somewhat small sample size of games (less than 150 in this sample) and a possible late goal in one of them being over-weighted. These both could be remedied with a larger data set or some Bayesian priors using past games based only on score advantage, team record and time remaining.
For a more comprehensive model, other important variables would include Pass Distance and Player Positioning when receiving that pass. These require further specification, as only Pass Events have a Pass Distance recorded. Finally, goals early on in a blowout are worth more than those later, so if certain players are scoring goals in different situations this could affect the outcome of the WPA measure for players. Players are rewarded extra for scoring or stopping a goal in times where a goal would cause a large swing in win probability (“High Leverage Situations”), so it is important to keep this caveat in mind unless we expect the leverage to “even out”. Since teams try and get their top players the ball for these situations, there is likely some bias.
Assuming all is well and good with the model and data used to construct it, we can easily use these models to estimate each player’s contribution to a win throughout the game or the season as well as get an average impact of each event. To do this, I simply take the first difference in Home win, Draw and Away win probability from the current event and previous event. This gives the change in win probability for the given team at each event. Depending on how one thinks Draws should be weighted, we can adjust as necessary. From here, it is easy to total or take an average per event for each player given which team he is on (Home or Away).
The results from my preliminary model indicate that Goalies have the largest total impact, a logical result given that each time they touch the ball it is in close proximity to the goal. Defensemen are next on the list, followed by a mix of Midfielders and Strikers. However, this does not necessarily mean that the goalies are more valuable than anyone else! One must be careful to compare Goalies only to other Goalies and Strikers to only his true positional counterparts.
On the first run (again, without much pass quality information included and only a single season’s worth of data) I find Ali Al Habsi, Ben Foster, Joe Hart, Petr Cech and Robert Green to be the top goalies. In limited action (about half the sample size of the guys mentioned above), the young Tim Krul actually outclasses the entire collection of goalies in the data set. With the little that I know about the EPL, these seem pretty reasonable and Krul looks like he could live up to the high praise he received after filling in this past year.
As for defenders, the model finds the highly regarded Manchester United captain Nemanja Vidic ranked lower than one might expect, with John Terry up near the top. Strikers are led by a familiar bunch with the likes of Rodallega, Odemwingie, Tevez, van Persie, Berbatov and Drogba to name a few; but the popular Wayne Rooney comes in between #25 or #30. While there is plenty of room for improvement, the rankings correlate relatively closely with the EA Sports Index found at the Barclays EPL website.
With respect to importance of events, the model finds Goals to be the biggest game changers, with “Sub-Ins” as the smallest. This makes sense to me. Also keep in mind that the cross-tabulations below are not conditional on field location or game state, which is why we see such low importance of common events like passes (most passes are marginal and near the midfield line). Lastly, I do not indicate directional changes in probability, just the swing from one team to another in absolute value.
Obviously, the approach would be improved with proper treatment of pass quality and pass difficulty information—which StatDNA does keep track of—and there is still much to account for in my model. Of course, I’d love to hear some feedback on improving things. Overall, I think it’s a pretty good start and I enjoyed getting a chance to work with this data. Thank you to StatDNA for allowing me to share my thoughts here.
|
Event Type
|
Sample Size
|
Change in Win Prob.
|
|
Goal
|
367
|
37.25%
|
|
Penalty
|
33
|
36.67%
|
|
Save tip
|
29
|
12.15%
|
|
Goalie deflection (non-save)
|
132
|
10.22%
|
|
Red Card
|
27
|
7.47%
|
|
Yellow Card
|
395
|
6.76%
|
|
Goalie Throw
|
1019
|
5.58%
|
|
Aerial Challenge Missed
|
269
|
5.06%
|
|
Dribble Sequence
|
1564
|
4.78%
|
|
Goal Kick
|
2410
|
4.69%
|
|
Goalie Punt
|
865
|
4.49%
|
|
Corner
|
1308
|
4.37%
|
|
Free Kick
|
3407
|
4.32%
|
|
Shot Foot
|
2949
|
4.00%
|
|
Goalie Catch (non-save)
|
350
|
4.00%
|
|
Goalie Punch (non-save)
|
186
|
3.95%
|
|
Offside
|
522
|
3.83%
|
|
Foul
|
2941
|
3.83%
|
|
Lost Possession
|
1037
|
3.64%
|
|
Goalie Possession
|
1699
|
3.63%
|
|
Tackle Won
|
6163
|
3.44%
|
|
Save deflection
|
410
|
3.43%
|
|
Block (non-goalie)
|
2776
|
3.39%
|
|
Save catch
|
354
|
3.34%
|
|
Pass Air
|
18017
|
3.32%
|
|
Clearance
|
6183
|
3.22%
|
|
Shot Head
|
562
|
2.86%
|
|
Deflection
|
3073
|
2.69%
|
|
Throw in
|
5760
|
2.66%
|
|
Failed Control
|
2513
|
2.65%
|
|
Aerial Challenge Lost
|
6256
|
2.57%
|
|
Head Clearance
|
4773
|
2.57%
|
|
Gain Possession
|
67873
|
2.38%
|
|
Cross
|
5274
|
2.19%
|
|
Pass Head
|
11097
|
2.12%
|
|
Pass Ground
|
67312
|
1.89%
|
|
Tackle Lost
|
5084
|
1.80%
|
|
Sub In
|
632
|
0.41%
|