Expected goals and conversion — a probabilistic approach

In recent years, Expected Goals (xG) has become a highly used figure in football analytics. For this post, we assume that the reader knows about the basic concept of Expected Goals (if not, many introductions are available online).

An interesting question, especially for strikers arises: how do the actual output, i.e. the number of goals scored, compare to the expected goals of a player’s chances? If a player scores consistently more goals than expected, we could rate him as a good finisher.
But, what means consistently? Goals include randomness, and random things have variance. Depending on the number and quality of shots that a player takes, the deviation between actual and expected goals may be moderate compared to the variance.

The goal of this blog post is to give a more detailled and quantitative answer to this question. It includes some mathematics, however the tools we need are very basic probability theory. We will apply it to a public event data set, covering all FC Barcelona matches in the era of Lionel Messi.

The model

Assume for a player, he has taken n shots and we have given xG values for each shot and the true outcome (goal or no goal).
Each shot can be seen as a Bernoulli variable (i.e. a {0,1}-variable) with a different probability of success, i.e. scoring a goal. Now, summing up all these individual variables, we get a random variable, which we will call G. It describes the total number of successes (e.g. over a season). G takes discrete values between 0 and the number of shots taken.

Assuming that each shot is independent of all others, the variable G follows a Poisson binomial distribution.
Note: in reality the independence assumption might not hold because of factors like momentum/confidence etc. However, estimating the dependence is almost impossible (or at least it seems so for me).

Doing some maths (details are not interesting here), we can calculate the probability of k successes for any number 0≤k≤n, i.e. P(G=k).

Now that we know the full distribution for the number of goals, we can compute an interval which contains the number of goals scored with high probability (a concept which is sometimes called area under curve).

Data

Let us apply this to real data. Statsbomb, one of the biggest event data provider, released a huge dataset for free last year: it covers all La Liga matches from FC Barcelona in the era of Lionel Messi.

Luckily, the Statsbomb data contains for each shot its xG value and also if it resulted into a goal.

This allows us to directly plug in the numbers in the above described procedure. The only data preproccessing we do is to filter out all free-kicks and penalties.

Let us look at all shots of Lionel Messi in the last season 2018/2019. He made a total of 126 shots, of which 26 resulted in a goal (recall: excluding freekicks and penalties). However, the xG values of his shots accumulate to only around 17. Calculating the distribution of G, we get the following picture:

Lionel Messi 2018/19

Each bar gives the probability of scoring the number of goals as indicated by the x-axis (i.e. P(G=k)) . The yellow bar marks the actual number of goals scored by the player. The brighter bars show the interval with 90% area under curve. This means, with a probability of (at least) 90%, the player will have scored a number of goals contained in that interval (the exact borders are also displayed on the x-axis).

The dashed line is the cumulated probability, i.e. P(G≤k).

This graphic gives multiple insights which are not as obvious if you only look at the accumulated xG and goals numbers.

  1. It shows not only the expected goals compared to the actual outcome but sets it in context to the variance of goals scored.
  2. Consequently, we can judge the conversion rate of a player with some probabilistic confidence. If the actual numbers of goals scored lies outside of the bright bars, this is an event of roughly 10% probability. Being on the right-hand side of the interval thus has roughly a probability of 5%.

To make this more clear, there are two relevant sizes that describe G.
On the one hand, the expectation of G is simply the sum of the expectation of all individual variables which is exactly the probability of success.
On the other hand, the variance is again the sum of all individual variances. The variance of a single Bernoulli variable is p*(1-p).
In other words, for the size of the interval in the above chart it matters which chances a player has in addition to how many in total.

A simple example: assume two players P1 and P2. Both had 100 shots over the course of a season which accumulate to 10 xG. However, for P1 they were uniformly distributed meaning that each shot had an xG of 0.1. P2 on the other hand had 10 clear chances with an xG of 0.8 while the rest is uniformly distributed on the remaining 90 shots.

This toy example shows the impact of the single xG values on the variance of the distribution. P2 had more clear chances and many very unprobable shots, so the variance of his distribution becomes smaller. With regular data, this impact may not be huge but we should have it in mind when evaluating a players’ performance regarding xG.

So, variance drives the shape of the above chart. Another factor is the number of shots. In order to get a better feeling how the interval size relates to the sample size n, we will look at all seasons of Messi separately.

All seasons of Lionel Messi.

In many seasons, Messi’s actual goal number lies on the right-hand side of the interval, most astonishingly maybe in 2012/13. Still, for a single season the size of that interval has a size in the order of 10 which is — given that Messi takes quite a lot shots — remarkably big. Drawing conclusions on over- or underperformance from one season only thus seems to be possible only in few occasions.

Ultimately, we also look at other players of Barcelona’s recent history. In this chart we show every player who had at least 100 shots in total.

All Barca players with >100 shots since 2004.

While Messi seems out of this world, many players scored more goals than expected. This is natural as Barcelona players are above-average in general. However, looking at Neymar, we see that his 47 goals were less than the expectation of around 51.5 goals.
However, we can now also state that — given the chances that Neymar had — a player would have scored more goals than Neymar in roughly 75% of the cases.
Note also how the interval sizes become (relatively) smaller the more shots we record.

Conclusion

If we rate the conversion rate of players using xG statistics, we should have in mind that the natural variance over the course of a season — even for strikers — is in general quite big.
Instead of looking only at the accumulated number of xG and goals we can use the chart as above in order to get a more detailled picture.
Over bigger sample sizes, we can use this in order to give a quantitative answer to a player’s finishing ability.

Interested in football, mainly analytics and tactics.