By Freddie Wilson – @thewonderofmu
There are many approaches to visualising shot location data. It can be tricky finding the right balance between the amount of information on display and the ease of use or intuitiveness. There is a plethora of categories and subcategories that can be considered such as the player taking the shot, the danger of the shot and whether the shot resulted in a goal.
In this article we will therefore explore how to effectively reduce the amount of data displayed in order to represent more layers of information on the same graphic.
This is where clustering comes into play; it is a method that summarises the data by assigning the n data points (in this case shot locations, and n could be as large as 100 over a season for a player) with k clusters (and we can choose k to be as small as we please – 2-8 should be appropriate).
This is to say that we can categorise each shot location into one of our clusters.
The cluster, or group, is then defined by its members and its cluster centre.
We want each cluster centre to represent its respective cluster as best as it can. In other words, this means we want the cluster centre to be as close to each point in its cluster as possible. This is known as minimising the cost of the clustering.
The k-means clustering algorithm does this very well by taking the centroid (the average position in this case) of each cluster as the centre.
Let’s see how Sadio Mané’s shot locations from the 2016/2017 season are grouped when we perform the k-means clustering algorithm with k=4.
We see each cluster is represented with a different colour, the cluster centres are marked with an “X” and then the centres are joined up with four lines.
But why do we choose four clusters?
This is slightly arbitrary however there are reasons behind the choice. Firstly, we need at least three clusters to be able to make a shape (rather than a line or a single point). However, the use of three cluster centres may be slightly misleading as the shooting cluster zone could appear similar to an arrow, giving the illusion of a direction to this shape.
Hence we would ideally like a minimum of four clusters and this number is fairly straightforward to work with in order to create a quadrilateral as the shooting cluster zone.
This plot below shows the k-means clustering for Mané’s 16/17 shots for k equal to 1 to 9.
The diagram above shows the “scree plot”. This helps us to determine the optimal number of clusters for the k-means algorithm by looking for the “elbow” of the plot; the sharpest fall or bend in the graph.
We see that, in fact, the optimal number of clusters for Mané’s shots is the elbow at 4, 5 or 6. However, for 6+ clusters, joining up all the centres would not create a convex shape, leaving some points redundant. This will not be the case for every single player and it would involve a lot of extra computation to decide the optimal “k” for each player. So four clusters it is!
We then perform the k-means algorithm for each player with at least 40 shots to his name and display the individual shooting cluster zones on the same graphic. Also displayed is the average expected goal value (xG) per shot for each player and it is this value that determines the shading of the zone. This can be interpreted as “the darker the zone, the higher the danger of the player (per shot)”.
So what can be deduced?
Finally, below is the diagram of Liverpool’s individual shooting cluster zones for the 16/17 season.
We see that the four Liverpool players with at least 40 shots to their name are located centrally around the D of the penalty box. Sadio Mané leads the way in terms of average xG per shot, and this may be because he has the most advanced forward cluster shot zone.
And let’s see how it is shaping up since the start of this season.
As mentioned before, four may not be the best number of clusters and so it may not be a natural fit. However, also as mentioned before, this seems like the best option for the time being. Also, the requirement that a player has had a certain number of shots is necessary in order for the clustering to be worthwhile and functional, even though it may mean that we miss some players. The most consequential absentees will be impact subs such as Olivier Giroud and Daniel Sturridge so far this season. These are players who may not have much game time in comparison to their team-mates, but who still manage to score a fair few goals.
There is also the danger of misinterpreting this visualisation; these shooting zones will not be the exclusive locations for players’ shots. The graphic is designed to give an indication of the primary shooting areas for certain players within a team and to conclude anything more would be a wrong assumption.
Interpreting the graphics also requires a contextual knowledge of the attackers in question; whilst average xG/shot is displayed, this does not factor in the player’s finishing ability relative to expectation. This is to say that, even though players could be more clinical than average, this is not reflected in the graphic.
Whilst the visual representations of cluster shot zones may need further refining, the methodology is sound and so it can be applied to other aspects of the game. A potential use could be with passes rather than shots. This is because passes will be more evenly spread across the pitch compared to shots which are mainly centred around the penalty box. Similarly, defensive actions for each player are less likely to overlap since defences tend to be more regimented than the fluid and intertwining attacking forces we have seen.
It also ought to be noted that k-means is subjective up to a point since it is very dependent on the initial guesses of cluster centres and these are picked at random at the start of the algorithm. This in turn means that results are not always replicable, although you will obtain similar results through a large number of iterations. In addition, k has to be provided and this highlights the benefit of hierarchical clustering where there is no need for k.
Please also note that these graphics exclude penalties and own goals, but they do include direct free-kicks.
Inspiration is partly taken from David Sumpter’s Soccermatics where he talks about “defensive hulls” and partly from the University of Bath’s Multivariate Data Analysis module. Also, the wonders of R’s ggplot2 package cannot be understated here.
Despite the limitations mentioned, I hope this contribution will prompt further discussion in the most effective methods of data representation in shooting.
This article was written with the aid of StrataData, which is property of Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.
Stay tuned at https://twitter.com/ChanceAnalytics and https://twitter.com/thewonderofmu for updates on this visualisation. It will be interesting to see the sort of teams that it can best represent in addition to the ideal number of players that it can effectively display.
A very big thank you to Stratagem (@Stratabet) for the data!