At a previous job, my colleagues and I would occasionally create football (soccer) quizzes for each other. The game I used to send them would involve me sequentially sending a list of (increasingly helpful) clues, from which they had to guess which player I was thinking of. I have now turned this into a web app using the shiny and shinyjs `R`

libraries, and you can play it here. You can find the code on the GitHub link that I have included in the app. I hope you enjoy it!

BibTeX citation:

```
@online{difrancesco2023,
author = {Domenic Di Francesco},
title = {Who {Am} {I?}},
date = {23-03-21},
url = {https://allyourbayes.com/posts/Who_Am_I},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 23AD. “Who Am I?” March 21, 23AD. https://allyourbayes.com/posts/Who_Am_I.

A recording of my presentation on value of information analysis at the Bayes@Lund2023 conference.

I have been following the Bayes@Lund conference since I started my PhD, and have often found the work presented to be very useful. This year I was able to attend and I presented on the topic of *value of information analysis* (how much should we be willing to pay for data).

Conventional experimental design is used to identify where our next mesaurement(s) should be obtained on the bases of reducing uncertainty. However, this scale (some measure of information entropy) is not always intuitive, and it won’t tell you the point at which paying for another measurement becomes uneconomical.

Value of information analysis is used to quantify how much we should be willing to pay for data of a specified quality (precision, bias, reliability, completeness, etc.), in the context of helping us make decisions.

Below is the recording of my talk, which breifly introduces the topic and provides a couple of examples.

BibTeX citation:

```
@online{difrancesco2023,
author = {Domenic Di Francesco},
title = {Bayes@Lund2023},
date = {23-01-23},
url = {https://allyourbayes.com/posts/Bayes@Lund},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 23AD. “Bayes@Lund2023.” January 23,
23AD. https://allyourbayes.com/posts/Bayes@Lund.

When is a player **in form** (over performing, or enjoying a hot streak) and how long does this last? If there is such an effect, I suspect it will be a result of some complicated system of personal circumstances. In this post I suggest a popular statistical model (Gaussian process) for approximating the dependencies (how many games back should we look?) and non-linearities (rise and fall of form) that we need. Again, I am suggesting that we should care about uncertainty when trying to model just about anything in football, and using probability is a helpful way of doing so.

Ellen White’s data from the 2019-20 WSL season (courtesy of StatsBomb) is used as an example.

Similarly to the posts on multi-level models, this will also be split into 2 parts. Part 1 (here) will focus on the features of a Gaussian process that are well suited to approximating player form. Part 2 (in preparation) will include more technical details and more code.

As I alluded to in the TLDR above, I suspect a players form is somehow linked to their current mental state. When they are feeling confident they may be less likely to doubt their abilities, and more decisive. This could mean they act quicker and become more difficult to play against.

I will not propose a detailed causal model here, just a statistical proxy. But, I will be assuming that form can rise and decay over time. For some players even a single good or bad performance may be enough to drastically impact their next game, and for others this process may be smoother and less volatile. More on this later.

For the purposes of this post, goalscoring form on a given match day, , is defined as the difference between the number of goals that were scored on that match day, minus the expected number of goals, associated with the opportunities in that game.

A nice feature of this is that, in principle, it is invariant to the quality of opposition. A striker may have a higher xG when performing against a weaker team, but will therefore need to score more goals in such a game to be considered in the same form. By the same token, it should also account for the fact that a player will generally get fewer scoring opportunities as a substitute.

…As for the not so nice features, there are plenty! For instance, what good is a measure of form that only considers goals scored? Is xG not also conditional on how well a striker is paying? Would it be more useful to standardise the result?

These are all fair questions, and with a little thought could all be integrated into a more comprehensive characterisation. However, the type of model that I will introduce will be equally compatible with alternative definitions, so let’s imagine we just care about whether a striker is scoring as many goals as they should be, and whether this will continue.

Ellen White is a clinical striker who, at the time of writing this, plays for Manchester City and England. She is a former winner of the Women’s Super League (WSL) golden boot, and is England’s all-time top scorer. So plenty of opportunities to see her distinctive celebration:

StatsBomb have kindly made data from the WSL (2019/20 season) freely available in their R package, and so we will consider this league season of Ellen White’s career here.

Here is a plot of Ellen’s **form** (performance vs. xG) over the 12 league games that she featured in, during that season. Her biggest over performance vs. xG was when she scored in a 4-1 win away at Tottenham despite a cumulative xG of 0.597. Her worst performance by the same measure was the following week, failing to score in the 3-1 win at home to Everton. Although she only played the final minutes of this game, she accrued a match xG of 0.302. The fact that these games were back to back could be tricky for a form model to accommodate!

Essentially, we are looking for are some numbers to help us understand the following:

- What form is a player currently in?
- How long will a player remain in good (or bad) form?
- What is the uncertainty in our predictions?

Since players can enter good and bad patches of form over the course of a season, we need a model that is able to twist and turn accordingly. This means we need some *non-linearity*.

We also want future predictions to be based on recent games - if a player has over performed for the last 3 games in a row, then we generally expect them to continue on this path, at least in the short-term. But how far back should we look? Does a single great performance from months ago have any impact on a players current form? We need to quantify this *dependency* in our model too.

Finally, a probabilistic model has the benefit of *quantifying uncertainty*. I emphasise the importance of this in the ‘final thoughts’ at the end of this post and for anyone interested, here is more Bayesian statistics propaganda. But, in summary we should not neglect uncertainty in this model because (a) We are not even sure what form is, and (b) we are estimating it from a small amount of indirect observations. So let’s not pretend we will end up with a single number. Enough preaching and back to the task at hand….

One solution that checks the above requirements is the Gaussian Process (GP).

So we have this probabilistic model of smooth, non-linear functions. Let’s see what it looks like. In the below plot, the match days are the same as those presented in the above plot, we just have a new y-axis scale, and we have ‘days’ (rather than date) along the x-axis.

There are multiple functions that are consistent with Ellen White’s form in the league that season, so let’s look at one example first:

Where there is a large gap between successive games (such as the 3 weeks between White’s first and second appearances of the season), there is less evidence to guide predictions of form. This is also true for the period around day 60 (late December). Here though, she was on an upward trajectory. In both cases, this lack of data results in higher uncertainty, as is apparent when we look at more samples, which are shown on top of the full predictive distribution below.

Some squiggly lines that approximately go through some points? What is the value of this when you could scribble something similar without knowing anything about statistics?

Well, underlying all of these lines is a model of dependency. We have quantified how similar (correlated) form should be in successive games, and how this correlation will decrease with time. I will talk about the parameters that do this and how they can be interpreted in part 2 (in preparation), but to summarise, the model quantifies how correlation in form decreases as time progresses and this can be seen in the smoothness of the lines.

For example, if form was always shown to be very similar to that of the previous game, then transitioning from good to bad performances would be gradual, and the samples from the associated GP model would be very smooth. Conversely, in the case where performance in subsequent games were completely independent, even if very little time had passed, the GP regression lines would need to be able to change direction very sharply.

Below are some predictions from the model in the period just after White’s last game of the season (to the right of the final match day on the above plots). She appeared to be on a slight upward trend at this point, over performing in her final game at home to Chelsea. This is shown in the uppermost histogram. As we move away from this game, into the off-season, we see the uncertainty gradually increase in our predictions and the average move towards zero. This is consistent with the considerations discussed above.

Any football models we propose will only vaguely resemble the ‘true’ data generating process and though we can incrementally add more parameters we do not automatically find more evidence for them. We can build big datasets by combining observations from multiple players, and leagues, but ignoring possible variation between such data is misleading. If your big football models need big data, why stop there? Feed it some Sunday league football, or some basketball, or some handwritten digits.

Alternatively, we acknowledge that our system of parameters are not perfectly precise, and our predictions will span credible ranges. Quantifying this variability is a strength, not a weakness of our models, and is actually of more direct use in decision support.

What can we do about long periods in time where no competitive games are taking place?

There may be other sources of information that could help, such as performances in other competitions or even in training. Given we are not sure of the extent that these should inform the model, there is an argument to use a multi-level (partial pooling) structure, as was used to improve player-specifc xG estimates.

Finally, the other source of information is that contained in the priors, which I have not included here. But don’t panic, some prior predictive sampling is on the way in part 2 (in preparation).

BibTeX citation:

```
@online{difrancesco2022,
author = {Domenic Di Francesco},
title = {Player Form. {Part} 1: {Overview}},
date = {22-01-19},
url = {https://allyourbayes.com/posts/player_form},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 22AD. “Player Form. Part 1:
Overview.” January 19, 22AD. https://allyourbayes.com/posts/player_form.

This is part 2 of an article on fitting a Bayesian partial pooling model to predict expected goals. It has the benefits of (a) quantifying *aleatory and epistemic* uncertainty, and (b) making both group-level (player-specific) and population-level (team-specific) probabilistic predictions. If you are interested in these ideas but not in statistical language, then you can also check out part 1.

Expected Goals (or *xG*) is a metric that was developed to predict the probability of a football (soccer) player scoring a goal, conditional on some mathematical characterisation of the shooting opportunity. Since we have a binary outcome (he or she will either score or not score) we can use everyone’s favourite GLM - logistic regression.

Unfortunately this causes some overlap with a previous blog post - ‘*Bayesian Logistic Regression with Stan*’, but don’t worry - the focus here is all about *Partial Pooling*.

First let’s look at a non-Bayesian base case. StatsBomb have kindly made lots of football data freely available in their R package. The below creates a dataframe of the shots taken by Arsenal FC during the `2003`

-`04`

Premier League winning season.

```
library(StatsBombR); library(tidyverse)
Prem_SB_matches <- FreeMatches(Competitions = SB_comps %>%
dplyr::filter(competition_name == 'Premier League') %>%
dplyr::filter(competition_gender == 'male'))
Arsenal_0304_shots <- StatsBombFreeEvents(MatchesDF = Prem_SB_matches,
Parallel = TRUE) %>%
allclean() %>%
dplyr::filter(type.name == 'Shot') %>%
dplyr::filter(possession_team.name == 'Arsenal')
```

Using `R`

’s `tidymodels`

framework - make sure to have a look at Julia Silge’s tutorials if you are unfamiliar - we can specify and fit a logistic regression. The below compares our results (including confidence intervals) to those from StatsBomb.

If you are interested in creating something similar yourself, this model has standardised inputs for parameters with relatively large values (such as angles and distances) and one hot encoding of categorical inputs (such as whether or not the shot was taken with a players weaker foot).

Since we have used StatsBomb data (though their model will no doubt be based on a much larger collection) we would expect our results to be similar to theirs, and they are. Considering just the point estimates, the two models appear to broadly agree, especially when both are predicting a very low or a very high xG.

However, some of the confidence intervals on our `tidymodels`

predictions are very large. Although we would generally expect these to decrease as we introduced more data, we know that football matches (and especially specific events within football matches) are full of uncertainty. If we want to be able to quantify this uncertainty in a more useful way (we do) - we want a Bayesian model. The below section details the specific type of Bayesian model that I’m proposing for estimating xG.

Hierarchical (or ‘nested’) data contains multiple groups within a population, such as players with a football team. Unfortunately, this information is lost (and bias is introduced) when such data is modelled as a single population. At the other extreme we can assume each group is fully independent, and the difficulty here is that there will be less data available and therefore more variance in our predictions.

Consequently, we want an intermediate solution, acknowledging variation between groups, but allowing for data from one group to inform predictions about others. This is achieved by using a multi-level (or hierarchical) model structure. Such models allow partial sharing (or *pooling*) of information between groups, to the extent that the data indicate is appropriate. This approach results in reduced variance (when compared to a set of corresponding independent models), a shift towards a population mean (known as *shrinkage*), and generally an improved predictive performance.

Sounds great, right? So why would anyone ever not use this kind of model? In his excellent blog, Richard McElreath makes the case that multi-level models should be our default approach. His greatest criticism of them is that they require some experience or training to specify and interpret. His book has a dedicated chapter to help with that. Of course, there are many better descriptions of multi-level modelling than you will get from me, but I personally found the examples in Andrew Gelman and Jennifer Hill’s book to be very helpful. Finally, Michael Betancourt has written a much more comprehensive blog post on the topic, which includes a discussion on the underlying assumption of *exchangeability*.

We can create a partial pooling model by re-writing the below:

To look like this:

In this new structure, each parameter will now be a vector of length (where players are being considered). This means there will be a different co-efficient describing how varies with distance from goal for each player. This makes sense as we would expect variation between players and we want our model to be able to describe it.

If each of these parameters had their own priors, we would essentially have specified independent models - one for each player. But there is a twist here: each of the vectors of co-efficients share a single prior.

This will pull each of the individual co-efficients towards a shared mean, . The variation between the players (for a given parameter) is characterised by . Rather than specify these ourselves, we will also estimate these as part of the model. This means that the extent of the pooling is conditional on the data, which is an extremely useful feature. However, we then need to include priors on these parameters, which are known as *hyperpriors*.

Note that this process has introduced an extra layer (or level) to the model structure. This is why they are known as *multi-level* or *hierarchical* models. The term *partial pooling* is more a description of what they do.

We see the greatest benefit of this approach when only limited data is available for one or more groups. If one player took very few shots during a period of data collection, then there will be a lot of uncertainty in their xG predictions ….*unless* we can make use of the data we have for the rest of the team.

`Stan`

?The below is a reduced `Stan`

model, with just one co-efficient (concerning the distance from goal of the shot). This is not me being secretive, its just that the full model is quite large. You can simply add more parameters like a multi-variate linear regression on the log-odds scale, but remember that they will each require priors, hyperpriors, and data.

```
data {
int <lower = 1> n_shots;
int <lower = 0, upper = 1> goal [n_shots];
int <lower = 1> n_players;
int <lower = 1> player_id [n_shots];
vector [n_shots] dist_goal;
real mu_mu_alpha;
real <lower = 0> sigma_mu_alpha;
real<lower = 0> rate_sigma_alpha;
real mu_mu_beta_dist_goal;
real <lower = 0> sigma_mu_beta_dist_goal;
real<lower = 0> rate_sigma_beta_dist_goal;
}
parameters {
vector [n_players] alpha;
vector [n_players] beta_dist_goal;
real mu_alpha;
real <lower = 0> sigma_alpha;
real mu_beta_dist_goal;
real <lower = 0> sigma_beta_dist_goal;
}
model {
// Logistic model
goal ~ bernoulli_logit(alpha[player_id] + beta_dist_goal[player_id] .* dist_goal);
// Priors
alpha ~ normal(mu_alpha, sigma_alpha);
beta_dist_goal ~ normal(mu_beta_dist_goal, sigma_beta_dist_goal);
// Hyperpriors
mu_alpha ~ normal(mu_mu_alpha, sigma_mu_alpha);
sigma_alpha ~ exponential(rate_sigma_alpha);
mu_beta_dist_goal ~ normal(mu_mu_beta_dist_goal, sigma_mu_beta_dist_goal);
sigma_beta_dist_goal ~ exponential(rate_sigma_beta_dist_goal);
}
generated quantities {
real alpha_pp = normal_rng(mu_alpha, sigma_alpha);
real beta_dist_goal_pp = normal_rng(mu_beta_dist_goal, sigma_beta_dist_goal);
}
```

A few things that I’d like to note:

- My input data is of length
`n_shots`

and my parameters are vectors of length`n_players`

. - I’ve included my hyperpriors (the
`mu_mu_...`

,`sigma_mu...`

, and`rate_sigma...`

terms) as data, rather than*hard code*values into the file. This is so I can re-run the model with new hyperpriors without`Stan`

needing to recompile. - Even though I have included the
`mu...`

and`sigma..`

terms as priors in my comment, this is just to help describe the model structure. They are all included in the Parameters block of the model. As discussed above, they are inferred as part of the joint posterior distribution, meaning that we are estimating the extent of the pooling from the data. - I’m using the generated quantities to produce my population-level parameters, so that I have everything I need to put together probabilistic predictions in either
`R`

or`Python`

.

#### Model Parameters

The posterior distribution (which `Stan`

has sampled from) is a joint probabilistic model of all parameters. Let’s have a look at a few, specifically those corresponding to the effect of distance between the shot taker and goalkeeper. Shown below is the co-efficient for players (indexed ). We can see that the distance to the keeper is predicted to influence each player differently.

Some of the players will have taken fewer shots and therefore we will have less data to fit their player-specific parameters. The `mu_beta_dist_keeper`

and `sigma_beta_dist_keeper`

parameters in the above plot are the shared ‘*priors*’ that control how the data from each of the players can be used to inform one another. The `beta_dist_keeper_pp`

parameter is specified in the generated quantities block of my `Stan`

model. It is correlated samples from the distribution characterised by the shared priors. This becomes the population (team) level co-efficient in my predictions.

I’ve included some predictions for some actual shots taken that season in part 1 of this article, but since this is the purpose of the model let’s look at one more.

Here is Robert Pirès goal from just outside the box at home to Bolton Wanderers in 2004. It was on his stronger (right) foot and he was not under pressure from any defenders.

As labelled on the above plot, the StatsBomb model only gave Pirès a 5% chance of scoring this chance. The below xG predictions are from the Bayesian partial pooling model, both for Robert Pirès (upper) and for the case where any Arsenal player could be shooting (lower). Also shown is the StatsBomb prediction. We see an improvement (since we know this chance was scored) when we make a player-specific prediction.

Our probabilistic predictions contain more information than point estimates, but for the purposes of a simpler comparison we can consider the mean value. The mean value of our team-level prediction is 20%, but conditional on the knowledge that Pirès was shooting, this becomes 33%.

If Arsène Wenger could’ve chosen which of his players was presented with this opportunity, Robert Pirès would’ve been one of his top choices (though possible behind Thierry Henry). We have an intuitive understanding that such players have the necessary attributes to score from relatively difficult opportunities such as this, and this is accounted for in our model. We have tackled the challenge of greatly reduced (player-specific) datasets, by allowing them to share information on the basis of how similar they are.

Multi-level models capture the multi-level structure of hierarchical (nested) datasets, accounting for both variability and commonality between different groups (in this example: between different players in a team). However, as we can see from the previous plot, by introducing a set of parameters for each group and relating them all in this way, the posterior distribution now has many more dimensions and is more challenging to sample from. If you are using `Stan`

you may now see more warning messages regarding *divergent transitions* - a concept that José Mourinho is acting out, below. If you do run into these problems, I would recommend reviewing the guidance in the Stan manual on reparameterisation (writing your same model on a new scale, such that it is easier for the software to work with).

Finally, I have published a paper demonstrating this modelling approach in an engineering context, which includes additional details for anyone who is interested: ‘Consistent and coherent treatment of uncertainties and dependencies in fatigue crack growth calculations using multi-level Bayesian models’.

BibTeX citation:

```
@online{difrancesco2021,
author = {Domenic Di Francesco},
title = {Uncertainty in {xG.} {Part} 2: {Partial} {Pooling}},
date = {21-01-07},
url = {https://allyourbayes.com/posts/xg_pt2},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 21AD. “Uncertainty in xG. Part 2: Partial
Pooling.” January 7, 21AD. https://allyourbayes.com/posts/xg_pt2.

The Expected Goals (xG) metric is now widely recognised as numerical measure of the *quality* of a goal scoring opportunity in a football (soccer) match. In this article we consider how to deal with uncertainty in predicting xG, and how each players individual abilities can be accounted for. This is part 1 of the article, which is intended to be free of stats jargon, maths and code. If you are interested in those details, you can also check out part 2.

Opta sports tell us that the *Expected Goals* (or **xG**) of a shot describe how likely it is to be scored. The cumulative xG over a game will therefore give an indication of how many goals a team would usually score, based on the chances they created.

Why would anyone be interested in this? Because if the xG model is any good, it can be the basis for an evidence-based style of play. If certain individuals in a team enjoy shooting from long-distance (or any other set of circumstances associated with a low xG), they may be encouraged to keep possession until a more favourable (higher xG) chance arises.

There is no universally accepted way of calculating xG, so there are many competing models around. In this article I will describe a statistical model that cares about who is taking the shot, but does not treat each player as a separate independent case. More on this later…

Once upon a time (in the `2003`

-`04`

season), Arsenal FC were brilliant. That squad is still referred to as *the Invincibles* after finishing the season without a defeat in the league, scoring the most goals and conceding the fewest. Their top scorer, Thierry Henry, finished 4th in the Ballon d’Or voting this season (having finished 2nd the season before). Unfortunately José Mourinho arrived at Chelsea the following season and Arsenal haven’t won the league since.

I’m using Arsenal’s unbeaten league season as an example because StatsBomb have kindly made all this data freely available in their R package.

Here are their league goal scorers:

And here’s where the goals were scored from:

The above plot shows that many of these goals were scored, even though the (StatsBomb) xG was relatively low. In fact the mean xG of the shots they scored was 0.33. This isn’t necessarily a problem as we do see improbable goals. Below is Giorgian De Arrascaeta’s contender for the 2020 FIFA Puskas award. Was anyone expecting him to score this chance? Could he do it again?

An ideal xG model would correctly predict every goal without error, but the many sources of variability in the game means that this isn’t happening any time soon. A **Bayesian** model (such as the one I’m proposing) will include uncertainty in it’s predictions, letting us know when it can narrow down a predicted xG, and when there is a larger range of credible values based on the available information.

Another feature that I’ve introduced to the model is the relationship between the data from different players. I want the model to distinguish between whether a team creates an opportunity for their top scorer, or their full-back who has never scored. One is clearly preferable, and should have a higher xG to reflect this. Why would this matter? Shooting from wide positions may (on average) be unlikely to pay off, but if your team has a winger who is especially adept at it, then it may be a strategy they should pursue.

For instance, Giorgian De Arrascaeta may have had a higher chance of scoring that bicycle kick when you consider that he was also nominated for the 2018 FIFA Puskas award for scoring another acrobatic volley.

The practical issue with considering each player separately is you now have many, smaller datasets. Larger datasets contain more information allowing for model parameters to be estimated more precisely. This sometimes encourages us to throw all our data into a single population and pretend we have a larger dataset. Your software will be happy, since it won’t know the difference, but you will lose the valuable player-specific information.

Bayesian models can do even better than this though. Consider some data that was collected from Arsenal’s defensive midfielder, Gilberto Silva. He scored 3 league goals in their invincible season, but his primary duties were defensive. He had different characteristics than Thierry Henry, but there is some commonality to take advantage of here. If Gilberto Silva scores an opportunity that gives me *some* information about whether Thierry Henry could have scored it too. How much information? That depends on how similar they are. Unless we tell the model, it will assume we cannot learn anything about these players from the other. Both were professional footballers. Neither was a hockey player, or a tree, or a kitten - though a statistical model could not intuit this. If the data did indicate that they were in fact very different players, then the special model structure that we are using would recognise this and not share information between them to the same extent.

If the above concept make sense to you, then congratulations - you appreciate the utility of multi-level (partial-pooling) Bayesian models. This *sharing of information* is one of many reasons Bayesian methods can perform so well on small (and imperfect) datasets.

We have a model that describes uncertainty (using probability) and makes both team-level and player-specific predictions. Here are some examples:

How about Dennis Bergkamp’s dinked finish when clear through on goal against Birmingham. Remember it? Me neither - here’s where the shot was taken from:

And our predicted xG is shown below, both for Dennis Bergkamp (upper) and for the case where any Arsenal player could be shooting (lower). Here is a great example of being able to make a better prediction conditional on the information of who is taking the shot. The model has identified that Bergkamp was very capable of scoring these kind of chances and was therefore able to identify a narrow range of very high xG values. However, if we were considering a generic player in the Arsenal team, there is more uncertainty in our prediction.

What about Thierry Henry’s long range goal against Man Utd? (Note that the straight arrow in the below plot does not reflect the true trajectory of his shot).

OK, so I wouldn’t have seen that one coming either ….but I would have given it more of a chance knowing who was shooting.

Here is a final example - a shot from Gilberto Silva, on his stronger foot, which was saved by Neil Sullivan (who I’d completely forgotten had signed for Chelsea that season). I thought this was worth looking at because StatsBomb’s xG suggests this was a very good chance.

Our model did not expect him to score, and also predicted that Freddie Ljungberg would have missed. Henry (unsurprisingly) is expected to have had a better chance, but **interestingly**, our model thinks that Arsenal’s goalscoring winger Robert Pirès would have been most likely to score this opportunity.

What should we make of the above predictions? The single values (*point estimates*) provided by analytics companies may be a bit easier to read, but I’m suggesting that they are not as useful. We should want our models to tell us when they are not sure. There is more information in a probabilistic prediction than a point estimate, which means you can go from the former to the latter, but not vice-versa. The type of model we have discussed in this article has the added benefit of sharing information between different players in a mathematically coherent way (see part 2) for the technical details).

BibTeX citation:

```
@online{difrancesco2020,
author = {Domenic Di Francesco},
title = {Uncertainty in {xG.} {Part} 1: {Overview}},
date = {20-12-10},
url = {https://allyourbayes.com/posts/xg_pt1},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 20AD. “Uncertainty in xG. Part 1:
Overview.” December 10, 20AD. https://allyourbayes.com/posts/xg_pt1.

This post is intended to be a high-level discussion of the merits and challenges of applied Bayesian statistics. It is intended to help the reader answer: *Is it worth me learning Bayesian statistics?* or *Should I look into using Bayesian statistics in my project?* Maths, code and technical details have all been left out.

Firstly, Bayesian…

- Statistics
- Inference
- Modelling
- Updating
- Data Analysis

…can be considered the same thing (certainly for the purposes of this post): **the application of Bayes theorem to quantify uncertainty**.

So Bayesian statistics may be of interest to you if you are dealing with a problem associated with uncertainty - either due to some underlying variability, or due to limitations of your data.

Bayesian statistics is not the only way to account for uncertainty in calculations. The below points describe what a Bayesian approach offers, that others don’t. Note that I am only really discussing methods involving probability here, though alternative approaches are available.

The outcome of a Bayesian model is a posterior distribution. This describes the joint uncertainty in all the parameters you are trying to estimate. This can be used to describe uncertainty in a prediction for some new input data. By comparison, alternative (frequentist) methods typically describes uncertainty in predictions using confidence intervals, which are widely used but easy to misinterpret.

Confidence intervals are calculated so that they will contain the *true* value of whatever you are trying to predict with some desired frequency. They provide no information (in the absence of additional assumptions) on how credible various possible results are. The Bayesian equivalent (sometimes called credible intervals) can be drawn anywhere on a predictive distribution. In Pratt, Raiffa and Schlaiffer’s textbook an example is used to highlight this difference:

*Imagine the plight of the manager who exclaims, ‘I understand [does he?] the meaning that the demand for XYZ will lie in the interval 973 to 1374 with confidence .90. However, I am particularly interested in the interval 1300 to 1500. What confidence can I place on that interval?’* *Unfortunately, this question cannot be answered. Of course, however, it is possible to give a posterior probability to that particular interval - or any other - based on the sample data and on a codification of the manager’s prior judgements.*

And a more succinct description of the same view from Dan Ovando’s fishery statistics blog:

*Bayesian credible intervals mean what we’d like Frequentist confidence intervals to mean.*

Following on from the previous point, an analysis that directly describes the probability of any outcome is fully compatible with a decision analysis. After completing a Bayesian analysis, identifying the optimal strategy implied by your model becomes simpler and more understandable.

As stated in James Berger’s (quite theoretical) book on Bayesian statistics:

*Bayesian analysis and decision theory go rather naturally together, partly because of their common goal of utilizing non-experimental sources of information, and partly because of deep theoretical ties.*

So this one is based on a point made in Ben Lambert’s book on Bayesian statistics. It is regarding how modern Bayesian statistics is achieved in practice. The computational methods require some effort to pick up, especially if you do not have experience with programming (though Ben Lambert’s book gives a nice introduction to Stan). However, they can be readily extended to larger and more complex models.

So why would anyone ever *not* use Bayesian models when making predictions?

Perhaps the most common criticism of Bayesian statistics is the requirement for prior models. An initial estimate of uncertainty is a term in Bayes’ theorem - but how can you estimate the extent of variability before you see it in your data? This will surely be completely subjective, so the results will vary depending on who is doing the analysis. This, understandably, doesn’t seem right with a lot of casual enquirers.

A common response to this accusation is that subjectivity is not an exclusive feature of Bayesian analysis (how about the whole structure of the model you are trying to fit, regardless of your method?) *…but* at least Bayesians are required to be explicit about it. Priors are part of the model with no-where to hide (in the code or the reporting) and so they are open to criticism. This point is discussed in **much** more detail in this paper from Colombia University.

Priors can contain, as much or as little, information as desired. However, even in instances where you may feel you don’t have any upfront knowledge of a problem, they represent a valuable opportunity for introducing regularisation (which protects against bad predictions due to overfitting). This idea is discussed in detail in Richard McElreath’s textbook.

In practice, statisticians estimate Bayesian posterior distributions using Markov Chain Monte Carlo (MCMC) sampling algorithms. This approach is slower, more complicated and less informative than standard, independent Monte Carlo sampling. The models that I have worked with during my PhD have taken several hours to finish sampling from, but I have met statisticians whose models run for days or even weeks. Following this, there are checks that need to be completed as there are plenty of things that can go wrong with MCMC.

My background is in mechanical and civil engineering. In discussions with engineering researchers at conferences I have often been told that the errors and complications they encountered when using MCMC methods had made them believe that Bayesian statistics wasn’t for them. These are challenges that I imagine everyone who has attempted modern Bayesian statistics will have encountered and resolving them can require a deep understanding of your model. Both domain-specific and statistical knowledge is required to help ensure the model you are trying to fit is justified. In addition some programming *tricks* like reparameterisation can be of great help to your software, which sometimes needs equivalent, but easier to interpret instructions.

With all that in mind, when would this ever be worthwhile?

Regardless of whether you believe we exist in a deterministic universe or not, you will never have perfect state of knowledge describing your problem: uncertainty exists, so we need a sensible and safe way of accounting for it.

I believe that Bayesian statistics is actually well suited to traditional engineering problems, which are concerned with managing risk when confronted with small, messy datasets and models with plenty of uncertainty. As suggested in the earlier description of confidence intervals, frequentist statistics defines probability based on occurrences of events following a large number of trials or samples. When only limited data is available, Bayesian statistics can shine by comparison.

Very large datasets may contain enough information to precisely estimate parameters in a model using standard machine learning methods, and so it becomes less worthwhile running simulations to characterise variability. But how common are these big data problems in science and engineering? Sometimes large populations of data are better described as multiple smaller constituent groups, after accounting for key differences between them. Bayesian statistics has a very useful way of managing such problems by structuring models hierarchically. This method allows for **partial pooling of information** between groups, so that predictions account for the variability and commonality between groups. I will provide a detailed example of this in a future post.

In conclusion, Bayesian statistics requires (computational and personal) effort to apply. But it provides results that are (usually) more interpretable and closely linked to the questions we want to answer. Whether or not these methods are worth learning of course depend on personal circumstances. I encountered Bayesian statistics during my PhD, and so had plenty of time to read up and I’ve found this to be very rewarding and enjoyable…

BibTeX citation:

```
@online{difrancesco2020,
author = {Domenic Di Francesco},
title = {Why Be {Bayesian?}},
date = {20-03-24},
url = {https://allyourbayes.com/posts/Why_Bayes},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 20AD. “Why Be Bayesian?” March 24,
20AD. https://allyourbayes.com/posts/Why_Bayes.

Logistic regression is a popular machine learning model. One application of it in an engineering context is quantifying the effectiveness of inspection technologies at detecting damage. This post describes the additional information provided by a Bayesian application of logistic regression (and how it can be implemented using the `Stan`

probabilistic programming language). Finally, I’ve also included some recommendations for making sense of priors.

So there are a couple of key topics discussed here: Logistic Regression, and Bayesian Statistics. Before jumping straight into the example application, I’ve provided some **very** brief introductions below.

At a very high level, Bayesian models quantify (aleatory and epistemic) uncertainty, so that our predictions and decisions take into account the ways in which our knowledge is limited or imperfect. We specify a statistical model, and identify probabilistic estimates for the parameters using a family of sampling algorithms known as Markov Chain Monte Carlo (MCMC). My preferred software for writing a fitting Bayesian models is `Stan`

. If you are not yet familiar with Bayesian statistics, then I imagine you won’t be fully satisfied with that 3 sentence summary, so I will put together a separate post on the merits and challenges of applied Bayesian inference, which will include much more detail.

Logistic regression is used to estimate the probability of a binary outcome, such as *Pass* or *Fail* (though it can be extended for `> 2`

outcomes). This is achieved by transforming a standard regression using the logit function, shown below. The term in the brackets may be familiar to gamblers as it is how odds are calculated from probabilities. You may see *logit* and *log-odds* used exchangeably for this reason.

Since the logit function transformed data *from* a probability scale, the inverse logit function transforms data *to* a probability scale. Therefore, as shown in the below plot, it’s values range from `0`

to `1`

, and this feature is very useful when we are interested the probability of *Pass*/*Fail* type outcomes.

Before moving on, some terminology that you may find when reading about logistic regression elsewhere:

- When a linear regression is combined with a re-scaling function such as this, it is known as a Generalised Linear Model (
**GLM**). - The re-scaling (in this case, the logit) function is known as a
**link function**in this context. - Logistic regression is a
**Bernoulli-Logit GLM**.

You may be familiar with libraries that automate the fitting of logistic regression models, either in `Python`

(via `sklearn`

):

```
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X = dataset['input_variables'], y = dataset['predictions'])
```

…or in `R`

:

```
model_fit <- glm(formula = preditions ~ input_variables,
data = dataset, family = binomial(link = 'logit'))
```

To demonstrate how a Bayesian logistic regression model can be fit (and utilised), I’ve included an example from one of my papers. Engineers make use of data from inspections to understand the condition of structures. Modern inspection methods, whether remote, autonomous or manual application of sensor technologies, are very good. They are generally evaluated in terms of the accuracy and reliability with which they size damage. Engineers never receive perfect information from an inspection, such as:

- There is a crack of
**exact**length`30 mm`

and**exact**depth`5 mm`

at this**exact**location, or - There is
**definitely**no damage at this location.

For various reasons, the information we receive from inspections is imperfect and this is something that engineers need to deal with. As a result, providers of inspection services are requested to provide some measure of how good their product is. This typically includes some measure of how accurately damage is sized and how reliable an outcome (detection or no detection) is.

This example will consider trials of an inspection tool looking for damage of varying size, to fit a model that will predict the probability of detection for any size of damage. Since various forms of damage can initiate in structures, each requiring inspection methods that are suitable, let’s avoid ambiguity and imagine we are only looking for cracks.

For the purposes of this example we will simulate some data. Let’s imagine we have introduced some cracks (of known size) into some test specimens and then arranged for some blind trials to test whether an inspection technology is able to detect them.

```
set.seed(1008)
N <- 30; lower <- 0; upper <- 10; alpha_true <- -1; beta_true <- 1
depth <- runif(n = N, min = lower, max = upper)
PoD_1D <- function(depth, alpha_1D, beta_1D){
PoD <- exp(alpha_1D + beta_1D * log(depth)) / (1 + exp(alpha_1D + beta_1D * log(depth)))
return (PoD)
}
pod_df <- tibble(depth = depth, det = double(length = N))
for (i in seq(from = 1, to = nrow(pod_df), by = 1)) {
pod_df$det[i] = rbernoulli(n = 1,
p = PoD_1D(depth = pod_df$depth[i],
alpha_1D = alpha_true,
beta_1D = beta_true))
}
```

The above code is used to create 30 crack sizes (depths) between 0 and 10 mm. We then use a log-odds model to back calculate a probability of detection for each. This is based on some fixed values for and . In a real trial, these would not be known, but since we are inventing the data we can see how successful our model ends up being in estimating these values.

The below plot shows the size of each crack, and whether or not it was detected (in our simulation). The smallest crack that was detected was 2.22 mm deep, and the largest undetected crack was 5.69 mm deep. Even so, it’s already clear that larger cracks are more likely to be detected than smaller cracks, though that’s just about all we can say at this stage.

After fitting our model, we will be able to predict the probability of detection for a crack of any size.

`Stan`

is a probabilistic programming language. In a future post I will explain why it has been my preferred software for statistical inference throughout my PhD.

The below is a simple `Stan`

program to fit a Bayesian Probability of Detection (PoD) model:

```
library(cmdstanr)
PoD_model <- cmdstan_model(stan_file = "PoD_model.stan")
PoD_model$format()
```

```
data {
int<lower=0> N; // Defining the number of defects in the test dataset
array[N] int<lower=0, upper=1> det; // A variable that describes whether each defect was detected [1]or not [0]
vector<lower=0>[N] depth; // A variable that describes the corresponding depth of each defect
int<lower=0> K; // Defining the number of probabilistic predictions required from the model
vector<lower=0>[K] depth_pred;
}
parameters {
// The (unobserved) model parameters that we want to recover
real alpha;
real beta;
}
model {
// A logistic regression model relating the defect depth to whether it will be detected
det ~ bernoulli_logit(alpha + beta * log(depth));
// Prior models for the unobserved parameters
alpha ~ normal(0, 1);
beta ~ normal(1, 1);
}
generated quantities {
// Using the fitted model for probabilistic prediction.
// K posterior predictive distributions will be estimated for a corresponding crack depth
vector[K] postpred_pr;
for (k in 1 : K) {
postpred_pr[k] = inv_logit(alpha + beta * log(depth_pred[k]));
}
}
```

The `generated quantities`

block will be used to make predictions for the `K`

values of `depth_pred`

that we provide.

`K <- 50; depth_pred <- seq(from = lower, to = upper, length.out = K)`

The above code generates 50 evenly spaced values, which we will eventually combine in a plot. In some instances we may have specific values that we want to generate probabilistic predictions for, and this can be achieved in the same way.

Data can be pre-processed in any language for which a `Stan`

interface has been developed. This includes, `R`

, `Python`

, and `Julia`

. In this example we will use `R`

and the accompanying package, `cmdstanr`

.

Our `Stan`

model is expecting data for three variables: **N**, **det**, **depth**, **K** and **depth_pred** and `cmdstanr`

requires this in the form of a list.

Once we have our data, and are happy with our model, we can set off the Markov chains. There are plenty of opportunities to control the way that the `Stan`

algorithm will run, but I won’t include that here, rather we will mostly stick with the default arguments in `cmdstanr`

.

```
PoD_fit <- PoD_model$sample(data = list(N = N, det = pod_df$det, depth = pod_df$depth,
K = K, depth_pred = depth_pred), seed = 2408)
```

**Note**:I’ve not included any detail here on the checks we need to do on our samples. There are some common challenges associated with MCMC methods, each with plenty of associated guidance on how to diagnose and resolve them. For now, let’s assume everything has gone to plan.

Now, there are a few options for extracting samples from a stanfit object such as `PoD_samples`

, including `cmdstanr::as_draws()`

. However, these usually require a little post-processing to get them into a tidy format. There is a function in my DomDF R package for this, which we can use to create a tidy output that specifies the iteration, parameter value and chain associated with each data point:

```
library(DomDF)
PoD_samples <- PoD_fit |> DomDF::tidy_mcmc_draws()
head(x = PoD_samples, n = 5)
```

```
# A tibble: 5 × 4
Parameter Chain Iteration value
<chr> <int> <int> <dbl>
1 lp__ 1 1 -15.5
2 lp__ 1 2 -15.2
3 lp__ 1 3 -15.8
4 lp__ 1 4 -16.6
5 lp__ 1 5 -16.1
```

We have sampled from a 2-dimensional posterior distribution of the unobserved parameters in the model: and . Below is a density plot of their corresponding marginal distributions based on the `1000`

samples collected from each of the `4`

Markov chains that have been run.

So our estimates are beginning to converge on the values that were used to generate the data, but this plot also shows that there is still plenty of uncertainty in the results. Unlike many alternative approaches, Bayesian models account for the statistical uncertainty associated with our limited dataset - remember that we are estimating these values from 30 trials. These results describe the possible values of and in our model that are consistent with the limited available evidence. If more data was available, we could expect the uncertainty in our results to decrease. I think there are some great reasons to keep track of this statistical (sometimes called *epistemic*) uncertainty - a primary example being that we should be interested in how confident our predictive models are in their own results! …but I’ll leave it at that for now, and try to stay on topic.

How do we know what do these estimates of and mean for the PoD (what we are ultimately interested in)? We can check this using the posterior predictive distributions that we have (thanks to the `generated quantities`

block of the `Stan`

program).

One thing to note from these results is that the model is able to make much more confident predictions for larger crack sizes. The increased uncertainty associated with shallow cracks reflects the lack of data available in this region - this could be useful information for a decision maker!

There are only 3 trials in our dataset considering cracks shallower than 3 mm (and only 1 for crack depths `< 2`

mm). If we needed to make predictions for shallow cracks, this analysis could be extended to quantify the value of future tests in this region.

There are many approaches for specifying prior models in Bayesian statistics. *Weakly informative* and *MaxEnt* priors are advocated by various authors. Unfortunately, *Flat Priors* are sometimes proposed too, particularly (but not exclusively) in older books. A flat prior is a wide distribution - in the extreme this would be a uniform distribution across all real numbers, but in practice distribution functions with very large variance parameters are sometimes used. In either case, a very large range prior of credible outcomes for our parameters is introduced the model. This may sound innocent enough, and in many cases could be harmless.

Flat priors have the appeal of describing a state of complete uncertainty, which we may believe we are in before seeing any data - but is this really the case?

Suppose you are using Bayesian methods to model the speed of some athletes. Even before seeing any data, there is some information that we can build into the model. For instance, we can discount negative speeds. We also wouldn’t need to know anything about the athletes to know that they would not be travelling faster than the speed of light. This may sound facetious, but flat priors are implying that we should treat all outcomes as equally likely. In fact, there are some cases where flat priors cause models to require large amounts of data to make good predictions (meaning we are failing to take advantage of Bayesian statistics ability to work with limited data).

In this example, we would probably just want to constrain outcomes to the range of metres per second, but the amount of information we choose to include is ultimately a modelling choice. Another helpful feature of Bayesian models is that the priors are part of the model, and so must be made explicit - fully visible and ready to be scrutinised.

A common challenge, which was evident in the above PoD example, is lacking an intuitive understanding of the meaning of our model parameters. Here and required prior models, but I don’t think there is an obvious way to relate their values to the result we were interested in. They are linear regression parameters on a log-odds scale, but this is then transformed into a probability scale using the logit function.

This problem can be addressed using a process known as **Prior Predictive Simulation**, which I was first introduced to in Richard McElreath’s fantastic book. This involves evaluating the predictions that our model would make, based only on the information in our priors. Relating our predictions to our parameters provides a clearer understanding of the implications of our priors.

Back to our PoD parameters - both and can take positive or negative values, but I could not immediately tell you a sensible range for them. Based on our lack of intuition it may be tempting to use a variance for both, right? Well, before making that decision, we can always simulate some predictions from these priors. The below code is creating a data frame of prior predictions for the PoD (`PoD_pr`

) for many possible crack sizes.

*(Thank you to Jiun for your kind message that helped me tidy up the below)*

```
# A tibble: 6 × 2
depth PoD_pr
<dbl> <dbl>
1 0 0
2 0.0100 0.00000780
3 0.0200 0.00366
4 0.0300 0.0452
5 0.0400 0.0413
6 0.0501 0.00619
```

And we can visualise the information contained within our priors for a couple of different cases.

Our wide, supposedly *non*-informative priors result in some pretty useless predictions. I’ve suggested some more sensible priors that suggest that larger cracks are more likely to be detected than small cracks, without overly constraining our outcome (see that there is still prior credible that very small cracks are detected reliably and that very large cracks are often missed).

Why did our predictions end up looking like this?

Borrowing from McElreath’s explanation, it’s because and are linear regression parameters on a log-odds (logit) scale. Since we are estimating a PoD we end up transforming out predictions onto a probability scale. Flat priors for our parameters imply that extreme values of log-odds are credible. All that prior credibility of values `< - 3`

and `> 3`

ends up getting concentrated at probabilities near `0`

and `1`

. I think this is a really good example of flat priors containing a lot more information than they appear to.

I’ll end by directing you towards some additional (generally non-technical) discussion of choosing priors, written by the `Stan`

development team (link). It provides a definition of *weakly informative priors*, some words of warning against *flat priors* and more general detail than this humble footnote.

BibTeX citation:

```
@online{difrancesco2020,
author = {Domenic Di Francesco},
title = {Bayesian {Logistic} {Regression} with {Stan}},
date = {20-02-15},
url = {https://allyourbayes.com/posts/Logistic_Bayes},
langid = {en}
}
```

For attribution, please cite this work as:

Domenic Di Francesco. 20AD. “Bayesian Logistic Regression with
Stan.” February 15, 20AD. https://allyourbayes.com/posts/Logistic_Bayes.