Abstract
Predicting football results using mathematics has become increasingly popular over the years and many different models have been used to model football data. This is due to a number of reasons, and the application of the models have many real world uses. In my report, I predict football results for the English Premier League using a Poisson regression model. This is done by using historical data from the website http://www.football-data.co.uk/. The Poisson regression model is implemented in R along with the simulations of games, all the steps taken are outlined throughout and the code can be found in the appendix. Some of the analysis is done using Microsoft Excel.
Contents
1 | Introduction | 2 |
Literature Reviews | ||
2 | Poisson Models | |
Chapter 1
Introduction
The English Premier League is the highest tier league in England. Founded in 1992, it is the country’s primary football competition, which twenty clubs compete in yearly. The league operates on a system of promotion and relegation within the English Football League. The season runs between August and May. Each team competing in the league plays against each other twice, home and away; totalling thirty-eight matches each and three hundred and eighty matches overall within a single season. The English Premier League is the most-watched sports league in the world, broadcast in two hundred and twelve territories to six hundred and forty three million homes and a potential TV audience of 4.7 billion people. In the 2014-15 season, the average Premier League match stadium attendance exceeded thirty six thousand spectators (En.wikipedia.org, 2017). It is no wonder the Premier League has a market value of £4.17 billon (transfers et al., 2017).
Predicting the outcome of football matches has become a very lucrative business. Individuals partake in the notion of predicting games for the sole purpose of winning money through various types of betting. “The Football Pools started 90 years ago this year…The growth of the internet and mobile devices with quick access to odds has made betting generally much more accessible,” stated Keogh and Rose (2013). These technological advancements have been instrumental in turning the football betting market into the worldwide phenomenon that it is today. In an interview Darren Small, director of integrity at betting and sports data analyst Sportsradar said “The current estimations, which include both the illegal markets and the legal markets, suggest the sports match-betting industry is worth anywhere between $700 billion and $1trillion (£435 billion to £625 billion) a year.” The United Kingdom Gambling Commission recently released statistics presenting the total annual gross gambling yield for the gambling industry in the United Kingdom up to March 2016; these figures show that the United Kingdom has a gross gambling yield of £13.6 billion with football generating £449.44 million (Gamblingcommission.gov.uk, 2017). Betting is not as straight forward as it seems. 1X2 betting is the most commonly used type of wager in football. To win money you must correctly predict the full time result of a match, whether a given team wins, draws or loses (Sports Central, 2017). This is the most simplistic and basic form of betting, more complicated forms include … However some of the bigger bookmakers for example, Sky Sports and William Hill create their own style of betting. A great example of this is the Sky Super 6. A user must predict the outcome of six football matches in the Barclays Premier League and the minute of the first goal in any of the six games for a chance to £250 000 each week (Super6.skysports.com, 2017). Therefore, it is clear that predicting the outcome of football matches is extremely common around the world. Not only does an individual who is wanting to place a bet on their favourite sports team, (to make things a little more interesting) make predictions; betting companies do this too. They have to set their odds appropriately to ensure they make money overall. One way they achieve a high profit is by luring sports fans into betting on the least likely outcome by offering them higher returns. Predicting football results also has other real world applications other than for gambling such as coaching improvements and in journalism.
Mathematics is constantly evolving; advancements in technology means that it can solve almost anything in today’s world. However, can it predict the outcome of football games? How accurate are the predictions? Various proposals for modelling the outcome of football matches exist, but not all are mathematical. As stated, predictions need not solely be made using mathematical models. Experts in the field can use their extensive self-knowledge and their possible previous practical playing experience to base their predictions on. Mark Lawrenson, a familiar face for football fans, is a good example of an individual that bases his predictions on the above. Lawrenson is a former Liverpool defender amongst others (En.wikipedia.org, 2017), draws on his extensive field experiences and is now an established pundit, both on BBC television and radio. All of this without the use of any mathematical calculations.
The main focus of this report is looking at predicting football results. Firstly, by assessing whether the football data will fit a Poisson distribution, from there Poisson regression allows the simulation of football matches, thus providing a predicted result. Football leagues are full of variables that can affect results, the analysis will look at the randomness of a football league and investigate how the strengths differ between each team and how this may have an impact on the simulation. Using this, the report will look to establish the possible team position within the English Premiere League table at the end of the 2015-2016 season and then compare it to the actual league standings. The Poisson regression model is one of the more commonly used mathematical approaches to predicting results, in this case football results, and this is the method that will be used in order to garner the predictions for the purpose of this report. The predictions made are not expected to be 100% accurate as the sample of data being used will not allow for this. However, the report will look to provide a good indication of the likelihood of a team winning or losing. R will be the main statistical program which will be used in order to simulate the football matches and provide the predictions, and the explanation on how this is done along with the functions used will be provided in the report.
Literature Reviews
The first journal to be reviewed is “Modelling association football scores” This paper introduces the idea that “skill rather than chance dominates the game.” Throughout the history of football there have been many games where the better team has lost a game to an opponent due to a number of reasons for example injury to a player. Maher continues this idea and explains that “teams are not identical; each one has its own inherent quality.” Thus meaning that chance will have less of an effect over several games and that better teams should have a higher probability of winning. To make predictions Maher uses football data from four English Football Leagues from the years 1973, 1974 and 1975. Thus, giving him data from twelve separate leagues for analysis.
Previously Maher looked at journals rejecting the Poisson model as a good fit for the football data. To disprove this he uses the idea that during a game if a team has possession of the ball, then they have an opportunity to attack and score. Obviously, not every attack results in a team scoring a goal, so the probability of a goal being scored each time a team is in possession of the ball is very small, but the amount of possession a team has during a game is large. If the probability is constant and attacks are independent, Maher suggest that the number of goals scored will be Binomial. This suggests the Poisson approximation will be a good fit for the football data, further in the report will be an explanation of why this is the case. Maher also states that, “the mean of this Poisson will vary according to the quality of the team.” Maher models the home and away goals separately to a Poisson distribution.
Therefore, for this model Maher formulates his attack and defence parameters using the mean of each Poisson in the following way; “if team i is playing at home against team j and the observed score is
(xij,yij), we shall assume that
Xijis Poisson with mean
αiβj, that
Yijis also Poisson with mean
yiδj. Then we can think of
αias representing the strength of team i’s when playing at home,
βjthe weakness of team j’s defence when playing away,
yithe weakness of teams i’s defence when playing away,
δjthe strength of team j’s attack when playing away.” Therefore, in a league where there are twenty-two teams there are eighty-eight such parameters, equalling four parameters per team. Since each match is considered as two separate games X and Y are assumed to be independent,
αand
βwill be estimated from X and,
yalong with
δwill be estimated from Y. After this, the log likelihood function is used and the maximum likelihood estimates of the discrete random variable parameters are found by the iterative technique Newton-Raphson. This seems like an excellent method of determining the attack and defence parameters and is something that will be considered for revision of this report.
As it is not necessary to have four parameters for each team, a hierarchical model is fitted to the parameter estimations. Moving up a level in the hierarchical model leads to the introduction of (n-1) further parameters, as under the null hypothesis the extra parameters are not required. Now the parameters
yand
δdo not need to be included. For my analysis of the data I will now be considering using two parameters rather than four. A goodness-of-fit test is used on all twelve datasets to determine whether there is a significant difference between the observed and expected
χ2values at a 5% and 1% level. Similarly, I will be using a chi-squared (
χ2) goodness-of-fit test to determine whether my maximum likelihood estimates for the number of goals scored in a season are significantly different from the expected values. As the goodness-of-fit test yielded results that show that Maher’s independent Poisson model for each match is not the best fit for the data, there are improvements that can be made. The following improvements were made, a bivariate Poisson model approach is taken, which, in turn leads to the addition of another parameter (
ο). When the goodness-of-fit test was applied to the bivariate model it yielded better results, implying that the bivariate model is a better for the data.
Maher’s model shows how the results of football matches can be modelled by the use of an independent Poisson model or, a bivariate Poisson model for the scores of both the home teams and the away teams. The parameters that Maher calculated represent the attacking and defensive strength and qualities of each team. When Maher applied his model to each data set all but five cases gave a good fit for the data at a 5% significance level. A bivariate model was used due to the differences in scores not having a good fit with the independent Poisson model.
The second journal to be reviewed is, “Modelling Association Football Scores and Inefficiencies in the Football Betting Market,” created by Mark J. Dixon and Stuart G. Coles. Their journal shows the development of a statistical model, which was created in order to accurately predict the probabilities of the outcomes of football games. The data they have used to predict the probabilities of football games is from the English League and Cup Football, from the 1992-1993 seasons to 1994-1995 seasons. They also have access to data from the 1995-1996 seasons, and use these as a validation sample to test the accuracy of their model. Each data set consists of the home score and the away score only. They decide not to use other data which is available to them such as newly signed players or the sacking of a manager because the information for this is “less easily formalised and its qualitative value is subjective.” Having access to more up-to date data means that I can predict with a higher degree of accuracy the probability of whether or not a team will win. Similarly, to Dixon and Coles, I will be compiling my data sets, only to contain the home score and away score of a match and disregard other data available in order to keep the results as objective as possible leading to a greater degree of accuracy.
Dixon and Coles aim is to exploit potential inefficiencies in the association of the football betting market by deriving a model, which can allow them to estimate the probabilities of football results. By doing this they have the potential to achieve a positive return when using the probabilities they have calculated as the basis of a betting strategy against bookmakers odds. Their basic model is a simple bivariate Poisson distribution for the number of goals scored by each team, with parameters related to past performance. For their betting strategy to work they had to estimate the probabilities of the outcomes of the football games on a team-specific basis. Maher published a model that assumes independent Poisson distributions for the number of goals scored by each of the home and away teams, with means that are specific to each teams past performance. Having reviewed this, they form the foundation for their modelling approach by assuming independent Poisson distributions for the number of goals scored by each of the home and away teams. In their journal, they clearly outline the features required of a statistical model for football matches that aims to develop a profitable betting strategy. Here are some great examples of the features they use that I will be considering for my model. The model should take into account different abilities if both the teams in a match. A team’s ability is a measure of their ability to attack, or their ability to defend. Another key feature they established is that the teams playing at home have some advantage, online journal.
One of their improvements to Maher’s model is the incorporation of a time perspective, so that matches played a long time ago do not have as much influence of the parameter estimates. This ensures higher reliability and is a factor that could be considered and explored in reference to the model outlined in this report. Dixon and Coles also make amendments to the maximum likelihood equation produced by Maher, so that the parameters used for predictions are locally constant through time. This results in their “pseudo-likelihood” equation having historical data, which is down weighted in the likelihood to a greater or lesser degree. The estimating procedure uses a technique called maximum likelihood, a technique that will also be being used further in this report to ensure good and more reliable predictions.
As their model is simple, there is room for further improvement. As they have explained, one possibility is to refine further the Poisson regression model… They consider stochastically updated parameters, but the detailed implementation is difficult… They also had a look at Smith (1981), and considered a dynamic regression framework for their simple Poisson model, although the generalisation of these ideas to the model is not immediate meaning? Another idea to improve their model is the addition of a Bayesian structure to exploit the addition of covariate information. They established a third possibility for improvement would be to change the betting regime. Having restricted attention so far to fixed odds bets of match outcomes lead to a betting strategy in which relatively few bets are actually placed. As bookmakers offer odds on specific match score, the probabilities of which can be obtained using their model, a betting strategy based on match scores could be developed.
The third journal featured in this report to be reviewed is “Bayesian hierarchical model for the predictions of football results,” created by Gianluca Baio and Marta A. Blangiardo. In their journal they propose a Bayesian hierarchical model football results from the Italian Serie A Championship using historical match data from the 1991-1992 season, and then test the performance of their model on the 2007-2008 Italian Serie A Championship season.
Firstly Baio and Blangiardo model the goals scored by each time as an independent Poisson vector for each count
y=(yg1,
yg2)where the two parameters represent the scoring intensity in the gth game. “We indicate the number of goals scored by the home and by the away teams in the g-th game of the season (g = 1,…, G = 306) as
yg1and
yg2respectively.” The parameters are modelled using a log-linear random effect model, which also incorporates a home team advantage in their parameters. Incorporating this into the data structure gives them the following for their model; the name and code of the teams, and the number of goals scored for each game of the season. As they are going for a Bayesian approach to model the football results, they specify suitable prior distributions for all the random parameters in their parameters, and the team-specific effects are modelled as ex-changeable from a common distribution. They also impose some identifiable constraints on the teams’ specific parameters, the constraints they use sum-to-zero constraints. Finally, they established hyper-priors of the attack and defence effects are modelled independently using flat prior distributions. The objective of their model is twofold. Firstly, they wish to estimate the scoring rates. They do this using the results from games and updating the prior distributions by means of the Bayes’s theorem using an MCMC-base procedure. The second objective of their model is to make predictions. They are able to do this by using the results obtained from the summary statistics for the posterior distributions. Now the results obtained from the predictions are compared to the Bivariate Poisson model and to the actual data. This allows Baio and Blangiardo to see how well the Bayesian hierarchical model fits the football data. A possible drawback to their model is over-shrinkage over shrinkage is… They state that this can be caused as their model assumes that “all the attack and defence properties be drawn by a common process, characterised by the common vector of hyper-parameters.” This will penalise the better teams and overestimate the performance of teams that do not feature so high up on the league table. To reduce over-shrinkage caused by the hierarchical model Baio and Blangiardo introduce another complicated structure for the parameters in the model. This complicated structure limits over-shrinkage by using a non central distribution on v = 4 degrees of freedom. What is the complicated structure briefly explain?
The Bayesian hierarchical model is implemented using standard MCMC algorithms. Overall, the model performs similar to bivariate Poisson model previously discussed. One of the drawbacks to the results Baio and Blangiardo found is that, as they chose a simple method of producing the results, the results come in one batch for the whole season. As a result of this, individual games cannot be predicted. To fix this they establish that they must come up with a more complex approach where the hyper-parameters are time-specific. This will mean that they can account for periods where the form of a team is reduced to due to issues such as injury. I found this journal very interesting and may consider incorporating aspects from Baio and Blangiardo’s Bayesian hierarchical model in my analysis.
Look at over-shrinkage in more detail.
Chapter 2
2.1 Poisson Process
http://faculty.neu.edu.cn/ise/songxiaoshi/book/Stochastic_Process.pdf
The Poisson process is based upon a Poisson distribution
Y ~ Poi(λ)that counts the number of occurrences per unit time, where
λ>0is the average number of occurrences per unit time (Mieghem, 2014). For the following report, an occurrence will be a goal scored during a football match in the English Premier League.
A simple stochastic process used for modelling the times at which arrivals enter a system is the Poisson process. It is similar in many ways the continuous-time version of the Bernoulli process, which can be characterised by a sequence of independent identically distributed (IID) binary random variables (rv s),
Y1, Y2, …,where
Yi=1indicates an arrival at increment
iand
Yi=0indicates that there has been arrival at increment
i. Although with a Poisson process, rather than the arrivals occurring at only positive integer multiples of a given increment size, the arrivals occur at arbitrary positive times. In addition, the probability of an arrival for the Poisson process at any particular instance is 0, (Gallager, 2016).
Therefore describing a Poisson process in terms of the probability of an arrival at any given instance is not ideal. A more clean and convenient description of the Poisson process is to define it in terms of a sequence of inter-arrival times,
X1, X2, …,which are defined to be independent identically distributed (IID), (Gallager, 2016). Before we define the Poisson process, the following definitions of a stochastic process, arrival process, counting process and renewal process are below, in order to gain a better understanding of the Poisson process.
Definition: Stochastic Process
A stochastic process (or random process) is an infinite collection of random variables defined on a common probability model. An integer or real number often interpreted as time usually indexes the random variables in a model, (Gallager, 2016).
Definition: Arrival Process
An arrival process is a sequence of increasing random variables,
0<A1< A2<…,where
Ai<Ai+1means that
Ai+1-Ais a positive random variable, i.e. a random variable X such that
FX0=0. The random variables
A1, A2, …are called arrival epochs and represent the successive times at which some random repeating phenomenon occurs. For an arrival process, the process starts at time 0. Multiple arrivals can occur but not at the same time. In addition, although we refer to these processes as arrival processes, they could equally well model departures from a system, or any sequences of incidents, (Gallager, 2016).
Figure 2.1
Figure 2.1 is a diagram of a sample function of an arrival process and its arrival epochs
(A1, A2, …). The diagram also shows inter-arrival time (
X1, X2, …) and the sample functions counting process (
Nt;t>0). The definition of a counting process is found further along in the report, (Gallager, 2016).
As illustrated in Figure 2.1, any arrival process
Sn;n≥1can be specified by either of two alternative stochastic processes. The first alternative is the sequence of inter-arrival times,
Xi; i ≥ 1where
Xiis a positive random variables defined in terms or the arrival epochs by
X1= S1and
Xi=Si-Si-1for
i>1. Similarly, each arrival epoch
Snis specified by (
Xi;i≥1)as;
Sn=∑i=1nXi
Thus, the joint distribution
X1, X2, …,
Xnfor all
n>1is sufficient to specify the arrival process. Since the inter-arrival time are independently identically distributed in most cases of interest, it is usually simpler to specify the joint distribution of the
Xithan of the
Si.
The second alternative is the counting process (
Nt;t>0), where
t>0the random variable
Ntis the aggregate number of arrivals up to and including time
t. The counting process (
Nt;t>0), is an infinite family of random variables where the counting random variable
Nt, for each
t>0, is the aggregate number of arrivals in the interval (0, t]. The random variable
N0is defined to be 0 with probability 1, which means that we are considering only arrivals at strictly positive times.
Definition: Counting Process
The counting process
(Nt;t>0)for any arrival process has the property that (
Nτ≥Ntfor all
τ≥1>0(i.e.
Nτ-Ntis non-negative random variable for
τ≥t>0). For any given integer
n≥1and time
t>0,the nth arrival epoch,
Snand the counting random variable,
Ntare related by
(Sn≤t=(N(t)≥n)
To see this, note that
Sn≤ tis the event that the nth arrival occurs at some epoch
τ ≤ t. This event implies that
N(τ )= n, and thus that
(N(t) ≥ n). Similarly,
(N(t) ≥ n)for some m ≥ n implies (S_{m}≤ t), and thus that (S_{n} ≤ t).
Definition: Renewal Process
A renewal process is arrival process for which the sequence of inter-arrival times is a sequence of positive are independent identically distributed random variables, (Gallager, 2016).
“The most elegantly simple and most important arrival processes are those for which the inter-arrival times are independently identically distributed. These arrival processes are called renewal processes,” (Gallager, 2016). Poisson processes are very simple therefore; it comes as no surprise that they are considered one of the most widely used class of renewal processes, (Gallager, 2016).
Definition: Poisson Process
A Poisson process is a renewal process in which the in which the inter-arrival times have exponential cumulative distribution function (CDF); i.e. for some real
λ>0, each
Xihas the density
fx(x) = λe-λx for x ≥ 0.
The parameter λ is referred to as the rate of the process, (Gallager, 2016).
The Poisson process is unique amongst all other renewal processes due to the memoryless property of the exponential distribution.
Definition: Memoryless Random Variables
A random variable holds the memoryless property if X is a positive random variable for which the following applies;
Pr{X > t + x} = Pr{X > x} Pr{X > t}
For all x, t ≥ 0, (Gallager, 2016).
For an exponential random variable X, the following applies;
X of rate λ > 0, Pr{X > x} = e-λx
For x ≥ 0, (Gallager, 2016). It can be seen that the equation above satisfies the definition of the memoryless random variable, therefore X is memoryless. Conversely, an arbitrary random variable X is memoryless only if it is exponential. To see this;
let h(x) = ln[Pr{X > x}
In addition, observe that since,
Pr{X > x}is non-increasing in
x, h(x)is also.
In addition,
ht+x=hx+h(t)
For all x, t ≥ 0.
These two statements imply that
hxmust be linear in
x, and thus
Pr{X > x}must be exponential in in
x.
Since a memoryless random variable X must be exponential, there must be some λ > 0 such that
Pr{X > t} = e-λt > 0
For all t ≥ 0.
This means that we can rewrite it as
Pr{X > t + x | X > t} = Pr{X > x}
If X is interpreted as the waiting time until some given arrival, then the equation above states that, given that the arrival has not occurred by time t, the distribution of the remaining waiting time is the same as the original waiting time distribution i.e., the remaining waiting time has no ‘memory’ of previous waiting.
General Assumptions for Poisson Model
We assume that we have an initial score of 0 for each game so that
X0=0.As no team has a handicap advantage. Arrivals occur according to a
Poi(λ)distribution. That is;
- Arrivals occur independently.
- The rate of arrival is constant with the average time between arrivals being
L= 1λ
.
- The probability of exactlykarrivals per unit time is;
PA1=k= λkeλk!
For k 0, 1, 2,… and the average number of arrivals per unit time is
λ, with
λ>0.
- Because the rate of arrival is constant, the average number of arrivals through time t is proportional to the interval length [0,t]. Thus, there are an average of
λtarrivals through time t, and the number of arrivals per time
tis a
Poi(λ t) distribution, (Neal, 2017).
2.2 Range and Probability Distributions
http://people.wku.edu/david.neal/540/Poisson.pdf
Definition: Discrete Data
“Discrete data is information that can be categorized into a classification. Discrete data is based on counts. Only a finite number of values is possible, and the values cannot be subdivided meaningfully.” For example, the number of goals scored in a football game, (Mckenzie, 2017).
In this setting, an occurrence will be an arrival to a group that has an initial amount of J members. The Poisson process
Atcounts the number of arrivals to the group through time t. We also count the number in the group at time t and denote this amount
Xt. We then have,
Xt=J+At>0.
Each process
Atand
Xthas a range consisting its possible values. In this case,
Atcounts the number of arrivals through time t while
Xtcounts the total number in the group at time t. Thus,
Range At=0,1,2,… and Range Xt=J,J+1,J+2,….
The probability distribution function of a discrete-values process Z gives the probability, denoted by
P(Zt=k), of Z equalling k at time t, for all k in
Range Zt.
The number of arrivals through time t is still a Poisson distribution, but the average number of arrivals is
λt. Thus, the probability of exactly k arrivals through time t is,
PAt=k=(λt)ke-λtk!
For k = 0, 1, 2,…
On the other hand, there are k members in the group if there have been
k-jarrivals. So the probability of having exactly k members in the group at time t is given by,
PXt=k=PAt=k-J=(λt)k-Je-λtk-J!
For
k=J, J+1, J+2, …
http://www1.maths.leeds.ac.uk/~voss/projects/2010-sports/JamesGardner.pdf (Page 8)
Therefore, it is clear to see that the probability mass function can be used to calculate the number of times an event will occur. The total probability of the number of events that can occur is 1.
∑x=1∞exp-λλxx!=1
For any value of
λ, where
λis the expected value of a variable. A sum is used here as a Poisson process and distribution uses discrete values, whereas in the continuous case, this would be an integral as the area under the curve would be needed. However, in the discrete Poisson case, it is only required that the number of occurrences at each time is recorded as the data points, which can then be viewed in a bar chart. The graphical representation of this will look different for different values of
λ. The bar charts below show the probability mass shape for a Poisson random variable with several values of
λby processing 20 random Poisson values for each
λ. For a Poisson distribution, the value
λrepresents the expected number of occurrences of an event for the variable in question. A unique property of the Poisson distribution is that both the mean and the variance are equal to the value of
λ, this will be explained further in this chapter.
As seen from the graphs above lower values of
λskew the density to the left as the expected value of the random variable is low, meaning that a higher percentage of low values will occur. Higher values of
λskew the density to the right as the expected value of the random variable is high, meaning that a higher percentage of high values will occur.
2.3 Mean, Variance and Standard Deviation
A Poi(
λ) distribution already has it specified that the average value is
λ. And it is also the case that its standard deviation
λ. Because
At~Poiλt, we then know that the average number of arrivals per time t is
λtwith a standard deviation of
λt. But these facts also can be quickly derived using the Poisson distribution function of
Atalong with the fact that,
∑k=0∞xkk!=ex
For all x.
Therefore, the average number of arrivals per time t (also called the mean or expected value) can be derived by,
EAt=∑k ε Range AtkPAt=k=∑k=0∞(λt)ke-λtk!
=∑k=1∞k(λt)ke-λtk!
=∑k=1∞k(λt)ke-λt(k-1)!= ∑k=0∞(λt)k+1e-λtk!
=λte-λt∑k=0∞(λt)kk!
=(λt)e-λteλt
=λt
The variance of random variable Y is found by
VarY=EY2-(EY)2, and then its standard deviation gives a way of measuring the average spread from the mean.
To find the variance of
At, we first find the average square of
Atby,
EAT2=∑k ε Range Atk2PAt=k=∑k=0∞(λt)ke-λtk!
=∑k=1∞k2(λt)ke-λtk!
=∑k=1∞k(λt)ke-λt(k-1)!= ∑k=0∞(k+1)(λt)k+1e-λtk!
=λt(∑k=1∞k(λt)ke-λtk-1!+∑k=0∞(λt)ke-λtk!)
λtEAt+1=(λt)2+λt
So the variance of
Atis
VarAt=EAT2-(EAt)2=(λt)2+λt-(λt)2=λt, and its standard deviation is
σAt=VarAt=λt.
2.4 Probability Generating Function
For a discrete-valued stochastic process Z, we can define a probability generating function by
Gzs,t=ESZt=∑k ε Range AtskP(Zt=k)
The probability generating function completely determines the process due to the following properties,
2.5 Modelling Football Scores as a Poisson Distribution
Descriptive Stats | |
Mean | 2.565789 |
Standard Error | 0.083633 |
Median | 2 |
Mode | 2 |
Standard Deviation | 1.630316 |
Sample Variance | 2.657929 |
Kurtosis | 0.673811 |
Skewness | 0.67353 |
Range | 9 |
Minimum | 0 |
Maximum | 9 |
Sum | 975 |
Count | 380 |
The descriptive statistics table above was created by inputting the total number of goals per game into Microsoft Excel and by using the Data Analysis function. The table shows statistical analysis of the total number of goals scored in a game from both the home and away teams, from the 2014/2015 Premier League season. This is required in order to see if the football data fits a Poisson distribution.
Probability Mass Function
pXx=exp(-λ)λxx!
Using values from the analysis, we can see that
pXx=exp(-2.565789)-2.565789xx!
Taking the 2014-2015 Premiership football season as an example, where information on all 380 matches is considered, and looking at the total number of goals scored by both teams in each of the 380 games played, it is easy to see that the distribution of the goals scored looks very similar to the shape of a Poisson distribution discussed earlier. The total number of goals in this Premiership season was 975 giving an average of 2.565789 goals per game. Therefore, let X be the set of data of the total number of goals in all 380 matches, the
E(X)≈2.565789and from the theorem above
λ=2.565789. The two bar charts above show the actual data for the total number of goals scored in the 2014/2015 season, and the other bar chart shows the random Poisson variable for the value
λ=2.565789where the probabilities of each number is calculated from
λ=2.565789using the probability mass function.
Comparing the two charts, it is clear to see that the goals scored in the games of 2014-2015 Premiership season do very closely follow a Poisson distribution with a mean of the average number of goals scored in a game in that season. This can be seen due to the fact that scoring a higher number of goals than the mean, which is generally around 2 goals per game, is a lot less likely.
2.6 Test of Distribution
How close the total goals fit the Poisson distribution can be tested using a goodness of fit test. As the Poisson distribution and the football data are discrete, the Chi-squared goodness of test can be used. It test whether the observed data (the 2014-2015 Premiership football season) follows a particular distribution (the Poisson distribution).
Is taken and the test statistic
Xt2is calculated using the formula,
Xt2=∑i=1n(Oi-Ei)2Ei
Where
Oiare the observed values from the data,
Eiare the expected values for the distribution is question and n is the number of observations. The observed values from the 2014-2015 Premiership football season of total number of goals in the 380 games are the frequencies.
The table below shows the observed values for the total number of goals scored in a game for the entire 2014/2015 Premier League Season. The table is graphically displayed as a bar chart above.
Number of Goals | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Frequency | 31 | 77 | 88 | 85 | 56 | 27 | 9 | 3 | 3 | 1 |
The following Poisson probabilities were calculated using the probability mass function using Microsoft Excel.
No of Goals | Poisson Probability |
0 | 0.07685848 |
1 | 0.197202678 |
2 | 0.252990278 |
3 | 0.216373264 |
4 | 0.138792061 |
5 | 0.071222242 |
6 | 0.03045688 |
7 | 0.011163706 |
8 | 0.003580465 |
9 | 0.001020747 |
The table below shows the expected values for the total number of goals scored in a game for the entire 2014/2015 Premier League Season. In order to calculate the expected values of goals scored we use the values obtained from the Poisson formula (probability mass function) and multiply them by the number of observations, which are the total games played in a season (380 matches).
380*PX=x for each value of x.
Number of Goals (x) | Expected Frequency
380*PX=x for each value of x |
0 | 29.2062224 |
1 | 74.93701774 |
2 | 96.13630566 |
3 | 82.22184036 |
4 | 52.74098313 |
5 | 27.06445187 |
6 | 11.57361429 |
7 | 4.242208244 |
8 | 1.360576657 |
9 | 0.387883696 |
The Poisson model now fits the football data. Now we can determine the test statistic as we have the expected values.
Which test statistics do we use? Explain why!!
H0:The total number of goals scored per game follow a Poisson distribution
H1:The total number of goals scored per game does not follow a Poisson distribution
Since the Poisson distribution has one parameter, its mean λ, either a specified value can be included as part of the null and alternative hypotheses or the parameter can be estimated from the sample data.
http://courses.wcupa.edu/rbove/Berenson/10th%20ed%20CD-ROM%20topics/section12_5.pdf
The following equation determines whether the data follows a specific probability distribution (Poisson distribution). We call it the chi-squared test.
Χk-p-12=∑k(f0-fe)2fe
Where
f0=observed frequency
fe=theoretical or expected frequency
k=number of catagories or classes remaining after combining classes
p=number of parameters estimated from the data
There are 10 categories in total as the total number of goals scored per game ranges from 0-9. Since the mean of the Poisson distribution has been estimated from the data, the number of degrees of freedom are;
k-p-1=10-1-1=8 degrees of freedom
Using a 0.05 level of significance, the critical value of
Χ2with 8 degrees of freedom is 15.507.
Reject H0 if X2>15.507;otherwise do not reject H0
No of Goals | f0 | fe | (f0 – fe) | (f0 – fe)^2 | (f0 – fe)^2/fe |
0 | 31 | 29.2062223 | 1.7937777 | 3.217638437 | 0.110169621 |
1 | 77 | 74.93701774 | 2.062982257 | 4.255895792 | 0.05679297 |
2 | 88 | 96.13630566 | -8.136305657 | 66.19946975 | 0.688600101 |
3 | 85 | 82.22184036 | 2.778159635 | 7.718170958 | 0.093870083 |
4 | 56 | 52.74098313 | 3.259016871 | 10.62119097 | 0.201384016 |
5 | 27 | 27.06445187 | -0.064451869 | 0.004154043 | 0.000153487 |
6 | 9 | 11.57361429 | -2.573614286 | 6.623490493 | 0.572292313 |
7 | 3 | 4.242208244 | -1.242208244 | 1.543081321 | 0.363744831 |
8 | 3 | 1.360576657 | 1.639423343 | 2.687708897 | 1.975418939 |
9 | 1 | 0.387883696 | 0.612116304 | 0.374686369 | 0.965976073 |
5.028402433 |
Since
Χ2= 5.028402433 < 15.507, the decision is not to reject
H0. There is insufficient evidence to conclude that the number of goals scored per game does not fit a Poisson distribution.
https://www.uam.es/personal_pdi/ciencias/anabz/Prest/Trabajos/Critical_Values_of_the_Chi-Squared_Distribution.pdf
Using Online Calculator -_- http://www.socscistatistics.com/pvalues/chidistribution.aspx
Χ2
= 5.028402433 = Chi-square score
Df=8
Sig level 0.05
P Value Results
Chi ^{2} =5.028402433 DF=8
The two-tailed P value equals 0.7545
By conventional criteria, this difference is considered to be not statistically significant.
The P-Value is 0.754536. The result is not significant at p > 0.05. Therefore, we accept
H0. Therefore, it is possible to say that there is a 75% chance that the distributions are similar. This is very significant due to the large amounts of data and so it can be said that the null hypothesis can be accepted and the distribution of total goals from the 2014/15 premiership football season does in fact follow a Poisson distribution. Consequently this backs up the idea that the Poisson distribution is the correct distribution for football data and will provide a good model to use.
Chapter 3
3.1 Poisson Regression
The Poisson regression model is used as the “standard model for count data,” (Cameron and Trivedi, 1998) and is “derived from the Poisson distribution by allowing the intensity parameter
μto depend on covariates,” (Cameron and Trivedi, 1998). The parameter
μis represented by
λas explained in the previous section. The Poisson regression model has many applications and is used to model the number of occurrences of an event, such as the number of days a student is absent from university. The model is also used to model the rate of occurrences of an event, such as the rate of insurance claims an insurance company receives. As this report is focussing on modelling football data, the Poisson regression model will be modelling the number of goals scored in the English Premier League. The Poisson regression model is very similar to the normal linear regression model as
y=Xβ+εwhere the scalar variables
yand variable
Xare used to estimate the parameters
β. Although the main assumption of a linear regression model is that, the residual errors follow a normal distribution, as the football data consist of positive discrete count variables the distributions will most likely be positively skewed with many observations in the data set having a value of 0. This means that the distribution cannot be transformed to a normal distribution because of the high number of 0’s. Another reason the Poisson regression model is favoured over the linear regression model is that the linear regression model produces negative predicted values, which are theoretically impossible, (Theanalysisfactor.com, 2017).
Definition: Count Data
The counting process (
Nt;t>0), where
t>0the random variable
Ntis the aggregate number of arrivals up to and including time
t. The counting process (
Nt;t>0), is an infinite family of random variables where the counting random variable
Nt, for each
t>0, is the aggregate number of arrivals in the interval (0, t]. The random variable
N0is defined to be 0 with probability 1, which means that we are considering only arrivals at strictly positive times, (Gallager, 2017).
Caret Package
(Kuhn, 2008)
Kuhn, M. (2008). Building Predictive Models in R Using the. JSS Journal of Statistical Software, 28,(5).
The caret package, short for classification and regression training, contains numerous tools for predictive models using the rich set of models available in the statistical analysis programme R. The package focuses on simplifying model training and tuning across a wide variety of modelling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualisations.
The caret package was built with several goals in mind,
- To eliminate syntactical differences between many of the functions for building and predicting models
- To develop a set of semi-automated, reasonable approaches for optimizing the values of the tuning parameters for many of these models.
- To create a package that can easily be extended to parallel processing systems.
Some of the key features of the caret model are outlined below.
Generalised linear models are used when the variance is not constant and when the errors are not normally distributed, therefore we can consider using a generalised linear model when the response variable is count data that are not proportions (e.g. log-linear models of counts).
If I get time create the graphs shown on page 511.
For count data, where the response variable is an integer and there are often a lot of zeros in the data-frame, the variance may increase linearly with the mean.
A generalised linear model has three important properties;
- The error structure;
- The linear predictor;
- The link function;
These will be explained below.
The error structure for a generalised linear model has non-normal errors this can be due to variety of reasons. For example the data used has errors that cannot lead to negative fitted values (as in counts), like football scores which can only be positive values.
Definition: Error Structure
The error structure is defined by means of the family directive, used as part of the model formula. For example
glm(y ~ z, family=poisson)which means that the response variable has Poisson errors.
The structure of the model relates each observed y value to a predicted value. The predicted value is obtained by transformation of the value emerging from the linear predictor. The linear predictor,
η, is a linear sum of the effects of one or more explanatory variable
xj;
ηi=∑j=1pxijβj
Where the xs are the values of the p different explanatory variables, and the
βsare the unknown parameters to be estimated from the data. The right hand side of the equation above is called the linear structure.
One of the difficult things to grasp about the generalised liner models is the relationship between the values of the response variable and the linear predictor. The thing to remember is that the link function relates the mean value of y to its linear predictor.
η=g(μ)
The code used to implement the caret package can be found in the appendix.
Training the Caret Package
The train function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model and calculates a resampling based performance measure. Ther train function can be used to tune models by picking the complexity parameters that are associated with the optimal resampling statistics. For a particular model, a grid of parameters is created and the model is trained on slightly different data for each candidate combination of tuning parameters. Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarised for each combination. The combination with optimal resampling statistic is chosen as the final model and the entire training set is used to fit a final model. Before the dataset can be trained it must be pre-processed. It is important to do this within the sample used to evaluate each model, to ensure that the results account for all the variability in the test.
Maximum Likelihood
Modelling association football scores is a journal, which I have previously reviewed. In this journal, Maher models an independent Poisson model for football scores. Maher also produces a unique set of parameters, which are derived using maximum likelihood estimators.
The probability density function for a random variable, y, conditioned on a set of parameters,
ϑ, is denoted
fy∥θ. This function identifies the data-generating process that underlies an observed sample of data and, at the same time, provides a mathematical description of the data that the process will produce. The joint density n independent and independent identically distributed observations from this process is the product of the individual densities,
fy1,…, y2∥θ=∏i=1nfy∥θ=Lθ∥y.
This joint density is the likelihood function, defined as a function of the unknown parameter vector,
θ, where y is used to indicate the collection of sample data. Note that we write the joint density as a function of the data conditioned on the parameters whereas when we form the likelihood function, we will write this function in reverse, as a function of the parameters, conditioned on the data. Though the two function are the same, it is to be emphasized that the likelihood function is written in this fashion to highlight our interest in the parameters and the information about them that is contained in the observed data. However, it is understood that the likelihood function is not meant to represent a probability density for the parameters. In this classical estimation framework, the parameters are assumed to be fixed constants that we hope to learn about from the data. It is usually simpler to work with the log of the likelihood function,
lnL(θ∥y)=∑i=1nlnfyi∥θ.
Again to emphasize our interest in the parameters, given the observed data, we denote this function
Lθ∥data=L(θ∥y). The likelihood function and its logarithm, evaluated at
θ, are sometimes denoted simply
L(θ)and
lnL(θ), respectively, or, where no ambiguity can arise, just L or ln L.
The principle of maximum likelihood provides a means of choosing an asymptotically efficient estimator for a parameter or set of parameters. The logic of this technique is easily illustrated in the setting of a discrete distribution. Consider a random sample of the following 10 observations from a Poisson distribution: 5, 0, 1, 1, 0, 3, 2, 3, 4 and 1. The density for each observation is,
fy∥θ=e-θθyiyi!.
Because the observations are independent, their joint density, which is the likelihood for this sample is,
y1,…, y10∥θ=∏i=110fy∥θ=e-10θθ20207, 360.
The last result gives the probability of observing this particular sample, assuming that a Poisson distribution with as yet unknown parameter
θgenerated the data. What value of
θwould make this sample most probable?
Consider maximizing
L(θ∥y)with respect
θ. Because the log function is monotonically increasing and easier to work with, we usually maximise the
L(θ∥y)instead; in sampling from a Poisson population,
lnL(θ∥y)=-nθ+lnθ∑i=1nyi-∑i=1nlnyi,
∂lnLθ∥y∂θ=-n+1θ∑i=1nyi=0⇒θML^
For the assumed sample of observations,
lnL(θ∥y)=-10θ+20lnθ-12.242
∂lnLθ∥y∂θ=-10+20θ=0⇒θ^=2
And
∂lnLθ∥y∂θ2=20θ2<0⇒this is a maximum
The reference to the probability of observing the given sample is not exact in a continuous distribution, because a particular sample has probability zero. Nonetheless, the principle is the same. The values of the parameters that maximise
Lθ∥dataor its log are the maximum likelihood estimates, denoted
θ^. The logarithm is a monotonic function, so the values that maximise
Lθ∥dataare the same as those that maximise in
Lθ∥data. The necessary condition for maximising in
Lθ∥datais,
∂lnLθ∥data∂θ=0
This is called the likelihood equation. The general result then is that the maximum likelihood estimate is a root of the likelihood equation. The application to the parameters of the dgp for a discrete random variable are suggestive that maximum likelihood is a “good” use of the data.
Maher Parameters
Maher assumes an independent Poisson model for the football scores. If team i is playing at home against team j and the observed score is
(xij,yij), we shall assume that
Xijis Poisson with mean
αiβj, that
Yijis also Poisson with mean
yiδj. Then we can think of
αias representing the strength of team i’s when playing at home,
βjthe weakness of team j’s defence when playing away,
yithe weakness of teams i’s defence when playing away,
δjthe strength of team j’s attack when playing away.” Therefore, in a league where there are twenty-two teams there are eighty-eight such parameters, equalling four parameters per team. Since each match is considered as two separate games X and Y are assumed to be independent,
αand
βwill be estimated from X and,
yalong with
δwill be estimated from Y. Maher does the following; in order to produce a unique set of parameter the constraints
∑iαi=∑iβi
may be imposed. In the same way the constraint
∑iyi=∑iδi
May be imposed and so only 86independant parameters need to be specified. Since the X and Y are assumed to be independent of each other(representing separate “games” at the two ends of the pitch), the estimation of
αand
βwill be entirely from the x and the estimation of the y and
δby means of the y alone.
For the home teams’ scores, therefore, the log likelihood function is:
logLα,β=∑i∑j≠i(-αiβj+xijlogαiβj-log(xij!))
Therefore,
∂lnL∂αi=∑j≠i(-βi+xijαi)
And so the maximum likelihood estimates
α^,β^satisfy:
αi^=∑j≠ixij∑j≠iβj^
And,
βi^=∑j≠ixij∑j≠iαj^
An iterative technique, such as Newton-Raphson, enables these maximum likelihood estimates to be determined. The steps taken in the Newton-Raphson iterative technique for maximum likelihood will now be outlined.
Newton-Raphson for Maximum Likelihood Estimation
Let
X1,…,Xnbe n independent random variables such that the probability mass function of
Xiis
fi(x;θ)which depends on a vector
θof r parameters. Then the maximum likelihood estimator
θ^of
θis that value that maximises the log likelihood function,
lθ=∑i=1nlogfiXi;θ,
Or equivalently, minimise
Lθ=-lθ.Given an initial guess
θ0for
θ^, Newton-Raphson can be used to find the maximum likelihood estimate. The jth element of the gradient and the (j,k)th element of the Hessian are given by,
gj=∂L∂θj=-∑i=1n∂logfi(xi;θ)∂θj, j=1,…, r,
Hjk= ∂2L∂θj∂θk=-∑i=1n∂2logfi(xi;θ)∂θj∂θk, j,k=1,…., r.
To perform inferences about
θbased on
θ^, we can use the fact that,
θ^~ANθ,Vθ, Vθ=I-1θ,
Where the (j, k)th element of the information matrix
Iθis,
Ijkθ=-E∂2loglθdθjdθk=Ed2logLθdθjdθk=-∑i=1nE(∂2logfi(xi;θ)∂θj∂θk)
If in addition to being independent, the X’s are identically distributed with common density or mass function f, then al of the expectations in this last expression will be the same and we can write,
Ijkθ=nikθ=-nEd2logfX;θdθjdθk
Where the matrix
i(θ)is called the information in a single observation.
It is often difficult to evaluate the expectations in i(
θ)so a natural procedure is to approximate them by their sample averages, which from,
Hjk= ∂2L∂θj∂θk=-∑i=1n∂2logfi(xi;θ)∂θj∂θk, j,k=1,…., r.
Are,
ijk^θ=-1n∑i=1n∂2logfxj;θ∂θj∂θk=1nHjkθ,
And so a popular method of finding standard errors of
θ^is to use covariance matrix
H-1(θ^), that is, the inverse of the Hessian matrix at the last Newton-Raphson iteration.
Results
Using the caret package to predict football results is straight forward.
Conclusion
To make the model more accurate more years of data could be used
Combining both mle and caret model would be benficial mathematically.
Poisson Regression References
Gallager, R. (2017). Stochastic Processes. [online] Available at: http://faculty.neu.edu.cn/ise/songxiaoshi/book/Stochastic_Process.pdf [Accessed 13 Feb. 2017].
Theanalysisfactor.com. (2017). Regression Models for Count Data. [online] Available at: http://www.theanalysisfactor.com/regression-models-for-count-data/ [Accessed 9 Feb. 2017].
Neal, D. (2017). Poisson Arrival Process. [online] Available at: http://people.wku.edu/david.neal/540/Poisson.pdf [Accessed 14 Feb. 2017].
Mckenzie, C. (2017). Discrete Data | iSixSigma. [online] Isixsigma.com. Available at: https://www.isixsigma.com/dictionary/discrete-data/ [Accessed 14 Feb. 2017].