FC7P01NI Dissertation
Msc Project
Topic: Result Prediction of season 2020/21 for English
Premier League using Machine Learning Algorithm
,FC7P01NI MSc Project
Abstract
This dissertation is written as a part of the MSc program in Data Science at London Metropolitan
University. The purpose of this paper is to predict the results of season 2020/21 for English Premier
League. Several research has been done on results prediction of English Premier League of
previous seasons.However, no significant work has been done on results prediction of season
2020/21. To predict theresults, I used Logistic Regression, Support Vector Machines and K-
nearest neighbors. An exploratory data analysis was also conducted to explore more insights
on the datasets. I alsoevaluated the percentage of whether the match will be won by home team,
away team or will it bedraw. My results determine the use of logistic regression for the prediction of
English Premier Leaguematches. My results also determine Manchester City have more chance to
win the champions league title for the season 2020/21.
Keywords: machine learning, logistic regression, support vector machines, K- nearest
neighbors, multi class classification, sports results prediction.
3
,FC7P01NI MSc Project
Chapter 1: Introduction
1.1 Background
Football, sometimes referred to as soccer, is one of the most popular sports in the world. It is played,
watched, and enjoyed by billions of people worldwide. According to the statistics published by
TopTrend, “there are 3.5 billion football fans across Europe, Asia, Africa and America”
(TopTrendSports, 2014). It has been ranked as the world’s most popular sport.
As of 2018, 80 percent of the people from UAE declared themselves either interested or very
interested in football (Oregonreigns, 2021).
There are many matches that are played in football. One of the most popular is the English Premier
League (Pilger, 2014). The English Premier League is the most watched league on the planet with
one billion fans spread across 188 countries (Leauge, 2018). The league held its first season in 1992-
93. It was composed of twenty- two clubs for that season (Leauge, 2018). The number of clubs was
reduced to twenty in 1995 (France, 2020). At present, there are twenty clubs playing in EPL matches.
With the advancement of technology, analytics have been popular in the recent years. As stated by
an article by fingent, “Analytics have completely disrupted the way organizations go about with their
business by using one commodity that is data.” (Fingent, 2020). Analytics have been used in sports
too. While the theory of sports analytics might have been around since the 1980s, it was hugely
popularized by Billy Beane, the general manager of American baseball team (Phillips, 2020).
“Written by Michael Lewis, ‘Moneyball: The Art of Winning an Unfair game’ was the first book on
sports analytics (Soccerment Research, 2019). Analysis of matches in football have been increased
over the years. “Match Analysis in football has really come to prominence over the last 5 years. The
primary reason being the accessibility of football data and the growing analytics community behind
it.” (Eccles, 2017).
In football too, clubs like to gain a competitive edge on and off the pitch, and big data is allowing them
to extract the insights (Business, 2018). These insights have helped in improving the player’s average
stats such as the number of goals they score, the number of fouls they commit, with how many red
and yellow cards are booked, and many more.
According to an article by Intel, “Philippe Coutinho scored a free kick against Barcelona in December
2017 by firing his shot underneath the jumping defensive wall, Jurge Klopp credited his analytics for
pointing out the opportunity” (BeSoccer, 2018).
8
, FC7P01NI MSc Project
In 2014, Bing correctly predicted the outcomes for all the fifteen games in the knockout round for the
2014 world cup (Nisen, 2018). Every single game had an accuracy of 100 per-cent. As suggested
by Guardian, “Manchester United, a team of English Premier League, have won against Swansea,
through short and long passes” (Jackson, 2018).
Kaggle, an online community of data scientists and machine learning hosts a yearly competition
called “March Madness”, where many data scientists gather and predict winners and losers of a game
(Kaggle, 2021).
1.2 Goal of the project
With so much money and emotion invested in the outcomes of professional matches in English
premier league, the same sort of urban myths and armchair managers arise in English Premier
League. Having been a fan of Liverpool for years, I am excited to bring the power and insight of
machine learning to bear on English Premier League matches.
One of our goals in this thesis will be to predict the winning team. We will use machine learning
techniques to predict full time results or FTR.
We will predict FTR by evaluating the three possible outcomes. They are: home team win, draw and
away team win. Because of the nature of the outcome, predicting FTR can be categorized as a
multiclass classification problem.
Since it is a multiclass classification problem, we analyzed different machine learning techniques
such as logistic regression, support vector machines and k-nearest neighbors.
I then divided the data into two parts: feature set and target variable. All the columns except FTR is
feature that is considered for training the model.
1.3 Motivation
Data Science is often used in football for evaluation of a team’s performance and the use of that
information to predict the result (Bouley, 2020). Prediction in football matches is often difficult to
predict, as there are many factors that influence the outcome of the game (Punter, 2017). The
possible outcomes of a football match are win, lose or draw. It can therefore seem quite
straightforward to predict the outcome of a game. However, from 1992 to 2017, the average goals
scored in EPL per game was less than 3 goals (Wright, 2021).
A potential solution to this problem is to explore the in-game statistics to dive deeper than the simple
match results. In the last few years, in- depth match statistics have been made available (Herbinet,
2018). Due to this, expected goals metrics have been developed in which the estimate of number of
9