This article is going to be part of a multipart series. The end goal is to write a football score simulation program.

The first place to start is creating a statistical model. From the statistical model, we can emulate matches.

We're going to weight the football teams using the Elo rating system. The reason for this is it's easy to find the Elo ratings of real life football teams from websites like clubelo.com(?) and the website which provides ratings for national teams.

The Elo rating system will need a few extensions before we can use it to properly model the outcome of football matches. First of all, Elo only predicts wins and losses, not draws. This makes sense as the drawing odds depend heavily on the game.

Another goal of our statistical model will be to predict the scores of football matches. This is another big challenge as it goes well beyond what Elo provides.

Although in later articles I'm going to be using OCaml for the implementation, in this article I'll be using Haskell as a tool to help me quickly evaluate results and R to do the statistical calculations.

Modelling Wins

The first task is to create an expected value function for wins. I'm going to use results from the 2017/18 Premier League season as posted below.

We're going to create our model based off the results from the 2017/18 Premier League season.

Premier League 2017/18

Team W D L Rating
MCI 32 4 2 1316
MUN 25 6 7 1189
TOT 23 8 7 1168
LIV 21 12 5 1168
CHE 21 7 10 1116
ARS 19 6 13 1063
BUR 14 12 12 1021
EVE 13 10 15 979
LEI 12 11 15 968
NEW 12 8 18 937
CRY 11 11 16 947
BOU 11 11 16 947
WHU 10 12 16 937
WAT 11 8 19 916
BHA 9 13 16 926
HUD 9 10 19 895
SOU 7 15 16 905
SWA 8 9 21 863
STK 7 12 19 874
WBA 6 13 19 863

I'm using this formula to calculate the elo ratings:

f (wins, draws, losses) = (total_elo + 400 * (wins - losses)) / games

Which reduces to this, in our case:

f (wins, draws, losses) = 1000 + (400 / 38) * (wins - losses)

This will be no good though. There is always a home advantage (or an away disadvantage) in football. So when analysing the results, we want to take that into account. So we want the home table and the away table and from that calculate the team's home rating and away rating as we did before.

Past Matches

Here are all the results for the 2017/18 Premier League season in a format that we can work with in Haskell:

type Match = (String, String, Int, Int)

matches :: [Match]
matches = [
    ("ARS", "BOU", 3, 0),
    ("ARS", "BHA", 2, 0),
    ("ARS", "BUR", 5, 0),
    ...
    ("WHU", "WBA", 2, 1)]

The full listing can be found in the football.lhs file, along with all code from this article.

Calculating Home And Away Tables

newtype Record = Record (Int, Int, Int) -- (Wins, Draws, Losses)

isWin :: Match -> Bool
isWin (_, _, homeGoals, awayGoals, _) = homeGoals > awayGoals

isDraw :: Match -> Bool
isDraw (_, _, homeGoals, awayGoals, _) = homeGoals == awayGoals

isLoss :: Match -> Bool
isLoss (_, _, homeGoals, awayGoals, _) = homeGoals < awayGoals

getTeam :: Match -> String
getTeam (homeTeam, _, _, _, _) = homeTeam

Now we want to create a home table and away table. We'll represent this as type [(String, Record)]. So it'll be an associative list. First, it'll be easier to map each result to a non compressed list of [(String, Record)]. So there will be many teams with Records that only have one value.

matchToRecord :: Match -> (String, Record)
matchToRecord match
    | isWin  match = (getTeam match, Record (1, 0, 0))
    | isDraw match = (getTeam match, Record (0, 1, 0))
    | isLoss match = (getTeam match, Record (0, 0, 1))

We can create a list of records by doing map matchToRecord matches. But the question is, how do we fold this into a table?

We're going to use Haskell's Data.Map.Strict library and treat the table as a Map of Map String Record.

import qualified Data.Map.Strict as Map

To create a home table, it'll be best to start with an empty map Map.empty and fold the records into that. It'll be a map of type Map.Map String Record, as we want the team name (String) to be the key and the team's records (Record) to be the associated value.

We'll create a Monoid and Semigroup instance for records as that'll give us empty records and a nice function to combine them.

instance Monoid Record where

    mempty = Record (0, 0, 0)

instance Semigroup Record where

    (<>)
        (Record (winsA, drawsA, lossesA))
        (Record (winsB, drawsB, lossesB)) =
            Record (winsA + winsB, drawsA + drawsB, lossesA + lossesB)

Now we can use Map.fromListWith to create our table.

homeTable :: Map.Map String Record
homeTable = Map.fromListWith (<>) (map matchToRecord matches)

Here is a more generic function for calculating tables from results:

createTable :: [Match] -> Map.Map String Record
createTable = Map.fromListWith (<>) . map matchToRecord

To get the away table, we'll need to use createTable with the away results.

awayMatch :: Match -> Match
awayMatch (homeTeam, awayTeam, homeGoals, awayGoals, isHome) =
    (awayTeam, homeTeam, awayGoals, homeGoals, not isHome)

Now by using this function, we can calculate the away table:

awayTable :: Map.Map String Record
awayTable = (createTable . map awayMatch) matches

We can even get the full table by first getting the home and away match from the home match:

homeAndAway :: Match -> [Match]
homeAndAway match = [match, awayMatch match]

Then flat mapping across the matches and applying it to createTable gives us the full Premier League table:

fullTable :: [Match] -> Map.Map String Record
fullTable = createTable . concatMap homeAndAway

Tables To Ratings

The goal is now to use our table of records to create a lookup table of ratings.

Using the formula from before:

f (wins, draws, losses) = 1000 + (400 / 38) * (wins - losses)

A Haskell function can be created that works with our Record data type:

type Rating = Int

recordToRating :: Record -> Rating
recordToRating (Record (wins, draws, losses)) = rating where

    rating :: Rating
    rating = 1000 + round ((400 / fromIntegral totalGames) * diff)

    totalGames :: Int
    totalGames = wins + draws + losses

    diff :: Float
    diff = fromIntegral (wins - losses)

Now to map over all the values in our tables to get home and away ratings:

recordToRatingTable :: Map.Map String Record -> Map.Map String Rating
recordToRatingTable = Map.map recordToRating
homeRatings :: Map.Map String Rating
homeRatings = recordToRatingTable homeTable
awayRatings :: Map.Map String Rating
awayRatings = recordToRatingTable awayTable

Perfect! It's now possible to lookup Manchester United's home rating with Map.lookup "MUN" homeRatings which is Just 1274.

TODO

The first feature that our model will be predicting will be whether the match was a draw or not. The probability of a draw will depend upon the rating difference of the two teams.

If the match isn't a draw, then we will calculate if the match is a win for the home team or not. This is conditional probability: given not a draw, what is the probability of a win?

In order to fit our model to draw probability against rating difference, we must first classify each match into a win or a draw. We already have the isDraw function, so we just need to write a function which converts a match into a record which contains, the rating difference and the feature (whether the match was a draw or not).

matchesToDrawFeature :: [Match] -> [(Int, Bool)]
matchesToDrawFeature = map f where

TODO

CODE

  • Rethink a lot of this

TODO

  • When match is not a draw (predicting win or loss)
  • Conditional probability
  • First filtering draws out of training data
  • Export to CSV
  • Performing regression on that data to find relationship between rating difference and chance of winning

Match Preparation

Ultimately, we want to train our statistical model on rating difference with some random variable and try to map it to a logistic curve like the Elo expected result formula, but one that gives us the correct expected chance of winning for football matches.

For each match, we'll pair up the rating difference with whether the match was won or not (1 for a win, 0 for anything else).

First we'll need a better way of finding a home team or away team from a match:

getHomeTeam :: Match -> String
getHomeTeam (homeTeam, _, _, _, True)  = homeTeam
getHomeTeam (_, homeTeam, _, _, False) = homeTeam

getAwayTeam :: Match -> String
getAwayTeam (_, awayTeam, _, _, True)  = awayTeam
getAwayTeam (awayTeam, _, _, _, False) = awayTeam

And a way of calculating the rating difference of a match:

matchRatingDiff ::
    Map.Map String Rating -> Map.Map String Rating -> Match -> Rating
matchRatingDiff homeRatings awayRatings match = ratingDiff match where

    ratingDiff :: Match -> Rating
    ratingDiff (_, _, _, _, True)  = homeRating - awayRating
    ratingDiff (_, _, _, _, False) = awayRating - homeRating

    lookup :: String -> Map.Map String Rating -> Rating
    lookup team = Data.Maybe.fromMaybe 1000 . Map.lookup team

    homeRating :: Rating
    homeRating = lookup (getHomeTeam match) homeRatings

    awayRating :: Rating
    awayRating = lookup (getAwayTeam match) awayRatings
matchRatingDiffWithWin ::
    Map.Map String Rating -> Map.Map String Rating -> Match -> (Rating, Int)
matchRatingDiffWithWin homeRatings awayRatings match = (ratingDiff, win) where

    ratingDiff :: Rating
    ratingDiff = matchRatingDiff homeRatings awayRatings match

    win :: Int
    win = if isWin match then 1 else 0

It's possible to map this across all the matches to get what we want:

allMatches :: [Match]
allMatches = concatMap homeAndAway matches
allMatchesDiffWithWin :: [(Rating, Int)]
allMatchesDiffWithWin = map ratingDiffWithWin allMatches where

    ratingDiffWithWin :: Match -> (Rating, Int)
    ratingDiffWithWin = matchRatingDiffWithWin homeRatings awayRatings

Writing To A CSV File

All that needs to be done now is to take each matches' rating difference and win variable and convert it to a string that can be written as a CSV file. This is dead easy:

toCSV :: [(Rating, Int)] -> String
toCSV = foldl (++) "Rating,Win\n"
    . map (\(rating, win) -> show rating ++ "," ++ show win ++ "\n")

To write to a file in Haskell you just have to call writeFile:

> writeFile "matches.csv" (toCSV allMatchesDiffWithWin)