How our election forecasting model works

A Canadian flag flies on under the Peace Tower in this file photo.Adrian Wyld/The Canadian Press

Paul Fairie is a political scientist at the University of Calgary, where he studies voter behaviour. He designed an election-forecasting model with The Globe.

This model sets out to aggregate polling numbers from a variety of survey companies, and then translate them into various outcome probabilities. A poll that tells you the NDP are at 36 per cent with the Conservatives and Liberals tied at 30 per cent is certainly interesting and helpful at getting a national picture of the popular vote, but there are other questions we want the answer to: what is the probability of the Bloc Québécois regaining official party status (12 seats) after the election? What are the chances that the top three parties will finish within 10 seats of one another? What's the probability of the incumbent being re-elected in your riding?

This model works by combining three ideas from election research and statistical analysis: poll aggregation, the concept of uniform swing and Monte Carlo simulations.

Poll aggregation

Aggregating polls provides a way to get a general sense of the picture while also reducing the reliance on a single poll. An individual survey, even with a perfectly random sample, still has error. The larger the sample size, the smaller this error is. Additionally, different polling firms and different polling techniques can be prone to other unintentional biases, depending on exactly how they run their polls. For instance, if they call only during business hours, they might be more likely to oversample retired Canadians.

If these biases aren't in a single direction (for instance, they don't all overrate the chances of the NDP), then combining polls can wash out some of this effect. Moreover, aggregation also increases the sample size, and therefore reduces the sampling error.

Uniform swing

Uniform swing was initially developed largely to forecast elections in the United Kingdom, with the idea that – roughly speaking -- the swing (or change in one party's vote share) for a region or nation will be roughly similar for the individual ridings.

Imagine that the region of Fake Province voted 55 per cent Conservative and 45 per cent Liberal in the last election, and the specific riding of Somewhere Centre voted 70 per cent Conservative and 30 per cent Liberal. Looking at current polls in Fake Province, we can see that that it's an even 50-50 split in the vote share. Since the Conservatives are down 5 points and the Liberals are up 5 in Fake Province compared to last time, a regional swing model would project that the same thing would happen at the riding level, and we'd predict that the riding of Somewhere Centre would be a 65 to 35 per cent victory for the Conservatives.

In real life this makes a lot of sense. Often swings in one province are found, more or less, province-wide. Think of the last federal election: scores of NDP candidates were elected to the House of Commons, seemingly on a regional swing towards the party, rather than because of individual efforts in their own ridings.

This isn't to say that local candidates aren't important (political science research suggests that the local candidate can sway 5 to 10 per cent of the vote), but just that wider trends are more important. To borrow a phrase from economics: a rising tide lifts all boats.

Monte Carlo simulations

One application of Monte Carlo simulations (named after the casino in Monaco) is to use random numbers to estimate how a system with rules might play out. For elections, we can use polling numbers, other known factors and the rules of the electoral system to estimate the probabilities of various outcomes.

Let's say you have three candidates polling at 34 per cent, 33 per cent and 32 per cent, and that this poll has a margin of error of 3 per cent 19 times out of 20, which means that most of the time a candidate polling at 34 percent will be between 31 and 37 per cent. What's the probability of each candidate winning? It's not that obvious.

We can then use a random number generator with a mean (or average) and an associated error value to generate random numbers for this fictional example.

Here's the result of 20 random simulations normalized to 100 per cent using the overall polling numbers as the mean value, and the error described above.

Simulation #	Candidate 1	Candidate 2	Candidate 3	Winner?
1	32.1	34.2	33.7	Candidate 2
2	33.3	34.7	32.1	Candidate 2
3	35.0	33.0	32.0	Candidate 1
4	32.3	34.4	33.3	Candidate 2
5	35.4	32.1	32.6	Candidate 1
6	34.2	32.5	33.3	Candidate 1
7	34.5	31.7	33.8	Candidate 1
8	34.5	32.9	32.6	Candidate 1
9	34.8	32.2	33.0	Candidate 1
10	35.9	33.9	30.1	Candidate 1
11	36.0	34.2	29.8	Candidate 1
12	34.0	34.4	31.6	Candidate 2
13	36.9	30.4	32.6	Candidate 1
14	34.0	32.2	33.8	Candidate 1
15	33.0	34.4	32.6	Candidate 2
16	33.9	33.0	33.1	Candidate 1
17	33.2	33.1	33.8	Candidate 3
18	33.8	32.6	33.6	Candidate 1
19	35.4	32.1	32.5	Candidate 1
20	32.7	33.5	33.7	Candidate 3

Of the 20 simulations, 13 were won by candidate 1, five were won by candidate 2 and two were won by candidate 3. We could then say that there's a 65 per cent chance of candidate 1 winning the election, a 25 per cent chance of candidate 2 triumphing and a 10 per cent chance of candidate 3 being victorious.

The two keys to a Monte Carlo simulation in this style are being able to generate a mean value and an error value. For this model, we've already covered the mean value: we use the result of a uniform swing. For some countries it is possible to use a uniform national swing. In Canada, however, it's better to use a regional swing, since regionalism plays a very important role in our political culture. Because of this, we'll divide the country into six regions: B.C., Alberta, Manitoba and Saskatchewan, Ontario, Quebec, and Atlantic Canada.

The error term is first built from the error associated with the sample size: the bigger the sample, the smaller the error. However, there are also other sources of error that we need to accommodate. We also know that uniform regional swing doesn't work perfectly. It's not like every riding in a province or region will swing exactly the same way. We can therefore assign some extra "uniform swing error" into the model. This number is arrived at by looking at how wrong predictions would have been in past elections using assumptions of uniform swing.

There is also polling bias. Sometimes respondents to polls are a bit shy to reveal their true preferences, especially for incumbents; at other times, pollsters simply accidentally call more of one group or another, or the people who respond to the survey are not representative of the population at large.

There is also error caused by time: the longer we are from an election, the less accurate those polls are about the results of the vote. The model already takes some of this into account by weighting newer polls more heavily than older ones, but builds in some extra error the further away we are from the vote.

The model (written in R, a programming language) then applies a uniform regional swing to every riding using the transposed results from 2011 and the latest polling data, estimates the total error in the estimate for each party in each riding as described above, and then simulates 1,000 elections using the predicted vote share and the expected error.

What we can do

We can then look at the data from the 1,000 simulated elections, and answer more specific questions than just looking at polls. For instance, what are the odds of all the major parties leaders winning their seats? What are the chances that the Conservatives will win the most seats, but the NDP and Liberals together will have more than 200? Does the Green Party have a chance to win more than 1 riding?

By running these simulations and then analyzing the data, we can get a much more nuanced – and hopefully more useful – view of the race than if we just look at individual polls.

Latest in

Interact with The Globe