A good understanding of the ground reality helps, but it is a mix of simple mathematics and statistics that are at the heart of opinion polls
How can obtaining the opinion of, say, 50,000 voters, be sufficient to predict the electoral outcome in a country with over 710 million voters? Do opinion polls conducted ahead of the polling date have predictive power as far as final results are concerned? These are questions that need answers.
We will see that simple mathematics and statistics, lots of commonsense and a good understanding of the ground reality or domain knowledge, together can be very effective.
It can predict election results on the basis of a sample survey.
Brief mathematical background: Suppose an urn contains M number of balls, identical in all aspects except for colour, with K of them being orange and the rest being green. If the balls are mixed, and without looking one of them is drawn, then the chance of it being orange is K/M. This is based on the premise that each ball has an equal probability of being drawn, and since there are K orange balls, the required probability is K/M.
Suppose M is given to be 10,000 and K is either 9,900 or 100 — either 9,900 are orange or 9,900 are green. One ball is drawn (after mixing, without looking) and its colour is found to be green. We are to make a decision about K — choose out of the two possibilities: K equals 100 or K equals 9,900. If K is 100, then the probability of drawing a green ball is 0.99, whereas if K is 9,900, then it is 0.01. Based on this, we can say that 9,900 balls are likely to be green. This is what commonsense tells us and can be justified in various ways. The story would not change if K is either 99,000 or 1,000 and M is 100,000. This is the only idea from probability theory or statistics that is needed to answer most of the questions, as we will see.
Now consider a constituency, say Chennai South. To make matters simple, suppose there are two candidates, Rajesh and Kavita. Suppose the election is not a close one, that is, the winner is getting at least four percentage points more votes than the loser. Let us make lists of 4,001 voters at a time with each list written on a sheet of paper. We can imagine that each sheet is marked with magic ink in red colour or blue colour, depending on who the voters listed on that sheet prefer — Rajesh or Kavita. Thus, if 2,001 or more voters prefer Rajesh, the sheet is marked with magic ink in red colour, otherwise in blue colour. The colour on the sheet will be revealed when it is wet. Now it can be shown by counting arguments that 99 per cent of the sheets are marked in one colour — the colour for the winning candidate. So if Kavita has the support of over 52 per cent voters in the constituency, over 99 per cent sheets would be blue, and if Rajesh is supported by over 52 per cent then over 99 per cent sheets would be red.
Suppose we mix the sheets well and draw one at random. When we wet it (this is equivalent to visiting each of the 4,001 voters listed on that sheet and getting their opinion to decide the colour of the sheet), we are likely to observe the colour of the winning candidate. So if it is red, we can predict that Rajesh will win and if blue then we can say that Kavita will win. In either case, we will be correct with 99-per-cent probability.
By the way, this calculation did not need the value of M — namely, the total number of voters. It could be 500,000 or 5,000,000, and knowing the opinion of 4,001 voters would suffice to make a prediction with 99-per-cent accuracy (as long as it is not a close election). This seems counter-intuitive, but this is what comes out of simple counting.
Another way to say this is that most samples with size 4,001 are representative of the population and hence, if we select one randomly we are likely to end up with a representative sample.
In colloquial English, the word random is also used in the sense of arbitrary (as in Random Access Memory, or RAM). So some people think of a random sample as any arbitrary subset. Randomness should be seen as a property of the process that selects the sample and not the sample itself.
Failure to select a random sample can lead to wrong conclusions. In 1948, all opinion polls in the U.S. predicted that Thomas Dewey would defeat Harry Truman in the presidential election. The problem was traced to the sampling methodology used: telephone numbers were randomly generated and calls were made to these subscribers to get their voting intentions. In 1948, the weaker sections of the society in the U.S. were under-represented in the survey. Today, the penetration of telephones in the U.S. is almost universal and so the method generally works there. It will not work in India even after the unprecedented growth in the telecom sector, as a large number of the underprivileged still do not own a telephone and thus a telephone survey will not yield a representative sample.
My view is that the statistical guarantee that the sample proportion and the population proportion do not differ significantly does not kick in unless the sample is chosen by means of randomisation. The sample should be chosen by means of randomisation, perhaps after suitable stratification. This costs a lot more than the quota sampling used commonly by market research agencies, but is a must.
Following statistical methodology, one can get a fairly good estimate of the percentage of votes of the major parties in the country (or in a State), at least at the time the survey is conducted. However, the public interest is in the prediction of the number of seats and not percentage of votes for parties. It is possible (though it is extremely unlikely) even in a two-party system for a party ‘A' with, say, 26 per cent, to win 272 (out of 543) seats (majority) while the other party ‘B', with 74 per cent votes, wins only 271 seats (‘A' gets just over 50 per cent votes in 272 seats winning them, while ‘B' gets 100 per cent votes in the remaining 271 seats). Thus, a good estimate of vote percentages does not automatically translate to a good estimate of number of seats for the major parties.
So, in order to predict the number of seats for parties, we need to estimate not only the percentage of votes for each party, but also the distribution of votes of each of the parties across constituencies. And here, independents and smaller parties that have influence across a few seats make the vote-to-seat translation that much more difficult. If we can get a random sample of size 4,001 in each of the 543 constituencies, then we can predict the winner in each of them and we will be mostly correct (in constituencies where the contest is not a very close one). But conducting a survey with more than 21 lakh respondents is very difficult: money, time, reliable trained manpower, each of these resources is limited.
One way out is to construct a model of voter behaviour. While such a model can be built, estimating various parameters of such a model would itself require a very large sample size. Another approach is to use past data in conjunction with opinion poll data. In order to do this, we need to build a suitable model of voting behaviour — not of individual voters but for the percentage of votes for a party in a constituency.
To make a model, let us observe some features of the Indian democracy. Voting intentions in India are volatile — in a matter of months they can undergo a big change. Examples are Delhi in the March 1998 Lok Sabha elections, in the November 1998 Vidhan Sabha elections, and the October 1999 Lok Sabha elections. This is very different from the situation in the U.K. where voting intentions are very stable across decades, and thus methods used in the U.K. cannot be used in India, though superficially, the Indian political system resembles the one in the U.K.
This is where domain knowledge plays an important role. A model which works in the West may not work in the Indian context if it involves human behaviour. And having all the data relating to elections in India (since 1952) will not help. The point is that large amounts of data cannot substitute an understanding of the ground realities.
While the behaviour of voters in a constituency may be correlated with that in adjacent constituencies in the same State, the voting behaviour in one State has no correlation with that in another State. The behaviour is influenced by many local factors.
Socio-economic factors do influence voting patterns significantly. However, incorporating them directly in a model will require too many parameters. It is reasonable to assume that the socio-economic profile of most of the constituencies does not change significantly from one election to the next. So, while the differences in socio-economic profiles between two constituencies are reflected in the differences in voting pattern in a given election, the change from one election to the next in a given constituency does not depend on them.
So we make an assumption that the change in the percentage of votes for a given party from the previous election to the present is constant across a given State. The resulting model is not very accurate if we look at historical data, but is a reasonably good approximation — good enough for the purpose, namely, to predict the number of seats for major parties at the national level. The change in the percentage of votes is called swing. Under this model, all we need to do via sampling is to estimate the swing for each party in each State. Then, using past data we will have an estimate of percentage of votes for each party in each State.
Here we can refine this a little: we can divide the big States into regions and postulate that the swing in a seat is a convex combination of swing across the State and swing across the region.
Predicting the winner: Here, one more element comes in. We need to predict the winner in each constituency and then give the number of seats for major parties. If in a constituency our predicted margin for the leading candidate is eight per cent, we will be a lot more confident about the leading candidate winning the seat than the situation where our predicted lead is just one per cent. So we translate this into the probability of victory for the two leading candidates.
The best case scenario for the candidate who is second is that actually he has a slender lead and yet a sample of the given size shows him trailing by the given margin. The probability of this is assigned as the probability of victory for the second candidate, and 1 minus this is the probability of victory for the leading candidate. Adding the probability of victory in each seat for a given party gives us the expected number of seats. This method gives reasonable predictions at the national level.
The crux of the matter is to get a random sample that is reasonably distributed across the country. The method generally followed by us is to select about one-fifth of the constituencies for sampling. In the list of constituencies, contiguous constituencies occur together and hence systematic sampling or circular sampling is appropriate as it gives an even spread across the country. Then we get a list of polling booths in each constituency and pick, say four to six polling booths, again by circular random sampling. Finally, we get the voters' list in these booths and pick 35 to 50 voters in each chosen polling booth by circular sampling. The enumerators then have to go door-to-door (three times if necessary) and get the opinion of the chosen persons.
If we conduct this opinion poll before the voting begins (so as to publish the results two days before the first phase starts), there is quite a bit of gap between our poll and the actual voting. In India there seems to be a lot of churning of electoral preferences as the voting day comes closer. This introduces an error factor in any opinion poll-based prediction. Some pollsters do claim to correct for this effect. They conduct what is called a tracking poll, where polls are conducted every week for, say, six to eight weeks prior to the polls, and then the trend is extrapolated to get the prediction on what is to happen on election day. However, the churn that happens nearer to the voting day is much more than in previous weeks and so this method is not very satisfactory. Another problem with opinion polls conducted prior to the voting day is that the best of methodology can only measure the mood of all the voters, while what matters is the segment that actually goes to the polling booth to vote. These two factors raise a big question mark on predictions based on opinion polls conducted prior to the voting day.
Both these problems are addressed by an exit poll, where we interview respondents as they exit the polling booth. However, in an exit poll we cannot ask respondents from a previously generated list. We can at best choose polling booths via multi-stage circular sampling and then give a thumb rule, such as, pick every 10th voter to the enumerator. This is likely to introduce a bias in the sample.
Multi-phase polls seem to have become the norm in India. We have to give our findings only at the end of the last phase. So for the last several years, we conduct a proper randomised poll in all constituencies that are included in all the phases except the last. In the last phase we do an exit poll. This method has given reasonable predictions for several State elections and in the 2009 Lok Sabha elections.
To sum up, proper use of statistical techniques and some domain expertise can give very remarkable results. However, the media sometimes project the predictions as the truth, the whole truth and nothing but the truth.
It should be remembered that most of the time, a proper methodological poll would be able to pick the correct winner — namely, the party which gets the largest number of seats. However, the exact number of seats that the various parties would win sometimes eludes us.
Opinion polls do serve a much larger purpose than forecasting the final outcome — it gives an insight into why people have voted the way they did. If the political parties use the opinion polls as a feedback mechanism to gauge public opinion and act accordingly, that would help the country.
(The author is the Director of the Chennai Mathematical Institute.)