Judging the fudging of data

An editorial in Nature Genetics in January, A very Mendelian year, reminded us of the 200th birth anniversary of Gregor Mendel, the ‘father of modern genetics’, on July 20, 2022. The legacy of Mendel is intriguing. Mendel performed controlled crossing experiments on around 29,000 plants with the garden pea between 1856 and 1863. He registered many observable characteristics, such as the shape and colour of the seeds, the colour of the flower, and formulated two principles of heredity. His seminal paper, ‘Experiments on Plant Hybridization’, was published in the Proceedings of the Brunn Society for Natural Science in 1866. He, however, gained posthumous recognition when, in 1900, the British biologist William Bateson unearthed Mendel’s paper.

The issue of falsification

Importantly, in 1936, eminent British statistician and geneticist, Sir Ronald Fisher, published a paper titled Has Mendel’s Work Been Rediscovered? By reconstructing Mendel’s experiments, Fisher found the ratio of dominant to recessive phenotypes to be implausibly close to the expected ratio of 3:1. He claimed that Mendel’s data agree better with his theory than expected under natural fluctuations. “The data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations,” he concluded. Fisher’s criticism drew wide attention beginning in 1964, about the time of the centenary of Mendel’s paper. Numerous articles have been published on the Mendel-Fisher controversy subsequently. The 2008 book, Ending the Mendel-Fisher Controversy, by Allan Franklin and others recognised that “the issue of the ‘too good to be true’ aspect of Mendel’s data found by Fisher still stands.” Fisher, of course, attributed the falsification to an unknown assistant of Mendel. Modern researchers also tend to give the benefit of the doubt to Mendel.

In fact, the 1982 book, Betrayers of the Truth: Fraud and Deceit in the Halls of Science, by William Broad and Nicholas Wade is a compendium of case histories of malpractice in scientific research. While data fudging in the scientific and social arena is understandably more likely in today’s data-driven and data-obsessed world, data and the resulting conclusions, in many cases, lose their credibility. Data is expanding; so is fudged data.

In a paper published in 2016 in the journal Statistical Journal of the IAOS, two researchers illustrated that about one in five surveys may contain fraudulent data. They presented a statistical test for detecting fabricated data in survey answers and applied it to more than 1,000 public data sets from international surveys to get this worrying picture.

Also, Benford’s law says that in many real-life numerical data sets, the proportion of times of different leading digits is fixed. A data set not conforming to Benford’s law is an indicator that something is wrong. The U.S. Internal Revenue Service uses it to sniff out tax cheats, or at least to narrow the field to better channel resources.

Judging the fudging is not easy though. The available technologies for identifying data fudging are still inadequate to address all possible situations. Several procedures for testing the randomness of data exist. But they may only shed doubts over the data, at best. It’s difficult to conclude fudging in most cases. Data may, of course, be non-random due to extreme inclusion criteria or inadequate data cleaning. And remember that a real data set is just a single ‘simulation by nature’, and it can take any pattern, whatever small likelihood that might have.

Still, an efficient statistical expert will be able to identify the inconsistencies within the data as nature induces some kind of inbuilt randomness that fabricated data would miss. However, if raw data is not reported and only some brief summary results are given, it’s very difficult to identify data fudging. Still, if the same data is used to calculate different types of summary measures and some of the measures are fudged, quite often it’s possible to find inconsistencies. There is nothing called ‘perfect fudging of data’.

Back to the Mendel-Fisher controversy. In her 1984 review of the book Betrayers of the Truth, Patricia Woolf noted that Ptolemy, Hipparchus, Galileo, Newton, Bernoulli, Dalton, Darwin, and Mendel are all alleged to have violated standards of good research practice. “[T]here is scant acknowledgement that scientific standards have changed over the two-thousand-year period from 200 B.C. to the present,” Woolf wrote. The importance of the natural fluctuation of data was possibly not so clear during Mendel’s era as it is today, for example. Thus, it’s possibly unfair to put these stalwarts under a scanner built by present-day ethical standards.

Judging the fudging is a continual process, empowered with new technologies, scientific interpretations, and ethical standards. The future generations would keep judging you even if your conclusion is perfect.

Atanu Biswas is Professor of Statistics, Indian Statistical Institute, Kolkata

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

Judging the fudging of data
Premium

The available technologies for identifying data fudging are still inadequate to address all possible situations

The issue of falsification

Top News Today

Comments

Judging the fudging of data Premium

The available technologies for identifying data fudging are still inadequate to address all possible situations

The issue of falsification

Related Topics

Top News Today

Comments

Judging the fudging of data
Premium