Harvard researchers and Google have developed a tool that can identify cultural trends over the past 200 years.

How many words in the English language never make it into dictionaries? How has the nature of fame changed in the past 200 years? How do scientists and actors compare in their impact on popular culture? These are some of the questions that researchers and members of the public can now answer using a new online tool developed by Google with the help of scientists at Harvard University. The massive searchable database is being hailed as the key to a new era of research in the humanities, linguistics and social sciences that has been dubbed “culturomics”.

The database comprises more than five million English language books — fiction and non-fiction — published between 1800 and 2000, representing about four per cent of all the books ever printed. Dr. Jean-Baptiste Michel and Dr. Erez Lieberman Aiden of Harvard University have developed the search tool, which they say will give researchers the ability to quantify a huge range of cultural trends in history.

“Interest in computational approaches to the humanities and social sciences dates back to the 1950s,” said Dr. Michel, a psychologist in Harvard's Programme for Evolutionary Dynamics. “But attempts to introduce quantitative methods into the study of culture have been hampered by the lack of suitable data. We now have a massive dataset, available through an interface that is user-friendly and freely available to anyone.” In their initial analysis of the database, the team found that about 8,500 new words enter the English language every year and the lexicon grew by 70 per cent between 1950 and 2000. But most of these words do not appear in dictionaries. “We estimated that 52 per cent of the English lexicon — the majority of words used in English books — consist of lexical ‘dark matter' undocumented in standard references,” they wrote in the journal Science.

The researchers were also able to trace how words had changed, for example a trend that started in the U.S. towards more regular forms of verbs from irregular forms such as “burnt”, “smelt” and “spilt”. “The forms still cling to life in British English. But the -t irregulars may be doomed in England too: each year, a population the size of Cambridge adopts ‘burned' in lieu of ‘burnt',” they wrote.

The team also investigated the changing nature of fame over the past two centuries. By looking at the frequency of famous names in literature, they showed that celebrities born in the mid-20th century tended to be younger and more famous than those of the 19th century, but their fame lasted for a shorter period of time.

By 1950, celebrities were achieving fame, on average, when they were 29, compared with 43 for celebrities around 1800. “People are getting more famous than ever before,” wrote the researchers, “but are being forgotten more rapidly.” By the mid-20th century, the most famous actors tended to achieve fame at around 30, while writers had to wait until 40 and, for politicians, fame didn't tend to happen until they reached at least 50.

“Science is a poor route to fame. Physicists and biologists eventually reached a similar level of fame as actors but it took them far longer,” wrote the researchers. “Alas, even at their peak, mathematicians tend not to be appreciated by the public.” The database can also identify patterns of censorship in individual countries. The Jewish artist Marc Chagall, for example, was mentioned only once in the entire German literature from 1936 to 1944, even though his appearance in English-language books grew by around five times.

Claire Warwick, director of the centre for digital humanities at University College London, said humanities researchers had been using word-frequency techniques for several decades. But the sheer size of this dataset marked it out from the usual tools. “What's different is that this allows people to not just look at several hundred thousand words or several million words but several million books. So the overview is much bigger.” The database of 500 billion words is thousands of times bigger than any existing tool, with a sequence of letters 1,000 times longer than the human genome. The vast majority of data, around 72 per cent, is in English with small amounts in French, Spanish, German, Chinese, Russian, and Hebrew. To coincide with the release of the Science paper, Google will release a tool allowing members of the public to see how often a word or phrase has appeared and how its usage has changed over time. — © Guardian Newspapers Limited, 2010

More In: Comment | Opinion