Using Wikipedia to change the language of the Web

Nothing exemplifies the power of Wiki – the open and collaborative platforms for content creation – like the online encyclopaedia, Wikipedia. The site that everybody can freely and collaboratively edit is credited not only with having created a massive repository of knowledge but also democratising the presentation of content on the Web.

Wikipedia, which turns 10 this January, has over 35 lakh articles. In 2010, we saw several Wiki Media Foundation bigwigs visit India, hold public meet-ups with the Wiki community and appoint the first Indian to sit on the board of the Wikipedia Foundation, Bisakha Dutta. Apart from the formal Indian Wikipedia chapter, that has been on the anvil for some time now, Wikipedia Foundation has also chosen India to set up its first offshore office.

Why India?

But why India? The large number of potential Net users here, and the ‘ground support' that exists in the form of a passionate community of Wikipedians, drive these “offshore efforts”. However, they realise, that the ‘Indian Internet' is by no means a homogenous entity. During recent visits to India, Wikipedia co-founder Jimmy Wales has repeatedly articulated the need to approach Wikipedia growth here from a strictly ‘localised' perspective — by expanding the user base for the local language Wikipedias.

Yes, the Internet, with English as its predominant language, can barely make inroads into vast areas of the country. Indian language content on the Internet is low, and is restricted either to niche blogs or news content. Internet firms are also interested in changing this by enabling web advertisers aim to target larger local audiences.

So, how can Wikis help drive this change? It seems natural that a massive task like this one — that of creating and expanding local language content — is best tackled ‘collaboratively'. And that is just what Wiki communities do best. As of today, there are Wikipedias in over 20 Indian languages. While there are 58,000 articles in the Hindi Wikipedia, Telugu and Marathi too have been growing steadily, clocking 47,000 and 32,000 articles respectively. The Tamil Wiki has around 26,000 articles, Bengali (22,000), Malayalam (16,000) and Kannada (9,900). Together, Wikipedia is arguably the single largest source of Indic content online.

Early challenges

Enthusiastic Wikipedians (Wikipedia editors/contributors) will tell you that this growth has been all but easy. Buggy fonts, lack of platform-independent fonts and the lack of a common standard for keyboard layouts marred early efforts. Data input, though much improved today, is still a challenge for non-technical folks. Most operating systems, particularly proprietary ones, still do not support Kannada fonts ‘out of the box', points out Hari Prasad Nadig, an active Kannada Wikipedia editor.

“Data input was a huge challenge when we started building the Kannada Wikipedia in 2004. Even the Nudi font for Kannada, declared a standard by the Karnataka Government, worked only on machines running on Microsoft Windows. There too, Windows XP, the most popularly OS, still doesn't offer complete support for Unicode Kannada,” he explains. Most Indian languages face similar issues with rendering of Indic fonts.

Enter Internet firms

Things have come a long way from then. Over the years, technical experts among the Wikipedia community worked to improve Media Wiki — the software that runs Wikipedia — to offer web-based support for these fonts.

More recently, Internet firms have evinced substantial interest in this area. Firms that were once indifferent to these local needs are today proactive about what is termed as localisation. From seeking out Wiki volunteer groups to use their translation tools (by offering free applications such as Google's Translation Kit) to ‘Open Sourcing' their software kits (proprietary firm Microsoft's Wikibhasha is released under Apache 2.0, a Free Software licence), firms are going all out to make the Internet “linguistically diverse”. They are investing heavily in tools that can help translate the web and ‘localise' it.

Understandably, Wikipedia, with its existing army of enthusiastic volunteers, is seen as a useful ally. So, traditional offline Wiki meet-ups in recent months have had corporate guests attending to get a first-hand understanding of how Wiki volunteering works, and to encourage them to use their translator kit — a tool that allows translation of webpages, Wikipedia articles and Knol. An Official Google Blog post, dated July 14, 2010, states: “Translation is key to our mission of making useful to everyone. The intent is to ... help Wikipedia become more helpful to speakers of smaller languages.” So, their project began in 2008, when Google employed translators and volunteers to translate Wikipedia articles into Hindi. Using search data from Google Trends to determine what Wikipedia articles are most read in India, they selected articles for machine translation, and then got teams to do rough edits. Google estimates that Hindi Wikipedia grew by at least 20 per cent as a result of these efforts. The firm is now working with at least five local language Wikis.

While the stated aim is to increase the quantum of Indic content online, there are tangible benefits for its Translation Toolkit. The more ‘humans', as opposed to machines, use the translation toolkit, the more the tool learns and the better it translates. That is, every time you or the company's team edits using their tool, the tool learns a new word, phrase or syntax. This is stored in the translation memory, a data base that stores the source text and the translation coupled together as ‘translation units'.

The software is so designed that once Wikipedia pages are imported, these units are stored in the tool's memory. Given that Wiki articles offer variety in content and style, this is an ideal platform for their tool to ‘learn'.

This makes perfect business sense for not only will it enrich Search options, but also index content on various software platforms. This could potentially be valuable for Internet firms in emerging technology areas such as Optical Character Recognition.

Google, though the earliest to take this approach, isn't the only one looking to capitalise on the ‘power of the Wikis'. Albeit a late entrant, Microsoft in October 2010 released WikiBhasha, a browser-based tool that works on Wikipedia sites.

It allows users to translate, manually make changes to the translated piece and upload it on to Wikipedia, in 30 languages. Other machine translation tools include Yahoo's Babelfish.

What about the Wiki philosophy?

Besides releasing these tools, and lobbying for its adoption, Internet firms have hired teams to translate articles, adding what they call the “human touch” to the machine job. However, the Indian Wikipedia community is divided in its response to this ‘commercial' approach to content creation.

Many argue that the articles generated are ridden with errors and way too literal as translations. Others argue that the whole idea of “developing content collaboratively” is lost.

Typically, more than one volunteer creates a Wikipedia article. An article stub is created, following which a request for contributions is placed on a ‘talk page'.

Using data from multiple sources and citations, the article is expanded, rewritten and reviewed by Wiki editors.

This process — characteristic of a Wiki and one that Wikipedians are proud to be part of — is lost when hired translators go about their job. They seldom take feedback or return to their articles a second time, passionate Wiki editors lament.

The efficiency of these machine translations is around 25 to 30 per cent, admits Nadig. But he adds that in the case of the Kannada Wiki the articles (translations) are of a fairly good quality. There has been resistance in other language Wikis where the “quality factor” has been overlooked.

While most agree that getting more content online is good for the cause of a “richer local language web”, many do not want the quality to suffer. For instance, in the case of the Bengali Wiki, the firm has not approached the community.

Conversations on Wiki mailing lists there reveal that they are worried about “dump-and-run translators” who do not follow up with the translations, and do not interact or make a second edit to fix issues with the article.

Technical issues

Balasundararaman, who has been editing the Tamil Wiki for over five years now, says that though firms intend to add more content while serving their business interests, quality takes a backseat.

For instance, he points out that tools often do not lend themselves well enough to translate wikitext — text interspersed with wiki syntax.

“Because of this, for example, we faced an issue of spurious red links to non-articles. I wrote a patch to pywikipediabot to fix the mess the tools created due to their business requirement of enriching their ‘translation memory', a machine learning technique. They've built the tools to do more aligned translations i.e. line-by-line. This creates a highly mechanical and unnatural translation unsuitable for the consumption of native speakers.”

Further, Mr. Balasundararaman points out that the third-party companies that get the contracts to do these translations are incentivised only to dump articles and not use community feedback to improve the articles or future translations as would normally happen with volunteer editors.

Recommended for you