The launch of a new machine translation system – IndicTrans2 – is being hailed as a game-changer for effective communication, social inclusion, equitable access, and national integrity in India. IndicTrans2, a publicly available model, covers all 22 of India’s scheduled languages named in the country’s Constitution. This, along with the training data that underpins it, represents a huge step forward for open, accessible machine translation of Indian native tongues.
English is widely used in India for communication in business, education, and government. Thus, to more effectively disseminate information, official documents from any sector must first then be translated into regional languages. Furthermore, the availability of high-quality translations of learning materials in students’ native languages ensures that knowledge is better democratized.
The image above paints a picture on the usage of native Indian languages. By supporting high-quality translations that everyone has open access to, IndicTrans 2 can facilitate real-time information exchange despite linguistic borders within India. It also poses some quality of life improvements for non-Indian individuals, supporting travelers and migrants to acclimate more fluidly within India.
Bringing together a team from the Nilekani Centre at AI4Bharat, IIT Madras, Microsoft, EkStep Foundation, the National Institute of Information and Communications Technology in Kyoto (Japan), and Singapore’s Institute for Infocomm Research (I2R), the work behind IndicTrans2 addressed three key issues:
The fact that there was no parallel training data for machine learning models that spanned all 22 Indian languages
The lack of robust benchmarks covering these languages and centered around content relevant to India
The absence of translation models supporting India’s 22 scheduled languages
The team addressed this by creating a vast training dataset, known as the Bharat Parallel Corpus Collection, which comprises 230 million bitext pairings, including 126 million additional new pairings (644,000 of them from manual translations). They also created benchmarks for Indian languages, designing them based on high-quality human translations of India-specific content covering culture, economics, education, entertainment, geography, government, health, industry, legal, news, religions, sports, and tourism. The IN22 benchmark, resulting from this, covers all 22 scheduled Indian languages.
Need more technical information? You can access the open-source file here.
Companies exporting goods to India must already meet certain language obligations. Labeling, for example, is preferred in English, with the Indian customs authority ensuring that all labeling obligations are met. However, marketing is a whole other ballpark…
Figures from CSA Research shows that 76% of consumers prefer to purchase products that have information in their own language, while 40% will never buy at all from websites that aren’t in their mother tongue. Marketers clearly have plenty to gain by reaching out to consumers in their first language, regardless of the language they are obliged to use to label their products. High-quality, reliable and openly accessible machine translation solutions such as IndicTrans2 can therefore do much for marketers seeking to tap into new audiences and create closer connections with buyers.
Data from PhonePe reveals some interesting insights for businesses looking to connect with Indian consumers. The firm’s analysis of payments made by Indian consumers are detailed on the infographic above. This presents a room for businesses to extend their goods and services to the greater Indian population.
The advent of an MT engine that supports all 22 scheduled languages not only supports greater social inclusion and equity within India, but also presents a great opportunity for international commerce. Around 97% of India’s population speaks one of those 22 languages as their mother tongue. This effectively means a potentially better quality of life for these stakeholders – whether that be access to information, goods, and services.
If you’re keen to know more about the potential that machine translation presents, developments in the language industry, or some interesting facts about languages, why not browse the Tomedes hub? You can also discover our post-editing machine translation solution, if you would like to add an expert human touch to your machine translations.