Marathon, not a sprint: Developing authentic ChatGPTs for Indian languages
In a significant stride towards Indianising artificial intelligence (AI), earlier this month, Bengaluru-based Sarvam AI came up with a suite of products to change the AI landscape in the country. One among them was Sarvam 2B -- an open-source large language model (LLM) proficient in 10 Indian languages.
Democratising AI, making it accessible to every Indian irrespective of their linguistic and socio-economic background, and bridging the digital divide is the intended goal, one of the firm’s founders posted, after the launch.
LLMs went viral in the news last year in light of the launch of American firm OpenAI’s GPT4 -- an advanced version of the company’s ChatGPT that was purported to be able to comprehend human emotions and respond as such. While ChatGPT, at present, does allow interaction in Indian languages, including Tamil, Malayalam, and Hindi, it leaves a lot to be desired -- especially in areas of understanding the language nuances, dialects, idioms, and cultural references of Indian languages, among others.
To a query on this, the chatbot, on its part, says that these would be better-represented in an LLM that is designed from scratch as opposed to an AI model developed from publicly available data.
No dearth of initiatives
Developing an LLM in languages besides English is difficult, says Neha*, who has worked in LLMs for over four years. “The data and the digital content available in English are aplenty, serving as a base to train the machine. As for Indian languages, the data are very limited. It will take a lot of time, and a lot of training to get to where English is placed at the moment [with respect to LLMs],” she adds.
However, efforts have been under way for sometime in the field by firms such as Sarvam, which released a Hindi Language Model ‘OpenHathi’ last year; initiatives of the Central government such as Bhashini that provides a range of AI tools enabling access in preferred Indian languages; AI4Bharat, a venture of the Indian Institute of Technology-Madras, among others.
More recently, Thiagarajar College of Engineering (TCE) in Madurai launched a research centre, Tamarai, for AI in Tamil.
The process is arduous as designing an efficient Indian language LLM requires extensive amounts of accurate, authentic data. “[In most cases], the firms or varsities working towards this reach out to the best universities where the intended language is being taught, get in touch with the faculty, and curate a bit of literature on the language. The assistance of non-governmental organisations (NGOs) is sought in collecting data [on the language] on the field. Agents are deployed in remote areas where the language is still spoken without being influenced by other languages [such as English]. The NGOs will arrange meetings with the residents, ask them to speak on several topics or domains, and record the conversations,” contends Neha.
Transcribing the collected data is quite challenging, says Janki Nawale, linguist at AI4Bharat, IIT-M, listing the issues faced during the design of a dataset ‘IndicVoices’, using which IndicASR -- the first automatic speech recognition model to have supported all the 22 languages [in the Eighth Schedule] -- has been built.
“Projects such as IndicTrans and IndicVoices at AI4Bharat gave opportunities to translators, language experts, native speakers, NGOs and local partners to participate in various linguistic tasks. It is a difficult thing to translate and transcribe data for machines since in most cases, annotation is done on a sentence or utterance level without long semantic context. At times, the translated sentences can be lengthy, making it challenging to translate them due to the limitations of the target language’s syntax. The diversity of Indian languages doesn’t help either, such as the right to left writing of Urdu and Kashmiri; aspect markers and Meitei Mayek script of Manipuri; and the standard script’s inability to write colloquially spoken words. Hence, from a scientific perspective, certain annotation rules must be established to maintain consistency in data across languages, while allowing the freedom to capture the authenticity of the language without being constrained by these ‘rules’ for a diverse application,” she says.
Technical issues pose a challenge too. Graphics processing units (GPUs) are as vital as data for LLMs, to process the huge volumes of information that the machine is trained on. “LLMs deal with billions of parameters, working on petabytes of data. To train them, one requires H100 chips [manufactured by NVIDIA] to crunch large volumes of data, or machine learning models. Besides the expensive rates, there is a need for a specialised RAM, power supply, and motherboard, among others, requiring highly technical resources to put the assembly together; and used efficiently to train an LLM,” Ranjith Melarkode, founder, The Neural.ai, says.
Token computing
Generally, an AI has to break a sentence or a word fed to it into ‘tokens’, and the machine is known to generate fewer tokens for English as opposed to a language like Hindi or Tamil. Mr. Ranjith says, “Higher tokenisation allows for the model to capture the finer nuances of language, and handle diverse inputs -- which are much-needed in Indic languages where words often share common roots across languages. This fidelity and flexibility often come at the cost of increased computation and resources. It is important to find the right balance between model efficiency and fidelity [and cost]”
Ms. Nawale says that engaging a huge number of people in detail-oriented tasks is challenging. “To get 20 minutes’ worth of content from a person, one has to work with them for three to four hours, which some of them would not concur with,” she says, while also affirming that many people, as well as organisations, do extend their cooperation upon realising that the efforts are aimed at promoting, digitising and preserving their language, which may otherwise be declining in prominence.
The process is long as well, affirms Sanjay Suryanarayanan, research engineer, AI4Bharat, IIT-M. Various factors, and domains (topics), have to be looked at before the data is fed to the machine, to make the end-product more efficient. For instance, for evaluation of translation models (AI models designed to translate text-based content from one language to another), developers look for ‘gold-standard parallel data’ -- content that is translated by humans, not machines -- to train the model. “Translators manually translate a text from English to an Indian language, and feed it to the machine. Once the translated text is generated, the machine is asked to translate it to English again. This process is called back translation,” he says.
This is only a cog in the wheel of the larger set up. Moreover, Mr. Sanjay says, prompt engineering (where the instruction is structured in a way that the AI model is able to accede to the request made) has to be focused on. “The end goal is to make the AI models as sophisticated as possible,” he adds.
Besides, once the LLMs are built, unless there is the right vision, team and operational experience, there are chances of inadvertently releasing “biases [arising from within the teams]” into the system, says Mr. Ranjith. “One team may not be aware of the workings of the other. The data team may not be interacting with the user experience or legal compliance team. They might be talking in superficial layers, but not at a deeper level, and there is a constant worry of whether biases are being introduced [into the model being built],” he adds.
Moreover, LLMs are to be constantly trained and tuned. The process is continuous. “Only then, would we get the accuracy we aim for,” he concludes.
The benefits
Indianised LLMs, when holistically designed, can have a variety of applications. Neha* opines that from interactive learning courses to chatbots to revitalisation of languages such as Dogri could be made possible through this.
Hari Thiagarajan, chairman, TCE, says: “There is a lot of potential in [Indianising AI] as the country has over 20 languages [in the Eighth Schedule of the Constitution]. [As for Tamarai], Tamil is a classical language, and the Tamil diaspora is spread across the world. Hence, leveraging Tamil would be of great benefit -- something that has not been done before. Also, it serves as a way of promoting the language. Tomorrow, if a Tamil LLM is able to do what the English ChatGPT is doing, the industry would benefit from it, and the language would be preserved,” he adds.