Government’s IndiaAI Datasets Platform To Be Operational By January 2025

India will soon introduce an open-source forum for hosting datasets, known as the IndiaAI Datasets Platform, by January 2025, announced National eGovernance Division (NeGD) Chief Executive Nand Kumarum on October 9, 2024. This platform will be a key component of the government’s ₹10,000 crore IndiaAI Mission, designed to enable developers to create, train, and deploy their own AI models using data from both central and state governments, as well as the private sector.
The initiative is part of India’s larger effort to build a robust AI ecosystem. Last month, Abhishek Singh, Additional Secretary to the Ministry of Electronics and Information Technology and CEO of Digital India Corporation, shared that the government was establishing Data Management Offices as part of the platform. These offices will ensure that the collected data is AI-compatible. Furthermore, Data Management Units will be set up within every government department to streamline the data collection process.
This announcement follows the Ministry of Science and Technology’s launch of BharatGen, a generative AI (GenAI) initiative aimed at creating a multimodal Large Language Model (LLM) for Indian languages, signaling India’s growing focus on AI capabilities.
“The idea primarily is like HuggingFace – you have models, you have datasets, and you have people coming up and using those datasets and building models. We are trying to do something similar,” Kumarum said, as per the report.
Private sector companies are also contributing to the AI landscape. Tech Mahindra’s Project Indus, initially trained in Hindi and 37 dialects, is set to expand its coverage to more Indic languages. Meanwhile, Ola announced its Krutrim LLM in December 2023, trained in Indian languages to serve a “1.4 billion India.” Krutrim, said to be comparable to ChatGPT, can write in 10 Indian languages and understand 20. While Ola CEO Bhavish Aggarwal claimed Krutrim could outperform GPT-4 in Indic languages, users have reported inaccuracies in some responses generated by the model.