By Onno Kampman & Holy Lovenia, TFGI Insights Contributor
Southeast Asia’s (SEA) digital ambitions are accelerating. Governments across the region are launching national AI strategies, digitalising public services, and investing in infrastructure to drive economic growth and social development. Initiatives like the ATIPAN project in the Philippines and MediBot in Timor-Leste—bringing AI-powered healthcare to remote communities—demonstrate how transformative these technologies can be. Yet amid this momentum lies a quiet but urgent gap: the AI systems shaping SEA’s digital future often fail to represent its languages, cultures, or lived realities.
With over 100 ethnic groups who speak over a thousand living languages and dialects, SEA is one of the most linguistically diverse regions on Earth. Yet most modern AI systems are trained predominantly on English and a few other global languages, leaving most SEA communities (speakers of Javanese, Tagalog, Burmese and more) underrepresented or invisible in AI development. Why does this matter? This language gap isn’t just technical; it’s a barrier to equitable digital inclusion, as language is deeply tied to identity, trust, and nuance. Studies show that large language models (LLMs) struggle with SEA languages, leading to mistranslations, cultural misinterpretations, and even harmful outputs—particularly in sensitive areas like healthcare. For example, AI might misread a patient saying they feel sapot in Filipino, missing the deeper emotional or psychosocial nuance and expression of distress, which can distort emotional meaning and erode user trust. As AI becomes more embedded across sectors, building regional language models is essential to ensure SEA’s digital future reflects its people, languages, and lived realities. When AI systems misinterpret what users say—or fail to speak in ways that feel natural or respectful—they risk delivering harmful advice, misclassifying inputs, or simply being ignored.
Challenges & Barriers
SEA’s push for inclusive AI faces four interconnected challenges: data scarcity, fragmented development, limited market incentives, and gaps in accessibility and trust.
Most SEA languages lack the large, high-quality datasets needed to train robust models. Where data exists, it is often scattered across informal sources and hard to standardise. The problem is worse for languages with strong oral traditions, which may have little or no digital footprint. Building quality datasets requires more than literal translation, which risks producing awkward “translationese”; it demands deep cultural grounding. A 2024 study by SEACrowd showed that popular global models underperform on SEA language tasks., particularly in generating natural-sounding text. Even when technically included, model performance for languages with limited digital presence fall behind, mirroring the hierarchy of data availability. Small language groups, already excluded from services, risk further marginalisation when AI tools bypass them.
National AI strategies often prioritise infrastructure, data governance, and economic competitiveness, sidelining linguistic inclusion. Policy approaches vary widely between countries, and without regional coordination or data-sharing frameworks (e.g., common formats, ethical standards, pooled compute resources), efforts remain siloed. Some promising local initiatives are beginning to emerge. Thailand’s Typhoon model, an accessible Thai-centric LLM, was also trained on informal language to capture stylistic nuances that global models often overlook. Indonesia’s NusaCrowd curated high-quality open datasets for low-resource languages, including widely spoken Javanese and Sundanese, as well as endangered tongues like Lampung and Buginese, capturing the breadth of linguistic diversity and cultural contexts such as code-switching and shifting levels of formality. Yet, without sustained investment and alignment with broader ASEAN strategies, their long-term support and interoperability remain limited. Regional collaboration is especially crucial in Southeast Asia, where many languages—like Malay, Khmer, and Hmong—cross national borders, and individual countries may lack the capacity to build full-stack AI pipelines independently.
Because big tech companies prioritise mainstream languages with existing commercial value, indigenous and low-resource languages are rarely incorporated into their models or business strategies. Meanwhile, local startups, academic labs, and grassroots groups often lack the computing power and funding needed to build language-specific tools. The region also faces a shortage of skilled NLP researchers and data engineers experienced in low-resource AI development, leaving the ecosystem under-resourced.
For perspective, SEA-LION, Southeast Asia’s flagship open-source LLM project, was built by 31 authors—compared to 199 for China’s DeepSeek-RI, and a staggering 3,295 contributors behind Google’s latest Gemini model.
In much of Southeast Asia, AI adoption is constrained by foundational infrastructure challenges: limited connectivity, unreliable power, costly or low-spec devices, and insufficient digital literacy. Even when tools are available, widespread usage is not guaranteed. Poor localisation—beyond mere translation—can result in awkward tone, cultural mismatches, or unfamiliar interfaces. In the region, this may manifest as overly formal language, failure to interpret code-switching (the blending of languages), or disregard for indirect communication norms. When tools feel extractive or culturally alien, they risk eroding user trust.
Opportunities and Solutions: Building Inclusive AI from the Ground Up
Despite the barriers, SEA has a unique opportunity to lead in creating AI that is truly inclusive and culturally grounded. The region can chart its own path—treating linguistic and cultural diversity as assets, not obstacles. With its deep traditions of multilingualism, code-switching, oral storytelling, and cultural hybridity, SEA is well-placed to pioneer flexible, context-aware AI systems that handle code-switching, shifting levels of formality, and socially complex communication. AI attuned to SEA’s complexity could enable trust-sensitive applications, from health promotion in conservative areas to crisis communication across multiple languages and dialects.
1. Local Innovation and Homegrown Solutions
A growing ecosystem of regional initiatives is tackling SEA’s unique linguistic challenges, blending grassroots energy with institutional support. Community-led efforts like SEACrowd are making a significant impact—curating hundreds of corpora covering nearly 1,000 languages, building performance benchmarks in 38 Southeast Asian languages (for comparison, OpenAI’s latest model only benchmarked performance on five SEA languages), and nurturing local AI talent. SEACrowd also collaborates with global open-source initiatives such as ML Commons, Common Crawl, and Masakhane to share lessons and enable the global shift toward community-led, inclusive AI development. The Singapore-based SEA-LION initiative is creating open-source LLMs trained on 11 Southeast Asian languages to capture cultural nuances, while Thailand’s Typhoon model and Indonesia’s NusaWrites are building datasets and models rooted in local context. Together, these efforts offer a powerful alternative to global models that often overlook the region’s linguistic diversity.
Beyond technology, they play a vital preservation role. UNESCO warns that nearly 40% of the world’s languages—many in SEA—are endangered. By creating a digital footprint, these initiatives help safeguard not only languages but also the cultural knowledge embedded within them.
2. Regional Coordination and Shared Infrastructure
To break silos, ASEAN—working alongside universities, community groups, international organisations such as UNESCO, and global open-source initiatives–should support interoperable data frameworks and shared standards. Projects like SEA-VL—pairing over a million culturally relevant images with local-language captions—show both the value and complexity of cross-border collaboration. A Southeast Asian NLP Commons could standardise benchmarks, ethics, and governance, especially for indigenous and low-resource languages. India’s AI4Bharat offers a model, funding open datasets in over 20 Indian languages with government, academic, and civil society support.
3. Enabling Ecosystems through Policy and Incentives
Governments can treat linguistic datasets as public digital goods and fund open-source AI for regional languages. Procurement policies, tax incentives, and grants can spur business investment in inclusion. Policymakers are starting to take notice—ASEAN’s Guide on AI Governance and Ethics and Singapore’s IMDA emphasise inclusive data practices. However, unless language equity becomes a core pillar of digital transformation, SEA risks developing AI that speaks over its people.
4. Trust, Transparency, Inclusion
Language inclusion must be participatory. Co-governance models—where contributors shape data practices and evaluation—build awareness, trust, and ownership. Investing in mentorship, transparency, and shared control ensures SEA’s digital future reflects its full diversity.
Conclusion
Southeast Asia’s digital future depends on closing the AI language gap. The region’s linguistic diversity is a strategic asset, but failing to harness it will leave many behind. Policymakers must treat local language data as critical infrastructure, while industry and communities work together to create AI that reflects SEA voices. Inclusive AI is not optional—it is a strategic imperative. By investing in linguistic inclusion now, SEA can bridge the gap and set a global standard for AI that belongs to everyone.
About the Authors
Onno Kampman is an AI Scientist at Singapore’s MOH Office for Healthcare Transformation (MOHT) and a Visiting Scientist at the University of Cambridge. He leads pioneering projects that apply AI to mental health care transformation, and contributes to SEACrowd’s mission to boost Southeast Asian AI capabilities.
Holy Lovenia is the Lead of SEACrowd. Based in London and affiliated with AI Singapore, she drives SEACrowd’s strategy to unify and scale AI resources across Southeast Asia—most recently through initiatives like SEACrowd’s multilingual benchmarks and the SEA‑VL vision‑language dataset.
About the Organisation
SEACrowd is a research community advancing Southeast Asia-focused AI and empowering the next generation of AI researchers in the region. The organisation envisions a future where Southeast Asia’s AI ecosystem is mature, globally competitive, and grounded in the region’s diverse linguistic and cultural contexts.
SEACrowd’s initiatives include leading data collection and model development efforts tailored to Southeast Asia, building and connecting a regional research network, and supporting early-career talent through mentorship and hands-on experience via the SEACrowd Apprentice Program.
The views and recommendations expressed in this article published on September 2025 are solely of the author and do not necessarily reflect the views and position of the Tech for Good Institute, the MOH Office for Healthcare Transformation (MOHT), or the authors’ respective organisations.