Mind the Language Gap: Building an Inclusive AI Future for Southeast Asia

Southeast Asia’s AI future depends on closing the language gap—ensuring that the region’s rich linguistic and cultural diversity is reflected in the data and models driving technological progress. Grassroots initiatives like SEACrowd show that inclusive, community-driven AI is not only possible but essential for equitable digital development.

By Onno Kampman & Holy Lovenia, TFGI Insights Contributor

Southeast Asia’s (SEA) digital ambitions are accelerating. Governments across the region are launching national AI strategies, digitalising public services, and investing in infrastructure to drive economic growth and social development. Initiatives like the ATIPAN project in the Philippines and MediBot in Timor-Leste—bringing AI-powered healthcare to remote communities—demonstrate how transformative these technologies can be. Yet amid this momentum lies a quiet but urgent gap: the AI systems shaping SEA’s digital future often fail to represent its languages, cultures, or lived realities.

With over 100 ethnic groups who speak over a thousand living languages and dialects, SEA is one of the most linguistically diverse regions on Earth. Yet most modern AI systems are trained predominantly on English and a few other global languages, leaving most SEA communities (speakers of Javanese, Tagalog, Burmese and more) underrepresented or invisible in AI development. Why does this matter? This language gap isn’t just technical; it’s a barrier to equitable digital inclusion, as language is deeply tied to identity, trust, and nuance. Studies show that large language models (LLMs) struggle with SEA languages, leading to mistranslations, cultural misinterpretations, and even harmful outputs—particularly in sensitive areas like healthcare. For example, AI might misread a patient saying they feel sapot in Filipino, missing the deeper emotional or psychosocial nuance and expression of distress, which can distort emotional meaning and erode user trust. As AI becomes more embedded across sectors, building regional language models is essential to ensure SEA’s digital future reflects its people, languages, and lived realities. When AI systems misinterpret what users say—or fail to speak in ways that feel natural or respectful—they risk delivering harmful advice, misclassifying inputs, or simply being ignored.


Challenges & Barriers

SEA’s push for inclusive AI faces four interconnected challenges: data scarcity, fragmented development, limited market incentives, and gaps in accessibility and trust.

Most SEA languages lack the large, high-quality datasets needed to train robust models. Where data exists, it is often scattered across informal sources and hard to standardise. The problem is worse for languages with strong oral traditions, which may have little or no digital footprint. Building quality datasets requires more than literal translation, which risks producing awkward “translationese”; it demands deep cultural grounding. A 2024 study by SEACrowd showed that popular global models underperform on SEA language tasks., particularly in generating natural-sounding text. Even when technically included, model performance for languages with limited digital presence fall behind, mirroring the hierarchy of data availability. Small language groups, already excluded from services, risk further marginalisation when AI tools bypass them.

National AI strategies often prioritise infrastructure, data governance, and economic competitiveness, sidelining linguistic inclusion. Policy approaches vary widely between countries, and without regional coordination or data-sharing frameworks (e.g., common formats, ethical standards, pooled compute resources), efforts remain siloed. Some promising local initiatives are beginning to emerge. Thailand’s Typhoon model, an accessible Thai-centric LLM, was also trained on informal language to capture stylistic nuances that global models often overlook. Indonesia’s NusaCrowd curated high-quality open datasets for low-resource languages, including widely spoken Javanese and Sundanese, as well as endangered tongues like Lampung and Buginese, capturing the breadth of linguistic diversity and cultural contexts such as code-switching and shifting levels of formality. Yet, without sustained investment and alignment with broader ASEAN strategies, their long-term support and interoperability remain limited. Regional collaboration is especially crucial in Southeast Asia, where many languages—like Malay, Khmer, and Hmong—cross national borders, and individual countries may lack the capacity to build full-stack AI pipelines independently.

Because big tech companies prioritise mainstream languages with existing commercial value, indigenous and low-resource languages are rarely incorporated into their models or business strategies. Meanwhile, local startups, academic labs, and grassroots groups often lack the computing power and funding needed to build language-specific tools. The region also faces a shortage of skilled NLP researchers and data engineers experienced in low-resource AI development, leaving the ecosystem under-resourced.

For perspective, SEA-LION, Southeast Asia’s flagship open-source LLM project, was built by 31 authors—compared to 199 for China’s DeepSeek-RI, and a staggering 3,295 contributors behind Google’s latest Gemini model.

In much of Southeast Asia, AI adoption is constrained by foundational infrastructure challenges: limited connectivity, unreliable power, costly or low-spec devices, and insufficient digital literacy. Even when tools are available, widespread usage is not guaranteed. Poor localisation—beyond mere translation—can result in awkward tone, cultural mismatches, or unfamiliar interfaces. In the region, this may manifest as overly formal language, failure to interpret code-switching (the blending of languages), or disregard for indirect communication norms. When tools feel extractive or culturally alien, they risk eroding user trust.

Opportunities and Solutions: Building Inclusive AI from the Ground Up

Despite the barriers, SEA has a unique opportunity to lead in creating AI that is truly inclusive and culturally grounded. The region can chart its own path—treating linguistic and cultural diversity as assets, not obstacles. With its deep traditions of multilingualism, code-switching, oral storytelling, and cultural hybridity, SEA is well-placed to pioneer flexible, context-aware AI systems that handle code-switching, shifting levels of formality, and socially complex communication.  AI attuned to SEA’s complexity could enable trust-sensitive applications, from health promotion in conservative areas to crisis communication across multiple languages and dialects.

1. Local Innovation and Homegrown Solutions

A growing ecosystem of regional initiatives is tackling SEA’s unique linguistic challenges, blending grassroots energy with institutional support. Community-led efforts like SEACrowd are making a significant impact—curating hundreds of corpora covering nearly 1,000 languages, building performance benchmarks in 38 Southeast Asian languages (for comparison, OpenAI’s latest model only benchmarked performance on five SEA languages), and nurturing local AI talent. SEACrowd also collaborates with global open-source initiatives such as ML Commons, Common Crawl, and Masakhane to share lessons and enable the global shift toward community-led, inclusive AI development. The Singapore-based SEA-LION initiative is creating open-source LLMs trained on 11 Southeast Asian languages to capture cultural nuances, while Thailand’s Typhoon model and Indonesia’s NusaWrites are building datasets and models rooted in local context. Together, these efforts offer a powerful alternative to global models that often overlook the region’s linguistic diversity.

Beyond technology, they play a vital preservation role. UNESCO warns that nearly 40% of the world’s languages—many in SEA—are endangered. By creating a digital footprint, these initiatives help safeguard not only languages but also the cultural knowledge embedded within them.

2. Regional Coordination and Shared Infrastructure

To break silos, ASEAN—working alongside universities, community groups, international organisations such as UNESCO, and global open-source initiatives–should support interoperable data frameworks and shared standards. Projects like SEA-VL—pairing over a million culturally relevant images with local-language captions—show both the value and complexity of cross-border collaboration. A Southeast Asian NLP Commons could standardise benchmarks, ethics, and governance, especially for indigenous and low-resource languages. India’s AI4Bharat offers a model, funding open datasets in over 20 Indian languages with government, academic, and civil society support.

3. Enabling Ecosystems through Policy and Incentives

Governments can treat linguistic datasets as public digital goods and fund open-source AI for regional languages. Procurement policies, tax incentives, and grants can spur business investment in inclusion. Policymakers are starting to take notice—ASEAN’s Guide on AI Governance and Ethics and Singapore’s IMDA emphasise inclusive data practices. However, unless language equity becomes a core pillar of digital transformation, SEA risks developing AI that speaks over its people.

4. Trust, Transparency, Inclusion

Language inclusion must be participatory. Co-governance models—where contributors shape data practices and evaluation—build awareness, trust, and ownership. Investing in mentorship, transparency, and shared control ensures SEA’s digital future reflects its full diversity.

 

Conclusion

Southeast Asia’s digital future depends on closing the AI language gap. The region’s linguistic diversity is a strategic asset, but failing to harness it will leave many behind. Policymakers must treat local language data as critical infrastructure, while industry and communities work together to create AI that reflects SEA voices. Inclusive AI is not optional—it is a strategic imperative. By investing in linguistic inclusion now, SEA can bridge the gap and set a global standard for AI that belongs to everyone.

 

 

About the Authors

Onno Kampman is an AI Scientist at Singapore’s MOH Office for Healthcare Transformation (MOHT) and a Visiting Scientist at the University of Cambridge. He leads pioneering projects that apply AI to mental health care transformation, and contributes to SEACrowd’s mission to boost Southeast Asian AI capabilities.

Holy Lovenia is the Lead of SEACrowd. Based in London and affiliated with AI Singapore, she drives SEACrowd’s strategy to unify and scale AI resources across Southeast Asia—most recently through initiatives like SEACrowd’s multilingual benchmarks and the SEA‑VL vision‑language dataset.

 

About the Organisation

SEACrowd is a research community advancing Southeast Asia-focused AI and empowering the next generation of AI researchers in the region. The organisation envisions a future where Southeast Asia’s AI ecosystem is mature, globally competitive, and grounded in the region’s diverse linguistic and cultural contexts.

SEACrowd’s initiatives include leading data collection and model development efforts tailored to Southeast Asia, building and connecting a regional research network, and supporting early-career talent through mentorship and hands-on experience via the SEACrowd Apprentice Program.

 

The views and recommendations expressed in this article published on September 2025 are solely of the author and do not necessarily reflect the views and position of the Tech for Good Institute, the MOH Office for Healthcare Transformation (MOHT), or the authors’ respective organisations.

Download Report

Download Report

Latest Updates

Latest Updates​

Tag(s):

Keep pace with the digital pulse of Southeast Asia!

Never miss an update or event!

Mouna Aouri

Programme Fellow

Mouna Aouri is an Institute Fellow at the Tech For Good Institute. As a social entrepreneur, impact investor, and engineer, her experience spans over two decades in the MENA region, South East Asia, and Japan. She is founder of Woomentum, a Singapore-based platform dedicated to supporting women entrepreneurs in APAC through skill development and access to growth capital through strategic collaborations with corporate entities, investors and government partners.

Dr Ming Tan

Senior Fellow & Founding Executive Director

Dr Ming Tan is Senior Fellow at the Tech for Good Institute; where she served as founding Executive Director of the non-profit focused on research and policy at the intersection of technology, society and the economy in Southeast Asia. She is concurrently a Senior Fellow at and the Centre for Governance and Sustainability at the National University of Singapore and Advisor to the Founder of the COMO Group, a Singaporean portfolio of lifestyle companies operating in 15 countries worldwide. Ming was previously Managing Director of IPOS International, part of the Intellectual Property Office of Singapore. Prior to joining the public sector, she was Head of Stewardship of the COMO Group.


Ming also serves on the boards of several private companies, Singapore’s National Volunteer and Philanthropy Centre, Singapore Network Information Centre (SGNIC), and on the Digital and Technology Advisory Panel for Esplanade–Theatres on the Bay, Singapore’s national performing arts centre. Her current portfolio spans philanthropy, social impact, sustainability and innovation.