Commentary

The Rise of Regional Language Models in Southeast Asia

A paper by the Carnegie Endowment for International Peace dives into the development of regional large language models (LLMs), highlighting their opportunity to approach underrepresented cultural and linguistic groups within Southeast Asia.

By Chloe Tan, Tech For Good Institute

Data from AI Singapore, sourced from Hugging Face, finds that 73% of existing large language models (LLMs) come from the United States and China. 95% of these models are primarily trained on data in English, or with a mix of Arabic, Chinese or Japanese. Research finds that 88% of the world’s languages are underrepresented online, leaving over a billion people unable to use their native language in the digital world. The entry of Southeast Asian models introduces an underrepresented expression in an ecosystem otherwise dominated by external languages, perspectives and resources.

Rationale in Building Localised Language Learning Models

Southeast Asia is home to over 1,200 languages, each reflecting distinct power structures, identities, and cultural values. While global AI models, such as ChatGPT, have expanded their support for regional languages, many still fail to capture the cultural nuances, idiomatic expressions, and historical contexts that define them.

A key limitation of existing multilingual models is their English-centric bias. Research suggests that these models often process non-English inputs by first translating them into English, analysing the content in English, and then translating the output back into the original language. This process risks distorting meaning, particularly in cases where linguistic and cultural frameworks differ significantly from Western conventions.

For example, Balinese historical writing reflects and reinforces social and cultural structures, so what might be seen as mythical tales in Western terms are actually valid historical records in Bali. In Balinese culture, there is no direct equivalent to the Western notion of fiction, everything is considered true, with varying levels of sacredness depending on the form, language, and narrative. Western trained LLMs may struggle to accurately convey this perspective, reducing historical texts to mere folklore rather than recognising their integral role in Balinese history. Replicating language is one thing, but accurately conveying the knowledge and subtlety behind it is quite another.

Beyond linguistic concerns, there is also a need to contextualise the sociopolitical landscape of each country as the language choices and policies create distinctively different challenges when it comes to developing LLMs. To better understand these complexities, let’s explore how language policies and societal dynamics unfold in specific countries.

Singapore

Code-switching or mixing is particularly characteristic of Singlish, a variant of colloquial English in Singapore that intermixes words, grammar and tones from multiple languages within the vicinity. For decades, the legitimacy of Singlish has been a subject of ongoing debate on whether it would be a threat to the country’s image and economic advancement as a global hub.

Initially, the Singaporean government aimed to minimise, if not eradicate, Singlish, viewing it as a threat to the nation’s global image, economic progress, and ability to function as an international hub. Campaigns like the Speak Good English Movement reflected this position, promoting Standard English as a tool for upward mobility and global relevance.

However, there has been a moderated stance recently, where Singlish is now generally celebrated and marketed as uniquely, authentically Singaporean, transcending class relations.

Vietnam

Vietnam, which is home to more than 100 different languages spoken by 50 ethnic groups, Vietnamese serves as the national and most widely used language. This shared linguistic identity stands in contrast to Vietnam’s early history, when periods of foreign occupation hindered the development of a unified national language.

Following the country’s partition in 1954, the North and South adopted distinct language policies shaped by their respective political ideologies and foreign influences. However, with the onset of Đổi Mới (economic reforms) in the 1980s, Vietnam experienced rapid global integration, during which English rose in prominence, becoming the dominant foreign language by 2005. This evolution in language use reflects the country’s shifting political and economic priorities, which in turn shape the data environment and sociocultural sensitivities relevant to LLM development.

Thailand

Though often perceived as linguistically homogenous, Thailand is home to over seventy indigenous languages. The state has largely succeeded in fostering a unified Thai identity among its diverse ethnic communities, many of whom have migrated or assimilated over time. However, this national integration has not been without friction.

Long-standing tensions between the central government in Bangkok and the southern provinces have shaped language policy in complex ways. In particular, the use and teaching of Pattani-Malay, a language predominantly spoken in the Muslim-majority south, were historically viewed with suspicion. This stemmed from concerns over separatism and resistance to assimilation, highlighting how language in Thailand has been both a tool of unity and a site of political sensitivity.

The backdrop of the complexities within culture and national building in Southeast Asia makes the development of localised LLMs much more important. The region’s leaders in government and the tech community remain fully aware of the importance of preserving local perspectives and cultures while technology grows increasingly interconnected.

Policy Recommendations for Localised Language Learning Models

Models can only be effectively trained and deployed for use by society when there is data drawn from experiences and interactions among human beings as well as between humans and their surroundings. This growing self-assurance in Southeast Asia and the emergence of localised LLMs aids the region in asserting practical agency in long term technological advancements. The authors of the paper, Elina Noor and Binya Kanitroj, provide several key policy recommendations to effectively harness the power of these models for the region including:

1. A larger network of cross-disciplinary researchers

a. Southeast Asia’s technology community presents a unique opportunity for collaboration to advance natural language processing. Experts from fields like history, anthropology, and the social sciences can help uncover overlooked aspects of development that are critical to the region’s context. Thorough analysis and open dialogue among developers and other stakeholders become even more essential when creating models for sensitive applications.

2. Draw lessons from other regions and communities that are on a similar trajectory

a. Researchers exploring the intersection of society and technology can draw inspiration from Africa, where developers are creating LLMs tailored to African languages by collaborating directly with local communities. Initiatives like the Africa-Asia AI Policymakers Network also foster connections between government officials across both continents to promote the development of responsible and relevant AI ecosystems. Similarly, Southeast Asian researchers can explore the potential of Indigenous-led initiatives, such as the Maori data sovereignty movement, to ensure technology is applied in ways that respect and empower local communities.

3. Explore the feasibility of multi-input AI models

a. In Southeast Asia, much of language and communication relies on high-context interactions, where nonverbal cues such as tone and body language are crucial for conveying nuance. This creates a challenge for regional text-based language models, which may struggle to capture the full complexity of communication. While governments and stakeholders are aware of these challenges, they also recognise that transitioning to multimodal models requires significant resources.

Singapore’s National Multimodal LLM Program, which includes SEA-LION, is working to develop techniques that integrate speech data with nonverbal cues like tone and pitch. As other models in the region look to SEA-LION as a foundation, further advancements in multimodal capabilities could drive similar improvements, depending on available resources.

This discussion paper provides invaluable insights into the socio-political challenges involved in developing language models that authentically reflect the diverse worldviews, languages, and values of Southeast Asia. A powerful quote from African scholars Angella Ndaka and Geci Karuri-Sebina highlights the importance of critically examining the assumptions driving technological representation: “When we volunteer our data and ourselves in the name of digital inclusion, where are we being included? Whose agendas dominate in the technology being developed? And whose technologies are being produced anyway?”

These questions urge us to scrutinise the power dynamics shaping technological development. By exploring the deeper implications behind these questions, we can gain a clearer understanding of how digital inclusion can either reinforce or challenge existing inequalities.

Incorporating these perspectives into the development of Southeast Asia’s AI and technology landscape can foster innovations that are not just inclusive but also culturally relevant and equitable. As the region continues to advance technologically, ensuring that these advancements benefit all communities, while respecting their unique cultural identities, should be a central goal. Only through such an approach can we ensure that technological progress genuinely supports the growth of a fairer, more inclusive Southeast Asia.