5 min read

SASEA Language Working Group

SASEA Language Working Group
Photo by Florian Wehde / Unsplash

Large language models demonstrate substantial performance disparities across languages. While English-language content represents approximately 50% of internet content, speakers of South and Southeast Asian languages, comprising over 2 billion people across hundreds of languages and dozens of scripts, remain under-served.

Recent evaluations indicate these gaps are measurable. When researchers assessed GPT-4's performance on Urdu question-answering tasks, the model achieved 70% accuracy compared to 85% in English. For Sundanese, a language spoken by approximately 40 million people, ChatGPT demonstrated 0% accuracy on English-to-Sundanese translation test sets. These performance differences, while influenced by multiple factors including training data availability and evaluation methodology, suggest that model capabilities vary substantially across linguistic contexts.

The Carnegie Endowment's recent analysis, "Speaking in Code," documents several implications of these disparities: performance gaps that compound across deployment contexts, potential biases that remain unmeasured in non-Latin scripts, and the systematic under-representation of billions of speakers.

As AI systems increasingly facilitate access to information, services, and opportunities, the time to investigate this disparity is now.

Existing Work

We acknowledge the excellent work in this area, carried out by researchers around the world.

Southeast Asian initiatives: AI Singapore's SEA-LION and Alibaba DAMO Academy's SeaLLM represent pioneering efforts to develop regional language models covering Thai, Vietnamese, Indonesian, and Khmer. Google and AI Singapore's Project SEALD focuses on enhancing datasets for Southeast Asian languages, representing the most extensive multilingual data collection effort for LLMs in the region to date.

South Asian initiatives: AI4Bharat at IIT Madras has developed a comprehensive ecosystem including Airavata (a Hindi instruction-tuned LLM), IndicBias and many other projects.

AfriqueLLM: Recent work on African language adaptation provides a methodological framework that be applicable to South and Southeast Asian languages as well. The AfriqueLLM project, released in January 2026, adapted open language models to 20 African languages through continued pre-training on 26 billion tokens. This rigorous, collaborative, open-source approach offers a replicable methodology for other under-served language regions

The Economic Research Institute for ASEAN and East Asia (ERIA) notes in its recent analysis, "a truly collaborative LLM that reflects Southeast Asia's rich linguistic and cultural diversity has yet to be realised." Development efforts remain distributed across national boundaries, with limited coordination between South Asian and Southeast Asian research communities. Even among South Asian and Southeast Asian languages, there are lower resource languages that need more attention.

Why South and Southeast Asia?

South and Southeast Asia share centuries of cultural exchange, trade routes, and linguistic borrowing. Yet in the LLM research landscape, they're often treated separately. Communities at the intersection, such as Tamil speakers in Singapore and Malaysia, Bengali diaspora across the region, or anyone navigating between these linguistic worlds, are not well represented. Tamil slang is very different in Chennai vs Kuala Lumpur.

At launch, we plan to focus on 8 languages, making up 1.4 billion speakers. They are:

  • Telugu
  • Tamil
  • Hindi
  • Bengali
  • Khmer
  • Thai
  • Vietnamese
  • Indonesian

Core Technical Problems

LLM tokenizers are trained predominantly on English. When processing Tamil or Thai, for example, they fragment text into far more tokens than semantically equivalent English, and use more tokens than equivalent English text.

This means higher costs, worse performance on long documents, and degraded multi-turn reasoning.

Many of these languages form meaning by combining morphemes.

  • A single Tamil word can encode what would be an entire English phrase
  • Standard tokenizers don't understand this structure and fragment words arbitrarily
  • This breaks semantic meaning at the token level

Beyond tokenization challenges, there is a lower volume of high-quality training data for many of these languages. Even where training data exists, they tend to be formal written text, which misses colloquialism and code-switching that's common across multilingual communities.

Why This Matters

As frontier labs move toward LLMs-as-judge and automated evaluation pipelines, the absence of robust benchmarks for South and Southeast Asian languages creates a dangerous blind spot. Models can appear to perform well on existing metrics while failing at actual comprehension.

Projects like our SASEA Working Group aim to close this gap. We're building the evaluation infrastructure that makes progress measurable in these regions, in these languages, for more than 1.4 billion people.

Without robust evaluation infrastructure for South and Southeast Asian languages, we cannot adequately test AI safety in high-stakes domains, such as mental health.

  • Can models recognize distress signals in Tamil or Thai? Expressions of suicidality, self-harm, or crisis vary dramatically across languages and cultures. A model might catch "I want to end it all" in English while completely missing equivalent expressions in Bengali or Khmer.
  • Does the model understand culturally appropriate responses? Mental health is deeply entangled with family, community, religion, and social structures that differ significantly from Western contexts. A response that's helpful in English could be harmful, dismissive, or culturally tone-deaf when translated.
  • Can we even measure failure? Without benchmarks for detecting harmful outputs, hallucinated medical advice, or inappropriate crisis responses in these languages, companies can deploy tools with no way to audit their safety.

We're laying the foundation that allows:

  • Researchers to measure progress and identify failures
  • Developers to build applications they can actually test
  • Policymakers to set standards grounded in evidence
  • Communities to hold AI systems accountable

Establishing SASEA

We are establishing the Future Ethics SASEA (South and Southeast Asian) Language Working Group to adapt the AfriqueLLM methodology to include South and Southeast Asian languages.

At launch, our target languages are: Telugu, Tamil, Hindi, Bengali, Khmer, Vietnamese, Thai, Indonesian.

Planned Output

The working group aims to produce the following research artifacts:

  • Curated multilingual corpus following established methodologies for low-resource language adaptation
  • Continued pre-training experiments to assess adaptation effectiveness across target languages
  • Evaluation benchmarks tailored to our target languages, enabling systematic assessment of model capabilities
  • Public release of datasets and adapted models for use for the research community


Collaborative Approach

This effort is designed to complement, rather than duplicate, existing work. We plan to coordinate with AI4Bharat, AI Singapore, and other regional initiatives. Our focus addresses languages and communities that may not align neatly with either "Indic" or "Southeast Asian" categorical boundaries as they are currently defined by national boundaries.

Want to get involved?

We seek collaborators with the following expertise:

  • Language specialists: Bilingual researchers and native speakers of target languages for dataset curation, evaluation design, and model assessment
  • Technical researchers: NLP researchers with experience in low-resource language adaptation, continued pre-training, and multilingual evaluation
  • Institutional partners: Organizations working on AI safety, fairness, and multilingual NLP who can contribute resources, expertise, or coordination support
  • Individuals with linguistic expertise in Telugu, Tamil, Hindi, Bengali, Khmer, Thai, Vietnamese, or Indonesian who are interested in contributing to this work are invited to join the working group.

About Us

Future Ethics brings direct practitioner experience to this work. In 2024, we led the Singapore government's world-first multilingual AI safety evaluation across 9 countries and 8 languages, including Tamil, Thai, Vietnamese, and Bahasa Indonesia, four of our SASEA target languages. That project, conducted with IMDA Singapore and featured at the 2025 AI Action Summit in Paris, revealed significant safety filter gaps in non-English languages and created a replicable methodology for cross-cultural testing. Our ICLR 2025 workshop paper formalized these findings for the research community. We also worked on medical AI evaluations for the DoD.

Through that work and our work on DOD medical AI evaluations, we've built a network of multilingual researchers, linguists, and domain experts across Asia-Pacific. SASEA channels that network toward a collaborative, open-source research agenda.

Future Ethics will act as a corporate sponsor of this project, which will be open source. We look forward to collaborating with other companies, research institutions, partners, individual researchers, and individuals from these communities.

Expression of Interest

Please fill out this form if you would like to participate in the project. We are especially interested to hear from folks who may not typically be involved in similar research.

The expected time commitment is 2 hours a week. Team leads will have more responsibilities.

Subscribe to my newsletter

Join our newsletter for AI safety news and research updates