Past Work
Here is a selection of past work in AI safety and bias testing and evaluation, as well as original AI/ML/NLP research.
2025
Department of Defense Crowdsourced AI Red-Teaming Program
- Led large-scale medical AI evaluation with Humane Intelligence and DOD's Chief Digital and Artificial Intelligence Office (CDAO) (link)
- Coordinated red teaming exercise testing Large Language Model (LLM) chatbots for military medicine applications. Over 200 participants, including clinical providers and healthcare analysts from Defense Health Agency (DHA), Uniformed Services University of the Health Sciences, and military services.
Results: 800+ findings of potential vulnerabilities and biases
- Compared three popular LLMs across medical use cases
- Created benchmark datasets for evaluating future vendors
- Findings informed DOD policies for responsible Generative AI use
ICLR Workshop: Multicultural and Multilingual AI Systems
- Poster presentation: "Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific" on multilingual red teaming methodologies at International Conference on Learning Representations (ICLR) workshop. (link)
Expert Speaker at Portland State University
Presented on "Measuring Meaning Through Adversarial Testing" at Portland State University's Natural Language Processing lab, bridging academic NLP research and practical AI safety deployment.
The presentation demonstrated how systematic red teaming reveals what AI systems actually understand versus what they appear to understand. Key topics included multi-lingual safety gaps (GPT-4 fails safety tests 79% of the time in low-resource languages vs. <1% in English), multi-turn attacks exposing context limitations, and evaluation methodology gaps between benchmark performance and real-world robustness.
Research questions raised:
- Can we build language-agnostic safety filters that understand harmful intent regardless of language?
- How do we systematically evaluate cross-lingual safety across diverse linguistic contexts?
- What feedback loops exist between adversarial testing findings and model training improvements?
2024
Appointed to GSA's Acquisition Policy Federal Advisory Committee (GAPFAC)
Adrianna Tan was appointed to the U.S. General Services Administration's Acquisition Policy Federal Advisory Committee (GAP FAC) in 2024, advising on the integration of artificial intelligence and emerging technologies into federal procurement processes.
The committee brings together leaders from academia, industry, and government to modernize federal acquisition strategies. GAP FAC's focus includes integrating AI, data analytics, cloud computing, and cybersecurity into government procurement to drive innovation and efficiency.
World's first multi-cultural and multi-lingual red team
- Led 9-country, 8-language AI safety evaluation with Singapore's Info-communications Media Development Authority (IMDA) and Humane Intelligence (link)
- First-ever multilingual human-led red teaming exercise conducted across Asia-Pacific region.
- 54 participants from nine countries tested AI systems in Chinese, Malay, Tamil, Bahasa Indonesia, Thai, Vietnamese, and English
- 300+ more participated virtually
Results:
- Published world's first multilingual expert-led red teaming evaluation report
- Demonstrated significant safety filter gaps in non-English languages
- Established baseline for regional AI safety standards across East, Southeast, and South Asia
- Created replicable methodology for multilingual testing that addresses demographic and cultural disparities overlooked in Western-centric AI evaluations
Nominated as Top Women in GovTech

2021
- Invited to be government expert at IEEE-NYU AI Procurement Primer and roundtable (link)
Join our newsletter for AI safety news and research updates