Leanne Tan

Data Scientist at GovTech's AI Practice (Responsible AI). Learning new things every day, here to share stories and insights! Let's connect :)

The Lab

Building an automated Evals workflow that works (and open-sourcing it)

How we built Kaleidoscope: A structured workflow for realistic, scalable, and human-aligned contextual AI evaluations.

Responsible AI

The Lab

Yes, you’re absolutely right… Right? A mini survey on LLM sycophancy

Ever spoken to an AI and felt like it was responding with insincere praise?

Responsible AI

The Lab

MetaEvaluator: Systematically Evaluate Your LLM Judges

Measure how well your app is performing and more importantly where it's failing.

Evals

The Lab

Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach

Evaluating dimensions often overlooked by traditional benchmarks.

Responsible AIEvals

The Studio

Introducing LionGuard 2: Multilingual LLM Guardrail for Singapore

We improved its coverage and robustness.

Responsible AI

RabakBench: Multilingual AI Safety Evaluation Made Local

Global safety guardrails are often blind to local dialects and sensitivities.

Responsible AI

Validating Annotation Agreement between Humans and LLMs

Who Judges the Judge? At GovTech’s AI Practice, we’ve been embracing what’s known as “LLM-as-a-judge” — essentially employing LLMs as evaluators across our AI workflows. This approach has become one powerful approach in our evaluation toolkit. We use LLMs extensively across multiple areas: judging other