Leanne Tan

Leanne Tan

Data Scientist at GovTech's AI Practice (Responsible AI). Learning new things every day, here to share stories and insights! Let's connect :)
Building an automated Evals workflow that works (and open-sourcing it)
The Lab

Building an automated Evals workflow that works (and open-sourcing it)

How we built Kaleidoscope: A structured workflow for realistic, scalable, and human-aligned contextual AI evaluations.
Responsible AI
Yes, you’re absolutely right… Right? A mini survey on LLM sycophancy
The Lab

Yes, you’re absolutely right… Right? A mini survey on LLM sycophancy

Ever spoken to an AI and felt like it was responding with insincere praise?
Responsible AI
MetaEvaluator: Systematically Evaluate Your LLM Judges
The Lab

MetaEvaluator: Systematically Evaluate Your LLM Judges

Measure how well your app is performing and more importantly where it's failing.
Evals
Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach
The Lab

Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach

Evaluating dimensions often overlooked by traditional benchmarks.
Responsible AIEvals

RabakBench: Multilingual AI Safety Evaluation Made Local

Global safety guardrails are often blind to local dialects and sensitivities.
Responsible AI

Validating Annotation Agreement between Humans and LLMs

Who Judges the Judge? At GovTech’s AI Practice, we’ve been embracing what’s known as “LLM-as-a-judge” — essentially employing LLMs as evaluators across our AI workflows. This approach has become one powerful approach in our evaluation toolkit. We use LLMs extensively across multiple areas: judging other LLM outputs (e.