The Lab

Our experiments and insights from tinkering at the frontier of AI

Building an automated Evals workflow that works (and open-sourcing it)
The Lab

Building an automated Evals workflow that works (and open-sourcing it)

How we built Kaleidoscope: A structured workflow for realistic, scalable, and human-aligned contextual AI evaluations.
Responsible AI
The Road Under the Harness
The Lab

The Road Under the Harness

Previously I wrote about building a harness for yourself. This one is about the environment you're building in, and why at enterprise scale, if the platform underneath doesn't exist, individual wins have nowhere to accumulate.
AgenticInfrastructure
Scaling the Pentesting Team with AI
The Lab

Scaling the Pentesting Team with AI

Engineering Multi-Agent Architectures for Autonomous Penetration Testing.
AgenticSecurity
Harnessing the harness
The Lab

Harnessing the harness

On building your own multi-agent orchestrator, and why owning the infrastructure around AI matters.
Agentic
Video Generation Landscape Analysis: The Road to Informative Video
The Lab

Video Generation Landscape Analysis: The Road to Informative Video

We tested 2026 SOTA models and found a "usability gap".
Multimodal
Yes, you’re absolutely right… Right? A mini survey on LLM sycophancy
The Lab

Yes, you’re absolutely right… Right? A mini survey on LLM sycophancy

Ever spoken to an AI and felt like it was responding with insincere praise?
Responsible AI
MetaEvaluator: Systematically Evaluate Your LLM Judges
The Lab

MetaEvaluator: Systematically Evaluate Your LLM Judges

Measure how well your app is performing and more importantly where it's failing.
Evals
Building for Agentic AI - Agent SDKs & Design Patterns
The Lab

Building for Agentic AI - Agent SDKs & Design Patterns

The true value of AI agents lies in loops and self-correction rather than raw reasoning power.
Agentic
Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach
The Lab

Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach

Evaluating dimensions often overlooked by traditional benchmarks.
Responsible AIEvals
(Part 2) LLM Safety Alignment for the Singapore Context using Supervised Fine-tuning and RLHF-based Methods
The Lab

(Part 2) LLM Safety Alignment for the Singapore Context using Supervised Fine-tuning and RLHF-based Methods

Safety must be "baked in".
Responsible AI
(Part 1) LLM Safety Alignment for the Singapore Context using Supervised Fine-tuning and RLHF-based Methods
The Lab

(Part 1) LLM Safety Alignment for the Singapore Context using Supervised Fine-tuning and RLHF-based Methods

The process of "teaching" models to be safe
Responsible AI
Eliciting Toxic Singlish from r1
The Lab

Eliciting Toxic Singlish from r1

A red-teaming exercise that proves even "reasoning" models can be coaxed.
Responsible AI