Haize Labs with Leonard Tang — Weaviate Podcast #121!
“We’ve got a big opportunity on our hands… We can test not just the Frontier Labs as customers, but basically anybody who is trying to build an AI application.”
Context & Background
As AI systems become increasingly integrated into enterprise environments, the gap between demo-ready AI products and truly enterprise-grade applications continues to be a significant challenge. In this enlightening episode of the Weaviate Podcast, Leonard Tang, co-founder of Haize Labs, discusses how his company is revolutionizing AI evaluation through innovative approaches to testing and evaluation. Coming from a background in adversarial testing and robustness research, Leonard brings a unique perspective that’s particularly relevant in today’s landscape where AI reliability is paramount. His transition from academic research to founding Haize Labs offers valuable insights for anyone looking to bridge theoretical AI safety concepts with practical commercial applications.
Key Insights
From Academic Research to Founding Haize Labs
Leonard’s journey into AI evaluation began with computational neuroscience research, which later evolved into adversarial attack work. As AI hype accelerated in late 2022 and throughout 2023–2024, he noticed a significant problem: despite the proliferation of demo-ready AI products, few were truly enterprise-grade, robust, and production-ready.
“We sort of felt that it was deeply unsatisfying, that there were so many demo-ready AI products, but not truly enterprise-grade, robust, productized products that were being used and built on top of each other,” Leonard explained.
His academic work on adversarial testing provided the perfect foundation for addressing AI liability questions in commercial settings. Rather than completing his planned PhD at Stanford, Leonard committed to commercializing his research, spending his final undergraduate semester in New York launching Haize Labs. What started as an intellectual exercise quickly revealed its commercial potential when they began working with frontier AI labs like OpenAI, Anthropic, and AI21 Labs to test and audit their models before release.
Revolutionizing User Experience for Evaluations
Leonard challenges conventional thinking about evaluation interfaces. While most companies focus on presenting evaluation results through tables and charts, Haize Labs has reimagined how subject matter experts interact with evaluation systems.
Instead of simply asking users to rate AI responses and provide explanations, they implement contrastive evaluation methods, presenting users with multiple AI responses to the same input and asking them to explain why one response is better than another. This approach yields more precise feedback:
“Humans are notoriously inconsistent on their own numeric scales, but giving qualitative feedback in this directionally well-scoped way is easier for the human and also much more useful signal for our own judge algorithms.”
This insight reflects a deeper understanding of how humans assess quality, focusing on comparative judgments rather than absolute ratings to produce more consistent and actionable feedback.
Scaling Judge-Time Compute with Verdict
Perhaps the most innovative aspect of Haize Labs’ approach is their concept of “scaling judge-time compute” through their open-source Verdict library. While much of the AI world focuses on scaling compute during model training, Leonard recognized the untapped potential of using more computational resources during evaluation.
The Verdict library implements various “scalable oversight architectures” drawn from AI safety research, including debate, ensembling, and adversarial wargaming. These techniques allow weaker models to effectively evaluate stronger models through clever coordination:
“The core insight is, Verdict scales judge-time compute in a very particular way with a very particular inductive bias — the inductive bias being scalable oversight architectures. And this gets us extremely good performance.”
Remarkably, their experiments show that by stacking multiple instances of relatively inexpensive models like GPT-4.0 Mini in structured evaluation pipelines, they can achieve better evaluation performance than using more advanced models like GPT-4.1 or GPT-4.3 Mini. This represents a significant cost advantage while improving evaluation quality.
Debate Judges and Declarative Pipelines
Leonard detailed how debate techniques can be implemented in evaluation systems. Rather than using a single judge, multiple judge models can argue for different ratings and perspectives, either in parallel or through round-robin discussions, with meta-judges choosing the most compelling assessment.
To manage these complex evaluation structures, Haize Labs created a declarative framework that allows engineers to compose evaluation pipelines without rewriting code for every scenario:
“We were like, okay, wait, there’s so many different ways to stack the actual debate architecture… It sounds pretty painful to roll code individually for all these different small scenarios. Why don’t we just build a fully declarative, fully composable framework?”
This approach makes sophisticated evaluation techniques accessible to AI developers without requiring them to implement complex architectures from scratch.
Hidden Gems
Custom Reward Models Through Efficient Preference Learning
A fascinating insight from the conversation was how Haize Labs creates customer-specific reward models with minimal preference data. Leonard revealed that they’ve developed techniques to bootstrap judge models using meta-judges that provide rewards, requiring very little customer data to start.
“The question again comes back to: how do we elicit out these preferences from the customer without having this dataset?” Leonard explained. Their solution involves sophisticated reinforcement learning methods that can align judge models with customer preferences even with scant initial data.
Mechanistic Interpretability and Its Current Limitations
Leonard offered a refreshingly pragmatic perspective on mechanistic interpretability (MechInterp), suggesting that despite significant research interest, the field is still maturing:
“I think it’s still early before we get truly actionable and truly useful MechInterp tooling for auditing purposes,” he observed, noting that many current approaches rely heavily on correlational methods that can be “fickle in practice.”
He emphasized that ultimately, interpretability tools are only as good as the data used to train them — bringing the conversation full circle back to the fundamental importance of evaluation.
Practical Takeaways
For Beginners: Improve Evaluation Quality with Contrastive Assessment
Instead of asking evaluators to rate individual responses on a scale, implement comparative evaluations where multiple responses are assessed side-by-side. This approach yields more consistent human feedback and provides clearer signals about what makes a good response.
For Intermediate Practitioners: Implement Ensemble Judges
Use multiple judge models in parallel to evaluate your AI systems, then aggregate their assessments. This approach can significantly improve evaluation quality without requiring sophisticated architectures. Libraries like Verdict make this implementation straightforward.
For Advanced Users: Design Custom Judge Pipelines
Create sophisticated evaluation architectures that combine debate, ensembling, and other scalable oversight techniques. Experiment with different configurations to find approaches that best align with your specific evaluation criteria and use cases.
Conclusion
Leonard Tang’s insights reveal that effective AI evaluation requires more than just benchmark comparisons — it demands sophisticated architectures that can scale judge-time compute, elicit precise human preferences, and adapt to specific use cases. As AI systems become more capable, these evaluation techniques will be crucial for ensuring reliability and safety.
Whether you’re building AI applications, evaluating third-party models, or working with Weaviate’s vector search capabilities, implementing these advanced evaluation strategies could be the difference between a demo-ready product and a truly enterprise-grade solution.
Further Resources
- Verdict Library on GitHub
- Awesome LLM Judge Repository
- Scaling Judge-Time Compute Research Paper
- Weaviate Query Agent Documentation
- LM Unit Testing (Stanford)
Weaviate Podcast #121 with Leonard Tang!