Case Study: Kaizen Gaming builds a specialized text-to-SQL model for sports analytics that beats Kimi-K2

With Oumi, Kaizen fine-tuned a 3B model that matched or exceeded frontier-scale alternatives — at a fraction of the cost — by treating rule compliance as a fine-tuning problem, not a prompting problem

By Stefan Webb

May 15, 2026

0 View Original Post

Problem

Kaizen Gaming runs sports betting and online gaming products at scale across European and Latin American markets. Their internal analytics platform serves product, trading, and operations teams who need fast, precise answers about leagues, matches, players, and team performance — in plain language.

Turning natural-language questions into SQL sounds straightforward until your platform has strict rules: a specific view allow-list, mandatory business filters, canonical entity identifiers, and a required output wrapper. Kaizen’s team tried the obvious solution first — prompting a frontier model with the rules in the system prompt — and found two structural problems.

Rule drift was the first: even with explicit instructions, large general-purpose models would occasionally query the wrong tables, omit mandatory filters, or drop the required output wrapper. At query volume, a small failure rate becomes a large failure count. Cost and latency was the second: a 70B-class hosted model on every analytics query is expensive when most questions are structured and short-form, and answerable by something much smaller if it understands the schema.

The constraints in this domain are structural and repeatable — exactly the kind of pattern that supervised fine-tuning encodes more reliably than prompting. That made fine-tuning a small open-weights model a better fit than continuing to engineer prompts for a frontier one.

Solution

The Kaizen team used Oumi’s platform to synthesise training data, fine-tune a small base model, and evaluate results with task-specific judges — all within a single, version-controlled workflow.

“Oumi’s synthesis recipes took us from schema to 500 training samples in just a few iterations. Controlling data distribution was simple, and evolving from basic to complex queries required only small config changes. The declarative, version-controlled approach enabled rapid iteration and a production-ready model, without manual data creation.” — Ioanna Sanida, Data Science Team Lead, Kaizen Gaming

Synthetic training data with hard negatives: Off-the-shelf text-to-SQL datasets don’t know about Kaizen’s views, filters, or canonical identifiers. Using Oumi’s declarative synthesis recipes, the team generated approximately 500 schema-grounded training samples without manual data creation, evolving from basic to complex queries through small configuration changes between iterations. The most impactful single technique was paraphrased hard negatives, i.e., adversarial examples designed to tempt the model into using a forbidden table or skipping a required filter.

Small, specialised base model: The team selected Qwen2.5-3B-Instruct and fine-tuned it with LoRA, then benchmarked it against Qwen2.5-7B-Turbo, Qwen2.5-72B-Turbo, Qwen3-235B-A22B, and Kimi-K2 on the same evaluation set to validate that fine-tuning a small base model actually beat prompting a frontier one.

One judge per concern: A generic “instruction following” judge was too imprecise — it couldn’t distinguish almost compliant from fully compliant. The team replaced it with three task-specific judges: Instruction Compliance Strict (every explicit rule), SQL Hygiene (output structure), and Topic Adherence (does the SQL address the question). Separating concerns turned out to be a recurring lesson: each subsequent iteration produced a measurable delta the team could attribute to a specific cause.

Outcome

The fine-tuned 3B model matched or exceeded the strict-judge scores of much larger off-the-shelf alternatives on Kaizen’s task — including Qwen2.5-72B, Qwen3-235B-A22B, and Kimi-K2. The paraphrased hard-negatives iteration lifted the SQL Hygiene metric to a perfect 1.00 on final evaluations. Topic Adherence held steady in the 0.91–0.98 range across every run, confirming that adding rule discipline did not cost the model its understanding of the question.

SQL Hygiene score after hard-negatives iteration: 1.00 (perfect)
Topic Adherence: 0.91–0.98 across all runs
Benchmark comparison: 3B fine-tuned model matched or exceeded 72B, 235B, and Kimi-K2 on Kaizen’s task
Training data: ~500 schema-grounded samples generated without manual data creation

The Kaizen team chose to proceed with the fine-tuned model built with the Oumi Platform over the larger off-the-shelf alternatives they had benchmarked — delivering lower cost, lower latency, and higher rule compliance than prompting a frontier model.

What’s next

The Kaizen engagement demonstrates a pattern that recurs across compliance, regulatory, and analytics workflows: when a model has to obey a data contract more than reason creatively — structured outputs, mandatory business rules, a fixed view layer, canonical entity IDs — fine-tuning a small open-weights model is more reliable and more economical than prompting a large one.

Why not try it out today and see for yourself? You only need come with your task prompt and the Oumi Agent takes it from there!