Google Simula: Synthetic Training Data Without Real Data

Google Simula is the AI research breakthrough most people are sleeping on — synthetic training data that beats real data on some benchmarks. This isn't a product launch but a research framework that solves one of AI's biggest underlying problems, and the implications for specialist AI products over the next few years are huge.

This post covers what Google Simula does, why the data shortage matters, how Simula generates training data from scratch, and the real products it's already powering.

The Problem Simula Solves

Every AI model needs data to learn. For general AI, there's plenty of data online. For specialised AI, the data is locked up — medical files are private, legal cases are sealed, cyber security attacks are sensitive, and banking fraud is confidential.

That's the data war, and it blocks the next wave of specialised AI from happening. Until now.

What Google Simula Is

Simula is a reasoning-first framework that generates synthetic training data from scratch. No seeds, no copies, no scraping — just pure logic and structure. It was built by Google in collaboration with EPFL, and the research is genuinely novel rather than just an iteration on existing approaches.

How Google Simula Generates Data

Three stages do the heavy lifting.

The first stage is global diversification, where Simula maps the entire domain first — like drawing a country before placing cities. It uses a taxonomy, which is a complete menu of every topic and subtopic in the area. For cyber security, that means every type of attack, every defender, and every system. The second stage is local diversification, where it zooms in and creates lots of different examples in each spot on the map. It uses "one of n meta prompting" to generate many versions of each scenario, then runs "complexification" to push simple examples to be harder, trickier, and more nuanced — like leveling up in a video game from easy to medium to hard to boss fight. The third stage is the dual critic filter, where two critic models check each example and decide whether it's good enough to keep. In the legal data set test, 61% of generated data was rejected — that's a serious quality filter.

Why This Matters

Three reasons this research changes what's possible.

The first is that it unlocks specialised AI. Where real data is locked behind privacy, cost, or risk, synthetic data unlocks training. Medical AI, legal AI, and security AI all benefit. The second is better diversity than real data. Real-world data covers what people happen to write online, while Simula data covers the full domain on purpose. In tests, real reference data sets covered LESS of a topic than Simula-built ones — synthetic can be more complete than real. The third is that quality, diversity, and complexity become separate knobs you control one by one. Need diverse but simple? Yes. Need complex but narrow? Yes. The cloud AI subscription model can't compete with this level of customisation.

🔥 Want to be ahead on synthetic data? Inside the AI Profit Boardroom, I share AI research updates including Simula, prep workflows for data-driven operators, and weekly live coaching. 3,000+ members. → Get the playbook

Where Simula Already Powers Products

This is the part most people don't realise — Google is already using Simula in production.

The first product is AI scam detection on Android calls. The feature warning you when a phone call sounds like a scam? Simula helps train it. You can't train scam detection on real scam data because it's illegal, private, and risky. Simula generates synthetic scam-shaped data so the model learns scam patterns without ever seeing a real victim's message. The second product is spam filtering in Google Messages, which uses the same principle — synthetic spam data trains the filter while real spam stays protected by privacy laws.

What The Numbers Show

Google ran experiments with results worth understanding.

The headline is that synthetic Simula data beats real data on some tasks. Specifically, high-complexity Simula data gave a 10% accuracy gain over low-complexity on a math reasoning test (GSMAT), and real reference data sets covered LESS of the topic in some cases.

The catch is that high-complexity synthetic data only helps when the teacher model is strong enough. On the legal data set with a weak teacher (57% accurate), high-complexity actually hurt performance. The honest finding is that the model labelling the synthetic data must be smart enough to make this approach work.

Why Specialised AI Was Stuck

Until Simula, specialised AI for legal, medical, and financial fields couldn't progress because real-world data missed entire parts of subjects, privacy locked up the data that mattered, and cost of acquiring data was prohibitive.

Simula offers a different path: build the data on purpose rather than trying to scavenge it.

Implications For Solo Operators

Three implications worth thinking about.

The first is that specialist AI tools for niche industries become possible. Tools for lawyers, doctors, and financial advisors that didn't exist because of data scarcity now can. The second is that privacy-friendly AI gains adoption — when models are trained on synthetic data, they don't need real customer data, which is a major unlock for privacy-sensitive operators. The third is that the quality bar rises for AI products, because when everyone can train good models, the differentiator becomes domain knowledge plus use case design. Operators with clear specialist insight win.

What Google Simula Doesn't Do

Honest about the limits.

It doesn't replace primary data collection entirely. It doesn't generate truly novel scenarios — it's limited by the training distribution of the generator. It requires a strong teacher model to validate quality.

For most use cases, the wins outweigh the limits.

How This Pairs With Open Source AI

Simula's pattern (mechanism design plus dual critic filter) could be applied to open source AI training. If applied, open source models could train on synthetic data tailored to specialist domains, and the closed-source advantage in specialist verticals shrinks.

This is part of the broader open source vs closed source race we're seeing across the AI ecosystem. I cover the open source side in Hermes Gemma 4 and Kimi 2.6 Benchmark.

Strategic Takeaway

Simula tells us something important about where AI is heading.

The future of AI isn't just bigger models — it's smarter ways of building training data. The leverage shifts from "who has the most data" to "who has the sharpest thinking about their domain." For operators with specialist domain knowledge, that's genuinely good news. You can structure your domain, articulate the rules, and design training data without needing a giant data warehouse.

What This Means For The AI Industry

Four predictions for the next 12-18 months.

The first is that specialist AI will explode as niche industries that lacked data finally have a path. The second is that privacy-first AI will grow because synthetic data sidesteps privacy concerns. The third is that open source will benefit as the same techniques get applied to open models. The fourth is big tech consolidation around companies with strong reasoning models (the "teachers"), which gives them an advantage in the synthetic data game.

Critic Step Generalises

Simula's dual critic approach has a broader lesson worth applying to your own AI workflows.

For any AI system, always include a critic step, have a second pair of AI eyes review outputs, and quality goes through the roof. This applies to content generation, code generation, customer responses, and any AI output.

I apply this in my Hermes Agent Swarm workflows — always have a reviewer agent.

What Solo Operators Should Do This Quarter

Three actions to take in the next 90 days.

The first is paying attention to Simula evolution, because this research will likely productise within 12-18 months and you want to be ready. The second is mapping your specialist domain — if you have niche expertise, document the structure, because that structure is exactly what Simula-style training needs. The third is adding critic steps to your AI workflows by applying the dual critic pattern to your existing AI use; quality jumps measurably.

🚀 Want my full AI research strategic playbook? The AI Profit Boardroom has my AI strategic updates, OpenClaw 6-hour course, Hermes 2-hour course, daily training, weekly live coaching. 3,000+ members. → Join here

FAQ — Google Simula

What is Google Simula?

A research framework for generating synthetic AI training data without seeds or scraping.

Is Simula a product I can use?

Not directly — it's research. But it's already powering Google products like Android scam detection.

Will Simula's techniques become open source?

Possibly — Google often releases research papers detailing the approach.

Can Simula data really beat real data?

In some tests yes — particularly for diversity coverage.

What's the catch?

Requires a strong teacher model to validate quality.

Where else might Simula be deployed?

Likely fraud detection, content moderation, accessibility tools.

Should solo operators care?

Yes — synthetic data unlocks specialist AI tools.