Google Simula: Synthetic Training Data Without Real Data

Google Simula is the AI research breakthrough most people are sleeping on — synthetic training data that beats real data on some benchmarks.

This isn't a product launch.

It's a research framework that solves one of AI's biggest problems.

This post covers:

The Problem Simula Solves

Every AI model needs data to learn.

For general AI, there's plenty of data online.

For specialised AI, the data is locked up:

That's the data war.

It blocks the next wave of specialised AI.

Until now.

What Google Simula Is

A reasoning-first framework that generates synthetic training data from scratch.

No seeds.

No copies.

No scraping.

Just pure logic and structure.

Built by Google + EPFL.

How Simula Generates Data

Three stages.

Stage 1 — Global diversification

Simula maps the entire domain first.

Like drawing a country before placing cities.

Uses a taxonomy — a complete menu of every topic and subtopic in the area.

For cyber security: every type of attack, every defender, every system.

Stage 2 — Local diversification

Zooms in.

Creates lots of different examples in each spot on the map.

Uses "one of n meta prompting" — generates many versions of each scenario.

Then runs "complexification" — pushes simple examples to be harder, trickier, more nuanced.

Like leveling up a video game.

Easy → medium → hard → boss fight.

Stage 3 — Dual critic filter

Two critic models check each example.

Decides if it's good enough to keep.

In the legal data set test: 61% of generated data was rejected.

That's a serious quality filter.

Why This Matters

Three reasons.

1 — Unlocks specialised AI

Where real data is locked behind privacy/cost/risk, synthetic data unlocks training.

Medical AI, legal AI, security AI — all benefit.

2 — Better diversity than real data

Real-world data covers what people happen to write online.

Simula data covers the full domain on purpose.

In tests, real reference data sets covered LESS of a topic than Simula-built ones.

Synthetic can be more complete than real.

3 — Quality, diversity, complexity as separate knobs

You control them one by one.

Need diverse but simple? Yes.

Need complex but narrow? Yes.

Cloud AI subscription model can't compete with this customisation.

🔥 Want to be ahead on synthetic data? Inside the AI Profit Boardroom, I share AI research updates including Simula, prep workflows for data-driven operators, and weekly live coaching. 2,800+ members. → Get the playbook

Where Simula Already Powers Products

This is the part most people don't realise.

Google is using Simula in production:

1 — AI scam detection on Android calls

The feature warning you when a phone call sounds like a scam?

Simula helps train it.

You can't train scam detection on real scam data — illegal, private, risky.

Simula generates synthetic scam-shaped data.

Model learns scam patterns without ever seeing a real victim's message.

2 — Spam filtering in Google Messages

Same principle.

Synthetic spam data trains the filter.

Real spam protected by privacy laws.

What The Numbers Show

Google ran experiments.

Headline result: synthetic Simula data beats real data on some tasks.

Specifically:

The catch:

Honest finding: the model labeling the synthetic data must be smart enough.

Why Specialised AI Was Stuck

Until Simula:

Specialised AI for legal, medical, financial fields couldn't progress.

Simula offers a different path: build the data on purpose.

Implications For Solo Operators

Three.

1 — Specialist AI tools for niche industries become possible

Tools for lawyers, doctors, financial advisors that didn't exist because of data scarcity.

Now they can.

2 — Privacy-friendly AI gains adoption

Trained on synthetic data, AI doesn't need real customer data.

For privacy-sensitive operators, this is a major unlock.

3 — Quality bar rises for AI products

When everyone can train good models, the differentiator becomes domain knowledge + use case design.

Operators with clear specialist insight win.

What Simula Doesn't Do

Be honest.

For most use cases, the wins outweigh the limits.

How This Pairs With Open Source AI

Simula's pattern (mechanism design + dual critic filter) could be applied to open source AI training.

If applied:

This is part of the broader open source vs closed source race we're seeing.

I cover the open source side in Hermes Gemma 4 and Kimi 2.6 Benchmark.

Strategic Takeaway

Simula tells us something important:

The future of AI isn't just bigger models.

It's smarter ways of building training data.

The leverage shifts from "who has the most data" to "who has the sharpest thinking about their domain".

For operators with specialist domain knowledge, that's good news.

You can structure your domain.

You can articulate the rules.

You can design training data.

You don't need a giant data warehouse.

What This Means For The AI Industry

Predictions.

1 — Specialist AI explodes

Niche industries that lacked data now have a path.

2 — Privacy-first AI grows

Synthetic data sidesteps privacy concerns.

3 — Open source benefits

The same techniques can be applied open source.

4 — Big tech consolidation

Companies with strong reasoning models (the "teachers") have an advantage.

Critic Step Generalises

Simula's dual critic approach has a broader lesson.

For any AI system:

This applies to:

I apply this in my Hermes Agent Swarm workflows — always have a reviewer agent.

What Solo Operators Should Do This Quarter

Three actions.

1 — Pay attention to Simula evolution

This research will likely productise within 12-18 months.

2 — Map your specialist domain

If you have niche expertise, document the structure.

That structure is exactly what Simula-style training needs.

3 — Add critic steps to your AI workflows

Apply the dual critic pattern to your existing AI use.

Quality jumps.

🚀 Want my full AI research strategic playbook? The AI Profit Boardroom has my AI strategic updates, OpenClaw 6-hour course, Hermes 2-hour course, daily training, weekly live coaching. 2,800+ members. → Join here

FAQ — Google Simula

What is Google Simula?

A research framework for generating synthetic AI training data without seeds or scraping.

Is Simula a product I can use?

Not directly — it's research.

But it's already powering Google products like Android scam detection.

Will Simula's techniques become open source?

Possibly — Google often releases research papers detailing the approach.

Can Simula data really beat real data?

In some tests yes — particularly for diversity coverage.

What's the catch?

Requires a strong teacher model to validate quality.

Where else might Simula be deployed?

Likely fraud detection, content moderation, accessibility tools.

Should solo operators care?

Yes — synthetic data unlocks specialist AI tools.

Related Reading

📺 Video notes + links to the tools 👉

🎥 Learn how I make these videos 👉

🆓 Get a FREE AI Course + Community + 1,000 AI Agents 👉

Google Simula is the AI research breakthrough that's quietly going to enable the next wave of specialist AI products — pay attention now.

Ready to Build AI Agents That Actually Make Money?

Join 2,200+ entrepreneurs inside the AI Profit Boardroom. Get 1,000+ plug-and-play AI agent workflows, daily coaching, and a community that holds you accountable.

Join The AI Agent Community →

7-Day No-Questions Refund • Cancel Anytime