Hermes Gemma 4: Run Hermes Free Forever Locally

Hermes Gemma 4 is the combination I've been waiting years for — a free, local, lightweight model that actually plays nicely with a proper agent.

Let me cut to it.

Gemma 4 is Google's latest open-source lightweight model.

It runs locally.

It's fast.

It's free forever.

And when you wire it up with Hermes, you get an AI agent that costs you nothing to run — no API bills, no token anxiety, no rate limits.

I'll walk you through the exact setup I use.

No fluff.

No 20-minute preamble about "what is AI" like every other tutorial.

Just the command flow that gets you running in about 5 minutes.

Why Hermes Gemma 4 Is Actually A Big Deal

Most people think local models are rubbish.

They were, honestly.

Six months ago, local models felt like a toy compared to the frontier stuff.

Gemma 4 changes that maths.

Here's why:

It's free. No tokens. No API key. No monthly bills.
It's local. Your data never leaves your machine. Big deal if you're handling client work or sensitive info.
It's got a monster context window. The bigger Gemma 4 variants (18GB and 20GB) pack a 256K context window. That's larger than a lot of frontier cloud models, including MiniMax M2.7.
It's built for agents. Runs tools, handles sub-agents, works with Hermes without weird workarounds.

I've been messing with every local model out there.

Gemma 4 is the first one I'd actually trust to run overnight as a sub-agent without babysitting it.

That's why Hermes Gemma 4 matters.

Now let's set it up.

What You Need Before You Start

Two things.

First, Hermes installed.

If you haven't got Hermes on your machine yet, stop reading and install it.

I covered the full walkthrough in my Ollama + Hermes setup — go grab that first, come back, keep reading.

Second, Ollama installed.

Ollama is the runtime that hosts Gemma 4 locally.

It's the thing that lets your model talk to Hermes.

You'll install it in Step 1 below if you haven't already.

That's the full prereq list.

No Docker.

No Python environments.

No CUDA wrangling.

Hermes and Ollama. That's it.

Step 1: Install Ollama

Head to ollama.com and grab the install command.

Paste it into your terminal.

Run it.

Done.

Ollama runs as a background service on your machine — you don't have to think about it once it's installed.

Check it's alive by running ollama list in your terminal.

If it returns without error, you're good.

Step 2: Install Gemma 4

Go to the Models page on ollama.com.

Search "Gemma 4".

Click it.

Pick your variant:

Smaller Gemma 4 variants — ~128K context window, runs on basically any modern laptop
Larger Gemma 4 variants (18GB / 20GB) — 256K context window, needs a beefier machine

For most people I'd say start with the smaller one.

You can always pull the bigger one later.

Ollama shows you the exact terminal command to install whichever variant you pick.

Copy it. Paste it. Run it.

It'll download the model weights (takes a minute or two depending on your connection).

Once done, Gemma 4 lives on your machine and responds instantly.

🔥 Want the exact commands + screenshots I used to get this running first try? Inside the AI Profit Boardroom, I've got a full Hermes + Ollama section with step-by-step video tutorials showing you the exact setup, including the custom endpoint dance, API key gotchas, and what to do when things break. Plus weekly coaching calls where you can share your screen and get help with YOUR setup. 2,800+ members are already using this. → Get access to the full training here

Step 3: Wire Hermes Up To Gemma 4

This is where most people get stuck.

It's dead simple once you see it.

Open your terminal.

Start a new Hermes chat.

Run:

hermes model

You'll see a list of different models and endpoints.

Scroll to "Custom Endpoint".

Select it.

Hermes now asks you for a URL.

This is the Ollama local URL — Ollama exposes itself on http://localhost:11434 by default.

Paste that URL into Hermes.

IMPORTANT: Make sure Ollama is actually running in the background at this point.

If it's not, Hermes can't talk to it and you'll get an error.

A quick ollama list in another terminal tab tells you it's alive.

Next, Hermes asks for an API key.

This is a local setup — there's no real key.

Either:

Leave the API key blank (works fine), or
Type "Ollama" (also works, and it's what I do by default)

Hit enter.

Hermes pings Ollama and shows you the list of models you've got installed locally.

Gemma 4 will be on that list.

Select "Gemma 4 latest" (or whichever variant you pulled).

Leave the next prompt blank.

Run Hermes.

Boom.

Gemma 4 is now running as the brain behind Hermes.

Free.

Local.

Forever.

The Entire Command Flow In One Block

Because I know some of you just want to copy and go:

# Step 1
Install Ollama from ollama.com

# Step 2
ollama pull gemma:4-latest  # or your chosen variant

# Step 3
hermes model
→ Select Custom Endpoint
→ Paste http://localhost:11434
→ API key: Ollama (or leave blank)
→ Select Gemma 4 latest
→ Leave next prompt blank
→ Run Hermes

That's the whole thing.

Nobody else on the internet explains it this simply.

I know because I had to figure it out the hard way when the docs were missing half the steps.

Why Hermes Gemma 4 Beats Cloud Models For Certain Jobs

Cloud models are better. I'll say it.

Claude, GPT, MiniMax M2.7 — they're smarter than Gemma 4 on the hardest tasks.

But they've got limits.

Token limits.

Rate limits.

Cost limits.

Here's where Hermes Gemma 4 wins:

1. Sub-agents

When I'm running a big Hermes workflow with 10+ sub-agents, I don't want every sub-agent burning through Claude tokens.

Point the sub-agents at Gemma 4.

They're free.

They run forever.

Save the expensive frontier model for the orchestrator.

2. Massive context jobs

The bigger Gemma 4 variants give you 256K of context.

That's bigger than a lot of frontier models, including MiniMax M2.7.

If you're feeding in a huge codebase or a massive document, Gemma 4 can swallow it without breaking.

3. Privacy-sensitive work

Client data that can't leave your machine?

Internal docs you can't send to a cloud API?

Local Gemma 4 via Hermes doesn't phone home.

Your data stays on your box.

4. Running tools agentically

A lot of the new Ollama-compatible models — including Gemma 4 — are designed to run tools and behave agentically.

Pair that with Hermes' agent loop and you've got a free local agent that can call tools, write files, run commands.

I covered the tool-running side more in Hermes vs OpenClaw if you want to see where each one shines.

Bonus: Run MiniMax M2.7 Cloud With Hermes Too

Same endpoint trick works for cloud models.

Run hermes model.

Select custom endpoint.

This time type "2" for MiniMax M2.7 cloud.

Leave blank.

Run Hermes.

Done — MiniMax M2.7 running through the same custom endpoint flow.

Why bother?

Because MiniMax 2.7 is agentic.

Self-improving.

Designed to run tools.

It basically built itself.

You can even run OpenClaw with MiniMax 2.7 through Ollama — I broke that down properly in my OpenClaw Opus 4.7 walkthrough.

Cloud models like MiniMax are better for the hardest tasks.

Local models like Gemma 4 are free and unlimited.

Use both.

Julian's Real Talk On Which To Pick

Here's how I actually use Hermes Gemma 4 day-to-day.

Heavy reasoning, high-stakes task? Claude Opus 4.7 via Hermes. No question. I covered that setup in Claude Opus 4.7 for AI SEO.

Massive context dump? The big Gemma 4 variant via Hermes. 256K of context window and zero cost.

Parallel sub-agents running overnight? Small Gemma 4 via Hermes. I don't care if it's slightly less smart — I'm running 20 of them at once and the cost is zero.

Private client work? Always local Gemma 4. Data stays on my machine.

The combination matters more than any single model.

Hermes is the conductor.

Gemma 4 is a free instrument in your orchestra.

FAQ — Hermes Gemma 4

Is Hermes Gemma 4 really free forever?

Yes.

Gemma 4 is open-source.

Ollama is open-source.

Hermes is the agent that wraps around them.

Running Gemma 4 locally via Ollama doesn't cost you a penny in API fees — you only pay the electricity cost of your own machine.

What's the context window on Hermes Gemma 4?

Depends on the variant.

The smaller Gemma 4 models run at ~128K context.

The larger 18GB and 20GB Gemma 4 variants run at 256K context — bigger than most frontier cloud models, including MiniMax M2.7.

Can Hermes Gemma 4 run tools and sub-agents?

Yes.

Gemma 4 is built for agentic use.

You can fire Hermes sub-agents at it, have it run tools, and orchestrate bigger workflows.

It won't hit the reasoning depth of a frontier cloud model on the hardest tasks, but for sub-agent work it's brilliant.

Do I need a powerful computer to run Hermes Gemma 4?

For the smaller Gemma 4 variants, any modern laptop works fine.

For the 18GB and 20GB variants with 256K context, you'll want a machine with solid RAM and a decent GPU.

If you're on a MacBook Pro with 16GB+ unified memory, you're golden.

What if Hermes can't find Gemma 4?

Check Ollama is running in the background.

Run ollama list in a separate terminal to confirm Gemma 4 is installed.

If it's not listed, re-run the install command from ollama.com.

If it is listed but Hermes still can't see it, double-check the URL you pasted into the custom endpoint step — it should be the local Ollama URL.

Should I use Hermes Gemma 4 or a cloud model?

Both.

Cloud models are smarter on the hardest tasks.

Hermes Gemma 4 is free, local, unlimited, and has a bigger context window than many cloud models.

Use cloud for the top-of-the-funnel brain work, and Hermes Gemma 4 for sub-agents, private data, and bulk jobs.

Wrapping Up

Hermes Gemma 4 is the easiest "free forever" AI agent stack on the planet right now.

Install Ollama.

Pull Gemma 4.

Run hermes model, select custom endpoint, paste the URL, API key = "Ollama", select Gemma 4, leave blank, run.