Use Case: AI/ML

Distributed AI Training at Global Scale

Nous Research uses iroh to train foundation LLMs with compute distributed around the world, across AWS, GCP, Azure, and self-hosted infrastructure.

AWSGCPAzureSelf-HostEdgeHybridPrivate

10x

Bandwidth reduction (10Gbps → 1Gbps)

50%

Cost reduction ($1M → $500K models)

30-50

Nodes in training runs

100%

GPU & network utilization

The Problem: Training LLMs is Brutally Expensive

The amount of data that needs to be sent between every GPU during training is intense—basically the entire model. Traditional approaches require building massive data centers with specialized high-bandwidth interconnects. There's been no way to do this without concentrated infrastructure.

“Doubling the network speed halves our compute budget. That's the difference between a $1M model and a $500K model.”

But what if you could run distributed training over the internet? What if you could use the cheapest compute anywhere in the world and link them all together?

The Solution: Psyche

Psyche is Nous's distributed training framework. It brings the bandwidth requirements between each machine down from 10Gbps to just 1Gbps—making internet-based distributed training viable.

Data center operators can download a binary and use iroh to connect to every other node in a training run. They do training on their GPUs and communicate information through gossip, transferring large amounts of data via blobs.

The core question was simple: how do you get something to talk to something else? Iroh solves this. Gossip is especially useful because Psyche is building a swarm, not just a centralized service.

Swarm Architecture

Built as a decentralized swarm, not a centralized service. Gossip enables coordination across all nodes.

100% Utilization

Most training frameworks do a train step then synchronize. Psyche's asynchronous approach pegs GPUs at 100% and network connections at 100% simultaneously.

Saturated Connections

The use case: saturate all connections 100%. From iroh, they need reliable delivery of messages and file transfer as fast as it can possibly go.

Blockchain Coordination

High-level coordination through blockchain integration—useful when you need to pay someone and you don't know who they are.

Why Relays Matter

Nous runs 5 iroh relays to ensure reliable connectivity across their distributed training network. The key insight: when things deteriorate, they can't break.

Iroh automatically establishes direct connections when possible for maximum throughput. When direct connections aren't possible—due to NATs, firewalls, or network conditions—traffic flows through relays. This fallback mechanism means training runs continue even when network conditions change.

Managed Relays & Monitoring

The n0.computer team hosts relays for Nous through the iroh-online service. This provides reliable relay infrastructure without Nous having to manage it themselves.

We're also partnering with Nous to build better monitoring tools for distributed training—making it easy to understand what's happening at the network level during training runs.

“iroh does so much low-level networking for us. We don't have to learn about the low-level details of QUIC. When things go wrong, we want to look at the metrics and logs to understand what happened.”

Ready to Connect Your Distributed Infrastructure?

Get started with iroh in minutes. No complex configuration required.