Service Mesh at Scale with William Morgan

William Morgan

CEO of Buoyant, Creator of Linkerd, ex-Twitter Engineer

Señors @ Scale host Neciu Dan sits down with William Morgan, CEO of Buoyant and creator of Linkerd — the world's first service mesh and a graduated CNCF project. William's path runs from teaching himself BASIC on a begged-for DOS PC, through Twitter's painful migration off Ruby on Rails into JVM-based microservices, and into building the proxy that handles retries, mTLS, load balancing, and multi-cluster traffic for thousands of production Kubernetes clusters. From the Scala-to-Rust rewrite to why every sustainable cloud native open source project needs a commercial engine behind it, this is the infrastructure conversation most application developers never get to have.

🎧 New Señors @ Scale Episode

This week, I spoke with William Morgan, CEO of Buoyant and the creator of Linkerd — the world's first service mesh and a graduated CNCF project. William spent nearly four years at Twitter in the Ruby-on-Rails-to-microservices era, building the photo service and the embedded timelines product, and watched first-hand as Twitter invented the patterns that would later become foundational infrastructure for everyone running Kubernetes at scale. He took those lessons, founded Buoyant, and built Linkerd to give the rest of the world the tools Twitter had to invent for itself.

In this episode, we go deep into what a service mesh actually is, why Linkerd rewrote everything from Scala to Rust, how Buoyant funds its maintainers through an open core model, the protocol-detection edge cases that bite you in production, and why every cloud native open source project needs a commercial engine behind it to survive.

⚙️ Main Takeaways

1. Twitter's monolith-to-microservices rewrite was the seed for Linkerd

William joined Twitter in 2010 right as the company decided to rewrite everything off Ruby on Rails onto JVM-based microservices in Scala — and he was against it.

The starting point: Twitter was a popular Ruby on Rails app with severe scaling issues — the fail whale era, asymmetric traffic patterns where a single user with tens of millions of followers could destabilize the system.
The decision: Within his first six months, Twitter committed to decomposing the monolith into microservices on the JVM using Scala, deployed on Mesos, communicating over Thrift.
The personal pivot: William was a Ruby fan and "got bullied into doing it." Watching that transformation — and the new failure modes it introduced — became the basis for everything he built next.

2. The hidden cost of decomposing a monolith is the network

Function calls don't fail. Network calls do — and that single difference changes everything.

The shift: "Function calls are very predictable, and they're very fast, and they basically don't fail. Now you've got network calls and dramatically different semantics."
The new problems: Hundreds of milliseconds instead of hundreds of microseconds. Calls can fail. Data has to be encoded, sent over the wire, and decoded. Memory-shared state doesn't exist anymore.
The Twitter solution: Libraries that handled retries, timeouts, load balancing, circuit breaking, rate limiting, and service discovery — so the photo service developer could just say "talk to the storage service" and not care about the rest.

3. The first version of Linkerd was a proxy wrapper around Twitter's open source Scala libraries

Linkerd started as a way to get Twitter's networking primitives into the hands of teams not writing Scala.

The constraint: Twitter had open sourced beautiful Scala libraries that handled hard distributed systems problems. But "if you're into that, great" — most companies weren't writing Scala.
The hack: Wrap the Scala libraries in a network proxy. "Let's put a proxy on both sides" of every call. Now language doesn't matter — your service just talks HTTP to the proxy, and the proxy handles load balancing, retries, and the rest.
The evolution: Modern Linkerd has zero Scala. The proxy is in Rust, the control plane in Go, and it's purpose-built for Kubernetes.

4. The Scala-to-Rust rewrite was forced by the JVM's memory floor

The JVM is great at scaling up. It's terrible at scaling down — and the service mesh proxy needs to scale down to nearly nothing.

The number: Linkerd's JVM-based proxy could not get under 150 megs of memory, no matter what they tried — including GraalVM and other tricks.
The pitch problem: "I'm running this Go microservice that takes 50 megs. And you're telling me I need to add 150 megs of proxy on top of that. That's not transparent or lightweight."
The choice: Rust gave them memory safety guarantees that mattered for customers running medical data and financial transactions. C++ was out — "There's no way that a human being can write C++ code without making it insecure." The result is a "micro-proxy" that's roughly a tenth the size of Envoy.

5. Linkerd funded the early Rust networking ecosystem

When Linkerd picked Rust in 2018, the ecosystem barely existed.

The state of play: Rust was pre-1.0 or just past it. The async networking libraries were embryonic.
The investment: Buoyant funded a lot of early work on H2, Hyper, and Tokio so the primitives would exist for the proxy.
The artifact: The linkerd2-proxy repo on GitHub is one of the most sophisticated examples of asynchronous user-space networking code in Rust.

6. What a service mesh actually is — and who it's for

The term is meaningless on its own; Buoyant invented it to describe what Linkerd does.

The scope: Application-level networking on top of Kubernetes — retries, timeouts, load balancing, circuit breaking, mTLS, multi-cluster routing, observability. Not L4 plumbing — Kubernetes already gives you TCP between IPs.
The audience: Not developers. Platform owners. "Our goal is to make life easier for the devs. And ideally, they would never know about the service mesh."
The split: Devs own application code. Platform owners own Kubernetes plus the service mesh. The mesh handles the cross-cutting concerns the devs would otherwise have to reimplement everywhere.

7. mTLS, identity, and policy come for free at the platform layer

Asking every developer to implement TLS, rotate certificates every 24 hours, and manage a CA is a recipe for it not happening.

The mechanism: Linkerd issues a cryptographic identity in a TLS certificate to every pod. Every service-to-service call is encrypted and authenticated. Certificates rotate automatically.
The Twitter parallel: "Client identity was a big problem for Twitter because we didn't have that. Suddenly the photo service has 10x of traffic and you're like, where is this coming from?"
The next layer: Once you have identity, you can write policy — service A can call service B's /get but not /delete, gRPC method allow/deny lists, etc.

8. Sophisticated load balancing is not optional at scale

When you have 100 pods of service A talking to 1000 pods of service B, naive connection management falls apart.

The connection problem: Every A pod opens connections to every B pod. You hit OS limits. Linkerd upgrades all of those into a single multiplexed HTTP/2 connection per pair and does request-level load balancing.
The latency-aware bit: Linkerd measures response rates per endpoint. If a JVM pod stalls in a 200-300ms GC pause, Linkerd stops sending it traffic until it recovers.
The dev experience: None of this is in your code. It's all happening at the proxy layer.

9. Open core is the only sustainable model for cloud native infrastructure projects

William has strong, controversial opinions about open source funding — earned over a decade.

The premise: "The only way you can have a truly sustainable open source project in the Kubernetes ecosystem is by having a commercial engine behind it." Without that, projects die.
The Linkerd model: Buoyant sells Buoyant Enterprise for Linkerd. That money funds the maintainers — every Linkerd maintainer is a full-time Buoyant employee. When someone good shows up writing Linkerd code, they get a job offer.
The contribution reality: Linkerd's audience is SREs and platform engineers, not developers writing Rust. The barrier to contributing async Rust networking code is enormous, so most users "just want it to work."

10. Drawing the line between open source and enterprise features

The crappy way is to put features everyone needs behind the paywall. The good way is to find what businesses care about that individuals don't.

The principle for Linkerd: Anything around security stays in the open source. "We can't have an insecure version."
Compliance is enterprise: SBOMs, compliance attestations, audit features. "The only reason you ever care about that is because you have some annoying compliance team in your company."
HAZL — High Availability Zonal Load Balancing: AWS charges for cross-zone traffic. Kubernetes spreads traffic evenly across zones for reliability. Linkerd Enterprise's HAZL keeps traffic in-zone unless reliability degrades — saving cross-AZ costs at scale. "If you're worried about cost savings on cross-zone traffic, you're a big company and that's something you pay for."

11. Protocol detection is the kind of feature that's easy to design and hard to operate

Linkerd auto-detects whether a connection is HTTP, HTTP/2, or raw TCP — and that decision is full of edge cases.

The happy path: Connection opens, app sends bytes, Linkerd peeks at them, routes accordingly.
The edge case: App opens a connection and sends nothing. Linkerd needs to know the protocol to know where to route — config might say "if HTTP and going to foo, send to bar." So there's a 10-second timeout, after which it falls back to TCP.
The cluster-overload trap: Under heavy CPU pressure, those timeouts start firing for normal traffic. Behavior changes in ways operators can't predict. The fix has been refining behavior across releases — not a big-bang feature, just hard-won operational nuance.

12. Multi-cluster is built into the proxy layer, not bolted on

Kubernetes itself doesn't have multi-cluster. Every real platform runs multiple clusters anyway.

The reality: HA across regions, separation of concerns, different failure domains — every modern Kubernetes platform has multiple clusters.
The Linkerd answer: Service A talks to service B without knowing whether B is on the same cluster, a different cluster, or whether traffic is being migrated halfway between them.
The gotcha: Multi-cluster requires a shared trust anchor — a root TLS certificate distributed across clusters. "Trust anchor rotation is one of the hardest operational tasks in Linkerd today." If you mess up the certs, Linkerd refuses to talk and everything stops.

13. The proxy was deliberately not Kubernetes-specific

A small architectural decision in the 2.0 rewrite paid off later.

The split: The proxy knows how to talk to the Linkerd control plane. It does not know it's in Kubernetes.
The payoff: Mesh expansion — running the proxy on a VM that you can't migrate into Kubernetes. The same proxy works because it doesn't care about its environment.
The other 2.0 decision he's proud of: Making the control-plane / data-plane distinction explicit. The first version had ad-hoc coordination between independent proxies; 2.0 made it a clean split.

14. Observability for free — but bring your own backend

Linkerd instruments traffic. You decide where the metrics and traces go.

The metrics: Request latency histograms, success/failure rates, throughput per endpoint, exposed in Prometheus format. Datadog, Grafana Cloud, your own Prometheus — all work.
The tracing: Linkerd emits distributed traces. Collection and processing are your problem.
The first-time-on moment: "I didn't know A was talking to B. I thought it was talking to C. And actually 80% of the time it's getting success, but 20% of the time it's failing on the /foo endpoint." First-time mesh installs reveal what's actually happening in your cluster.

15. Paying customers gave Buoyant the best product feedback they've ever had

William expected enterprise customers to be transactional. They turned out to be the opposite.

The surprise: "I thought they'd be like, take my money and go away. I hate you because I have to pay. No, they want us involved, and they want to tell us their problems."
The shape of recent releases: Not big-bang features. Refinements to protocol detection, control-plane components, the things that actually matter when you're in production at scale.
The pricing model: Free for production use under 50 employees total. Above that, priced based on deployment scale — number of clusters, footprint. "Not $5" — but aligned with the value the customer is extracting.

🧠 What I Learned

Twitter's painful Ruby-on-Rails-to-Scala-microservices rewrite invented patterns the rest of the industry would later adopt — and Linkerd was built to give those patterns to companies that couldn't invent them from scratch.
The first version of Linkerd was a network proxy wrapping Twitter's open source Scala libraries. The modern version is Rust + Go, purpose-built for Kubernetes, with no Scala in sight.
The JVM couldn't get the proxy under 150 megs of memory; that single fact forced the rewrite to Rust.
Buoyant funded early work on Tokio, Hyper, and H2 because the Rust async networking ecosystem didn't exist when Linkerd needed it in 2018.
A service mesh handles the application-level networking concerns — retries, timeouts, mTLS, multi-cluster routing — that otherwise every developer would have to reimplement in their service.
Linkerd's audience is platform engineers and SREs, not developers. The mesh's success is when devs never have to think about it.
Open core is the only sustainable model for cloud native infrastructure: every Linkerd maintainer is a paid Buoyant employee, funded by enterprise sales.
Drawing the open-source-vs-enterprise line: anything security-related stays free; compliance, cost-saving features like HAZL, and operational tools become enterprise.
Cross-zone traffic on AWS is a major cost line item at scale; HAZL pins traffic in-zone until reliability degrades.
Protocol detection looks easy on paper; it's hard in practice because of timeouts, idle connections, and cluster CPU pressure.
Trust anchor rotation across multiple clusters is one of Linkerd's hardest operational pain points, and one Buoyant is actively simplifying.
Linkerd's proxy is intentionally not Kubernetes-specific, which enabled "mesh expansion" — running the same proxy on legacy VMs.
Paying customers turned out to give the best product feedback — recent Linkerd releases are about refinement, not big-bang features.
Sustainable open source in the cloud native space is no longer a "nights and weekends" model — it requires a commercial engine to keep maintainers paid and projects alive for the long term.

💬 Favorite Quotes

"Function calls are very predictable, and they're very fast, and they basically don't fail. Now you've got network calls and dramatically different semantics."

"The only way you can have a truly sustainable open source project in the Kubernetes ecosystem is by having a commercial engine behind it. When we haven't had that, that's when the projects have vanished."

"Our goal is to make life easier for the devs. And ideally, they would never know about the service mesh."

"There's no way that a human being can write C++ code without making it insecure."

"I thought they'd be like, take my money and go away. I hate you because I have to pay. No, they want us involved, and they want to tell us their problems."

"I think we should be able to get Linkerd to a hundred years old. I'll be long gone, but the project will continue."

🎯 Also in this Episode

William's path from a begged-for DOS PC and BASIC, through high-school Linux floppy-disk installs, to NLP research and infrastructure
The Twitter photo service and embedded timelines — what he actually shipped at Twitter
Why Mesos was Twitter's pre-Kubernetes orchestrator and what it shared with the modern stack
Why Buoyant tried SaaS first, why it failed, and why the enterprise distribution model worked
The CNCF, what it is, and why projects like Linkerd, Prometheus, Kubernetes, and etcd live there
Linkerd vs Istio — William's honest take on when to pick what
GitOps and how SREs configure Linkerd with Kubernetes CRDs, not Rust code
The 50-employee free tier and the philosophy behind it
William's transition from engineer to CEO and what he learned about enterprise sales
Hyperion, Gideon the Ninth, and Gene Wolfe's The Book of the New Sun as William's favorite sci-fi reads

Resources

🎧 Listen Now

🎧 Spotify
📺 YouTube
🍏 Apple Podcasts

Episode Length: 70 minutes on service mesh internals, the Twitter monolith breakup, the Scala-to-Rust rewrite, and how a graduated CNCF project actually pays its maintainers.

Whether you're a platform engineer evaluating Linkerd vs Istio, a developer who's never thought about what a service mesh does for you, or an open source maintainer trying to figure out a sustainable funding model, this conversation has something immediately useful.

Happy building,
Dan

💡 More Recent Takeaways

Episode 35

React Native at Scale with Kadi Kraman

Señors @ Scale host Neciu Dan sits down with Kadi Kraman, software developer at Expo working on the tools that make React Native development as smooth as possible. Kadi's path started with C++ in a university maths degree, took her through Angular 1, scientific programming for pharmaceutical and defense companies, five and a half years at Formidable, and finally to Expo itself. From the limitations of early React Native to development builds, EAS workflows, fingerprint-based repacks, and the right way to think about over-the-air updates, this is the React Native conversation most web developers never get.

60 minutes 📖 Read Takeaways

Episode 34

Browser ML at Scale with Nico Martin

Señors @ Scale host Neciu Dan sits down with Nico Martin — open source ML engineer at Hugging Face working on Transformers.js, and Google Developer Expert in AI and web technology — to go deep on running machine learning models directly in the browser. Nico breaks down architectures vs. weights, quantization, tokenizers, ONNX, WebGPU, and why on-device AI is the right answer for a huge class of problems. He also shares the road from ski instructor and self-taught web developer to landing what he calls his dream job at Hugging Face.

66 minutes 📖 Read Takeaways

Episode 33

Frontend Foundations at Scale with Giorgio Polvara

Señors @ Scale host Neciu Dan sits down with Giorgio Polvara, Staff Engineer at Perk (formerly TravelPerk), who joined when the company was 15 people in two flats with a hole knocked through the wall and helped build the frontend foundations that still hold up at unicorn scale. Giorgio covers the multi-year migration from a monolithic frontend to vertical micro-frontends, why their first attempt with single-spa didn't work, how they pulled off a full rebrand behind feature flags without leaking, and the staff engineer mindset of treating every feature as a system improvement.

55 minutes 📖 Read Takeaways

Episode 32

Module Federation at Scale with Zack Chapple & Nestor

Señors @ Scale host Neciu Dan sits down with Zack Chapple, CEO and co-founder of Zephyr Cloud, and Nestor, the platform engineer building it, to go deep on module federation, microfrontends, and what it actually takes to go from code to global scale in seconds. They unpack why module federation is Docker for the frontend, how Zephyr composes applications at the edge in 80 milliseconds, and why the real unlock for enterprise teams isn't deployment — it's composition.

57 minutes 📖 Read Takeaways

📻 Never Miss New Takeaways

Get notified when new episodes drop. Join our community of senior developers learning from real scaling stories.

💬 Share These Takeaways

Want More Insights Like This?

Subscribe to Señors @ Scale and never miss conversations with senior engineers sharing their scaling stories.

🎧 Subscribe to Updates 🎙️ Browse All Episodes