Skip to content
⚡ LIVE From Lizard to Wizard · Wednesday, August 5 · LIMITED SEATS Save my seat →
Episode 31 70 minutes

Service Mesh at Scale with William Morgan

Key Takeaways from our conversation with William Morgan

William Morgan

CEO of Buoyant, Creator of Linkerd, ex-Twitter Engineer

Señors @ Scale host Neciu Dan sits down with William Morgan, CEO of Buoyant and creator of Linkerd — the world's first service mesh and a graduated CNCF project. William's path runs from teaching himself BASIC on a begged-for DOS PC, through Twitter's painful migration off Ruby on Rails into JVM-based microservices, and into building the proxy that handles retries, mTLS, load balancing, and multi-cluster traffic for thousands of production Kubernetes clusters. From the Scala-to-Rust rewrite to why every sustainable cloud native open source project needs a commercial engine behind it, this is the infrastructure conversation most application developers never get to have.

🎧 New Señors @ Scale Episode

This week, I spoke with William Morgan, CEO of Buoyant and the creator of Linkerd — the world's first service mesh and a graduated CNCF project. William spent nearly four years at Twitter in the Ruby-on-Rails-to-microservices era, building the photo service and the embedded timelines product, and watched first-hand as Twitter invented the patterns that would later become foundational infrastructure for everyone running Kubernetes at scale. He took those lessons, founded Buoyant, and built Linkerd to give the rest of the world the tools Twitter had to invent for itself.

In this episode, we go deep into what a service mesh actually is, why Linkerd rewrote everything from Scala to Rust, how Buoyant funds its maintainers through an open core model, the protocol-detection edge cases that bite you in production, and why every cloud native open source project needs a commercial engine behind it to survive.

⚙️ Main Takeaways

1. Twitter's monolith-to-microservices rewrite was the seed for Linkerd

William joined Twitter in 2010 right as the company decided to rewrite everything off Ruby on Rails onto JVM-based microservices in Scala — and he was against it.

  • The starting point: Twitter was a popular Ruby on Rails app with severe scaling issues — the fail whale era, asymmetric traffic patterns where a single user with tens of millions of followers could destabilize the system.
  • The decision: Within his first six months, Twitter committed to decomposing the monolith into microservices on the JVM using Scala, deployed on Mesos, communicating over Thrift.
  • The personal pivot: William was a Ruby fan and "got bullied into doing it." Watching that transformation — and the new failure modes it introduced — became the basis for everything he built next.

2. The hidden cost of decomposing a monolith is the network

Function calls don't fail. Network calls do — and that single difference changes everything.

  • The shift: "Function calls are very predictable, and they're very fast, and they basically don't fail. Now you've got network calls and dramatically different semantics."
  • The new problems: Hundreds of milliseconds instead of hundreds of microseconds. Calls can fail. Data has to be encoded, sent over the wire, and decoded. Memory-shared state doesn't exist anymore.
  • The Twitter solution: Libraries that handled retries, timeouts, load balancing, circuit breaking, rate limiting, and service discovery — so the photo service developer could just say "talk to the storage service" and not care about the rest.

3. The first version of Linkerd was a proxy wrapper around Twitter's open source Scala libraries

Linkerd started as a way to get Twitter's networking primitives into the hands of teams not writing Scala.

  • The constraint: Twitter had open sourced beautiful Scala libraries that handled hard distributed systems problems. But "if you're into that, great" — most companies weren't writing Scala.
  • The hack: Wrap the Scala libraries in a network proxy. "Let's put a proxy on both sides" of every call. Now language doesn't matter — your service just talks HTTP to the proxy, and the proxy handles load balancing, retries, and the rest.
  • The evolution: Modern Linkerd has zero Scala. The proxy is in Rust, the control plane in Go, and it's purpose-built for Kubernetes.

4. The Scala-to-Rust rewrite was forced by the JVM's memory floor

The JVM is great at scaling up. It's terrible at scaling down — and the service mesh proxy needs to scale down to nearly nothing.

  • The number: Linkerd's JVM-based proxy could not get under 150 megs of memory, no matter what they tried — including GraalVM and other tricks.
  • The pitch problem: "I'm running this Go microservice that takes 50 megs. And you're telling me I need to add 150 megs of proxy on top of that. That's not transparent or lightweight."
  • The choice: Rust gave them memory safety guarantees that mattered for customers running medical data and financial transactions. C++ was out — "There's no way that a human being can write C++ code without making it insecure." The result is a "micro-proxy" that's roughly a tenth the size of Envoy.

5. Linkerd funded the early Rust networking ecosystem

When Linkerd picked Rust in 2018, the ecosystem barely existed.

  • The state of play: Rust was pre-1.0 or just past it. The async networking libraries were embryonic.
  • The investment: Buoyant funded a lot of early work on H2, Hyper, and Tokio so the primitives would exist for the proxy.
  • The artifact: The linkerd2-proxy repo on GitHub is one of the most sophisticated examples of asynchronous user-space networking code in Rust.

6. What a service mesh actually is — and who it's for

The term is meaningless on its own; Buoyant invented it to describe what Linkerd does.

  • The scope: Application-level networking on top of Kubernetes — retries, timeouts, load balancing, circuit breaking, mTLS, multi-cluster routing, observability. Not L4 plumbing — Kubernetes already gives you TCP between IPs.
  • The audience: Not developers. Platform owners. "Our goal is to make life easier for the devs. And ideally, they would never know about the service mesh."
  • The split: Devs own application code. Platform owners own Kubernetes plus the service mesh. The mesh handles the cross-cutting concerns the devs would otherwise have to reimplement everywhere.

7. mTLS, identity, and policy come for free at the platform layer

Asking every developer to implement TLS, rotate certificates every 24 hours, and manage a CA is a recipe for it not happening.

  • The mechanism: Linkerd issues a cryptographic identity in a TLS certificate to every pod. Every service-to-service call is encrypted and authenticated. Certificates rotate automatically.
  • The Twitter parallel: "Client identity was a big problem for Twitter because we didn't have that. Suddenly the photo service has 10x of traffic and you're like, where is this coming from?"
  • The next layer: Once you have identity, you can write policy — service A can call service B's /get but not /delete, gRPC method allow/deny lists, etc.

8. Sophisticated load balancing is not optional at scale

When you have 100 pods of service A talking to 1000 pods of service B, naive connection management falls apart.

  • The connection problem: Every A pod opens connections to every B pod. You hit OS limits. Linkerd upgrades all of those into a single multiplexed HTTP/2 connection per pair and does request-level load balancing.
  • The latency-aware bit: Linkerd measures response rates per endpoint. If a JVM pod stalls in a 200-300ms GC pause, Linkerd stops sending it traffic until it recovers.
  • The dev experience: None of this is in your code. It's all happening at the proxy layer.

9. Open core is the only sustainable model for cloud native infrastructure projects

William has strong, controversial opinions about open source funding — earned over a decade.

  • The premise: "The only way you can have a truly sustainable open source project in the Kubernetes ecosystem is by having a commercial engine behind it." Without that, projects die.
  • The Linkerd model: Buoyant sells Buoyant Enterprise for Linkerd. That money funds the maintainers — every Linkerd maintainer is a full-time Buoyant employee. When someone good shows up writing Linkerd code, they get a job offer.
  • The contribution reality: Linkerd's audience is SREs and platform engineers, not developers writing Rust. The barrier to contributing async Rust networking code is enormous, so most users "just want it to work."

10. Drawing the line between open source and enterprise features

The crappy way is to put features everyone needs behind the paywall. The good way is to find what businesses care about that individuals don't.

  • The principle for Linkerd: Anything around security stays in the open source. "We can't have an insecure version."
  • Compliance is enterprise: SBOMs, compliance attestations, audit features. "The only reason you ever care about that is because you have some annoying compliance team in your company."
  • HAZL — High Availability Zonal Load Balancing: AWS charges for cross-zone traffic. Kubernetes spreads traffic evenly across zones for reliability. Linkerd Enterprise's HAZL keeps traffic in-zone unless reliability degrades — saving cross-AZ costs at scale. "If you're worried about cost savings on cross-zone traffic, you're a big company and that's something you pay for."

11. Protocol detection is the kind of feature that's easy to design and hard to operate

Linkerd auto-detects whether a connection is HTTP, HTTP/2, or raw TCP — and that decision is full of edge cases.

  • The happy path: Connection opens, app sends bytes, Linkerd peeks at them, routes accordingly.
  • The edge case: App opens a connection and sends nothing. Linkerd needs to know the protocol to know where to route — config might say "if HTTP and going to foo, send to bar." So there's a 10-second timeout, after which it falls back to TCP.
  • The cluster-overload trap: Under heavy CPU pressure, those timeouts start firing for normal traffic. Behavior changes in ways operators can't predict. The fix has been refining behavior across releases — not a big-bang feature, just hard-won operational nuance.

12. Multi-cluster is built into the proxy layer, not bolted on

Kubernetes itself doesn't have multi-cluster. Every real platform runs multiple clusters anyway.

  • The reality: HA across regions, separation of concerns, different failure domains — every modern Kubernetes platform has multiple clusters.
  • The Linkerd answer: Service A talks to service B without knowing whether B is on the same cluster, a different cluster, or whether traffic is being migrated halfway between them.
  • The gotcha: Multi-cluster requires a shared trust anchor — a root TLS certificate distributed across clusters. "Trust anchor rotation is one of the hardest operational tasks in Linkerd today." If you mess up the certs, Linkerd refuses to talk and everything stops.

13. The proxy was deliberately not Kubernetes-specific

A small architectural decision in the 2.0 rewrite paid off later.

  • The split: The proxy knows how to talk to the Linkerd control plane. It does not know it's in Kubernetes.
  • The payoff: Mesh expansion — running the proxy on a VM that you can't migrate into Kubernetes. The same proxy works because it doesn't care about its environment.
  • The other 2.0 decision he's proud of: Making the control-plane / data-plane distinction explicit. The first version had ad-hoc coordination between independent proxies; 2.0 made it a clean split.

14. Observability for free — but bring your own backend

Linkerd instruments traffic. You decide where the metrics and traces go.

  • The metrics: Request latency histograms, success/failure rates, throughput per endpoint, exposed in Prometheus format. Datadog, Grafana Cloud, your own Prometheus — all work.
  • The tracing: Linkerd emits distributed traces. Collection and processing are your problem.
  • The first-time-on moment: "I didn't know A was talking to B. I thought it was talking to C. And actually 80% of the time it's getting success, but 20% of the time it's failing on the /foo endpoint." First-time mesh installs reveal what's actually happening in your cluster.

15. Paying customers gave Buoyant the best product feedback they've ever had

William expected enterprise customers to be transactional. They turned out to be the opposite.

  • The surprise: "I thought they'd be like, take my money and go away. I hate you because I have to pay. No, they want us involved, and they want to tell us their problems."
  • The shape of recent releases: Not big-bang features. Refinements to protocol detection, control-plane components, the things that actually matter when you're in production at scale.
  • The pricing model: Free for production use under 50 employees total. Above that, priced based on deployment scale — number of clusters, footprint. "Not $5" — but aligned with the value the customer is extracting.

🧠 What I Learned

  • Twitter's painful Ruby-on-Rails-to-Scala-microservices rewrite invented patterns the rest of the industry would later adopt — and Linkerd was built to give those patterns to companies that couldn't invent them from scratch.
  • The first version of Linkerd was a network proxy wrapping Twitter's open source Scala libraries. The modern version is Rust + Go, purpose-built for Kubernetes, with no Scala in sight.
  • The JVM couldn't get the proxy under 150 megs of memory; that single fact forced the rewrite to Rust.
  • Buoyant funded early work on Tokio, Hyper, and H2 because the Rust async networking ecosystem didn't exist when Linkerd needed it in 2018.
  • A service mesh handles the application-level networking concerns — retries, timeouts, mTLS, multi-cluster routing — that otherwise every developer would have to reimplement in their service.
  • Linkerd's audience is platform engineers and SREs, not developers. The mesh's success is when devs never have to think about it.
  • Open core is the only sustainable model for cloud native infrastructure: every Linkerd maintainer is a paid Buoyant employee, funded by enterprise sales.
  • Drawing the open-source-vs-enterprise line: anything security-related stays free; compliance, cost-saving features like HAZL, and operational tools become enterprise.
  • Cross-zone traffic on AWS is a major cost line item at scale; HAZL pins traffic in-zone until reliability degrades.
  • Protocol detection looks easy on paper; it's hard in practice because of timeouts, idle connections, and cluster CPU pressure.
  • Trust anchor rotation across multiple clusters is one of Linkerd's hardest operational pain points, and one Buoyant is actively simplifying.
  • Linkerd's proxy is intentionally not Kubernetes-specific, which enabled "mesh expansion" — running the same proxy on legacy VMs.
  • Paying customers turned out to give the best product feedback — recent Linkerd releases are about refinement, not big-bang features.
  • Sustainable open source in the cloud native space is no longer a "nights and weekends" model — it requires a commercial engine to keep maintainers paid and projects alive for the long term.

💬 Favorite Quotes

"Function calls are very predictable, and they're very fast, and they basically don't fail. Now you've got network calls and dramatically different semantics."

"The only way you can have a truly sustainable open source project in the Kubernetes ecosystem is by having a commercial engine behind it. When we haven't had that, that's when the projects have vanished."

"Our goal is to make life easier for the devs. And ideally, they would never know about the service mesh."

"There's no way that a human being can write C++ code without making it insecure."

"I thought they'd be like, take my money and go away. I hate you because I have to pay. No, they want us involved, and they want to tell us their problems."

"I think we should be able to get Linkerd to a hundred years old. I'll be long gone, but the project will continue."

🎯 Also in this Episode

  • William's path from a begged-for DOS PC and BASIC, through high-school Linux floppy-disk installs, to NLP research and infrastructure
  • The Twitter photo service and embedded timelines — what he actually shipped at Twitter
  • Why Mesos was Twitter's pre-Kubernetes orchestrator and what it shared with the modern stack
  • Why Buoyant tried SaaS first, why it failed, and why the enterprise distribution model worked
  • The CNCF, what it is, and why projects like Linkerd, Prometheus, Kubernetes, and etcd live there
  • Linkerd vs Istio — William's honest take on when to pick what
  • GitOps and how SREs configure Linkerd with Kubernetes CRDs, not Rust code
  • The 50-employee free tier and the philosophy behind it
  • William's transition from engineer to CEO and what he learned about enterprise sales
  • Hyperion, Gideon the Ninth, and Gene Wolfe's The Book of the New Sun as William's favorite sci-fi reads

Resources

More from William:

CNCF & ecosystem:

  • CNCF — Cloud Native Computing Foundation, where Linkerd, Prometheus, etcd, and Kubernetes live

Book Recommendations:

🎧 Listen Now

🎧 Spotify
📺 YouTube
🍏 Apple Podcasts

Episode Length: 70 minutes on service mesh internals, the Twitter monolith breakup, the Scala-to-Rust rewrite, and how a graduated CNCF project actually pays its maintainers.

Whether you're a platform engineer evaluating Linkerd vs Istio, a developer who's never thought about what a service mesh does for you, or an open source maintainer trying to figure out a sustainable funding model, this conversation has something immediately useful.

Happy building,
Dan

🏆 SOLD OUT IN SINGAPORE · ATHENS · LONDON

From Lizard to Wizard

4-hour remote system design intensive.
Chat apps, microfrontends, BFF, SDUI, event-driven, observability.

€299 4-HOUR INTENSIVE
Save your seat →

Spots are vanishing. Don't be the one who waited.

💡 More Recent Takeaways

Monorepos at Scale with Santosh Yadav
Episode 40

Señors @ Scale host Neciu Dan sits down with Santosh Yadav, principal developer advocate at CodeRabbit and one of only around 80 GitHub Stars in the world. Santosh started hating C in 2004, fell for C# by 2008, and turned a year of open source contributions to Angular and NgRx into a stack of community titles — Google Developer Expert, GitHub Star, Nx champion, and Microsoft MVP. As a staff engineer at Celonis he led the move of 20-plus apps to module federation and drove Nx adoption across 30-plus teams when the product grew from four apps to thirty. From the year-long incremental migration off a single deployable unit, to why polyrepos can't give AI tools the context they need, to how Nx's affected graph and build caching tame a 20-million-line monorepo, to running code review for free for open source at CodeRabbit, this is the monorepo conversation grounded in someone who actually shipped one at scale.

Routing at Scale with Nicolas Beaussart-Hatchuel
Episode 39

Señors @ Scale host Dan Neciu sits down with Nicolas Beaussart-Hatchuel, staff engineer at Payfit and one of the maintainers of TanStack Router. Nicolas's path started with C macros to auto-generate his student paper headers and frontend learned by building phishing login pages for practice, took him through an iframe-based AngularJS-to-Angular 2 micro frontend migration at a web radio platform, into open source contributions across NX, ESLint, Vite and Hasura, and finally to maintaining one of the most ambitious routers in the React ecosystem. From why TanStack Router exists, to migrating Payfit's 300-route, 1.5-million-line codebase off React Router v5 using the strangler pattern, to collapsing 25 polyrepos and five different micro frontend strategies into a single modular monolith, this is the routing conversation most engineers never get.

Redux at Scale with Mark Erikson
Episode 38

Señors @ Scale host Neciu Dan sits down with Mark Erikson, maintainer of Redux and senior front-end engineer at Replay.io, where he works on a time-traveling debugger. Mark's path started with a 286 he got at eight years old, ran through a computer science degree, four years teaching English in China, embedded software at Northrop Grumman emulating legacy CPUs in old aircraft, and a chain of projects — GWT, jQuery, Backbone — that led him to React and Redux. From the @deprecated backlash that had people insulting him on the internet, to why the Redux core hasn't meaningfully changed since 2016, to what RTK Query actually solves, the underused listener middleware, building source maps into React's own build pipeline, and how Replay's recordings now hand debugging over to AI agents — this is the Redux conversation grounded in two decades of shipping software.

TanStack Query at Scale with Dominik Dorfmeister
Episode 37

Señors @ Scale host Dan Neciu sits down with Dominik Dorfmeister — better known as TkDodo — the maintainer of TanStack Query and a software engineer at Sentry. Dominik's path started at a technical high school in Vienna, ran through JVM backend work in Java and Scala, and turned to frontend around the introduction of TypeScript. During the pandemic lockdowns in Austria he started answering questions in the TanStack Discord, got addicted to the instant gratification of helping people, and slowly turned that into a blog, a first code contribution six to eight months later, and eventually maintainership of TanStack Query. From tracked queries and the chaotic version-three-to-four rename, to the version-five mistake he still dreads, to ripping 28,000 lines of dead code out of Sentry with Knip and building Sentry's new design system, this is the open source maintenance conversation most developers never get to hear.

📻 Never Miss New Takeaways

Get notified when new episodes drop. Join our community of senior developers learning from real scaling stories.

💬 Share These Takeaways

Share:

Want More Insights Like This?

Subscribe to Señors @ Scale and never miss conversations with senior engineers sharing their scaling stories.