Dan Odell
Staff Software Engineer at Canva, ex-Volvo, author of Performance Engineering in Practice
Señors @ Scale host Neciu Dan sits down with Dan Odell, staff software engineer at Canva working on systems that serve over 250 million active users. Dan's path started with electronic engineering in the late 90s, took him through marketing sites for IBM and Johnson & Johnson at AKQA, his own consultancy for clients like UNICEF and MINI, nine years on Volvo's e-commerce and car configurator, and finally to Canva's charts and visualizations team. He's also the author of Performance Engineering in Practice, out now through Manning's Early Access Program, which introduces the Fast by Default framework. From feature-flagged staged rollouts and test parties to operational transforms, the performance decay cycle, shrinking the critical path, and perceived performance, this is the conversation about making software fast — and keeping it fast — at scale.
🎧 New Señors @ Scale Episode
This week, I spoke with Dan Odell, staff software engineer at Canva, where he works on systems serving over 250 million active users. Before Canva, Dan spent nine years at Volvo working on their e-commerce platform across 70-plus markets, and before that he was global head of web development for clients like Nike and MINI. He started out in electronic engineering in the late 90s — tinkering with HTML and JavaScript when it was split between Netscape and Internet Explorer — built marketing sites for IBM and Johnson & Johnson at AKQA, and ran his own consultancy for ten years. He's also a published author, and his latest book, Performance Engineering in Practice, is out now through Manning's Early Access Program.
In this episode, we dig into how Canva ships safely to a quarter of a billion users, the testing and rollout machinery that catches bugs before they scale, the tech stack behind a real-time collaborative editor, and the Fast by Default framework from his book — why performance is everyone's responsibility, how to break the performance decay cycle, and why perceived performance often matters more than raw speed.
⚙️ Main Takeaways
1. At Canva's scale, everything scales — including bugs
Small decisions that work for a marketing site become serious liabilities at 250 million users.
- The factor change: Dan's early AKQA marketing sites were seen by a few thousand or 100,000 people; the Nike homepage by a couple million a month; Volvo by a million a month. Canva is "a whole other factor altogether."
- Edge cases stop being edge cases: "An edge case affects like 0.1 % of people. Suddenly you're talking about 100,000 people or a million people are affected by this."
- The consequence: "So everything scales, including bugs." Testing has to be really thorough, and the processes have to be in place for it.
2. Staged, feature-flagged rollouts contain blast radius
Nothing goes to everyone at once — it ripples out through cohorts.
- Everything behind a feature flag: Changes roll out to internal staff first, where 5,000–6,000 people give early feedback and analytics.
- Dogfooding everything: "Everything we build, everything we use internally at Canva is Canva" — docs, presentations, all of it. If something's missing, they build it in Canva and use it.
- The widening cohorts: Staff, then opted-in beta testers, then real users geofenced by region — "like 10 % in Latin America, for example" — measuring at every step over days and weeks until it's 100 % to everyone in the world.
3. The testing pyramid plus test parties
Layered automated testing, then humans deliberately trying to break it.
- The layers: Unit tests (logic split from presentation), component/functional tests in isolation, integration tests with other components, and full-flow suites built by test leads.
- Where they run: Unit tests must pass locally before commit; GitHub Actions run component and some integration tests in CI/CD; the heavy full-context suites and visual regression tests run against the whole monorepo at PR merge time rather than in CI.
- Test parties: "You get a whole bunch of developers together and the QAs and you test a feature that you're about to release" — around 12 developers, a few QAs, and product, all trying to break it with weird inputs. Run every week or few weeks. Especially useful now that AI features make "anything can happen" with plain-English prompts.
4. Shift left means catch it on your machine, not in CI
Speeding up the feedback loop by running CI checks locally first.
- The definition: "Running as much as possible on your development machine and running all the things that would run in CI, CD locally first."
- Why it matters: GitHub Actions are great but sometimes slow, and on a giant monorepo there's a lot to run — catching errors locally avoids the slow hiccup of seeing failures in CI and going back.
- The double check: The same tests still run in CI as an extra gate, "just in case that things got missed." File-changed tests run as a gate; large functional and visual regression tests run against the entire monorepo.
5. The stack: React, TypeScript, Java, and protobuf-over-RPC
A binary protocol keeps front-end/back-end communication fast.
- Front and back: React and TypeScript on the front end; mostly Java on the back end, with Python for ML.
- Protogen over RPC: They communicate with a variant of protobuf they call protogen — over RPC, not WebSockets — sending data over binary instead of text. "If we tried to do it through JSON or something like that, it would just be too slow."
- The document abstraction: A Canva file is stored in an abstraction called the Canva Document Format (CDF), recognized by both front and back end and used for storage and communication.
6. Real-time collaboration via operational transforms
Multiple people editing one document without asking the user to resolve conflicts.
- The hard problem: When people manipulate the same element at the same time, who's the source of truth? "We try not to ask the user to make decisions like that."
- The technique: They use operational transforms — "it's kind of what Google Docs uses when you're collaborating there as well" — to establish the source of truth and the right option.
- The plumbing: An internal layer built on top of WebSockets handles conflicts and the many variations coming in from lots of people doing different things very quickly on the same document.
7. Crazy big goals: architecting for a billion before it happens
Canva isn't satisfied with 250 million.
- Micro frontends on the cards: Not used yet, but the team is talking about it. Teams are already split by product area — design system and core editor, 3D graphics, photos, videos, and Dan's own charts and visualizations team — so there's clear scope for it.
- The mindset: "They have what they call crazy big goals internally at Canva... What would it look like if we were a billion? How would we scale our architecture and stuff for that?"
- Getting ahead: Making decisions early "rather than like reacting to them when they actually happen and getting into a kind of crazy panic about it then."
8. Autonomous product teams and closing the loop
Each team is a self-contained product team, and user feedback actually ships.
- Team shape: Front end, back end, product manager, and a test lead who's usually a designer too — "a whole product team basically in the team." Everyone you need is in one team; no waiting on other teams for design advice.
- Product engineering, not top-down: Most decisions come from the team. Higher-level direction (like going down the AI route with natural-language prompts) comes from on high, but day-to-day it's "proper product engineering."
- Closing the loop: Feedback typed into the in-app button gets routed to the right team. "If enough people mention the same feature or bug or things, we will absolutely work on it and get it into the pipeline."
9. Globally distributed work runs on async and documents
The sun never sets on the Canva empire.
- The geography: Canva came out of Australia; Dan's in the UK; teams are all over the world. Doing accessibility work on charts, he hands things to the Sydney team async while he sleeps and picks up their work in the morning.
- Why it differs from Volvo: Volvo was mostly Sweden with a US team; Canva is genuinely geographically split, so async is "a really important way of working."
- More documents: "There's a lot more documents that get written at Canva to document decisions" so everybody has what they need without Dan being there to answer at the exact moment.
10. The performance decay cycle is a broken way of working
The motivation behind the whole book.
- The anti-pattern: Teams treat performance as a later-stage concern — build, test, release, then discover from real-user telemetry that people aren't getting the fast experience the team had locally.
- The decay cycle: "You release code. It's not as good as it could be for the users out there. They start complaining. Most of them don't complain. They just leave." By the time you do performance sprints, weeks have passed and you don't know how many users you lost.
- The cost of degraded UX: A slow, degraded experience erodes trust in your brand and product — "that's how you lose people and that's how they don't come back."
11. Fast by Default: performance is a whole-team, full-stack problem
The framework spreads performance across the entire lifecycle instead of cramming it at the end.
- The naming: Modeled on "secure by default" and "accessible by default" ideas — performance deserves its own framework.
- Everyone owns it: Not just an engineering problem — designers, product managers, and testers all have responsibility "to make sure this product is performing from the start to finish."
- Full stack, not siloed: The principles are universal and apply across back end, front end, mobile, and desktop. Focusing on just the front or just the back end means "you're missing part of the picture."
- Not shift-everything-left: "It's not shift left in terms of getting everything done upfront. It's kind of shift everything from the end to spread out throughout the process so that it never gets to the end."
12. Shrink the critical path; offload everything else
Make the end-to-end flow as short as possible and feel fast.
- The principle: Reduce the critical path — "anything that isn't really important in the here and the now needs to be offloaded into something that runs asynchronously."
- The checkout example: Don't block the response while sequentially storing the order, sending the email, and processing payment. Flip a bit to say "we're processing your order," return immediately, then poll for status updates.
- The 100ms math: "There's a 100 millisecond delay here, add to 100 millisecond delay there. Then you've got 200 milliseconds delay" — and then the UI starts to feel slow.
- The tooling in the book: Profiling with flame charts and call stacks, throttling your machine to behave like a real old device instead of an M4 MacBook Pro, reducing JavaScript bundle size, and setting performance goals and budgets.
13. Perceived performance: give the user something to do
When you genuinely can't make it faster, change how the wait feels.
- The failure he sees over and over: A button press, then just a loading spinner for two seconds or more, telling the user nothing about what's happening.
- The better pattern: Surface tips, interesting facts, or product information during the wait — "it gives them something to do, something to look at, something to read."
- Especially for AI: At Canva they discuss whether to show a skeleton loader or surface tips, and even let users do tangible work (like uploading an image) while the AI runs, then resolve it all into the final design.
- The loading-screen mini-game: Dan compares it to 80s games that shipped a mini Pac-Man to play while the main game loaded off tape — engaging the user "really can make that time feel a lot shorter than it actually is."
🧠 What I Learned
- At Canva's scale everything scales, including bugs — a 0.1% edge case is 100,000 to a million affected people.
- Every change ships behind a feature flag, rolling out from internal staff to beta testers to geofenced regions before hitting 100% worldwide.
- Test parties — a dozen developers, QAs, and product deliberately trying to break a feature — are a great gate before any staged rollout, and increasingly vital for AI features.
- Shift left means running CI checks on your own machine first; CI still re-runs everything as a double-check gate, with heavy functional and visual regression suites running against the whole monorepo at merge.
- Canva's stack: React + TypeScript front end, mostly Java back end (Python for ML), communicating via protogen — a protobuf variant — over RPC, because JSON would be too slow.
- Real-time collaboration uses operational transforms, the same family of technique Google Docs uses, on an internal layer over WebSockets.
- The performance decay cycle — release, get complaints (most users just leave silently), then sprint to fix — is a broken way of working.
- Fast by Default treats performance as a whole-team, full-stack concern spread across the lifecycle, not crammed in at the end.
- Shrink the critical path: in a checkout, return immediately and poll for payment/email status rather than blocking the response.
- 100ms delays stack up fast, and around 200ms the UI starts to feel slow.
- When you can't make something faster, improve perceived performance — surface tips or let users keep working, the way 80s games shipped a mini-game during the load.
- Canva works async across time zones, writing far more decision documents than Dan did at Volvo.
💬 Favorite Quotes
"So everything scales, including bugs."
"An edge case affects like 0.1 % of people. Suddenly you're talking about 100,000 people or a million people are affected by this."
"Everything we build, everything we use internally at Canva is Canva."
"You release code. It's not as good as it could be for the users out there. They start complaining. Most of them don't complain. They just leave."
"It's not shift left in terms of getting everything done upfront. It's kind of shift everything from the end to spread out throughout the process so that it never gets to the end."
"There's a 100 millisecond delay here, add to 100 millisecond delay there. Then you've got 200 milliseconds delay."
🎯 Also in this Episode
- Dan's path: electronic engineering (and a failed attempt at assembly) in the late 90s, Pascal at university, a work-experience webpage for an audio module company, a notice-board gig building marketing sites, then 13 years at AKQA on clients like IBM, Johnson & Johnson, Nike, Nestlé, Sainsbury's, and Ferrari
- Ten years running his own consultancy on UNICEF, MINI, and the Volvo car configurator — and the nine years he spent building Volvo's e-commerce offer selector and checkout flow
- Volvo's "organized chaos" as the auto industry navigates the shift to electric, and surfacing a matching hybrid option to keep customers from dropping out of the flow
- Why long-lived products beat short-lived marketing sites, and how being embedded in the product team kept Dan at Volvo for nine years
- Accessibility work on Canva's charts product, driven by new laws in the US and Europe, done closely with the Sydney team
- A 50% discount on Performance Engineering in Practice for listeners (link in the episode description)
- Book recommendation: The Product-Minded Engineer by Drew Hoskins (O'Reilly) — and how it knocks down the habit of bending user scenarios to justify the feature you already wanted to build
Resources
More from Dan Odell:
- Performance Engineering in Practice by Dan Odell — out now through Manning's Early Access Program, introducing the Fast by Default framework (50% discount link in the episode description)
Tools & concepts mentioned:
- Canva — the design platform Dan works on, serving 250M+ active users
- Protocol Buffers (protobuf) — the binary serialization format that Canva's "protogen" variant is based on
- Operational transforms — the conflict-resolution approach behind real-time collaborative editing (as used by Google Docs)
Books mentioned:
- The Product-Minded Engineer by Drew Hoskins (O'Reilly)
🎧 Listen Now
🎧 Spotify
📺 YouTube
🍏 Apple Podcasts
Episode Length: 57 minutes on performance engineering, shipping safely to 250 million users, feature-flagged rollouts, test parties, the critical path, and why perceived performance often matters more than raw speed.
Whether you're an engineer who treats performance as an end-of-cycle cleanup or a tech lead trying to make it a whole-team habit, this conversation has something immediately actionable.
Happy optimizing,
Dan
💡 More Recent Takeaways
Señors @ Scale host Neciu Dan sits down with Santosh Yadav, principal developer advocate at CodeRabbit and one of only around 80 GitHub Stars in the world. Santosh started hating C in 2004, fell for C# by 2008, and turned a year of open source contributions to Angular and NgRx into a stack of community titles — Google Developer Expert, GitHub Star, Nx champion, and Microsoft MVP. As a staff engineer at Celonis he led the move of 20-plus apps to module federation and drove Nx adoption across 30-plus teams when the product grew from four apps to thirty. From the year-long incremental migration off a single deployable unit, to why polyrepos can't give AI tools the context they need, to how Nx's affected graph and build caching tame a 20-million-line monorepo, to running code review for free for open source at CodeRabbit, this is the monorepo conversation grounded in someone who actually shipped one at scale.
Señors @ Scale host Dan Neciu sits down with Nicolas Beaussart-Hatchuel, staff engineer at Payfit and one of the maintainers of TanStack Router. Nicolas's path started with C macros to auto-generate his student paper headers and frontend learned by building phishing login pages for practice, took him through an iframe-based AngularJS-to-Angular 2 micro frontend migration at a web radio platform, into open source contributions across NX, ESLint, Vite and Hasura, and finally to maintaining one of the most ambitious routers in the React ecosystem. From why TanStack Router exists, to migrating Payfit's 300-route, 1.5-million-line codebase off React Router v5 using the strangler pattern, to collapsing 25 polyrepos and five different micro frontend strategies into a single modular monolith, this is the routing conversation most engineers never get.
Señors @ Scale host Neciu Dan sits down with Mark Erikson, maintainer of Redux and senior front-end engineer at Replay.io, where he works on a time-traveling debugger. Mark's path started with a 286 he got at eight years old, ran through a computer science degree, four years teaching English in China, embedded software at Northrop Grumman emulating legacy CPUs in old aircraft, and a chain of projects — GWT, jQuery, Backbone — that led him to React and Redux. From the @deprecated backlash that had people insulting him on the internet, to why the Redux core hasn't meaningfully changed since 2016, to what RTK Query actually solves, the underused listener middleware, building source maps into React's own build pipeline, and how Replay's recordings now hand debugging over to AI agents — this is the Redux conversation grounded in two decades of shipping software.
Señors @ Scale host Dan Neciu sits down with Dominik Dorfmeister — better known as TkDodo — the maintainer of TanStack Query and a software engineer at Sentry. Dominik's path started at a technical high school in Vienna, ran through JVM backend work in Java and Scala, and turned to frontend around the introduction of TypeScript. During the pandemic lockdowns in Austria he started answering questions in the TanStack Discord, got addicted to the instant gratification of helping people, and slowly turned that into a blog, a first code contribution six to eight months later, and eventually maintainership of TanStack Query. From tracked queries and the chaotic version-three-to-four rename, to the version-five mistake he still dreads, to ripping 28,000 lines of dead code out of Sentry with Knip and building Sentry's new design system, this is the open source maintenance conversation most developers never get to hear.
📻 Never Miss New Takeaways
Get notified when new episodes drop. Join our community of senior developers learning from real scaling stories.
💬 Share These Takeaways
Want More Insights Like This?
Subscribe to Señors @ Scale and never miss conversations with senior engineers sharing their scaling stories.