Skip to content
⚡ LIVE From Lizard to Wizard · Wednesday, August 5 · LIMITED SEATS Save my seat →

· javascript · 16 min read

Why are we not using Service Workers?

I feel like Service Workers are a underused technology with a lot of benefits, but very complex to set up and often misunderstood and what they do. Here are some case studies from Slack, Mux and me on where and how to use Service Workers

Neciu Dan

Neciu Dan

Hi there, it's Dan, a technical co-founder of an ed-tech startup, host of Señors at Scale - a podcast for Senior Engineers, Organizer of ReactJS Barcelona meetup, international speaker and Staff Software Engineer, I'm here to share insights on combining technology and education to solve real problems.

I write about startup challenges, tech innovations, and the Frontend Development. Subscribe to join me on this journey of transforming education through technology. Want to discuss Tech, Frontend or Startup life? Let's connect.

Share:
Why are we not using Service Workers?

Over the past few months at conferences and meetups in Barcelona, Paris, Cluj, London, Coimbra, and Zurich, I surveyed developers from big companies: how does your team use service workers?

I spoke with frontend leads, staff engineers, and app developers who work on apps with millions of users.

The overwhelming answer: we don’t use them.

This bothered me because the API has been available in every major browser since 2018, and I personally used it and saw the benefits firsthand.

Big corps like Google, Microsoft, or Canva use Service Workers heavily, but that is because of the nature of their products.

So I want to figure out why small and medium-sized companies aren’t using Service Workers, especially when their products need it!

My best assumption is that they don’t understand the benefits, so let’s go through what Service Workers are, how they can benefit your company, and how other big companies are using them in production.

What a service worker actually is

A service worker is a JavaScript file that the browser runs on a separate thread, outside your page.

You register it once:

if ('serviceWorker' in navigator) {
  navigator.serviceWorker.register('/sw.js');
}

From that point on, it sits between your app and the network. Every request your page makes (scripts, styles, API calls, images) can pass through its fetch handler, and the handler decides what to respond with: the real network response, a cached copy, or something it fabricated on the spot.

self.addEventListener('fetch', (event) => {
  event.respondWith(
    caches.match(event.request).then((hit) => hit || fetch(event.request))
  );
});

Three properties make it different.

  • It sits on the network path.

Nothing leaves your page without going through it first, which means it can rewrite requests, synthesize responses, add headers, or answer from disk without your application code knowing anything happened.

  • It outlives your page.

The browser can wake it up after the tab is closed, which is why push notifications and background sync are only possible through a service worker. No other browser context gets resurrected after the user is gone.

  • The worker has no DOM access

It talks to your app through postMessage, and through the responses it serves. It also has its own lifecycle, separate from your page’s: it installs, waits, activates, and gets terminated whenever the browser feels like it, keeping only what you explicitly persisted in the Cache API or IndexedDB.

Think of it as a proxy with a lifetime longer than your app.

Use case one: boot performance and offline support

I am going to start with a story from Slack.

In 2019, their web client booted in ~5 seconds for users with one or two workspaces.

They profiled it and found that the network was the biggest source of both latency and variability; every boot re-fetched assets, and asset fetch times swung wildly depending on connection quality.

They observed that almost nothing in that asset set changes between boots.

The user who opens Slack on Tuesday morning downloads the same JavaScript they downloaded Monday morning. So they decided to improve it.

On first boot, the client downloads the full asset set (HTML, JavaScript, CSS, fonts, and sounds) and stores it in the service worker’s Cache API. In parallel, a copy of the in-memory Redux store gets persisted to IndexedDB.

On the next boot, the client checks for those caches.

If they exist, it boots entirely from local data: cached HTML, cached bundles, hydrated Redux store. The UI is displayed on screen before a single network request completes, and fresh data is loaded in the background afterward, replacing the cached snapshot.

Slack calls this a warm boot. A cold boot is the first-ever visit, with nothing cached.

The Slack handler published is almost embarrassingly small (they note the production one carries more app-specific logic, but the shape is the same).

Their implementation was pretty simple:

self.addEventListener('fetch', (e) => {
  if (assetManifest.includes(e.request.url)) {
    e.respondWith(
      caches
        .open(cacheKey)
        .then((cache) => cache.match(e.request))
        .then((response) => response || fetch(e.request))
    );
  } else {
    e.respondWith(fetch(e.request));
  }
});

But the hard part was the versioning, because Slack deploys multiple times a day, and a worker serving cached assets means users boot on assets from a previous deploy.

Their solution has three layers.

A custom webpack plugin generates a manifest of all asset files, each with a content hash, on every deploy. That manifest is embedded into the service worker file itself, so any change to any asset makes the worker byte-different, and a byte-different worker triggers the browser’s update flow automatically.

Cache buckets are keyed by deploy timestamp. An HTML file from deploy X only ever loads assets from bucket X, whether they come from cache or the network. You can never get X HTML to load while Y JavaScript is deploying.

Buckets older than 7 days get deleted on the worker’s activate event.

Then, a warm boot, by definition, serves assets fetched at the previous worker registration. Slack deploys many times a day, and a typical user boots once each morning, so clients risked running a full day behind permanently.

Their fix: while the app is open, re-register the service worker on a jittered interval. Re-registration makes the browser check for a byte-different worker, which prefetches fresh assets for the next boot.

This halved the average age of assets at boot time, but there’s one more trick that I haven’t seen anyone else write about.

Slack ships features together with matching API changes, so a one-version-behind frontend could desync from the backend. To manage this, the worker caches selected API responses (feature flags, experiment assignments) in the same deploy-keyed bucket as the assets.

A warm boot gets a frontend and a flag configuration that were deployed together. Potentially stale, but always internally consistent, which matters far more.

The results: roughly 50% faster boots than the legacy client, warm boots about 25% faster than cold ones, and tens of millions of requests per day flowing through millions of installed workers within a month of release.

And offline support came as an immediate advantage. Once your app can start without needing the network, it can also function without it.

Slack users gained offline reading and the ability to mark items as unread, with sync automatically reestablished on reconnect, delivering a highly requested feature as a natural result of service worker use.

Use case two: the proxy can rewrite anything

Video streaming. If you’ve ever watched something online (think YouTube), you’ve streamed videos, and when your Internet is bad, you get a lower-quality video.

To accomplish this, HLS video streams are described in plain-text manifest files.

The player downloads a multivariant playlist listing every available rendition of the video (resolutions, codecs), picks one based on measured bandwidth, and starts fetching the segments for that rendition.

Mux had a customer streaming screencasts, and the adaptive bitrate kept doing its job too well: on slow connections, it switched viewers down to 240p. This was not ideal, as writing at 240px was unreadable, and users on slow connections would much rather wait for the video to buffer than see it at 240px.

Mux’s own player has built-in rendition filtering, and all fixes they tried didn’t work: fork the player, run a server-side proxy that rewrites manifests per customer, or tell the customer no.

Instead, they put a service worker in front of the player with one job: intercept requests for the manifest, edit the text before the player sees it.

const MIN_RESOLUTION = 720;

self.addEventListener('fetch', (event) => {
  const url = new URL(event.request.url);
  if (url.hostname === 'stream.mux.com' && url.pathname.endsWith('.m3u8')) {
    event.respondWith(fetchAndFilterPlaylist(event.request));
  }
});

async function fetchAndFilterPlaylist(request) {
  const response = await fetch(request);
  const text = await response.text();
  return new Response(filterPlaylist(text), { headers: response.headers });
}

The filterPlaylist function walks the manifest line by line and drops every rendition below 720p, keeping all other HLS tags intact.

The player receives a playlist where low resolutions simply don’t exist, so it cannot pick one.

One detail from their post stuck with me: because edge runtimes like Cloudflare Workers implement the same fetch event API, they deployed the stitching worker to Cloudflare unchanged and got a working URL.

Anything that travels as text over HTTP can be rewritten in flight, by code you control, running on the user’s machine.

My use case: the deploy that breaks every lazy-loaded route

My own service worker story starts with a Vite production incident.

Vite fingerprints every build output with a content hash. Your lazy-loaded route lives in Settings-a3f8b2.js, and after the next deploy, it lives in Settings-c91d44.js, while the old file is gone from the CDN.

Now, picture a user who opened the app before the deploy. Their index.html and main bundle still reference the old hashes.

They work through their morning, and at some point, they click on Settings for the first time that session. The browser requests Settings-a3f8b2.js, the CDN returns 404, and the dynamic import throws an Error (you might see it in Sentry as Failed to fetch dynamically imported module).

Error screen. For a user who did nothing wrong except keep a tab open over lunch.

Our first fix was the obvious one. Vite emits an event when a preload fails, so we caught it and reloaded:

window.addEventListener('vite:preloadError', () => {
  window.location.reload();
});

This worked, but the UX was terrible. Who wants to experience refreshes when navigating to a page?

Plus, if the user’s index.html was cached anywhere along the way (browser cache, a CDN edge that hadn’t purged yet, a misconfigured Cache-Control header, take your pick), the reload fetched the same stale HTML, which referenced the same dead chunk, which threw the same error, which triggered the same reload.

This can cause an infinite refresh loop.

The second fix was a guard, so we’d only force one reload per session:

window.addEventListener('vite:preloadError', () => {
  if (sessionStorage.getItem('chunk-reloaded')) return;
  sessionStorage.setItem('chunk-reloaded', '1');
  window.location.reload();
});

The loop was gone, but everything else about it was still bad.

We were treating the symptom. The problem underneath: the client had no idea a new version existed until something exploded, and old chunks evaporated the moment we deployed.

It took us an embarrassing amount of time to see that these are two problems, not one.

Problem one: users on the old version need the old chunks to keep existing for the duration of their session.

Problem two: users should migrate to the new version soon after it ships, without a hash failure being the messenger.

A service worker solves both, because it’s the only place in the browser that can keep dead files alive and the only background process that can watch for new versions.

Each deploy now writes a version.json next to the bundle, generated by a small Vite plugin reading the build manifest:

{
  "version": "2026.06.04-1412",
  "assets": ["/assets/index-c91d44.js", "/assets/Settings-c91d44.js"]
}

The version is a build timestamp rather than a hash, mostly for debugging.

The worker compares versions with a plain inequality check rather than “newer than,” so a rollback to an older build triggers an update like any other deploy.

The service worker polls that file. Polling a 200-byte JSON with cache: 'no-store' is cheap, and the worker is the right place for it because it’s already sitting on the network path:

let currentVersion = null;

async function checkVersion() {
  const res = await fetch('/version.json', { cache: 'no-store' });
  const { version, assets } = await res.json();

  if (version === currentVersion) return;

  // Precache the entire new build BEFORE telling anyone about it
  const cache = await caches.open(`app-${version}`);
  await cache.addAll(assets);
  currentVersion = version;

  const clients = await self.clients.matchAll();
  for (const client of clients) {
    client.postMessage({ type: 'NEW_VERSION', version });
  }
}

self.addEventListener('message', (event) => {
  if (event.data.type === 'CHECK_VERSION') checkVersion();
});

Service workers are terminated by the browser when they’re idle, so the worker can’t reliably run its own setInterval.

The page drives the polling instead, posting CHECK_VERSION on an interval and on visibilitychange, so a tab that comes back from a weekend in the background checks immediately.

The fetch handler is the part that fixes problem one.

Hashed assets are served cache-first, and old cache buckets stay alive until no client needs them:

self.addEventListener('fetch', (event) => {
  const url = new URL(event.request.url);
  if (url.pathname.startsWith('/assets/')) {
    event.respondWith(
      caches.match(event.request).then((hit) => hit || fetch(event.request))
    );
  }
});

caches.match with no cache name searches every bucket, old and new. So a user mid-session who clicks Settings gets Settings-a3f8b2.js from the worker’s cache even though the CDN deleted it minutes ago.

The 404 that started this whole story can no longer happen, because the worker holds the only surviving copy of the file and serves it without asking the network.

Problem two is the app’s side. It listens for the message and updates in the background:

let updateReady = false;

navigator.serviceWorker.addEventListener('message', (event) => {
  if (event.data.type === 'NEW_VERSION') updateReady = true;
});

// called on every route navigation
function onNavigate() {
  if (updateReady) window.location.reload();
}

The reload happens on a route change, when the user is already expecting the screen to swap and there’s no half-filled form to lose.

It pulls the new index.html, which points at assets the worker has already cached, so the “reload” is served almost entirely from disk.

One requirement for the full effect: index.html itself must be served with Cache-Control: no-cache. The Vite docs say the same thing about the plain reload approach, and our infinite loop earlier was the price of ignoring it.

Even when some CDN edge hands out stale HTML, the user lands back on the old version for one more cycle, and nothing breaks because the old chunks are still sitting in the worker’s cache.

The vite:preloadError handler is still there as a last resort, and it hasn’t fired since.

If this sounds familiar, it should. It’s Slack’s deploy-keyed cache buckets wearing different clothes (that’s where I got the idea from).

Old assets must keep working for old sessions; new sessions must get new assets; clients should converge on the latest version without anything breaking in between.

A service worker is the only place in a browser where you can enforce all three rules, because nothing else sits between your app and the network while also persisting across deploys.

So why is nobody using them?

After the survey responses, I started asking a follow-up question: why not?

The first answer was that they dont need them (Which might be true)

The second answer, from most senior engineers, was that they had trouble with them in the past, and it’s not worth the effort.

Specifically the lifecycle of a service worker

By default, a page that registers a service worker isn’t controlled by it until the next navigation, and even clients.claim() can’t intercept the requests the page fired before the worker activated.

Everyone who has touched a service worker has a story about a worker stuck in waiting while they refreshed the page twelve times, or about skipWaiting activating a new worker under a page built for the old one.

Even Mux ran into this in their own demo. A video player starts fetching the moment it mounts, before a same-page worker can take control, so they had to register the worker on an index page and link onward to the player page.

The post mentions a second issue: a registration scope of /resolution-filtering/ works, while /resolution-filtering doesn’t, and nothing explains why.

Everyone knows a cache horror story.

The two people in my survey who “tried one in 2019 and removed it” both told the same story with different details: a service worker with a bad cache strategy served a stale app to users, and the fix required shipping a killswitch worker and waiting days for clients to pick it up, because the broken worker controlled when updates were checked.

The fear is justified. But you need to invest more into versioning.

Look at the Slack example and see where they spent their effort: the fetch handler is a dozen lines, and the versioning machinery (manifest hashing, deploy-keyed buckets, jittered re-registration, 7-day eviction) is everything else.

Cache invalidation across deploys is the problem, and the teams that got burned are the teams that shipped the dozen lines without the rest.

As the saying goes, there are only 2 things hard things in programming: naming things and cache invalidation.

The product never asked for offline.

Most of the people I surveyed build dashboards, internal tools, and B2B apps.

Nobody writes “works on the metro” into those requirements. Fair enough.

But offline was never the only use case, and most of this article is evidence. My deployment problem had nothing to do with being offline. Mux’s manifest rewriting has nothing to do with offline. Slack’s 50% boot improvement helps users on gigabit fiber.

Dismissing service workers because you don’t need offline is dismissing a proxy because you don’t need one of the things a proxy can do.

And I think you should have offline support. Think of your users first.

And you probably ARE using them.

My favorite counterexample is Partytown, the Builder.io library that moves Google Analytics, Tag Manager, and Facebook Pixel off the main thread.

Third-party scripts wreck your Core Web Vitals by competing with your app for the main thread.

The obvious fix, running them in a web worker, fails for one specific reason. Those scripts constantly read the DOM synchronously (document.title, document.cookie, window.location.href) and expect an immediate return value, while worker-to-page communication is asynchronous.

Partytown’s trick starts with a fact about workers: a web worker has exactly two legal ways to block. Atomics.wait() on a SharedArrayBuffer, and a synchronous XMLHttpRequest, the API we all spent a decade learning to avoid.

When the analytics script (running in the worker, against a proxied DOM) needs a real value, Partytown serializes the request and fires a sync XHR at a fake URL ending in proxytown. The worker thread blocks, waiting for the response.

The production source includes my favorite comment in any open-source codebase:

const xhr = new XMLHttpRequest();
xhr.open('POST', partytownLibUrl('proxytown'), false);
xhr.send(JSON.stringify(accessReq));
// look ma, I'm synchronous (•‿•)
return JSON.parse(xhr.responseText);

That request never reaches any network.

A service worker intercepts it and messages the correct tab’s main thread (since the worker is shared across all tabs of the origin, requests carry a tab ID, pending requests live in a correlation map, and a timeout protects against a dead tab hanging everything).

If your site runs Partytown, you have a service worker in production bridging threads through fake HTTP, and you probably never thought about it.

Another library that relies heavily on Service Workers is Mock Service Worker.

Which is heavily used in the JS Ecosystem for testing: instead of making a real request, you use MSW to intercept it and send back a mocked JSON response. (I think in 2020, everybody was building e2e tests this way. I know we did this at Glovo)

It’s hard to write Service Workers

Hopefully, by now, you understand the benefits of Service Workers and where to use them:

  • offline support
  • deployment asset strategy
  • performance improvements

But you might still be overwhelmed by the documentation and boilerplate surrounding Service Workers (which is why this is not a how-to guide).

There are lots of complex interactions that are hard to get right when building Service Workers.

  • Network requests
  • Caching strategies
  • Cache management
  • Precaching

That’s where Workbox from Google comes in.

Workbox is a set of modules that simplify common service worker routing and caching. Each module available addresses a specific aspect of Service Worker development, making it easier to create, manage, and work with them.

The important thing to understand is what they do and where they can help you now or in the future.

Good luck.

References

🏆 SOLD OUT IN SINGAPORE · ATHENS · LONDON

From Lizard to Wizard

4-hour remote system design intensive.
Chat apps, microfrontends, BFF, SDUI, event-driven, observability.

€299 4-HOUR INTENSIVE
Save your seat →

Spots are vanishing. Don't be the one who waited.

Author

Discover more from The Neciu Dan Newsletter

A weekly column on Tech & Education, startup building and occasional hot takes.

Over 1,000 subscribers

🎙️ Latest Podcast Episodes

Dive deeper with conversations from senior engineers about scaling applications, teams, and careers.

React Native at Scale with Kadi Kraman
Episode 35
60 minutes

Señors @ Scale host Neciu Dan sits down with Kadi Kraman, software developer at Expo working on the tools that make React Native development as smooth as possible. Kadi's path started with C++ in a university maths degree, took her through Angular 1, scientific programming for pharmaceutical and defense companies, five and a half years at Formidable, and finally to Expo itself. From the limitations of early React Native to development builds, EAS workflows, fingerprint-based repacks, and the right way to think about over-the-air updates, this is the React Native conversation most web developers never get.

📖 Read Takeaways
Browser ML at Scale with Nico Martin
Episode 34
66 minutes

Señors @ Scale host Neciu Dan sits down with Nico Martin — open source ML engineer at Hugging Face working on Transformers.js, and Google Developer Expert in AI and web technology — to go deep on running machine learning models directly in the browser. Nico breaks down architectures vs. weights, quantization, tokenizers, ONNX, WebGPU, and why on-device AI is the right answer for a huge class of problems. He also shares the road from ski instructor and self-taught web developer to landing what he calls his dream job at Hugging Face.

📖 Read Takeaways
Frontend Foundations at Scale with Giorgio Polvara
Episode 33
55 minutes

Señors @ Scale host Neciu Dan sits down with Giorgio Polvara, Staff Engineer at Perk (formerly TravelPerk), who joined when the company was 15 people in two flats with a hole knocked through the wall and helped build the frontend foundations that still hold up at unicorn scale. Giorgio covers the multi-year migration from a monolithic frontend to vertical micro-frontends, why their first attempt with single-spa didn't work, how they pulled off a full rebrand behind feature flags without leaking, and the staff engineer mindset of treating every feature as a system improvement.

📖 Read Takeaways
Module Federation at Scale with Zack Chapple & Nestor
Episode 32
57 minutes

Señors @ Scale host Neciu Dan sits down with Zack Chapple, CEO and co-founder of Zephyr Cloud, and Nestor, the platform engineer building it, to go deep on module federation, microfrontends, and what it actually takes to go from code to global scale in seconds. They unpack why module federation is Docker for the frontend, how Zephyr composes applications at the edge in 80 milliseconds, and why the real unlock for enterprise teams isn't deployment — it's composition.

📖 Read Takeaways
Back to Blog