Skip to content
⚡ LIVE From Lizard to Wizard · Wednesday, August 5 · LIMITED SEATS Save my seat →
Episode 34 66 minutes

Browser ML at Scale with Nico Martin

Key Takeaways from our conversation with Nico Martin

Nico Martin

Open Source ML Engineer at Hugging Face, Google Developer Expert in AI and Web Technology, based in Switzerland

Señors @ Scale host Neciu Dan sits down with Nico Martin — open source ML engineer at Hugging Face working on Transformers.js, and Google Developer Expert in AI and web technology — to go deep on running machine learning models directly in the browser. Nico breaks down architectures vs. weights, quantization, tokenizers, ONNX, WebGPU, and why on-device AI is the right answer for a huge class of problems. He also shares the road from ski instructor and self-taught web developer to landing what he calls his dream job at Hugging Face.

🎧 New Señors @ Scale Episode

This week, I spoke with Nico Martin, open source ML engineer at Hugging Face working on Transformers.js, and a Google Developer Expert in AI and web technology. Nico's path is unusual — he studied to become a teacher, worked as a ski instructor and a windsurf instructor, taught himself web development by building sites for friends, and went freelance for years before joining Hugging Face about six months ago. The role pulls together his obsession with progressive web apps, on-device AI, and Transformers.js — a library he was already building demos with before version one even shipped.

In this episode, we cover how Transformers.js actually runs models in the browser, what an ML model really is under the hood, how tokenizers and quantization work, why ONNX matters as a runtime-agnostic standard, where WebGPU fits next to WebAssembly, and what the Prompt API in Chrome means for the future of in-browser LLMs.

⚙️ Main Takeaways

1. From ski instructor to Hugging Face — the non-traditional path

Nico didn't study computer science. He studied to be a teacher, ran a small web agency on the side, and went full-time self-employed once it started paying.

  • The setup: Summer windsurf instructor, winter ski instructor, then websites for friends, friends of friends, and eventually paying clients during teaching school.
  • The bet: He gave himself one year to see if web development could work as a career. It did. He stayed self-employed and freelanced for years.
  • The Hugging Face moment: He had already been building demos with Transformers.js before version one. When the role opened, he saw the posting and thought, "that would be my dream job." Six months in, he still thinks it is.

2. What Hugging Face actually is

People call it the GitHub for machine learning, but the platform has more layers than that.

  • The Hub: The core of Hugging Face is the Hub — huggingface.co — where you host and download ML models and datasets. Quality data is one of the biggest parts of ML, and the Hub is a central place to share it.
  • Spaces: Little hosted environments where you can ship Python apps or static HTML/CSS/JS apps. Hugging Face hosts them for you.
  • The libraries: Transformers (Python) for running and training models, Diffusers for diffusion models, and Transformers.js — which mirrors the Transformers Python API in JavaScript so the same kinds of models can run directly in the browser.

3. The biggest challenge of running ML in the browser is the user's hardware

When the model runs on a server, you control the box. When it runs in the browser, you don't.

  • The constraint: Transformers.js runs on whatever the user has — could be an M4 MacBook Pro, could be an old Lenovo with limited RAM and an outdated CPU. There's no detection mechanism that says "this device can run this model."
  • The strategy: Pick the task and model so it works on almost all devices. For small tasks with small models, that's safe. For big things like LLMs in the browser, you need to be aware that users may have to download multiple gigabytes and have a browser that supports WebGPU.
  • The trade-off: It's a guessing game. You design around what you can expect from your audience.

4. A model is an architecture plus weights

The two-part anatomy that makes everything else make sense.

  • The architecture: A description of the network — how many layers, how many neurons per layer, what operations happen between them. Not code. A schema.
  • The weights: Just numbers used in the operations between layers. For a small large language model — one to eight billion parameters — that's one to eight billion little numbers that need to be stored.
  • The size source: Weights are why models are big. Architecture is small. Compress the weights and you compress the model.

5. Quantization is how you fit big models on small machines

Quantization is the main mechanism Transformers.js uses to ship usable models to the browser.

  • The default: Weights are usually stored in 32-bit or 16-bit precision (FP32 / FP16).
  • The compression: You can store them in 4 bits instead — Q4 quantization. The numbers are less precise, the calculations are less precise, but with billions of operations the result is still usable. The size reduction is roughly a factor of 8.
  • The developer choice: When you call the Transformers.js pipeline function, you specify the task, the model, the device (CPU or GPU), and the quantization level. There's no automatic switching — you can offer multiple quantizations to your users (e.g. "try Q4 for 1GB or full precision for more"), but the precision is chosen at ship time, not run time.

6. Open weights vs open source models — datasets are the difference

Both let you run the model. Only one lets you reproduce it.

  • Open weights: Architecture and weights are publicly available. You can use the model. You don't get the training data.
  • Open source: Everything is documented — pre-training, mid-training, post-training, all the datasets used. You could retrain the model and end up with the same result.
  • The runtime perspective: For someone running a model in the browser, the difference doesn't matter much — you still get the architecture and weights and you can do inference.

7. ONNX is the runtime-agnostic standard that makes Transformers.js portable

Think of ONNX as the OpenAPI for ML model architectures.

  • The standard: ONNX (Open Neural Network Exchange) describes a model's architecture — layers, operations, and what flows between them — in a single file.
  • The runtime: ONNX Runtime Web is the WebAssembly runtime that Transformers.js loads to actually execute that architecture in the browser. There are other ONNX runtimes for other environments — Microsoft maintains the web one, others exist for different targets.
  • The decoupling: Export a model to ONNX once, and the same ONNX model runs across multiple environments. You don't re-export per platform. The runtime is decoupled from the model description.

8. The pre-process / model / post-process pipeline

This is the shape of every Transformers.js task, even when the model itself doesn't have a tokenizer.

  • The frame: The actual model takes tensors in and returns tensors out. A tensor is just an array of numbers — same idea as a vector. Developers don't want to write or read tensors; they want text in and text out.
  • The wrapper: Transformers.js does pre-processing (turn input into tensors), runs the model, then post-processes (turn tensors back into the output shape you want). Different tasks need different pre-processing steps. For LLMs, the big one is the tokenizer.
  • The generality: Not every task uses a tokenizer, but every task has this three-step structure.

9. Tokenizers, and why Hugging Face split them out into their own library

Tokenizing was Nico's first project at Hugging Face — and the reason tokenizers.js now exists.

  • The job: A tokenizer takes text, splits it into chunks (tokens), and maps each chunk to a numerical ID via a vocabulary. ML models work on numbers, not text.
  • Per-model: Each model is trained with a specific tokenizer and a specific vocabulary. Banana is one token in the GPT-5 tokenizer but two tokens in GPT-3 — different vocabularies, different splits. If you use the wrong tokenizer, the IDs are meaningless to the model.
  • The extraction: People were already using the Transformers.js tokenizer to pre-process text before sending it to other runtimes — because the tokenizer is identical for the same model regardless of where you run it. So Hugging Face pulled tokenization out into its own package, tokenizers.js, that you can use independently of the rest of Transformers.js.

10. How an LLM picks the next token, and where temperature comes in

The mechanic behind every LLM response, told in one example.

  • The output shape: The model takes the previous text as tensors, runs them through the network, and outputs a probability for every token in the vocabulary being the next token.
  • The example: "Roses are red, violets are…" — the model produces a probability distribution where "blue" has the highest probability.
  • The randomness: If you always pick the highest probability, you get boring, predictable text. Temperature adds randomness — pick from the top N candidates with weighted randomness. That's why the same prompt produces different outputs every time, and why a small bit of randomness compounds across hundreds of generated tokens.

11. LLMs aren't always the best fit for the browser — other tasks matter more

A reality check from someone who builds in-browser AI for a living.

  • The LLM demo: Nico's colleague Joshua released a demo running GPT-OSS, a 20B parameter LLM, fully in the browser. Cool — but it requires downloading roughly 12 gigabytes before you can use it.
  • The better fits: Vision tasks like background removal and image classification. Audio tasks like text-to-speech and speech recognition. Voice activity detection — a tiny model that just tells you "I'm 90% sure someone is talking." That last one is the perfect on-device use case: small, runs on CPU without a GPU, instant.
  • The pattern: The big-model browser demos are impressive. The actual practical wins are in the smaller, specialized tasks where the privacy and latency benefits outweigh the size cost.

12. WebAssembly + WebGPU — they don't replace each other, they cooperate

The runtime story for in-browser ML, clarified.

  • WebAssembly: ONNX Runtime Web is shipped as WebAssembly — that's how the runtime gets into the browser at all. You can't get rid of WebAssembly because it's what runs the inference engine.
  • WebGPU: A browser API that gives you access to GPU operations from the browser. You can run Transformers.js on the CPU, but it's slow. With WebGPU, the runtime delegates the heavy compute to the GPU.
  • The combo: WebAssembly hosts the runtime; WebGPU does the math. Both are required for serious in-browser ML performance.

13. The Prompt API — a unified browser API for in-browser LLMs

Chrome ships Gemini Nano with the browser. The Prompt API exposes it.

  • What it is: Chrome downloads a version of Gemini Nano once, and the Prompt API exposes a way for any origin to interact with it. The model runs entirely in the browser.
  • The cross-browser ambition: The same API could be implemented in Firefox using a Gemma model, or in Safari delegating to Apple Intelligence. Same JavaScript surface, different models behind the scenes.
  • The skepticism: As a developer, Nico wants to control which model his app is using. If Firefox ships an outdated model under the same API, his app's behavior changes. He'd at least want metadata — "you're talking to model X, version Y" — so apps can degrade gracefully.

14. Hardware reality: a Pixel phone outruns an old Dell for LLMs

A counter-intuitive observation that says something about chip architecture.

  • The Windows problem: Many Windows laptops with Intel chips have limited GPU RAM and have to swap memory in and out, which kills performance for large models.
  • The unified memory advantage: Apple Silicon and modern phones have unified memory. Once a model is loaded, reading and writing is fast.
  • The result: Running Gemma 3 is more performant on Nico's Pixel phone than on his older Dell laptop. Copilot Plus laptops with dedicated AI chips should change this — but right now, mobile often wins.

15. We're not cooked — the role just changes

On the AI-replaces-developers question, Nico's take is direct.

  • Not the engineer, the code monkey: The pure code-typing job is going away, but the software engineer role isn't — someone has to understand what's happening, why it was done that way, and whether a change makes sense historically.
  • The bank example: At the bank where Nico freelanced for five years, the product owner explicitly forbade ChatGPT use because of data and privacy concerns. As a customer, you actually want your bank to be conservative with new technology, not the first to wipe everything out by going wild with it.
  • The shift: Soft skills become hard skills — communicating features, writing good specs, knowing what changes need to happen. Maybe we don't need as many developers, but the role itself isn't going away.

16. How to break into ML/AI as a developer today

There's a difference between being a machine learning engineer and being good at using ML.

  • The distinction: A real ML engineer trains models. Most of us are users — and that's fine. Nico still considers himself a frontend developer; even at Hugging Face, his focus is the JavaScript API surface of Transformers.js.
  • The basics worth knowing: What tool calling is, what MCP is, why 50 MCP servers shrink your context window, how an agent goes from "generate the next token" to "do things in the real world." That mental model is what most developers actually need.
  • The realistic ceiling: We don't need millions of ML engineers. We need millions of people who understand how to use these tools well.

🧠 What I Learned

  • A model is an architecture (a schema describing layers and operations) plus weights (the numbers used in those operations) — and weights are why models are big.
  • Quantization compresses models by storing weights in lower precision (e.g. 4 bits instead of 32), trading some accuracy for roughly 8x size reduction.
  • ONNX is a runtime-agnostic standard for model architectures; ONNX Runtime Web is the WebAssembly runtime Transformers.js uses in the browser.
  • Every Transformers.js task is pre-process → model → post-process; only some tasks include a tokenizer step.
  • Tokenizers are tightly coupled to the model that was trained with them — using the wrong tokenizer makes IDs meaningless to the model.
  • Hugging Face split tokenization into tokenizers.js because people were already using it to pre-process text before sending to non-ONNX runtimes.
  • LLMs generate text by producing a probability distribution over the next token; temperature adds controlled randomness to avoid boring, deterministic output.
  • The best in-browser ML use cases are often small models — voice activity detection, background removal, embeddings — not 12GB LLMs.
  • WebAssembly hosts the inference runtime; WebGPU does the actual math. Both are required for performance.
  • The Prompt API in Chrome could become a unified, cross-browser interface for in-browser LLMs — but developers may want metadata to know which model they're actually talking to.
  • Mobile and Apple Silicon devices often outperform older Windows laptops for browser ML because of unified memory.
  • The "AI replaces developers" narrative misses the point — code monkey work is at risk; engineering judgment isn't.
  • Most developers don't need to become ML engineers. They need to understand tokenizers, embeddings, tool calling, MCP, and how an agent actually works.

💬 Favorite Quotes

"The biggest challenge is that Transformers.js actually uses the hardware that your clients come with. So it could be a MacBook Pro with an M4, or it could be an old laptop with not that much RAM. You need to make sure the task you want to use actually works everywhere where it needs to run."

"A machine learning model is basically two things. You have an architecture — how many layers, how many neurons, what operations happen between them. And then you have weights, which are just numbers needed inside those operations to calculate stuff."

"If you always take the token with the highest probability, you end up with very boring, very predictable texts. If you add a little bit of randomness to each step, you get some creativity. That's also why LLMs are unpredictable."

"We don't need millions of machine learning engineers, but we need millions of people who understand how they can use those tools."

"The job as a code monkey is gone. The job of a software engineer is not — you need people who actually understand what's happening."

"I want my bank to be conservative with these technologies and to actually let other people figure out the problems before they adopt it."

🎯 Also in this Episode

  • The journey from teaching school to ski instructor to self-employed web developer to Hugging Face
  • Why Nico joined the Google Developer Experts program while freelancing, working a lot with progressive web apps
  • The five years freelancing at a bank — safe but missing new-technology excitement
  • How tokenizers.js was extracted from Transformers.js as Nico's first Hugging Face project
  • The "banana" walkthrough — one token in GPT-5, two tokens in GPT-3
  • Embeddings explained through the animals analogy — height, weight, legs as features in a vector space
  • Vector databases, similarity search, and why text embedding models are a great browser fit
  • Image background removal and image classification as practical Transformers.js use cases
  • OpenCloud / OpenCloud-style takes on local agents and what's hyped vs. real
  • The biggest agent bottleneck right now — trust and access to user data, not the technology itself
  • Why frontend specialists may need to broaden into systems thinking as AI rewrites code across frameworks
  • Book recommendation: Utopia for Realists by Rutger Bregman

Resources

More from Nico:

  • Transformers.js — Run state-of-the-art ML models directly in the browser
  • Hugging Face — The Hub for models, datasets, and Spaces
  • Diffusers — Hugging Face's library for diffusion models
  • ONNX — Open Neural Network Exchange standard
  • ONNX Runtime Web — The WebAssembly runtime used by Transformers.js
  • Chrome Prompt API — Built-in AI in the browser via Gemini Nano

Book Recommendation:

  • Utopia for Realists: And How We Can Get There by Rutger Bregman

🎧 Listen Now

🎧 Spotify
📺 YouTube
🍏 Apple Podcasts

Episode Length: 66 minutes on Transformers.js, on-device AI, ONNX, WebGPU, tokenizers, and how machine learning models actually work end to end.

Whether you're a frontend developer who's curious about ML, or you've used LLMs but never quite understood what's happening under the hood, this conversation is one of the cleanest top-to-bottom mental models I've heard.

Happy building,
Dan

🏆 SOLD OUT IN SINGAPORE · ATHENS · LONDON

From Lizard to Wizard

4-hour remote system design intensive.
Chat apps, microfrontends, BFF, SDUI, event-driven, observability.

€299 4-HOUR INTENSIVE
Save your seat →

Spots are vanishing. Don't be the one who waited.

💡 More Recent Takeaways

Monorepos at Scale with Santosh Yadav
Episode 40

Señors @ Scale host Neciu Dan sits down with Santosh Yadav, principal developer advocate at CodeRabbit and one of only around 80 GitHub Stars in the world. Santosh started hating C in 2004, fell for C# by 2008, and turned a year of open source contributions to Angular and NgRx into a stack of community titles — Google Developer Expert, GitHub Star, Nx champion, and Microsoft MVP. As a staff engineer at Celonis he led the move of 20-plus apps to module federation and drove Nx adoption across 30-plus teams when the product grew from four apps to thirty. From the year-long incremental migration off a single deployable unit, to why polyrepos can't give AI tools the context they need, to how Nx's affected graph and build caching tame a 20-million-line monorepo, to running code review for free for open source at CodeRabbit, this is the monorepo conversation grounded in someone who actually shipped one at scale.

Routing at Scale with Nicolas Beaussart-Hatchuel
Episode 39

Señors @ Scale host Dan Neciu sits down with Nicolas Beaussart-Hatchuel, staff engineer at Payfit and one of the maintainers of TanStack Router. Nicolas's path started with C macros to auto-generate his student paper headers and frontend learned by building phishing login pages for practice, took him through an iframe-based AngularJS-to-Angular 2 micro frontend migration at a web radio platform, into open source contributions across NX, ESLint, Vite and Hasura, and finally to maintaining one of the most ambitious routers in the React ecosystem. From why TanStack Router exists, to migrating Payfit's 300-route, 1.5-million-line codebase off React Router v5 using the strangler pattern, to collapsing 25 polyrepos and five different micro frontend strategies into a single modular monolith, this is the routing conversation most engineers never get.

Redux at Scale with Mark Erikson
Episode 38

Señors @ Scale host Neciu Dan sits down with Mark Erikson, maintainer of Redux and senior front-end engineer at Replay.io, where he works on a time-traveling debugger. Mark's path started with a 286 he got at eight years old, ran through a computer science degree, four years teaching English in China, embedded software at Northrop Grumman emulating legacy CPUs in old aircraft, and a chain of projects — GWT, jQuery, Backbone — that led him to React and Redux. From the @deprecated backlash that had people insulting him on the internet, to why the Redux core hasn't meaningfully changed since 2016, to what RTK Query actually solves, the underused listener middleware, building source maps into React's own build pipeline, and how Replay's recordings now hand debugging over to AI agents — this is the Redux conversation grounded in two decades of shipping software.

TanStack Query at Scale with Dominik Dorfmeister
Episode 37

Señors @ Scale host Dan Neciu sits down with Dominik Dorfmeister — better known as TkDodo — the maintainer of TanStack Query and a software engineer at Sentry. Dominik's path started at a technical high school in Vienna, ran through JVM backend work in Java and Scala, and turned to frontend around the introduction of TypeScript. During the pandemic lockdowns in Austria he started answering questions in the TanStack Discord, got addicted to the instant gratification of helping people, and slowly turned that into a blog, a first code contribution six to eight months later, and eventually maintainership of TanStack Query. From tracked queries and the chaotic version-three-to-four rename, to the version-five mistake he still dreads, to ripping 28,000 lines of dead code out of Sentry with Knip and building Sentry's new design system, this is the open source maintenance conversation most developers never get to hear.

📻 Never Miss New Takeaways

Get notified when new episodes drop. Join our community of senior developers learning from real scaling stories.

💬 Share These Takeaways

Share:

Want More Insights Like This?

Subscribe to Señors @ Scale and never miss conversations with senior engineers sharing their scaling stories.