Browser ML at Scale with Nico Martin

Nico Martin

Open Source ML Engineer at Hugging Face, Google Developer Expert in AI and Web Technology, based in Switzerland

Señors @ Scale host Neciu Dan sits down with Nico Martin — open source ML engineer at Hugging Face working on Transformers.js, and Google Developer Expert in AI and web technology — to go deep on running machine learning models directly in the browser. Nico breaks down architectures vs. weights, quantization, tokenizers, ONNX, WebGPU, and why on-device AI is the right answer for a huge class of problems. He also shares the road from ski instructor and self-taught web developer to landing what he calls his dream job at Hugging Face.

🎧 New Señors @ Scale Episode

This week, I spoke with Nico Martin, open source ML engineer at Hugging Face working on Transformers.js, and a Google Developer Expert in AI and web technology. Nico's path is unusual — he studied to become a teacher, worked as a ski instructor and a windsurf instructor, taught himself web development by building sites for friends, and went freelance for years before joining Hugging Face about six months ago. The role pulls together his obsession with progressive web apps, on-device AI, and Transformers.js — a library he was already building demos with before version one even shipped.

In this episode, we cover how Transformers.js actually runs models in the browser, what an ML model really is under the hood, how tokenizers and quantization work, why ONNX matters as a runtime-agnostic standard, where WebGPU fits next to WebAssembly, and what the Prompt API in Chrome means for the future of in-browser LLMs.

⚙️ Main Takeaways

1. From ski instructor to Hugging Face — the non-traditional path

Nico didn't study computer science. He studied to be a teacher, ran a small web agency on the side, and went full-time self-employed once it started paying.

The setup: Summer windsurf instructor, winter ski instructor, then websites for friends, friends of friends, and eventually paying clients during teaching school.
The bet: He gave himself one year to see if web development could work as a career. It did. He stayed self-employed and freelanced for years.
The Hugging Face moment: He had already been building demos with Transformers.js before version one. When the role opened, he saw the posting and thought, "that would be my dream job." Six months in, he still thinks it is.

2. What Hugging Face actually is

People call it the GitHub for machine learning, but the platform has more layers than that.

The Hub: The core of Hugging Face is the Hub — huggingface.co — where you host and download ML models and datasets. Quality data is one of the biggest parts of ML, and the Hub is a central place to share it.
Spaces: Little hosted environments where you can ship Python apps or static HTML/CSS/JS apps. Hugging Face hosts them for you.
The libraries: Transformers (Python) for running and training models, Diffusers for diffusion models, and Transformers.js — which mirrors the Transformers Python API in JavaScript so the same kinds of models can run directly in the browser.

3. The biggest challenge of running ML in the browser is the user's hardware

When the model runs on a server, you control the box. When it runs in the browser, you don't.

The constraint: Transformers.js runs on whatever the user has — could be an M4 MacBook Pro, could be an old Lenovo with limited RAM and an outdated CPU. There's no detection mechanism that says "this device can run this model."
The strategy: Pick the task and model so it works on almost all devices. For small tasks with small models, that's safe. For big things like LLMs in the browser, you need to be aware that users may have to download multiple gigabytes and have a browser that supports WebGPU.
The trade-off: It's a guessing game. You design around what you can expect from your audience.

4. A model is an architecture plus weights

The two-part anatomy that makes everything else make sense.

The architecture: A description of the network — how many layers, how many neurons per layer, what operations happen between them. Not code. A schema.
The weights: Just numbers used in the operations between layers. For a small large language model — one to eight billion parameters — that's one to eight billion little numbers that need to be stored.
The size source: Weights are why models are big. Architecture is small. Compress the weights and you compress the model.

5. Quantization is how you fit big models on small machines

Quantization is the main mechanism Transformers.js uses to ship usable models to the browser.

The default: Weights are usually stored in 32-bit or 16-bit precision (FP32 / FP16).
The compression: You can store them in 4 bits instead — Q4 quantization. The numbers are less precise, the calculations are less precise, but with billions of operations the result is still usable. The size reduction is roughly a factor of 8.
The developer choice: When you call the Transformers.js pipeline function, you specify the task, the model, the device (CPU or GPU), and the quantization level. There's no automatic switching — you can offer multiple quantizations to your users (e.g. "try Q4 for 1GB or full precision for more"), but the precision is chosen at ship time, not run time.

6. Open weights vs open source models — datasets are the difference

Both let you run the model. Only one lets you reproduce it.

Open weights: Architecture and weights are publicly available. You can use the model. You don't get the training data.
Open source: Everything is documented — pre-training, mid-training, post-training, all the datasets used. You could retrain the model and end up with the same result.
The runtime perspective: For someone running a model in the browser, the difference doesn't matter much — you still get the architecture and weights and you can do inference.

7. ONNX is the runtime-agnostic standard that makes Transformers.js portable

Think of ONNX as the OpenAPI for ML model architectures.

The standard: ONNX (Open Neural Network Exchange) describes a model's architecture — layers, operations, and what flows between them — in a single file.
The runtime: ONNX Runtime Web is the WebAssembly runtime that Transformers.js loads to actually execute that architecture in the browser. There are other ONNX runtimes for other environments — Microsoft maintains the web one, others exist for different targets.
The decoupling: Export a model to ONNX once, and the same ONNX model runs across multiple environments. You don't re-export per platform. The runtime is decoupled from the model description.

8. The pre-process / model / post-process pipeline

This is the shape of every Transformers.js task, even when the model itself doesn't have a tokenizer.

The frame: The actual model takes tensors in and returns tensors out. A tensor is just an array of numbers — same idea as a vector. Developers don't want to write or read tensors; they want text in and text out.
The wrapper: Transformers.js does pre-processing (turn input into tensors), runs the model, then post-processes (turn tensors back into the output shape you want). Different tasks need different pre-processing steps. For LLMs, the big one is the tokenizer.
The generality: Not every task uses a tokenizer, but every task has this three-step structure.

9. Tokenizers, and why Hugging Face split them out into their own library

Tokenizing was Nico's first project at Hugging Face — and the reason tokenizers.js now exists.

The job: A tokenizer takes text, splits it into chunks (tokens), and maps each chunk to a numerical ID via a vocabulary. ML models work on numbers, not text.
Per-model: Each model is trained with a specific tokenizer and a specific vocabulary. Banana is one token in the GPT-5 tokenizer but two tokens in GPT-3 — different vocabularies, different splits. If you use the wrong tokenizer, the IDs are meaningless to the model.
The extraction: People were already using the Transformers.js tokenizer to pre-process text before sending it to other runtimes — because the tokenizer is identical for the same model regardless of where you run it. So Hugging Face pulled tokenization out into its own package, tokenizers.js, that you can use independently of the rest of Transformers.js.

10. How an LLM picks the next token, and where temperature comes in

The mechanic behind every LLM response, told in one example.

The output shape: The model takes the previous text as tensors, runs them through the network, and outputs a probability for every token in the vocabulary being the next token.
The example: "Roses are red, violets are…" — the model produces a probability distribution where "blue" has the highest probability.
The randomness: If you always pick the highest probability, you get boring, predictable text. Temperature adds randomness — pick from the top N candidates with weighted randomness. That's why the same prompt produces different outputs every time, and why a small bit of randomness compounds across hundreds of generated tokens.

11. LLMs aren't always the best fit for the browser — other tasks matter more

A reality check from someone who builds in-browser AI for a living.

The LLM demo: Nico's colleague Joshua released a demo running GPT-OSS, a 20B parameter LLM, fully in the browser. Cool — but it requires downloading roughly 12 gigabytes before you can use it.
The better fits: Vision tasks like background removal and image classification. Audio tasks like text-to-speech and speech recognition. Voice activity detection — a tiny model that just tells you "I'm 90% sure someone is talking." That last one is the perfect on-device use case: small, runs on CPU without a GPU, instant.
The pattern: The big-model browser demos are impressive. The actual practical wins are in the smaller, specialized tasks where the privacy and latency benefits outweigh the size cost.

12. WebAssembly + WebGPU — they don't replace each other, they cooperate

The runtime story for in-browser ML, clarified.

WebAssembly: ONNX Runtime Web is shipped as WebAssembly — that's how the runtime gets into the browser at all. You can't get rid of WebAssembly because it's what runs the inference engine.
WebGPU: A browser API that gives you access to GPU operations from the browser. You can run Transformers.js on the CPU, but it's slow. With WebGPU, the runtime delegates the heavy compute to the GPU.
The combo: WebAssembly hosts the runtime; WebGPU does the math. Both are required for serious in-browser ML performance.

13. The Prompt API — a unified browser API for in-browser LLMs

Chrome ships Gemini Nano with the browser. The Prompt API exposes it.

What it is: Chrome downloads a version of Gemini Nano once, and the Prompt API exposes a way for any origin to interact with it. The model runs entirely in the browser.
The cross-browser ambition: The same API could be implemented in Firefox using a Gemma model, or in Safari delegating to Apple Intelligence. Same JavaScript surface, different models behind the scenes.
The skepticism: As a developer, Nico wants to control which model his app is using. If Firefox ships an outdated model under the same API, his app's behavior changes. He'd at least want metadata — "you're talking to model X, version Y" — so apps can degrade gracefully.

14. Hardware reality: a Pixel phone outruns an old Dell for LLMs

A counter-intuitive observation that says something about chip architecture.

The Windows problem: Many Windows laptops with Intel chips have limited GPU RAM and have to swap memory in and out, which kills performance for large models.
The unified memory advantage: Apple Silicon and modern phones have unified memory. Once a model is loaded, reading and writing is fast.
The result: Running Gemma 3 is more performant on Nico's Pixel phone than on his older Dell laptop. Copilot Plus laptops with dedicated AI chips should change this — but right now, mobile often wins.

15. We're not cooked — the role just changes

On the AI-replaces-developers question, Nico's take is direct.

Not the engineer, the code monkey: The pure code-typing job is going away, but the software engineer role isn't — someone has to understand what's happening, why it was done that way, and whether a change makes sense historically.
The bank example: At the bank where Nico freelanced for five years, the product owner explicitly forbade ChatGPT use because of data and privacy concerns. As a customer, you actually want your bank to be conservative with new technology, not the first to wipe everything out by going wild with it.
The shift: Soft skills become hard skills — communicating features, writing good specs, knowing what changes need to happen. Maybe we don't need as many developers, but the role itself isn't going away.

16. How to break into ML/AI as a developer today

There's a difference between being a machine learning engineer and being good at using ML.

The distinction: A real ML engineer trains models. Most of us are users — and that's fine. Nico still considers himself a frontend developer; even at Hugging Face, his focus is the JavaScript API surface of Transformers.js.
The basics worth knowing: What tool calling is, what MCP is, why 50 MCP servers shrink your context window, how an agent goes from "generate the next token" to "do things in the real world." That mental model is what most developers actually need.
The realistic ceiling: We don't need millions of ML engineers. We need millions of people who understand how to use these tools well.

🧠 What I Learned

A model is an architecture (a schema describing layers and operations) plus weights (the numbers used in those operations) — and weights are why models are big.
Quantization compresses models by storing weights in lower precision (e.g. 4 bits instead of 32), trading some accuracy for roughly 8x size reduction.
ONNX is a runtime-agnostic standard for model architectures; ONNX Runtime Web is the WebAssembly runtime Transformers.js uses in the browser.
Every Transformers.js task is pre-process → model → post-process; only some tasks include a tokenizer step.
Tokenizers are tightly coupled to the model that was trained with them — using the wrong tokenizer makes IDs meaningless to the model.
Hugging Face split tokenization into tokenizers.js because people were already using it to pre-process text before sending to non-ONNX runtimes.
LLMs generate text by producing a probability distribution over the next token; temperature adds controlled randomness to avoid boring, deterministic output.
The best in-browser ML use cases are often small models — voice activity detection, background removal, embeddings — not 12GB LLMs.
WebAssembly hosts the inference runtime; WebGPU does the actual math. Both are required for performance.
The Prompt API in Chrome could become a unified, cross-browser interface for in-browser LLMs — but developers may want metadata to know which model they're actually talking to.
Mobile and Apple Silicon devices often outperform older Windows laptops for browser ML because of unified memory.
The "AI replaces developers" narrative misses the point — code monkey work is at risk; engineering judgment isn't.
Most developers don't need to become ML engineers. They need to understand tokenizers, embeddings, tool calling, MCP, and how an agent actually works.

💬 Favorite Quotes

"The biggest challenge is that Transformers.js actually uses the hardware that your clients come with. So it could be a MacBook Pro with an M4, or it could be an old laptop with not that much RAM. You need to make sure the task you want to use actually works everywhere where it needs to run."

"A machine learning model is basically two things. You have an architecture — how many layers, how many neurons, what operations happen between them. And then you have weights, which are just numbers needed inside those operations to calculate stuff."

"If you always take the token with the highest probability, you end up with very boring, very predictable texts. If you add a little bit of randomness to each step, you get some creativity. That's also why LLMs are unpredictable."

"We don't need millions of machine learning engineers, but we need millions of people who understand how they can use those tools."

"The job as a code monkey is gone. The job of a software engineer is not — you need people who actually understand what's happening."

"I want my bank to be conservative with these technologies and to actually let other people figure out the problems before they adopt it."

🎯 Also in this Episode

The journey from teaching school to ski instructor to self-employed web developer to Hugging Face
Why Nico joined the Google Developer Experts program while freelancing, working a lot with progressive web apps
The five years freelancing at a bank — safe but missing new-technology excitement
How tokenizers.js was extracted from Transformers.js as Nico's first Hugging Face project
The "banana" walkthrough — one token in GPT-5, two tokens in GPT-3
Embeddings explained through the animals analogy — height, weight, legs as features in a vector space
Vector databases, similarity search, and why text embedding models are a great browser fit
Image background removal and image classification as practical Transformers.js use cases
OpenCloud / OpenCloud-style takes on local agents and what's hyped vs. real
The biggest agent bottleneck right now — trust and access to user data, not the technology itself
Why frontend specialists may need to broaden into systems thinking as AI rewrites code across frameworks
Book recommendation: Utopia for Realists by Rutger Bregman

Resources

🎧 Listen Now

🎧 Spotify
📺 YouTube
🍏 Apple Podcasts

Episode Length: 66 minutes on Transformers.js, on-device AI, ONNX, WebGPU, tokenizers, and how machine learning models actually work end to end.

Whether you're a frontend developer who's curious about ML, or you've used LLMs but never quite understood what's happening under the hood, this conversation is one of the cleanest top-to-bottom mental models I've heard.

Happy building,
Dan

💡 More Recent Takeaways

Episode 35

React Native at Scale with Kadi Kraman

Señors @ Scale host Neciu Dan sits down with Kadi Kraman, software developer at Expo working on the tools that make React Native development as smooth as possible. Kadi's path started with C++ in a university maths degree, took her through Angular 1, scientific programming for pharmaceutical and defense companies, five and a half years at Formidable, and finally to Expo itself. From the limitations of early React Native to development builds, EAS workflows, fingerprint-based repacks, and the right way to think about over-the-air updates, this is the React Native conversation most web developers never get.

60 minutes 📖 Read Takeaways

Episode 33

Frontend Foundations at Scale with Giorgio Polvara

Señors @ Scale host Neciu Dan sits down with Giorgio Polvara, Staff Engineer at Perk (formerly TravelPerk), who joined when the company was 15 people in two flats with a hole knocked through the wall and helped build the frontend foundations that still hold up at unicorn scale. Giorgio covers the multi-year migration from a monolithic frontend to vertical micro-frontends, why their first attempt with single-spa didn't work, how they pulled off a full rebrand behind feature flags without leaking, and the staff engineer mindset of treating every feature as a system improvement.

55 minutes 📖 Read Takeaways

Episode 32

Module Federation at Scale with Zack Chapple & Nestor

Señors @ Scale host Neciu Dan sits down with Zack Chapple, CEO and co-founder of Zephyr Cloud, and Nestor, the platform engineer building it, to go deep on module federation, microfrontends, and what it actually takes to go from code to global scale in seconds. They unpack why module federation is Docker for the frontend, how Zephyr composes applications at the edge in 80 milliseconds, and why the real unlock for enterprise teams isn't deployment — it's composition.

57 minutes 📖 Read Takeaways

Episode 31

Service Mesh at Scale with William Morgan

Señors @ Scale host Neciu Dan sits down with William Morgan, CEO of Buoyant and creator of Linkerd — the world's first service mesh and a graduated CNCF project. William's path runs from teaching himself BASIC on a begged-for DOS PC, through Twitter's painful migration off Ruby on Rails into JVM-based microservices, and into building the proxy that handles retries, mTLS, load balancing, and multi-cluster traffic for thousands of production Kubernetes clusters. From the Scala-to-Rust rewrite to why every sustainable cloud native open source project needs a commercial engine behind it, this is the infrastructure conversation most application developers never get to have.

70 minutes 📖 Read Takeaways

📻 Never Miss New Takeaways

Get notified when new episodes drop. Join our community of senior developers learning from real scaling stories.

💬 Share These Takeaways

Want More Insights Like This?

Subscribe to Señors @ Scale and never miss conversations with senior engineers sharing their scaling stories.

🎧 Subscribe to Updates 🎙️ Browse All Episodes