Your agent doesn’t need a browser. It needs the API behind the button.
browser-use calls itself “The Way AI uses the web.” It’s an open-source browser automation framework that gives AI agents control of a real Chromium instance — point-and-click, page-by-page, just like a human. Over 25,000 GitHub stars. Backed by serious funding. Trusted, they say, by Fortune 500 companies.
It solves the right problem. Agents need to act on the web — search, fetch, interact, transact. The question is whether a browser is the right tool for that job.
We think it’s the most expensive, least reliable, least token-efficient way to get there.
How browser automation works
browser-use operates a perception-action loop. The agent captures the current page state — a structured list of interactive elements, indexed for reference. An LLM reads that state, decides what to do next, and executes an action: click a button, fill a field, scroll, navigate. Then it captures the new page state and repeats.
Think → Act → Observe → Reflect. Every interaction is a full loop iteration.
This is impressive engineering. Controlling a browser programmatically, handling dynamic content, managing page transitions — that’s hard. browser-use also offers stealth browsers (anti-detection, CAPTCHA solving, proxies in 195+ countries), custom models purpose-built for browser automation, and a Skills marketplace where browser interactions can be packaged as reusable API endpoints.
But impressive engineering doesn’t mean it’s the right abstraction.
What a browser actually is
Every human-facing interface — every web app, every GUI, every dashboard — is a slow API. A button click is an HTTP request with extra steps. A form submission is a POST with a loading spinner. A search bar is a GET with a text field and a magnifying glass icon.
The browser exists because humans need visual context to understand what they’re doing. We need layout, color, typography, and spatial relationships to parse information. We need buttons to know where to click. We need loading spinners to know something is happening.
Agents need none of this.
When an agent searches the web via browser automation, here’s what actually happens: a headless Chromium instance launches. DNS resolves. TCP connects. TLS handshakes. The server sends HTML, CSS, JavaScript, fonts, images, tracking pixels, analytics scripts, cookie consent banners. The browser parses and renders all of it. JavaScript executes. The page stabilizes. The agent extracts the content it actually needed — the search results — from the rendered DOM.
When an agent searches the web via a direct API call, here’s what happens: an HTTP request goes out. JSON comes back. Done.
The browser added a rendering engine, a JavaScript runtime, dozens of network requests, and hundreds of milliseconds of latency — all to extract the same structured data that was available before the page ever rendered.
The cost math
This isn’t theoretical. The costs are measurable.
Browser automation requires maintaining a browser instance — either locally or in the cloud. browser-use’s cloud offering runs managed Chromium sessions. Each interaction involves the LLM processing the full page state: the element tree, the URL, the visible text, sometimes a screenshot. That state is sent to the model on every loop iteration. A single web interaction might take three to five loop iterations: navigate, wait for load, find the element, interact, confirm the result.
Purpose-built tools skip all of this. A web search is one API call. A page fetch returns clean markdown. No browser instance, no rendering, no multi-step navigation, no page state serialization.
The numbers:
| Browser automation | Direct API | |
|---|---|---|
| Web search | ~$0.05–0.50 per search (browser instance + multi-iteration LLM reasoning over page state) | ~$0.015 per search |
| Page fetch | ~$0.10–1.00 per page (rendering + element extraction + LLM interpretation) | ~$0.03–0.06 per page |
| Speed | 3–15 seconds (page load + render + LLM loop iterations) | 200–500ms (API round-trip) |
| Tokens consumed | 2,000–10,000+ per interaction (full page state each iteration) | 50–500 per interaction (structured response) |
At small scale, the difference is pocket change. At agent scale — thousands of interactions per day across dozens of agents — it’s the difference between a viable operation and an uncontrolled cost center.
An agent doing 100 web searches and 50 page fetches per day costs roughly $10–100 via browser automation. The same workload via direct API calls costs roughly $3–4.50. Multiply by 30 agents and a month of operation, and you’re comparing $9,000–90,000 to $2,700–4,050.
Token efficiency is the real constraint
Cost matters, but token efficiency matters more. An agent’s context window is its most precious resource — the finite working memory where reasoning happens. Every token spent on tool overhead is a token not spent on the user’s actual problem.
Browser automation is token-hostile. Each loop iteration dumps the current page state into the agent’s context: element trees, URLs, visible text, interaction history. A single page might serialize to 2,000–10,000 tokens. Over a multi-step interaction, the context fills with page snapshots the agent will never reference again.
The Most MCPs Should Be CLIs article documented a 4–32x token cost difference between heavyweight and lightweight tool interfaces. Browser automation sits at the extreme end of that spectrum. You’re feeding the model a rendered webpage’s worth of tokens when what it needs is three lines of structured data.
Direct API calls return exactly what the agent needs: clean, structured output. A web search returns result snippets. A page fetch returns markdown. No navigation chrome, no cookie banners, no JavaScript bundles, no ad slots. Just data.
The less context spent on tool output, the more the agent can spend on the thinking that actually matters.
Reliability and the brittleness problem
Browser automation is inherently brittle. Websites change their layouts, update their CSS, restructure their DOM, add pop-ups, rotate A/B tests, deploy anti-bot measures. Each change can break an automation that worked yesterday.
browser-use’s answer to this is clever: their Skills feature lets you record a browser interaction and replay it as a deterministic endpoint. Define the goal, demonstrate it once, and the skill becomes reusable. It’s a compelling idea — turning brittle browser sessions into stable API-like interfaces.
But the foundation is still a browser. The skill was recorded against a specific DOM structure. When that structure changes — and it will — the skill breaks. You’re building deterministic endpoints on a shifting substrate.
Direct API calls don’t have this problem. APIs have contracts. Versioned endpoints. Documented schemas. When a provider changes their API, they publish a migration guide and maintain the old version. When a website changes their CSS, they tell nobody, because they don’t know you were scraping it.
browser-use also offers stealth browsers and anti-detection features — proxies, CAPTCHA solving, fingerprint randomization. These exist because websites actively fight automation. It’s an arms race: the website deploys detection, the automation deploys evasion, the website deploys better detection.
Purpose-built tools don’t fight this war. They authenticate via OAuth or API key — the same way the website’s own mobile app does. There’s no detection to evade because there’s no unauthorized access to detect.
Safety is structural, not bolted on
browser-use offers human-in-the-loop capability: a human can take over a live browser session for sensitive actions. That’s a useful feature. But it’s a feature — something you opt into for specific interactions. The default is autonomous operation.
When an agent controls a browser, it has the full capability surface of a human user. It can click “Delete All” as easily as “Send.” It can navigate to account settings and change passwords. It can access any page, submit any form, click any button. The blast radius of a mistake is whatever the website allows — which is usually everything.
Mechanical Advantage inverts this. Non-destructive design is a structural property, not a permission setting. There are no delete endpoints for emails, calendar events, contacts, or documents. The code path doesn’t exist. Every outbound action — emails, messages, calendar events, posts — queues for human approval before execution. The approval interface requires biometric authentication (WebAuthn/FIDO2 passkeys) that an agent physically cannot perform.
An agent that’s been prompt-injected, confused, or simply wrong gets as far as proposing an action. A human reviews it. If it’s wrong, the human rejects it, and the rejection and feedback flow back to the agent. Over time, the agent learns. The review queue isn’t just a safety gate — it’s a teaching loop.
A browser-controlled agent that’s been prompt-injected can click whatever it can see. The only thing between the agent and a catastrophic action is the hope that the LLM will reason its way out of doing the wrong thing.
Where browser automation wins
Let’s be honest about this.
There’s a long tail of web interactions that don’t have APIs. Legacy enterprise systems with web-only interfaces. Government portals. Niche SaaS products that never built public APIs. Websites where the only way to get the data is to navigate the pages, because nobody exposed the database behind them.
For these cases, browser automation is the right tool. It’s the only tool. If you need to fill out a form on a county tax assessor’s website from 2004, you need a browser.
browser-use’s Skills concept is also genuinely good thinking. The idea that any browser interaction can become a reusable, parameterized endpoint is powerful. If the underlying substrate were more stable, this would be transformative.
And there are interaction patterns — multi-step workflows across multiple sites, visual verification of rendered content, interactions that depend on JavaScript state — where a browser is legitimately the right level of abstraction.
We’re not arguing that browser automation should never exist. We’re arguing that it shouldn’t be the default.
The 90% case
Most of what agents do on the web falls into a small set of well-defined actions: search, fetch pages, send emails, manage calendars, look up contacts, send messages. These are the primitives. They compose into everything agents actually spend their time doing — research, scheduling, communication, coordination.
Every one of these has a direct API. Web search has Brave Search API. Page fetching has Firecrawl. Email has IMAP and SMTP. Calendars have CalDAV. Contacts have CardDAV. Messaging platforms have bot APIs.
Browser automation solves the 100% case — any interaction a human could do. But it pays the full cost of a browser for every interaction, including the 90% that didn’t need one. It’s using a forklift to carry a coffee cup. The forklift works. It’s just not the right tool.
Mechanical Advantage provides purpose-built CLI tools for the 90% case. A single command — ma web search "flights to tokyo" — returns structured markdown in 200 milliseconds. No browser instance. No page rendering. No multi-step LLM loop. The agent gets exactly what it needs and moves on.
$ ma web search "flights to tokyo in march"
# Web Search Results
1. **Tokyo Flight Deals — March 2026**
Source: kayak.com/flights/tokyo
Round-trip from SFO starting at $487...
2. **Cheap Flights to Tokyo (NRT)**
Source: google.com/flights
Best prices found: $512 JAL, $534 ANA...
Compare this to what browser automation produces: a multi-second browser session, several thousand tokens of page state pumped through an LLM, and the same three lines of useful information extracted from the noise.
Different problems, different tools
browser-use and Mechanical Advantage aren’t really competitors. They’re answers to different questions.
browser-use asks: “How do we make agents navigate the human web?” It’s a browser-first approach — start from what humans do, and automate it.
Mechanical Advantage asks: “What do agents actually need from the web?” It’s an API-first approach — start from what agents need, and build the shortest path to it.
For the long tail — the 10% of interactions that genuinely require a browser — browser automation is the right answer. For the foundational primitives that agents use hundreds of times a day — search, fetch, email, calendar, contacts, messaging, memory — a purpose-built CLI that calls the API directly is faster, cheaper, more reliable, more token-efficient, and structurally safer.
The way AI uses the web shouldn’t look like a human using a browser. It should look like software calling an API. Because that’s what it is.
Learn more
- Most MCPs Should Be CLIs — Why token-efficient CLI tools outperform heavyweight protocols for agent-tool interaction
- Bigger Cages, Better Tools — Why sandboxing agents isn’t enough, and what the alternative looks like
- Agents of Chaos — What happens when agents use tools designed for humans