63%->94%: Silverstream Hits 2σ Reliability for Web Agents

Our agents now solve 94% (2σ) of ServiceNow public benchmark tasks - a 31‑point jump over the previous state-of-the-art performance (63.8%).

Written by

Silverstream AI Team

Published on

July 11, 2025

From 63% to 94%: Closing the Web Agent Reliability Gap

In today's enterprise, 40% of workers spend at least a quarter of their workweek on manual, repetitive tasks, such as email, data collection, and data entry. At Silverstream, we build agents that work in the background, freeing users from these tasks so they can focus on strategic, high-value work.

Web agents are large‑language models equipped with a browser. They can control the mouse and keyboard on a webpage like a human, enabling them to interact with nearly any existing software through a single, universal interface. For example, in an operations team, the agent can log into multiple carrier portals every hour, extract tracking updates, reconcile them with a Transportation Management System, and surface the exceptions.

The First Milestone: Achieving 2σ Accuracy

We recently validated our first large-scale milestone on independent benchmarks. Silverstream agents now reach a 94% top 5 success rate on common ServiceNow tasks. Many enterprises run this software daily; such benchmarks enable us to quantify when web agents become economically useful.

In practical terms, a 63% step-wise success rate gives an agent a task half-life of ~1 minute (i.e., half of the time an agent fails within 1 minute). At 94%, its half-life is extended to over 11 minutes, drastically reducing the need for human intervention.

This 2σ milestone brings us closer to our goal of 3σ reliability: A full 8-hour unsupervised half-life, enabling a full workday automation without manual oversight.

Additionally, our agents achieved a 91% success rate on the MiniWoB benchmark, significantly outperforming the current best-in-class success rate of 74.9%. All evaluations were done at the top 5 across randomized task configurations.

Core Architectural Innovations for Robust Browser Automation

Unlike typical SaaS, building AI agents lacks mature abstractions. Forcing deliberate choices about which layers to own and which to abstract. In practice, we own the entire agentic stack: browser environments, memory management, agent orchestration, and user interfaces. We abstracted only the classical IaaS tooling and LLM inference in some cases.

An agent is a holistic system; a deterministic software stack with a stochastic LLM core. Unlike "read-only" generative models, agents are read-write decision loops (OODA): outsourcing key layers to third parties too early means inheriting the risks of an immature stack with the loss of reliability that comes with it.

Our goal of real-world reliability drives our decision to run ephemeral, encrypted, isolated, elastic browsers in the cloud. These are:

Scalable: Agents run in headless, ephemeral containers, eliminating local browser instability, ensuring consistent browser configuration, reducing interruptions, improving security, and lowering infrastructure costs. Google Cloud Run maintains low cold-start latency while scaling from zero to thousands of agents with minimal overhead.
Self-hosted: Customers can choose to run agents fully on-prem, keeping their data safe and avoiding regulatory friction.
Encrypted by default: Each agent runs inside a confidential computing enclave, fully encrypted by default via Trusted Execution Environments.
Observable & auditable: Every action is optionally recorded and can be replayed. We can audit and report every network request across systems for auditing or to pinpoint root causes.

Advanced understanding via Multi-Modal Page Representation

Our agents use five context sources to solve tasks: the raw page (DOM), enrichments from our specialized page-parsing tool, screenshots, console information, and network requests. Ablation studies showed the first three modalities contribute roughly equally.

Both textual and visual modalities are essential: from the DOM, we extract semantic structures and interaction cues, and resolve visual ambiguities. Screenshots help the agent to decode the layout and positioning of the elements on the page. Network requests, console logs, and other sources allow us to complete the picture and handle edge cases.

Large-scale deployments and off-road agents

We have run autonomous agents across the open internet for two years now Pasta-1T

The result: low-cost, unsupervised exploration yielding high-value datasets, while the infrastructure hardens against millions of real-world edge cases.

Reliability First: Measuring Continuous Incremental Progress and Health State

Consistent measurement is essential for users to trust an independent agent. At Silverstream, every new release is evaluated on 2,630 practical business scenarios, ranging from early 2000s ERP software to modern logistics portals.

Our agents self-heal, and that's a desirable feature, but we need to be very clear with ourselves about when something went wrong. We currently have 372 health checks across the agent loop, with a human fallback for selected customers.

Bottom line & the future

At a 94% task success rate on common SaaS software, our agents have crossed the reliability threshold for unattended production use. The first milestone has now been unlocked, and our next target is 99.7% accuracy, unlocking full workday automation without manual oversight.

Unlock reliable automation: Book a 30‑Min Scoping Call with the founders!

Tell us about your enterprise use case. We will follow up with a tailored demo.