In today's enterprise, 40% of workers spend at least a quarter of their workweek on manual, repetitive tasks, such as email, data collection, and data entry. At Silverstream, we build agents that work in the background, freeing users from these tasks so they can focus on strategic, high-value work.
Web agents are large‑language models equipped with a browser. They can control the mouse and keyboard on a webpage like a human, enabling them to interact with nearly any existing software through a single, universal interface. For example, in an operations team, the agent can log into multiple carrier portals every hour, extract tracking updates, reconcile them with a Transportation Management System, and surface the exceptions.
We recently validated our first large-scale milestone on independent benchmarks. Silverstream agents now reach a 94% top 5 success rate on common ServiceNow tasks. Many enterprises run this software daily; such benchmarks enable us to quantify when web agents become economically useful.
In practical terms, a 63% step-wise success rate gives an agent a task half-life of ~1 minute (i.e., half of the time an agent fails within 1 minute). At 94%, its half-life is extended to over 11 minutes, drastically reducing the need for human intervention.
This 2σ milestone brings us closer to our goal of 3σ reliability: A full 8-hour unsupervised half-life, enabling a full workday automation without manual oversight.
Additionally, our agents achieved a 91% success rate on the MiniWoB benchmark, significantly outperforming the current best-in-class success rate of 74.9%. All evaluations were done at the top 5 across randomized task configurations.
Unlike typical SaaS, building AI agents lacks mature abstractions. Forcing deliberate choices about which layers to own and which to abstract. In practice, we own the entire agentic stack: browser environments, memory management, agent orchestration, and user interfaces. We abstracted only the classical IaaS tooling and LLM inference in some cases.
An agent is a holistic system; a deterministic software stack with a stochastic LLM core. Unlike "read-only" generative models, agents are read-write decision loops (OODA): outsourcing key layers to third parties too early means inheriting the risks of an immature stack with the loss of reliability that comes with it.
Our goal of real-world reliability drives our decision to run ephemeral, encrypted, isolated, elastic browsers in the cloud. These are:
Our agents use five context sources to solve tasks: the raw page (DOM), enrichments from our specialized page-parsing tool, screenshots, console information, and network requests. Ablation studies showed the first three modalities contribute roughly equally.
Both textual and visual modalities are essential: from the DOM, we extract semantic structures and interaction cues, and resolve visual ambiguities. Screenshots help the agent to decode the layout and positioning of the elements on the page. Network requests, console logs, and other sources allow us to complete the picture and handle edge cases.
We have run autonomous agents across the open internet for two years now Pasta-1T
The result: low-cost, unsupervised exploration yielding high-value datasets, while the infrastructure hardens against millions of real-world edge cases.
Consistent measurement is essential for users to trust an independent agent. At Silverstream, every new release is evaluated on 2,630 practical business scenarios, ranging from early 2000s ERP software to modern logistics portals.
Our agents self-heal, and that's a desirable feature, but we need to be very clear with ourselves about when something went wrong. We currently have 372 health checks across the agent loop, with a human fallback for selected customers.
At a 94% task success rate on common SaaS software, our agents have crossed the reliability threshold for unattended production use. The first milestone has now been unlocked, and our next target is 99.7% accuracy, unlocking full workday automation without manual oversight.