Now in private beta for design partners

The eval layer for
agentic workflows.

devtail observes how your teams actually get work done — across apps, terminals, and browsers — and turns those workflows into living runbooks and testable evals for AI agents.

SOC 2 in progress · Self-hosted or managed · macOS, Linux, Windows
From keystrokes to runbooks and evals Live
YOUR TEAM, ON THEIR DEVICES DEVTAIL WHAT YOU GET BACK Mira ENGINEERING ship a hotfix · review PR Alex OPERATIONS triage a Stripe dispute Priya SALES run an enterprise demo devtail Captures activity, clusters into workflows, grades against your evals. OUTPUT · FOR HUMANS Runbooks Living docs your team owns & edits. OUTPUT · FOR AI AGENTS Evals Tests that update with the workflow.
Captured on device Workflow modeled Graded against your agents

Built for teams that ship agents into production

Engineering Operations Sales Infrastructure Finance & diligence

The problem

Most agent failures aren't model failures. They're context failures — the workflows your team relies on were never written down, so they were never tested.

devtail closes the gap between what your team actually does and what your agents are evaluated against.

How it works

From keystrokes to a tested workflow,
in three steps.

  1. 01 capture

    Capture how work actually flows.

    A lightweight agent runs on your team's devices, recording the shape of work — apps opened, windows focused, commands run, urls visited. Content is redacted at source.

    • macOS, Linux, Windows
    • Browser & terminal coverage
    • Local-first, encrypted at rest
  2. 02 model

    Model the workflows that recur.

    devtail clusters recurring sequences into named workflows. Each becomes a versioned runbook your team can review, edit, and approve — the source of truth your docs were always meant to be.

    • Auto-named, human-edited
    • Versioned, diff-able
    • Ownership & review built in
  3. 03 evaluate

    Test your agents against the real thing.

    Each workflow becomes an eval your agents are graded against. When the workflow changes, the eval updates. Plug into your CI to gate rollouts, or run on a schedule to detect regressions.

    • CI & cron integrations
    • Pass/fail with rich traces
    • Compare agents, prompts, models

Who it's for

Teams whose best documentation lives in someone's head.

Knowledge workers across the org. Wherever the way work gets done lives in a senior IC's muscle memory — or scrolled off the bottom of a Slack channel two months ago.

Engineering

Grade coding agents against your real review, deploy, and on-call flows — not a synthetic benchmark.

Infrastructure

Capture the runbooks that live in oncall's head, before they leave the team.

Operations

See the workflows your tooling never modeled, then automate the parts that are worth automating.

Sales

Quantify how your top reps prep, follow up, and close — and teach the agent to do the same.

Finance & diligence

For analysis-heavy teams: capture the patterns behind a thesis, the way an analyst actually builds one.

+ your team

If knowledge lives in your people, devtail probably applies. We'd like to hear about it.

Security & privacy

Your team's behavior is your data. We treat it that way.

devtail runs on your devices and stores observations in your infrastructure. We capture the shape of work — the apps, windows, and command boundaries — never what was typed inside them.

On-device redaction

Screen content, keystroke contents, and clipboard data never leave the device.

Self-hosted data plane

Run the storage layer in your own VPC. We never touch raw events.

Audited & auditable

Per-user pause, export, and delete. Full audit trail of every read.

SOC 2 & SSO

SOC 2 Type II in progress. SAML SSO, SCIM provisioning, and granular roles.

Get started

Bring your team's real workflows into your eval suite.

We're onboarding a small number of design partners. If your team is shipping agents into production — or thinking about it — we'd like to talk.

Or reach us directly at hello@devtail.dev