Bot Velocity logoBot Velocity

Evaluation Framework

Evaluation Framework

Quality gates are infrastructure, not afterthoughts.

Why Evaluation Matters

Go beyond observability to pre-deployment quality gates.

Problem Statement

Most automation platforms give you observability—logs, traces, dashboards. But observability tells you what happened after failure. Evaluation tells you what will fail before deployment.

Traditional Observability

  • Logs tell you execution happened
  • Traces show what was called
  • Dashboards visualize metrics
  • Alerts fire after incidents
  • Manual review before deployment
  • → Reactive, post-incident debugging

Evaluation-as-Infrastructure

  • Scoring validates output quality
  • Regression detection catches degradation
  • Flaky analysis identifies non-determinism
  • Policy checks enforce governance
  • Automated gates block bad deployments
  • → Proactive, pre-deployment validation

LangSmith gives you observability. Bot Velocity gives you quality gates.

Six-Dimensional Quality Framework

Each dimension scores independently and produces actionable insights.

Did it produce the right answer?

Output Quality

  • Exact match, fuzzy, and semantic scoring
  • Structured JSON diff for deterministic outputs
  • LLM-as-judge with rubric-based scoring

Quality Gates in Your Pipeline

Evaluation runs in CI/CD with status endpoints and gates.

1

Developer publishes package

  • Developer publishes the package to the orchestrator.
2

Orchestrator receives package

  • Auto-registers in package registry
  • Auto-creates FolderTool entry
3

Trigger evaluation run

  • Manual run vs dataset
  • Auto-trigger on publish (roadmap)
4

Execute test cases

  • One job per dataset item
  • Collect results and metrics
5

Evaluate results

  • Run all 6 dimensions
  • Compute aggregated score
  • Generate report
6

CI Quality Gates

  • Minimum passing score required before promotion
  • Zero regression tolerance for approved baselines
  • No high-severity policy violations permitted
  • Flaky behavior monitored and capped within defined thresholds
7

If gates pass

  • ✅ Deployment approved
  • Status badge
  • Trigger downstream deploy (roadmap)
8

If gates fail

  • ❌ Deployment blocked
  • Notify team (Slack/email)
  • Block promotion to production

Bot Velocity vs. Traditional Automation & Agent Platforms

Built-in evaluation infrastructure compared to reactive monitoring and manual testing approaches.

Evaluation Built-In. Not Bolted On.

Most automation and agent platforms focus on execution and observability. They help teams run workflows and monitor results after deployment.

Bot Velocity takes a different approach.

Evaluation is embedded directly into orchestration — not layered on afterward.

Traditional RPA Platforms

RPA platforms specialize in deterministic automation. While execution is reliable, quality validation is typically external:

  • Testing is environment-specific or manual
  • Regression tracking requires additional tooling
  • CI integration is limited or custom
  • Policy enforcement is handled separately

Automation executes — but governance is not inherent.

Agentic Frameworks

Agent frameworks prioritize flexibility and multi-step reasoning. They enable experimentation and rapid iteration.

However:

  • Evaluation is optional
  • Regression detection is manual
  • Output stability is not enforced
  • Policy controls require custom implementation
  • CI gating is rarely native

These systems are powerful — but production governance depends on additional engineering effort.

Observability Tools

Observability platforms provide trace visibility and performance monitoring.

They answer: “What happened?”

They do not answer: “Should this be deployed?”

Monitoring is reactive. Evaluation must be proactive.

Custom Testing Approaches

Many teams build internal evaluation pipelines:

  • Custom scoring scripts
  • Separate regression tracking
  • Manual CI wiring
  • Isolated policy checks

This increases operational complexity and long-term maintenance burden.

The Bot Velocity Approach

Bot Velocity unifies:

  • Deterministic orchestration
  • Agent execution
  • Structured evaluation
  • Regression detection
  • Stability analysis
  • Policy enforcement
  • CI gating

Evaluation is part of the control plane.

Deployment decisions are based on structured quality signals — not post-release diagnostics.

Observability reacts to failure.

Evaluation prevents instability from reaching production.

Bot Velocity treats evaluation as infrastructure.

Start Evaluating

Create datasets, run evaluations, and enforce CI gates.

1. Create a Test Dataset

Define structured evaluation datasets that include representative inputs and expected outcomes. Organize them by folder or domain to mirror production workflows.

2. Execute an Evaluation Run

Run the selected workflow version against the dataset. Each execution is measured across defined quality dimensions, including accuracy, policy adherence, performance, and stability.

3. Review Results

Analyze results through the Orchestrator interface. Review overall pass scores, dimension-specific insights, regression indicators, and CI status to determine readiness for promotion.

4. Enforce Quality Gates

Establish release criteria that define acceptable quality thresholds. Prevent regressions, restrict policy violations, and monitor output stability before allowing deployment.

5. Integrate with CI/CD Pipelines

Connect evaluation status to your CI/CD process. Automatically trigger evaluations on new releases and block deployment when quality gates are not satisfied.

Evaluate Your Automation

Put quality gates in front of every deployment.