Back to blog
AI Agents

AI Agent Monitoring: How We Know Before You Do When Something Breaks

··8 min read

Real-world AI agent monitoring and alerting strategies that catch failures before they impact your business. Practical n8n examples included.

The 3am Problem Nobody Talks About

Your AI agent processed 847 customer enquiries yesterday. Today it's processed 12. The agent hasn't crashed—it's running perfectly. The API you're calling changed their rate limits, and now 98% of your requests are silently failing.

Your client finds out at 9am when their team arrives to a flooded inbox.

This happened to us in January 2025. Once. We built monitoring systems that same week. Haven't had a silent failure since.

What Actually Breaks (And Why It Matters)

AI agents fail differently than traditional software. A crashed server is obvious. An AI agent that's running but producing garbage output? That's invisible until someone reads the results.

We monitor 127 production AI agents across 34 clients. Here's what actually breaks:

API failures: 43% of incidents. Rate limits, authentication tokens expiring, endpoint changes.

Data quality issues: 29% of incidents. Input formats changing, missing fields, encoding problems.

LLM response degradation: 18% of incidents. The model returns valid JSON but useless content.

Timeout cascades: 10% of incidents. One slow component makes everything downstream fail.

The median time to detect these without monitoring? 4.7 hours. With proper monitoring? 3 minutes.

The Five Monitoring Layers That Actually Work

Layer 1: Execution Monitoring

Track whether your workflows run at all. Sounds basic. Most people skip it.

In n8n, we add a monitoring node to every single workflow. Here's the structure:

Every workflow gets a final HTTP Request node that pings our monitoring endpoint. If we don't receive that ping within the expected timeframe, something's wrong.

For a workflow that runs every 15 minutes, we expect 96 pings per day. If we receive fewer than 92, we alert. That's our 4% failure tolerance.

The monitoring endpoint is a separate n8n workflow that:

  • Receives the ping with workflow ID and timestamp
  • Stores it in a PostgreSQL table
  • Runs every 30 minutes to check for missing pings
  • Sends Slack alerts for any workflow that hasn't pinged in 2x its normal interval

Cost to implement: £0 if you're already using n8n and PostgreSQL. Takes 45 minutes to set up the first time.

Layer 2: Volume Monitoring

Execution monitoring tells you the workflow ran. Volume monitoring tells you if it's actually doing anything useful.

We track three metrics for every workflow:

  • Items processed
  • Items successfully completed
  • Items that errored

A customer service agent that normally processes 200 tickets per day but only processes 15? That's a problem even if the workflow "succeeded."

In n8n, add a Function node before your final monitoring ping:

const itemCount = $input.all().length;
const successCount = $input.all().filter(item => item.json.status === 'success').length;
const errorCount = itemCount - successCount;

return {
  workflow_id: 'customer_service_agent',
  timestamp: new Date().toISOString(),
  items_processed: itemCount,
  items_success: successCount,
  items_error: errorCount
};

Our monitoring endpoint calculates a 7-day rolling average. If today's volume drops below 60% of that average, we alert.

This caught a client's lead qualification agent that was running perfectly but receiving zero leads because their website form had broken. Detection time: 47 minutes instead of discovering it at their weekly review.

Layer 3: Quality Monitoring

The hardest layer. Also the most important.

An AI agent can execute successfully, process the right volume, and still produce terrible output. We learned this when a document analysis agent started hallucinating client names after GPT-4 had a bad day in March 2025.

Quality monitoring requires you to define what "good" looks like for your specific use case. For different agent types, we monitor:

Classification agents: Track the distribution of categories. If a sentiment analyser suddenly marks 94% of messages as negative when it normally marks 35%, something's wrong.

Extraction agents: Track field completion rates. If a CV parser normally extracts phone numbers from 78% of CVs but suddenly only finds them in 23%, the format might have changed.

Generation agents: Track output length and structure. If a summarisation agent normally produces 150-word summaries but starts producing 12-word fragments, investigate.

Decision agents: Track the decision distribution. If an approval agent that normally approves 62% of requests suddenly approves 99%, check your logic.

Implementation in n8n:

After your AI node, add a Function node that validates the output:

const output = $json.ai_response;

// Define your quality checks
const checks = {
  has_required_fields: output.name && output.email && output.category,
  output_length_ok: output.summary.length > 50 && output.summary.length < 500,
  confidence_ok: output.confidence > 0.7,
  valid_category: ['sales', 'support', 'billing'].includes(output.category)
};

const quality_score = Object.values(checks).filter(Boolean).length / Object.keys(checks).length;

return {
  ...output,
  quality_score: quality_score,
  quality_checks: checks
};

Track these quality scores over time. If the median quality score drops below 0.8 for more than 10 consecutive executions, alert.

Layer 4: Cost Monitoring

AI agents can fail by succeeding too well. A runaway loop making OpenAI API calls can cost you £847 before breakfast.

We set spending thresholds:

  • Hourly limits per workflow
  • Daily limits per client
  • Weekly budget alerts

In n8n, track API costs in the same monitoring ping. For OpenAI calls:

const model = 'gpt-4';
const inputTokens = $json.usage.prompt_tokens;
const outputTokens = $json.usage.completion_tokens;

// GPT-4 pricing as of March 2026
const inputCostPer1k = 0.024; // £0.024 per 1k input tokens
const outputCostPer1k = 0.072; // £0.072 per 1k output tokens

const cost = (inputTokens / 1000 * inputCostPer1k) + (outputTokens / 1000 * outputCostPer1k);

return {
  ...output,
  cost_gbp: cost
};

Sum these costs in your monitoring database. Alert if:

  • Any workflow exceeds £50/hour
  • Any client exceeds their daily budget
  • Total weekly spend is trending to exceed monthly budget by over 20%

One client's data enrichment agent had a bug that caused it to re-process the same 1,000 records in a loop. Our cost monitoring caught it at £127 spent. Without monitoring? It would have run all weekend. Estimated cost: £8,400.

Layer 5: Dependency Monitoring

Your AI agent depends on external services. Those services break.

We monitor:

  • API endpoint health
  • Authentication token validity
  • Database connections
  • File storage access

Run a dedicated health check workflow every 10 minutes that:

  • Tests each critical API endpoint with a lightweight request
  • Verifies database queries complete in under 2 seconds
  • Confirms file storage is accessible
  • Checks authentication tokens haven't expired

This proactive monitoring catches problems before your production workflows hit them.

One client integrates with 7 different data sources. We caught 3 outages in February 2026 before their production agents were affected. Each time, we either switched to a backup data source or paused the affected agents until the service recovered.

The Alerting Strategy That Doesn't Create Alert Fatigue

Monitoring without good alerting is just logging. But too many alerts and your team starts ignoring them.

Our alerting tiers:

Critical (immediate phone call): Production workflow stopped, cost threshold exceeded, data loss detected. We wake people up for these. Average per month: 2.3 alerts.

High (Slack with @channel): Volume dropped over 70%, quality score under 0.6 for over 1 hour, dependency down. Requires action within 30 minutes. Average per month: 8.7 alerts.

Medium (Slack without @channel): Volume dropped 40-70%, quality score 0.6-0.8, elevated error rates. Review within 4 hours. Average per month: 23.4 alerts.

Low (daily digest): Minor anomalies, successful auto-recoveries, trending issues. Review during daily standup. Average per day: 12.1 alerts.

The key is tuning your thresholds based on actual data. We spent 6 weeks in mid-2025 collecting baseline metrics before setting any alerts. Worth it. Our false positive rate is under 5%.

The Monitoring Workflow Template

Here's the basic structure we use for every monitoring endpoint in n8n:

  1. Webhook node: Receives monitoring data from production workflows
  2. PostgreSQL node: Stores the raw monitoring data with timestamp
  3. Function node: Calculates current metrics and compares to baselines
  4. IF node: Checks if any thresholds are breached
  5. Switch node: Routes to appropriate alerting channel based on severity
  6. Slack/Email/Phone nodes: Sends alerts with context and recommended actions
  7. PostgreSQL node: Logs the alert to prevent duplicate notifications

The entire monitoring workflow runs in under 800ms. It needs to be fast—if monitoring is slow, it becomes the bottleneck.

What Happens When You Don't Monitor

February 2026: A prospect came to us after their AI agent "stopped working." It had actually been failing for 11 days. The agent was supposed to process supplier invoices and flag discrepancies.

Over those 11 days:

  • 847 invoices went unprocessed
  • £34,000 in duplicate payments went unnoticed
  • 23 supplier relationships were damaged by payment delays
  • Their finance team spent 67 hours manually fixing the backlog

The agent failure? A supplier changed their invoice PDF format. The extraction logic needed a 15-minute update.

Total cost of no monitoring: £51,000 (duplicate payments + recovery time at £80/hour).

Cost to implement monitoring: £0 in tools, 4 hours of setup time.

Start Monitoring Today

If you're running AI agents without monitoring, you're flying blind. The question isn't whether they'll fail—it's whether you'll know when they do.

Start with execution monitoring today. Add volume monitoring tomorrow. Build up to quality monitoring over the next fortnight.

We've built monitoring systems for 34 clients across industries from legal to e-commerce. The patterns are consistent. The failures are predictable. The solutions are proven.

Need help building a monitoring system that actually works? We'll analyse your current AI agents, identify the failure points, and implement monitoring that catches problems before they cost you money.

Start the conversation at /start-scaling

Your agents are running right now. Are you sure they're working?

Ready to automate?

Book a free automation audit and we'll map your workflows and show you where to start.

Book a Call

Related posts

Table of contents