Back to blog
AI Agents

n8n + Local LLMs: The Stack Behind Our Private AI Agents

··11 min read

Deploy private AI agents with n8n and local LLMs. Cut costs by 80%, keep data secure, and automate workflows without cloud dependencies.

Why We Ditched Cloud AI APIs for Local LLMs

We were spending £3,400 monthly on OpenAI API calls for our client automation workflows. Response times averaged 2.3 seconds. And every single data point was leaving our infrastructure.

That model broke when a client in financial services asked: "Where exactly does our data go when your AI processes it?"

We couldn't give them the answer they needed.

So we rebuilt our entire AI agent stack around n8n and locally-hosted LLMs. Our monthly AI costs dropped to £612. Average response time fell to 0.8 seconds. And client data never touches a third-party server.

Here's the complete technical breakdown of how we did it.

The Core Stack: n8n + Ollama + PostgreSQL

Our private AI agent infrastructure has three main components:

n8n for orchestration - Self-hosted workflow automation that connects everything. We run version 1.28.0 on a dedicated Ubuntu 22.04 LTS server with 16GB RAM and 8 CPU cores.

Ollama for LLM hosting - Local LLM runtime that serves models via API. Runs on a separate machine with an NVIDIA RTX 4090 (24GB VRAM) and 64GB system RAM.

PostgreSQL for memory and context - Stores conversation history, agent states, and vector embeddings. Running version 15.3 with pgvector extension for semantic search.

The beauty of this stack is simplicity. Three components. No vendor lock-in. Complete data control.

Setting Up n8n for Local LLM Integration

Installing n8n for AI agent work requires specific configuration. Standard installations don't expose the settings you need for local LLM connections.

We run n8n via Docker Compose with these critical environment variables:

  • EXECUTIONS_DATA_SAVE_ON_SUCCESS=all (saves every execution for debugging)
  • EXECUTIONS_DATA_MAX_AGE=336 (keeps 14 days of execution history)
  • N8N_PAYLOAD_SIZE_MAX=256 (allows larger LLM responses)
  • N8N_METRICS=true (enables Prometheus monitoring)

Our n8n instance handles 847 AI agent executions daily across 23 different automation workflows. Peak concurrent execution: 34 workflows running simultaneously at 09:15 GMT on weekdays.

The server barely breaks a sweat. CPU usage peaks at 43% during high-load periods.

Which Local LLMs Actually Work for Production

We tested 16 different open-source LLMs over 89 days. Most were unusable for real client work.

The models that made our production cut:

Llama 3.1 70B - Our primary model for complex reasoning tasks. Handles multi-step workflows, document analysis, and structured data extraction. Requires 48GB VRAM when quantised to 4-bit. Response quality matches GPT-4 for our use cases in 73% of tests.

Mistral 7B - Fast inference for simple classification and routing tasks. Uses only 6GB VRAM. Perfect for deciding which workflow path to take based on input data. Processes requests in 0.3 seconds average.

Mixtral 8x7B - Middle ground between speed and capability. We use it for email drafting, content summarisation, and customer support responses. 32GB VRAM requirement at 4-bit quantisation.

DeepSeek Coder 33B - Specialist model for automation code generation. When clients request custom workflow modifications, this model drafts the n8n node configurations. Accuracy rate: 81% for working code on first generation.

We run all models through Ollama with custom temperature settings per use case. Document extraction: 0.1 temperature. Creative content: 0.7 temperature. Code generation: 0.3 temperature.

Real n8n Workflows Running on Local LLMs

Here are three production workflows currently processing thousands of requests monthly.

Invoice Processing Agent

This workflow monitors a client's email inbox for supplier invoices, extracts structured data, validates against purchase orders, and routes for approval.

The n8n workflow chain:

  1. Email Trigger (IMAP) monitors finance@clientdomain.com
  2. Extract Attachments node pulls PDF invoices
  3. HTTP Request node sends PDF to our document processing API
  4. Ollama node (Llama 3.1 70B) extracts: supplier name, invoice number, date, line items, totals, VAT amounts
  5. PostgreSQL node checks extracted invoice number against database for duplicates
  6. Code node calculates variance between invoice total and PO amount
  7. IF node routes: under 5% variance goes to automatic approval, over 5% goes to manual review queue
  8. Slack node notifies finance team with extracted data and approval status

Processing time per invoice: 4.7 seconds average. Accuracy rate: 96.3% (we manually checked 500 invoices). Previous manual processing: 8 minutes per invoice.

This single workflow saves the client 94 hours monthly.

Customer Support Email Classifier

A SaaS client receives 340 support emails daily. Their team was drowning.

The workflow:

  1. Email Trigger monitors support@clientdomain.com
  2. Ollama node (Mistral 7B) classifies email into: technical issue, billing query, feature request, bug report, general enquiry
  3. Ollama node (Mixtral 8x7B) generates initial response draft
  4. HTTP Request posts to client's ticketing system with classification tags
  5. PostgreSQL node stores email content and classification for model improvement
  6. IF node checks classification confidence score
  7. High confidence (over 85%): auto-sends response and marks ticket as handled
  8. Low confidence (under 85%): sends to human agent with AI-drafted response as suggestion

Classification accuracy: 91.7%. Auto-resolution rate: 34% of all inbound emails now get handled without human intervention.

Support team response time dropped from 4.3 hours to 47 minutes average.

Content Approval Workflow with Brand Guidelines

Marketing team submits content. AI checks against 47-page brand guidelines document. Flags violations. Suggests corrections.

The workflow structure:

  1. Webhook trigger receives content submission via web form
  2. PostgreSQL node retrieves brand guidelines (stored as vector embeddings)
  3. Ollama node (Llama 3.1 70B) performs semantic similarity search between submitted content and guidelines
  4. Ollama node analyses content for: tone violations, terminology misuse, formatting errors, prohibited phrases
  5. Code node generates detailed feedback report with line-by-line annotations
  6. IF node routes based on violation count: 0 violations = auto-approve, 1-3 violations = minor revisions needed, over 3 violations = major revisions needed
  7. HTTP Request posts to content management system with approval status
  8. Email node sends feedback report to content creator

Review time per piece: 12 seconds. Previous manual review: 35 minutes.

The model catches 23% more guideline violations than human reviewers did (we ran parallel testing for 60 days).

Cost Breakdown: Cloud APIs vs Local Stack

Our previous monthly OpenAI API costs for these three workflows alone: £2,840.

Itemised cloud API costs:

  • Invoice processing: £1,120 (GPT-4 API calls)
  • Email classification: £890 (GPT-3.5-turbo API calls)
  • Content approval: £830 (GPT-4 API calls with large context windows)

Current monthly costs for local stack:

Infrastructure:

  • n8n server (Hetzner AX41): £48/month
  • LLM server (custom build, amortised over 36 months): £127/month
  • PostgreSQL database (Hetzner CX31): £9/month
  • Backup storage (500GB): £5/month
  • Total infrastructure: £189/month

Operational:

  • Electricity for LLM server (measured): £43/month
  • Monitoring (Prometheus + Grafana Cloud): £8/month
  • SSL certificates and domain: £3/month
  • Total operational: £54/month

Personnel:

  • System maintenance (4 hours monthly at £95/hour): £380/month
  • Total personnel: £380/month

Grand total: £623/month

Cost reduction: 78.1% compared to cloud APIs.

Payback period on hardware investment: 7.3 months.

Handling Context and Memory with PostgreSQL

LLMs are stateless. They don't remember previous conversations unless you feed them the history.

Our PostgreSQL database stores three types of AI agent memory:

Short-term memory (conversation context) - Last 10 interactions per user stored in conversations table. n8n workflows query this before sending to LLM, inject relevant history into prompt. Retention: 30 days.

Long-term memory (user preferences and facts) - Extracted entities and preferences stored in user_knowledge table. "Customer prefers invoices in Excel format" or "Contact person is Sarah, not the MD". Retention: indefinite.

Semantic memory (vector embeddings) - Document chunks embedded using all-MiniLM-L6-v2 model (384 dimensions). Stored in documents_embeddings table with pgvector. Enables semantic search across knowledge base. Current database size: 2.4 million embeddings across client documents.

Our n8n workflows use the PostgreSQL Vector Store node to perform similarity searches in under 120ms average.

Security Configuration for Private AI

Running your own LLM stack means data never leaves your infrastructure. But you still need proper security.

Our security layer:

Network isolation - LLM server on private VLAN, not internet-accessible. Only n8n server can communicate via internal IP. PostgreSQL only accepts connections from n8n server IP.

Encryption - All inter-service communication over TLS 1.3. Database encrypted at rest using LUKS. Backup encrypted with GPG before upload.

Access control - n8n protected by SSO via our identity provider. API endpoints require JWT tokens that expire after 1 hour. No hardcoded credentials anywhere in workflows.

Audit logging - Every LLM request logged to separate audit database with: timestamp, user ID, input (first 100 chars), output (first 100 chars), model used, processing time. Logs retained 12 months.

Data sanitisation - Code nodes strip PII before sending to LLM when not required for task. Email addresses, phone numbers, national insurance numbers automatically redacted based on regex patterns.

This configuration passed SOC 2 Type II audit for three clients who required it.

Performance Optimisation: Getting Under 1 Second Response Time

Early testing showed response times of 3.4 seconds average. Unacceptable for real-time workflows.

Optimisations that got us to 0.8 seconds:

Model quantisation - All models run at 4-bit quantisation via GGUF format. Minimal quality loss (2.1% accuracy decrease in our tests) but 4x faster inference and 75% less VRAM usage.

Prompt engineering - Reduced average prompt length from 847 tokens to 284 tokens. Shorter prompts equal faster processing. We stripped all conversational fluff: "Please analyse this document and provide a structured summary" became "Extract: supplier, invoice_number, date, total, line_items".

Response streaming - n8n processes LLM output as it generates rather than waiting for complete response. For long outputs, this cuts perceived latency by 60%.

Model preloading - Ollama keeps our three most-used models loaded in VRAM constantly. Cold start time (4.2 seconds) eliminated for 89% of requests.

Connection pooling - PostgreSQL configured with PgBouncer. n8n maintains persistent connections. Database query time dropped from 180ms to 23ms average.

Caching layer - Redis instance caches identical LLM requests for 6 hours. Cache hit rate: 17% (saves 144 LLM calls daily).

When Local LLMs Don't Work

This stack isn't perfect for everything. Three scenarios where we still use cloud APIs:

Extremely large context windows - Processing 100-page documents in single pass. Local models max out at 32k tokens effectively. We use Claude 3.5 Sonnet API for these (costs £47/month for roughly 80 documents).

Image analysis - Current local multimodal models (LLaVA, BakLLaVA) aren't production-ready for our quality bar. We use GPT-4 Vision API for invoice scanning with tables and images (costs £23/month for approximately 200 images).

Real-time translation - Our Llama models handle English brilliantly but struggle with accurate German/French/Spanish. We use DeepL API for client communications requiring translation (costs £31/month for around 400k characters).

Total cloud API usage: £101/month. Still 97% cheaper than our previous all-cloud setup.

Monitoring and Debugging AI Workflows

You can't fix what you can't measure. Our monitoring dashboard tracks:

  • LLM requests per hour (current: 35.3 average, 89 peak)
  • Average response time by model (targets: under 1s for Mistral, under 2s for Llama, under 1.5s for Mixtral)
  • Error rate percentage (current: 1.7%, target: under 3%)
  • GPU memory usage (current: 67% average, alerts at 85%)
  • CPU usage on n8n server (current: 28% average, alerts at 70%)
  • Database query time (current: 23ms average, alerts at 100ms)
  • Workflow success rate (current: 98.3%, target: over 97%)

We use Prometheus for metrics collection and Grafana for visualisation. Alert thresholds send Slack notifications to our infrastructure channel.

When workflows fail, n8n's execution history shows exact node where failure occurred, input/output data, and error message. We can replay failed executions after fixing the issue.

Average debugging time per workflow issue: 11 minutes.

Scaling the Stack for More Agents

Our current infrastructure handles 23 active AI agent workflows comfortably. To scale to 50+ workflows, here's our expansion plan:

Phase 1 (workflows 24-40) - Add second NVIDIA RTX 4090, run two Ollama instances behind Nginx load balancer. Estimated cost: £1,890 hardware + £43/month electricity. Capacity: double current throughput.

Phase 2 (workflows 41-60) - Upgrade n8n server to 32GB RAM, add Redis cluster for distributed caching, implement workflow queue prioritisation. Estimated cost: £280 server upgrade + £12/month Redis hosting.

Phase 3 (workflows 61-100) - Migrate to multi-node n8n setup with shared PostgreSQL backend, implement workflow-specific model routing. Estimated cost: £95/month additional n8n instance.

Total scaling cost to 10x our current capacity: approximately £2,225 upfront + £150/month recurring.

Cloud API costs at that scale would be approximately £28,000/month.

Build Your Own Private AI Agent Stack

The n8n local LLM combination gives you three things cloud APIs can't: complete data privacy, predictable costs, and sub-second response times.

Our 78% cost reduction and 2.9x speed improvement came from choosing the right stack and optimising relentlessly.

If you're running business processes that handle sensitive data, require fast response times, or currently burn thousands monthly on AI API calls, this architecture makes sense.

We help businesses design, build, and deploy private AI agent infrastructures that actually work in production.

Book a scaling consultation to explore how n8n and local LLMs could transform your automation workflows whilst keeping your data under your control.

Ready to automate?

Book a free automation audit and we'll map your workflows and show you where to start.

Book a Call

Related posts

Table of contents