NeuralBroker · Cut your team's LLM cloud spend by 40-70%

The problem

Your team is sending requests to OpenAI that your own hardware could handle for pennies.

Every agentic workflow, every code completion, every document run · if it's going to the cloud by default, you're leaving money on the table every single day.

# typical day without NeuralBroker Requests sent to OpenAI today: 12,847 Requests your hardware could handle: 8,391 Cloud spend today: $312.40 Spend that was avoidable: $203.18

# with NeuralBroker day 1 Gσ╞ [route] local/mac-studio 8,391 req $0.18 Gσ╞ [route] groq 3,201 req $0.19 Gσ╞ [route] openai 1,255 req $30.51 Total spend today: $30.88 Saved vs. cloud-only: $281.52 Gσ⌠ 90%

How it works

One config file. One endpoint. Your GPU, intelligently managed.

01

Point your SDK at localhost:8000

Change one line in your OpenAI client's base_url. Your app keeps calling chat.completions.create(). Streaming, tool calls, everything · preserved.

02

pynvml reads your GPU every 500ms

NeuralBroker reads your machines' live capacity every 500ms · GPU memory, model load state, current utilization. It knows before the request arrives whether your hardware can handle it, so routing decisions are made in under 5ms with no guessing.

03

Policy engine picks the backend

cost-mode: local if VRAM < threshold, else cheapest cloud. speed-mode: always local. fallback-mode: local by default, cloud on OOM or error. Declared in YAML, enforced per-request, switching in under 5ms.

04

Prometheus + Grafana dashboard

Every request tagged with: backend chosen, VRAM at routing time, realized cost, realized latency, routing reason. Live dashboard shows VRAM %, routing decisions, cost over time. Ships in the Docker container.

Routing modes

Three modes. One setting. Works the way your team works.

cost-mode

Route local first

Route local if VRAM < threshold. Spill to cheapest cloud above threshold.

Best for: Dev work alongside other GPU processes

default_mode: cost

speed-mode

Always local

Always route local. Never touch cloud. Error if local fails.

Best for: Latency-sensitive, privacy-required, offline

default_mode: speed

fallback-mode

Local by default

Local by default. Cloud only on OOM or error. Resumes local on recovery.

Best for: Long batch jobs, overnight runs

default_mode: fallback

# config.yaml · declare your infrastructure once
local_nodes:
  - name: local
    runtime: ollama        # or llama_cpp, lm_studio
    host: localhost:11434
    vram_threshold: 0.80   # spill to cloud above 80% VRAM

cloud_providers:
  - { name: groq,     api_key_env: GROQ_KEY }
  - { name: together, api_key_env: TOGETHER_KEY }
  - { name: openai,   api_key_env: OPENAI_KEY }

routing:
  default_mode: cost
  electricity_kwh_price: 0.14
  gpu_tdp_watts: 320

The Edge

Your local GPU runs Llama 3.1 8B for $0.02/M tokens. Groq charges $0.06. OpenAI charges $0.60.

NeuralBroker routes to local first. Cloud is overflow · not the default.

Llama 3.1 8B Instruct -+ $/M tokens

30+∙ spread between local electricity and OpenAI

provider $/M tok p95 TTFT

Local -+ Ollama local -+ electricity-only

$0.02

~210ms

Local -+ llama.cpp local -+ 4-bit quant

$0.03

~180ms

Groq inference api

$0.06

90ms

Cerebras inference api

$0.10

70ms

Together

$0.20

320ms

DeepInfra

$0.23

390ms

OpenRouter routed default

$0.55

480ms

OpenAI

$0.60

510ms

The field

Five alternatives. None of them know your GPU exists.

Feature

OpenRouter

LiteLLM

Portkey

Eden AI

TrueFoundry

NeuralBroker

OpenAI-compatible proxy

yes

Routes across cloud providers

yes

Cost + latency observability

yes

Guardrails & caching layer prompt shields, semantic cache

roadmap

Local GPU as peer provider

yes

Live VRAM polling (pynvml) every 500ms, real GPU state

yes

Dynamic routing on VRAM pressure

yes

Auto-recover to local on VRAM drop

yes

Electricity-cost calculation

yes

Self-hostable, zero cloud dependency

yes

Network path to your local machine can actually reach your GPU

yes

Detects pre-loaded model in VRAM 180ms $0.00002 vs cloud roundtrip

yes

OSS core, self-hosted free tier

yes

Pricing for serious workloads

usage+

free

$49/mo+

usage+

enterprise

$29/mo+

OpenRouter cloud-only router

the differenceOpenRouter optimizes within the cloud. NeuralBroker makes the cloud the fallback. Your GPU runs first, free. Cloud fills in only when VRAM actually can't handle it.

why you could save more · full breakdown

Doesn't know you exist · literally. OpenRouter is a hosted service. It has no network path to your desk. It cannot call nvidia-smi on your machine. It cannot know that Llama 3.1 8B is already loaded in VRAM and ready in 180ms for $0.00002. Every request leaves your house, hits their servers, comes back. Your GPU sits idle the entire time.

Charges you for requests your GPU could handle free. Llama 3.1 8B on Groq is $0.06/M tokens. Your local GPU costs $0.02/M in electricity. 500K tokens/day in dev work = $30/mo to Groq vs ~$10 in electricity. NeuralBroker routes those local. OpenRouter cannot.

No concept of "my VRAM is full right now." When local VRAM fills, you need something on your machine to detect pressure and route around it. OpenRouter can't do this · it's not local. The OOM errors, the 3-min model reloads, the manual switching · OpenRouter just charges cloud prices while those problems happen.

Their value prop is the thing NeuralBroker makes irrelevant. OpenRouter finds the cheapest cloud provider across a 7+∙ price spread. Real value · but your local GPU isn't on that spread. It's 3+∙ cheaper than the cheapest cloud option and already in your house. OpenRouter optimizes within the cloud. NeuralBroker makes the cloud the fallback.

The identity angle is real. OpenRouter is a service you pay. NeuralBroker is infrastructure you run. That's a fundamentally different relationship with your inference stack.

Where OpenRouter genuinely wins: No local GPU? OpenRouter wins cleanly · better provider coverage, polished UI, real free tier. NeuralBroker without a local GPU is just a worse OpenRouter. This product is for people with hardware on their desk paying cloud prices anyway.

LiteLLM closest technical overlap

Can technically point at Ollama. But routing is static config at startup · it doesn't poll GPU state. It can't detect that your VRAM just spiked to 88% and preemptively route to cloud. You'd have to manually restart with different config to change routing behavior.

NeuralBroker edgeDynamic re-routing every 500ms via pynvml. No restart. No config change. Fully autonomous.

Portkey cloud-to-cloud router

Genuinely good product · guardrails, caching, observability, fallback logic on top of cloud APIs. But it's a cloud-to-cloud router. Your GPU doesn't exist to it. No concept of VRAM pressure, no pynvml, no local-first routing. Also $49/mo minimum for anything serious.

NeuralBroker edgeLocal GPU is a first-class provider. Free OSS core. Cloud fallback only when VRAM actually fills up.

Eden AI different category

Aggregator API for cloud AI services · vision, NLP, speech, LLMs across AWS Rekognition, Google Vision, Azure Cognitive. No local inference support at all. The comparison is almost a category error. It's included because it's in the same price range, not because it does the same thing.

NeuralBroker edgeBuilt specifically for LLM routing with local GPU. Not an enterprise multi-modal aggregator.

TrueFoundry mlops platform

Deploys and scales models on cloud infrastructure · Kubernetes, AWS/GCP/Azure cloud GPUs. Built for teams that want managed model serving in the cloud. The opposite of local-first. Your local GPU in your home office is not the use case TrueFoundry was built for.

NeuralBroker edgeRuns on your machine. No Kubernetes. No cloud GPU bill. Zero infra overhead.

Integration

One base_url swap. Your code doesn't change.

python

curl

We were spending $8,400 a month on API calls for our agentic pipeline. NeuralBroker routes about 65% of it through our M3 Max workstations now. Bill dropped to $3,100 in the first month.

Head of AI Infrastructure

12-person startup

Monthly API spend before

$8,400

Monthly API spend after

$3,100

Handled locally

65%

Across active teams: avg. 61% of requests handled locally -+ avg. $2,840 saved per month

Your infrastructure already handles most of your AI. The cloud is only when necessary.