open source -+ MIT licensed -+ local-first routing
Star on GitHub ⭐

Your infrastructure already handles most of your AI. The cloud is only when necessary.

Smart routing sends LLM requests to your local machines first, spilling to the cloud only when needed. Instantly cut your AI spend by up to 70%.

$ pip install neuralbrok
broker.live
62 req/s
POST /v1/chat/completions -+ model=llama3.1:8b -+ 924 tokens in
VRAM: 73% threshold: 80% mode: cost
provider health cost/req p95 latency score
pynvml -+ 500ms polling winner: ·
Saved this month
$3,200vs. cloud-only
Requests routed
1,400,000
Handled locally
68%of all requests
Avoidable cloud calls
$0wasted
The problem

Your team is sending requests to OpenAI that your own hardware could handle for pennies.

Every agentic workflow, every code completion, every document run · if it's going to the cloud by default, you're leaving money on the table every single day.

# typical day without NeuralBroker Requests sent to OpenAI today: 12,847 Requests your hardware could handle: 8,391 Cloud spend today: $312.40 Spend that was avoidable: $203.18
# with NeuralBroker day 1 Gσ╞ [route] local/mac-studio 8,391 req $0.18 Gσ╞ [route] groq 3,201 req $0.19 Gσ╞ [route] openai 1,255 req $30.51 Total spend today: $30.88 Saved vs. cloud-only: $281.52 Gσ⌠ 90%
How it works

One config file. One endpoint. Your GPU, intelligently managed.

01

Point your SDK at localhost:8000

Change one line in your OpenAI client's base_url. Your app keeps calling chat.completions.create(). Streaming, tool calls, everything · preserved.

02

pynvml reads your GPU every 500ms

NeuralBroker reads your machines' live capacity every 500ms · GPU memory, model load state, current utilization. It knows before the request arrives whether your hardware can handle it, so routing decisions are made in under 5ms with no guessing.

03

Policy engine picks the backend

cost-mode: local if VRAM < threshold, else cheapest cloud. speed-mode: always local. fallback-mode: local by default, cloud on OOM or error. Declared in YAML, enforced per-request, switching in under 5ms.

04

Prometheus + Grafana dashboard

Every request tagged with: backend chosen, VRAM at routing time, realized cost, realized latency, routing reason. Live dashboard shows VRAM %, routing decisions, cost over time. Ships in the Docker container.

your app openai-sdk base_url=localhost neuralbroker vram_telemetry 500ms policy_engine <5ms provider_scorer scoring cache (redis) optional python -+ fastapi -+ asyncio -+ pynvml Ollama 73% VRAM llama.cpp Groq Together OpenAI event log -+ last 60s [local] ollama/local local -+ 73% VRAM -+ routing local -+ $0.00003 [vram] spike to 88% -+ spilling to groq [cloud] groq vram above threshold -+ $0.0006 [vram] recovered 71% -+ resuming local [cache] hit -+ 0ms -+ saved $0.0006
Routing modes

Three modes. One setting. Works the way your team works.

cost-mode

Route local first

Route local if VRAM < threshold. Spill to cheapest cloud above threshold.

Best for: Dev work alongside other GPU processes
default_mode: cost
speed-mode

Always local

Always route local. Never touch cloud. Error if local fails.

Best for: Latency-sensitive, privacy-required, offline
default_mode: speed
fallback-mode

Local by default

Local by default. Cloud only on OOM or error. Resumes local on recovery.

Best for: Long batch jobs, overnight runs
default_mode: fallback
# config.yaml · declare your infrastructure once
local_nodes:
  - name: local
    runtime: ollama        # or llama_cpp, lm_studio
    host: localhost:11434
    vram_threshold: 0.80   # spill to cloud above 80% VRAM

cloud_providers:
  - { name: groq,     api_key_env: GROQ_KEY }
  - { name: together, api_key_env: TOGETHER_KEY }
  - { name: openai,   api_key_env: OPENAI_KEY }

routing:
  default_mode: cost
  electricity_kwh_price: 0.14
  gpu_tdp_watts: 320
The Edge

Your local GPU runs Llama 3.1 8B for $0.02/M tokens. Groq charges $0.06. OpenAI charges $0.60.

NeuralBroker routes to local first. Cloud is overflow · not the default.

Llama 3.1 8B Instruct -+ $/M tokens
30+∙ spread between local electricity and OpenAI
provider $/M tok p95 TTFT
Local -+ Ollama local -+ electricity-only
$0.02
~210ms
Local -+ llama.cpp local -+ 4-bit quant
$0.03
~180ms
Groq inference api
$0.06
90ms
Cerebras inference api
$0.10
70ms
Together
$0.20
320ms
DeepInfra
$0.23
390ms
OpenRouter routed default
$0.55
480ms
OpenAI
$0.60
510ms
The field

Five alternatives. None of them know your GPU exists.

Feature
OpenRouter
LiteLLM
Portkey
Eden AI
TrueFoundry
NeuralBroker
OpenAI-compatible proxy
yes
Routes across cloud providers
yes
Cost + latency observability
yes
Guardrails & caching layer prompt shields, semantic cache
roadmap
Local GPU as peer provider
yes
Live VRAM polling (pynvml) every 500ms, real GPU state
yes
Dynamic routing on VRAM pressure
yes
Auto-recover to local on VRAM drop
yes
Electricity-cost calculation
yes
Self-hostable, zero cloud dependency
yes
Network path to your local machine can actually reach your GPU
yes
Detects pre-loaded model in VRAM 180ms $0.00002 vs cloud roundtrip
yes
OSS core, self-hosted free tier
yes
Pricing for serious workloads
usage+
free
$49/mo+
usage+
enterprise
$29/mo+
OpenRouter cloud-only router
the differenceOpenRouter optimizes within the cloud. NeuralBroker makes the cloud the fallback. Your GPU runs first, free. Cloud fills in only when VRAM actually can't handle it.
why you could save more · full breakdown

Doesn't know you exist · literally. OpenRouter is a hosted service. It has no network path to your desk. It cannot call nvidia-smi on your machine. It cannot know that Llama 3.1 8B is already loaded in VRAM and ready in 180ms for $0.00002. Every request leaves your house, hits their servers, comes back. Your GPU sits idle the entire time.

Charges you for requests your GPU could handle free. Llama 3.1 8B on Groq is $0.06/M tokens. Your local GPU costs $0.02/M in electricity. 500K tokens/day in dev work = $30/mo to Groq vs ~$10 in electricity. NeuralBroker routes those local. OpenRouter cannot.

No concept of "my VRAM is full right now." When local VRAM fills, you need something on your machine to detect pressure and route around it. OpenRouter can't do this · it's not local. The OOM errors, the 3-min model reloads, the manual switching · OpenRouter just charges cloud prices while those problems happen.

Their value prop is the thing NeuralBroker makes irrelevant. OpenRouter finds the cheapest cloud provider across a 7+∙ price spread. Real value · but your local GPU isn't on that spread. It's 3+∙ cheaper than the cheapest cloud option and already in your house. OpenRouter optimizes within the cloud. NeuralBroker makes the cloud the fallback.

The identity angle is real. OpenRouter is a service you pay. NeuralBroker is infrastructure you run. That's a fundamentally different relationship with your inference stack.

Where OpenRouter genuinely wins: No local GPU? OpenRouter wins cleanly · better provider coverage, polished UI, real free tier. NeuralBroker without a local GPU is just a worse OpenRouter. This product is for people with hardware on their desk paying cloud prices anyway.

LiteLLM closest technical overlap

Can technically point at Ollama. But routing is static config at startup · it doesn't poll GPU state. It can't detect that your VRAM just spiked to 88% and preemptively route to cloud. You'd have to manually restart with different config to change routing behavior.

NeuralBroker edgeDynamic re-routing every 500ms via pynvml. No restart. No config change. Fully autonomous.
Portkey cloud-to-cloud router

Genuinely good product · guardrails, caching, observability, fallback logic on top of cloud APIs. But it's a cloud-to-cloud router. Your GPU doesn't exist to it. No concept of VRAM pressure, no pynvml, no local-first routing. Also $49/mo minimum for anything serious.

NeuralBroker edgeLocal GPU is a first-class provider. Free OSS core. Cloud fallback only when VRAM actually fills up.
Eden AI different category

Aggregator API for cloud AI services · vision, NLP, speech, LLMs across AWS Rekognition, Google Vision, Azure Cognitive. No local inference support at all. The comparison is almost a category error. It's included because it's in the same price range, not because it does the same thing.

NeuralBroker edgeBuilt specifically for LLM routing with local GPU. Not an enterprise multi-modal aggregator.
TrueFoundry mlops platform

Deploys and scales models on cloud infrastructure · Kubernetes, AWS/GCP/Azure cloud GPUs. Built for teams that want managed model serving in the cloud. The opposite of local-first. Your local GPU in your home office is not the use case TrueFoundry was built for.

NeuralBroker edgeRuns on your machine. No Kubernetes. No cloud GPU bill. Zero infra overhead.
Integration

One base_url swap. Your code doesn't change.

python
curl

    
We were spending $8,400 a month on API calls for our agentic pipeline. NeuralBroker routes about 65% of it through our M3 Max workstations now. Bill dropped to $3,100 in the first month.
Head of AI Infrastructure
12-person startup
Monthly API spend before
$8,400
Monthly API spend after
$3,100
Handled locally
65%

Across active teams: avg. 61% of requests handled locally -+ avg. $2,840 saved per month

Your hardware is already paid for. Start using it.

Connect NeuralBroker to your team's machines in under 10 minutes. Most teams see their first routing decision · and first dollar saved · within the hour.

Tweaks nb-prefs