Smart routing sends LLM requests to your local machines first, spilling to the cloud only when needed. Instantly cut your AI spend by up to 70%.
Every agentic workflow, every code completion, every document run · if it's going to the cloud by default, you're leaving money on the table every single day.
Change one line in your OpenAI client's base_url. Your app keeps calling chat.completions.create(). Streaming, tool calls, everything · preserved.
NeuralBroker reads your machines' live capacity every 500ms · GPU memory, model load state, current utilization. It knows before the request arrives whether your hardware can handle it, so routing decisions are made in under 5ms with no guessing.
cost-mode: local if VRAM < threshold, else cheapest cloud. speed-mode: always local. fallback-mode: local by default, cloud on OOM or error. Declared in YAML, enforced per-request, switching in under 5ms.
Every request tagged with: backend chosen, VRAM at routing time, realized cost, realized latency, routing reason. Live dashboard shows VRAM %, routing decisions, cost over time. Ships in the Docker container.
Route local if VRAM < threshold. Spill to cheapest cloud above threshold.
Always route local. Never touch cloud. Error if local fails.
Local by default. Cloud only on OOM or error. Resumes local on recovery.
# config.yaml · declare your infrastructure once local_nodes: - name: local runtime: ollama # or llama_cpp, lm_studio host: localhost:11434 vram_threshold: 0.80 # spill to cloud above 80% VRAM cloud_providers: - { name: groq, api_key_env: GROQ_KEY } - { name: together, api_key_env: TOGETHER_KEY } - { name: openai, api_key_env: OPENAI_KEY } routing: default_mode: cost electricity_kwh_price: 0.14 gpu_tdp_watts: 320
NeuralBroker routes to local first. Cloud is overflow · not the default.
Doesn't know you exist · literally. OpenRouter is a hosted service. It has no network path to your desk. It cannot call nvidia-smi on your machine. It cannot know that Llama 3.1 8B is already loaded in VRAM and ready in 180ms for $0.00002. Every request leaves your house, hits their servers, comes back. Your GPU sits idle the entire time.
Charges you for requests your GPU could handle free. Llama 3.1 8B on Groq is $0.06/M tokens. Your local GPU costs $0.02/M in electricity. 500K tokens/day in dev work = $30/mo to Groq vs ~$10 in electricity. NeuralBroker routes those local. OpenRouter cannot.
No concept of "my VRAM is full right now." When local VRAM fills, you need something on your machine to detect pressure and route around it. OpenRouter can't do this · it's not local. The OOM errors, the 3-min model reloads, the manual switching · OpenRouter just charges cloud prices while those problems happen.
Their value prop is the thing NeuralBroker makes irrelevant. OpenRouter finds the cheapest cloud provider across a 7+∙ price spread. Real value · but your local GPU isn't on that spread. It's 3+∙ cheaper than the cheapest cloud option and already in your house. OpenRouter optimizes within the cloud. NeuralBroker makes the cloud the fallback.
The identity angle is real. OpenRouter is a service you pay. NeuralBroker is infrastructure you run. That's a fundamentally different relationship with your inference stack.
Where OpenRouter genuinely wins: No local GPU? OpenRouter wins cleanly · better provider coverage, polished UI, real free tier. NeuralBroker without a local GPU is just a worse OpenRouter. This product is for people with hardware on their desk paying cloud prices anyway.
Can technically point at Ollama. But routing is static config at startup · it doesn't poll GPU state. It can't detect that your VRAM just spiked to 88% and preemptively route to cloud. You'd have to manually restart with different config to change routing behavior.
Genuinely good product · guardrails, caching, observability, fallback logic on top of cloud APIs. But it's a cloud-to-cloud router. Your GPU doesn't exist to it. No concept of VRAM pressure, no pynvml, no local-first routing. Also $49/mo minimum for anything serious.
Aggregator API for cloud AI services · vision, NLP, speech, LLMs across AWS Rekognition, Google Vision, Azure Cognitive. No local inference support at all. The comparison is almost a category error. It's included because it's in the same price range, not because it does the same thing.
Deploys and scales models on cloud infrastructure · Kubernetes, AWS/GCP/Azure cloud GPUs. Built for teams that want managed model serving in the cloud. The opposite of local-first. Your local GPU in your home office is not the use case TrueFoundry was built for.
We were spending $8,400 a month on API calls for our agentic pipeline. NeuralBroker routes about 65% of it through our M3 Max workstations now. Bill dropped to $3,100 in the first month.
Across active teams: avg. 61% of requests handled locally -+ avg. $2,840 saved per month
Connect NeuralBroker to your team's machines in under 10 minutes. Most teams see their first routing decision · and first dollar saved · within the hour.