§ ONP-00 / MODULE← SERVICES · THE BACKBONE 02 RACKS LIVE · 1 IN BUILD--:--:-- TPA

On-prem
compute.

SOVEREIGN AI INFRASTRUCTURE / EST. 2026

The plumbing under Little Bear. A small rack in your shop, running open-source models you own, on hardware you own, against data that never leaves the building.

Continuous ontology systems are token-eaters. Renting frontier APIs by the call turns a 5-person operation into a hostage to someone else’s pricing page. We install fixed-cost local stacks instead — quantized specialist models, deterministic tool use, your data on your floor.

COST CURVE — LIVEFRONTIER API vs ON-PREM
CUMULATIVE SPEND · 36-MONTH HORIZON$USD · THOUSANDSFRONTIER API$0@ MONTH 0ON-PREM RACK$18.0kSAVE 0%$0k$50k$100k$150k$200kM00M06M12M18M24M30M36— MONTHS DEPLOYED —BREAK-EVENM04 · $24kFRONTIER · LINEAR @ $6k/moON-PREM · $18k + $1.5k/moCUMULATIVE SAVINGS$0RUN-RATE DELTAPROJECTED · M36$144.0kMETHOD · HEAVY AGENT WORKLOADTOKENS / MO HELD CONSTANTFIG. 00COST · SECTION
BREAK-EVEN AT MONTH 4 · 3-YEAR DELTA $144.0kFIG. 00
≈ $0/tok
Marginal cost per token
100%
Workload kept on-prem
4-bit
Quantization, ~98.9% retained
1rack
Your whole operation
§ ONP-00WHERE IT FITS / THE SUBSTRATE04

On-prem is the substrate Little Bear runs on.

MSP keeps the lights on. FDE earns the context. ONP holds the weights, the graph, and the writeback. Little Bear is the surface the operator touches. Take any one out and the loop stops compounding.

§ ONP-01WHY IT HOLDS / THREE PILLARS03

Sovereignty.
Fixed cost. Determinism.

On-prem is not an aesthetic preference. It is the only configuration that satisfies all three constraints at once for an operations-heavy mid-market shop.

01 / SOVEREIGNTY
Data stays on the floor.

Job records, invoices, payroll signals, client correspondence — none of it leaves the building. No vendor logs your prompts. No third-party model trains on your operation.

02 / FIXED COST
A line item, not a meter.

Capex once, opex flat. The same agent loop that bankrupts you on per-call APIs runs all night for the cost of the electricity. Budget like grown-ups.

03 / DETERMINISM
The plumbing is not probabilistic.

Models propose. MCP validates. SQL executes. The probabilistic surface is sandboxed inside a typed contract — it cannot send the wrong invoice to the wrong client because the schema will not let it.

+MSP × ON-PREM
COMPLEMENTARITYFIG. 05

One number
to call.

The rack is not a side project for somebody else to babysit. The same engineers who watch your applications watch the rack — disk health, fan curves, model serving, weight versions, monthly operating review.

01
SAME TEAM
Application support and rack support, one roster of engineers.
02
SAME SLA
Disk fail, fan stall, vLLM crash — alerted, ticketed, fixed.
03
SAME BILL
The on-prem stack is an MSP service; we just rack the box too.
MSP →← ONPONE THROAT TO CHOKE
§ ONP-02THE MATH / TOKENOMICSWHY ON-PREM

Continuous agents
are token-eaters.

An ontology that mutates on every job, every form, every approval cannot be funded by a per-call API. The first month works. The third month bleeds. The sixth month is a budget crisis dressed up as a roadmap question.

For a 5-person back office, the rack pays for itself before the second quarter. After that, the line is flat. The API line never is.

  • FIXEDHardware is bought once. Power and swap parts are predictable. No surprise pricing email at 11pm on a Tuesday.
  • SOVEREIGNCrew schedules, client invoices, equipment ledgers — none of it leaves the yard. Inference happens on a machine you own.
  • ALWAYS-ONBackground agents can run 24/7 without a meter spinning. The MAP phase mutates the graph as often as reality changes.
  • DETERMINISTICTool calls validate against schemas at the MCP layer. The probabilistic part is sandboxed. The writeback is not.
FIG. 01 / CUMULATIVE COST · 24 MO API rental On-prem rack
$0k$15k$30k$45k$60kM00M06M12M18M24CROSSOVER · MONTH 3CAPEX · ~$9k
ASSUMPTIONS · API: $2.2k/mo, +8% MoM · ON-PREM: $9k capex + $220/mo opexILLUSTRATIVE · NOT A QUOTE
§ ONP-03THE STACK / SIX STRATAL0 → L5

Six layers,
one closet.

The whole AI plumbing for an entire mid-market operation, stacked top to bottom. Open at every layer. Owned at every layer. Every layer is something we have racked before.

L5Little BearOperational ontology, decision intelligence, writeback execution. The product the operator actually sees.WEB · NEXT.JS
L4MCP GatewayThe USB-C port for AI. Every tool — invoicing, dispatch, document gen — exposed as a typed schema. Probabilistic in, deterministic out.MCP · TYPED
L3Graph + Vector StoreAdaptive GraphRAG over Neo4j, with a vector index for free-text recall. Relationships between 23 employees and their equipment stay structurally sound.NEO4J · QDRANT
L2Inference ServervLLM on the workstation route, MLX on Apple silicon. Continuous batching, prefix caching, KV reuse. The static system prompt is paid for once.vLLM · MLX
L1Quantized WeightsOpen-source model weights, 4-bit (W4A16), pinned to disk. Versioned. Auditable. Yours.GGUF · SAFETENSORS
L0Bare MetalThe Mac Studio or the tower. A surge protector. A UPS. A label maker. Locked in a closet that is already locked.STUDIO / TOWER
§ ONP-04HARDWARE / RIGHT-SIZEDTWO ROUTES

One rack, not
a data center.

A 5-person back office does not need a server room. A small, well-chosen box in a locked closet runs the whole stack. We pick the route that fits the shop, rack it, harden it, and cover it under MSP — same engineers, same on-call, one throat to choke.

ROUTE AQUIET
SHELF · CLOSETFIG. 03A

The desktop unit.

A single sealed box that sits on a shelf in the IT closet. Pulls less power than a space heater. No fan whine, no heat, no second mortgage on the electric bill. The right answer for most 5-to-50-person shops.

  • FORM1U-ish desktop
  • DRAWCloset-friendly
  • NOISELibrary-quiet
LOW POWERSINGLE UNITDESK FOOTPRINT
01 / 02
ROUTE BHEADROOM
GPUFLOOR · UNDER DESKFIG. 03B

The workstation tower.

A standard tower with a serious GPU, the same shape as the box already under somebody's desk. Headroom for heavier workloads, hot-swappable parts, an upgrade path that doesn't require throwing the rack away.

  • FORMMid-tower
  • DRAWWall-circuit
  • PATHCard-by-card upgrades
HEADROOMUPGRADEABLESTANDARD PARTS
02 / 02
On specFIG. 03 NOTE

Every install is sized to the actual workload — concurrent users, ontology size, agent loop frequency. We share the bill of materials with you. We do not share it on a public services page.

§ ONP-05MODELS / SPECIALISTS, NOT GENERALISTS04 ROLES

A small team of
specialists.

One enormous frontier model trying to do everything is the wrong shape for an ops backbone. We run a small ensemble of open-source specialists — each chosen for a single role, quantized for the rack, hot-swappable as the open ecosystem moves. The roles are stable. The weights underneath are ours to pick.

01 / CODING + TOOLS
The hands.

Writes the SQL, drafts the invoice, fills the form, calls the next tool. The model that turns an operator's intent into a typed action against your systems. Specialist-tuned for code and structured output.

TOOL USEJSONCODE
02 / REASONING + JSON
The planner.

Decomposes a fuzzy ask into a clean sequence of validated steps. Returns schema-compliant tool calls — the part the deterministic plumbing relies on. Best return on inference cost in the open ecosystem.

REASONINGJSONPLANNING
03 / EDGE + COMMS
The fast one.

Resource-light work that runs constantly — log parsing, dispatch summaries, agent-to-agent messages, micro-classifications. Cheap to keep warm, fast to invoke, sized for always-on background loops.

EDGEFASTALWAYS-ON
04 / EMBEDDINGS
The librarian.

Encodes the entire ontology into a local vector index for semantic recall. Re-embeds on a nightly cron without phoning home. The index is built, owned, and stored on the rack — never shipped to a third party.

RETRIEVALINDEXLOCAL
On weightsFIG. 04 NOTE

We track the open-source frontier weekly and re-quantize new releases into the same role slots. The MCP contract above the models does not change when the weights underneath do — your stack improves quietly, in place.

§ ONP-06INSTALL / WHAT FDE RACKS06 STEPS · ~5 WEEKS

What gets installed
during the embed.

The on-prem stack is not a separate engagement. It is part of the FDE embed. Our engineers rack the box the same week they start riding along. By the end of the embed, the rack is humming and Little Bear is mutating the graph.

01Site surveyPower draw, network closet, cooling, physical security. We map the room before we order the box.WEEK 01
02Hardware deliveryMac Studio or workstation racked and labeled. UPS, surge, KVM, network drop wired in.WEEK 02
03Stack provisionOS hardened, vLLM or MLX serving live, weights pinned, MCP gateway configured against your tools.WEEK 03
04Ontology bootstrapInitial graph populated from existing systems. Schedules, employees, equipment, clients connected.WEEK 04
05Writeback wiredFirst two MCP tools — invoice compose, schedule update — go live under team supervision.WEEK 05
06MSP handoffMonitoring on the rack, on-call on the engineers, monthly operating review on the calendar.ONGOING
§ ONP-07FAQ / ON-PREM COMPUTE6 ANSWERS
Q-01What about model updates? Won't we miss the frontier?

You miss the frontier the way a working diesel truck misses a Formula-1 car. For ops work — schedule mutation, invoice compose, document parsing — quantized 14B–32B specialists clear the bar. We track new releases monthly, re-quantize, and hot-swap weights without changing the MCP contract above them.

Q-02Who covers the rack when something breaks?

Our MSP does. The same engineers who watch your applications watch the rack. Disk fails, fan stalls, vLLM crashes — alerted, ticketed, fixed. You are not running an in-house data-center team.

Q-03Does this mean we never use frontier APIs?

No. For long-tail tasks where Sonnet-class horsepower is genuinely needed, the gateway can fall through to a frontier API under explicit policy. The default is local. The exception is logged.

Q-04Why open-source models specifically?

Three reasons: weights you can pin and audit, no surprise pricing email, no surprise license change. If a vendor pivots, your stack does not.

Q-05Can the rack actually keep up with always-on agents?

With prefix caching and continuous batching, yes. Static system prompts and stable ontology context get cached, so the per-mutation token cost approaches the cost of the diff. That is what makes background loops viable.

Q-06What happens to the rack when we outgrow it?

The MCP contract is the same regardless of what is behind it. Add a second card, a second box, or a second site — the tool layer does not change. The rack scales without rewriting Little Bear.

Your data.
Your weights.
Your floor.

We will spec the rack for your shop, install it under the FDE embed, and hand it to MSP for ongoing coverage. Fixed cost. No vendor lock. Little Bear humming on hardware you own.

Spec a rack