§ ONP-00 / MODULE← SERVICES · THE BACKBONE 02 RACKS LIVE · 1 IN BUILD--:--:-- TPA

On-prem
compute.

SOVEREIGN AI INFRASTRUCTURE / EST. 2026

The plumbing under Little Bear. A small rack in your shop, running open-source models you own, on hardware you own, against data that never leaves the building.

Continuous ontology systems are token-eaters. Renting frontier APIs by the call turns a 5-person operation into a hostage to someone else’s pricing page. We install fixed-cost local stacks instead — quantized specialist models, deterministic tool use, your data on your floor.

Spec a rack →See the stack

COST CURVE — LIVEFRONTIER API vs ON-PREM

BREAK-EVEN AT MONTH 4 · 3-YEAR DELTA $144.0kFIG. 00

≈ $0/tok

Marginal cost per token

100%

Workload kept on-prem

4-bit

Quantization, ~98.9% retained

1rack

Your whole operation

§ ONP-00WHERE IT FITS / THE SUBSTRATE04

On-prem is the substrate Little Bear runs on.

MSP keeps the lights on. FDE earns the context. ONP holds the weights, the graph, and the writeback. Little Bear is the surface the operator touches. Take any one out and the loop stops compounding.

01 / FOUNDATIONMSPManaged application supportView →02 / EMBEDFDEForward-deployed engineersView →03 / YOU ARE HEREONPOn-prem compute backbone— current —04 / COMPOUNDLBLittle Bear platformView →

§ ONP-01WHY IT HOLDS / THREE PILLARS03

Sovereignty.
Fixed cost. Determinism.

On-prem is not an aesthetic preference. It is the only configuration that satisfies all three constraints at once for an operations-heavy mid-market shop.

01 / SOVEREIGNTY

Data stays on the floor.

Job records, invoices, payroll signals, client correspondence — none of it leaves the building. No vendor logs your prompts. No third-party model trains on your operation.

02 / FIXED COST

A line item, not a meter.

Capex once, opex flat. The same agent loop that bankrupts you on per-call APIs runs all night for the cost of the electricity. Budget like grown-ups.

03 / DETERMINISM

The plumbing is not probabilistic.

Models propose. MCP validates. SQL executes. The probabilistic surface is sandboxed inside a typed contract — it cannot send the wrong invoice to the wrong client because the schema will not let it.

+MSP × ON-PREM

COMPLEMENTARITYFIG. 05

One number
to call.

The rack is not a side project for somebody else to babysit. The same engineers who watch your applications watch the rack — disk health, fan curves, model serving, weight versions, monthly operating review.

SAME TEAM

Application support and rack support, one roster of engineers.

SAME SLA

Disk fail, fan stall, vLLM crash — alerted, ticketed, fixed.

SAME BILL

The on-prem stack is an MSP service; we just rack the box too.

MSP →← ONPONE THROAT TO CHOKE

§ ONP-02THE MATH / TOKENOMICSWHY ON-PREM

Continuous agents
are token-eaters.

An ontology that mutates on every job, every form, every approval cannot be funded by a per-call API. The first month works. The third month bleeds. The sixth month is a budget crisis dressed up as a roadmap question.

For a 5-person back office, the rack pays for itself before the second quarter. After that, the line is flat. The API line never is.

FIXEDHardware is bought once. Power and swap parts are predictable. No surprise pricing email at 11pm on a Tuesday.
SOVEREIGNCrew schedules, client invoices, equipment ledgers — none of it leaves the yard. Inference happens on a machine you own.
ALWAYS-ONBackground agents can run 24/7 without a meter spinning. The MAP phase mutates the graph as often as reality changes.
DETERMINISTICTool calls validate against schemas at the MCP layer. The probabilistic part is sandboxed. The writeback is not.

FIG. 01 / CUMULATIVE COST · 24 MO API rental On-prem rack

ASSUMPTIONS · API: $2.2k/mo, +8% MoM · ON-PREM: $9k capex + $220/mo opexILLUSTRATIVE · NOT A QUOTE

§ ONP-03THE STACK / SIX STRATAL0 → L5

Six layers,
one closet.

The whole AI plumbing for an entire mid-market operation, stacked top to bottom. Open at every layer. Owned at every layer. Every layer is something we have racked before.

L5Little BearOperational ontology, decision intelligence, writeback execution. The product the operator actually sees.WEB · NEXT.JS

L4MCP GatewayThe USB-C port for AI. Every tool — invoicing, dispatch, document gen — exposed as a typed schema. Probabilistic in, deterministic out.MCP · TYPED

L3Graph + Vector StoreAdaptive GraphRAG over Neo4j, with a vector index for free-text recall. Relationships between 23 employees and their equipment stay structurally sound.NEO4J · QDRANT

L2Inference ServervLLM on the workstation route, MLX on Apple silicon. Continuous batching, prefix caching, KV reuse. The static system prompt is paid for once.vLLM · MLX

L1Quantized WeightsOpen-source model weights, 4-bit (W4A16), pinned to disk. Versioned. Auditable. Yours.GGUF · SAFETENSORS

L0Bare MetalThe Mac Studio or the tower. A surge protector. A UPS. A label maker. Locked in a closet that is already locked.STUDIO / TOWER

§ ONP-04HARDWARE / RIGHT-SIZEDTWO ROUTES

One rack, not
a data center.

A 5-person back office does not need a server room. A small, well-chosen box in a locked closet runs the whole stack. We pick the route that fits the shop, rack it, harden it, and cover it under MSP — same engineers, same on-call, one throat to choke.

ROUTE AQUIET

The desktop unit.

A single sealed box that sits on a shelf in the IT closet. Pulls less power than a space heater. No fan whine, no heat, no second mortgage on the electric bill. The right answer for most 5-to-50-person shops.

FORM1U-ish desktop
DRAWCloset-friendly
NOISELibrary-quiet

ROUTE BHEADROOM

The workstation tower.

A standard tower with a serious GPU, the same shape as the box already under somebody's desk. Headroom for heavier workloads, hot-swappable parts, an upgrade path that doesn't require throwing the rack away.

FORMMid-tower
DRAWWall-circuit
PATHCard-by-card upgrades

On specFIG. 03 NOTE

Every install is sized to the actual workload — concurrent users, ontology size, agent loop frequency. We share the bill of materials with you. We do not share it on a public services page.

§ ONP-05MODELS / SPECIALISTS, NOT GENERALISTS04 ROLES

A small team of
specialists.

One enormous frontier model trying to do everything is the wrong shape for an ops backbone. We run a small ensemble of open-source specialists — each chosen for a single role, quantized for the rack, hot-swappable as the open ecosystem moves. The roles are stable. The weights underneath are ours to pick.

01 / CODING + TOOLS

The hands.

Writes the SQL, drafts the invoice, fills the form, calls the next tool. The model that turns an operator's intent into a typed action against your systems. Specialist-tuned for code and structured output.

TOOL USEJSONCODE

02 / REASONING + JSON

The planner.

Decomposes a fuzzy ask into a clean sequence of validated steps. Returns schema-compliant tool calls — the part the deterministic plumbing relies on. Best return on inference cost in the open ecosystem.

REASONINGJSONPLANNING

03 / EDGE + COMMS

The fast one.

Resource-light work that runs constantly — log parsing, dispatch summaries, agent-to-agent messages, micro-classifications. Cheap to keep warm, fast to invoke, sized for always-on background loops.

EDGEFASTALWAYS-ON

04 / EMBEDDINGS

The librarian.

Encodes the entire ontology into a local vector index for semantic recall. Re-embeds on a nightly cron without phoning home. The index is built, owned, and stored on the rack — never shipped to a third party.

RETRIEVALINDEXLOCAL

On weightsFIG. 04 NOTE

We track the open-source frontier weekly and re-quantize new releases into the same role slots. The MCP contract above the models does not change when the weights underneath do — your stack improves quietly, in place.

§ ONP-06INSTALL / WHAT FDE RACKS06 STEPS · ~5 WEEKS

What gets installed
during the embed.

The on-prem stack is not a separate engagement. It is part of the FDE embed. Our engineers rack the box the same week they start riding along. By the end of the embed, the rack is humming and Little Bear is mutating the graph.

01Site surveyPower draw, network closet, cooling, physical security. We map the room before we order the box.WEEK 01

02Hardware deliveryMac Studio or workstation racked and labeled. UPS, surge, KVM, network drop wired in.WEEK 02

03Stack provisionOS hardened, vLLM or MLX serving live, weights pinned, MCP gateway configured against your tools.WEEK 03

04Ontology bootstrapInitial graph populated from existing systems. Schedules, employees, equipment, clients connected.WEEK 04

05Writeback wiredFirst two MCP tools — invoice compose, schedule update — go live under team supervision.WEEK 05

06MSP handoffMonitoring on the rack, on-call on the engineers, monthly operating review on the calendar.ONGOING

§ ONP-07FAQ / ON-PREM COMPUTE6 ANSWERS

Q-01What about model updates? Won't we miss the frontier?

You miss the frontier the way a working diesel truck misses a Formula-1 car. For ops work — schedule mutation, invoice compose, document parsing — quantized 14B–32B specialists clear the bar. We track new releases monthly, re-quantize, and hot-swap weights without changing the MCP contract above them.

Q-02Who covers the rack when something breaks?

Our MSP does. The same engineers who watch your applications watch the rack. Disk fails, fan stalls, vLLM crashes — alerted, ticketed, fixed. You are not running an in-house data-center team.

Q-03Does this mean we never use frontier APIs?

No. For long-tail tasks where Sonnet-class horsepower is genuinely needed, the gateway can fall through to a frontier API under explicit policy. The default is local. The exception is logged.

Q-04Why open-source models specifically?

Three reasons: weights you can pin and audit, no surprise pricing email, no surprise license change. If a vendor pivots, your stack does not.

Q-05Can the rack actually keep up with always-on agents?

With prefix caching and continuous batching, yes. Static system prompts and stable ontology context get cached, so the per-mutation token cost approaches the cost of the diff. That is what makes background loops viable.

Q-06What happens to the rack when we outgrow it?

The MCP contract is the same regardless of what is behind it. Add a second card, a second box, or a second site — the tool layer does not change. The rack scales without rewriting Little Bear.

Your data.
Your weights.
Your floor.

We will spec the rack for your shop, install it under the FDE embed, and hand it to MSP for ongoing coverage. Fixed cost. No vendor lock. Little Bear humming on hardware you own.

Spec a rack →

On-premcompute.

On-prem is the substrate Little Bear runs on.

Sovereignty.Fixed cost. Determinism.

One numberto call.

Continuous agentsare token-eaters.

Six layers,one closet.

One rack, nota data center.