The right hardware
for the right phase
AI inference is not one workload. Prefill demands compute. Decode demands bandwidth. Context demands capacity. Serving demands availability. PAMA™ assigns each phase to its natural hardware — nothing wasted, nothing idle.
Inference is four workloads, not one
A single AI request passes through phases with opposite hardware demands. Homogeneous clusters force every node to compromise.
The dominant model for AI infrastructure treats inference as a monolithic operation — prompt in, tokens out. This abstraction works for API consumers but is catastrophic for hardware architects.
Every node in a homogeneous cluster is provisioned for the most demanding phase. A node optimized for prefill compute wastes its TFLOPS during decode. A node with massive memory capacity underutilizes that capacity during prefill.
PAMA™ recognizes this and structures the physical topology — its interconnects, memory hierarchies, and node roles — around the phase-specific demands of inference.
Every tier exists for a reason
TRXGX™ handles compute. Mac handles memory. NUCLEUS handles serving. The topology is the memory architecture.
TRXGX™
GPU workstation with RTX PRO 6000 Blackwell. Handles the compute-heavy prefill phase, RAG pipeline execution, embedding, reranking, and model orchestration.
Prefill · RAG Pipeline · OrchestrationMac Studio
Connects via Thunderbolt 5 and becomes TRXGX™'s memory extension. Holds KV cache, stages model weights, and manages multi-session context — all at local PCIe latency.
KV Cache · Model Staging · ContextNUCLEUS
DGX Spark satellites on the QSFP fabric. Always-on, low-power production endpoints that serve quantized models with predictable latency and dedicated resources.
Production Endpoints · Always-On · RDMAThe cable is the architecture
Each interconnect type exists because a specific phase demands a specific data-movement pattern. Three cables, under ten minutes.
Four levels, mapped to hardware
Each level has decreasing bandwidth but increasing capacity — exactly the access pattern of autoregressive decoding.
Heterogeneous by design
Homogeneous clusters scale one dimension at a time. PAMA™ scales each phase independently.
Homogeneous Cluster
- Every node provisioned for most demanding phase
- Compute wastes during decode phases
- Memory underutilized during prefill
- Cannot run RAG alongside serving
- Scaling one dimension wastes budget on others
- Single-tier memory constrains model size
PAMA™ Architecture
- Each tier assigned to its optimal phase
- Compute isolated to prefill engine
- Memory capacity scales independently
- Dedicated RAG + dedicated serving simultaneously
- Upgrade targets specific bottleneck
- Four-level memory hierarchy spans all tiers
Start small, scale the right dimension
PAMA™ is designed for incremental investment. Every dollar carries forward.
NUCLEUS Only
Start with one or two DGX Spark nodes. Serve quantized models locally. Replace cloud API spend with a one-time investment. Prove the concept.
Payback in months · Zero cloud dependencyAdd TRXGX™ + Mac
Plug TRXGX™ into existing NUCLEUS via QSFP. Absorb your Mac via TB5. The full PAMA hierarchy activates: prefill compute, memory extension, production serving.
RAG · Fine-tuning · Multi-model routingExpand the Fleet
Add Mac memory, add NUCLEUS satellites, upgrade TRXGX™ GPUs. Each investment targets a specific phase bottleneck. No forklift upgrades.
Full PAMA hierarchy · Enterprise capacityPAMA™ serves the data you can't send to the cloud
Privileged documents. Patient records. Financial data. Trade secrets. PAMA™ keeps them local.
Legal
TRXGX runs the RAG pipeline over the firm's document corpus. Mac holds full KV cache for long-context legal review. NUCLEUS serves attorneys 24/7. All data stays on-premises — fully compliant with attorney-client privilege.
Healthcare
TRXGX processes clinical notes and imaging reports. Mac's memory extension holds patient context across multi-turn conversations. NUCLEUS serves providers at point of care. Zero PHI leaves the network.
Financial Services
TRXGX runs real-time document analysis on earnings calls and SEC filings. Mac holds concurrent analyst sessions. NUCLEUS serves the research desk with predictable tail latency. No MNPI exposure.
Manufacturing
TRXGX fine-tunes models on proprietary documentation and CAD metadata. Mac stages multiple domain-specific models. NUCLEUS serves the shop floor assistant. IP never leaves the building.
Ready to build with
Phase Aware Memory Architecture?
Tell us about your infrastructure needs. For implementation, configuration, and deployment, our team at Johannes AI™ will design the right solution.