Phase · Aware · Memory · Architecture

The right hardware
for the right phase

AI inference is not one workload. Prefill demands compute. Decode demands bandwidth. Context demands capacity. Serving demands availability. PAMA™ assigns each phase to its natural hardware — nothing wasted, nothing idle.

4
Inference Phases
3
Hardware Tiers
1
Architecture
The Problem

Inference is four workloads, not one

A single AI request passes through phases with opposite hardware demands. Homogeneous clusters force every node to compromise.

The dominant model for AI infrastructure treats inference as a monolithic operation — prompt in, tokens out. This abstraction works for API consumers but is catastrophic for hardware architects.

Every node in a homogeneous cluster is provisioned for the most demanding phase. A node optimized for prefill compute wastes its TFLOPS during decode. A node with massive memory capacity underutilizes that capacity during prefill.

PAMA™ recognizes this and structures the physical topology — its interconnects, memory hierarchies, and node roles — around the phase-specific demands of inference.

PhaseBottleneckKey Metric
PrefillCompute (TFLOPS)Time to First Token
DecodeMemory BandwidthTokens / Second
Context MgmtMemory CapacityConcurrent Sessions
ServingAvailability / LatencyP99 Latency, Uptime
Three Tiers

Every tier exists for a reason

TRXGX™ handles compute. Mac handles memory. NUCLEUS handles serving. The topology is the memory architecture.

Tier 1 — Compute

TRXGX™

GPU workstation with RTX PRO 6000 Blackwell. Handles the compute-heavy prefill phase, RAG pipeline execution, embedding, reranking, and model orchestration.

Phase Assignment
Prefill processes the full input prompt in parallel — this is pure compute, demanding maximum TFLOPS. TRXGX™'s multi-GPU VRAM delivers peak bandwidth for parallel token processing, plus dedicated GPUs for the RAG sub-pipeline.
Prefill · RAG Pipeline · Orchestration
Tier 2 — Memory

Mac Studio

Connects via Thunderbolt 5 and becomes TRXGX™'s memory extension. Holds KV cache, stages model weights, and manages multi-session context — all at local PCIe latency.

Phase Assignment
Decode and context management are bandwidth-and-capacity bound. Mac's unified memory pool absorbs KV cache overflow, enables fast model switching, and holds concurrent session state — freeing VRAM entirely for active attention and weights.
KV Cache · Model Staging · Context
Tier 3 — Serving

NUCLEUS

DGX Spark satellites on the QSFP fabric. Always-on, low-power production endpoints that serve quantized models with predictable latency and dedicated resources.

Phase Assignment
Production serving demands availability and tail-latency predictability, not peak compute. NUCLEUS nodes are stateless from TRXGX™'s perspective — add or remove them without disrupting the memory hierarchy. New models deploy in seconds over the 100G fabric.
Production Endpoints · Always-On · RDMA
Interconnect Topology

The cable is the architecture

Each interconnect type exists because a specific phase demands a specific data-movement pattern. Three cables, under ten minutes.

TB5 QSFP56 QSFP56 MAC MEMORY TRXGX COMPUTE NUC NUC PHASE AWARE MEMORY ARCHITECTURE
Memory Hierarchy

Four levels, mapped to hardware

Each level has decreasing bandwidth but increasing capacity — exactly the access pattern of autoregressive decoding.

L1 — VRAM

TRXGX™ GPU VRAM

Active weights and attention. Fastest path in the cluster.

GDDR7 · Peak Bandwidth · PCIe Gen5 Internal
L2 — Unified

Mac Unified Memory

KV cache overflow and model staging. Appears to TRXGX™ as a local extension via TB5.

Unified Memory · High Bandwidth · Thunderbolt 5
L3 — System

TRXGX™ System RAM

Vector database, orchestration state, and cold session storage. NVMe-backed restore on demand.

DDR5 ECC · Moderate Bandwidth · Local
L4 — Serving

NUCLEUS Unified Memory

Serving-model weights for production endpoints. Stateless, independently addressable.

Coherent Memory · RDMA · 100GbE QSFP56
Why PAMA™

Heterogeneous by design

Homogeneous clusters scale one dimension at a time. PAMA™ scales each phase independently.

Homogeneous Cluster

  • Every node provisioned for most demanding phase
  • Compute wastes during decode phases
  • Memory underutilized during prefill
  • Cannot run RAG alongside serving
  • Scaling one dimension wastes budget on others
  • Single-tier memory constrains model size

PAMA™ Architecture

  • Each tier assigned to its optimal phase
  • Compute isolated to prefill engine
  • Memory capacity scales independently
  • Dedicated RAG + dedicated serving simultaneously
  • Upgrade targets specific bottleneck
  • Four-level memory hierarchy spans all tiers
Customer Journey

Start small, scale the right dimension

PAMA™ is designed for incremental investment. Every dollar carries forward.

Phase 1 — Prove

NUCLEUS Only

Start with one or two DGX Spark nodes. Serve quantized models locally. Replace cloud API spend with a one-time investment. Prove the concept.

Payback in months · Zero cloud dependency
Phase 2 — Platform

Add TRXGX™ + Mac

Plug TRXGX™ into existing NUCLEUS via QSFP. Absorb your Mac via TB5. The full PAMA hierarchy activates: prefill compute, memory extension, production serving.

RAG · Fine-tuning · Multi-model routing
Phase 3 — Scale

Expand the Fleet

Add Mac memory, add NUCLEUS satellites, upgrade TRXGX™ GPUs. Each investment targets a specific phase bottleneck. No forklift upgrades.

Full PAMA hierarchy · Enterprise capacity
Verticals

PAMA™ serves the data you can't send to the cloud

Privileged documents. Patient records. Financial data. Trade secrets. PAMA™ keeps them local.

⚖️

Legal

TRXGX runs the RAG pipeline over the firm's document corpus. Mac holds full KV cache for long-context legal review. NUCLEUS serves attorneys 24/7. All data stays on-premises — fully compliant with attorney-client privilege.

🏥

Healthcare

TRXGX processes clinical notes and imaging reports. Mac's memory extension holds patient context across multi-turn conversations. NUCLEUS serves providers at point of care. Zero PHI leaves the network.

📈

Financial Services

TRXGX runs real-time document analysis on earnings calls and SEC filings. Mac holds concurrent analyst sessions. NUCLEUS serves the research desk with predictable tail latency. No MNPI exposure.

🏭

Manufacturing

TRXGX fine-tunes models on proprietary documentation and CAD metadata. Mac stages multiple domain-specific models. NUCLEUS serves the shop floor assistant. IP never leaves the building.

Contact

Ready to build with
Phase Aware Memory Architecture?

Tell us about your infrastructure needs. For implementation, configuration, and deployment, our team at Johannes AI™ will design the right solution.

For implementation and configuration inquiries, contact Johannes AI™ directly. For mobile deployment, visit Phantorex™.

PAMA™ is the foundation of Johannes AI™ infrastructure · Mobile deployment via Phantorex™ · A Telio™ technology