Phase · Aware · Memory · Architecture

The right hardware
for the right phase

AI inference is not one workload. Prefill demands compute. Decode demands bandwidth. Context demands capacity. Serving demands availability. PAMA™ assigns each phase to its natural hardware — nothing wasted, nothing idle.

Explore the Architecture Implementation & Configuration

Inference Phases

Hardware Tiers

Architecture

The Problem

Inference is four workloads, not one

A single AI request passes through phases with opposite hardware demands. Homogeneous clusters force every node to compromise.

The dominant model for AI infrastructure treats inference as a monolithic operation — prompt in, tokens out. This abstraction works for API consumers but is catastrophic for hardware architects.

Every node in a homogeneous cluster is provisioned for the most demanding phase. A node optimized for prefill compute wastes its TFLOPS during decode. A node with massive memory capacity underutilizes that capacity during prefill.

PAMA™ recognizes this and structures the physical topology — its interconnects, memory hierarchies, and node roles — around the phase-specific demands of inference.

PhaseBottleneckKey Metric

PrefillCompute (TFLOPS)Time to First Token

DecodeMemory BandwidthTokens / Second

Context MgmtMemory CapacityConcurrent Sessions

ServingAvailability / LatencyP99 Latency, Uptime

Three Tiers

Every tier exists for a reason

TRXGX™ handles compute. Mac handles memory. NUCLEUS handles serving. The topology is the memory architecture.

Tier 1 — Compute

TRXGX™

GPU workstation with RTX PRO 6000 Blackwell. Handles the compute-heavy prefill phase, RAG pipeline execution, embedding, reranking, and model orchestration.

Phase Assignment

Prefill processes the full input prompt in parallel — this is pure compute, demanding maximum TFLOPS. TRXGX™'s multi-GPU VRAM delivers peak bandwidth for parallel token processing, plus dedicated GPUs for the RAG sub-pipeline.

Prefill · RAG Pipeline · Orchestration

Tier 2 — Memory

Mac Studio

Connects via Thunderbolt 5 and becomes TRXGX™'s memory extension. Holds KV cache, stages model weights, and manages multi-session context — all at local PCIe latency.

Phase Assignment

Decode and context management are bandwidth-and-capacity bound. Mac's unified memory pool absorbs KV cache overflow, enables fast model switching, and holds concurrent session state — freeing VRAM entirely for active attention and weights.

KV Cache · Model Staging · Context

Tier 3 — Serving

NUCLEUS

DGX Spark satellites on the QSFP fabric. Always-on, low-power production endpoints that serve quantized models with predictable latency and dedicated resources.

Phase Assignment

Production serving demands availability and tail-latency predictability, not peak compute. NUCLEUS nodes are stateless from TRXGX™'s perspective — add or remove them without disrupting the memory hierarchy. New models deploy in seconds over the 100G fabric.

Production Endpoints · Always-On · RDMA

Interconnect Topology

The cable is the architecture

Each interconnect type exists because a specific phase demands a specific data-movement pattern. Three cables, under ten minutes.

Memory Hierarchy

Four levels, mapped to hardware

Each level has decreasing bandwidth but increasing capacity — exactly the access pattern of autoregressive decoding.

L1 — VRAM

TRXGX™ GPU VRAM

Active weights and attention. Fastest path in the cluster.

GDDR7 · Peak Bandwidth · PCIe Gen5 Internal

L2 — Unified

Mac Unified Memory

KV cache overflow and model staging. Appears to TRXGX™ as a local extension via TB5.

Unified Memory · High Bandwidth · Thunderbolt 5

L3 — System

TRXGX™ System RAM

Vector database, orchestration state, and cold session storage. NVMe-backed restore on demand.

DDR5 ECC · Moderate Bandwidth · Local

L4 — Serving

NUCLEUS Unified Memory

Serving-model weights for production endpoints. Stateless, independently addressable.

Coherent Memory · RDMA · 100GbE QSFP56

Why PAMA™

Heterogeneous by design

Homogeneous clusters scale one dimension at a time. PAMA™ scales each phase independently.

Homogeneous Cluster

Every node provisioned for most demanding phase
Compute wastes during decode phases
Memory underutilized during prefill
Cannot run RAG alongside serving
Scaling one dimension wastes budget on others
Single-tier memory constrains model size

PAMA™ Architecture

Each tier assigned to its optimal phase
Compute isolated to prefill engine
Memory capacity scales independently
Dedicated RAG + dedicated serving simultaneously
Upgrade targets specific bottleneck
Four-level memory hierarchy spans all tiers

Customer Journey

Start small, scale the right dimension

PAMA™ is designed for incremental investment. Every dollar carries forward.

Phase 1 — Prove

NUCLEUS Only

Start with one or two DGX Spark nodes. Serve quantized models locally. Replace cloud API spend with a one-time investment. Prove the concept.

Payback in months · Zero cloud dependency

Phase 2 — Platform

Add TRXGX™ + Mac

Plug TRXGX™ into existing NUCLEUS via QSFP. Absorb your Mac via TB5. The full PAMA hierarchy activates: prefill compute, memory extension, production serving.

RAG · Fine-tuning · Multi-model routing

Phase 3 — Scale

Expand the Fleet

Add Mac memory, add NUCLEUS satellites, upgrade TRXGX™ GPUs. Each investment targets a specific phase bottleneck. No forklift upgrades.

Full PAMA hierarchy · Enterprise capacity

Verticals

PAMA™ serves the data you can't send to the cloud

Privileged documents. Patient records. Financial data. Trade secrets. PAMA™ keeps them local.

⚖️

Legal

TRXGX runs the RAG pipeline over the firm's document corpus. Mac holds full KV cache for long-context legal review. NUCLEUS serves attorneys 24/7. All data stays on-premises — fully compliant with attorney-client privilege.

🏥

Healthcare

TRXGX processes clinical notes and imaging reports. Mac's memory extension holds patient context across multi-turn conversations. NUCLEUS serves providers at point of care. Zero PHI leaves the network.

📈

Financial Services

TRXGX runs real-time document analysis on earnings calls and SEC filings. Mac holds concurrent analyst sessions. NUCLEUS serves the research desk with predictable tail latency. No MNPI exposure.

🏭

Manufacturing

TRXGX fine-tunes models on proprietary documentation and CAD metadata. Mac stages multiple domain-specific models. NUCLEUS serves the shop floor assistant. IP never leaves the building.

Contact

Ready to build with
Phase Aware Memory Architecture?

Tell us about your infrastructure needs. For implementation, configuration, and deployment, our team at Johannes AI™ will design the right solution.

PAMA™ is the foundation of Johannes AI™ infrastructure · Mobile deployment via Phantorex™ · A Telio™ technology

The right hardwarefor the right phase

Inference is four workloads, not one

Every tier exists for a reason

TRXGX™

Mac Studio

NUCLEUS

The cable is the architecture

Four levels, mapped to hardware

TRXGX™ GPU VRAM

Mac Unified Memory

TRXGX™ System RAM

NUCLEUS Unified Memory

Heterogeneous by design

Homogeneous Cluster

PAMA™ Architecture

Start small, scale the right dimension

NUCLEUS Only

Add TRXGX™ + Mac

Expand the Fleet

PAMA™ serves the data you can't send to the cloud

Legal

Healthcare

Financial Services

Manufacturing

Ready to build withPhase Aware Memory Architecture?

The right hardware
for the right phase

Ready to build with
Phase Aware Memory Architecture?