-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This one in collaboration with Gemini pro - for throughput planning
see ObrienlabsDev/blog#153
https://gemini.google.com/app/ffe7490def8a744e
"
how do I directly program the neural processing units in apple silicon on for example the M4max or M3Ultra or the tensor cores in a GB10 on an nvidia dgx spark
"
The short answer is that while you can program NVIDIA Tensor Cores explicitly and down to the assembly level, you strictly cannot program Apple's Neural Engine (ANE) directly in the same sense. Apple's NPU is a "black box" designed to be accessed only through high-level compiler layers.
Here is the breakdown of how to get as close to the metal as possible on both architectures.
1. Apple Silicon (M3 Ultra / M4 Max)
Status: Closed / Undocumented ISA
You cannot write C++ or Assembly code that runs directly on the NPU (Neural Engine).1 Apple treats the ANE as an implementation detail of the OS. There is no public instruction set architecture (ISA) and no direct driver interface.
The "Official" Lowest Level: MPSGraph
The closest you can get to "metal" programming on Apple Silicon for AI is Metal Performance Shaders Graph (MPSGraph). This bypasses CoreML's high-level abstractions and lets you build the compute graph manually.2
- How it works: You define a graph of mathematical operations (matrix multiplications, convolutions).3 The OS compiler then decides at runtime whether to run this on the GPU, the NPU, or the AMX (Apple Matrix Coprocessor on the CPU).4
- Constraint: You cannot force execution on the NPU. You can only "hint" it. The NPU mostly accelerates very specific quantized operations (Int8/FP16 convolutions).
Swift
// Example: Creating a graph in Swift (closest you get to "direct")
import MetalPerformanceShadersGraph
let graph = MPSGraph()
let a = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "A")
let b = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "B")
let c = graph.matrixMultiplication(primary: a, secondary: b, name: "C")
// The OS driver compiles this graph and dispatches it to GPU or NPU
The "Hacker" Level: Reverse Engineering
There have been community efforts (such as the geohot/tinygrad project or anemll) to reverse-engineer the ANE ISA.
- They found the ANE is effectively a massive, statically scheduled spatial dataflow machine.
- Current State: It is extremely brittle. Apple changes the ANE ISA with every chip generation (M1
$\to$ M2$\to$ M3), breaking these tools immediately. It is not viable for production.
2. NVIDIA (Blackwell GB200 / "GB10")
Status: Open / Fully Programmable
Note: You likely mean the GB200 (Grace Blackwell) architecture. "DGX Spark" is possibly a reference to DGX SuperPOD or a specific cluster configuration.
On NVIDIA hardware, you have full access to the Tensor Cores.5 You can bypass high-level frameworks (PyTorch/TensorFlow) and program them using CUDA C++ or inline PTX assembly.6
Level 1: CUDA WMMA API (Warp Matrix Multiply Accumulate)
This is the standard "direct" way to program Tensor Cores without writing assembly. You use the nvcuda::wmma namespace to load data into special fragments and trigger the hardware matrix multiply.7
C++
#include <mma.h>
using namespace nvcuda;
__global__ void wmma_kernel(half *a, half *b, float *c) {
// Declare fragments (registers specifically for Tensor Cores)
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
// Load data from memory to Tensor Core registers
wmma::load_matrix_sync(a_frag, a, 16);
wmma::load_matrix_sync(b_frag, b, 16);
// Execute the hardware instruction
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store result
wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}
Level 2: Inline PTX (Assembly)
If you are on the Blackwell architecture (Compute Capability 10.0+), you can use the newest mma instructions or the new TMA (Tensor Memory Accelerator) to asynchronously copy data directly from global memory into the Tensor Core shared memory clusters.8
You can inject assembly directly into your C++ code to use instructions that might not yet be exposed in the CUDA C++ API:
C++
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};"
: "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
: "r"(a0), "r"(a1), "r"(a2), "r"(a3), "r"(b0), "r"(b1), "f"(c0), "f"(c1), "f"(c2), "f"(c3)
);
Level 3: CUTLASS & CuTe
Because programming Tensor Cores directly involves complex shared-memory swizzling to avoid bank conflicts, NVIDIA engineers themselves use a template library called CUTLASS (and its core layout library CuTe).
- This is "direct programming" but with C++ templates handling the register allocation math.
- On Blackwell (GB200), this is critical because you need to coordinate the TMA (async copy engines) with the WGMMA (Warpgroup Matrix Multiply) instructions.
Summary of Differences
| Feature | Apple Neural Engine (M3/M4) | NVIDIA Tensor Core (GB200) |
|---|---|---|
| Programmability | Indirect Only (Compiler driven) | Direct (C++ or Assembly) |
| Lowest Level API | MPSGraph / Core ML | CUDA WMMA / PTX Assembly |
| Instruction Set | Private (Undocumented) | Public (PTX / SASS) |
| Data Types | Mostly Int8 / FP16 | Int4, Int8, FP8, BF16, FP16, TF32, FP64 |
| Memory Access | Managed by OS Cache | Explicit (User controls Shared Mem/TMA) |
Recommendation:
- For Apple: Use Core ML or MPSGraph. Do not attempt to program the ANE directly unless you are researching security/reverse-engineering; it is a moving target that will break your code in the next macOS update.
- For NVIDIA: If you want max performance, use CUTLASS. It generates the highly optimized pipeline (loading data -> Tensor Core math -> storing data) that is tedious to write manually in raw CUDA.