NVIDIA development directly on Tensor cores - specifically the GB10 on the DGX Spark

This one in collaboration with Gemini pro - for throughput planning
see https://github.com/ObrienlabsDev/blog/issues/153

https://gemini.google.com/app/ffe7490def8a744e
"
how do I directly program the neural processing units in apple silicon on for example the M4max or M3Ultra or the tensor cores in a GB10 on an nvidia dgx spark
"

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<meta name="Generator" content="Cocoa HTML Writer">
<meta name="CocoaVersion" content="2685.3">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: #6d6d6d; -webkit-text-stroke: #6d6d6d; min-height: 14.0px}
p.p3 {margin: 0.0px 0.0px 14.0px 0.0px; font: 14.0px Helvetica; -webkit-text-stroke: #000000}
p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000; min-height: 14.0px}
p.p5 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Times; -webkit-text-stroke: #000000; min-height: 14.0px}
p.p6 {margin: 0.0px 0.0px 15.9px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000}
p.p7 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000}
li.li1 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000}
span.s1 {font-kerning: none}
span.s2 {font: 16.0px Helvetica; font-kerning: none}
table.t1 {margin: 0.0px 0.0px 32.0px 0.0px}
td.td1 {width: 98.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px}
td.td2 {width: 168.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px}
td.td3 {width: 220.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px}
ul.ul1 {list-style-type: disc}
</style>
</head>
<body>
The short answer is that while you can program NVIDIA Tensor Cores explicitly and down to the assembly level, you strictly cannot program Apple's Neural Engine (ANE) directly in the same sense. Apple's NPU is a "black box" designed to be accessed only through high-level compiler layers.
Here is the breakdown of how to get as close to the metal as possible on both architectures.
 
1. Apple Silicon (M3 Ultra / M4 Max)
Status: Closed / Undocumented ISA
You cannot write C++ or Assembly code that runs directly on the NPU (Neural Engine).1 Apple treats the ANE as an implementation detail of the OS. There is no public instruction set architecture (ISA) and no direct driver interface.
 
 
The "Official" Lowest Level: MPSGraph
The closest you can get to "metal" programming on Apple Silicon for AI is Metal Performance Shaders Graph (MPSGraph). This bypasses CoreML's high-level abstractions and lets you build the compute graph manually.2
 
 
<ul class="ul1">
<li class="li1">How it works: You define a graph of mathematical operations (matrix multiplications, convolutions).3 The OS compiler then decides at runtime whether to run this on the GPU, the NPU, or the AMX (Apple Matrix Coprocessor on the CPU).4 
 
 
</li>
<li class="li1">Constraint: You cannot force execution on the NPU. You can only "hint" it. The NPU mostly accelerates very specific quantized operations (Int8/FP16 convolutions).</li>
</ul>
Swift
 
// Example: Creating a graph in Swift (closest you get to "direct")
import MetalPerformanceShadersGraph
 
let graph = MPSGraph()
let a = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "A")
let b = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "B")
let c = graph.matrixMultiplication(primary: a, secondary: b, name: "C")
 
// The OS driver compiles this graph and dispatches it to GPU or NPU
The "Hacker" Level: Reverse Engineering
There have been community efforts (such as the geohot/tinygrad project or anemll) to reverse-engineer the ANE ISA.
<ul class="ul1">
<li class="li1">They found the ANE is effectively a massive, statically scheduled spatial dataflow machine.</li>
<li class="li1">Current State: It is extremely brittle. Apple changes the ANE ISA with every chip generation (M1 $\to$ M2 $\to$ M3), breaking these tools immediately. It is not viable for production.</li>
</ul>
 
2. NVIDIA (Blackwell GB200 / "GB10")
Status: Open / Fully Programmable
Note: You likely mean the GB200 (Grace Blackwell) architecture. "DGX Spark" is possibly a reference to DGX SuperPOD or a specific cluster configuration.
On NVIDIA hardware, you have full access to the Tensor Cores.5 You can bypass high-level frameworks (PyTorch/TensorFlow) and program them using CUDA C++ or inline PTX assembly.6
 
 
Level 1: CUDA WMMA API (Warp Matrix Multiply Accumulate)
This is the standard "direct" way to program Tensor Cores without writing assembly. You use the nvcuda::wmma namespace to load data into special fragments and trigger the hardware matrix multiply.7
 
 
C++
 
#include &lt;mma.h&gt;
using namespace nvcuda;
 
__global__ void wmma_kernel(half *a, half *b, float *c) {
   // Declare fragments (registers specifically for Tensor Cores)
   wmma::fragment&lt;wmma::matrix_a, 16, 16, 16, half, wmma::row_major&gt; a_frag;
   wmma::fragment&lt;wmma::matrix_b, 16, 16, 16, half, wmma::col_major&gt; b_frag;
   wmma::fragment&lt;wmma::accumulator, 16, 16, 16, float&gt; c_frag;
 
   // Load data from memory to Tensor Core registers
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);
 
   // Execute the hardware instruction
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
 
   // Store result
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}
Level 2: Inline PTX (Assembly)
If you are on the Blackwell architecture (Compute Capability 10.0+), you can use the newest mma instructions or the new TMA (Tensor Memory Accelerator) to asynchronously copy data directly from global memory into the Tensor Core shared memory clusters.8
 
 
You can inject assembly directly into your C++ code to use instructions that might not yet be exposed in the CUDA C++ API:
C++
 
asm volatile(
   "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
   "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};"
   : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
   : "r"(a0), "r"(a1), "r"(a2), "r"(a3), "r"(b0), "r"(b1), "f"(c0), "f"(c1), "f"(c2), "f"(c3)
);
Level 3: CUTLASS &amp; CuTe
Because programming Tensor Cores directly involves complex shared-memory swizzling to avoid bank conflicts, NVIDIA engineers themselves use a template library called CUTLASS (and its core layout library CuTe).
<ul class="ul1">
<li class="li1">This is "direct programming" but with C++ templates handling the register allocation math.</li>
<li class="li1">On Blackwell (GB200), this is critical because you need to coordinate the TMA (async copy engines) with the WGMMA (Warpgroup Matrix Multiply) instructions.</li>
</ul>
Summary of Differences

Feature | Apple Neural Engine (M3/M4) | NVIDIA Tensor Core (GB200)
-- | -- | --
Programmability | Indirect Only (Compiler driven) | Direct (C++ or Assembly)
Lowest Level API | MPSGraph / Core ML | CUDA WMMA / PTX Assembly
Instruction Set | Private (Undocumented) | Public (PTX / SASS)
Data Types | Mostly Int8 / FP16 | Int4, Int8, FP8, BF16, FP16, TF32, FP64
Memory Access | Managed by OS Cache | Explicit (User controls Shared Mem/TMA)


Recommendation:
<ul class="ul1">
<li class="li1">For Apple: Use Core ML or MPSGraph. Do not attempt to program the ANE directly unless you are researching security/reverse-engineering; it is a moving target that will break your code in the next macOS update.</li>
<li class="li1">For NVIDIA: If you want max performance, use CUTLASS. It generates the highly optimized pipeline (loading data -&gt; Tensor Core math -&gt; storing data) that is tedious to write manually in raw CUDA.</li>
</ul>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIA development directly on Tensor cores - specifically the GB10 on the DGX Spark #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	Apple Neural Engine (M3/M4)	NVIDIA Tensor Core (GB200)
Programmability	Indirect Only (Compiler driven)	Direct (C++ or Assembly)
Lowest Level API	MPSGraph / Core ML	CUDA WMMA / PTX Assembly
Instruction Set	Private (Undocumented)	Public (PTX / SASS)
Data Types	Mostly Int8 / FP16	Int4, Int8, FP8, BF16, FP16, TF32, FP64
Memory Access	Managed by OS Cache	Explicit (User controls Shared Mem/TMA)

NVIDIA development directly on Tensor cores - specifically the GB10 on the DGX Spark #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions