Skip to content

NVIDIA development directly on Tensor cores - specifically the GB10 on the DGX Spark #47

@obriensystems

Description

@obriensystems

This one in collaboration with Gemini pro - for throughput planning
see ObrienlabsDev/blog#153

https://gemini.google.com/app/ffe7490def8a744e
"
how do I directly program the neural processing units in apple silicon on for example the M4max or M3Ultra or the tensor cores in a GB10 on an nvidia dgx spark
"

<title></title> <style type="text/css"> p.p1 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: #6d6d6d; -webkit-text-stroke: #6d6d6d; min-height: 14.0px} p.p3 {margin: 0.0px 0.0px 14.0px 0.0px; font: 14.0px Helvetica; -webkit-text-stroke: #000000} p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000; min-height: 14.0px} p.p5 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Times; -webkit-text-stroke: #000000; min-height: 14.0px} p.p6 {margin: 0.0px 0.0px 15.9px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} p.p7 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} li.li1 {margin: 0.0px 0.0px 12.0px 0.0px; font: 12.0px Helvetica; -webkit-text-stroke: #000000} span.s1 {font-kerning: none} span.s2 {font: 16.0px Helvetica; font-kerning: none} table.t1 {margin: 0.0px 0.0px 32.0px 0.0px} td.td1 {width: 98.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px} td.td2 {width: 168.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px} td.td3 {width: 220.7px; margin: 0.5px 0.5px 0.5px 0.5px; border-style: solid; border-width: 1.0px 1.0px 1.0px 1.0px; border-color: #000000 #000000 #000000 #000000; padding: 1.0px 1.0px 1.0px 1.0px} ul.ul1 {list-style-type: disc} </style>

The short answer is that while you can program NVIDIA Tensor Cores explicitly and down to the assembly level, you strictly cannot program Apple's Neural Engine (ANE) directly in the same sense. Apple's NPU is a "black box" designed to be accessed only through high-level compiler layers.

Here is the breakdown of how to get as close to the metal as possible on both architectures.


1. Apple Silicon (M3 Ultra / M4 Max)

Status: Closed / Undocumented ISA

You cannot write C++ or Assembly code that runs directly on the NPU (Neural Engine).1 Apple treats the ANE as an implementation detail of the OS. There is no public instruction set architecture (ISA) and no direct driver interface.



The "Official" Lowest Level: MPSGraph

The closest you can get to "metal" programming on Apple Silicon for AI is Metal Performance Shaders Graph (MPSGraph). This bypasses CoreML's high-level abstractions and lets you build the compute graph manually.2



  • How it works: You define a graph of mathematical operations (matrix multiplications, convolutions).3 The OS compiler then decides at runtime whether to run this on the GPU, the NPU, or the AMX (Apple Matrix Coprocessor on the CPU).4


  • Constraint: You cannot force execution on the NPU. You can only "hint" it. The NPU mostly accelerates very specific quantized operations (Int8/FP16 convolutions).

Swift


// Example: Creating a graph in Swift (closest you get to "direct")

import MetalPerformanceShadersGraph


let graph = MPSGraph()

let a = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "A")

let b = graph.placeholder(shape: [1024, 1024], dataType: .float32, name: "B")

let c = graph.matrixMultiplication(primary: a, secondary: b, name: "C")


// The OS driver compiles this graph and dispatches it to GPU or NPU

The "Hacker" Level: Reverse Engineering

There have been community efforts (such as the geohot/tinygrad project or anemll) to reverse-engineer the ANE ISA.

  • They found the ANE is effectively a massive, statically scheduled spatial dataflow machine.
  • Current State: It is extremely brittle. Apple changes the ANE ISA with every chip generation (M1 $\to$ M2 $\to$ M3), breaking these tools immediately. It is not viable for production.


2. NVIDIA (Blackwell GB200 / "GB10")

Status: Open / Fully Programmable

Note: You likely mean the GB200 (Grace Blackwell) architecture. "DGX Spark" is possibly a reference to DGX SuperPOD or a specific cluster configuration.

On NVIDIA hardware, you have full access to the Tensor Cores.5 You can bypass high-level frameworks (PyTorch/TensorFlow) and program them using CUDA C++ or inline PTX assembly.6



Level 1: CUDA WMMA API (Warp Matrix Multiply Accumulate)

This is the standard "direct" way to program Tensor Cores without writing assembly. You use the nvcuda::wmma namespace to load data into special fragments and trigger the hardware matrix multiply.7



C++


#include <mma.h>

using namespace nvcuda;


__global__ void wmma_kernel(half *a, half *b, float *c) {

   // Declare fragments (registers specifically for Tensor Cores)

   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;

   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;

   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;


   // Load data from memory to Tensor Core registers

   wmma::load_matrix_sync(a_frag, a, 16);

   wmma::load_matrix_sync(b_frag, b, 16);


   // Execute the hardware instruction

   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);


   // Store result

   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);

}

Level 2: Inline PTX (Assembly)

If you are on the Blackwell architecture (Compute Capability 10.0+), you can use the newest mma instructions or the new TMA (Tensor Memory Accelerator) to asynchronously copy data directly from global memory into the Tensor Core shared memory clusters.8



You can inject assembly directly into your C++ code to use instructions that might not yet be exposed in the CUDA C++ API:

C++


asm volatile(

   "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "

   "{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%10, %11, %12, %13};"

   : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)

   : "r"(a0), "r"(a1), "r"(a2), "r"(a3), "r"(b0), "r"(b1), "f"(c0), "f"(c1), "f"(c2), "f"(c3)

);

Level 3: CUTLASS & CuTe

Because programming Tensor Cores directly involves complex shared-memory swizzling to avoid bank conflicts, NVIDIA engineers themselves use a template library called CUTLASS (and its core layout library CuTe).

  • This is "direct programming" but with C++ templates handling the register allocation math.
  • On Blackwell (GB200), this is critical because you need to coordinate the TMA (async copy engines) with the WGMMA (Warpgroup Matrix Multiply) instructions.

Summary of Differences

Feature Apple Neural Engine (M3/M4) NVIDIA Tensor Core (GB200)
Programmability Indirect Only (Compiler driven) Direct (C++ or Assembly)
Lowest Level API MPSGraph / Core ML CUDA WMMA / PTX Assembly
Instruction Set Private (Undocumented) Public (PTX / SASS)
Data Types Mostly Int8 / FP16 Int4, Int8, FP8, BF16, FP16, TF32, FP64
Memory Access Managed by OS Cache Explicit (User controls Shared Mem/TMA)

Recommendation:

  • For Apple: Use Core ML or MPSGraph. Do not attempt to program the ANE directly unless you are researching security/reverse-engineering; it is a moving target that will break your code in the next macOS update.
  • For NVIDIA: If you want max performance, use CUTLASS. It generates the highly optimized pipeline (loading data -> Tensor Core math -> storing data) that is tedious to write manually in raw CUDA.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions