Skip to content

Conversation

@glitzflitz
Copy link

@glitzflitz glitzflitz commented Dec 31, 2025

Background

As per #695
currently propolis relies on edk2-stable202105 version of EDK2 OVMF to provide the ACPI tables to the guest as it was the last version that has included static tables.
Another limitation is the guest only sees whatever OVMF decided to generate rather than what the hypervisor knows about the virtual/emulated hardware.

In newer versions, OVMF expects the VMM to generate a set of ACPI tables and expose them via the fw_cfg table-loader interface. Being able to generate ACPI tables also unlocks other opportunities for features like being able to chose which tables and control methods to expose, PCIe host bridge and switch emulation, supporting native PCIe hotplug etc.

This PR addresses that limitation and adds mechanism to let propolis generate its own ACPI tables.

Implementation

Oveview

The series starts with implementing fw_cfg's table-loader mechanism to enable passing static tables to guest firmware(OVMF). Then the basic static tables like RSDT, XSDT and RSDP etc are added.
After that we reach to second milestone that is generating the AML bytecode. This where some technical decision need to be made after evaluating different options and tradeoffs along with use case for how to go about generating bytecode without introducing too much complexity.
At the end everything is wired up to switch to using propolis generated tables.

Details

The fw_cfg Interface

QEMU's fw_cfg interface provides a mechanism for the hypervisor to expose files to guest firmware. Propolis already had basic fw_cfg support for the e820 memory map and bootrom. The ACPI implementation builds on that foundation.

OVMF expects three specific fw_cfg files for ACPI tables:

etc/acpi/tables    Combined ACPI table data
etc/acpi/rsdp      Root System Description Pointer
etc/table-loader   Linker/loader commands

The table-loader file contains a sequence of fixed-size commands that instruct OVMF to allocate memory, patch pointer fields and compute checksums. This is necessary because the tables contain absolute addresses that are only known after OVMF allocates memory for them.

In the proposed implementation in Add fw_cfg table-loader helpers for ACPI table generation , TableLoader generates three command types:

ALLOCATE - reserves memory for a fw_cfg file with specified alignment in a given zone

ADD_POINTER - patches an address field in one file to point at another file's allocated location. The command specifies source file, destination file, offset within source and pointer size

ADD_CHECKSUM - computes a checksum over a byte range and writes it to a specified offset. ACPI tables use a simple byte sum that must equal zero.

The commands are used in Prepare the ACPI tables for generation

Static Table Generation

The simpler static tables that don't require AML bytecode are implemented first.

Since propolis does not have hotplug support yet SSDT is not required as of now.

AML generation and usage

The DSDT contains AML bytecode for describing devices, methods and resources. AML has a hierarchical structure with scopes containing devices which contain named objects and methods. The encoding uses variable length packages.

Possible approaches

QEMU uses a C based approach with GArray buffers. Each AML construct is a function returning an Aml pointer that must be explicitly appended to its parent. The design is flexible but also has caveats for example, forgetting manual aml_append call silently drops content and there is no type safety around what can be nested. Since we are not bound my limitations of C and have borrow checker with us, we can do better.

crosvm defines a single Aml trait with many implementing types. Each construct is a separate struct collecting children in a Vec. The usage pattern is usually a macro followed by to_aml_bytes() which recursively serializes the tree. Although this provides strong typing, its bit more complex and requires constructing the entire tree in memory before serialization. Package lengths use a two pass approach of first measuring then writing.

Firecracker also follows a same pattern to crosvm with trait methods along with some additional error handling.

acpi_tables crate used by cloud-hypervisor: uses a dual trait design to split the problem into two traits: Aml for things that can be serialized and AmlSink as the destination. The sink abstraction is nice because the same tree can write to a Vec or feed a checksum calculator without changing the serialization code. Its structurally similar to crosvm and the same two pass length encoding which gets bit complex when building nested hierarchies.

Approach in this series

Introduce AML bytecode generation adds RAII guards that automatically finalize package lengths when dropped.
The core abstraction is an AmlBuilder that owns a single byte buffer plus guard types for Scope, Device and Method. Each guard holds a mutable borrow on the builder so we have compile time scope safety through the borrow checker. This way its impossible to miss closing any scope.
Also using single buffer from AmlBuilder avoids the overhead of dynamic dispatch as in crosvm and acpi_tables approach.

Guards borrow the builder mutably and write content directly to its buffer. When a guard is created it writes the opcode, reserves 4 bytes for the package length (the maximum encoding size) and writes the name. When the guard drops it calculates the actual package length, encodes it in 1-4 bytes and splices out the unused reserved bytes.

Usage looks like

let mut builder = AmlBuilder::new();
{
    let mut sb = builder.scope("\\_SB_");
    {
        let mut pci0 = sb.device("PCI0");
        pci0.name("_HID", &EisaId::from_str("PNP0A08"));
    }  // DeviceGuard drops
}  // ScopeGuard drops
let aml = builder.finish();

which looks structurally similar to ASL code that is compiled to AML bytecode.
The conditional content is simply an if statement due to RAII guards which avoids complexity of Option wrappers as needed in other cases mentioned above. The limitation in this design is that its less composable. There is no easy way to return a "partial device tree" from a function or store AML fragments for later use.

Note about Package Length Encoding

The ACPI specification Section 20.2.4 defines a variable length encoding for package sizes. A package length includes itself in the count which creates a circular dependency: the length must be known to encode it but the encoding affects the length. That is why two pass approach is often used as done by others.

The implementation in Introduce AML byte generation, simply reserves max 4 bytes when opening any scope and splices in the actual encoded length when the scope closes. This produces minimal output with a single pass through the data.

I'd be open to new ideas or going with another approach mentioned above as well :)

Wiring up new tables

  • Wire up ACPI table generation via fw_cfg
    The new table generation is controlled by a native_acpi_tables flag in the Board spec. Newly launched VMs have this set to true and get new generated tables via fw_cfg. VMs migrating from older propolis versions won't have this field in their spec so it defaults to false and they keep using OVMF tables.
    So existing VMs can safely migrate to propolis generated tables without any guest visible changes to their ACPI tables. Only VMs launched with new version of propolis will use the new tables.

Testing

This is the dmesg of linux when using new tables. Now the standard OVMF bootrom can be used.

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
...
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] APIC: Static calls initialized
[    0.000000] efi: EFI v2.7 by EDK II
[    0.000000] efi: SMBIOS=0xbf9a9000 ACPI=0xbfb7e000 ACPI 2.0=0xbfb7e014 MEMATTR=0xbe3f5298 INITRD=0xbe564f98
[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: Oxide OxVM, BIOS v0.0.1-alpha 1 Bureaucracy 41, 3186 YOLD
...
[    0.000000] ACPI: RSDP 0x00000000BFB7E014 000024 (v02 OXIDE )
[    0.000000] ACPI: XSDT 0x00000000BFB7D0E8 00004C (v01 OXIDE  PROPOLIS 00000001      01000013)
[    0.000000] ACPI: FACP 0x00000000BFB7B000 000114 (v06 OXIDE  PROPOLIS 00000001 OXDE 00000001)
[    0.000000] ACPI: DSDT 0x00000000BFB7C000 000833 (v02 OXIDE  PROPOLIS 00000001 OXDE 00000001)
[    0.000000] ACPI: FACS 0x00000000BFBFC000 000040
[    0.000000] ACPI: APIC 0x00000000BFB7A000 000072 (v05 OXIDE  PROPOLIS 00000001 OXDE 00000001)
[    0.000000] ACPI: MCFG 0x00000000BFB79000 00003C (v01 OXIDE  PROPOLIS 00000001 OXDE 00000001)
[    0.000000] ACPI: HPET 0x00000000BFB78000 000038 (v01 OXIDE  PROPOLIS 00000001 OXDE 00000001)
[    0.000000] ACPI: BGRT 0x00000000BFB77000 000038 (v01 INTEL  EDK2     00000002      01000013)
[    0.000000] ACPI: Reserving FACP table memory at [mem 0xbfb7b000-0xbfb7b113]
[    0.000000] ACPI: Reserving DSDT table memory at [mem 0xbfb7c000-0xbfb7c832]
[    0.000000] ACPI: Reserving FACS table memory at [mem 0xbfbfc000-0xbfbfc03f]
[    0.000000] ACPI: Reserving APIC table memory at [mem 0xbfb7a000-0xbfb7a071]
[    0.000000] ACPI: Reserving MCFG table memory at [mem 0xbfb79000-0xbfb7903b]
[    0.000000] ACPI: Reserving HPET table memory at [mem 0xbfb78000-0xbfb78037]
[    0.000000] ACPI: Reserving BGRT table memory at [mem 0xbfb77000-0xbfb77037]
  • Now we can easily use HPET in the guest. There are TSC calibration failures which were there when using OVMF as well
[    0.000000] ACPI: Core revision 20250404
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 113919999973 ns
[    0.000000] APIC: Switch to symmetric I/O mode setup
[    0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.008000] tsc: Unable to calibrate against PIT
[    0.008000] tsc: HPET/PMTIMER calibration failed
[    0.008000] tsc: Marking TSC unstable due to could not calculate TSC khz
  • We also have support for PCIe ECAM to use PCIe host bridges
[    0.313000] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.316000] PCI: ECAM [mem 0xe0000000-0xefffffff] (base 0xe0000000) for domain 0000 [bus 00-ff]
[    0.316000] PCI: ECAM [mem 0xe0000000-0xefffffff] reserved as E820 entry
[    0.316000] PCI: Using configuration type 1 for base access
  • The GlobalLock is not supported by propolis yet so the warning appears with OVMF tables as well
[    0.337000] ACPI: Added _OSI(Module Device)
[    0.337000] ACPI: Added _OSI(Processor Device)
[    0.337000] ACPI: Added _OSI(Processor Aggregator Device)
[    0.338000] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.340000] ACPI Error: Could not enable GlobalLock event (20250404/evxfevnt-182)
[    0.341000] fbcon: Taking over console
[    0.341000] ACPI Warning: Could not enable fixed event - GlobalLock (1) (20250404/evxface-618)
[    0.341000] ACPI Error: No response from Global Lock hardware, disabling lock (20250404/evglock-63)
[    0.342000] ACPI: Interpreter enabled
  • Support for power state
[    0.342000] ACPI: PM: (supports S0 S3 S4 S5)
[    0.342000] ACPI: Using IOAPIC for interrupt routing
  • Support for _OSC which enables use PCIe host bridges and PCIe Native Hotplugging in future
[    0.342000] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.342000] PCI: Using E820 reservations for host bridge windows
[    0.343000] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.343000] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
[    0.343000] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug AER]
[    0.343000] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR DPC]
[    0.343000] acpi PNP0A08:00: host bridge window expanded to [mem 0xc0000000-0xfbffffff]; [mem 0xc0000000-0xfbffffff window] ignored
[    0.343000] acpi PNP0A08:00: host bridge window [mem 0x240000000-0xfffffffffff window] ([0x10000000000-0xfffffffffff] ignored, not CPU
addressable)
[    0.343000] PCI host bridge to bus 0000:00
[    0.343000] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfbffffff]
[    0.343000] pci_bus 0000:00: root bus resource [mem 0x240000000-0xffffffffff window]
[    0.343000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.343000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.343000] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.344000] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000 conventional PCI endpoint
[    0.350000] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100 conventional PCI endpoint
[    0.357000] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000 conventional PCI endpoint
[    0.362000] pci 0000:00:01.3: quirk: [io  0xb000-0xb03f] claimed by PIIX4 ACPI
[    0.362000] pci 0000:00:01.3: quirk: [io  0xb100-0xb10f] claimed by PIIX4 SMB
  • HPET from bhyve could be used
[    2.588000] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[    2.588000] hpet0: 8 comparators, 32-bit 16.777216 MHz counter
[    2.591000] clocksource: Switched to clocksource hpet
[    2.588000] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[    2.588000] hpet0: 8 comparators, 32-bit 16.777216 MHz counter
[    2.591000] clocksource: Switched to clocksource hpet
  • PS/2 keyboard is also supported. Note that support can be easily extended to PS/2 mouse and AUX port but to maintain same behaviour as OVMF table only keyboard is exposed for now
[    3.047234] i8042: PNP: PS/2 Controller [PNP0303:KBD] at 0x60,0x64 irq 1
[    3.047234] i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please boot with i8042.nopnp
[    3.050217] serio: i8042 KBD port at 0x60,0x64 irq 1
[    3.063171] ACPI Error: Could not enable RealTimeClock event (20250404/evxfevnt-182)
[    3.075329] ACPI Warning: Could not enable fixed event - RealTimeClock (4) (20250404/evxface-618)
[

TODO:

  • Test *bsd/illumos
  • Test Windows
  • Compare dump of actual tables from OVMF and new tables extracted from the guest?

Add a TableLoader builder that can be used to generate the
etc/table-loader file to be passed to guest firmware via fw_cfg.

The etc/table-loader file in fw_cfg contains the sequence of fixed size
linker/loader commands that can be used to instruct guest to allcoate
memory for set of fw_cfg files(e.g. ACPI tables), link allocated memory
by patching pointers and calculate the ACPI checksum.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
@glitzflitz glitzflitz marked this pull request as ready for review December 31, 2025 04:33
@glitzflitz glitzflitz marked this pull request as draft December 31, 2025 04:54
Add builders to generate basic ACPI tables
RSDP(ACPI 2.0+) that points to XSDT, XSDT with 64-bit table pointers and
RSDT with 32-bit table pointers that would work with the table-loader
mechanism in fw_cfg.

These tables are used to describe the ACPI table hierarchy to guest
firmware. The builders produce raw table data bytes with placeholder
addresses and checksums that are fixed up by firmware using table-loader
commands.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
FADT describes fixed hardware features and points to the DSDT. The
builder supports both standard and HW-reduced ACPI modes.

DSDT contains AML bytecode describing system hardware. The builder
provides methods to append AML data which could be populated by a
AML generation mechanism in future.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Add a builder for the Multiple APIC Description Table (MADT) that
describes the system's interrupt controllers.

Supports adding local APIC, I/O APIC and interrupt source overrides for
describing processor and interrupt controller topology.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Add builder for the HPET table that describes the HPET hardware to the
guest. The table uses the bhyve HPET hardware ID (0x8086a201) and maps
to the standard HPET MMIO address at 0xfed00000.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Add the FACS table that provides a memory region for firmware/OS
handshaking. The table includes the GlobalLock field used for mutual
exclusion between the OS and firmware during ACPI operations but we
don't have support for handling GBL_EN yet[1] but to match the behaviour
of OVMF expose the table.

[1]: oxidecomputer#837

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Define bytecode opcodes for AML generation per ACPI Specification
Chapter 20 [1]. Includes namespace modifiers, named objects, data object
prefixes, name path prefixes, local/argument references, control flow
and logical/arithmetic operators.

These constants will be used in next commits to generate AML byte code
which would enable us to generate ACPI tables ourselves.

[1]: https://uefi.org/specs/ACPI/6.5/20_AML_Specification.html

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Implement NameSeg and NameString encoding per ACPI Specification Section
20.2.2 [1]. Single segments encode as 4 bytes padded with underscores,
dual segments use DualNamePrefix and three or more use MultiNamePrefix
with a count byte.
Also implement EISA ID compression for hardware identification strings
like "PNP0A08".

[1]: https://uefi.org/specs/ACPI/6.4_A/20_AML_Specification.html#name-objects-encoding

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Add AML bytecode generation to mainly support dynamically generating
ACPI tables and control methods.
The bytecode is built in a single pass by directly writing to the
output buffer. AML scopes encode their length in a 1-4 byte PkgLength
field at the start[1]. Since we don't know the final size until the scope's
content is fully written, reserve 4 bytes when opening a scope upfront
and splice in the actual encoded length when the scope closes.
This avoids complexity of having to build an in memory tree and then
walk it twice to measure and serialize.

The RAII guards automatically close scopes and finalize the PkgLength on
drop. Those guards hold a mutable borrow on the builder so the borrow checker
won't let us close a parent while a child scope is still open. The
limitation of this approach is that the content has to be written in
output order but that is not a big issue for the use case of VM device
descriptions.

[1]: ACPI Specification Section 20.2.4
 https://uefi.org/specs/ACPI/6.4_A/20_AML_Specification.html#package-length-encoding

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Implement ResourceTemplateBuilder for constructing resource descriptors
used in methods like _CRS. Supports QWord/DWord memory and I/O ranges, Word
bus numbers and IRQ descriptors per ACPI Specification Section 6.4 [1].

[1]: https://uefi.org/specs/ACPI/6.4_A/06_Device_Configuration.html#resource-data-types-for-acpi

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Export public API for AML generation AmlBuilder, AmlWriter trait, guard
types (ScopeGuard, DeviceGuard, MethodGuard), EisaId and
ResourceTemplateBuilder.
This would enable generating the dynamic bytecode used in tables like
DSDT.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Since now we have support to generate AML bytecode, add DSDT generation
that provides the guest OS with device information via AML. The DSDT
contains _SB.PCI0 describing the PCIe host bridge with ECAM
configuration space and bus number resources, plus COM1-COM4 serial port
devices with their IO ports and IRQs.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Add AT keyboard controller resources to allow guest to enumerate the
i8042 controller. Only keyboard is added to match the OVMF's existing
behaviour for now.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
The OS calls _OSC on the PCIe host bridge to negotiate control of native
PCIe features like hotplug, AER and PME. Without _OSC, Linux logs
warning about missing capability negotiation(_OSC: platform retains
control of PCIe features (AE_NOT_FOUND)) and as per [1] Windows as well
won't enable any of the advanced PCI Express features through PCI
Express Native Control.

Control of the AER seems to be optional so its not handed over to the
guest for now.

Also to simplify the aml generation of _OSC itself introduce some high
level wrappers around aml generation.

[1]: https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/enabling-pci-express-native-control

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Combine all ACPI tables into the format expected by firmware(OVMF) by
using fw_cfg's table-loader commands for address patching and checksum
computation.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
@glitzflitz glitzflitz force-pushed the acpi_fwcfg_reord branch 2 times, most recently from 29dbb4f to a568c53 Compare December 31, 2025 17:58
Integrate the new ACPI table generation into propolis-standalone and
propolis-server. Also replace hardcoded memory region addresses with
constants that align with ACPI table definitions.

The PCIe ECAM base is kept same as before at 0xe000_0000 (3.5GB) to
match existing i440fx chipset ECAM placement.

Guest physical memory map:
0x0000_0000 - 0xbfff_ffff    Low RAM (up to 3 GiB)
0xc000_0000 - 0xffff_ffff    PCI hole (1 GiB MMIO region)
  0xc000_0000 - 0xdfff_ffff    32-bit PCI MMIO
  0xe000_0000 - 0xefff_ffff    PCIe ECAM (256 MiB, 256 buses)
  0xfec0_0000                  IOAPIC
  0xfed0_0000                  HPET
  0xffe0_0000 - 0xffff_ffff    Bootrom (2 MiB)
0x1_0000_0000+               High RAM + 64-bit PCI MMIO

e820 map as seen by guest:
0x0000_0000 - 0x0009_ffff    Usable (640 KiB low memory)
0x0010_0000 - 0xbeaf_ffff    Usable (~3 GiB main RAM)
0xbeb0_0000 - 0xbfb6_cfff    Reserved (UEFI runtime/data)
0xbfb6_d000 - 0xbfbf_efff    ACPI Tables + NVS
0xbfbf_f000 - 0xbffd_ffff    Usable (top of low memory)
0xbffe_0000 - 0xffff_ffff    Reserved (PCI hole)
0x1_0000_0000 - highmem      Usable (high RAM above 4 GiB)

To stay on safe side only enable using new ACPI tables for newly
launched VMs. Old VMs using OVMF tables would keep using the same OVMF
tables throughout multiple migrations.  To verify this add the phd test
as well for new VM launched with native tables, native tables preserved
through migration and VM launched from old propolis without native
tables stays with OVMF through multiple future migrations.

Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant