Skip to content

RFC: Lossless fixed-width integer bit-packing codec for ADC-style data #813

@sakshi2433

Description

@sakshi2433

Hi! This is an exploratory design discussion based on a working prototype and benchmarks.
I’d really appreciate feedback on API shape and integration direction.

Motivation

Many scientific datasets store integer signals where only a subset of bits are meaningful
(e.g. 10–12 bit ADC data stored in uint16). Zarr/numcodecs currently rely on byte-level
compression for such data, which does not explicitly remove unused bits.

Observations

In an ADC-style benchmark (uint16, effective 12 bits, 10M samples), default Zarr compression
reduced storage from ~19 MB to ~7 MB, but further gains plateaued. Existing bit-level tools
in numcodecs (e.g. PackBits) do not support lossless integer bit-width packing.

Proposal

Introduce a lossless integer bit-packing codec/filter that:

  • Packs fixed-width integer values using exactly N bits
  • Operates per chunk
  • Is fully reversible
  • Can be composed with existing compression

Prototype

I implemented a pure-Python prototype to validate feasibility:

  • Correct round-trip verified
  • Storage reduction proportional to effective bit-width
  • Zarr v2 compatible (self-describing stream)

Results (summary)

  • Bit-packing alone reduces storage predictably (e.g. 12/16 → ~0.75×)
  • Bit-packing + Blosc/Zstd achieves size near default compression
  • Python implementation is CPU-heavy → optimized backend likely needed

Questions

  • Should this live as a codec or filter in numcodecs?
  • How should bit-width metadata be handled (header vs external)?
  • Is limiting initial scope to uint16 reasonable?
  • Any guidance on aligning this with Zarr v3’s codec pipeline?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions