-
Notifications
You must be signed in to change notification settings - Fork 109
Description
Hi! This is an exploratory design discussion based on a working prototype and benchmarks.
I’d really appreciate feedback on API shape and integration direction.
Motivation
Many scientific datasets store integer signals where only a subset of bits are meaningful
(e.g. 10–12 bit ADC data stored in uint16). Zarr/numcodecs currently rely on byte-level
compression for such data, which does not explicitly remove unused bits.
Observations
In an ADC-style benchmark (uint16, effective 12 bits, 10M samples), default Zarr compression
reduced storage from ~19 MB to ~7 MB, but further gains plateaued. Existing bit-level tools
in numcodecs (e.g. PackBits) do not support lossless integer bit-width packing.
Proposal
Introduce a lossless integer bit-packing codec/filter that:
- Packs fixed-width integer values using exactly N bits
- Operates per chunk
- Is fully reversible
- Can be composed with existing compression
Prototype
I implemented a pure-Python prototype to validate feasibility:
- Correct round-trip verified
- Storage reduction proportional to effective bit-width
- Zarr v2 compatible (self-describing stream)
Results (summary)
- Bit-packing alone reduces storage predictably (e.g. 12/16 → ~0.75×)
- Bit-packing + Blosc/Zstd achieves size near default compression
- Python implementation is CPU-heavy → optimized backend likely needed
Questions
- Should this live as a codec or filter in numcodecs?
- How should bit-width metadata be handled (header vs external)?
- Is limiting initial scope to uint16 reasonable?
- Any guidance on aligning this with Zarr v3’s codec pipeline?