new API for parsing attrs and keys from anndata #102

giovp · 2025-12-18T00:23:46Z

Potential fix to #91

This PR introduces a new API called AnnDataField for parsing multiple anndata slots (layers, obsm, obs).

The way it works is that instead of specifying obs_key, the argument adata_fields accepts a mapping of key:AnnDataField(...) which returns the value of that field for the given anndata slice. It also provides a convert_fn which can be used for e.g. label encoder or any type of on the fly transformation.

If this is a desirable API, I think the current API for chunks_converted should also be handled inside AnnDataField, and effectively the Loader should only return a dictionary of {key:AnnDataFIeld} such as {"counts": sp.csr_matrix, "label":np.ndarray, ...} .

…ttributes) from anndata

review-notebook-app · 2025-12-18T00:23:51Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2025-12-18T00:27:02Z

Codecov Report

❌ Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.93%. Comparing base (7eb124a) to head (eff85d6).

Files with missing lines	Patch %	Lines
src/annbatch/loader.py	88.88%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
+ Coverage   90.90%   90.93%   +0.02%     
==========================================
  Files           5        6       +1     
  Lines         594      629      +35     
==========================================
+ Hits          540      572      +32     
- Misses         54       57       +3

Files with missing lines	Coverage Δ
src/annbatch/__init__.py	`100.00% <100.00%> (ø)`
src/annbatch/fields.py	`100.00% <100.00%> (ø)`
src/annbatch/loader.py	`92.46% <88.88%> (-0.63%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold · 2025-12-18T08:55:33Z

@giovp Can you comment on #99 as an API? My thought was that #94 would enable something like this paradigm:

shuffler = Shuffler('/path/to/shuffled/store')
shuffler.shuffle(my_adatas)
Loader(**settings).add_anndatas(shuffler.to_adata(lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory()))

In other words, just use the anndata API - everything that is in obs, will be iterated on, we can return the var column (as per your suggestion) and whatever is in X is the data of interest

This would somewhat obviate the need for a schema because everything comes from shuffler which has as a property uniform processing. We could also do:

Loader(**settings).add_shuffled(shuffler, lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory())

which would really enforce the "schema" concept because a shuffler object represents a "uniformly processes" on-disk dataset

For something like https://github.com/scverse/annbatch/pull/102/changes#diff-a828467ae280cb3ae472f4c3fa7fafe22f9e672829114f1ecccc14435fa205f8R51-R52 we could just either use the anndata API (anything found in X, layers or obsm is loaded) or use "linked" loaders i.e., from #91 my comment:

Expose a RNG argument so two loaders can use RNGs seeded identically. This approach would force the user to create two objects, but would grossly simplify our internal code. I lean towards this because we can always create new APIs that compose two ZarrSparseDataset objects, for example, and handle this internally.

ilan-gold · 2025-12-19T13:36:18Z

@giovp I completely forgot about this PR when I merged the other one, but I'm still not convinced of creating a new API for subsetting AnnData fields. I think we can just tell users to put the things in their AnnData objects that they want yielded. We could loosen #99 so that it doesn't look at X alone - it can just fetch everything in layers and obsm as well. WDYT? It would need a few more for-loops but seems doable. We could then yield AnnData objects instead of LoaderOutput. This is a bit problematic though because:

AnnData doesn't (IMO rightly) support torch.Tensor (see Python Array API Compatibility Tracker pytorch/pytorch#58743 for why not)
Returning an AnnData object or subsetting one may not be as performant as what we have here right now and is arguably not super useful under my understanding (I'd be happy to be convinced otherwise - do people really train on both PCA and counts simultaneously for example?). So I'd want to benchmark this before putting it in the codebase

Not yielding AnnData objects is problematic though because if we support both obsm and layers simultaneously, we end up in a situation where we need to track key provenance. So my main question is whether or not you really need to yield obsm and layers (for example) simultaneously? If not, then I think the other field stuff (like things in obs) is already covered in #99

EDIT: I answered my own question: spliced + unspliced is a good example. So I think this becomes a question of return type. I don't think we need a fields API since we can just "use" whatever is in the input AnnData. But what we return becomes a more serious question.

giovp · 2025-12-19T17:21:47Z

But what we return becomes a more serious question.

My answer to this is that it should return a mapping {key:torch.Tensor|np.ndarray}. It should also arbitrary provide an way to transform the data on the fly, to enable for instance to one hot encode classes, normalize and transform data, type transformation etc.

Not yielding AnnData objects is problematic though because if we support both obsm and layers simultaneously, we end up in a situation where we need to track key provenance.

would it be a problem to track key provenance? also I don't think I understand why that is needed, since it can just happen at iter stage?

I share your two concerns, I don't think returning anndatas make sense, and it's not going to be performant.

fields API since we can just "use" whatever is in the input AnnData

Yeah I personally just think the Fields API is a clean way to implement that as it's generic enough (can't take any combination of attr/key) and provides a way for on-the-fly transformations.

selmanozleyen · 2025-12-20T13:00:37Z

If this is a desirable API, I think the current API for chunks_converted should also be handled inside AnnDataField, and effectively the Loader should only return a dictionary of {key:AnnDataFIeld} such as {"counts": sp.csr_matrix, "label":np.ndarray, ...} .

I also think the current for loop is a bit too raw and can be generalizable and there is no reason we can't also extract from .obsm, obsp, layers. This reminded me of my graph ML pipeline project so I will have a look at this for further extensibility.

ilan-gold · 2025-12-22T12:58:01Z

So the reason I don't like fields is that anndata objects for use in the loader are going to have to be constructed manually - there's no way to one-line opening an anndata object with only zarr arrays/sparse dataset objects (as of the latest anndata, that will change in the next release: scverse/anndata#2147). Setting that aside, I still think it's much simpler to have people just put their desired output keys in the input anndata objects (see code snippet #102 (comment)). If you want to return a column called categorical_1, put it in the input object and if you don't then don't put it there. We could start validating internally that all columns match in dtype etc. as anndata objects are added.

You were also mentioning a more complex API where the AnnDataField would have a transformation capability but I don't see that here - that's a big enough design decision I would want to see how it's implemented. In that case we would start to need a concept of a schema because everything would have to conform to AnnDataField, which would need to be stored on the object and then interact with the outputs as well. I'm not opposed to that, but I'm not sure what it adds over having a validated-as-shuffled on-disk store (which will be a collection of anndata objects, all of which are guaranteed to look the same thus obviating the need for a schema).

I think LoaderOutput could have some helper methods like LoaderOutput.one_hot_encode("my_categorical_column"). This doesn't really solve the issue of customizing these transformations, but I'd like to tackle that later perhaps.

selmanozleyen · 2025-12-22T15:03:42Z

I still think it's much simpler to have people just put their desired output keys in the input anndata objects

This I agree with, I think we should come with a standard way of describing a location though. Like "obsm/Xpca[:,50]", this could be represented as a class to be serialized but user wouldn't have to know about this. And there should be also a one to one conversion between that string and that object

ilan-gold · 2025-12-22T15:12:00Z

This I agree with, I think we should come with a standard way of describing a location though. Like "obsm/Xpca[:,50]", this could be represented as a class to be serialized but user wouldn't have to know about this. And there should be also a one to one conversion between that string and that object

Yup, internally we could do this so that AnnData -> fetched data from disk -> LoaderOutput is smooth and lossless

giovp · 2025-12-22T19:17:56Z

You were also mentioning a more complex API where the AnnDataField would have a transformation capability but I don't see that here - that's a big enough design decision I would want to see how it's implemented.

mmh it is implemented as the lambda call in the AnnDataField definition

In that case we would start to need a concept of a schema because everything would have to conform to AnnDataField, which would need to be stored on the object and then interact with the outputs as well. I'm not opposed to that, but I'm not sure what it adds over having a validated-as-shuffled on-disk store

I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the Loader class, as well as providing a lightwwight way to perform a transformation on the tensor before yielding it (e.g. densitification, map to cateogrical etc.). This is quite standard in DL data pipelines.

I also don't think you need a schema for that. if this is true

(which will be a collection of anndata objects, all of which are guaranteed to look the same thus obviating the need for a schema).

then the AnnDataField API could just sit on top, and indeed no validation is needed.

f you want to return a column called categorical_1, put it in the input object and if you don't then don't put it there.

mmh sure but it would require writing to disk constantly based on which categories you want to encode (or not encode). IMO enabling a transformation on the fly makes it more convenient and flexible, since that transformation should be ephemeral and only dependent on the training pipelines, not necessarily part of the data written to disk.

As for the question on why we would want to remap key names, here's an example: let's say I have an anndata with layers: 'sum' and obs: 'sum'. If I don't get to decide how to map the anndata attr/key to the yielded dictionary, then there would be a conflict here.

ilan-gold

I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the Loader class, as well as providing a lightwwight way to perform a transformation on the tensor before yielding it (e.g. densitification, map to cateogrical etc.). This is quite standard in DL data pipelines.

Been doing some thinking about this a bit more. I think there are two issues here - transformations and arbitrary key handling as you point out. I think we can be even less prescriptive than fields - at the end of the day, we still want a add_dataset{s,} API to be independent of AnnData, at least until the issue of var spaces is resolved (since solving that will probably mean the internal state will hold a reference to the anndata object at which point I would remove add_dataset{s,} - don't mind this).

So, we have a few things to handle

"Current behavior" where you add an anndata with X and obs and you get back a dict-per-batch. This should be the "default" (modulo other keys, see below)
Future feature of multiple keys When a user provides an AnnData with different keys in obs we want to yield all or some of them. In any case, this point is just that we need the internal code to loop over fetching data from multiple on-disk stores per batch setting aside the problem of how those stores are specificied. By default, once this is supported, if you provide an AnnData with obsm["X_pca"] and X, you get both back.
Building off of 2, you may want to be able to distinguish between different keys with custom assignment - do we return a dict with obsm/X_pca and X, just X_pca and X etc? What about labels i.e., obs?
Custom transformations one hot encoding, tokenization etc.

To this end I would propose the following as the most specialized use-case (with all others reducing to this internally):

def extract_obsm_X(adata: AnnData) -> dict:
   store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
   store.set_output_transform("pca", lambda x: x + 2)
   return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
    for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
       assert k in batch

The TrainingStore dict-like class returned from extract_obsm_X could then have the transformations on it. This TrainingStore's transforms could then be transferred to values of LoaderOutput internally. Something like:

# in_mem_data has all the customized keys specified in extract_obsm_X because _index_datasets would access self._store to do the fetching:
in_mem_data: LoaderOutput = zsync.sync(self._index_datasets(dataset_index_to_slices))
...
# Do transformations i.e., `transform` will just loop over the keys and apply the functions
in_mem_data = self._store.transform(in_mem_data)
...
yield in_mem_data

To get back to the point about add_dataset{,s}, we could then have

Loader(...).add_dataset(key, transform, value)

which internally updates the self._store: TrainingStore.

ilan-gold · 2025-12-30T15:55:31Z

src/annbatch/loader.py

                List of :class:`zarr.Array` or :class:`anndata.abc.CSRDataset` objects, generally from :attr:`anndata.AnnData.X`.
                They must all be of the same type and match that of any already added datasets.
            obs
-                List of :class:`numpy.ndarray` labels, generally from :attr:`anndata.AnnData.obs`.


Out of date now - this now is a pandas dictionary on main

ilan-gold · 2025-12-30T15:56:38Z

src/annbatch/loader.py

        dataset = adata.X if layer_key is None else adata.layers[layer_key]
        if not isinstance(dataset, BackingArray_T.__value__):
            raise TypeError(f"Found {type(dataset)} but only {BackingArray_T.__value__} are usable")
-        obs = adata.obs[obs_key].to_numpy() if obs_key is not None else None


Not sure this is what you want, though. We probably want to validate everything from obs as being categorical or numeric (i.e., something convertible to DL-usable). There is no validation on main but doesn't mean there shouldn't be!

giovp · 2026-01-02T19:46:57Z

Ok, I think I get this implementation from the code snippet

def extract_obsm_X(adata: AnnData) -> dict:
   store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
   store.set_output_transform("pca", lambda x: x + 2)
   return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
    for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
       assert k in batch

so then my understanding is that the TrainingStore should be implemented on top of the loader, correct? shall I go ahead and give it a try?

ilan-gold · 2026-01-02T21:33:31Z

@giovp Not on top but within! And I think we should go with your anndata fields idea on top of that (this way we get both add_dataset compat and a nice selector API) with a slight variation instead of this functional API of having a second argument to add_anndatas that does extraction into a dict directly. I think I had mentioned to you IRL that Phil and I were thinking about a schema-like API for anndata based on accessors from hv-anndata so I'd like the selector API to look like that (this will also give a chance to workshop that a bit here). So concretely, I would proceed in two PRs (you can close this one or not):

Implement internal usage of TrainingStore on the Loader class which will enable AnnData objects with more than one "data store." Nothing would change in the current public API except LoaderOutput's keys. But the behavior of the Loader will change so that something like

loader = Loader(...)
loader.add_anndata(
    AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)

will no longer ignore obsm but instead initialize internally a TrainingStore with three keys, X, obs, and obsm/X_pca (also not opposed to making every column of the dataframe its own key in the store). Then a further call like

loader.add_anndata(
    AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)

will add X obs and obsm/X_pca internally from this different AnnData object (but one that has the same "schema") to the store but a futher call like

loader.add_anndata(
    AnnData(layers={"spliced": ...}, obsm={"X_scvi": ...}, obs=pd.DataFrame({"not_category_1": ...}))
)

will error out because the TrainingStore has been initialized with X and X_pca and category_1 so any further deviation from that is not allowed (i.e., a different column name, or X_scvi instead of X_pca). The final LoaderOutput would have keys X, X_pca, labels, and indices (and in the default case, it will still be data, labels and indices as currently is the case on main i.e., only an AnnData with X and obs has been given to the Loader).

In a follow-up PR, you would implement the selector mechanism. I am definitely on board with it now! Provided, we follow the direction that such an API will probably take in AnnData. So instead of an AnnDataField class, we would just have a class A (see the hv-anndata repo) that does selection + transform. The API would look like:

loader = Loader(...)
loader.add_anndata(
    AnnData(X=..., layers={"normalized": ...}, obs=pd.DataFrame({"category_1":...})),
    selector={ "counts": A.X.with_transform(tokenize), "normalized": A.layers["normalized"].with_transform(tokenize), "category": A.obs["category_1"] }
)

but then also allows

loader = Loader(...)
loader.add_anndata(
    AnnData(layers={"Normalized": ..., "Counts": ...}, obs=pd.DataFrame({"category_1_but_different_name":...})),
    selector={ "counts": A.layers["Counts"].with_transform(tokenize), "normalized": A.layers["Normalized"].with_transform(tokenize), "category": A.obs["category_1_but_different_name"] }
)

which gives the selector-transform mechanism.

Each key in the selector dict argument is what the TrainingStore internally (from the first PR!) uses and that key is fetched from whatever is in the value of the selector argument's dict's key. So "counts": A.layers["Counts"].with_transform(tokenize) means that the Loader will load the layer "Counts" from the input AnnData, apply a transform tokenize to the output data for each batch at that key, and put that output in the key counts in LoaderOutput.

Does that make sense? Sorry it took so long, but I just needed some time to get my head around everything fully :)

Very open to feedback!

giovp added 2 commits December 17, 2025 13:33

new AnnDataField API for multiple columns (and potentially multiple a…

69985a0

…ttributes) from anndata

AnnDataField for all fields.

eff85d6

ilan-gold reviewed Dec 30, 2025

View reviewed changes

new API for parsing attrs and keys from anndata #102

Are you sure you want to change the base?

new API for parsing attrs and keys from anndata #102

Uh oh!

Conversation

giovp commented Dec 18, 2025

Uh oh!

review-notebook-app bot commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giovp commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

selmanozleyen commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Dec 22, 2025

Uh oh!

selmanozleyen commented Dec 22, 2025

Uh oh!

ilan-gold commented Dec 22, 2025

Uh oh!

giovp commented Dec 22, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

giovp commented Jan 2, 2026

Uh oh!

ilan-gold commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 18, 2025 •

edited

Loading

ilan-gold commented Dec 18, 2025 •

edited

Loading

ilan-gold commented Dec 19, 2025 •

edited

Loading

giovp commented Dec 19, 2025 •

edited

Loading

selmanozleyen commented Dec 20, 2025 •

edited

Loading

ilan-gold commented Jan 2, 2026 •

edited

Loading