Skip to content

Conversation

@giovp
Copy link
Member

@giovp giovp commented Dec 18, 2025

Potential fix to #91

This PR introduces a new API called AnnDataField for parsing multiple anndata slots (layers, obsm, obs).

The way it works is that instead of specifying obs_key, the argument adata_fields accepts a mapping of key:AnnDataField(...) which returns the value of that field for the given anndata slice. It also provides a convert_fn which can be used for e.g. label encoder or any type of on the fly transformation.

If this is a desirable API, I think the current API for chunks_converted should also be handled inside AnnDataField, and effectively the Loader should only return a dictionary of {key:AnnDataFIeld} such as {"counts": sp.csr_matrix, "label":np.ndarray, ...} .

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.93%. Comparing base (7eb124a) to head (eff85d6).

Files with missing lines Patch % Lines
src/annbatch/loader.py 88.88% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
+ Coverage   90.90%   90.93%   +0.02%     
==========================================
  Files           5        6       +1     
  Lines         594      629      +35     
==========================================
+ Hits          540      572      +32     
- Misses         54       57       +3     
Files with missing lines Coverage Δ
src/annbatch/__init__.py 100.00% <100.00%> (ø)
src/annbatch/fields.py 100.00% <100.00%> (ø)
src/annbatch/loader.py 92.46% <88.88%> (-0.63%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilan-gold
Copy link
Collaborator

ilan-gold commented Dec 18, 2025

@giovp Can you comment on #99 as an API? My thought was that #94 would enable something like this paradigm:

shuffler = Shuffler('/path/to/shuffled/store')
shuffler.shuffle(my_adatas)
Loader(**settings).add_anndatas(shuffler.to_adata(lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory()))

In other words, just use the anndata API - everything that is in obs, will be iterated on, we can return the var column (as per your suggestion) and whatever is in X is the data of interest

This would somewhat obviate the need for a schema because everything comes from shuffler which has as a property uniform processing. We could also do:

Loader(**settings).add_shuffled(shuffler, lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory())

which would really enforce the "schema" concept because a shuffler object represents a "uniformly processes" on-disk dataset

For something like https://github.com/scverse/annbatch/pull/102/changes#diff-a828467ae280cb3ae472f4c3fa7fafe22f9e672829114f1ecccc14435fa205f8R51-R52 we could just either use the anndata API (anything found in X, layers or obsm is loaded) or use "linked" loaders i.e., from #91 my comment:

Expose a RNG argument so two loaders can use RNGs seeded identically. This approach would force the user to create two objects, but would grossly simplify our internal code. I lean towards this because we can always create new APIs that compose two ZarrSparseDataset objects, for example, and handle this internally.

@ilan-gold
Copy link
Collaborator

ilan-gold commented Dec 19, 2025

@giovp I completely forgot about this PR when I merged the other one, but I'm still not convinced of creating a new API for subsetting AnnData fields. I think we can just tell users to put the things in their AnnData objects that they want yielded. We could loosen #99 so that it doesn't look at X alone - it can just fetch everything in layers and obsm as well. WDYT? It would need a few more for-loops but seems doable. We could then yield AnnData objects instead of LoaderOutput. This is a bit problematic though because:

  1. AnnData doesn't (IMO rightly) support torch.Tensor (see Python Array API Compatibility Tracker pytorch/pytorch#58743 for why not)
  2. Returning an AnnData object or subsetting one may not be as performant as what we have here right now and is arguably not super useful under my understanding (I'd be happy to be convinced otherwise - do people really train on both PCA and counts simultaneously for example?). So I'd want to benchmark this before putting it in the codebase

Not yielding AnnData objects is problematic though because if we support both obsm and layers simultaneously, we end up in a situation where we need to track key provenance. So my main question is whether or not you really need to yield obsm and layers (for example) simultaneously? If not, then I think the other field stuff (like things in obs) is already covered in #99

EDIT: I answered my own question: spliced + unspliced is a good example. So I think this becomes a question of return type. I don't think we need a fields API since we can just "use" whatever is in the input AnnData. But what we return becomes a more serious question.

@giovp
Copy link
Member Author

giovp commented Dec 19, 2025

But what we return becomes a more serious question.

My answer to this is that it should return a mapping {key:torch.Tensor|np.ndarray}. It should also arbitrary provide an way to transform the data on the fly, to enable for instance to one hot encode classes, normalize and transform data, type transformation etc.

Not yielding AnnData objects is problematic though because if we support both obsm and layers simultaneously, we end up in a situation where we need to track key provenance.

would it be a problem to track key provenance? also I don't think I understand why that is needed, since it can just happen at iter stage?

I share your two concerns, I don't think returning anndatas make sense, and it's not going to be performant.

fields API since we can just "use" whatever is in the input AnnData

Yeah I personally just think the Fields API is a clean way to implement that as it's generic enough (can't take any combination of attr/key) and provides a way for on-the-fly transformations.

@selmanozleyen
Copy link
Member

selmanozleyen commented Dec 20, 2025

If this is a desirable API, I think the current API for chunks_converted should also be handled inside AnnDataField, and effectively the Loader should only return a dictionary of {key:AnnDataFIeld} such as {"counts": sp.csr_matrix, "label":np.ndarray, ...} .

I also think the current for loop is a bit too raw and can be generalizable and there is no reason we can't also extract from .obsm, obsp, layers. This reminded me of my graph ML pipeline project so I will have a look at this for further extensibility.

@ilan-gold
Copy link
Collaborator

So the reason I don't like fields is that anndata objects for use in the loader are going to have to be constructed manually - there's no way to one-line opening an anndata object with only zarr arrays/sparse dataset objects (as of the latest anndata, that will change in the next release: scverse/anndata#2147). Setting that aside, I still think it's much simpler to have people just put their desired output keys in the input anndata objects (see code snippet #102 (comment)). If you want to return a column called categorical_1, put it in the input object and if you don't then don't put it there. We could start validating internally that all columns match in dtype etc. as anndata objects are added.

You were also mentioning a more complex API where the AnnDataField would have a transformation capability but I don't see that here - that's a big enough design decision I would want to see how it's implemented. In that case we would start to need a concept of a schema because everything would have to conform to AnnDataField, which would need to be stored on the object and then interact with the outputs as well. I'm not opposed to that, but I'm not sure what it adds over having a validated-as-shuffled on-disk store (which will be a collection of anndata objects, all of which are guaranteed to look the same thus obviating the need for a schema).

I think LoaderOutput could have some helper methods like LoaderOutput.one_hot_encode("my_categorical_column"). This doesn't really solve the issue of customizing these transformations, but I'd like to tackle that later perhaps.

@selmanozleyen
Copy link
Member

I still think it's much simpler to have people just put their desired output keys in the input anndata objects

This I agree with, I think we should come with a standard way of describing a location though. Like "obsm/Xpca[:,50]", this could be represented as a class to be serialized but user wouldn't have to know about this. And there should be also a one to one conversion between that string and that object

@ilan-gold
Copy link
Collaborator

This I agree with, I think we should come with a standard way of describing a location though. Like "obsm/Xpca[:,50]", this could be represented as a class to be serialized but user wouldn't have to know about this. And there should be also a one to one conversion between that string and that object

Yup, internally we could do this so that AnnData -> fetched data from disk -> LoaderOutput is smooth and lossless

@giovp
Copy link
Member Author

giovp commented Dec 22, 2025

You were also mentioning a more complex API where the AnnDataField would have a transformation capability but I don't see that here - that's a big enough design decision I would want to see how it's implemented.

mmh it is implemented as the lambda call in the AnnDataField definition

In that case we would start to need a concept of a schema because everything would have to conform to AnnDataField, which would need to be stored on the object and then interact with the outputs as well. I'm not opposed to that, but I'm not sure what it adds over having a validated-as-shuffled on-disk store

I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the Loader class, as well as providing a lightwwight way to perform a transformation on the tensor before yielding it (e.g. densitification, map to cateogrical etc.). This is quite standard in DL data pipelines.

I also don't think you need a schema for that. if this is true

(which will be a collection of anndata objects, all of which are guaranteed to look the same thus obviating the need for a schema).

then the AnnDataField API could just sit on top, and indeed no validation is needed.

f you want to return a column called categorical_1, put it in the input object and if you don't then don't put it there.

mmh sure but it would require writing to disk constantly based on which categories you want to encode (or not encode). IMO enabling a transformation on the fly makes it more convenient and flexible, since that transformation should be ephemeral and only dependent on the training pipelines, not necessarily part of the data written to disk.

As for the question on why we would want to remap key names, here's an example: let's say I have an anndata with layers: 'sum' and obs: 'sum'. If I don't get to decide how to map the anndata attr/key to the yielded dictionary, then there would be a conflict here.

Copy link
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the Loader class, as well as providing a lightwwight way to perform a transformation on the tensor before yielding it (e.g. densitification, map to cateogrical etc.). This is quite standard in DL data pipelines.

Been doing some thinking about this a bit more. I think there are two issues here - transformations and arbitrary key handling as you point out. I think we can be even less prescriptive than fields - at the end of the day, we still want a add_dataset{s,} API to be independent of AnnData, at least until the issue of var spaces is resolved (since solving that will probably mean the internal state will hold a reference to the anndata object at which point I would remove add_dataset{s,} - don't mind this).

So, we have a few things to handle

  1. "Current behavior" where you add an anndata with X and obs and you get back a dict-per-batch. This should be the "default" (modulo other keys, see below)
  2. Future feature of multiple keys When a user provides an AnnData with different keys in obs we want to yield all or some of them. In any case, this point is just that we need the internal code to loop over fetching data from multiple on-disk stores per batch setting aside the problem of how those stores are specificied. By default, once this is supported, if you provide an AnnData with obsm["X_pca"] and X, you get both back.
  3. Building off of 2, you may want to be able to distinguish between different keys with custom assignment - do we return a dict with obsm/X_pca and X, just X_pca and X etc? What about labels i.e., obs?
  4. Custom transformations one hot encoding, tokenization etc.

To this end I would propose the following as the most specialized use-case (with all others reducing to this internally):

def extract_obsm_X(adata: AnnData) -> dict:
   store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
   store.set_output_transform("pca", lambda x: x + 2)
   return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
    for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
       assert k in batch

The TrainingStore dict-like class returned from extract_obsm_X could then have the transformations on it. This TrainingStore's transforms could then be transferred to values of LoaderOutput internally. Something like:

# in_mem_data has all the customized keys specified in extract_obsm_X because _index_datasets would access self._store to do the fetching:
in_mem_data: LoaderOutput = zsync.sync(self._index_datasets(dataset_index_to_slices))
...
# Do transformations i.e., `transform` will just loop over the keys and apply the functions
in_mem_data = self._store.transform(in_mem_data)
...
yield in_mem_data

To get back to the point about add_dataset{,s}, we could then have

Loader(...).add_dataset(key, transform, value)

which internally updates the self._store: TrainingStore.

List of :class:`zarr.Array` or :class:`anndata.abc.CSRDataset` objects, generally from :attr:`anndata.AnnData.X`.
They must all be of the same type and match that of any already added datasets.
obs
List of :class:`numpy.ndarray` labels, generally from :attr:`anndata.AnnData.obs`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date now - this now is a pandas dictionary on main

dataset = adata.X if layer_key is None else adata.layers[layer_key]
if not isinstance(dataset, BackingArray_T.__value__):
raise TypeError(f"Found {type(dataset)} but only {BackingArray_T.__value__} are usable")
obs = adata.obs[obs_key].to_numpy() if obs_key is not None else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is what you want, though. We probably want to validate everything from obs as being categorical or numeric (i.e., something convertible to DL-usable). There is no validation on main but doesn't mean there shouldn't be!

@giovp
Copy link
Member Author

giovp commented Jan 2, 2026

Ok, I think I get this implementation from the code snippet

def extract_obsm_X(adata: AnnData) -> dict:
   store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
   store.set_output_transform("pca", lambda x: x + 2)
   return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
    for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
       assert k in batch

so then my understanding is that the TrainingStore should be implemented on top of the loader, correct? shall I go ahead and give it a try?

@ilan-gold
Copy link
Collaborator

ilan-gold commented Jan 2, 2026

@giovp Not on top but within! And I think we should go with your anndata fields idea on top of that (this way we get both add_dataset compat and a nice selector API) with a slight variation instead of this functional API of having a second argument to add_anndatas that does extraction into a dict directly. I think I had mentioned to you IRL that Phil and I were thinking about a schema-like API for anndata based on accessors from hv-anndata so I'd like the selector API to look like that (this will also give a chance to workshop that a bit here). So concretely, I would proceed in two PRs (you can close this one or not):

  1. Implement internal usage of TrainingStore on the Loader class which will enable AnnData objects with more than one "data store." Nothing would change in the current public API except LoaderOutput's keys. But the behavior of the Loader will change so that something like
loader = Loader(...)
loader.add_anndata(
    AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)

will no longer ignore obsm but instead initialize internally a TrainingStore with three keys, X, obs, and obsm/X_pca (also not opposed to making every column of the dataframe its own key in the store). Then a further call like

loader.add_anndata(
    AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)

will add X obs and obsm/X_pca internally from this different AnnData object (but one that has the same "schema") to the store but a futher call like

loader.add_anndata(
    AnnData(layers={"spliced": ...}, obsm={"X_scvi": ...}, obs=pd.DataFrame({"not_category_1": ...}))
)

will error out because the TrainingStore has been initialized with X and X_pca and category_1 so any further deviation from that is not allowed (i.e., a different column name, or X_scvi instead of X_pca). The final LoaderOutput would have keys X, X_pca, labels, and indices (and in the default case, it will still be data, labels and indices as currently is the case on main i.e., only an AnnData with X and obs has been given to the Loader).

  1. In a follow-up PR, you would implement the selector mechanism. I am definitely on board with it now! Provided, we follow the direction that such an API will probably take in AnnData. So instead of an AnnDataField class, we would just have a class A (see the hv-anndata repo) that does selection + transform. The API would look like:
loader = Loader(...)
loader.add_anndata(
    AnnData(X=..., layers={"normalized": ...}, obs=pd.DataFrame({"category_1":...})),
    selector={ "counts": A.X.with_transform(tokenize), "normalized": A.layers["normalized"].with_transform(tokenize), "category": A.obs["category_1"] }
)

but then also allows

loader = Loader(...)
loader.add_anndata(
    AnnData(layers={"Normalized": ..., "Counts": ...}, obs=pd.DataFrame({"category_1_but_different_name":...})),
    selector={ "counts": A.layers["Counts"].with_transform(tokenize), "normalized": A.layers["Normalized"].with_transform(tokenize), "category": A.obs["category_1_but_different_name"] }
)

which gives the selector-transform mechanism.

Each key in the selector dict argument is what the TrainingStore internally (from the first PR!) uses and that key is fetched from whatever is in the value of the selector argument's dict's key. So "counts": A.layers["Counts"].with_transform(tokenize) means that the Loader will load the layer "Counts" from the input AnnData, apply a transform tokenize to the output data for each batch at that key, and put that output in the key counts in LoaderOutput.

Does that make sense? Sorry it took so long, but I just needed some time to get my head around everything fully :)

Very open to feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants