-
Notifications
You must be signed in to change notification settings - Fork 3
new API for parsing attrs and keys from anndata #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ttributes) from anndata
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #102 +/- ##
==========================================
+ Coverage 90.90% 90.93% +0.02%
==========================================
Files 5 6 +1
Lines 594 629 +35
==========================================
+ Hits 540 572 +32
- Misses 54 57 +3
🚀 New features to boost your workflow:
|
|
@giovp Can you comment on #99 as an API? My thought was that #94 would enable something like this paradigm: shuffler = Shuffler('/path/to/shuffled/store')
shuffler.shuffle(my_adatas)
Loader(**settings).add_anndatas(shuffler.to_adata(lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory()))In other words, just use the anndata API - everything that is in obs, will be iterated on, we can return the This would somewhat obviate the need for a schema because everything comes from Loader(**settings).add_shuffled(shuffler, lambda x: ad.AnnData(X=x.X, obs=x.obs[column_subset].to_memory(), var=pd.DataFrame(index=x.var.index.to_memory())which would really enforce the "schema" concept because a For something like https://github.com/scverse/annbatch/pull/102/changes#diff-a828467ae280cb3ae472f4c3fa7fafe22f9e672829114f1ecccc14435fa205f8R51-R52 we could just either use the anndata API (anything found in |
|
@giovp I completely forgot about this PR when I merged the other one, but I'm still not convinced of creating a new API for subsetting
Not yielding EDIT: I answered my own question: spliced + unspliced is a good example. So I think this becomes a question of return type. I don't think we need a fields API since we can just "use" whatever is in the input |
My answer to this is that it should return a mapping
would it be a problem to track key provenance? also I don't think I understand why that is needed, since it can just happen at iter stage? I share your two concerns, I don't think returning anndatas make sense, and it's not going to be performant.
Yeah I personally just think the Fields API is a clean way to implement that as it's generic enough (can't take any combination of attr/key) and provides a way for on-the-fly transformations. |
I also think the current for loop is a bit too raw and can be generalizable and there is no reason we can't also extract from |
|
So the reason I don't like fields is that anndata objects for use in the loader are going to have to be constructed manually - there's no way to one-line opening an anndata object with only zarr arrays/sparse dataset objects (as of the latest anndata, that will change in the next release: scverse/anndata#2147). Setting that aside, I still think it's much simpler to have people just put their desired output keys in the input anndata objects (see code snippet #102 (comment)). If you want to return a column called You were also mentioning a more complex API where the I think |
This I agree with, I think we should come with a standard way of describing a location though. Like "obsm/Xpca[:,50]", this could be represented as a class to be serialized but user wouldn't have to know about this. And there should be also a one to one conversion between that string and that object |
Yup, internally we could do this so that |
mmh it is implemented as the lambda call in the AnnDataField definition
I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the I also don't think you need a schema for that. if this is true
then the AnnDataField API could just sit on top, and indeed no validation is needed.
mmh sure but it would require writing to disk constantly based on which categories you want to encode (or not encode). IMO enabling a transformation on the fly makes it more convenient and flexible, since that transformation should be ephemeral and only dependent on the training pipelines, not necessarily part of the data written to disk. As for the question on why we would want to remap key names, here's an example: let's say I have an anndata with |
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this, why is it a problem that the AnnDataField should be stored in the object? It's effectively just a way to map the original obs_column to a new key that is going to be yielded alongside its value, by the Loader class, as well as providing a lightwwight way to perform a transformation on the tensor before yielding it (e.g. densitification, map to cateogrical etc.). This is quite standard in DL data pipelines.
Been doing some thinking about this a bit more. I think there are two issues here - transformations and arbitrary key handling as you point out. I think we can be even less prescriptive than fields - at the end of the day, we still want a add_dataset{s,} API to be independent of AnnData, at least until the issue of var spaces is resolved (since solving that will probably mean the internal state will hold a reference to the anndata object at which point I would remove add_dataset{s,} - don't mind this).
So, we have a few things to handle
- "Current behavior" where you add an anndata with
Xandobsand you get back a dict-per-batch. This should be the "default" (modulo other keys, see below) - Future feature of multiple keys When a user provides an
AnnDatawith different keys inobswe want to yield all or some of them. In any case, this point is just that we need the internal code to loop over fetching data from multiple on-disk stores per batch setting aside the problem of how those stores are specificied. By default, once this is supported, if you provide anAnnDatawithobsm["X_pca"]andX, you get both back. - Building off of 2, you may want to be able to distinguish between different keys with custom assignment - do we return a dict with
obsm/X_pcaandX, justX_pcaandXetc? What about labels i.e.,obs? - Custom transformations one hot encoding, tokenization etc.
To this end I would propose the following as the most specialized use-case (with all others reducing to this internally):
def extract_obsm_X(adata: AnnData) -> dict:
store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
store.set_output_transform("pca", lambda x: x + 2)
return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
assert k in batchThe TrainingStore dict-like class returned from extract_obsm_X could then have the transformations on it. This TrainingStore's transforms could then be transferred to values of LoaderOutput internally. Something like:
# in_mem_data has all the customized keys specified in extract_obsm_X because _index_datasets would access self._store to do the fetching:
in_mem_data: LoaderOutput = zsync.sync(self._index_datasets(dataset_index_to_slices))
...
# Do transformations i.e., `transform` will just loop over the keys and apply the functions
in_mem_data = self._store.transform(in_mem_data)
...
yield in_mem_dataTo get back to the point about add_dataset{,s}, we could then have
Loader(...).add_dataset(key, transform, value)which internally updates the self._store: TrainingStore.
| List of :class:`zarr.Array` or :class:`anndata.abc.CSRDataset` objects, generally from :attr:`anndata.AnnData.X`. | ||
| They must all be of the same type and match that of any already added datasets. | ||
| obs | ||
| List of :class:`numpy.ndarray` labels, generally from :attr:`anndata.AnnData.obs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of date now - this now is a pandas dictionary on main
| dataset = adata.X if layer_key is None else adata.layers[layer_key] | ||
| if not isinstance(dataset, BackingArray_T.__value__): | ||
| raise TypeError(f"Found {type(dataset)} but only {BackingArray_T.__value__} are usable") | ||
| obs = adata.obs[obs_key].to_numpy() if obs_key is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is what you want, though. We probably want to validate everything from obs as being categorical or numeric (i.e., something convertible to DL-usable). There is no validation on main but doesn't mean there shouldn't be!
|
Ok, I think I get this implementation from the code snippet def extract_obsm_X(adata: AnnData) -> dict:
store = TrainingStore({ "pca": adata.obsm["X_pca"], "counts": adata.X, "labels": adata.obs[["column_1", "column_2"]] })
store.set_output_transform("pca", lambda x: x + 2)
return store
for batch in Loader(...).add_anndatas(adatas, extract_obsm_X):
for k in ["pca", "counts", "labels"]: # or "column_1" and "column_2" instead of "labels"
assert k in batchso then my understanding is that the TrainingStore should be implemented on top of the loader, correct? shall I go ahead and give it a try? |
|
@giovp Not on top but within! And I think we should go with your anndata fields idea on top of that (this way we get both
loader = Loader(...)
loader.add_anndata(
AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)will no longer ignore loader.add_anndata(
AnnData(X=..., obsm={"X_pca": ...}, obs=pd.DataFrame({"category_1":...}))
)will add loader.add_anndata(
AnnData(layers={"spliced": ...}, obsm={"X_scvi": ...}, obs=pd.DataFrame({"not_category_1": ...}))
)will error out because the
loader = Loader(...)
loader.add_anndata(
AnnData(X=..., layers={"normalized": ...}, obs=pd.DataFrame({"category_1":...})),
selector={ "counts": A.X.with_transform(tokenize), "normalized": A.layers["normalized"].with_transform(tokenize), "category": A.obs["category_1"] }
)but then also allows loader = Loader(...)
loader.add_anndata(
AnnData(layers={"Normalized": ..., "Counts": ...}, obs=pd.DataFrame({"category_1_but_different_name":...})),
selector={ "counts": A.layers["Counts"].with_transform(tokenize), "normalized": A.layers["Normalized"].with_transform(tokenize), "category": A.obs["category_1_but_different_name"] }
)which gives the selector-transform mechanism. Each key in the Does that make sense? Sorry it took so long, but I just needed some time to get my head around everything fully :) Very open to feedback! |
Potential fix to #91
This PR introduces a new API called
AnnDataFieldfor parsing multiple anndata slots (layers,obsm,obs).The way it works is that instead of specifying
obs_key, the argumentadata_fieldsaccepts a mapping ofkey:AnnDataField(...)which returns the value of that field for the given anndata slice. It also provides a convert_fn which can be used for e.g. label encoder or any type of on the fly transformation.If this is a desirable API, I think the current API for
chunks_convertedshould also be handled inside AnnDataField, and effectively theLoadershould only return a dictionary of{key:AnnDataFIeld}such as{"counts": sp.csr_matrix, "label":np.ndarray, ...}.