-
Notifications
You must be signed in to change notification settings - Fork 3
distributed dataloader #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
- Coverage 90.90% 90.07% -0.84%
==========================================
Files 5 6 +1
Lines 594 675 +81
==========================================
+ Hits 540 608 +68
- Misses 54 67 +13
🚀 New features to boost your workflow:
|
|
Also, in case you have time, could you look at the "cache efficient strategy" in cellarium dataloader here: https://github.com/cellarium-ai/cellarium-ml/blob/58bc81b1e4ff51ceef51664bd99aed8af229b412/cellarium/ml/data/dadc_dataset.py#L193 and check whether you think this would make sense wrt the annbatch implementation? My understanding is that maybe it doesn't since annbatch uses a different logic to yield the batches. |
|
Hi @giovp, thanks for your input! I have a few questions. As far as I understand, you're sharding the input dataset across several loaders, right? @ilan-gold already implemented a similar strategy to support multi-process data loading for the torch dataLoader here. Does it make sense to somehow try to merge this? Continuing on this thought, @selmanozleyen is currently working on a sampler API. Maybe the best approach is to somehow merge those efforts? E.g. use Selman's sampler API, then have a class Let me know what you think! |
We also spoke about maybe just a |
|
Thanks both for comments!
This is at the worker level though, whereas the PR is concerned with the rank-level distribution. By merging you mean in the sense of refactoring? I'd be happy to take a look. But otherwise, they are two different levels of distributed streaming.
mmh you can't pass a DistributedSampler, in fact you can't pass any Sampler to Iterable datasets, and my understanding is that the
Other frameworks would be def cool, but then you get into the rabbit hole of what type of parallelization you want to support, in which framework. Definitely interesting, but also possibly a fair amount of work. |
Yes I think Felix's point is "mask out some portion of the dataset" is a common op so no sense duplicating - let's try to share a utility function or something
Right that is what is Selman is doing. So here you would reimplement what you have as a
Right, that's why I'd like to understand a bit more the space for this stuff. What is the minimal set of requirements we can build (either in a base sampler class or in this PR) to enable different distributed sampling? If this PR only implements one form, we should make that clear. Surely there has to be some sort of clear-ish documentation on this stuff somewhere? |
This is an attempt to wrap the distributed dataloader handling directly into the new
Loaderclass. It does two things:drop_last_indicespad_indices. In this case, it pads the indices for uneven chunks with observations from the "start" of the dataset. This introduces duplicates over the epoch.Not sure if this is the type of ergonomics you are interested in supporting, but I personally find it useful. I'd be interested to hear if you have other approaches for distributed trainings.
Maybe working towards #56