Might want to explore the scverse package "Shadows" #889
-
https://github.com/scverse/shadows It is currently experimental but my hope is that we could use it in situations where reading the entire dataset to do a simple function is a bottleneck. One idea would be to get the dataset's shape (genes x cells) just to report that information. We may have some application in using it for projections and plots, as it will not read the data until you say we need to. However I do not know if this would save time unless I were to test it. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Timings with a 29Gb dataset in nemo-prod import anndata
import time
from shadows import AnnDataShadow
testfile = "./fd179c25-bf62-4d90-8158-b8ac6da0b8da.h5ad"
def time_adata():
start = time.time()
adata = anndata.read_h5ad(testfile)
print(adata.shape)
end = time.time()
print(end - start)
return
def time_adata_backed():
start = time.time()
adata = anndata.read_h5ad(testfile, backed="r")
print(adata.shape)
end = time.time()
print(end - start)
return
def time_shadow():
start = time.time()
adata = AnnDataShadow(testfile)
print(adata.shape)
end = time.time()
print(end - start)
return
time_adata() #17.94s
time_adata_backed() #0.26s
time_shadow() #0.00096s It seems that there is a massive speed increase when switching from memory to file-backed, and another when switching from file-backed to the AnnDataShadow object. When I tested going from memory to file-backed in #885, I did not see a noticeable decrease in speed in the API call, but seeing how fast the "shadow" performed, I'd be intrigued to edit the get_dataset_info.cgi again to use that. |
Beta Was this translation helpful? Give feedback.
-
Timings with the dataset from #885 (28692 genes x 6794880 obs, but 33M size)
The memory and file-backed methods having roughly the same time I think makes sense, as the setting only affects adata.X... I think this may be a sparse dataset too whereas the previous was dense I believe-> https://anndata.readthedocs.io/en/latest/generated/anndata.read_h5ad.html Looks like "shadows" may be using pyarrow under the hood (judging from the pip install), which is an engine that Pandas will use in place of numpy if you specify. |
Beta Was this translation helpful? Give feedback.
-
changing to Shadows improved the fetch time from 16.9s to 1.8s on the API call |
Beta Was this translation helpful? Give feedback.
-
I'd love to play around and see what other things this could improve. |
Beta Was this translation helpful? Give feedback.
-
I read through the documentation, and I think there is some valid usage. It lazy-loads parts of the AnnData structure as they are accessed. So if want to only look at the observations, using AnnDataShadow could save memory and speed as we are not bringing up the X matrix, for example. I'd be curious to see if this would potentially speed up the plotting functions or projectR, as I am unsure if |
Beta Was this translation helpful? Give feedback.
changing to Shadows improved the fetch time from 16.9s to 1.8s on the API call