Might want to explore the scverse package "Shadows" #889

adkinsrs · 2024-09-17T13:53:03Z

adkinsrs
Sep 17, 2024
Maintainer

It is currently experimental but my hope is that we could use it in situations where reading the entire dataset to do a simple function is a bottleneck. One idea would be to get the dataset's shape (genes x cells) just to report that information. We may have some application in using it for projections and plots, as it will not read the data until you say we need to. However I do not know if this would save time unless I were to test it.

Answered by adkinsrs

Sep 17, 2024

changing to Shadows improved the fetch time from 16.9s to 1.8s on the API call

View full answer

adkinsrs · 2024-09-17T18:50:31Z

adkinsrs
Sep 17, 2024
Maintainer Author

Timings with a 29Gb dataset in nemo-prod

import anndata
import time
from shadows import AnnDataShadow

testfile = "./fd179c25-bf62-4d90-8158-b8ac6da0b8da.h5ad"

def time_adata():
    start = time.time()
    adata = anndata.read_h5ad(testfile)
    print(adata.shape)
    end = time.time()
    print(end - start)
    return
    
def time_adata_backed():
    start = time.time()
    adata = anndata.read_h5ad(testfile, backed="r")
    print(adata.shape)
    end = time.time()
    print(end - start)
    return
    
def time_shadow():
    start = time.time()
    adata = AnnDataShadow(testfile)
    print(adata.shape)
    end = time.time()
    print(end - start)
    return
    
time_adata()    #17.94s
time_adata_backed()    #0.26s
time_shadow()    #0.00096s

It seems that there is a massive speed increase when switching from memory to file-backed, and another when switching from file-backed to the AnnDataShadow object. When I tested going from memory to file-backed in #885, I did not see a noticeable decrease in speed in the API call, but seeing how fast the "shadow" performed, I'd be intrigued to edit the get_dataset_info.cgi again to use that.

1 reply

adkinsrs Sep 17, 2024
Maintainer Author

Shape was also (259858 cells, 29834 genes).

jorvis · 2024-09-17T19:03:03Z

jorvis
Sep 17, 2024
Maintainer

Whoa!

0 replies

adkinsrs · 2024-09-17T19:13:57Z

adkinsrs
Sep 17, 2024
Maintainer Author

Timings with the dataset from #885 (28692 genes x 6794880 obs, but 33M size)

memory-backed adata - 12.45s
file-backed adata - 12.45s
shadow - 0.00077s

The memory and file-backed methods having roughly the same time I think makes sense, as the setting only affects adata.X... I think this may be a sparse dataset too whereas the previous was dense I believe-> https://anndata.readthedocs.io/en/latest/generated/anndata.read_h5ad.html

Looks like "shadows" may be using pyarrow under the hood (judging from the pip install), which is an engine that Pandas will use in place of numpy if you specify.

0 replies

adkinsrs · 2024-09-17T20:00:56Z

adkinsrs
Sep 17, 2024
Maintainer Author

changing to Shadows improved the fetch time from 16.9s to 1.8s on the API call

0 replies

adkinsrs · 2024-09-17T20:03:58Z

adkinsrs
Sep 17, 2024
Maintainer Author

I'd love to play around and see what other things this could improve.

0 replies

adkinsrs · 2024-09-18T14:57:59Z

adkinsrs
Sep 18, 2024
Maintainer Author

I read through the documentation, and I think there is some valid usage. It lazy-loads parts of the AnnData structure as they are accessed. So if want to only look at the observations, using AnnDataShadow could save memory and speed as we are not bringing up the X matrix, for example. I'd be curious to see if this would potentially speed up the plotting functions or projectR, as I am unsure if adata.to_df() would access the data (though I think it would access it).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Might want to explore the scverse package "Shadows" #889

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Might want to explore the scverse package "Shadows" #889

adkinsrs Sep 17, 2024 Maintainer

Replies: 6 comments · 1 reply

adkinsrs Sep 17, 2024 Maintainer Author

adkinsrs Sep 17, 2024 Maintainer Author

jorvis Sep 17, 2024 Maintainer

adkinsrs Sep 17, 2024 Maintainer Author

adkinsrs Sep 17, 2024 Maintainer Author

adkinsrs Sep 17, 2024 Maintainer Author

adkinsrs Sep 18, 2024 Maintainer Author

adkinsrs
Sep 17, 2024
Maintainer

Replies: 6 comments 1 reply

adkinsrs
Sep 17, 2024
Maintainer Author

adkinsrs Sep 17, 2024
Maintainer Author

jorvis
Sep 17, 2024
Maintainer

adkinsrs
Sep 17, 2024
Maintainer Author

adkinsrs
Sep 17, 2024
Maintainer Author

adkinsrs
Sep 17, 2024
Maintainer Author

adkinsrs
Sep 18, 2024
Maintainer Author