Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Scan Source partition supplier creates partitions in memory and a failure causes no partitions to be created #4608

Open
graytaylor0 opened this issue Jun 6, 2024 · 3 comments · May be fixed by #5039
Assignees
Labels
backlog bug Something isn't working

Comments

@graytaylor0
Copy link
Member

Is your feature request related to a problem? Please describe.
As a user of s3 scan, I have a bucket with 100 million objects. The current s3 scan source is not able to handle this many objects, as it is bottlenecked by returning all objects as a list of partitions in the supplier, which can lead to out of memory errors. Additionally, if there are any failures in s3 scan supplier, no partitions will get created because all partitions are returned from the supplier before they are created in the coordination store.

Describe the solution you'd like
I would like the PartitionSupplier functions to be able to pass partitions back to the source coordinator for creation. So as objects are found during a scan, instead of holding them all in memory, the call to create the partition would be made right after the object is found from scanning.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@graytaylor0 graytaylor0 changed the title S3 Scan Source partition supplier creates partitions in memory S3 Scan Source partition supplier creates partitions in memory and a failure causes no partitions to be created Jun 6, 2024
@dlvenable dlvenable added bug Something isn't working and removed untriaged labels Jun 12, 2024
@dlvenable
Copy link
Member

@graytaylor0 , Are you planning on working this?

@graytaylor0
Copy link
Member Author

@dlvenable I am not planning on working this right now

@dayandersen
Copy link

Encountered what I think to be this issue, would there be logs available in the CloudWatch logs to verify if I'm falling into this situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
3 participants