-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: filemanager partitions #615
Conversation
select | ||
s3_object_id, | ||
-- This finds the first value in the set which represents the most up-to-date state. | ||
-- If ordered by the sequencer, the first row is the one that needs to have `is_current_state` | ||
-- set to 'true' only for `Created` events, as `Deleted` events are always non-current state. | ||
case when row_number() over (order by s3_object.sequencer desc) = 1 then | ||
event_type = 'Created' | ||
-- Set `is_current_state` to 'false' for all other rows, as this is now historical data. | ||
else | ||
false | ||
end as updated_state | ||
from s3_object | ||
where | ||
-- This should be fairly efficient as it's only targeting objects where `is_current_state` is true, | ||
-- or objects with the highest sequencer values (in case of an out-of-order event). This means that | ||
-- although there is a performance impact for running this on ingestion, it should be minimal with | ||
-- the right indexes. | ||
input.bucket = s3_object.bucket and | ||
input.key = s3_object.key and | ||
input.version_id = s3_object.version_id and | ||
-- This is an optimization which prevents querying against all objects in the set. | ||
( | ||
-- Only need to update current objects | ||
s3_object.is_current_state = true or | ||
-- Or objects where there is a newer sequencer than the one being processed. | ||
-- This is required in case an out-of-order event is encountered. This always | ||
-- includes the object being processed as it's required for the above row-logic. | ||
s3_object.sequencer >= input.sequencer | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this makes sense! Let me know if I can improve the explanation or if it would be good to reduce some of the complexity. For example, that last nested or
condition probably isn't necessary for correctness but I think it helps with performance.
the performance impact of determining this is too great on every API call. This does incur a performance penalty for | ||
ingestion. However, it should be minimal as only other current records need to be considered. This is because records | ||
only need to transition from current to historical (and not the other way around). Records which are current represent | ||
a smaller subset of all records, meaning that only a smaller part of the dataset needs to be queried when ingesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I completely follow here....
Why do other records need to be queried when a new one is ingested?
from current to historical (and not the other way around)
How would objects restored from the archive bucket be handled?
Also, perhaps for future consideration, if handling the data within one table get's tricky, consider moving records between tables ? E.g. a current_records vs deleted_records table? (I think the Django history functionality works like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do other records need to be queried when a new one is ingested?
A record starts as current, and then some logic needs to make it historical.
For example, say Created
event "A" comes in for a given key/bucket. At that point in time, "A" has is_current_state
set to true on ingestion. Then, another Created
event "B" comes in for the same key/bucket, which represents overwriting that object. Now, "B" has is_current_state
set to true given that it's the new current state. Something needs to flip is_current_state
to false on "A" as it's no longer current. This requires querying the database, which is what this query does on ingestion.
This doesn't have to happen on ingestion, it could happen in another process some time afterwards. However, I think the cost of doing it at ingestion isn't too high so I've added it to make the state of the records as accurate as soon as possible.
Does that make sense (I'll add this in the docs for the next PR)?
Also, perhaps for future consideration, if handling the data within one table get's tricky, consider moving records between tables ?
Yeah, agreed, that's the idea with task 2 on #614. I'll do this by partitioning the table, which more-or-less creates two separate tables, but allows queries to treat it like one table. When this is set up in postgres, it happens automatically on a particular partition key. This is why I'm adding the is_current_state
here, because this allows the table to be partitioned on that column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I forgot we are dealing with object versions....
Querying with bucket + key + version should be quick and efficient... all good!
Related to #614, finishes first task.
Changes
is_current_state
column to database for later partitioning.is_current_state
.I know this adds some overhead to the ingester, but it is performing pretty well in testing (< a few ms per insert), and I think it's worth it because it reduces the performance issues on the API side. If this ever is a problem for the ingester to handle, then it can always be run as a separate asynchronous process.