-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add public API definitions for surfacing data age #13138
base: main
Are you sure you want to change the base?
Conversation
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
||
// Information for the age of a collection of data. The unit for all the age | ||
// stats are in seconds. Data's age is defined as the time passed since the data | ||
// was initially written into the DB. Check `DataCollectionIsEmpty` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the structure becomes less accurate over time, since data with age 1000 seconds needs to be 2000 seconds in another 1000 seconds? Any reason not to use unix times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as unix time. This is what I plan to implement it:
At property recording time: get data's initial write time by converting its sequence number with the SeqnoToTimeMapping mapping, and store clock->current() - initial write time as its age at file creation time.
At age query time: since we also store another property file_creation_time
in table properties. I will use it to do something like data_current_age = clock->current() - file_creation_time + data_age_at_file_creation
I figured this is a more intuitive definition of data's age, if we directly use unix time, I'm not sure the definition of averaging this unix time age is as meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is that the struct makes it a little too easy to "do the wrong thing" such as saving the information for minutes or hours or days and forgetting that it is out of date. I would be satisfied if the struct contains a field for the unix time that the ages are relative to. Or maybe it's better for the struct to store the data as unix times and expose the ages through public functions, perhaps relative to a specified "current time."
I see this as somewhat analogous to my argument for using UTC time in daily_offpeak_time_utc
. The simplest, most "convenient" API is usually best but not if it invites bugs. Prompting the user to think about the potential bug-prone piece (what is the reference point for these times?) is the best approach IMHO.
I figured this is a more intuitive definition of data's age, if we directly use unix time, I'm not sure the definition of averaging this unix time age is as meaningful.
(Assuming exclusion of "infinitely old"...) Yeah, it may not be obvious to general users that average/min/max absolute times is equivalent to average/min/max relative time (even though these are robust under linear transformations and then the reverse transformation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very detailed context! Yeah, unix time is really standard practices. The seemingly convenient age
API I was trying to provide is really unnecessary and dangerous design as you pointed out. Users can easily do a (now() - write_time) query to get age themselves. I should have been a non-general user myself to understand that absolute time and relative time work the same for these statistics, thanks for calling that out!
include/rocksdb/db.h
Outdated
// The former is enabled with `preserve_internal_time_seconds` and/or | ||
// `preclude_last_level_data_seconds`. The latter is enabled via | ||
// `CompactForTieringCollectorFactory`. | ||
virtual Status GetDataCollectionAgeInfoForLevels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little niche to go in DB API. If we had a version of GetPropertiesOfAllTables that divided the data by level (rather than cross-refing with GetLiveFilesMetaData) then collating the data could be a utility function outside of DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I had a similar version that divide the functionality into a table property collecting API and a util API to convert that into data age. I was thinking not many other usage expect table properties per level. But that is a more general API. I think Zippy folks should be fine with it too. Thanks for the suggestion!
bcabcdd
to
0881f29
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
0881f29
to
7738423
Compare
@jowlyzhang has updated the pull request. You must reimport the pull request before landing. |
@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
This PR adds the definition for the public APIs for surfacing data write time info. It only contains minimum implementation. The implementations will be in follow ups. I need to sync with customers if these public APIs meet their requirements and are easy to use. And make modifications accordingly before proceeding with implementations.
struct DataCollectionUnixWriteTimeInfo
is a struct for the unix write time info for a collection of dataDB::GetPropertiesOfTablesForLevels
returns table properties collection per levelGetDataCollectionUnixWriteTimeInfoForFile
returns the data write time info for a file.GetDataCollectionUnixWriteTimeInfoForLevels
returns the data write time info for levels.Follow ups:
Test plan:
No functional change, also follow ups should have tests covering the minimum implementation added in this PR.