Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public API definitions for surfacing data age #13138

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Nov 14, 2024

This PR adds the definition for the public APIs for surfacing data write time info. It only contains minimum implementation. The implementations will be in follow ups. I need to sync with customers if these public APIs meet their requirements and are easy to use. And make modifications accordingly before proceeding with implementations.

  • struct DataCollectionUnixWriteTimeInfo is a struct for the unix write time info for a collection of data
  • DB::GetPropertiesOfTablesForLevels returns table properties collection per level
  • GetDataCollectionUnixWriteTimeInfoForFile returns the data write time info for a file.
  • GetDataCollectionUnixWriteTimeInfoForLevels returns the data write time info for levels.
  • The user property names for recording write time stats in the user collected properties are defined.

Follow ups:

  • Implement collecting the write time related user table properties
  • Use the data write time info recorded in the table properties to implement these APIs

Test plan:
No functional change, also follow ups should have tests covering the minimum implementation added in this PR.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


// Information for the age of a collection of data. The unit for all the age
// stats are in seconds. Data's age is defined as the time passed since the data
// was initially written into the DB. Check `DataCollectionIsEmpty` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the structure becomes less accurate over time, since data with age 1000 seconds needs to be 2000 seconds in another 1000 seconds? Any reason not to use unix times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as unix time. This is what I plan to implement it:

At property recording time: get data's initial write time by converting its sequence number with the SeqnoToTimeMapping mapping, and store clock->current() - initial write time as its age at file creation time.

At age query time: since we also store another property file_creation_time in table properties. I will use it to do something like data_current_age = clock->current() - file_creation_time + data_age_at_file_creation

I figured this is a more intuitive definition of data's age, if we directly use unix time, I'm not sure the definition of averaging this unix time age is as meaningful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that the struct makes it a little too easy to "do the wrong thing" such as saving the information for minutes or hours or days and forgetting that it is out of date. I would be satisfied if the struct contains a field for the unix time that the ages are relative to. Or maybe it's better for the struct to store the data as unix times and expose the ages through public functions, perhaps relative to a specified "current time."

I see this as somewhat analogous to my argument for using UTC time in daily_offpeak_time_utc. The simplest, most "convenient" API is usually best but not if it invites bugs. Prompting the user to think about the potential bug-prone piece (what is the reference point for these times?) is the best approach IMHO.

I figured this is a more intuitive definition of data's age, if we directly use unix time, I'm not sure the definition of averaging this unix time age is as meaningful.

(Assuming exclusion of "infinitely old"...) Yeah, it may not be obvious to general users that average/min/max absolute times is equivalent to average/min/max relative time (even though these are robust under linear transformations and then the reverse transformation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed context! Yeah, unix time is really standard practices. The seemingly convenient age API I was trying to provide is really unnecessary and dangerous design as you pointed out. Users can easily do a (now() - write_time) query to get age themselves. I should have been a non-general user myself to understand that absolute time and relative time work the same for these statistics, thanks for calling that out!

// The former is enabled with `preserve_internal_time_seconds` and/or
// `preclude_last_level_data_seconds`. The latter is enabled via
// `CompactForTieringCollectorFactory`.
virtual Status GetDataCollectionAgeInfoForLevels(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little niche to go in DB API. If we had a version of GetPropertiesOfAllTables that divided the data by level (rather than cross-refing with GetLiveFilesMetaData) then collating the data could be a utility function outside of DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I had a similar version that divide the functionality into a table property collecting API and a util API to convert that into data age. I was thinking not many other usage expect table properties per level. But that is a more general API. I think Zippy folks should be fine with it too. Thanks for the suggestion!

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants