Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching/multiple step pinned columns #305

Open
1 task done
paddymul opened this issue Oct 16, 2024 · 0 comments
Open
1 task done

caching/multiple step pinned columns #305

paddymul opened this issue Oct 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@paddymul
Copy link
Owner

Checks

  • I have checked that this enhancement has not already been requested

How would you categorize this request. You can select multiple if not sure

Summary stats

Enhancement Description

When Buckaroo is used for exploratory data analysis, it is best to think of the different steps as a pipeline.

It is very useful to be able to compare different steps of the pipeline, particularly for summary stats and to be able to show those to the user. Some of these steps don't have explicit representation in Buckaroo state

Generally the flow goes
raw_df -> cleaned_df -> filtered_df -> lowcode_transformed_df ->
transformed_df -> summary_stats

cleaned_df -> filtered_df -> lowcode_transformed_df are all jumbled
together, but by accident or user interaction convention, the user flow generally goes

raw_df -> summary_stats
raw_df -> cleaned_df -> summary_stats

Then
raw_df -> cleaned_df -> filtered_df -> summary_stats
or
raw_df -> cleaned_df -> transformed_df -> summary_stats

Finally, since it requires the most user interaction
raw_df -> cleaned_df -> low_code_tranformed_df -> summary_stats

I would like to be able to show the following types of pinned_rows, if
they exist

"dtype" for all states
raw:dtype and cleaned:dtype are frequently different
similarly for null_count

histograms are very likely to change between cleaned and filtered_df states. These should all be visible in the UI at once.

Thinking of this in terms of pure functions

You can think of these as functions

cleaned_df is a function of raw_df and cleaning_method
filtered_df is a function of raw_df, cleaning_method, and filter_args
transformed_df is a function of raw_df, cleaning_method, filter_args, and transform_method

For configuration, we don't want to name pinned_rows as the full args of cleaning_method and filter_args

instead we want be able to add rows as

summary_stats[('cleaned', current)] or summary_stats[('cleaned', current), ('filtered', current)]

All of this dovetails quite nicely with a caching/memoization mechanism.

Pseudo Code Implementation

Uhm

Prior Art

?

@paddymul paddymul added the enhancement New feature or request label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant