Child plugin wastes space? #62

p-zach · 2022-04-24T19:40:18Z

p-zach
Apr 24, 2022

Using the setup here, a parent DB that has a lot of data would have that data duplicated in the child DB, taking up a lot of unnecessary storage, right? Is there any way to avoid this? Or do I have the wrong idea of how that data would be stored?

Answered by bmeares

Apr 24, 2022

@p-zach yes, each pipe has an underlying SQL table, so in your example the child would duplicate the parent because it only adds rows to the parent table.

One way to save space is to create a view that selects from the parent to act as the child. A view is basically an alias for a SELECT statement that behaves like a virtual table. For example, a view like this would mimic the child data from your example plugin:

CREATE VIEW AS plugin_test_derivative_a_deriv_1 AS
SELECT
    timestamp, random1, random2,
    (random1 * 2) AS deriv_random1,
    (random2 + 0.5) AS deriv_random2 
FROM plugin_test_derivative_a

You could even drop the table for the child plugin and remove its .sync() and use the…

View full answer

bmeares · 2022-04-24T21:57:39Z

bmeares
Apr 24, 2022
Maintainer

@p-zach yes, each pipe has an underlying SQL table, so in your example the child would duplicate the parent because it only adds rows to the parent table.

One way to save space is to create a view that selects from the parent to act as the child. A view is basically an alias for a SELECT statement that behaves like a virtual table. For example, a view like this would mimic the child data from your example plugin:

CREATE VIEW AS plugin_test_derivative_a_deriv_1 AS
SELECT
    timestamp, random1, random2,
    (random1 * 2) AS deriv_random1,
    (random2 + 0.5) AS deriv_random2 
FROM plugin_test_derivative_a

You could even drop the table for the child plugin and remove its .sync() and use the view as if it was its table. The tradeoff here is that because the view is not cached, it needs to be calculated each time. For complex transformations, it gets expensive quickly.

In fact, if you prefer to write your transformations in SQL, you can register a pipe with sql:main as its connector and paste the above query (without the CREATE VIEW clause), essentially creating a view that automatically caches the results into a table. This has the benefit of automatically taking advantage of the datetime column, which will improve performance for complex queries. See the syncing reference page for more details, as well as my thesis research in syncing strategies.

Another space-saving pattern I have seen used in production with Meerschaum is to create a temporary Pipe (e.g. with sql:local as its instance) to act as a collection bucket and dropping old rows after the derivative pipe has been synced. This has the risk of losing data, but you can control how long you want to hang onto raw data. Here's a quick example of that:

from meerschaum.utils.typing import SuccessTuple
import meerschaum as mrsm
import datetime

def fetch(pipe: mrsm.Pipe, **kw) -> 'pd.DataFrame':
    ...

def process_raw_data(raw_df: 'pd.DataFrame') -> 'pd.DataFrame':
    ...

def sync(pipe: mrsm.Pipe, **kw) -> SuccessTuple:
    """
    Store the raw data in a temporary pipe.
    """
    ### You can use any string as the connector
    #### if the pipe will only be updated via `.sync(df)`.
    bucket = mrsm.Pipe('foo', 'bar', instance='sql:local')

    ### Grab the last datetime of the bucket.
    ### Will be `None` if no results are found.
    ### You can also use `params` if you wanted to filter by columns,
    ### and `newest=False` to get the oldest datetime value.
    last_sync_time = bucket.get_sync_time()

    raw_df = fetch(bucket, **kw)
    bucket.sync(raw_df)
    new_sync_time = bucket.get_sync_time()

    ### If you mutate your data, you might need to call `pipe.clear()`.
    ### Depending on your situation, you might be able to just use `raw_df`.
    processed_df = process_raw_data(
        bucket.get_data(begin=last_sync_time, end=new_sync_time)
    )
    
    success, msg = pipe.sync(processed_df)
    if not success:
        return success, msg
    
    ### You can use any bounds you like.
    ### Here we're only deleting rows older than 10 days.
    return bucket.clear(end=new_sync_time+datetime.timedelta(days=10))

I hope this helps! Meerschaum is flexible enough to give you a lot of different options for balancing between space and speed (trust me, I wrote a whole thesis on it!).

1 reply

p-zach Apr 25, 2022
Author

Awesome! Thanks for all the advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Child plugin wastes space? #62

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Child plugin wastes space? #62

p-zach Apr 24, 2022

Replies: 1 comment · 1 reply

bmeares Apr 24, 2022 Maintainer

p-zach Apr 25, 2022 Author

p-zach
Apr 24, 2022

Replies: 1 comment 1 reply

bmeares
Apr 24, 2022
Maintainer

p-zach Apr 25, 2022
Author