Paging (aka chunking, batch) tools #29

thorwhalen · 2023-06-19T16:16:37Z

thorwhalen
Jun 19, 2023
Maintainer

The need for tools to carry out chunked iteration (or paging, or batch operations), read and write, comes up regularly. It shows up not only when we have big files, but in general, when ever there's a lot of data and/or IO is expensive (for example, remote DBs). So our tooling needs to get something reusable for that soon.

Here's the discussion for that.

Some related use cases

Mapping python collections builtin methods to REST API patterns. Doesn't expand on the theme there, but the problem here is that when dealing with big (or live) streams/sequences, both web services and UIs use "paging". How do we map that to/from collections abstractions?
Someone made a pull request to add a "paged reader" of file bytes. I made a lengthy comment about this, which can be useful here.

Links

wrapt's ObjectProxy.
Stream, which started out in dol (well, py2store) to provide more fine-tuned control over iteration. It's original purpose was precisely to read large files. Namely, a store would return a Stream instance that could then be consumed, automatically taking care of any paging/chunking concerns, but also preparing, filtering... This Stream then grew up to be called Creek in a namesake creek package that accumulated many other streaming tools.
context managers. See some past dol use cases here.
Better way to control the wrapper class
Mapping Views
Tools to transform classes

thorwhalen · 2023-06-20T11:10:17Z

thorwhalen
Jun 20, 2023
Maintainer Author

Here's a function that can be useful to make a "chunked reads" version of __getitem__ in azuredol:

def get_chunked_data(self, k, chunk_size=8192):
    blob_client = self._container_client.get_blob_client(blob=self._id_of_key(k))
    
    # Download the blob and return a StorageStreamDownloader
    stream_downloader = blob_client.download_blob()
    
    # Return an iterator that yields chunks of the data
    return stream_downloader.chunks(chunk_size)

The question is now: What is the general pattern here?

We would like to have some general tools to handle these chunked read situations.
What does a facade to that problem look like?

For example: If you have a new system, you need to specify how to get a chunk reader, and that reader is used in the __getitem__?

0 replies

thorwhalen · 2023-10-04T08:33:00Z

thorwhalen
Oct 4, 2023
Maintainer Author

Appendable

This is a very common and general use-case: We want to append some streaming bytes to an open file or upload a big file to some remote blob storage service (e.g. s3 or azure), etc.

We'd like to be able to do things like

for chk in stream:
    s[k] += chk

# or if some open/close management is needed:

with s[k]:
    for chk in stream:
        s[k].extend(chk)

# or
consume(map(s[k].append, chks)  # where `consume = lambda it: [_ for _ in it]`

It should be obvious from the code above that this ability to append to s[k] is separate (so should be separated in code) from the fact that s is a mapping.

I'm thinking more along the lines of an Appendable (or Extendable) context manager class that has the append, extend, and or __iadd__ (for the += operator) methods, and knowing what to do with them (like append to an open file, or buffer/flush the data, or whatever s3 does). We can then use this to simply tell our store to return Appendable(key, ...) for store[key].

Most probably this code will go in the already existing dol.appendable module.

Now to the question of .append versus .extend versus __iadd__ (for the += operator). I've thought of the pros and cons of each in the past, but don't recall that there was any clear winner. Maybe it will come back to me as we work on this.
I do remember seeing someone use s[k] = v as an append, but don't like that one, since it is not aligned with the "replace totally the existing" nature of assignment in python. So that one, I'm against.

One thing is sure: We want to make it as close to the builtin types as possible. That is, we shouldn't align with file and DB operations, but with simple builtin types. Here we're mutating the s[k] value, so the obvious candidates are list and set (or rather, their collection abstractions, MutableSequence and Set). The Set interface would be used in cases where we want to maintain an list of unique items (and don't care about order). In that case, the or operator might be what corresponds to adding elements. But here we're dealing with an "add more data" the resembles more the list.append or list.extend methods. Note that extend can be defined in terms of repeated appends and append as a single item extend, but when optimizations are available, it's better to define both, not rely on the other. The implementation of __iadd__ can then just be a forward to extend.

The strange world of `iadd`

distinguish `s.iadd` and `s[k].iadd`!

First. Don't be confused as I constantly am: The __iadd__ of a store (if it exists) has nothing to do with s[k] += v. If s had a __iadd__, it would be relevant to s += v. That latter is useful -- so useful that this is what was implemented first in dol.appendable. It's implementing the "store v in s" concern, hiding the details what key v will be stored under:

from dol.appendable import appendable

appendable_dict = appendable(dict, item2kv=lambda x: (sum(x), x))
s = appendable_dict({5: ['five']})
assert s == {5: ['five']}

# Note this: 
# - the key is the sum of the values
# - the (2, 3) replaces the existing ['five'] value (the ['five'] list is not extended to be ['five', (2, 3)]!)
s.extend([(2,3), (3, 2, 1)])

The strange world of `iadd`

One should note something important about __iadd__: that we get a s[k] += v functionality for free if s[k] has a __iadd__ or even a __add__ (which __iadd__ falls back on).

But the way it does it is read the contents of s[k], perform the += operation, then write the results back to s[k]. This is unacceptable in some situations (e.g. big remote files). So whatever we do to implement custom append functionality (e.g. through the append and/or extend function), we'd like to make sure that the the MutableMapping's __iadd__ knows and uses this.

See the How to control Python's += operator to update dictionary values in-place? stackoverflow post. This was marked for deletion because (mistakingly confused to be a duplicate of) Python dictionary "plus-equal" behavior.

Making sure key-value transformations still work

One should note one danger though, that we should make sure to check for in our implementation of appendable tools: That all methods involving keys and values need to be wrapped when wrapping (e.g. with dol.trans tools). For example, as[k].append(v) on an empty s[k] needs to have the same effect as a s[k] = v.

This means either adding them to the dol.base.Store and place the _obj_of_data etc. where they need to be, or implementing some kind of "registration" mechanism to our stores. This is something that I've wanted to do for awhile now in dol, but have been waiting for the right time (which never comes) for doing a redesign...

0 replies

thorwhalen · 2023-10-05T14:45:38Z

thorwhalen
Oct 5, 2023
Maintainer Author

Here's a proposal, which I'll push (or PR) tomorrow:

from collections.abc import MutableMapping
from functools import partial
from operator import add

def read_add_write(store, key, iterable, add_iterables=add):
    """Retrieves """
    if key in store:
        store[key] = add_iterables(store[key], iterable)
    else:
        store[key] = iterable

class Extender:
    def __init__(
            self, 
            store: MutableMapping, 
            key, 
            *,
            extend_store_value=read_add_write,
            append_method=None,
        ):
        self.store = store
        self.key = key
        self.extend_store_value = extend_store_value
        if append_method is not None:
            self.append = partial(append_method, self)

    def extend(self, iterable):
        """Extend the iterable stored in """
        return self.extend_store_value(self.store, self.key, iterable)

    __iadd__ = extend
    

store = {'a': 'pple'}
# test normal extend
a_extender = Extender(store, 'a')
a_extender.extend('sauce')
assert store == {'a': 'pplesauce'}
# test creation (when key is not in store)
b_extender = Extender(store, 'b')
b_extender.extend('anana')
assert store == {'a': 'pplesauce', 'b': 'anana'}
# you can use the += operator too
b_extender += ' split'
assert store == {'a': 'pplesauce', 'b': 'anana split'}

# test append
# Need to define an append method that makes sense. 
# Here, with strings, we can just call extend.
b_bis_extender = Extender(store, 'b', append_method=lambda self, obj: self.extend(obj))
b_bis_extender.append('s')
assert store == {'a': 'pplesauce', 'b': 'anana splits'}
# But if our "extend" values were lists, we'd need to have a different append method,
# one that puts the single object into a list, so that its sum with the existing list
# is a list.
store = {'c': [1,2,3]}
c_extender = Extender(store, 'c', append_method=lambda self, obj: self.extend([obj]))
c_extender.append(4)
assert store == {'c': [1,2,3,4]}
# And if the values were tuples, we'd have to put the single object into a tuple.
store = {'d': (1,2,3)}
d_extender = Extender(store, 'd', append_method=lambda self, obj: self.extend((obj,)))
d_extender.append(4)
assert store == {'d': (1,2,3,4)}

# Now, the default extend method is `read_add_write`, which retrieves the existing 
# value, sums it to the new value, and writes it back to the store.
# If the values of your store have a sum defined (i.e. an `__add__` method), 
# **and** that sum method does what you want, then you can use the default 
# `extend_store_value` function. 
# O ye numpy users, beware! The sum of numpy arrays is an elementwise sum, 
# not a concatenation (you'd have to use `np.concatenate` for that).
import numpy as np
store = {'e': np.array([1,2,3])}
e_extender = Extender(store, 'e')
e_extender.extend(np.array([4,5,6]))
assert all(store['e'] == np.array([5,7,9]))
# This is what the `extend_store_value` function is for: you can pass it a function
# that does what you want.
store = {'f': np.array([1,2,3])}
def extend_store_value_for_numpy(store, key, iterable):
    store[key] = np.concatenate([store[key], iterable])
f_extender = Extender(store, 'f', extend_store_value=extend_store_value_for_numpy)
f_extender.extend(np.array([4,5,6]))
assert all(store['f'] == np.array([1,2,3,4,5,6]))
# WARNING: See that the `extend_store_value`` defined here doesn't accomodate for 
# the case where the key is not in the store. It is the user's responsibility to
# handle that aspect in the `extend_store_value` they provide. 
# For your convenience, the `read_add_write` that is used as a default has 
# (and which **does** handle the non-existing key case by simply writing the value in 
# the store) has an `add_iterables` argument that can be set to whatever 
# makes sense for your use case.
from functools import partial
store = {'g': np.array([1,2,3])}
extend_store_value_for_numpy = partial(
    read_add_write, add_iterables=lambda x, y: np.concatenate([x, y])
)
g_extender = Extender(store, 'g', extend_store_value=extend_store_value_for_numpy)
g_extender.extend(np.array([4,5,6]))
assert all(store['g'] == np.array([1,2,3,4,5,6]))


# TODO: Continue this: Make a store that returns file objects, and make an Extender
# that appends to those files.
# TODO: Make this an actual class of dol.filesys

def append_to_file(store, key, iterable):
    with store[key] as f:
        for item in iterable:
            f.write(item)


from dol import temp_dir


# TODO: Show how to use `Extender` with `wrap_kvs` to make stores that return Extenders, 
#   thereby enabling efficient `store[key] += iterable`` syntax, 
#   and explore how the store[key] might be made to conserve its original type and behavior
#   (discussing the pros and cons of each approach).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paging (aka chunking, batch) tools #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Paging (aka chunking, batch) tools #29

thorwhalen Jun 19, 2023 Maintainer

Some related use cases

Links

Replies: 3 comments

thorwhalen Jun 20, 2023 Maintainer Author

thorwhalen Oct 4, 2023 Maintainer Author

Appendable

The strange world of __iadd__

distinguish s.__iadd__ and s[k].__iadd__!

The strange world of __iadd__

Making sure key-value transformations still work

thorwhalen Oct 5, 2023 Maintainer Author

thorwhalen
Jun 19, 2023
Maintainer

thorwhalen
Jun 20, 2023
Maintainer Author

thorwhalen
Oct 4, 2023
Maintainer Author

The strange world of `iadd`

distinguish `s.iadd` and `s[k].iadd`!

The strange world of `iadd`

thorwhalen
Oct 5, 2023
Maintainer Author