GitHub - ikegami-yukino/madoka-python: Memory-efficient Count-Min Sketch Counter (based on Madoka C++ library)

madoka

Madoka is an implementation of a Count-Min sketch data structure for summarizing data streams.

String-int pairs in a Madoka-Sketch may take less memory than in a standard Python dict, Counter, Redis.

Counting error rate is about 0.0911 %

More details are described in Benchmark.ipynb

This module is based on madoka C++ library.

NOTE: Madoka-Sketch does not have index of keys. so Madoka-Sketch can not dump all keys such as Python dict's dict.keys(). However, when set k parameter to costructer, most_common method (returns key and value as many as k) is available.

Contributions are welcome!

Installation

$ pip install madoka

Class

Madoka has some classes having same interface. These classes are vary in value data type. So you can choose for your purpose.

For example, if you wants to count float data, it's preferable to choose CroquisFloat class or CroquisDouble class.

Sketch - storing unsigned long long (64bit) and fast implementation
CroquisFloat - storing float (32bit)
CroquisDouble - storing double (64bit)
CroquisUint8 - storing unsigned char (8bit)
CroquisUint16 - storing unsigned short (16bit)
CroquisUint32 - storing unsigned int (32bit)
CroquisUint64 - storing unsigned long long (64bit)

Usage

From here, I will describe about Sketch class. But, Croquis classes have also same interfaces mostly. So you can use other classes by the same way as Sketch class. In that case, you should replace to intended class from "Sketch".

Create a new sketch

>>> import madoka
>>> sketch = madoka.Sketch()

Sketch madoka.Sketch([width=1048576, max_value=35184372088831, path='', flags=0, seed=0, k=5])
- width is a size of register. If you are worrying about gap, you should increase width value. The larger width is, the fewer mistakes madoka makes in estimating value. But, the larger width is, the larger memory consumption is.
- Permission of path should be 644
- k means Top-K used by most_common method. if you don't want to use most_common method, then I recommend to set k=0 so it is slightly fast.
- madoka.Sketch() calls madoka.Sketch.create(), so you don't have to explicitly call create() in initialization

Increment a key value

>>> sketch['mami'] += 1

or

>>> sketch.inc('mami')

int inc(key[, key_length=0])
- Note that key_length is automatically determined when not giving key_length. Thus, the order of parameters differs from original madoka C++ library.

Add a value to the current key value

>>> sketch['mami'] += 6

or

>>> sketch.add('mami', 6)

int add(key, value[, key_length=0])
- Note that key_length is automatically determined when not giving key_length. Thus, the order of parameters differs from original madoka C++ library.

Update a key value

>>> sketch['mami'] = 6

or

>>> sketch.set('mami', 6)

void set(key, value[, key_length=0])
- Note that set() does nothing when the given value is not greater than the current key value.
- Also note that the new value is saturated when the given value is greater than the upper limit.
- Additionally note that key_length is automatically determined when not giving key_length. Thus, the order of parameters differs from original madoka C++ library.

Get a key value

>>> sketch['mami']

or

>>> sketch.get('mami')

int get(key[, key_length=0])
- Note that key_length is automatically determined when not giving key_length. Thus, the order of parameters differs from original madoka C++ library.

Get all values

>>> sketch.values()

generator<int> values()
- Note that processing time increases according to sketch's width. But this method may be slow, so I recommend setting width to less than 1000000 when creating sketch.

Save a sketch to a file

>>> sketch.save('example.madoka')

void save(path)
- Permission of path should be 644

Load a sketch from a file

>>> sketch.load('example.madoka')

void load(path)
- Permission of path should be 644

Clear a sketch

>>> sketch.clear()

void clear()
- Delete all key-value pairs. It differs from create() in maintaining current settings.

Initialize a sketch with settings change

>>> sketch.create()

void create([width=0, max_value=0, path=NULL, flags=0, seed=0])
- Permission of file given to path should be 644

Copy a sketch

>>> sketch.copy(othersketch)

void copy(Sketch)

Merge two sketches

>>> sketch += other_sketch

or

>>> sketch.merge(othersketch)

void merge(Sketch[, lhs_filter=None, rhs_filter=None])
- lhs_filter is applied for self.sketch, rhs_filter is applied for given sketch

Shrink a sketch

>>> sketch.shrink(sketch, width=1000)

void shrink(Sketch[, width=0, max_value=0, filter=None, path=None, flags=0])
- When width > 0, width must be less than source sketch
- Permission of path should be 644

Get summed sketch

>>> summed_sketch = sketch + other_sketch

Create summed sketch, So it does not break original sketches

Get summed sketch by dict

>>> summed_sketch = sketch + {'mami': 1, 'kyoko': 2}

Create summed sketch, So it does not break original sketches

Check whether sketch contains key value

>>> 'mami' in sketch

Get inner product of two sketches

>>> sketch.inner_product(other_sketch)

list<float> inner_product(Sketch)
- Returns [inner product, square length of left hands sketch (float), square length of right hands sketch (float)]

Get median value

>>> sketch['madoka'] = 1
>>> sketch['mami'] = 2
>>> sketch['sayaka'] = 3
>>> sketch['kyouko'] = 4
>>> sketch['homura'] = 5
>>> sketch.median()  # => 3

int or float median()

Apply filter into all values

>>> sketch.filter(lambda x: x + 1)

void filter(Callable[, apply_zerovalue=False])
- If apply_zerovalue = True, filter_method is applied also 0 values (It may be slow) (from version 0.6 or later)
- Note that processing time increases according to sketch's width. If you feel this method is slow, I recommend setting width to less than 1000000 when creating sketch

Set values from dict

>>> sketch.fromdict({'mami': 14, 'madoka': 13})

or

>>> sketch += {'mami': 14, 'madoka': 13}

void fromdict(dict)

Get most common keys

>>> sketch.most_common()

generator most_common([k=5])
- returns key-value pair as many as k
- Note that this method is required to set k parameter in constructer.

License

Wrapper code is licensed under New BSD License.
Bundled madoka C++ library is licensed under the Simplified BSD License.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
madoka		madoka
src		src
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
Benchmark.ipynb		Benchmark.ipynb
CHANGES.rst		CHANGES.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
madoka.i		madoka.i
madoka_wrap.cxx		madoka_wrap.cxx
setup.py		setup.py
test_madoka.py		test_madoka.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

madoka

Installation

Class

Usage

Create a new sketch

Increment a key value

Add a value to the current key value

Update a key value

Get a key value

Get all values

Save a sketch to a file

Load a sketch from a file

Clear a sketch

Initialize a sketch with settings change

Copy a sketch

Merge two sketches

Shrink a sketch

Get summed sketch

Get summed sketch by dict

Check whether sketch contains key value

Get inner product of two sketches

Get median value

Apply filter into all values

Set values from dict

Get most common keys

License

About

Releases 3

Packages

Contributors 2

Languages

License

ikegami-yukino/madoka-python

Folders and files

Latest commit

History

Repository files navigation

madoka

Installation

Class

Usage

Create a new sketch

Increment a key value

Add a value to the current key value

Update a key value

Get a key value

Get all values

Save a sketch to a file

Load a sketch from a file

Clear a sketch

Initialize a sketch with settings change

Copy a sketch

Merge two sketches

Shrink a sketch

Get summed sketch

Get summed sketch by dict

Check whether sketch contains key value

Get inner product of two sketches

Get median value

Apply filter into all values

Set values from dict

Get most common keys

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages