Skip to content

Latest commit

 

History

History
354 lines (256 loc) · 13.5 KB

README.mkd

File metadata and controls

354 lines (256 loc) · 13.5 KB

Note: If you are using the 0.6.x series of Cassandra then get Pycassa 0.3 from the Downloads section and read the documentation contained within. This README applies to the current state of Pycassa which tracks Cassandra's development (work in-progress toward Cassandra 0.7).

pycassa

pycassa is a Cassandra library with the following features:

  1. Auto-failover single or thread-local connections and connection pooling
  2. A simplified version of the thrift interface
  3. A method to map an existing class to a Cassandra ColumnFamily.
  4. Support for SuperColumns

Documentation

While this readme includes a lot of information, the official and more thorough documentation can be found here:

http://pycassa.github.com/pycassa/

IRC

If you have any Cassandra questions, try the IRC channel #cassandra on irc.freenode.net

Requirements

thrift: http://incubator.apache.org/thrift/
Cassandra: http://cassandra.apache.org

To install thrift's python bindings:

easy_install thrift

pycassa comes with the Cassandra python files for convenience, but you can replace them with your own.

Installation

The simplest way to get started is to copy the pycassa directories to your program. If you want to install, run setup.py as a superuser.

python setup.py install

Connecting

All functions are documented with docstrings. To read usage documentation:

>>> import pycassa
>>> help(pycassa.ColumnFamily.get)

For a single connection (which is not thread-safe), pass a list of servers.

>>> client = pycassa.connect('Keyspace1') # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect('Keyspace1', ['localhost:9160'])

Framed transport is the default in Cassandra 0.7 and pycassa. You may disable it by passing framed_transport=False.

>>> client = pycassa.connect('Keyspace1', framed_transport=False)

Thread-local connections opens a connection for every thread that calls a Cassandra function. It also automatically balances the number of connections between servers, unless round_robin=False.

>>> client = pycassa.connect_thread_local('Keyspace1') # Defaults to connecting to the server at 'localhost:9160'
>>> client = pycassa.connect_thread_local('Keyspace1', ['localhost:9160', 'other_server:9160']) # Round robin connections
>>> client = pycassa.connect_thread_local('Keyspace1', ['localhost:9160', 'other_server:9160'], round_robin=False) # Connect in list order

Connections are robust to server failures. Upon a disconnection, it will attempt to connect to each server in the list in turn. If no server is available, it will raise a NoServerAvailable exception.

Timeouts are also supported and should be used in production to prevent a thread from freezing while waiting for Cassandra to return.

>>> client = pycassa.connect('Keyspace1', timeout=3.5) # 3.5 second timeout
(Make some pycassa calls and the connection to the server suddenly becomes unresponsive.)

Traceback (most recent call last):
...
pycassa.connection.NoServerAvailable

Note that this only handles socket timeouts. The TimedOutException from Cassandra may still be raised.

If simple authentication is in use, you can pass in a dict of credentials using the credentials keyword.

>>> credentials = {'username': 'jsmith', 'password': 'havebadpass'}
>>> client = pycassa.connect('Keyspace1', credentials=credentials)

Basic Usage

To use the standard interface, create a ColumnFamily instance.

>>> cf = pycassa.ColumnFamily(client, 'Test ColumnFamily')

The value returned by an insert is the timestamp used for insertion, or int(time.time() * 1e6). You may replace this function with your own (see Extra Documentation).

>>> cf.insert('foo', {'column1': 'val1'})
1261349837816957
>>> cf.get('foo')
{'column1': 'val1'}

Insert also acts to update values.

>>> cf.insert('foo', {'column1': 'val2'})
1261349910511572
>>> cf.get('foo')
{'column1': 'val2'}

You may insert multiple columns at once.

>>> cf.insert('bar', {'column1': 'val3', 'column2': 'val4'})
1261350013606860
>>> cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
>>> cf.get_count('bar')
2

get_range() returns an iterable. Call it with list() to convert it to a list.

>>> list(cf.get_range())
[('bar', {'column1': 'val3', 'column2': 'val4'}), ('foo', {'column1': 'val2'})]
>>> list(cf.get_range(row_count=1))
[('bar', {'column1': 'val3', 'column2': 'val4'})]

You can remove entire keys or just a certain column.

>>> cf.remove('bar', columns=['column1'])
1261350220106863
>>> cf.get('bar')
{'column2': 'val4'}
>>> cf.remove('bar')
1261350226926859
>>> cf.get('bar')
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()

pycassa retains the behavior of Cassandra in that get_range() may return removed keys for a while. Cassandra will eventually delete them, so that they disappear.

>>> cf.remove('foo')
>>> cf.remove('bar')
>>> list(cf.get_range())
[('bar', {}), ('foo', {})]

... After some amount of time

>>> list(cf.get_range())
[]

Class Mapping

You can also map existing classes using ColumnFamilyMap.

>>> class Test(object):
...     string_column       = pycassa.String(default='Your Default')
...     int_str_column      = pycassa.IntString(default=5)
...     float_str_column    = pycassa.FloatString(default=8.0)
...     float_column        = pycassa.Float64(default=0.0)
...     datetime_str_column = pycassa.DateTimeString() # default=None

The defaults will be filled in whenever you retrieve instances from the Cassandra server and the column doesn't exist. If, for example, you add columns in the future, you simply add the relevant column and the default will be there when you get old instances.

IntString, FloatString, and DateTimeString all use string representations for storage. Float64 is stored as a double and is native-endian. Be aware of any endian issues if you use it on different architectures, or perhaps make your own column type.

>>> Test.objects = pycassa.ColumnFamilyMap(Test, cf)

All the functions are exactly the same, except that they return instances of the supplied class when possible.

>>> t = Test()
>>> t.key = 'maptest'
>>> t.string_column = 'string test'
>>> t.int_str_column = 18
>>> t.float_column = t.float_str_column = 35.8
>>> from datetime import datetime
>>> t.datetime_str_column = datetime.now()
>>> Test.objects.insert(t)
1261395560186855

>>> Test.objects.get(t.key).string_column
'string test'
>>> Test.objects.get(t.key).int_str_column
18
>>> Test.objects.get(t.key).float_column
35.799999999999997
>>> Test.objects.get(t.key).datetime_str_column
datetime.datetime(2009, 12, 23, 17, 6, 3)

>>> Test.objects.multiget([t.key])
{'maptest': <__main__.Test object at 0x7f8ddde0b9d0>}
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7f8ddde0b710>]
>>> Test.objects.get_count(t.key)
7

>>> Test.objects.remove(t)
1261395603906864
>>> Test.objects.get(t.key)
Traceback (most recent call last):
...
cassandra.ttypes.NotFoundException: NotFoundException()

Note that, as mentioned previously, get_range() may continue to return removed rows for some time:

>>> Test.objects.remove(t)
1261395603756875
>>> list(Test.objects.get_range())
[<__main__.Test object at 0x7fac9c85ea90>]
>>> list(Test.objects.get_range())[0].string_column
'Your Default'

SuperColumns

To use SuperColumns, pass super=True to the ColumnFamily constructor.

>>> cf = pycassa.ColumnFamily(client, 'Test SuperColumnFamily', super=True)
>>> cf.insert('key1', {'1': {'sub1': 'val1', 'sub2': 'val2'}, '2': {'sub3': 'val3', 'sub4': 'val4'}})
1261490144457132
>>> cf.get('key1')
{'1': {'sub2': 'val2', 'sub1': 'val1'}, '2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.remove('key1', super_column='1')
1261490176976864
>>> cf.get('key1')
{'2': {'sub4': 'val4', 'sub3': 'val3'}}
>>> cf.get('key1', super_column='2')
{'sub3': 'val3', 'sub4': 'val4'}
>>> cf.multiget(['key1'], super_column='2')
{'key1': {'sub3': 'val3', 'sub4': 'val4'}}
>>> list(cf.get_range(super_column='2'))
[('key1', {'sub3': 'val3', 'sub4': 'val4'})]

You may also use a ColumnFamilyMap with SuperColumns:

>>> Test.objects = pycassa.ColumnFamilyMap(Test, cf)
>>> t = Test()
>>> t.key = 'key1'
>>> t.super_column = 'super1'
>>> t.string_column = 'foobar'
>>> t.int_str_column = 5
>>> t.float_column = t.float_str_column = 35.8
>>> t.datetime_str_column = datetime.now()
>>> Test.objects.insert(t)
>>> Test.objects.get(t.key)
{'super1': <__main__.Test object at 0x20ab350>}
>>> Test.objects.multiget([t.key])
{'key1': {'super1': <__main__.Test object at 0x20ab550>}}

These output values retain the same format as given by the Cassandra thrift interface.

Batch Mutations

The batch interface allows insert/update/remove operations to be performed in batches. This allows a convenient mechanism for streaming updates or doing a large number of operations while reducing number of RPC roundtrips.

Batch mutator objects are synchronized and can be safely passed around threads.

>>> b = cf.batch(queue_size=10)
>>> b.insert('key1', {'col1':'value11', 'col2':'value21'})
>>> b.insert('key2', {'col1':'value12', 'col2':'value22'}, ttl=15)
>>> b.remove('key1', ['col2'])
>>> b.remove('key2')
>>> b.send()

One can use the queue_size argument to control how many mutations will be queued before an automatic send is performed. This allows simple streaming of updates. If set to None, automatic checkpoints are disabled. Default is 100.

Supercolumns are supported:

>>> b = scf.batch()
>>> b.insert('key1', {'supercol1': {'colA':'value1a', 'colB':'value1b'}
                     {'supercol2': {'colA':'value2a', 'colB':'value2b'}})
>>> b.remove('key1', ['colA'], 'supercol1')
>>> b.send()

You may also create a batch mutator from a client instance, allowing operations on multiple column families:

>>> b = client.batch()
>>> b.insert(cf, 'key1', {'col1':'value1', 'col2':'value2'})
>>> b.insert(supercf, 'key1', {'subkey1': {'col1':'value1', 'col2':'value2'}})
>>> b.send()

Note: This interface does not implement atomic operations across coLUMN families. All the limitations of the batch_mutate Thrift API call applies. Remember, a mutation in Cassandra is always atomic per key per column family only.

Note: If a single operation in a batch fails, the whole batch fails.

In Python >= 2.5, mutators can be used as context managers, where an implicit send will be called upon exit.

>>> with cf.batch() as b:
>>>     b.insert('key1', {'col1':'value11', 'col2':'value21'})
>>>     b.insert('key2', {'col1':'value12', 'col2':'value22'})

Calls to insert and remove can also be chained:

>>> cf.batch().remove('foo').remove('bar').send()

Connection Pooling

pycassa offers several types of connection pools for different usage scenarios. These include:

  • QueuePool - typical connection pool that maintains a queue of open connections
  • SingletonThreadPool - one connection per thread
  • StaticPool - a single connection used for all operations
  • NullPool - no pooling is performed, but failover is supported
  • AssertionPool - asserts that at most one connection is open at a time; useful for debugging

To create a pool and use a connection:

>>> pool = pycassa.QueuePool(keyspace='Keyspace1')
>>> connection = pool.get()
>>> cf = pycassa.ColumnFamily(connection, 'Standard1')
>>> cf.insert('key', {'col': 'val'})
>>> connection.return_to_pool()

Automatic retries (or failover) are supported with all types of pools except for StaticPools. This means that if any operation fails, it will be transparently retried on other servers until it succeeds or a maximum number of failures is reached.

Advanced

pycassa currently returns Cassandra Columns and SuperColumns as python dictionaries. Sometimes, though, you care about the order of elements. If you have access to an ordered dictionary class (such as collections.OrderedDict in python 2.7), then you may pass it to the constructor. All returned values will be of that class.

>>> cf = pycassa.ColumnFamily(client, 'Test ColumnFamily',
                              dict_class=collections.OrderedDict)

You may also define your own Column types for the mapper. For example, the IntString may be defined as:

>>> class IntString(pycassa.Column):
...     def pack(self, val):
...         return str(val)
...     def unpack(self, val):
...         return int(val)
... 

Meta API

All of the underlying Cassandra interface functions are available through the connection.

>>> client = pycassa.connect()
>>> client.describe_version()
'8.1.0'
>>> client.describe_keyspaces()
['Test Keyspace', 'system']
>>> client.describe_keyspace('system')
{'LocationInfo': {'Type': 'Standard', 'CompareWith': 'org.apache.cassandra.db.marshal.UTF8Type', 'Desc': 'persistent metadata for the local node'}, 'HintsColumnFamily': {'CompareSubcolumnsWith': 'org.apache.cassandra.db.marshal.BytesType', 'Type': 'Super', 'CompareWith': 'org.apache.cassandra.db.marshal.UTF8Type', 'Desc': 'hinted handoff data'}}