Best method for using multiple columns? #3

jeffb4 · 2013-01-17T17:36:20Z

I'm writing Apache webserver logs to Cassandra using Flume and this Sink, and I would like to break log entries into various fields/columns (I already break in to fields with an interceptor).

Would the best/canonical method of doing this be to extend flume-ng-cassandra-sink with a serializer config directive, default said directive to the existing serializer, and then (for my needs) create a custom serializer that takes desired fields as a configuration option, and stuffs them into Cassandra as columns?

btoddb · 2013-01-17T17:43:07Z

if you already have the fields broken out using an interceptor, i think you
have what you need. does the interceptor put the fields back into the
payload of the LogEvent or use attributes on the log event?

the sink is fairly basic, so it does need some work to be used in
production. my thoughts on this have always been around using an
interceptor that takes the payload and converts the payload into standard
JSON (or some other) such that any sink would be able to interpret the JSON
to do as it needs. for cassandra sink it could parse the JSON and store
each JSON property as a column.

if you have an interceptor that does this, i'll take a pull request and
include in the code. i will probably be working on this again soon so your
question is timely :)

thx!

On Thu, Jan 17, 2013 at 9:36 AM, jeffb4 [email protected] wrote:

I'm writing Apache webserver logs to Cassandra using Flume and this Sink,
and I would like to break log entries into various fields/columns (I
already break in to fields with an interceptor).

Would the best/canonical method of doing this be to extend
flume-ng-cassandra-sink with a serializer config directive, default said
directive to the existing serializer, and then (for my needs) create a
custom serializer that takes desired fields as a configuration option, and
stuffs them into Cassandra as columns?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3.

jeffb4 · 2013-01-17T18:36:47Z

The interceptor (default regex_extractor that comes with Flume) serializes the parsed-out data into event headers - I'm not familiar enough with Flume terminology to say whether that is LogEvent payload or attribute.

My thought (instead of the JSON conversion and then deconversion) was something like:

host1.sinks.sink1.type = com.btoddb.flume.sinks.cassandra.CassandraSink
# default Cassandra Serializer (no Flume header magic)
# host1.sinks.sink1.serializer = com.btoddb.flume.sinks.cassandra.SimpleCassandraEventSerializer
# custom Cassandra Serializer (insert Flume header fields as columns)
host1.sinks.sink1.serializer = com.blah.ComplexCassandraEventSerializer

# map Flume headers "foo", "bar", "alpha", and "beta" to Cassandra columns of the same name
host1.sinks.sink1.serializer.fieldcolumns = foo bar alpha beta

As far as your plugin goes, the big difference would be the addition of the .serializer config option (defaulting to your current use of the ByteBufferSerializer out of Hector).

If JSON/BSON was being written to more than MongoDB, or if Flume event headers weren't capable of storing columns, I could see a more generic JSON solution for in-flight data.

btoddb · 2013-01-21T19:23:06Z

thanks for pinging me. i had some other things ahead of this, and wanted to understand a bit more (been a while since i've hit the code).

yes i think you're on to something, but instead of using the regex, maybe just supply a serializer that does the regex directly into cassandra columns? this would essentially mean that anyone could create a serializer to parse the flume event into columns.

taking it one step further, how about defining the conversion in configuration, like JSON or XML?

1 - read the "conversion definition" based on something in the flume headers (source id, app id, hostname, etc)
2 - retrieve the conversion definition from a data store (maybe cache it)
3 - execute the conversion creating a single batch mutation
4 - save batch mutation to cassandra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best method for using multiple columns? #3

Best method for using multiple columns? #3

jeffb4 commented Jan 17, 2013

btoddb commented Jan 17, 2013

jeffb4 commented Jan 17, 2013

btoddb commented Jan 21, 2013

Best method for using multiple columns? #3

Best method for using multiple columns? #3

Comments

jeffb4 commented Jan 17, 2013

btoddb commented Jan 17, 2013

jeffb4 commented Jan 17, 2013

btoddb commented Jan 21, 2013