Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best method for using multiple columns? #3

Open
jeffb4 opened this issue Jan 17, 2013 · 3 comments
Open

Best method for using multiple columns? #3

jeffb4 opened this issue Jan 17, 2013 · 3 comments

Comments

@jeffb4
Copy link

jeffb4 commented Jan 17, 2013

I'm writing Apache webserver logs to Cassandra using Flume and this Sink, and I would like to break log entries into various fields/columns (I already break in to fields with an interceptor).

Would the best/canonical method of doing this be to extend flume-ng-cassandra-sink with a serializer config directive, default said directive to the existing serializer, and then (for my needs) create a custom serializer that takes desired fields as a configuration option, and stuffs them into Cassandra as columns?

@btoddb
Copy link
Owner

btoddb commented Jan 17, 2013

if you already have the fields broken out using an interceptor, i think you
have what you need. does the interceptor put the fields back into the
payload of the LogEvent or use attributes on the log event?

the sink is fairly basic, so it does need some work to be used in
production. my thoughts on this have always been around using an
interceptor that takes the payload and converts the payload into standard
JSON (or some other) such that any sink would be able to interpret the JSON
to do as it needs. for cassandra sink it could parse the JSON and store
each JSON property as a column.

if you have an interceptor that does this, i'll take a pull request and
include in the code. i will probably be working on this again soon so your
question is timely :)

thx!

On Thu, Jan 17, 2013 at 9:36 AM, jeffb4 [email protected] wrote:

I'm writing Apache webserver logs to Cassandra using Flume and this Sink,
and I would like to break log entries into various fields/columns (I
already break in to fields with an interceptor).

Would the best/canonical method of doing this be to extend
flume-ng-cassandra-sink with a serializer config directive, default said
directive to the existing serializer, and then (for my needs) create a
custom serializer that takes desired fields as a configuration option, and
stuffs them into Cassandra as columns?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3.

@jeffb4
Copy link
Author

jeffb4 commented Jan 17, 2013

The interceptor (default regex_extractor that comes with Flume) serializes the parsed-out data into event headers - I'm not familiar enough with Flume terminology to say whether that is LogEvent payload or attribute.

My thought (instead of the JSON conversion and then deconversion) was something like:

host1.sinks.sink1.type = com.btoddb.flume.sinks.cassandra.CassandraSink
# default Cassandra Serializer (no Flume header magic)
# host1.sinks.sink1.serializer = com.btoddb.flume.sinks.cassandra.SimpleCassandraEventSerializer
# custom Cassandra Serializer (insert Flume header fields as columns)
host1.sinks.sink1.serializer = com.blah.ComplexCassandraEventSerializer

# map Flume headers "foo", "bar", "alpha", and "beta" to Cassandra columns of the same name
host1.sinks.sink1.serializer.fieldcolumns = foo bar alpha beta

As far as your plugin goes, the big difference would be the addition of the .serializer config option (defaulting to your current use of the ByteBufferSerializer out of Hector).

If JSON/BSON was being written to more than MongoDB, or if Flume event headers weren't capable of storing columns, I could see a more generic JSON solution for in-flight data.

@btoddb
Copy link
Owner

btoddb commented Jan 21, 2013

thanks for pinging me. i had some other things ahead of this, and wanted to understand a bit more (been a while since i've hit the code).

yes i think you're on to something, but instead of using the regex, maybe just supply a serializer that does the regex directly into cassandra columns? this would essentially mean that anyone could create a serializer to parse the flume event into columns.

taking it one step further, how about defining the conversion in configuration, like JSON or XML?

1 - read the "conversion definition" based on something in the flume headers (source id, app id, hostname, etc)
2 - retrieve the conversion definition from a data store (maybe cache it)
3 - execute the conversion creating a single batch mutation
4 - save batch mutation to cassandra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants