This project provides a relay that can accept metrics in various formats (initially Graphite Line protocol) and send them through various ways.
Early Proof of Concept. Compiles, but not tested extensively
General:
- Don't store metrics forever in queues in case destination is unavailable
- Offload queue overflows to disk
- Internal stats
- Extended stats
Benchmark:
- Provide simple but configurable load generator
- Load generator should be fast (at least 10M lines/sec on a E5-2620 over loopback)
- Delay measurements
- Extrapolate speed based on when data arrives
Config:
- Override key for Transport distribution
Calculator:
- Calculate real metric frequency
- Detect semi-frequent metrics
Input:
- TCP
- UDP
- Unix Socket
- TLS
- Configurable encoding
Input Encoders:
- Graphite Line Protocol
- Graphite Line Protocol with tags
- Metrics 2.0
- InfluxDB Line Protocol
Output Encoders:
- Graphite Line Protocol
- Graphite Line Protocol with tags
- JSON
- Protobuf
- kafkamdm
Output:
- Kafka
- TCP
- UDP
- Unix Socket
Routing:
- Regexp matching (Re2-based)
- Rewrites
- Prefix Matching
- Blackhole sender
- Log on receive
- PCRE Regexp Matching
- Separate tool to show where metric will lend
LoadBalancing:
- fnv1a
- jump hash fnv1a
- round robin with sticking
- graphite consistent hash
Documentation:
- At least some docs
- Design documentation
- Extended docs
Internal benchmarks shows that current version of relay can do simple routing (StatsWith: "" + send to 4 destinations) of 2M lines/sec on 2xE5-2620v3, 128GB Ram. CPU Consumption is 6 (out of 24) cores on average (spikes up to 18 cores), memory consumption is far from optimal - 60GB of Ram (6x overhead). This performance levels can't be considered ok for sustained load.
With more complex rules, relay performance dramatically decreases (10-20x decrease and 10x more memory consumption). This is subject to investigate and fix.
Performance with tags is mostly untested
- Some internal queues (if you can call it queues) have no limit so malformed or unthrottled input might lead to OOM issues
- If backend go down, first point in queue will be lost
- Config format is far from perfect (readability, easy of modification, easy of generation)
- Unstable config format
- Delays are untested
- Might contain memory leaks
- Have no statistics
- Have no documentation, except for comments in config file
This program was originally developed for Booking.com. With approval from Booking.com, the code was generalised and published as Open Source on GitHub, for which the author would like to express his gratitude.
This code is licensed under the Apache2 license.