-
Notifications
You must be signed in to change notification settings - Fork 254
Home
Netflix uses a SOA to deliver a compelling experience to its members. Each service in the architecture can fail independently, but due to complicated interactions between services, a failure in one service often affects many. To help diagnose these issues, teams at Netflix use a variety of tools, one of them being a realtime dashboard, which was cited in the blog post Making The API More Resilient.
Here is a snapshot of the dashboard being used to monitor several systems across the company.
The metrics visible in the dashboard are sourced from Hystrix that enables systems to gain fault tolerance at large scale.
Turbine is a low latency high throughput stream processing engine that powers the backend for these metrics. It is one of the key systems used at Netflix to gain real time insight into multiple distributed systems comprising thousands of servers. Users of Turbine can get real time data about system events within seconds of their occurrence.
- Netflix uses a single deployment of Turbine to power many teams' dashboards.
- Turbine is explicitly configured to monitor Hystrix metrics across several key systems. Configuration is expressed in the form of clusters, which is Turbine's way of understanding a logical group of servers to be monitored for the same set of metrics.
- Turbine automatically discovers and connects to server instances using Netflix's Eureka service.
- Data is streamed over persistent connections from each server instance to Turbine over http. Note that the protocol here is not request-response oriented, the data is constantly sent over the same connection without ever closing it.
- Turbine accommodates servers coming and going. It quickly discovers new instances, opens connections to them, and also tears down connections to instances that have gone away. In adverse network conditions, Turbine repeatedly tries to access the instances (with backoff) unless told otherwise.
- Turbine has an aggregator that collects metrics sent from individual machines to give users a global view of the system.
- Turbine uses the cluster as a natural grouping criteria in order to aggregate metrics from the same group. Hence Turbine can be use to simultaneously monitor multiple clusters or system deployments.
- Engineers, dashboards, alerting systems and data analytic systems can connect to Turbine to get the real time feed of data for the entire system.
There is none! Data in Turbine is ephemeral and is not persisted anywhere within the system. Turbine maintains a sliding window over the data and aggregates data over this (configurable) time window. The statelessness is one of the key tenants that enables Turbine to be a low latency system that functions well at scale, monitoring thousands of machines and providing insight into problems in a matter of seconds. Netflix uses other complementary systems and tools that deal well with persistent data.
-
[End-to-End Examples](https://github.com/Netflix/Turbine/wiki/End-to-End Examples-(1.x))