-
Notifications
You must be signed in to change notification settings - Fork 254
Home
Netflix uses a SOA to deliver a compelling experience to its members. Each service in the architecture can fail independently, but due to complicated interactions between services, a failure in one service often times affects many. To help diagnose these issues, teams at Netflix use a variety of tools, one of them being a realtime dashboard, which was cited in the blog post Making The API More Resilient.
Here is a snapshot of the dashboard being used to monitor several systems across the company.
The metrics visible in this dashboard are vended by our recently open sourced infrastructure Hystrix that enables systems to gain fault tolerance at large scale.
Turbine is a low latency high throughput stream processing engine that powers the backend for these metrics. It is one of the key systems used at Netflix to gain real time insight into multiple distributed systems comprising of thousands of servers. Users of Turbine can get real time data about system events within seconds of their occurrence.
- Netflix uses a single deployment of the Turbine system for the entire company.
- Turbine is explicitly configured to monitor Hystrix metrics across several key systems. Configuration is expressed in the form of clusters, which is Turbine's way of understanding a logical group of servers to be monitored for the same set of metrics.
- Turbine automatically discovers the server instances using Netflix's Eureka service and connects to these instances.
- Data is sent constantly over these persistent connections to Turbine over http. Note that the protocol here is not request-response oriented, the data is constantly sent over the same connection without ever closing it.
- Turbine accommodates for instances terminating or launching due to autoscale issues, fleet deployments, AWS problems or just bad network connectivity. It quickly discovers new instances and opens connections to them, and also tears down connections to instances that have gone away. In adverse network conditions, Turbine repeatedly tries to access the instances (with backoff) unless told otherwise.
- Turbine has an embedded aggregator that aggregates metrics sent from individual machines to give users a global view of the system.
- Turbine uses the cluster as a natural grouping criteria in order to aggregate metrics from the same group. Hence Turbine can be use to simultaneously monitor multiple clusters or system deployments.
- Engineers, dashboards, alerting systems and data analytic systems can connect to Turbine to get the real time feed of data for the entire system.
There is none! Data in Turbine is ephemeral and is not persisted anywhere within the system. Turbine maintains a constant sliding window over the data and aggregates data over this time window which is configurable. The statelessness is one of the key tenants that enable Turbine to be super low latency system that functions well at massive scale where it monitors thousands of machines and can give real time insight into real problems in a matter of seconds. Netflix uses other complimentary systems and tools that deal well with persistent data.
-
[End-to-End Examples](https://github.com/Netflix/Turbine/wiki/End-to-End Examples-(1.x))