Skip to content
0x6875790d0a edited this page Oct 1, 2019 · 5 revisions

Introduction and Scope

At LINE, we have few thousands of engineers on hundreds of teams managing thousands of services. 1.5 engineer works on distributed tracing. Almost all services are in Java, some are in Erlang.

At LINE, we use <5% of infra cost and <1% of human cost to maintain our observability stack includes:

  • Metrics (~ 200 millions metrics/minutes) (in-house tsdb (prev opentsdb/mysql))
  • Logging (in-house build using elasticsearch)
  • Tracing (Using openzipkin)

Primary integration point with zipkin is armeria, which has native support for instrumenting server and client requests.

  • Some users using spring sleuth, and some using envoy to send span

System Overview

Instrumentation

  • Use Brave for instrumentation.
  • Most instrumentation happens within armeria, which instruments server and client requests.
  • Other standard brave adapters used include mysql.
  • Custom brave adapters have been implemented for data stores like redis, mongo.
  • Custom reporter that uses armeria RPC-over-HTTP2 (Thrift) to send spans

Data ingestion

Primary transport is normal HTTP2, as used by armeria for RPC

Monitoring API server between all instrumented servers and the data store.

  • Implements zipkin-api as well.
  • Exposes zipkin-ui as well
  • Fully asynchronous, non-blocking

API is a simple thrift service with a union of two fields

  • binary encoded_spans - Spans that have already been serialized as a list of zipkin thrift spans. This would be the result of using zipkin-java’s ThriftCodec. This field is prefered for languages with good implementations of zipkin already.
  • <zipkin2.Span> spans - Spans that can be filled in by filling in generated language code. This should be useful for languages that don’t have implementations of zipkin as they can just fill in the generated code without worrying about duplicating models + business logic. Not used yet, but probably would be used from erlang.

Data store and aggregation

Elasticsearch cluster - 10 nodes on physical machines with xeon CPUs, 64GB ram, big SSD (NvME each)

  • Use elasticsearch’s Curator tool with cron to clean up indexes that are more than a week old
  • Monitoring API server uses zipkin’s elasticsearch-storage to write spans into elasticsearch with no extra processing.
  • Best effort - randomly lost spans will be lost (no separate storage like kafka)
    • In practice, don’t see many failures

UI part: we created https://github.com/line/zipkin-lens to fit our usage and and moved it to openzipkin.

Goals

Best effort - as long as latency investigation can happen, occasional broken traces isn’t a big deal.

Eventually want to instrument all servers - currently only instrumenting one team’s servers which is comprised of several services each with dozens of serving machines.

Will need erlang instrumentation

Current Status (10-2019)

  • Ingest rate is around 30000 spans per sec

Service name

At LINE, service name is created freely by our users. Mostly user likely to create service name to represent their cluster purpose, like "bot-frontend-service" or "shop-ownership-service". 

Site-specific tags

The following are span tags we frequently use in indexing or aggregation

Tag Description Usage
instance-id Our company-wide naming for project
instance-phase Our company-wide naming for enviroment
Clone this wiki locally