Cold starts double to quadruple when layer is included #228

ianwremmel · 2022-03-01T05:08:52Z

Is your feature request related to a problem? Please describe.
I'm just getting started with OpenTelemetry and just the act of adding the ADOT NodeJS layer to my functions makes the simplest ping function from 1-2 to 4-5 seconds and my remix app's cold start goes from 8 seconds to more than 30 (I've got the function set to timeout at 30).

Is there any configuration that I can do to improve that or is that just the way it has to be?

Describe the solution you'd like
Minimal impact on cold start times.

Describe alternatives you've considered

use standard xray libs
use otel with non-AWS providers and send directly from application code

NathanielRN · 2022-03-01T16:27:28Z

Hi @ianwremmel, thanks for raising this issue!

We've actually gotten issues on things like this before. A user on the OpenTelemetry CNCF Slack channel did a useful deep dive into investigating cold start impacts of Lambda with the ADOT NodeJS Lambda Layer.

Unfortunately it seems that the best solution is to bump up the memory capacity of your Lambda which does improve Cold Start time. If your Lambda is being used a lot this should prove to be a good idea.

Let me know if that helps!

ianwremmel · 2022-03-01T21:03:57Z

Gotcha, thanks!

Could you share that post somehow? I don't seem to be able to access that Slack.

ianwremmel · 2022-03-02T05:15:51Z

After a bit of experimenting (and seeing a serendipitous tweet this morning), I've determined that my cold start times without the layer were bad due to running the the --source-map-support flag. Once that was removed, my ping function cold start time was reduced to about 700ms with a MemorySize of 128MB, which seems pretty tolerable to me.

With the unexpected overhead of --source-map-support removed and the layer added, I still see ping taking 4 to 5 seconds and my remix app (MemorySize of 1024MB) still times out. If I up MemorySize to 8192 for all functions, I end up with ping at 3.6 seconds and the remix app at 11 seconds.

So, unless you've got other suggestions, it sounds like the AWS otel lambda support isn't viable for my project.

NathanielRN · 2022-03-03T19:30:50Z

Could you share that post somehow? I don't seem to be able to access that Slack.

Here is the Discussion on OTel JS that the user linked.

ianwremmel · 2022-03-03T21:01:30Z

Thanks! yea, it looks I'm in kind of the same boat. I didn't see the kind of speedup he did by increasing memory, but even if I had, that would be way too expensive. I don't think there's any way I can use AWS otel, given these numbers, but maybe there's still a chance to use otel instrumentation and ship directly to third-party receivers (e.g., honeycomb).

NathanielRN · 2022-03-07T22:19:01Z

One improvement you might try to reduce the cold start time is to use ESBuild instead of CommonJS build for your Node JS app on Lambda. ESBuild is nicer because it does operations like tree-shaking to seriously reduce the size of imported libraries which reduces memory and improves performance.

There are discussions upstream on OpenTelemetry JS Contrib to improve support for ESBuild because it requires more careful configuration. You might find the comment thread at open-telemetry/opentelemetry-js-contrib#647 (comment) helpful towards this goal and improve your Node App's memory impact in a way that would make ADOT Lambda layers usable for your purpose 🙂

The solution utilizes esbuild's "external" option and a custom plugin. "external" is an array of packages you tell esbuild to not bundle, and the plugin picks up those external packages and installs them into a raw node_modules directory when building. OpenTelemetry needs to monkey patch methods, so you can't bundle the libraries it needs to instrument (e.g. http, express, graphql, etc.), but you can instrument everything else.

ianwremmel · 2022-03-07T22:31:04Z

I'm already using rollup to do pretty aggressive treeshaking and I've tried putting all the otel packages on my dependencies layer. Unfortunately, as soon as I include the AWS Open Telemetry layer, cold boot times skyrocket.

I've made a few attempts at switch my lambdas over to mjs, but things go wrong with dependencies every time.

Note that the ping lambda I've been talking about above is about 40k after being processed by rollup.

ianwremmel · 2022-04-02T22:14:48Z

came back to poke at this a little bit more and it seems like most of the overhead comes from adding http: to protocols in opentelemetry.yml. If I stick to only grpc:, then cold starts are only about 10% worse rather than 400% worse.

ianwremmel · 2022-04-03T05:52:01Z

Another interesting detail. This line from the docs site, when used with a remix app, seems to mostly prevent the function from returning. From what I can tell, initialization takes more than 30 seconds. (For reasons not entirely clear to me, I couldn't get the lambda timeout to be more than 30 seconds, but that might be a sleep-deprivation issue on my part).

const tracerProvider = new NodeTracerProvider(tracerConfig);

ianwremmel · 2022-04-03T06:03:48Z

Actually, that's the wrong line. I think it's this line:

import {NodeTracerProvider} from '@opentelemetry/node';

Apparently there's some side effect somewhere in @opentelemetry/node that completely tanks lambda initialization when Remix is involved.

ianwremmel · 2022-04-03T06:35:02Z

For reasons not entirely clear to me, I couldn't get the lambda timeout to be more than 30 seconds, but that might be a sleep-deprivation issue on my part

It's the API gateway timeout that has hard limit at 30 seconds. Looking at the corresponding xray trace, it looks like the lambda takes 1.2 minutes to initialize when both remix and @opentelemetry/node are present.

Admittedly, remix is only the most obvious package in this lambda. There are any number of frontend packages that I could imagine might not play well with otel; the bundle file is 3MB.

ianwremmel · 2022-04-05T03:04:22Z

After a bunch of refactoring to get AWSXRayIdGenerator and AWSXRayPropagator working, both of the problems I mentioned above seem to have gone away: the http receiver and @opentelementr/node + remix are no longer making initialization times skyrocket. I don't really know what meaningfully changed.

willarmiros · 2022-04-12T13:51:27Z

After a bunch of refactoring to get AWSXRayIdGenerator and AWSXRayPropagator working

@ianwremmel why did you need to refactor to get these components to work? They should work out of the box with the Lambda layer instrumentation

ianwremmel · 2022-04-12T15:29:53Z

@willarmiros I was having a whole lot of trouble getting the layer to actually collect my spans and send them anywhere.

First, the layer doesn't configure a few things that one would otherwise expect it to configure:

it only configures otel to talk to the collector via http, not grpc
it does not configure a ContextProvider

a side effect is that following, say, the aws otel javascript tracing guide has no effect, because there's already been a tracer configured by the layer. Since otel doesn't provide a way to get access to the original tracerProvider, there's no way to add grpc after the fact. (it took me a while, but once I actually found the aws and otel repos for the layer, I was able to configure things using the hooks provided by the upstream otel layer to address the issues I mentioned above.)

Second, again because there's no way to get the tracerProvider after initialization, I had to piggyback on one of the config hooks so I could attach the traceProvider to global so that I could call global.traceProvider.foceFlush() at the end of each of my lambda functions.

Finally, there have been a lot of API changes the majority of aws-otel.github.io was written, so they're all slightly incorrect.

In any case, after working through those issues, my original coldstart issue and the super high duration on my Remix lambdas went away.

This is what my final set of config overrides looks like

// Copied from
// https://github.com/open-telemetry/opentelemetry-lambda/blob/0a83149fe2f23a7dab64df6108cfa35f18cc2ae5/nodejs/packages/layer/src/wrapper.ts#L42-L50
declare global {
  // in case of downstream configuring span processors etc
  function configureTracerProvider(tracerProvider: NodeTracerProvider): void;

  function configureTracer(defaultConfig: NodeTracerConfig): NodeTracerConfig;

  function configureSdkRegistration(
    defaultSdkRegistration: SDKRegistrationConfig
  ): SDKRegistrationConfig;

  function configureLambdaInstrumentation(
    config: AwsLambdaInstrumentationConfig
  ): AwsLambdaInstrumentationConfig;
}

// And this needs to be defined so that we can flush at the end of each
// invocation.
declare global {
  // var seems to be the only typescript keyword that works here.
  // eslint-disable-next-line no-var
  var tracerProvider: NodeTracerProvider;
}

// Remove some noise from the service name
const serviceName = process.env.AWS_LAMBDA_FUNCTION_NAME?.replace(
  my-stack-name',
  ''
)?.split('-')?.[0];

// This produces the config passed to new NodeTracerProvider()
global.configureTracer = ({resource, ...config}) => {
  diag.debug('Telemetry: generating tracer config');

  if (serviceName) {
    const r = new Resource({
      [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
        process.env.STAGE_NAME ?? 'development',
      [SemanticResourceAttributes.FAAS_INSTANCE]:
        process.env.AWS_LAMBDA_LOG_STREAM_NAME ?? '',
      [SemanticResourceAttributes.FAAS_MAX_MEMORY]:
        process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE ?? '',
      [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    });
    resource = resource ? resource.merge(r) : r;
  }

  return {
    ...config,
    idGenerator: new AWSXRayIdGenerator(),
    resource,
  };
};

// This gets applied to the tracerProvider
global.configureTracerProvider = (tracerProvider) => {
  diag.debug('Telemetry: configuring tracer provider');
  const exporter = new OTLPTraceExporter({
    url: 'grpc://localhost:4317',
  });

  tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));

  global.tracerProvider = tracerProvider;
};

// This produces the config passed to tracerProvider.register();
global.configureSdkRegistration = (config) => {
  diag.debug('Telemetry: generating SDK registration config');
  const contextManager = new AsyncHooksContextManager();
  contextManager.enable();
  return {
    ...config,
    contextManager
  };
};

ianwremmel · 2022-04-12T15:32:30Z

So, it wasn't really AWSXRayIdGenerator and AWSXRayPropagator that weren't working, it was just that the layer doesn't quite work out of the box. When I wrote the original comment, I thought the issue was with the AWS pieces, but it mostly the fact that the layer doesn't flush on its own.

kmihaltsov · 2022-07-07T02:53:58Z

Having similar performance issue. Nodejs Lambda with 1536mb: cold start goes from 500ms with x-ray sdk to 1700ms with this layer and default configuration from aws docs.

What makes me crazy is that aws documentation suggests to use this OTEL layer over x-ray while still having such issues.

a-h · 2022-07-24T00:26:02Z

Similar situation with Go.

I was testing out following the advice to switch to Open Telemetry, spent a few hours on troubleshooting etc., only to find that the Lambda layer adds 200ms(!) to cold start times compared to using the X-Ray SDK.

Disappointed to see the advice to look at Open Telemetry, without any warning of the performance penalty.

The only reason I did was because I thought I could cut out some of the cold start time (the Go AWS X-Ray has about 6MB of dependency bloat for me).

With 1GB of RAM, I'm seeing init duration go from around 100ms, to 300ms.

I don't see how this is workable. 🤷🏻

I have a test Lambda function. Used CloudWatch Log Insights to compare before and after.

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration. (?<@init>\d+\.\d+) ms/
| sort @timestamp desc
| limit 20

(The first 50ms coming off the cold start time in the data is due to other optimisations).

RichiCoder1 · 2022-07-24T00:43:47Z

@a-h It might be worth cross-posting that to https://github.com/open-telemetry/opentelemetry-lambda/ (which is the underlying repo) as that means the collector itself is taking that long to boot.

github-actions · 2022-10-23T20:00:30Z

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

RichiCoder1 · 2022-10-23T20:11:08Z

Still a very valid issue.

ianwremmel · 2022-10-23T20:16:45Z

In fact, I think it’s gotten worse with the latest releases.

adambiggs · 2023-02-16T07:52:42Z

Any updates? Is this being looked into?

mattfysh · 2023-03-25T18:50:28Z

Achievement unlocked: It's not everyday you stumble upon a Github thread with screenshots of your Slack conversation!

I've been looking into this problem again and it does seem as though ADOT performance on cold starts is still rather poor, but my guess is that it's being caused by the underlying repo https://github.com/open-telemetry/opentelemetry-lambda

If anyone is interested, I'm putting together some automated benchmarks to find the optimal memory setting and implementation when tracing in lambda: https://github.com/mattfysh/lambda-trace-perf

I do find it ironic that we're seeing these bottlenecks. Tracing instrumentations are typically used to measure and monitor application performance, so it does give pause when I see vendors are not fine-tuning their own shared libraries and runtimes required by developers.

ianwremmel · 2023-03-25T19:12:31Z

I do find it ironic that we're seeing these bottlenecks.

Agreed. Honestly, I think the real solution here is "stop running the collector in a Lambda layer". I haven't looked into it yet, but I've been noodling on how to run the collector in ECS. I did some manual benchmarking a few weeks ago (I don't have the numbers anymore) and just adding the ADOT layer without even setting the env vars to induce instrumentation adds nearly 800ms.

Somewhere, I heard that AWS was planning to make xray more OTel native, so hopefully someday it'll be possible to just let the XRay infrastructure do the collection.

RichiCoder1 · 2023-03-25T19:17:12Z

I think it's still worth have an "agent" in a layer to take advantage of things like the Telemetry API and handling batch. That said, that agent doesn't necessarily need to be the Otel collector. Maybe something more purpose built would make more sense like a very thin rust agent to collect logs, traces, and batch and forward otel data?

(I promise I'm not mentioning rust cuz it's the cool kid, but rather because real world perf in lambda: https://maxday.github.io/lambda-perf/).

Edit: A native, AWS/Lambda level support for Otel would def be the best solution so there's no (user visible) overhead whatsoever.

ianwremmel · 2023-03-25T19:27:11Z

yea, there are certainly advantages, but at least for my purposes, the overhead of using any layer seems to be too much of a performance impact to justify using that layer. I've jumped through a lot of goofy hoops to remove multiple layers. From what I can tell, layers only have "good" performance when you hit a scale where coldstarts are an adequately small portion of your traffic that they average their way off of your dashboards.

github-actions · 2023-06-25T20:00:50Z

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

RichiCoder1 · 2023-06-25T21:53:45Z

Bump. As far as I'm aware, this issue hasn't been resolved even with the introduction of the new lambda repo.

mattfysh · 2023-07-07T01:26:27Z

Performance of OTel in Lambda is receiving attention again. If anyone would like to benchmark the current solutions I put together a rudimentary repo here: https://github.com/mattfysh/lambda-trace-perf

I've not looked at it in a while (I've since moved away from serverless because of various cold boot issues, including this one) but if anyone wants to work together on it, I'd be happy to jump back into it

deki · 2023-09-06T16:13:56Z

See also: open-telemetry/opentelemetry-lambda#727

adambiggs · 2023-10-31T17:06:43Z

Any updates?

jnicholls · 2024-01-23T13:43:24Z

I'm going to dig into this a little bit this week to try to understand where the startup time is rooted in ADOT's setup. I'll get a baseline with zero components registered and introduce them one by one.

If this isn't resolved by the team, I'm going to resort to running a central collector to which my lambdas export, and avoid a collector as a layer altogether. Until the Lambda runtime makes an OTel collector a first-class citizen like the X-Ray daemon is, I suspect an OTel collector is simply too heavyweight to be a layer.

Unless of course, someone decides to write an OTel collector in Rust. Which would be substantially better imho.

github-actions · 2024-04-28T20:00:38Z

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

RichiCoder1 · 2024-04-29T00:14:40Z

This is still a concern as far as I'm aware.

stevemao · 2024-05-21T01:13:46Z

This is a blocking issue for us to use opentelemetry in lambda

felipenervo · 2024-06-18T12:00:28Z

Any updates on this issue? We had to remove the layer from our lambdas due to these issues.

jnicholls · 2024-06-18T12:34:30Z

I too removed the layer. It was a nonstarter. An external collector makes more sense for Lambda at this time.

ianwremmel · 2024-06-18T14:43:08Z

@jnicholls would you be able to share anything about your external collector setup? I've been interested in doing something like that, but the pricing has always looked prohibitive.

github-actions · 2024-09-22T20:00:45Z

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

ianwremmel · 2024-09-22T20:51:14Z

This is still a major issue and it's causing me to run new workloads on fly.io instead of lambda.

alolita assigned NathanielRN Mar 14, 2022

alolita added the lambda-perf label Mar 14, 2022

a-h mentioned this issue Jul 24, 2022

Consider not recommending AWS Distro for OpenTelemetry (ADOT)? aws/aws-xray-sdk-go#372

Closed

a-h mentioned this issue Jul 24, 2022

Slow start-up? open-telemetry/opentelemetry-lambda#263

Open

github-actions bot added the stale label Oct 23, 2022

bryan-aguilar removed the stale label Oct 25, 2022

adambiggs mentioned this issue Dec 15, 2022

Improve support for AWS SDK v3 aws/aws-xray-sdk-node#547

Closed

vsakaram unassigned NathanielRN Jan 23, 2023

github-actions bot added the stale label Jun 25, 2023

bilalq mentioned this issue Jul 1, 2023

captureAWSv3Client does not support SQS or SSM clients aws/aws-xray-sdk-node#439

Closed

github-actions bot removed the stale label Jul 2, 2023

bilalq mentioned this issue Jul 7, 2023

Expose more data in middleware to improve X-Ray and OTel traces aws/aws-sdk-js-v3#4902

Closed

2 tasks

github-actions bot added the stale label Apr 28, 2024

github-actions bot removed the stale label May 5, 2024

github-actions bot added the stale label Sep 22, 2024

github-actions bot removed the stale label Sep 29, 2024

Cold starts double to quadruple when layer is included #228

Cold starts double to quadruple when layer is included #228

Comments

ianwremmel commented Mar 1, 2022

NathanielRN commented Mar 1, 2022

ianwremmel commented Mar 1, 2022

ianwremmel commented Mar 2, 2022

NathanielRN commented Mar 3, 2022

ianwremmel commented Mar 3, 2022

NathanielRN commented Mar 7, 2022

ianwremmel commented Mar 7, 2022

ianwremmel commented Apr 2, 2022

ianwremmel commented Apr 3, 2022

ianwremmel commented Apr 3, 2022

ianwremmel commented Apr 3, 2022

ianwremmel commented Apr 5, 2022

willarmiros commented Apr 12, 2022

ianwremmel commented Apr 12, 2022 • edited Loading

ianwremmel commented Apr 12, 2022

kmihaltsov commented Jul 7, 2022

a-h commented Jul 24, 2022

RichiCoder1 commented Jul 24, 2022

github-actions bot commented Oct 23, 2022

RichiCoder1 commented Oct 23, 2022

ianwremmel commented Oct 23, 2022

adambiggs commented Feb 16, 2023

mattfysh commented Mar 25, 2023

ianwremmel commented Mar 25, 2023

RichiCoder1 commented Mar 25, 2023 • edited Loading

ianwremmel commented Mar 25, 2023

github-actions bot commented Jun 25, 2023

RichiCoder1 commented Jun 25, 2023

mattfysh commented Jul 7, 2023

deki commented Sep 6, 2023

adambiggs commented Oct 31, 2023

jnicholls commented Jan 23, 2024 • edited Loading

github-actions bot commented Apr 28, 2024

RichiCoder1 commented Apr 29, 2024

stevemao commented May 21, 2024

felipenervo commented Jun 18, 2024

jnicholls commented Jun 18, 2024

ianwremmel commented Jun 18, 2024

github-actions bot commented Sep 22, 2024

ianwremmel commented Sep 22, 2024

ianwremmel commented Apr 12, 2022 •

edited

Loading

RichiCoder1 commented Mar 25, 2023 •

edited

Loading

jnicholls commented Jan 23, 2024 •

edited

Loading