Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirect logs from ECS Fargate to fluentd #869

Open
oneS5 opened this issue Oct 22, 2024 · 1 comment
Open

Redirect logs from ECS Fargate to fluentd #869

oneS5 opened this issue Oct 22, 2024 · 1 comment

Comments

@oneS5
Copy link

oneS5 commented Oct 22, 2024

I have an ECS container with the following tasks: an application, the OpenTelemetry Collector, and Fluent Bit. I'm facing an issue with the application logs, which are currently being recorded as stdout. Previously, I used Datadog, and the logs were pushed correctly. Now, I want to change the configuration so that the logs go to the OpenTelemetry Collector and from there to ClickHouse. However, it seems that the logs are not being passed to Fluent Bit. Could you help me identify any potential mistakes in the configuration? It looks like a problem with passing logs to fluentd itself, because nothing reaches the stdout output directly either. I'm using the 'latest' image. I tried also with debug image, but it didn't give any result in the logs.

Configuration

My fluent-bit.conf

[SERVICE]
    Flush               5
    Log_Level           info
    Daemon              off

[INPUT]
    Name                forward
    Listen              0.0.0.0
    Port                24224
    Buffer_Chunk_Size   1M
    Buffer_Max_Size     10M


[FILTER]
    Name         parser
    Match        *
    Key_Name     log
    Parser       json
    Reserve_Data True

[OUTPUT]
    Name          forward
    Match         *
    Host          localhost
    Port          24284 (otel collector task)

[OUTPUT]
    Name  stdout
    Match *

tried with

[INPUT]
    Name   dummy
    Dummy {"message": "custom dummy"}

and it's worked

My otel config

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: '${JOB_NAME}'
          static_configs:
            - targets: ['localhost:9394']
          relabel_configs:
            - source_labels: [__address__]
              regex: (.*)
              target_label: instance_ip
              replacement: "$1"
            - source_labels: [job]
              target_label: job_name
              replacement: '${JOB_NAME}'

        - job_name: opentelemetry-collector
          scrape_interval: 30s
          static_configs:
          - targets:
            - 127.0.0.1:8888
                  
  fluentforward/tcp:
    endpoint: 0.0.0.0:24284
  awsecscontainermetrics:
    collection_interval: 10s
    
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"
      grpc:
        endpoint: "0.0.0.0:4317"
        
processors:
  transform/firelens:
    log_statements:
      - context: log
        statements:
          # parse json logs
          - merge_maps(cache, ParseJSON(body), "insert") where IsMatch(body, "^\\{")
          # set message
          - set(body, cache["message"]) where cache["message"] != nil

          # set trace/span id
          - set(trace_id.string, cache["trace_id"]) where cache["trace_id"] != nil
          - set(span_id.string, cache["span_id"]) where cache["span_id"] != nil

          # set severity when available
          - set(severity_number, SEVERITY_NUMBER_INFO) where IsMatch(cache["level"], "(?i)info")
          - set(severity_number, SEVERITY_NUMBER_WARN) where IsMatch(cache["level"], "(?i)warn")
          - set(severity_number, SEVERITY_NUMBER_ERROR) where IsMatch(cache["level"], "(?i)err")
          - set(severity_number, SEVERITY_NUMBER_DEBUG) where IsMatch(cache["level"], "(?i)debug")
          - set(severity_number, SEVERITY_NUMBER_TRACE) where IsMatch(cache["level"], "(?i)trace")
          - set(severity_number, cache["severity_number"]) where cache["severity_number"] != nil

          # move log_record attributes to resource
          - set(resource.attributes["container_name"], attributes["container_name"])
          - set(resource.attributes["container_id"], attributes["container_id"])
          - delete_key(attributes, "container_id")
          - delete_key(attributes, "container_name")

          - delete_matching_keys(cache, "^(message|trace_id|span_id|severity_number)$")

          - merge_maps(attributes, cache, "insert")

  batch:
    timeout: 5s
    send_batch_size: 100000

exporters:
  prometheusremotewrite:
    endpoint: "${PROMETHEUS_ENDPOINT}"
    headers:
      Authorization: "${PROMETHEUS_PASSWORD}"
    remote_write_queue:
      enabled: true
    resource_to_telemetry_conversion:
      enabled: true
    timeout: 15s
    tls:
      insecure: true
      
  clickhouse:
    endpoint: "${CLICKHOUSE_ENDPOINT}"
    database: "otel_metrics"
    username: "default"
    password: "${CLICKHOUSE_PASSWORD}"
    timeout: 10s
    create_schema: true
    logs_table_name: otel_logs
    traces_table_name: otel_traces
    metrics_table_name: otel_metrics
    
service:
  telemetry:
    logs:
      level: "debug"
      
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [prometheusremotewrite, clickhouse]
    logs:
      receivers: [otlp, fluentforward/tcp]
      processors: [batch, transform/firelens]
      exporters: [clickhouse]

and my task definition

{
  "family": "ENVIRONMENT_NAME-web",
  "taskRoleArn": "TASK_ROLE_ARN",
  "executionRoleArn": "EXECUTION_ROLE_ARN",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "ENVIRONMENT_NAME-web",
      "portMappings": [
        {
          "containerPort": 3000,
          "hostPort": 3000,
          "protocol": "tcp"
        },
        {
          "containerPort": 9394,
          "hostPort": 9394,
          "protocol": "tcp"
        }
      ],
      "image": "",
      "essential": true,
      "command": [
        "bundle",
        "exec",
        "rails",
        "server",
        "-b",
        "0.0.0.0",
        "-p",
        "3000"
      ],
      "environment": [
        {
          "name": "ECS_SERVICE_NAME",
          "value": "web"
        }
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "forward",
          "Host": "localhost",
          "Port": "24224"
        }
      }
    },
    {
      "name": "otel-collector",
      "essential": true,
      "image": "xxx/prod-opentelemetry-collector:latest",
      "portMappings": [
        {
          "containerPort": 5568,
          "hostPort": 5568,
          "protocol": "tcp"
        },
        {
          "containerPort": 24225,
          "hostPort": 24225,
          "protocol": "tcp"
        },
        {
          "containerPort": 4318,
          "hostPort": 4318,
          "protocol": "tcp"
        },
        {
          "containerPort": 4317,
          "hostPort": 4317,
          "protocol": "tcp"
        },
        {
          "containerPort": 24284,
          "hostPort": 24284,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "JOB_NAME",
          "value": "web"
        }
      ],
      "user": "0",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ENVIRONMENT_NAME/web-logs",
          "awslogs-region": "REGION_NAME",
          "awslogs-stream-prefix": "ecs"
        }
      }
    },
{
  "name": "fluent-bit",

      "portMappings": [
        {
          "containerPort": 24224,
          "hostPort": 24224,
          "protocol": "tcp"
        }
      ],
  "image": "xxxx/prod-fluent-bit:latest",
  "essential": true,
  "firelensConfiguration": {
    "type": "fluentbit",
    "options": {
      "enable-ecs-log-metadata": "true"
    }
  },
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ENVIRONMENT_NAME/web-logs",
      "awslogs-region": "REGION_NAME",
      "awslogs-stream-prefix": "ecs"
    }
  }
}
  ]
}
### Fluent Bit Log Output 
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [output:null:null.0] worker #0 started
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [output:forward:forward.1] worker #0 started
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:8877
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [input:forward:forward.1] listening on unix:///var/run/fluent.sock
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [input:forward:forward.2] listening on 127.0.0.1:24224
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [sp] stream processor started
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [output:forward:forward.1] worker #1 started
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [fluent bit] version=1.9.10, commit=eba89f4660, pid=1
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [storage] version=1.4.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
[2024/10/17 14:26:18] [ info] [cmetrics] version=0.3.7
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
* Copyright (C) 2015-2022 The Fluent Bit Authors
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
* https://fluentbit.io
fluent-bit
October 17, 2024 at 16:26 (UTC+2:00)
Fluent Bit v1.9.10
fluent-bit

or


October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [ warn] [input:forward:forward.2] fd=66 incoming data exceed limit (6144000 bytes)
fluent-bit
October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [error] [output:forward:forward.1] cannot get ack
fluent-bit
October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [ warn] [engine] failed to flush chunk '1-1729175681.8912538.flb', retry in 8 seconds: task_id=0, input=forward.2 > output=forward.1 (out_id=1)```

currently, i see in logs 

October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [ warn] [input:forward:forward.2] fd=66 incoming data exceed limit (6144000 bytes)
fluent-bit
October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [error] [output:forward:forward.1] cannot get ack
fluent-bit
October 17, 2024 at 16:34 (UTC+2:00)
[2024/10/17 14:34:41] [ warn] [engine] failed to flush chunk '1-1729175681.8912538.flb', retry in 8 seconds: task_id=0, input=forward.2 > output=forward.1 (out_id=1)```
@ishworg
Copy link

ishworg commented Nov 29, 2024

We are also facing a similar issue while shipping logs to GrafanaCloud Loki (firelens > fluentbit > loki)

AWS team, can you kindly look into this issue? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants