Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Grok Rules in logstack application #4234

Closed
6 tasks
nickumia-reisys opened this issue Mar 13, 2023 · 19 comments
Closed
6 tasks

Implement Grok Rules in logstack application #4234

nickumia-reisys opened this issue Mar 13, 2023 · 19 comments
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Mar 13, 2023

User Story

In order to have better control of log streams in NR, the Data.gov Response Team wants to implement application-specific routing rules in our datagov-logstack app.

Acceptance Criteria

  • GIVEN grok rules have been implemented in logstack app
    WHEN I check NR
    THEN I see different streams/parsing rules applied to each type of app logs

Background

Security Considerations (required)

This is supposed to help with log review processes.

Sketch

  • Unarchive https://github.com/GSA/datagov-logstack
  • Create a test case that represents different input log types
  • Implement Grok Rules
  • See if log outputs are changing based on Grok rules.
  • Create specific Grok rules for cloud.gov applications.
@nickumia-reisys nickumia-reisys added the O&M Operations and maintenance tasks for the Data.gov platform label Mar 15, 2023
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 16, 2023
@hkdctol hkdctol moved this from 📔 Product Backlog to New Dev in data.gov team board Sep 29, 2023
@hkdctol hkdctol moved this from New Dev to 📟 Sprint Backlog [7] in data.gov team board Sep 29, 2023
@FuhuXia FuhuXia self-assigned this Oct 5, 2023
@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Oct 5, 2023
@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Oct 20, 2023
@Jin-Sun-tts Jin-Sun-tts self-assigned this Nov 15, 2023
@Jin-Sun-tts Jin-Sun-tts moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Nov 15, 2023
@Jin-Sun-tts
Copy link
Contributor

both prod and staging use the logstack-shipper app in the management space, setting up one for staging only in the management-staging space for testing.

cf-drain-cli plugin was deprecated. to install drains plugin, download from https://github.com/cloudfoundry/cf-drain-cli/releases/tag/v2.0.0 and install it from binary.

@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Nov 17, 2023

pushed to management-staging, got message to logstack-shipper, but with following error:
x_cf_routererror:"endpoint_failure (tls: failed to verify certificate: x509: certificate is not valid for any names

@Jin-Sun-tts
Copy link
Contributor

By setting up the space-drain in development, all apps log from development can be sent to New Relic now.

Dev/Staging/Prod currently use one logstash shipper from management space.
Ideally we would like use separate shipper from each environments, but we got setup error like below:

OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
ERR ERROR: File not found for: file:///home/vcap/app/plugins.zip, message: Can't file local file /home/vcap/app/plugins.zip
OUT Installing Cloud Foundry root CA certificate...
OUT Installing certificates: /etc/cf-system-certificates/*
keytool error: **java.io.FileNotFoundException: /etc/cf-system-certificates/*** (No such file or directory)
OUT Invoking start command.
OUT Using bundled JDK: /home/vcap/app/logstash-7.16.3/jdk
ERR OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.

ERR warning: thread "Converge PipelineAction::Create main " terminated with exception (report_on_exception is true):
ERR LogStash::Error: **Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`**
ERR create at org/logstash/execution/ConvergeResultExt.java:135
ERR add at org/logstash/execution/ConvergeResultExt.java:60
ERR converge_state at **/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb**:396
[ERROR][logstash.agent           ] An exception happened when converging configuration {:exception=>LogStash::Error, :message=>"Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`"}
OUT [2023-11-22T17:59:27,912][FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<main>`>, :backtrace=>["org/logstash/execution/ConvergeResultExt.java:135:in `create'", "org/logstash/execution/ConvergeResultExt.java:60:in `add'", "/home/vcap/app/logstash-7.16.3/logstash-core/lib/logstash/agent.rb:396:in `block in converge_state'"]}
OUT [2023-11-22T17:59:27,918][FATAL][org.logstash.Logstash    ] Logstash stopped processing because of an error: (SystemExit) exit
OUT org.jruby.exceptions.SystemExit: (SystemExit) exit
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:747) ~[jruby-complete-9.2.20.1.jar:?]
OUT at org.jruby.RubyKernel.exit(org/jruby/RubyKernel.java:710) ~[jruby-complete-9.2.20.1.jar:?]
OUT at home.vcap.app.logstash_minus_7_dot_16_dot_3.lib.bootstrap.environment.<main>(/home/vcap/app/logstash-7.16.3/lib/bootstrap/environment.rb:94) ~[?:?]
OUT Exit status 1

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 📟 Sprint Backlog [7] in data.gov team board Dec 1, 2023
@Jin-Sun-tts
Copy link
Contributor

@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Dec 12, 2023
@Jin-Sun-tts
Copy link
Contributor

After fixed plugins and new_relic key etc issues, now set up separate instances of the logstack-shipper app in each environment (development-ssb management-staging management). Currently in the process of testing a Grok rule within the development environment.

@Jin-Sun-tts
Copy link
Contributor

the original message before any New Relic parsing rules:
<14>1 2023-12-14T20:06:39.954251+00:00 gsa-datagov.development.catalog-proxy 0a6741c4-167a-43e3-bc3a-fda36e6a1bef [APP/PROC/WEB/0] - [tags@47450 app_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_name="catalog-proxy" deployment="cf-production" index="e9142d48-5d18-4f21-a8ee-91d48a62cc84" instance_id="0" ip="10.10.1.9" job="diego-cell" organization_id="90047c5d-337f-4802-bd48-2149a4265040" organization_name="gsa-datagov" origin="rep" process_id="1913b784-daf0-4e83-9d33-e89b4c7b70c8" process_instance_id="d809bf52-9ed4-497d-5d0b-361d" process_type="web" source_id="0a6741c4-167a-43e3-bc3a-fda36e6a1bef" source_type="APP/PROC/WEB" space_id="eab3d327-7d9f-423b-9838-753c26fdb5a0" space_name="development"] NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141

Grok rule:
<%{INT:num}>%{POSINT:ver} %{TIMESTAMP_ISO8601:timestamp} %{DATA:host} %{UUID:proc_id} \[%{DATA:instance_info}\] - \[tags@%{INT:tag_id} app_id="%{UUID:app_id}" app_name="%{DATA:app_name}" deployment="%{DATA:deployment}" index="%{DATA:index}" instance_id="%{INT:instance_id}" ip="%{IP:ip}" job="%{DATA:job}" organization_id="%{UUID:organization_id}" organization_name="%{DATA:organization_name}" origin="%{DATA:origin}" process_id="%{UUID:process_id}" process_instance_id="%{DATA:process_instance_id}" process_type="%{DATA:process_type}" source_id="%{UUID:source_id}" source_type="%{DATA:source_type}" space_id="%{UUID:space_id}" space_name="%{DATA:space_name}"\] %{GREEDYDATA:log_data}

Grok output:

{
  "num": 14,
  "ver": 1,
  "timestamp": "2023-12-14T20:06:39.954251+00:00",
  "host": "gsa-datagov.development.catalog-proxy",
  "proc_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
  "instance_info": "APP/PROC/WEB/0",
  "temp": 47450,
  "app_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
  "app_name": "catalog-proxy",
  "deployment": "cf-production",
  "index": "e9142d48-5d18-4f21-a8ee-91d48a62cc84",
  "instance_id": 0,
  "ip": "10.10.1.9",
  "job": "diego-cell",
  "organization_id": "90047c5d-337f-4802-bd48-2149a4265040",
  "organization_name": "gsa-datagov",
  "origin": "rep",
  "process_id": "1913b784-daf0-4e83-9d33-e89b4c7b70c8",
  "process_instance_id": "d809bf52-9ed4-497d-5d0b-361d",
  "process_type": "web",
  "source_id": "0a6741c4-167a-43e3-bc3a-fda36e6a1bef",
  "source_type": "APP/PROC/WEB",
  "space_id": "eab3d327-7d9f-423b-9838-753c26fdb5a0",
  "space_name": "development",
  "log_data": "NginxLog \"POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1\" 500 141"
}

Required fields:

{
    "timestamp": "2023-12-14T20:03:41.65677+00:00",
    "app_name": "catalog-proxy",
    "ip": "10.10.1.9",
    "space_name":"development",
    "instance_id": "APP/PROC/WEB/0",
    "log_data": 'NginxLog "POST /'https://catalog.data.gov'/%3Cz3 HTTP/1.1" 500 141'
}

@Jin-Sun-tts
Copy link
Contributor

discussed with Fuhu, for CKAN log message, there are more required field need to be extracted:

Format one:
catalog-dev.data.gov - [2023-12-15T21:11:49.508628938Z] "GET /0000000 HTTP/1.1" 404 0 21445 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36" "127.0.0.1:47846" "10.10.2.10:61092" x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1" x_forwarded_proto:"https" vcap_request_id:"f31245e2-7980-4a82-77e9-3c4ca21703ae" response_time:0.037609 gorouter_time:0.000121 app_id:"0a6741c4-167a-43e3-bc3a-fda36e6a1bef" app_index:"0" instance_id:"f410370c-fc91-424b-451b-e47e" x_cf_routererror:"-" x_b3_traceid:"f31245e279804a8277e93c4ca21703ae" x_b3_spanid:"77e93c4ca21703ae" x_b3_parentspanid:"-" b3:"f31245e279804a8277e93c4ca21703ae-77e93c4ca21703ae"

**Fields need to be extracted:**

catalog-dev.data.gov
[2023-12-15T21:11:49.508628938Z]
"GET /0000000 HTTP/1.1" 404 0 21445 
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
x_forwarded_for:"108.28.249.216, 64.252.66.176, 127.0.0.1" 

Format two:
2023-12-14 20:31:53,839 INFO [ckan.config.middleware.flask_app] 404 /dataset/aaaabbbb render time 0.023 seconds

**Fields need to be extracted:**

2023-12-14 20:31:53,839 
INFO  
[ckan.config.middleware.flask_app]  
404 
/dataset/aaaabbbb 
render time 0.023 seconds

@Jin-Sun-tts
Copy link
Contributor

Screenshot 2023-12-21 at 12 42 21 PM

Screenshot 2023-12-21 at 12 50 56 PM

Those extra fields show up in development space. Will review it with @FuhuXia to see if we need more.

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Dec 21, 2023
@Jin-Sun-tts Jin-Sun-tts moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Dec 27, 2023
@Jin-Sun-tts
Copy link
Contributor

@FuhuXia I added grok rule to separate the request, please check on development space:

Image

@Jin-Sun-tts
Copy link
Contributor

Filter the data based on HTTP status codes and generate a pie chart

Image

@Jin-Sun-tts
Copy link
Contributor

Double checked on development, those 404 error are all irrelevant traffic.
Screenshot 2023-12-28 at 10 48 28 AM

@Jin-Sun-tts
Copy link
Contributor

Discussed with Fuhu, we will implement the following changes:

  • Disregard any log messages related to Nginxlog, considering them duplicates.
  • Also drop messages associated with logstack-shipper that have a 200 response code. This action will significantly reduce the volume of logs being forwarded to New Relic.

@Jin-Sun-tts
Copy link
Contributor

The Nginx log messages were excluded and are no longer present in the New Relic logs. Additionally, the log from logstack-shipper only contains logs associated with non-200 response codes.
PR: GSA/datagov-logstack#45

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Jan 4, 2024
@FuhuXia
Copy link
Member

FuhuXia commented Jan 4, 2024

We can deploy the changes on prod and find the new crawler agent(s) which is responsible for the recent increased tracking traffic. Previously it can only be done via lengthy cloudfront log processing.

@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Jan 4, 2024

couple of changes need to modified from the grok rules:

  • two separate fields: http_referer, http_user_agent

  • change some fields name:
    http_status -> status
    request_uri -> request
    request_size -> bytes_sent
    response_size -> bytes_received

  • separate forward ips to two fields: real_ip, forward_ips

Also, only keep fields raw_message_content, log_data in development space for debug purpose.

Reference for the field name: https://djangocas.dev/blog/nginx/nginx-access-log-with-real-x-forwarded-for-ip-instead-of-proxy-ip/

@Jin-Sun-tts Jin-Sun-tts moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Jan 4, 2024
@FuhuXia
Copy link
Member

FuhuXia commented Jan 8, 2024

Another big benefit from this grok change is that we can set grok rule to ignore certain white-noise logs. As of now 78% of all logs (based of December 2023 data, 958,729,889 of 1,216,483,493) are logstack-shipper POST/200 logs that offer 0 value to us. NR log query will be faster without them.

logstash-prod-datagov.app.cloud.gov - [2023-12-01T15:12:55.421915605Z] 
"POST /?drain-type=all HTTP/1.1" 200 761 2 "-" "fasthttp" 
...

@Jin-Sun-tts
Copy link
Contributor

All modified fields are in development now:

Screenshot 2024-01-09 at 12 38 29 PM

Also added logic to remove fields raw_message_content, log_data in other non-development envs. It only shows up in development for debug purpose.

@Jin-Sun-tts Jin-Sun-tts moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Jan 10, 2024
@Jin-Sun-tts
Copy link
Contributor

Fixed the deploy related issues and have PR GSA/datagov-logstack#48 here. @FuhuXia

@Jin-Sun-tts
Copy link
Contributor

confirmed from New Relic, all changes are in prod and other environments now.

@github-project-automation github-project-automation bot moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Jan 16, 2024
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

3 participants