file_matches - why so much traffic #4

indera-shsp · 2019-09-03T14:27:26Z

We use file_matches as described here https://www.elastic.co/guide/en/logstash/current/plugins-inputs-google_cloud_storage.html
to determine which files need processing.

We are observing excessive traffic initiated by logstash - is it downloading ALL the files from the bucket every 60 seconds?

I expected it to be smart and only download file names which should not add to hundreds of megabytes every hour.

The text was updated successfully, but these errors were encountered:

josephlewis42 · 2019-09-03T15:13:13Z

Hi @indera-shsp, the plugin should be listing the bucket every interval seconds and filtering objects based on the names before attempting to download the contents of the objects.

That could show up as several GET requests to the https://www.googleapis.com/storage/v1/b/bucket/o endpoint.

If you see Logstash downloading the contents of the file every 60 seconds that's probably a bug. The plugin should keep a locacl cache of which objects it's already processed or mark them with a label in GCS. The label is preferred (by default it's x-goog-meta-ls-gcs-input) because it's guaranteed to persist across multiple Logstash workers and sessions.

Could you expound on your use case a little bit more (average size of file, average size of name, count of objects per bucket)?

indera-shsp · 2019-09-03T17:24:13Z

The bucket contains about 20 small json files and we check every 30 seconds

tmegow · 2019-09-03T21:33:50Z

    input {
      google_cloud_storage {
        interval => 30
        bucket_id => "${GOOGLE_BUCKET_NAME}"
        json_key_file => "/sd/creds/gcp_service_account.json"
        file_matches => "transcriptions/.*json"
        prefix => "implementHere"
        codec => "json"
        metadata_key => "x-goog-meta-logstash-blocks"
      }
    }

https://github.com/logstash-plugins/logstash-input-google_cloud_storage/blob/master/lib/logstash/inputs/cloud_storage/client.rb#L22

The list function here can optionally accept an option object containing a property named "prefix" https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Storage.java#L973.

I suspect that if Logstash allowed providing a "prefix" as a parameter (similar to how one provides file_matches) would allow Google to do pre-filtering of large lists of files and may be cheaper computationally and use less network bandwidth each list cycle.

tmegow · 2019-09-03T21:50:58Z

We have a bucket with 28716 objects (and growing). To retrieve the list of these objects plus their metadata, the resulting file is 38MB. Since our logstash interval is set to 30s, we calculate approximately 36GB every 8h. This results in significant usage increase in our ingress bandwidth, it seems like the option to provide a pre-filter pattern would eliminate some of this overhead.

indera-shsp · 2019-09-04T09:19:08Z

From reading the ruby code it looks like filtering objects is done after the code downloads the list of objects and their metadata, which is already too late to save bandwidth in our case.

josephlewis42 · 2019-09-04T14:42:01Z

It sounds like even with a prefix, we might end up back here soon if the data is going to continue growing.

It seems like the pipeline is trying to index something that's near-real time. Would one of the following approaches help?

Create a second bucket for files pending processing. When a file changes in the first, use a Cloud Function to copy it over to the second and attach Logstash to the second bucket making it delete objects once it's done processing them.
Read file change events off a Pub/Sub queue and only process the interesting ones with something like logstash-input-google_pubsub.

If those are too much, I'm happy to look at just adding the prefix for now if you're willing to test it so we can get a Logstash maintainer to approve the PR (they like to see at least one one real user testing it before approving a merge/release).

indera-shsp · 2019-09-04T19:28:37Z

@josephlewis42 Thank you for the prompt responses. We will evaluate the options you mentioned, but adding support for the prefix can benefit other users too :)

tmegow · 2019-09-04T21:08:44Z

@josephlewis42 if there is a PR enabling the use of a prefix pre-filter, we will happily test it

indera-shsp · 2019-10-10T14:11:52Z

@josephlewis42 how difficult would be to fix this issue for somebody not familiar with the code base?

josephlewis42 · 2019-10-10T15:31:09Z

@indera-shsp I'm taking a stab at it right now. The codebase is a bit hairy because it's Java mixed with Ruby. Our hope was full Java because then type checks and the like are easy but I think those plans have been stalled upstream.

josephlewis42 · 2019-10-10T16:04:04Z

@indera-shsp or @tmegow I built a version with the fix and have it published here: https://storage.googleapis.com/logstash-prereleases/logstash-input-google_cloud_storage-0.12.0-java.gem for testing. If things look good, would you mind leaving your remarks in #7 ?

Here are the docs for the new field:

[id="plugins-{type}s-{plugin}-file_prefix"]
===== `file_prefix`

added[0.12.0]

  * Value type is <<string,string>>
  * Default is: ``

A prefix filter applied server-side. Only files starting with this prefix will
be fetched from Cloud Storage. This can be useful if all the files you want to
process are in a particular folder and want to reduce network traffic.

josephlewis42 added a commit that referenced this issue Oct 10, 2019

first pass at fixing #4

1bb7517

josephlewis42 mentioned this issue Oct 10, 2019

Fix/4 #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file_matches - why so much traffic #4

file_matches - why so much traffic #4

indera-shsp commented Sep 3, 2019

josephlewis42 commented Sep 3, 2019

indera-shsp commented Sep 3, 2019

tmegow commented Sep 3, 2019 •

edited

Loading

tmegow commented Sep 3, 2019

indera-shsp commented Sep 4, 2019 •

edited

Loading

josephlewis42 commented Sep 4, 2019

indera-shsp commented Sep 4, 2019

tmegow commented Sep 4, 2019

indera-shsp commented Oct 10, 2019

josephlewis42 commented Oct 10, 2019

josephlewis42 commented Oct 10, 2019

file_matches - why so much traffic #4

file_matches - why so much traffic #4

Comments

indera-shsp commented Sep 3, 2019

josephlewis42 commented Sep 3, 2019

indera-shsp commented Sep 3, 2019

tmegow commented Sep 3, 2019 • edited Loading

tmegow commented Sep 3, 2019

indera-shsp commented Sep 4, 2019 • edited Loading

josephlewis42 commented Sep 4, 2019

indera-shsp commented Sep 4, 2019

tmegow commented Sep 4, 2019

indera-shsp commented Oct 10, 2019

josephlewis42 commented Oct 10, 2019

josephlewis42 commented Oct 10, 2019

tmegow commented Sep 3, 2019 •

edited

Loading

indera-shsp commented Sep 4, 2019 •

edited

Loading