Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file_matches - why so much traffic #4

Open
indera-shsp opened this issue Sep 3, 2019 · 11 comments
Open

file_matches - why so much traffic #4

indera-shsp opened this issue Sep 3, 2019 · 11 comments

Comments

@indera-shsp
Copy link

We use file_matches as described here https://www.elastic.co/guide/en/logstash/current/plugins-inputs-google_cloud_storage.html
to determine which files need processing.

We are observing excessive traffic initiated by logstash - is it downloading ALL the files from the bucket every 60 seconds?

I expected it to be smart and only download file names which should not add to hundreds of megabytes every hour.

@josephlewis42
Copy link
Contributor

Hi @indera-shsp, the plugin should be listing the bucket every interval seconds and filtering objects based on the names before attempting to download the contents of the objects.

That could show up as several GET requests to the https://www.googleapis.com/storage/v1/b/bucket/o endpoint.

If you see Logstash downloading the contents of the file every 60 seconds that's probably a bug. The plugin should keep a locacl cache of which objects it's already processed or mark them with a label in GCS. The label is preferred (by default it's x-goog-meta-ls-gcs-input) because it's guaranteed to persist across multiple Logstash workers and sessions.

Could you expound on your use case a little bit more (average size of file, average size of name, count of objects per bucket)?

@indera-shsp
Copy link
Author

The bucket contains about 20 small json files and we check every 30 seconds

@tmegow
Copy link

tmegow commented Sep 3, 2019

    input {
      google_cloud_storage {
        interval => 30
        bucket_id => "${GOOGLE_BUCKET_NAME}"
        json_key_file => "/sd/creds/gcp_service_account.json"
        file_matches => "transcriptions/.*json"
        prefix => "implementHere"
        codec => "json"
        metadata_key => "x-goog-meta-logstash-blocks"
      }
    }

https://github.com/logstash-plugins/logstash-input-google_cloud_storage/blob/master/lib/logstash/inputs/cloud_storage/client.rb#L22

The list function here can optionally accept an option object containing a property named "prefix" https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-storage/src/main/java/com/google/cloud/storage/Storage.java#L973.

I suspect that if Logstash allowed providing a "prefix" as a parameter (similar to how one provides file_matches) would allow Google to do pre-filtering of large lists of files and may be cheaper computationally and use less network bandwidth each list cycle.

@tmegow
Copy link

tmegow commented Sep 3, 2019

We have a bucket with 28716 objects (and growing). To retrieve the list of these objects plus their metadata, the resulting file is 38MB. Since our logstash interval is set to 30s, we calculate approximately 36GB every 8h. This results in significant usage increase in our ingress bandwidth, it seems like the option to provide a pre-filter pattern would eliminate some of this overhead.

@indera-shsp
Copy link
Author

indera-shsp commented Sep 4, 2019

From reading the ruby code it looks like filtering objects is done after the code downloads the list of objects and their metadata, which is already too late to save bandwidth in our case.

@josephlewis42
Copy link
Contributor

It sounds like even with a prefix, we might end up back here soon if the data is going to continue growing.

It seems like the pipeline is trying to index something that's near-real time. Would one of the following approaches help?

  1. Create a second bucket for files pending processing. When a file changes in the first, use a Cloud Function to copy it over to the second and attach Logstash to the second bucket making it delete objects once it's done processing them.
  2. Read file change events off a Pub/Sub queue and only process the interesting ones with something like logstash-input-google_pubsub.

If those are too much, I'm happy to look at just adding the prefix for now if you're willing to test it so we can get a Logstash maintainer to approve the PR (they like to see at least one one real user testing it before approving a merge/release).

@indera-shsp
Copy link
Author

@josephlewis42 Thank you for the prompt responses. We will evaluate the options you mentioned, but adding support for the prefix can benefit other users too :)

@tmegow
Copy link

tmegow commented Sep 4, 2019

@josephlewis42 if there is a PR enabling the use of a prefix pre-filter, we will happily test it

@indera-shsp
Copy link
Author

@josephlewis42 how difficult would be to fix this issue for somebody not familiar with the code base?

josephlewis42 added a commit that referenced this issue Oct 10, 2019
@josephlewis42
Copy link
Contributor

@indera-shsp I'm taking a stab at it right now. The codebase is a bit hairy because it's Java mixed with Ruby. Our hope was full Java because then type checks and the like are easy but I think those plans have been stalled upstream.

@josephlewis42 josephlewis42 mentioned this issue Oct 10, 2019
Closed
@josephlewis42
Copy link
Contributor

@indera-shsp or @tmegow I built a version with the fix and have it published here: https://storage.googleapis.com/logstash-prereleases/logstash-input-google_cloud_storage-0.12.0-java.gem for testing. If things look good, would you mind leaving your remarks in #7 ?

Here are the docs for the new field:

[id="plugins-{type}s-{plugin}-file_prefix"]
===== `file_prefix`

added[0.12.0]

  * Value type is <<string,string>>
  * Default is: ``

A prefix filter applied server-side. Only files starting with this prefix will
be fetched from Cloud Storage. This can be useful if all the files you want to
process are in a particular folder and want to reduce network traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants