Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for partition table #42

Open
alepuccetti opened this issue Aug 29, 2018 · 8 comments
Open

Support for partition table #42

alepuccetti opened this issue Aug 29, 2018 · 8 comments

Comments

@alepuccetti
Copy link

alepuccetti commented Aug 29, 2018

Bigquery can partition tables by time to save cost and faster performances when querying smaller time windows.

Enable create partitioned tables on ingestion time or a date/timestamp field in the schema.

Maybe there is a way to do that but it is not documented.

@josephlewis42
Copy link
Contributor

Hi @alepuccetti,

You might be able to set a custom date pattern and follow the guide for converting date sharded tables into a single date partitioned table, even if it's not ideal.

To add support for this feature, it looks like we'll need a few more settings (from the docs on table creation):

setting type description
clustering nested object [Beta] Clustering specification for the table. Must be specified with time-based partitioning, data in the table will be first partitioned and subsequently clustered.
clustering.fields[] list [Repeated] One or more fields on which data should be clustered. Only top-level, non-repeated, simple-type fields are supported. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
timePartitioning nested object Time-based partitioning specification for this table.
timePartitioning.expirationMs long [Optional] Number of milliseconds for which to keep the storage for partitions in the table. The storage in a partition will have an expiration time of its partition time plus this value.
timePartitioning.field string [Beta] [Optional] If not set, the table is partitioned by pseudo column, referenced via either '_PARTITIONTIME' as TIMESTAMP type, or '_PARTITIONDATE' as DATE type. If field is specified, the table is instead partitioned by this field. The field must be a top-level TIMESTAMP or DATE field. Its mode must be NULLABLE or REQUIRED.
timePartitioning.requirePartitionFilter boolean [Beta] [Optional] If set to true, queries over this table require a partition filter that can be used for partition elimination to be specified.
timePartitioning.type string [Required] The only type supported is DAY, which will generate one partition per day.

Translated to a LogStash config that would probably look something like this:

  config :cluster_fields, validate: :string, list: true, default: []
  config :enable_time_partitioning, validate: :boolean, default: false
  config :time_partition_field, validate: :string, default: ""
  config :time_partition_require_filter, validate: :boolean, default: false

Do you think something like that could fit your needs?

@alepuccetti
Copy link
Author

Hi @josephlewis42,

You might be able to set a custom date pattern and follow the guide for converting date sharded tables into a single date partitioned table, even if it's not ideal.

Making monthly (or other time shards) tables is not a feasable for my use case. Writing queries that touch multiple tables is very ugly, and if you want to build a dashboard with datastudio, I am not even sure if that is possible. Converting sharded tables to partitioned ones, it is not feasible because I want to query the data in near real-time.

I think that be able to leverage partitioned table would be great for everyone. I am not familiar with the internal of streaming ingest, but I think it is just a table definition issues, isn't it?
Also, clustering it is not required to use partition so it could be a separate work. (but clustering require partition tables).

I am trying a different number of things now (I started using this plugin today).

  • I am trying to send data to an already existing partitioned/not partitioned table.
  • Insert into a dynamically generated table

In my case, I rather have one table with all the data and drop row with a select statement. Currently, I can do this by setting date_pattern => "" but the table name will be <table_prefix>+<table_delimiter>. I will open a separate issue about this.

@alepuccetti
Copy link
Author

alepuccetti commented Aug 29, 2018

Update:

I successfully inserted into an existing partitioned table, so this could be a quick around.

However, I noticed that if I have one record in a batch failing the whole batch fails.
If I set batch_size to 1, I insert more records than if I set batch size to 128 or more. Is this a known problem? In the case of size = 1, I get all the record except the errors In the case of batch size = 128 the total number of record is a multiple of 128. The input dataset if a bit more than a thousand.

Note that setting batch size to 1 will make the inser very slow.

@MPTG94
Copy link

MPTG94 commented May 13, 2021

@alepuccetti can you please show how you managed to insert data to an existing partitioned table?

I am trying to do the same thing right now, but I'm getting constant Null Pointer Exceptions in the logstash logfile (the config I am using worked without any issue before I made the table a partitioned one)

@alepuccetti
Copy link
Author

@MPTG94 It has been a long time, so I don't remember for sure if I had to do something special. But I don't remember to have this issue. Maybe something changes form when I did it.

@simon-verzijl
Copy link

@alepuccetti can you please show how you managed to insert data to an existing partitioned table?

I am trying to do the same thing right now, but I'm getting constant Null Pointer Exceptions in the logstash logfile (the config I am using worked without any issue before I made the table a partitioned one)

I was just trying to do this myself since I want the data to be loaded in a partitioned table and set partition expiry to i.e. 60 days.
When I create the table with the same schema that the BQ plugin itself created automatically it works fine for me.

i.e. in my case:

bq mk -t \
--schema 'timestamp:TIMESTAMP,host:STRING,<all other fields>' \
--time_partitioning_field timestamp \
--time_partitioning_type DAY \
--time_partitioning_expiration 5184000  \
dataset.mytable

(and then just set table_prefix to mytable and table_separator + date_pattern empty)

I am seeing lot of these errors, but the data is actually being loaded succesfully :

[2021-07-08T17:31:26,013][ERROR][logstash.outputs.googlebigquery] Error creating table. {:exception=>java.lang.NullPointerException}
[2021-07-08T17:31:26,024][ERROR][logstash.outputs.googlebigquery] Error creating table. {:exception=>java.lang.NullPointerException}
[2021-07-08T17:31:26,041][INFO ][logstash.outputs.googlebigquery] Publishing 128 messages to mytable
[2021-07-08T17:31:26,073][INFO ][logstash.outputs.googlebigquery] Publishing 128 messages to mytable

@MPTG94
Copy link

MPTG94 commented Jul 8, 2021

@simon-verzijl thanks for the reply, I ended up doing the same as you did (just creating the Partitioned table from the GCP web interface and not the command line).

I am also seeing these weird errors, but data seems to flow to the tables just fine

@zachaller
Copy link

zachaller commented Dec 3, 2021

I don't really think this is in a place to say upstream because I don't really know ruby or java very well and had slowly work my way to get this setup but it works and allows table partitioning and clustering to be setup. I have it in a fork here https://github.com/zachaller/logstash-output-google_bigquery an example config that would create an HOUR partitioned table with 1 month table partition due to bq only supporting 4k partitions and enabling clustering on the table as well.

date_pattern => "%Y-%m"
table_separator => "_"
table_prefix => "logstash"
insert_partition_by => "HOUR"
cluster_fields => "kubernetesNamespace,containerName"

Should be able to build an image via

docker build --rm -t logstash -f Dockerfile.build .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants