Support for partition table #42

alepuccetti · 2018-08-29T13:53:17Z

Bigquery can partition tables by time to save cost and faster performances when querying smaller time windows.

Enable create partitioned tables on ingestion time or a date/timestamp field in the schema.

Maybe there is a way to do that but it is not documented.

josephlewis42 · 2018-08-29T15:23:33Z

You might be able to set a custom date pattern and follow the guide for converting date sharded tables into a single date partitioned table, even if it's not ideal.

To add support for this feature, it looks like we'll need a few more settings (from the docs on table creation):

setting	type	description
clustering	nested object	[Beta] Clustering specification for the table. Must be specified with time-based partitioning, data in the table will be first partitioned and subsequently clustered.
clustering.fields[]	list	[Repeated] One or more fields on which data should be clustered. Only top-level, non-repeated, simple-type fields are supported. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
timePartitioning	nested object	Time-based partitioning specification for this table.
timePartitioning.expirationMs	long	[Optional] Number of milliseconds for which to keep the storage for partitions in the table. The storage in a partition will have an expiration time of its partition time plus this value.
timePartitioning.field	string	[Beta] [Optional] If not set, the table is partitioned by pseudo column, referenced via either '_PARTITIONTIME' as TIMESTAMP type, or '_PARTITIONDATE' as DATE type. If field is specified, the table is instead partitioned by this field. The field must be a top-level TIMESTAMP or DATE field. Its mode must be NULLABLE or REQUIRED.
timePartitioning.requirePartitionFilter	boolean	[Beta] [Optional] If set to true, queries over this table require a partition filter that can be used for partition elimination to be specified.
timePartitioning.type	string	[Required] The only type supported is DAY, which will generate one partition per day.

Translated to a LogStash config that would probably look something like this:

  config :cluster_fields, validate: :string, list: true, default: []
  config :enable_time_partitioning, validate: :boolean, default: false
  config :time_partition_field, validate: :string, default: ""
  config :time_partition_require_filter, validate: :boolean, default: false

Do you think something like that could fit your needs?

alepuccetti · 2018-08-29T17:19:50Z

Hi @josephlewis42,

You might be able to set a custom date pattern and follow the guide for converting date sharded tables into a single date partitioned table, even if it's not ideal.

Making monthly (or other time shards) tables is not a feasable for my use case. Writing queries that touch multiple tables is very ugly, and if you want to build a dashboard with datastudio, I am not even sure if that is possible. Converting sharded tables to partitioned ones, it is not feasible because I want to query the data in near real-time.

I think that be able to leverage partitioned table would be great for everyone. I am not familiar with the internal of streaming ingest, but I think it is just a table definition issues, isn't it?
Also, clustering it is not required to use partition so it could be a separate work. (but clustering require partition tables).

I am trying a different number of things now (I started using this plugin today).

I am trying to send data to an already existing partitioned/not partitioned table.
Insert into a dynamically generated table

In my case, I rather have one table with all the data and drop row with a select statement. Currently, I can do this by setting date_pattern => "" but the table name will be <table_prefix>+<table_delimiter>. I will open a separate issue about this.

alepuccetti · 2018-08-29T18:26:10Z

Update:

I successfully inserted into an existing partitioned table, so this could be a quick around.

However, I noticed that if I have one record in a batch failing the whole batch fails.
If I set batch_size to 1, I insert more records than if I set batch size to 128 or more. Is this a known problem? In the case of size = 1, I get all the record except the errors In the case of batch size = 128 the total number of record is a multiple of 128. The input dataset if a bit more than a thousand.

Note that setting batch size to 1 will make the inser very slow.

MPTG94 · 2021-05-13T11:08:24Z

@alepuccetti can you please show how you managed to insert data to an existing partitioned table?

I am trying to do the same thing right now, but I'm getting constant Null Pointer Exceptions in the logstash logfile (the config I am using worked without any issue before I made the table a partitioned one)

alepuccetti · 2021-05-16T08:09:02Z

@MPTG94 It has been a long time, so I don't remember for sure if I had to do something special. But I don't remember to have this issue. Maybe something changes form when I did it.

simon-verzijl · 2021-07-08T15:33:25Z

@alepuccetti can you please show how you managed to insert data to an existing partitioned table?

I am trying to do the same thing right now, but I'm getting constant Null Pointer Exceptions in the logstash logfile (the config I am using worked without any issue before I made the table a partitioned one)

I was just trying to do this myself since I want the data to be loaded in a partitioned table and set partition expiry to i.e. 60 days.
When I create the table with the same schema that the BQ plugin itself created automatically it works fine for me.

i.e. in my case:

bq mk -t \
--schema 'timestamp:TIMESTAMP,host:STRING,<all other fields>' \
--time_partitioning_field timestamp \
--time_partitioning_type DAY \
--time_partitioning_expiration 5184000  \
dataset.mytable

(and then just set table_prefix to mytable and table_separator + date_pattern empty)

I am seeing lot of these errors, but the data is actually being loaded succesfully :

[2021-07-08T17:31:26,013][ERROR][logstash.outputs.googlebigquery] Error creating table. {:exception=>java.lang.NullPointerException}
[2021-07-08T17:31:26,024][ERROR][logstash.outputs.googlebigquery] Error creating table. {:exception=>java.lang.NullPointerException}
[2021-07-08T17:31:26,041][INFO ][logstash.outputs.googlebigquery] Publishing 128 messages to mytable
[2021-07-08T17:31:26,073][INFO ][logstash.outputs.googlebigquery] Publishing 128 messages to mytable

MPTG94 · 2021-07-08T16:50:52Z

@simon-verzijl thanks for the reply, I ended up doing the same as you did (just creating the Partitioned table from the GCP web interface and not the command line).

I am also seeing these weird errors, but data seems to flow to the tables just fine

zachaller · 2021-12-03T18:57:54Z

I don't really think this is in a place to say upstream because I don't really know ruby or java very well and had slowly work my way to get this setup but it works and allows table partitioning and clustering to be setup. I have it in a fork here https://github.com/zachaller/logstash-output-google_bigquery an example config that would create an HOUR partitioned table with 1 month table partition due to bq only supporting 4k partitions and enabling clustering on the table as well.

date_pattern => "%Y-%m"
table_separator => "_"
table_prefix => "logstash"
insert_partition_by => "HOUR"
cluster_fields => "kubernetesNamespace,containerName"

Should be able to build an image via

docker build --rm -t logstash -f Dockerfile.build .

sakshamgupta05 mentioned this issue Feb 2, 2021

Add table_name setting #63

Closed

sakshamgupta05 mentioned this issue Feb 11, 2021

Add table_name setting #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for partition table #42

Support for partition table #42

alepuccetti commented Aug 29, 2018 •

edited

Loading

josephlewis42 commented Aug 29, 2018

alepuccetti commented Aug 29, 2018

alepuccetti commented Aug 29, 2018 •

edited

Loading

MPTG94 commented May 13, 2021

alepuccetti commented May 16, 2021

simon-verzijl commented Jul 8, 2021

MPTG94 commented Jul 8, 2021 •

edited

Loading

zachaller commented Dec 3, 2021 •

edited

Loading

Support for partition table #42

Support for partition table #42

Comments

alepuccetti commented Aug 29, 2018 • edited Loading

josephlewis42 commented Aug 29, 2018

alepuccetti commented Aug 29, 2018

alepuccetti commented Aug 29, 2018 • edited Loading

MPTG94 commented May 13, 2021

alepuccetti commented May 16, 2021

simon-verzijl commented Jul 8, 2021

MPTG94 commented Jul 8, 2021 • edited Loading

zachaller commented Dec 3, 2021 • edited Loading

alepuccetti commented Aug 29, 2018 •

edited

Loading

alepuccetti commented Aug 29, 2018 •

edited

Loading

MPTG94 commented Jul 8, 2021 •

edited

Loading

zachaller commented Dec 3, 2021 •

edited

Loading