Dropping last message or two before new parquet writer is created #1712

jeremyplichtafc · 2020-11-20T00:09:11Z

We are using the AvroMessageParser and AvroParquetFileReaderWriterFactory and have noticed that a very small amount of messages are being dropped. Upon further investigation the sequence numbers of the messages being dropped correspond to the number right before (or sometimes 2 before) one of the files that was written to S3.

Ex:
If one of the files on s3 is named: 1_1_00000000002329440769.gz.parquet (which I take to mean that the first piece of data in that file was from partition 1 with offset 2329440769), then the data which was dropped was in offset 2329440768.

The previous file I would have expected it to be in is well under our max file size param so I think it is getting finalized/written due to reaching max file age.

I will try to investigate more and see if I can write a unit test and figure out what is going on. If it turns out this is somehow related to our setup/config I'll add more detail here.

We are running of a fairly recent version we built off master: 359c8b8

Thanks,
Jeremy

The text was updated successfully, but these errors were encountered:

HenryCaiHaiying · 2020-11-20T04:35:51Z

Do you see consumer group rebalance during those times? There should be some logging messages indicating the rebalance was happening, that usually is a period of time which might cause some edge case bugs. Also can you use the SequenceFileReadWritingFactory during debugging? It's much easier to debug and look at the sequence files (it's in sequential order of the records coming in). The other possibility is the parquet file is not flushed to disk before S3 or HDFS uploading starts, take a look at AvroParquetReaderWriter class to see whether file close() and flush() is called on all edge paths.

…

On Thu, Nov 19, 2020 at 4:09 PM jeremyplichtafc ***@***.***> wrote: We are using the AvroMessageParser and AvroParquetFileReaderWriterFactory and have noticed that a very small amount of messages are being dropped. Upon further investigation the sequence numbers of the messages being dropped correspond to the number right before (or sometimes 2 before) one of the files that was written to S3. Ex: If one of the files on s3 is named: 1_1_00000000002329440769.gz.parquet (which I take to mean that the first piece of data in that file was from partition 1 with offset 2329440769), then the data which was dropped was in offset 2329440768. The previous file I would have expected it to be in is well under our max file size param so I think it is getting finalized/written due to reaching max file age. I will try to investigate more and see if I can write a unit test and figure out what is going on. If it turns out this is somehow related to our setup/config I'll add more detail here. We are running of a fairly recent version we built off master: ***@***.*** <fullcontact@359c8b8> Thanks, Jeremy — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1712>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYJP753OTI3BQTQUQMYXH3SQWXTPANCNFSM4T4CXQLQ> .

jeremyplichtafc · 2020-11-20T15:55:11Z

Thanks for the tips on how to troubleshoot. I'll let you know what I find. And if there is an apparent fix I'll send a PR your way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropping last message or two before new parquet writer is created #1712

Dropping last message or two before new parquet writer is created #1712

jeremyplichtafc commented Nov 20, 2020 •

edited

Loading

HenryCaiHaiying commented Nov 20, 2020 via email

jeremyplichtafc commented Nov 20, 2020

Dropping last message or two before new parquet writer is created #1712

Dropping last message or two before new parquet writer is created #1712

Comments

jeremyplichtafc commented Nov 20, 2020 • edited Loading

HenryCaiHaiying commented Nov 20, 2020 via email

jeremyplichtafc commented Nov 20, 2020

jeremyplichtafc commented Nov 20, 2020 •

edited

Loading