-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add estimated row batch size in bytes to state #230
Conversation
@@ -142,7 +141,6 @@ func (c *Cursor) Each(f func(*RowBatch) error) error { | |||
tx.Rollback() | |||
|
|||
c.lastSuccessfulPaginationKey = paginationKeypos | |||
c.rowsExamined += uint64(batch.Size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah this wasn't used... :whistling:..
test/go/row_batch_test.go
Outdated
|
||
s := batch.EstimateByteSize() | ||
|
||
fmt.Printf("%d", s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should actually assert something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed since I'm arleady testing in the callback.
func (e *RowBatch) EstimateByteSize() uint64 { | ||
var total int | ||
for _, v := range e.values { | ||
size, err := json.Marshal(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the overhead of the json marshaling here. Have you run any benchmarks to see how much additional CPU this will take?
Also, instead of the json marshaling, have we considered unsafe.Sizeof()
or reflect.Type.Size()
? I'm not familiar with the risks of using the unsafe package, however. Something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsafe.Sizeof
(and reflect
- they're aliases) give you the size of the pointer. So regardless of string length, the size is 8. Same with uint64, its always gonna be 8. So it's not quite the same since we want to know the byte length of the data itself.
And with some benchmarking, the json.Marshal
seems harmless 👇
BenchmarkSize-12 1000000000 0.000454 ns/op 0 allocs/op #2.8MB file
BenchmarkJSON-12 1000000000 0.00634 ns/op 0 allocs/op #2.8MB file
BenchmarkSizeSmall-12 1000000000 0.000077 ns/op 0 allocs/op #200KB file
BenchmarkJSONSmall-12 1000000000 0.000429 ns/op 0 allocs/op #200KB file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you'd have to traverse into the pointer and things gets ugly quickly (although json.Marshal is technically doing it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud here, but we could technically with minimal performance hit wrap mysql.writeExecutePacket
and len
the return data after filtering for INSERTS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or no we can't, it doesn't return the data, just an error.
765fd22
to
3fdea62
Compare
7508eb4
to
5e3db8a
Compare
a62be4d
to
19e5524
Compare
There are some performance issues with the estimation of the row batch size in bytes noted here - #240. I have currently added a flag to turn on/off the row batch size estimation which can be passed in the ghostferry config. |
bytesWrittenForThisBatch = batch.EstimateByteSize() | ||
} | ||
w.StateTracker.UpdateLastSuccessfulPaginationKey(batch.TableSchema().String(), endPaginationKeypos, | ||
RowStats{NumBytes: bytesWrittenForThisBatch, NumRows: uint64(batch.Size())}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour seems a bit sketchy. If someone has turned off the EnableRowBatchSize
to off they'll see the NumBytes
to 0 which would create confusion. I don't think there's much we can do in here other then documenting this behaviour.
465f7ce
to
1ae01dc
Compare
changes in config debug tests fix tests added go tests modifications
1ae01dc
to
7b4d802
Compare
7b4d802
to
6d14b90
Compare
Related: #226
Similar PR: #228
Add an estimate of number of bytes for each row batch and sending it with the
progress
callback