feat: SSAPI Checkpointing (BPS-277) #1969

Caleb-Hurshman · 2024-11-15T16:47:27Z

Proposed Change

Add checkpointing capabilities to the SSAPI receiver.
The receiver now requires a storage extension to be configured in order to run.
Store the offset of the last successfully exported log as the checkpoint value.
Upon starting the receiver, load a checkpoint from the storage extension in case a previous run failed.

1st run (Empty storage extn):

{"level":"info","ts":"2024-11-15T11:33:03.831-0500","caller":"splunksearchapireceiver/receiver.go:260","msg":"no checkpoint found","kind":"receiver","name":"splunksearchapi","data_type":"logs"}
{"level":"info","ts":"2024-11-15T11:33:09.094-0500","caller":"splunksearchapireceiver/receiver.go:111","msg":"search results fetched","kind":"receiver","name":"splunksearchapi","data_type":"logs","num_results":8}
offset 8
{"level":"info","ts":"2024-11-15T11:33:09.096-0500","caller":"splunksearchapireceiver/receiver.go:106","msg":"fetching search results","kind":"receiver","name":"splunksearchapi","data_type":"logs"}
{"level":"info","ts":"2024-11-15T11:33:09.136-0500","caller":"splunksearchapireceiver/receiver.go:111","msg":"search results fetched","kind":"receiver","name":"splunksearchapi","data_type":"logs","num_results":8}
offset 16
{"level":"info","ts":"2024-11-15T11:33:09.136-0500","caller":"splunksearchapireceiver/receiver.go:106","msg":"fetching search results","kind":"receiver","name":"splunksearchapi","data_type":"logs"}
{"level":"info","ts":"2024-11-15T11:33:09.173-0500","caller":"splunksearchapireceiver/receiver.go:111","msg":"search results fetched","kind":"receiver","name":"splunksearchapi","data_type":"logs","num_results":4}
offset 20
{"level":"info","ts":"2024-11-15T11:33:09.174-0500","caller":"splunksearchapireceiver/receiver.go:185","msg":"all search results exported","kind":"receiver","name":"splunksearchapi","data_type":"logs","query":"search index=otel earliest=-1d latest=now","total results":20}

2nd run (Storage extension not cleared out to simulate an existing offset checkpoint):

{"level":"info","ts":"2024-11-15T11:33:35.808-0500","caller":"splunksearchapireceiver/receiver.go:72","msg":"found offset checkpoint in storage extension","kind":"receiver","name":"splunksearchapi","data_type":"logs","offset":20}
{"level":"info","ts":"2024-11-15T11:33:41.161-0500","caller":"splunksearchapireceiver/receiver.go:111","msg":"search results fetched","kind":"receiver","name":"splunksearchapi","data_type":"logs","num_results":0}
offset 20
{"level":"info","ts":"2024-11-15T11:33:41.162-0500","caller":"splunksearchapireceiver/receiver.go:185","msg":"all search results exported","kind":"receiver","name":"splunksearchapi","data_type":"logs","query":"search index=otel earliest=-1d latest=now","total results":0}

Checklist

Changes are tested
CI has passed

schmikei · 2024-11-15T17:06:20Z

receiver/splunksearchapireceiver/receiver.go

+		ssapir.logger.Info("no checkpoint found")
+		return
+	}
+	if err = json.Unmarshal(marshalBytes, ssapir.checkpointRecord); err != nil {


shouldn't we bomb out if we fail to unmarshal a checkpoint?

We shouldn't Fatal because that would stop the collector process entirely. I'd prefer seeing the error propogated back up.

We do this so that other components can have clean Shutdown calls.

receiver/splunksearchapireceiver/receiver.go

schmikei · 2024-11-15T17:14:21Z

receiver/splunksearchapireceiver/model.go

+
+// EventRecord struct stores the offset of the last event exported successfully
+type EventRecord struct {
+	Offset int `json:"offset"`


Curious last time I've used a storage extension; we've used a timestamp to manage the last processed event time... Do you think that'd be more bulletproof than using the paginated offset?

Using the offset instead of the timestamp was a piece of feedback from Dan on the spec. Looking back at his comment, he thought it would be easier to use the offset. Whether it's as bulletproof is a good question, but I assumed he wouldn't sacrifice the reliability of the receiver for ease. Curious to hear more of your perspective.

schmikei · 2024-11-15T21:17:51Z

Yeah I'm thinking there's a couple scenarios where we need to think about:

Test a fail condition (shut off network connection mid processing) change configuration to do a different search; does it still use an offset that makes sense? Seems like we need to start tracking this by search
Test a network failure from Splunk on retrieving paginated results (does our checkpoint pretty much work as we expect)

At least those are the two main things I'm curious if this solution fully solves at the moment.

schmikei · 2024-11-15T21:20:16Z

receiver/splunksearchapireceiver/config.go

@@ -60,6 +62,10 @@ func (cfg *Config) Validate() error {
 		return errors.New("at least one search must be provided")
 	}

+	if cfg.StorageID == nil {
+		return errors.New("storage configuration must be provided")


Suggested change

return errors.New("storage configuration must be provided")

return errors.New("storage configuration is required for this receiver")

Caleb-Hurshman · 2024-11-18T13:58:34Z

Test a network failure from Splunk on retrieving paginated results (does our checkpoint pretty much work as we expect)

How would you test this?

schmikei · 2024-11-18T14:11:24Z

Test a network failure from Splunk on retrieving paginated results (does our checkpoint pretty much work as we expect)

How would you test this?

My recommendation would be to use https://pkg.go.dev/net/http/httptest and simulate one in code as a test; otherwise you can use a non-routable IP address: https://stackoverflow.com/questions/100841/artificially-create-a-connection-timeout-error

schmikei

There's one clarificationI want to understand when the ConsumeLogs call breaks;

schmikei · 2024-11-19T15:14:54Z

receiver/splunksearchapireceiver/receiver.go

 				ssapir.logger.Error("error consuming logs", zap.Error(err))
 			}
+			// last batch of logs has been successfully exported
+			exportedEvents += logs.ResourceLogs().Len()


If there was an error in the ConsumeLogs I believe that exportedEvents would be an untrue value right?

At least since we're not stopping if there was an error from the consumer. I think we should consider stopping if we get a ConsumeLogs error ever occurs.

And an checkpoint should still be written in that case

I agree that the process should break if ConsumeLogs errors, but I don't think the checkpoint should be updated.
In my head, the flow is like this:

Previous batch consumed successfully (new offset written, which is the offset of the last log exported +1)

Next batch obtained from Splunk using the offset

Batch fails to export, ConsumeLogs errors and the receiver process exits (no partial writes to GCP by default, no checkpoint written)

User addresses some issue with their GCP instance

User restarts the receiver, which grabs the last written offset, essentially restarting from step 2

Feel free to add on if I'm missing something.

Yeah I think my brain was on the thread that if some of the batch gets processed, but some others get dropped; I think that gets hairy for checkpointing for what is likely a pretty big edge case...

Makes sense. It was my impression that the GCP exporter wouldn't partially write a batch, but I might be off on that. I don't know how to handle that case right now, but that's something I wanted to revisit when we get a chance to test. It depends on how much information the error returns. If there's no info on which log(s) failed, then we're kinda hosed.

schmikei · 2024-11-19T16:02:03Z

receiver/splunksearchapireceiver/go.mod

 	go.opentelemetry.io/collector/consumer v0.113.0
+	go.opentelemetry.io/collector/consumer/consumertest v0.113.0
+	go.opentelemetry.io/collector/extension/experimental/storage v0.113.0
+	go.opentelemetry.io/collector/filter v0.114.0


is this supposed to v0.114.0? I'd imagine a merge from main might smooth out some of these dependency conflicts?

I'm not seeing any conflicts personally. Tidy-ing doesn't update these deps. Should I rebase the feature branch to the latest release?

Perhaps @dpaasman00 would know better; but I'd imagine that we probably want go.opentelemetry.io/collector/filter v0.113.0 rather than v0.114.0 but since this is not being merged into main for a little while; I'd imagine its not worth worrying about too much until we want to start entertaining a main merge.

schmikei

Think its good for the feature branch, may need some further testing.

Nice work 🚀

Caleb-Hurshman added 4 commits November 11, 2024 14:12

checkpoint methods

567c7f7

Merge branch 'feat/ssapi-receiver' into feat/ssapi-checkpoints

67ae37d

WIP

33d9b0f

functional checkpoint

61b1452

Caleb-Hurshman requested review from dpaasman00 and a team as code owners November 15, 2024 16:47

Caleb-Hurshman requested review from schmikei and removed request for a team and dpaasman00 November 15, 2024 16:49

Caleb-Hurshman changed the title ~~feat: SSAPI Checkpointing~~ feat: SSAPI Checkpointing (BPS-277) Nov 15, 2024

schmikei reviewed Nov 15, 2024

View reviewed changes

debug logs, rm print

1a3407a

Caleb-Hurshman requested a review from schmikei November 15, 2024 19:27

loadCheckpoint return error

f753e78

schmikei reviewed Nov 15, 2024

View reviewed changes

Caleb-Hurshman added 4 commits November 18, 2024 15:27

splunk failure test

e775560

Merge branch 'feat/ssapi-receiver' into feat/ssapi-checkpoints

f6694a0

storage config test

28927d9

lint, tidy

d4df0db

Caleb-Hurshman requested a review from schmikei November 19, 2024 13:32

schmikei reviewed Nov 19, 2024

View reviewed changes

return error on export fail

2153d9e

schmikei reviewed Nov 19, 2024

View reviewed changes

tidy

34907e1

Caleb-Hurshman requested a review from schmikei November 19, 2024 16:28

schmikei approved these changes Nov 19, 2024

View reviewed changes

Caleb-Hurshman merged commit c24eb17 into feat/ssapi-receiver Nov 19, 2024
14 checks passed

Caleb-Hurshman deleted the feat/ssapi-checkpoints branch November 19, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SSAPI Checkpointing (BPS-277) #1969

feat: SSAPI Checkpointing (BPS-277) #1969

Caleb-Hurshman commented Nov 15, 2024 •

edited

Loading

schmikei Nov 15, 2024

schmikei Nov 15, 2024

schmikei Nov 15, 2024

Caleb-Hurshman Nov 15, 2024

schmikei commented Nov 15, 2024

schmikei Nov 15, 2024

Caleb-Hurshman commented Nov 18, 2024

schmikei commented Nov 18, 2024

schmikei left a comment

schmikei Nov 19, 2024

schmikei Nov 19, 2024

Caleb-Hurshman Nov 19, 2024

schmikei Nov 19, 2024

Caleb-Hurshman Nov 19, 2024

schmikei Nov 19, 2024

Caleb-Hurshman Nov 19, 2024

schmikei Nov 19, 2024

schmikei left a comment

	return errors.New("storage configuration must be provided")
	return errors.New("storage configuration is required for this receiver")

feat: SSAPI Checkpointing (BPS-277) #1969

feat: SSAPI Checkpointing (BPS-277) #1969

Conversation

Caleb-Hurshman commented Nov 15, 2024 • edited Loading

Proposed Change

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmikei commented Nov 15, 2024

Choose a reason for hiding this comment

Caleb-Hurshman commented Nov 18, 2024

schmikei commented Nov 18, 2024

schmikei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schmikei left a comment

Choose a reason for hiding this comment

Caleb-Hurshman commented Nov 15, 2024 •

edited

Loading