Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(operator): Operator panic's when reconnect fails after max retries #1692

Open
wants to merge 4 commits into
base: staging
Choose a base branch
from

Conversation

PatStiles
Copy link
Contributor

Fix: Operator Panic when reconnect fails after max Retries

Description

Closes #960. When SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable exhaust the maximum number of retries the respective subscribe functions panics due to dereferencing a nil reference within the generated contract bindings. This pr addressed this issue by:
1.) Adding a defer() to the go routine the converts the panic into an error before the operator process exits.
2.) Changing the retry parameters for SubscribeToNewTasksV2Retryable and SubscribeToNewTasksV3Retryable to retry infinitely (up to retry intervals of 60 sec).

To Test:

  • install and use nginx as a proxy to test the connection brew install nginx
  • create a directory nginx with the file nginx.conf inside of it, with this content:
events { }

http {
    server {
        listen 8082;

        location /main/ {
            proxy_pass http://host.docker.internal:8545/;  # Note the trailing slashes
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_connect_timeout 1d;
            proxy_send_timeout 1d;
            proxy_read_timeout 1d;
        }
    }
}
  • create a docker compose with the following contents:
version: '3'
services:
  nginx:
    image: nginx:alpine
    container_name: nginx-anvil-proxy
    volumes:
      - ./nginx:/etc/nginx
    ports:
      - "8082:8082"
  • change the respective url's in config-files within the aligned-layer repo to use :8082/main/ port and path.
  • Start a local anvil node make anvil_start_with_block_time
  • Start nginx in a docker container via docker compose -f nginx-docker-compose.yaml up -d
  • Observe proof verification is successful.
  • kill the connection to anvil via docker compose -f nginx-docker-compose.yaml stop
  • observe operator blocks with the following message:
Screenshot 2024-12-30 at 17 08 42
  • Start the nginx proxy again docker compose -f nginx-docker-compose.yaml up -d && restart the aggregator make aggregator_start
  • Observe the operator reconnects and start verifying batches
Screenshot 2024-12-30 at 17 10 41 Panic Operation: - change line 49 in `core/retry.go` to ``` SubscribeToNewTasksNumRetries = 1 ``` - Start the local network with the nginx proxy as described above and kill the proxy - Observe the operator errors and shuts down with the panic wrapped within an error. Screenshot 2024-12-30 at 16 42 17

Type of change

Please delete options that are not relevant.

  • Bug fix

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@PatStiles PatStiles self-assigned this Dec 30, 2024
core/retry.go Outdated
RespondToTaskV2MaxElapsedTime = 0 // Maximum time all retries may take. `0` corresponds to no limit on the time of the retries.
RespondToTaskV2NumRetries = 0 // Total number of retries attempted. If 0, retries indefinitely until maxElapsedTime is reached.

// Retry Parameters for SubscribeToNewTasksV3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Retry Parameters for SubscribeToNewTasksV3
// Retry Parameters for SubscribeToNewTasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed!

Copy link
Collaborator

@avilagaston9 avilagaston9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.

@PatStiles
Copy link
Contributor Author

Works in my machine! Maybe we should add logs for each retry to warn that we are trying to reconnect. Otherwise, it might seem like nothing is happening.

Agreed! @JuArce should we do this in a separate pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants