Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock Detected in Evolution API with Prisma Client #1082

Open
1 task done
costaeder opened this issue Nov 22, 2024 · 0 comments
Open
1 task done

Deadlock Detected in Evolution API with Prisma Client #1082

costaeder opened this issue Nov 22, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@costaeder
Copy link

Welcome!

  • Yes, I have searched for similar issues on GitHub and found none.

What did you do?

We are encountering a deadlock error in the Evolution API during query execution. The issue arises due to concurrent updates on certain database records, as shown in the error logs below:

ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 4756 waits for ShareLock on transaction 1029970; blocked by process 4745. Process 4745 waits for ShareLock on transaction 1029972; blocked by process 4744. Process 4744 waits for ShareLock on transaction 1029973; blocked by process 4756."), column: None, hint: Some("See server log for query details.") }), transient: false })

The issue occurs when multiple processes attempt to acquire locks on the same transactions, creating a circular wait condition.

Temporary Solution
To mitigate the issue, we implemented a trigger at the database level to reduce concurrent updates. The trigger restricts frequent updates to certain records based on the updatedAt field, allowing updates only if:

Fields other than updatedAt are modified, or
The last update occurred more than 5 minutes ago.
Here is the trigger function applied:

CREATE OR REPLACE FUNCTION public.prevent_frequent_updates() RETURNS trigger LANGUAGE 'plpgsql' COST 100 VOLATILE NOT LEAKPROOF AS $BODY$ BEGIN -- Verifies if ONLY the updatedAtfield is being updated IF (NEW."remoteJid" IS DISTINCT FROM OLD."remoteJid" OR NEW."labels" IS DISTINCT FROM OLD."labels" OR NEW."name" IS DISTINCT FROM OLD."name" OR NEW."unreadMessages" IS DISTINCT FROM OLD."unreadMessages" OR NEW."instanceId" IS DISTINCT FROM OLD."instanceId") THEN -- Allows the update if any field other thanupdatedAt` is modified
RETURN NEW;
END IF;

-- Checks if the record was updated within the last 5 minutes
IF (NEW."updatedAt" <= OLD."updatedAt" + INTERVAL '5 minutes') THEN
-- Returns the current state without applying the update
RETURN OLD;
END IF;

-- Allows the update
RETURN NEW;
END;
$BODY$;

ALTER FUNCTION public.prevent_frequent_updates()
OWNER TO postgres;`

What did you expect?

Suggested Solution
A more robust solution would involve enqueuing updates targeting the same remoteJid to prevent concurrent updates on the same row in the chat table. This approach would serialize updates to a specific remoteJid, avoiding transaction contention entirely.

Steps to Reproduce
Execute multiple concurrent updates on records with overlapping lock dependencies.
Observe deadlock errors in the API logs.
Expected Behavior
The system should handle concurrent updates without causing deadlocks, potentially using queueing or transaction management strategies.

Actual Behavior
Deadlock errors prevent some updates from completing successfully.

Possible Solutions
Enqueue updates targeting the same remoteJid, serializing them to ensure only one transaction modifies the row at a time.
Investigate and optimize the Prisma query execution strategy to minimize transaction lock contention.
Implement application-level queuing for updates to avoid simultaneous operations on the same records.
Consider using row-level locks with SELECT FOR UPDATE to prevent deadlocks during concurrent updates.
Introduce retry logic in the application to handle transient deadlock errors gracefully.

What did you observe instead of what you expected?

Expected:
We expected the updates to execute successfully without locking or waiting issues, even with multiple concurrent requests targeting the chat table.

Observed:
Instead, a deadlock occurs during concurrent updates, causing transactions to fail with the following error message:

Screenshots/Videos

`version: "3.7"

services:
evolution_s03:
image: atendai/evolution-api:v2.2.0

volumes:
  - evolution_instances:/evolution/instances
networks:
  - evolution
environment:
  - SERVER_URL=https://chat-whatsapi-s03.example.com
  - LOG_LEVEL=ERROR
  - LOG_BAILEYS=error
  - DEL_INSTANCE=false
  - DATABASE_PROVIDER=postgresql
  - DATABASE_CONNECTION_URI=${DATABASE_CONNECTION_URI}
  - DATABASE_SAVE_DATA_INSTANCE=true
  - DATABASE_SAVE_DATA_NEW_MESSAGE=true
  - DATABASE_SAVE_MESSAGE_UPDATE=true
  - DATABASE_SAVE_DATA_CONTACTS=true
  - DATABASE_SAVE_DATA_CHATS=true
  - DATABASE_SAVE_DATA_LABELS=true
  - DATABASE_SAVE_DATA_HISTORIC=true
  - DATABASE_CONNECTION_CLIENT_NAME=evolution_s03
  - RABBITMQ_ENABLED=true
  - RABBITMQ_URI=${RABBITMQ_URI}
  - RABBITMQ_EXCHANGE_NAME=evolution_s03
  - RABBITMQ_GLOBAL_ENABLED=false
  - RABBITMQ_EVENTS_APPLICATION_STARTUP=false
  - RABBITMQ_EVENTS_INSTANCE_CREATE=false
  - RABBITMQ_EVENTS_INSTANCE_DELETE=false
  - RABBITMQ_EVENTS_QRCODE_UPDATED=false
  - RABBITMQ_EVENTS_MESSAGES_SET=false
  - RABBITMQ_EVENTS_MESSAGES_UPSERT=true
  - RABBITMQ_EVENTS_MESSAGES_EDITED=false
  - RABBITMQ_EVENTS_MESSAGES_UPDATE=false
  - RABBITMQ_EVENTS_MESSAGES_DELETE=false
  - RABBITMQ_EVENTS_SEND_MESSAGE=false
  - RABBITMQ_EVENTS_CONTACTS_SET=false
  - RABBITMQ_EVENTS_CONTACTS_UPSERT=false
  - RABBITMQ_EVENTS_CONTACTS_UPDATE=false
  - RABBITMQ_EVENTS_PRESENCE_UPDATE=false
  - RABBITMQ_EVENTS_CHATS_SET=false
  - RABBITMQ_EVENTS_CHATS_UPSERT=false
  - RABBITMQ_EVENTS_CHATS_UPDATE=false
  - RABBITMQ_EVENTS_CHATS_DELETE=false
  - RABBITMQ_EVENTS_GROUPS_UPSERT=false
  - RABBITMQ_EVENTS_GROUP_UPDATE=false
  - RABBITMQ_EVENTS_GROUP_PARTICIPANTS_UPDATE=false
  - RABBITMQ_EVENTS_CONNECTION_UPDATE=true
  - RABBITMQ_EVENTS_CALL=false
  - RABBITMQ_EVENTS_TYPEBOT_START=false
  - RABBITMQ_EVENTS_TYPEBOT_CHANGE_STATUS=false
  - SQS_ENABLED=false
  - SQS_ACCESS_KEY_ID=
  - SQS_SECRET_ACCESS_KEY=
  - SQS_ACCOUNT_ID=
  - SQS_REGION=
  - WEBSOCKET_ENABLED=false
  - WEBSOCKET_GLOBAL_EVENTS=false
  - WA_BUSINESS_TOKEN_WEBHOOK=evolution
  - WA_BUSINESS_URL=https://graph.facebook.com
  - WA_BUSINESS_VERSION=v20.0
  - WA_BUSINESS_LANGUAGE=pt_BR
  - WEBHOOK_GLOBAL_URL=''
  - WEBHOOK_GLOBAL_ENABLED=false
  - WEBHOOK_GLOBAL_WEBHOOK_BY_EVENTS=false
  - WEBHOOK_EVENTS_APPLICATION_STARTUP=false
  - WEBHOOK_EVENTS_QRCODE_UPDATED=true
  - WEBHOOK_EVENTS_MESSAGES_SET=true
  - WEBHOOK_EVENTS_MESSAGES_UPSERT=true
  - WEBHOOK_EVENTS_MESSAGES_EDITED=true
  - WEBHOOK_EVENTS_MESSAGES_UPDATE=true
  - WEBHOOK_EVENTS_MESSAGES_DELETE=true
  - WEBHOOK_EVENTS_SEND_MESSAGE=true
  - WEBHOOK_EVENTS_CONTACTS_SET=true
  - WEBHOOK_EVENTS_CONTACTS_UPSERT=true
  - WEBHOOK_EVENTS_CONTACTS_UPDATE=true
  - WEBHOOK_EVENTS_PRESENCE_UPDATE=true
  - WEBHOOK_EVENTS_CHATS_SET=true
  - WEBHOOK_EVENTS_CHATS_UPSERT=true
  - WEBHOOK_EVENTS_CHATS_UPDATE=true
  - WEBHOOK_EVENTS_CHATS_DELETE=true
  - WEBHOOK_EVENTS_GROUPS_UPSERT=true
  - WEBHOOK_EVENTS_GROUPS_UPDATE=true
  - WEBHOOK_EVENTS_GROUP_PARTICIPANTS_UPDATE=true
  - WEBHOOK_EVENTS_CONNECTION_UPDATE=true
  - WEBHOOK_EVENTS_LABELS_EDIT=true
  - WEBHOOK_EVENTS_LABELS_ASSOCIATION=true
  - WEBHOOK_EVENTS_CALL=true
  - WEBHOOK_EVENTS_TYPEBOT_START=false
  - WEBHOOK_EVENTS_TYPEBOT_CHANGE_STATUS=false
  - WEBHOOK_EVENTS_ERRORS=false
  - WEBHOOK_EVENTS_ERRORS_WEBHOOK=
  - CONFIG_SESSION_PHONE_CLIENT=Evo S03
  - CONFIG_SESSION_PHONE_NAME=Chrome
  - CONFIG_SESSION_PHONE_VERSION=2.3000.1015901307
  - QRCODE_LIMIT=30
  - OPENAI_ENABLED=true
  - DIFY_ENABLED=false
  - TYPEBOT_ENABLED=false
  - TYPEBOT_API_VERSION=latest
  - CHATWOOT_ENABLED=true
  - CHATWOOT_MESSAGE_READ=true
  - CHATWOOT_MESSAGE_DELETE=true
  - CHATWOOT_IMPORT_DATABASE_CONNECTION_URI=${CHATWOOT_DATABASE_CONNECTION_URI}
  - CHATWOOT_IMPORT_PLACEHOLDER_MEDIA_MESSAGE=true
  - CACHE_REDIS_ENABLED=true
  - CACHE_REDIS_URI=${CACHE_REDIS_URI}
  - CACHE_REDIS_PREFIX_KEY=evolution_s03
  - CACHE_REDIS_SAVE_INSTANCES=false
  - CACHE_LOCAL_ENABLED=false
  - S3_ENABLED=true
  - S3_ACCESS_KEY=${S3_ACCESS_KEY}
  - S3_SECRET_KEY=${S3_SECRET_KEY}
  - S3_BUCKET=${S3_BUCKET}
  - S3_PORT=443
  - S3_ENDPOINT=s3.us-east-1.amazonaws.com
  - S3_REGION=us-east-1
  - S3_USE_SSL=true
  - AUTHENTICATION_API_KEY=${AUTHENTICATION_API_KEY}
  - AUTHENTICATION_EXPOSE_IN_FETCH_INSTANCES=true
  - LANGUAGE=en
  - SENTRY_DSN=${SENTRY_DSN}
deploy:
  mode: replicated
  resources:
    limits:
      cpus: "1"
      memory: 2G
  replicas: 1
  placement:
    constraints:
      - node.role == manager
  labels:
    - traefik.enable=true
    - traefik.http.routers.evolution_s03.rule=Host(`chat-whatsapi-s03.example.com`)
    - traefik.http.routers.evolution_s03.entrypoints=websecure
    - traefik.http.routers.evolution_s03.tls.certresolver=letsencryptresolver
    - traefik.http.routers.evolution_s03.priority=1
    - traefik.http.routers.evolution_s03.service=evolution_s03
    - traefik.http.services.evolution_s03.loadbalancer.server.port=8080
    - traefik.http.services.evolution_s03.loadbalancer.passHostHeader=true

evolution_s03_postgres:
image: postgres:14
environment:
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- TZ=America/Sao_Paulo
- PGTZ=America/Sao_Paulo
networks:
- evolution
ports:
- 12013:5432
shm_size: 512mb

volumes:
  - postgres_data:/var/lib/postgresql/data
deploy:
  mode: replicated
  replicas: 1
  placement:
    constraints:
      - node.role == manager
  resources:
    limits:
      cpus: "2"
      memory: 4G

evolution_s03_redis:
image: redis:latest
command: [
"redis-server",
"--appendonly",
"yes",
"--port",
"6379"
]
volumes:
- redis_data:/data
networks:
- evolution
deploy:
placement:
constraints:
- node.role == manager
resources:
limits:
cpus: "1"
memory: 512M

volumes:
evolution_instances:
external: true
name: evolution_s03_data
postgres_data:
external: true
name: evolution_s03_database
redis_data:
external: true
name: evolution_s03_cache

networks:
evolution:
external: true
name: evolution
`

Which version of the API are you using?

2.2.0

What is your environment?

Docker

Other environment specifications

DOCKER:
Server Environment:
OS: Ubuntu
CPU: 8 vCPUs
RAM: 32GB

If applicable, paste the log output

No response

Additional Notes

While investigating the deadlock issue, we observed several logs indicating potential connection resets from the database client. This might be contributing to or exacerbating the problem. Below are the relevant logs:

2024-11-22 20:20:23.876 UTC [2019] LOG: could not receive data from client: Connection reset by peer 2024-11-22 20:20:56.644 UTC [2053] LOG: could not receive data from client: Connection reset by peer 2024-11-22 20:20:56.644 UTC [2051] LOG: could not receive data from client: Connection reset by peer 2024-11-22 20:20:56.644 UTC [1962] LOG: could not receive data from client: Connection reset by peer 2024-11-22 20:28:51.780 UTC [2014] LOG: could not receive data from client: Connection reset by peer

@costaeder costaeder added the bug Something isn't working label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant