Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML PI-26: timeout handler possible deadlock issue #4021

Open
elnyry-sam-k opened this issue Dec 3, 2024 · 1 comment
Open

ML PI-26: timeout handler possible deadlock issue #4021

elnyry-sam-k opened this issue Dec 3, 2024 · 1 comment
Labels
bug Something isn't working or it has wrong behavior on a Mojaloop Core service oss-core This is an issue - story or epic related to a feature on a Mojaloop core service or related to it

Comments

@elnyry-sam-k
Copy link
Member

elnyry-sam-k commented Dec 3, 2024

Summary:
During performance testing by one of the community members, a deadlock in timeout handler was encountered.

Severity:
Low

Priority:
Critical

Expected Behavior
No deadlocks with timeout handler but transfers processed as expected and graceful degradation of service in case of load higher than acceptable with the switch settings and resources.

Steps to Reproduce

  1. Setup a switch with Mojaloop v16
  2. Consistent load of about 200 TPS with simulators and handlers scaled accordingly
  3. Sample logs below:

2024-11-29 11:17:30 2024-11-29T09:17:30.361Z - error: "uncaughtException: insert into transferTimeout (transferId, expirationDate) select t.transferId, t.expirationDatefromtransferastinner join (selecttransferId, max(transferStateChangeId) as maxTransferStateChangeIdfromtransferStateChangewheretransferStateChangeId> 508300 andtransferStateChangeId<= 508776 group bytransferId) as tsonts.transferId=t.transferIdinner jointransferStateChangeastscontsc.transferStateChangeId=ts.maxTransferStateChangeIdleft jointransferTimeoutasttontt.transferId=t.transferIdwherett.transferIdis null andtsc.transferStateIdin ('RECEIVED_PREPARE', 'RESERVED') - ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction\nFSPIOPError: insert into transferTimeout (transferId, expirationDate) selectt.transferId, t.expirationDatefromtransferastinner join (selecttransferId, max(transferStateChangeId) as maxTransferStateChangeIdfromtransferStateChangewheretransferStateChangeId> 508300 andtransferStateChangeId<= 508776 group bytransferId) as tsonts.transferId=t.transferIdinner jointransferStateChangeastscontsc.transferStateChangeId=ts.maxTransferStateChangeIdleft jointransferTimeoutasttontt.transferId=t.transferIdwherett.transferIdis null andtsc.transferStateId in ('RECEIVED_PREPARE', 'RESERVED') - ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction\n at createFSPIOPError (/opt/app/node_modules/@mojaloop/central-services-error-handling/src/factory.js:198:12)\n at Object.reformatFSPIOPError (/opt/app/node_modules/@mojaloop/central-services-error-handling/src/factory.js:333:12)\n at /opt/app/src/models/transfer/facade.js:781:36\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\nError: ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction\n at Sequence._packetToError (/opt/app/node_modules/mysql/lib/protocol/sequences/Sequence.js:47:14)\n at Query.ErrorPacket (/opt/app/node_modules/mysql/lib/protocol/sequences/Query.js:79:18)\n at Protocol._parsePacket (/opt/app/node_modules/mysql/lib/protocol/Protocol.js:291:23)\n at Parser._parsePacket (/opt/app/node_modules/mysql/lib/protocol/Parser.js:433:10)\n at Parser.write (/opt/app/node_modules/mysql/lib/protocol/Parser.js:43:10)\n at Protocol.write (/opt/app/node_modules/mysql/lib/protocol/Protocol.js:38:16)\n at Socket.<anonymous> (/opt/app/node_modules/mysql/lib/Connection.js:88:28)\n at Socket.<anonymous> (/opt/app/node_modules/mysql/lib/Connection.js:526:10)\n at Socket.emit (node:events:513:28)\n at addChunk (node:internal/streams/readable:324:12)\n --------------------\n at Protocol._enqueue (/opt/app/node_modules/mysql/lib/protocol/Protocol.js:144:48)\n at Connection.query (/opt/app/node_modules/mysql/lib/Connection.js:198:25)\n at /opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/dialects/mysql/index.js:132:18\n at new Promise (<anonymous>)\n at Client_MySQL._query (/opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/dialects/mysql/index.js:126:12)\n at executeQuery (/opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/execution/internal/query-executioner.js:37:17)\n at Client_MySQL.query (/opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/client.js:154:12)\n at /opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/execution/transaction.js:371:24\n at new Promise (<anonymous>)\n at trxClient.query (/opt/app/node_modules/@mojaloop/database-lib/node_modules/knex/lib/execution/transaction.js:366:12)"

Specifications

  • Component (if known): Center ledger
  • Version: v16 of Mojaloop
  • Platform: on-prem
  • Subsystem: time-out handler
  • Type of testing: Performance
  • Bug found/raised by: Bonaparte and team, @elnyry-sam-k

Notes:

  • Severity when opened: Low
  • Priority when opened: Critical
@elnyry-sam-k elnyry-sam-k added oss-core This is an issue - story or epic related to a feature on a Mojaloop core service or related to it bug Something isn't working or it has wrong behavior on a Mojaloop Core service labels Dec 3, 2024
@kalinkrustev
Copy link

Here are some related observations, when running the performance tests:

  • The timeout handler is called by a module named cron with a default to call it every 15 seconds. It looks it is not aware of the fact that this handler is an asynchronous function and it will not await it. This example illustrates that - running it will log every 15 seconds, even though the promise never resolves:

    require('cron').from({
      cronTime: '*/15 * * * * *', async onTick(){console.log('tick');await new Promise(()=>{})}, start: true
    })

    Also if the onTick handler returns a unresolved promise, the app will terminate, tested with the following script:

    require('cron').CronJob.from({
      cronTime: '*/15 * * * * *', async onTick(){await Promise.reject(new Error('error'))}, start: true
    })

    All this means that the timeout handler may be invoked again, before the previous one completed. So if the database is experiencing some higher load and cannot complete all the queries included in the handler, then it will get an additional load every 15 seconds. As the handler at least 4 queries, that can be potentially slow and they are very similar and take similar time to complete, it is enough to hit ~4 seconds execution time per query to cause a problem. The 4 seconds can be hit if the DB is already loaded by something else. It is also likely that the deadlock is caused by invoking the same handler several times, without waiting for it to complete.

  • The slowness of the queries is probably related to a big number of rows in the transferTimeout table. This is likely related to the fact, that I am running only tests for transfer prepare. So probably all transfers time out and accumulate in that table and contribute to the likelihood of query slowness and a deadlock. But probably something similar can happen in production if some or several of the participants have issues with fulfilling transfers.

  • When running the performance tests for transfer prepare, it is observed that the time of deadlocks is very close to periods where the processing of transfers seems to freeze or become very slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working or it has wrong behavior on a Mojaloop Core service oss-core This is an issue - story or epic related to a feature on a Mojaloop core service or related to it
Projects
None yet
Development

No branches or pull requests

2 participants