You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When employing the CPU provider, it is possible to run multiple inferences in parallel in the same session by calling the Run method concurrently from different threads. However, when attempting the same with the CUDA provider, the performance is very bad, and it appears that the ORT is serializing simultaneous calls to Run. Is this the case? If so, what can I do to run multiple concurrent queries efficiently in a single session when using the CUDA backend? My workflow involves running multiple inferences in parallel using tiny batch sizes and small models.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When employing the CPU provider, it is possible to run multiple inferences in parallel in the same session by calling the Run method concurrently from different threads. However, when attempting the same with the CUDA provider, the performance is very bad, and it appears that the ORT is serializing simultaneous calls to Run. Is this the case? If so, what can I do to run multiple concurrent queries efficiently in a single session when using the CUDA backend? My workflow involves running multiple inferences in parallel using tiny batch sizes and small models.
Beta Was this translation helpful? Give feedback.
All reactions