Replies: 2 comments 10 replies
-
If you are looking for high throughput multi-threaded IO, especially with an NVME drive, you want to use Also, if you are allocating new buffers, you are likely limited by allocation throughput, in which case, although that's still extremely inefficient, you want to use Server GC. |
Beta Was this translation helpful? Give feedback.
-
Think I've got this mostly sorted. Based on buying an additional PCIe 4.0 x4 NVMe to test with and several days' further investigation I'm currently getting sustained rates of 7.0 GB/s with peaks of 7.2. While this is a few hundred MB/s under the drive's benchmarks and PCIe link abilities I'm using application code paths which, unlike benchmarks, need to do things with the data read from the drive and have to generate data to write rather than preloading a buffer of random bytes. Unsurprisingly, there's multiple contributing factors. The ones which seem of interest here are
Thanks to @teo-tsirpanis for pointing out Win32 last errors have moved to thread local! I'm hesitant to mark this as an answer just yet as much of the information above is provisional and, being centered around just a few test cases I've been focusing on, lacks workload diversity. Also it looks like there's probably a bit of thermal throttling influencing the results even with cooldown intervals between tests. NVMe thermal contact to the motherboard armor is good but 7 GB/s is fast enough the real world datasets for large sequential testing are large enough typical armor plates' thermal mass starts to saturate. I've ordered alternate heatsinks but they'll also be some time to arrive and test. |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm trying to sort out a code design for performant utilization of NVMe drives and it appears there's a process using .NET 8 (or maybe an app domain) is limited to somewhere in the range of 2.2–2.7 GB/s depending on buffer size choices, details of
ReadExact()
,ReadExactAsync()
, or such are called, and what exactly needs to be done with the data that's read. If different implementation choices are made it's easy to drop to 1.0–1.3 GB/s per process but so far I've only been able to get over 3 GB/s by using multiple processes.For example, some basic profiling of reads from a current gen PCIe 4.0 x4 DRAMless NVMe capable of 5.2 GB/s yields the table below. This is on Win 10 22H2 with a 16 core processor and DDR4-3200. No hardware constraint is apparent. DRAM bandwidth's ~50 GB/s and never measures above ~26 GB/s. CPU utilization stays under 27%, the cores are not thermally limited, and have a couple hundred MHz of boost still available. Each thread or process is reading different files and thus has exclusive ownership of the file handle, attached
FileStream
, and associated buffers. Visual Studio's performance profiler shows a series of clean reads to each file and the backtraces just show ntdll.dll reading whole files as expected. I'm using object pooling, as seems to be routine to avoid overloading the GC at faster IO rates, and DDR utilization runs steady at a few GB per process.What I'd hoped to see here is, given code that does 2.7 GB/s single threaded, is that putting on two threads would saturate the drive at 5.3 GB/s. If I replicate CrystalDiskMark/DiskSpd benchmark type IO where data's only read from the drive without doing anything with it, a single .NET 8 thread highmarks around 4.6 GB/s at the drive's current level of fullness. DiskSpd manages 4.3 GB/s single threaded, so .NET's clearly competent at synthetics, but it'd be pretty useful if one could also, say, copy the data being read into an array at that speed.
Testing shows a second layer of buffering is helpful as being chatty with
FileStream.Read*()
both drops maximum throughput and also causes performance degradation in multithreaded cases. For example, if one thread callsReadExact()
to get a few hundred bytes at time with 2.2 GB/s total throughput, then two threads doing that can drop total throughput to 1.1 GB/s. I've found there's little to no performance gain in giving eachFileStream
more than a 2 MB buffer and asking for more than about 8 kB per synchronousRead*()
call. Pushing the queue depth nets maybe another 100 MB/s with asynchronous reads in the range of 256 kB to 2 MB but does not change the basic behavior.So it looks to me like there's some sort of process-level IO contention among
Read*()
calls that's limiting practical (as opposed to benchmark) throughput, either just in Win 10's IO components, just in .NET 8, or in their interaction. Considering all the IO work that went into .NET 6, I thought I'd inquire here to see if anyone has ideas as to what the constraint might be and how to code around it within a single process. Insights appreciated as I'm not having luck with further searches and the next obvious thing to try appears to be reimplementingFileStream
.Beta Was this translation helpful? Give feedback.
All reactions