how to scale FileStream reads above 2 GB/s outside of benchmarks? #104563

twest820 · 2024-07-08T17:31:07Z

twest820
Jul 8, 2024

Hi, I'm trying to sort out a code design for performant utilization of NVMe drives and it appears there's a process using .NET 8 (or maybe an app domain) is limited to somewhere in the range of 2.2–2.7 GB/s depending on buffer size choices, details of ReadExact(), ReadExactAsync(), or such are called, and what exactly needs to be done with the data that's read. If different implementation choices are made it's easy to drop to 1.0–1.3 GB/s per process but so far I've only been able to get over 3 GB/s by using multiple processes.

For example, some basic profiling of reads from a current gen PCIe 4.0 x4 DRAMless NVMe capable of 5.2 GB/s yields the table below. This is on Win 10 22H2 with a 16 core processor and DDR4-3200. No hardware constraint is apparent. DRAM bandwidth's ~50 GB/s and never measures above ~26 GB/s. CPU utilization stays under 27%, the cores are not thermally limited, and have a couple hundred MHz of boost still available. Each thread or process is reading different files and thus has exclusive ownership of the file handle, attached FileStream, and associated buffers. Visual Studio's performance profiler shows a series of clean reads to each file and the backtraces just show ntdll.dll reading whole files as expected. I'm using object pooling, as seems to be routine to avoid overloading the GC at faster IO rates, and DDR utilization runs steady at a few GB per process.

processes	threads per process	sync or async?	FileStream buffer	throughput, GB/s
1	1	sync	FILE_FLAG_NO_BUFFERING	1.8
1	1	either	4 MB	2.7
1	2	either	4 MB	2.7
1	3	either	4 MB	2.7
1	4	either	4 MB	2.6
2	1	either	4 MB	3.4
3	1	either	4 MB	4.3
4	1	either	4 MB	4.5

What I'd hoped to see here is, given code that does 2.7 GB/s single threaded, is that putting on two threads would saturate the drive at 5.3 GB/s. If I replicate CrystalDiskMark/DiskSpd benchmark type IO where data's only read from the drive without doing anything with it, a single .NET 8 thread highmarks around 4.6 GB/s at the drive's current level of fullness. DiskSpd manages 4.3 GB/s single threaded, so .NET's clearly competent at synthetics, but it'd be pretty useful if one could also, say, copy the data being read into an array at that speed.

Testing shows a second layer of buffering is helpful as being chatty with FileStream.Read*() both drops maximum throughput and also causes performance degradation in multithreaded cases. For example, if one thread calls ReadExact() to get a few hundred bytes at time with 2.2 GB/s total throughput, then two threads doing that can drop total throughput to 1.1 GB/s. I've found there's little to no performance gain in giving each FileStream more than a 2 MB buffer and asking for more than about 8 kB per synchronous Read*() call. Pushing the queue depth nets maybe another 100 MB/s with asynchronous reads in the range of 256 kB to 2 MB but does not change the basic behavior.

So it looks to me like there's some sort of process-level IO contention among Read*() calls that's limiting practical (as opposed to benchmark) throughput, either just in Win 10's IO components, just in .NET 8, or in their interaction. Considering all the IO work that went into .NET 6, I thought I'd inquire here to see if anyone has ideas as to what the constraint might be and how to code around it within a single process. Insights appreciated as I'm not having luck with further searches and the next obvious thing to try appears to be reimplementing FileStream.

neon-sunset · 2024-07-08T22:17:39Z

neon-sunset
Jul 8, 2024

If you are looking for high throughput multi-threaded IO, especially with an NVME drive, you want to use RandomAccess and SafeFileHandle directly instead. They are, generally, easy to use. There is quite a bit of overhead to FileStream in this scenario (and it's not designed to be used from multiple threads at the same time), because it is optimized around the common use case which isn't reading data at 4GiB/s. Also note that NVME can be scaling with request queue depth, so there are multiple potential bottlenecks at play.

Also, if you are allocating new buffers, you are likely limited by allocation throughput, in which case, although that's still extremely inefficient, you want to use Server GC.

8 replies

twest820 Jul 9, 2024
Author

Do you have any strong evidence about it?

Callers must honor the number of bytes RandomAccess.ReadAsync() reads in order to process file contents correctly. As it's a reasonable assumption ReadAsync() doesn't overrun, it's likely calling code will raise ArgumentOutOfRangeException trying to access the bytes returned past the end of the buffer. When the debugger breaks on that you look at the Task<int>'s result and the Memory<byte>'s and byte[]'s lengths.

Haven't had time to code review but, as it sounds like you're not seeing anything, it might be worth considering if dynamic PGO's confusing the byte[] with the length of the Memory<byte> mapped onto it.

huoyaoyuan Jul 9, 2024
Collaborator

it's likely calling code will raise ArgumentOutOfRangeException trying to access the bytes returned past the end of the buffer. When the debugger breaks on that you look at the Task<int>'s result and the Memory<byte>'s and byte[]'s lengths.

It's very serious accuse about reliability and security. Do you have full code snippet to reproduce this? I'm never aware of how this can happen.

Also: FileStream calls RandomAccess internally.. It can never be safer than RandomAccess.

teo-tsirpanis Jul 9, 2024
Collaborator

In earlier .NET versions I've seen p/invoke put a lock around calls marked SetLastError = true, such as the underlying ReadFile()s here, presumably to ensure integrity of the Win32 error code.

There is no need for such locking, the Win32 last error is documented to be thread-local.

twest820 Jul 14, 2024
Author

I'm never aware of how this can happen.

The setup's been detailed. So you should have everything needed to investigate if you feel a process reading data from a file it has permissions to into a buffer its allocated is a security issue. Based on my experience as a Microsoft contracted .NET pen tester it wouldn't classify, though we'd routinely bug similar collateral issues. Ones like this would tend to get follow up as they'd reasonably make people nervous.

huoyaoyuan Jul 15, 2024
Collaborator

The setup's been detailed.

Is there any code we can directly execute and test on our machine?

Reliability issue are often related to the exact code shape. Misuses can often cause false positives of problem reports. That's why we want exact code.

In case there are misuses in user code, we are also happy to point out and correct them so that users can get more reliability.

twest820 · 2024-07-14T18:01:03Z

twest820
Jul 14, 2024
Author

Think I've got this mostly sorted. Based on buying an additional PCIe 4.0 x4 NVMe to test with and several days' further investigation I'm currently getting sustained rates of 7.0 GB/s with peaks of 7.2. While this is a few hundred MB/s under the drive's benchmarks and PCIe link abilities I'm using application code paths which, unlike benchmarks, need to do things with the data read from the drive and have to generate data to write rather than preloading a buffer of random bytes.

Unsurprisingly, there's multiple contributing factors. The ones which seem of interest here are

Measurements in the OP are low because the code I was working with had like five, well, not really bugs but interacting limitations which resulted in fully multithreaded operation not being as concurrent as it could be. Fixing those provided a moderate unlock, with doubling threads increasing transfer rates by 1.9–1.0x (so potentially no improvement), declining with overall demand. For the 2.7 GB/s example mentioned in the OP I'm not seeing more than ~1.6x, meaning doubling threads goes to 4.3 GB/s instead of the hoped for 5.4 GB/s.
I found some low level optimizations in object pooling and code gen. Don't think there's useful general guidance here beyond the usual need for good understanding of implementation details and review of disassembly. Observed increments have been modest but picking up like 100 MB/s around one line of code and maybe 200 MB/s from another does add up to a useful boost.
Different NVMes show dramatically different demands on DDR bandwidth. With only two drives to test with there's not a way I can make strong statements. But it appears plausible that, while quite performant, current gen HMB (DRAMless) drives do not scale as well in application use as they do in benchmarks. On drive DRAM may thus offer substantial benefits at higher transfer rates, which might be partially or mostly an artifact of Windows device driver, file system, or cache implementations rather than actual hardware differences. It'd take a lot of money and time I don't currently have to dig into this and really understand what's going on. Plus there'd likely be little user actionability of the findings.
Much information on IO pattern performance seems to lack context and thus contain undocumented assumptions. While that's probably inevitable to some extent, what I've found most useful is to set up multiple implementations of workload kernels and then profile across all of them. What I'm seeing for the large sequential transfers, predominantly reads, under test here is asynchronous buffered is slowest (queue depth 2) with synchronous buffered and then synchronous unbuffered ultimately fastest. This ranking matches the three IO patterns' DDR bandwidth requirements. Synchronous is viable for the app conditions under test as the number of threads doesn't exceed the number of processor cores, making the cost of blocking a thread on IO unimportant in the particular set of use cases under test. I think it's unlikely this would hold for higher latency transfers (network operations, maybe larger hard drive RAIDs) or large numbers of small transfers (e.g. SQL IOPS style scenarios).
Currently what I'm getting is synchronous buffered can be faster than unbuffered at lower transfer rates, transitioning to unbuffered being faster at higher rates. It looks like the tradeoff is motivated by DDR pressure, with synchronous' expenditure of additional DDR bandwidth increasing IO throughput when DDR contention is low but becoming disadvantageous as total DDR demand approaches memory channel capability. Where the transition occurs likely thus depends on hardware capabilities, application access patterns, buffer sizes, read sizes, and the extent to which application buffers are used to amortize the cost of calls into FileStream or related classes. FWIW, I'm seeing changeover around 3.5 GB/s.

Thanks to @teo-tsirpanis for pointing out Win32 last errors have moved to thread local!

I'm hesitant to mark this as an answer just yet as much of the information above is provisional and, being centered around just a few test cases I've been focusing on, lacks workload diversity. Also it looks like there's probably a bit of thermal throttling influencing the results even with cooldown intervals between tests. NVMe thermal contact to the motherboard armor is good but 7 GB/s is fast enough the real world datasets for large sequential testing are large enough typical armor plates' thermal mass starts to saturate. I've ordered alternate heatsinks but they'll also be some time to arrive and test.

2 replies

neon-sunset Jul 14, 2024

Note very different performance characteristics and edge cases between Windows and POSIX IO stacks. Specifically previously mentioned SafeHandle, depending on the instantiation flags will either perform synchronous IO calls or overlapped IO on windows, with one pattern of interaction with threadpool, while on Linux and macOS this performs in a very different manner especially large chunks of data vs many small files.

In any case, thank you for posting your notes, they were interesting to read! Consider making it into a blog post, I'm sure many would be interested in looking at pretty graphs and perhaps some OEM will get well-deserved slack for their HMB implementation :)

p.s.: please, do not use FileStream for this haha, your choices are still either RandomAccess, manual IO calls with P/Invoke (it's relatively minor difference), I believe NVME drives have also a flavour of direct IO API which enables full userspace IO access bypassing the kernel and its IO stack, and filesystem overhead.

twest820 Jul 15, 2024
Author

Excellent point. The operating system agnostic paths here use buffered IO, whether sync or async, and least on Windows the most I'm seeing on those code paths is 4.6 GB/s due to additional DDR demand of about 12 GB/s. A general implementation would, I suspect, want to look at the planned IO and make operating system aware choices of threading and IO strategy. Not set up for Mac testing, though, and Linux is currently low priority.

As already noted, RandomAccess.ReadAsync() profiles slower than FileStream.ReadAsync(). This also holds for RandomAccess.Read() versus FileStream.Read() and ReadExact(). DirectStorage addresses similar DDR bandwidth concerns but, as a GPU's not a destination of interest here, doesn't appear obviously relevant. P/invokes I've not tested but, as you point out, opportunity appears slight as FileStream's thin and well amortized here.

Don't have a strong answer at the moment as to why FileStream shows measurable advantage over RandomAccess. Best guess from perf monitoring is PGO finds slightly better (1–1.5%) optimizations, which may well be more than p/invoke can recover.

Also don't have anything which suggests against HMB below 2–3 GB/s. Frequent faster use is pretty niche. But I wouldn't be doing this exercise if I wasn't writing for that niche.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to scale FileStream reads above 2 GB/s outside of benchmarks? #104563

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

how to scale FileStream reads above 2 GB/s outside of benchmarks? #104563

twest820 Jul 8, 2024

Replies: 2 comments · 10 replies

neon-sunset Jul 8, 2024

twest820 Jul 9, 2024 Author

huoyaoyuan Jul 9, 2024 Collaborator

teo-tsirpanis Jul 9, 2024 Collaborator

twest820 Jul 14, 2024 Author

huoyaoyuan Jul 15, 2024 Collaborator

twest820 Jul 14, 2024 Author

neon-sunset Jul 14, 2024

twest820 Jul 15, 2024 Author

twest820
Jul 8, 2024

Replies: 2 comments 10 replies

neon-sunset
Jul 8, 2024

twest820 Jul 9, 2024
Author

huoyaoyuan Jul 9, 2024
Collaborator

teo-tsirpanis Jul 9, 2024
Collaborator

twest820 Jul 14, 2024
Author

huoyaoyuan Jul 15, 2024
Collaborator

twest820
Jul 14, 2024
Author

twest820 Jul 15, 2024
Author