Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance/scalability issue in Folder::searchByMime (et al) #11244

Closed
paulijar opened this issue Sep 16, 2018 · 5 comments
Closed

Performance/scalability issue in Folder::searchByMime (et al) #11244

paulijar opened this issue Sep 16, 2018 · 5 comments
Labels

Comments

@paulijar
Copy link
Contributor

The execution time of the function Folder::searchByMime() depends on the number of matching files in the whole user storage. That is, the matching files don't have to be within the targeted folder, just having them anywhere in the user storage slows down the search in all folders.

I ran into this when investigating the issue owncloud/music#664. There, the user had some 1.5 million image files (including the generated thumbnails) in the storage and running Folder::searchByMime('image') took 10 minutes even on empty folder!

The reason for this behavior is in the function Folder::searchCommon() in https://github.com/nextcloud/server/blob/master/lib/private/Files/Node/Folder.php#L245. There, the function first fetches all the matching files in the whole storage and only then filters the result set according the target path. So in our example case, 1.5 million rows were retrieved from the DB table oc_filecache, a new CacheEntry object was instantiated for each row, and then all the 1.5 million objects were discarded.

As I see it, the filtering by path could as well happen already when making the SQL query in https://github.com/nextcloud/server/blob/master/lib/private/Files/Cache/Cache.php#L658. That should be at least an order of magnitude faster than filtering afterwards in PHP foreach loop.

This performance issue is in code which is identical in ownCloud and Nextcloud. I have reported the same issue for ownCloud in owncloud/core#32720.

@nextcloud-bot
Copy link
Member

GitMate.io thinks possibly related issues are #9524 (Huge security issue when sharing folder), #4407 (Performance issues e.g. with favourites), #553 (Issues after renaming an already shared folder...), #3021 (performance issue when deleting folder of user with many files), and #3613 (Shared folder download issues).

@MorrisJobke
Copy link
Member

cc @icewind1991 @nextcloud/sharing

@skjnldsv skjnldsv added the 0. Needs triage Pending check for reproducibility or if it fits our roadmap label Jun 12, 2019
@juliusknorr juliusknorr added 1. to develop Accepted and waiting to be taken care of feature: files feature: search and removed 0. Needs triage Pending check for reproducibility or if it fits our roadmap labels Sep 30, 2020
@PVince81
Copy link
Member

PVince81 commented Nov 3, 2022

the search backend has been changed in Nextcloud 23, are you still seeing this issue there @paulijar ?

@paulijar
Copy link
Contributor Author

paulijar commented Nov 13, 2022

It took some time to setup tests for this but now I have some results. The TLDR is that there is indeed significant improvement on this since I reported the issue but the improvement has happened already before NC23.

My test setup was:

  • RaspberryPi 3
  • PHP 7.4
  • SQLite db
  • 32768 text files in various (nested) subfolders, all under one parent folder
  • Nextcloud versions 18.0.1, 22RC1, and 25.0.1

With this setup, I measured the time taken to execute Folder::searchByMime('text') on various sub-folders using microtime function. I also measured the memory allocations made by the function using the memory_get_usage. The whole testing code was executed within an occ command and looked like this:

	protected function execute(InputInterface $input, OutputInterface $output) {
		$homeFolder = $this->rootFolder->getUserFolder('test_user');
		$targetFolder = $homeFolder->get($input->getArgument('folder'));
		$mime = $input->getOption('mime');

		$memBefore = memory_get_usage(true);
		$startTime = microtime(true);
		$files = $targetFolder->searchByMime($mime);
		$endTime = microtime(true);
		$memAfter = memory_get_usage(true);

		$memDelta = $memAfter - $memBefore;
		$fmtMemDelta = Util::formatFileSize($memDelta);

		$count = count($files);
		$elapsed = $endTime - $startTime;
		$output->writeln("Found $count $mime files in $elapsed seconds, memory allocated $fmtMemDelta");
		return 0;
	}

On NC18, I got:

Matches in folder Execution time Memory allocated
0 3.27 s 88 MB
1024 3.27 s 88 MB
16384 4.02 s 94 MB
32768 4.45 s 100 MB

On NC22, I got:

Matches in folder Execution time Memory allocated
0 0.34 s 0 B
1024 0.49 s 4 MB
16384 3.11 s 66 MB
32768 5.44 s 134 MB

On NC25, I got:

Matches in folder Execution time Memory allocated
0 0.22 s 0 B
1024 0.38 s 4 MB
16384 2.70 s 60 MB
32768 5.78 s 122 MB

I also tested Folder::searchByMime('audio') on the same folders where there was no audio files and the storage in general didn't contain more than a few dozens of audio files. Regardless of the Nextcloud version used and the target folder selected, this took around 0.005 seconds.

Conclusions:
The scalability of the searchByMime has got a lot saner somewhere between NC18 and NC22. My best guess is NC20 and commit 671b992. There is still a measurable but not significant improvement in performance from NC22 to NC25.

Still in NC25, the searchByMime is not completely immune to the amount of matching files outside the targeted folder. This can be seen in how in my tests searchByMime('text') took 50 times longer than searchByMime('audio') in case both of them were run on a folder where there were no matches. This effect is not linear, though, as there was around 500 times more text files than audio files in the storage. I'd say that there is a good chance that the execution time now remains tolerable even when there are millions of matching files somewhere within the file storage.

@szaimen
Copy link
Contributor

szaimen commented Jan 9, 2023

Hi, please update to 24.0.8 or better 25.0.2 and report back if it fixes the issue. Thank you!

@szaimen szaimen added needs info 0. Needs triage Pending check for reproducibility or if it fits our roadmap and removed 1. to develop Accepted and waiting to be taken care of labels Jan 9, 2023
@szaimen szaimen closed this as completed Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants