traverse folders in php to search in shared files #10048

butonic · 2014-07-30T11:54:21Z

makeshift fix for #10009

@karlitschek @DeepDiver1975 @craigpg needed for 7.0.1

Why makeshift, you ask? The solution uses the same directory traversing as searchByMime ... which is a performance nightmare and ugly. But it works.

@icewind1991 @PVince81 @schiesbn We definitely need to come up with a better solution... I still vote for a n:m relation table between share and files. Yes it takes longer to share hundreds of files ... but it is a much cleaner solution and allows us to use joins and let the db do it's magic.

karlitschek · 2014-07-30T11:58:17Z

@schiesbn What is your opinion?

schiessle · 2014-07-30T12:58:55Z

@butonic do you mean that a share table entry should not only point to the shared folder but to every file below the folder? I would really avoid to go into this direction. This would not only make the initial share operation more expensive but the share table would also need to track all changes to the file system. This would duplicate (large) parts of the file cache and would add additional costs to every file system operation which adds/removes a file.

Currently the share table gives you the starting point at the file cache table and from there you can operate on the file cache table which provides every information needed and keeps track of all changes. I don't see the need to duplicate it.

ghost · 2014-07-30T13:02:59Z

🚀 Test Passed. 🚀
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6477/

schiessle · 2014-07-30T13:04:22Z

Regarding your changes, the code looks good to me. Although I never used the search() method by myself. I have no idea how such a "pattern" can look like and whether your changes (for example trim($pattern,'%');) works for all pattern provided to this function.

Would be great if you could add some unit tests to it, thanks!

butonic · 2014-07-30T13:17:11Z

The pattern is another thing that is broken. Currently OC\Files\View encloses the search term in % to make the pattern usable in a SQL LIKE statement. But not all storage backends might use SQL to search for files. Files_external could query amazon s3 or swift and won't need that kind of pattern.

The search method itself is only directly called from the tests and OC\Files\View https://github.com/owncloud/core/blob/master/lib/private/files/view.php#L1102 / https://github.com/owncloud/core/blob/master/lib/private/files/view.php#L1119

schiessle · 2014-07-30T13:26:05Z

The pattern is another thing that is broken. Currently OC\Files\View encloses the search term in % to make the pattern usable in a SQL LIKE statement. But not all storage backends might use SQL to search for files. Files_external could query amazon s3 or swift and won't need that kind of pattern.

I think that's just a syntax question right? We need to come up with a syntax for this pattern. Ideally this syntax works for all storage back-ends so that the pattern can be used right away. I believe this was true before we supported object storage. If such a syntax exists we can use it. If not we need to make sure that we translate it, if we call a a storage which expects a different syntax.

The search method itself is only directly called from the tests

But this tests will never call the search method implemented by the Shared_Cache right? Would be good to have a unit test for it too.

butonic · 2014-07-30T19:47:37Z

The current implementation does not scale because searchByMime() as well as search() uses dozens of sql queries when walking the directory tree: getFolderContents first looks up the source cache via the shares table and that then looks up the files in the filecache. That is at least two queries per folder from what I can tell. Have a look at the following query log: http://pastebin.com/6V0M2NAg

And I already stripped it of completely unneeded duplicate queries for user group and what not. But that is a different issue.

What I mean is having the current filecache and share tables and an additional relation table called 'oc_shares_files' having the fileid and the shareid as foreign keys:

CREATE TABLE `oc_shares_files` (`shareid` int NOT NULL , `fileid` int NOT NULL , PRIMARY KEY (`shareid`, `fileid`));

-- We can then join the share table via the relations to search files
SELECT "sf"."shareid", "fc"."fileid"
FROM "oc_filecache" "fc"
JOIN "oc_share_files" "sf" ON "sf"."fileid"="fc"."fileid"
JOIN "oc_share" "s" ON "sf"."shareid"="s"."id"
 WHERE "fc"."name" LIKE ?
AND "s"."share_with" = ?

pass in the pattern and the user searching through shared files. The query could be optimized to also show files owned by the user.

Maintaining the relations is not that hard:

-- add all when a new share is created:
INSERT INTO  `oc_shares_files` (`shareid`,`fileid`) SELECT ?, `fileid` FROM `oc_filecache` WHERE `path` LIKE ?

with the new share id and the file path appended with '%' to get all children of a folder. All this happens inside the database. A quick test on an oracle db that had to insert 7736 rows took less than a second. Honestly, this is what RDBMS are made for. On my productivy mysql machine I did:

SELECT count(*) FROM `oc_filecache` WHERE `path` LIKE 'files%';
+----------+
| count(*) |
+----------+
|    21750 |
+----------+
1 row in set (0.19 sec)

INSERT INTO  `oc_shares_files` (`shareid`,`fileid`) SELECT 1, `fileid` FROM `oc_filecache` WHERE `path` LIKE 'files%';
Query OK, 21750 rows affected (0.07 sec)
Records: 21750  Duplicates: 0  Warnings: 0

so ... IMHO performance is clearly not an issue here.

Other operations can now also be offloaded to the database:

-- on a create hook add the file to the shares_files relation by looking up the parent in the share_files table
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT `sf`.`shareid`, `fc`.`fileid` FROM `oc_filecache` `fc` JOIN `oc_share_files` `sf` ON `sf`.`fileid`=`fc`.`parent` WHERE `fc`.`fileid` = ?

-- on a delete hook remove the file or all files of a share from the shares_files relation
-- when deleting a file (or folder)
DELETE FROM `oc_shares_files` WHERE `fileid` IN (SELECT `fileid` FROM `oc_filecache` WHERE `path` LIKE ? )
-- when deleting a share
DELETE FROM `oc_shares_files` WHERE `shareid` = ?

If we were using foreign keys we could even use the cascade option to remove entries from the relation table.

Migration to a relations table is easy because it only adds the relation table.

It is basically a speed tradeoff where we maintain the relations of the shared files in the database to speed up all queries related to file shares, because we can let the database handle permissions and don't have to go through php and the connection.

Looking at [nested foreach loops that are needed to update the etags](https://github.com/owncloud/core/blob/master/apps/files_sharing/lib/updater.php#L72 for all users with access to a shared file) (which kills sync performance btw) my experience tells me that this can also be improved. Yes we have to recursively walk up the tree as well but we can do that for all users in parallel.

schiessle · 2014-07-31T08:15:37Z

@butonic thanks for the detailed explanation. I'm not sure if it is that easy in all circumstances, e.g. your initial insert if a share was created only looks for the path, but this way you would also insert files from different users who might have a similar path. Also we could not just add the storage ID as a second parameter because the storage ID could change if the shared folder contains a sub-folder which is a external mount. This could probably be solved, but I assume that some extra queries would be involved.

I didn't thought about it in detail, this are just some thoughts coming to my mind while reading it. We would also have to check the other queries if they work in all circumstances. I also wonder if this is a sharing-only issue. Does the searchByMime() method for the local storage is more efficient? I would prefer to optimize this method, if needed. This way all other storage could re-use it. As said, as soon as we know the real path and user of a shared folder, all operation can happen on the file cache without any special code regarding sharing. So if the file cache operation is fast then the same operation for shared files should be fast too.

For now, let's fix it they way you did it. I would only like to have some unit tests for it. 😉
Maybe you can copy&paste your last post to a separate issue so that we can re-consider it during oc8 development where we will add some changes/improvements to sharing anyway.

butonic · 2014-07-31T09:06:23Z

@schiesbn the SQL was meant to show that inserting the relations is done with one query, limiting the query to only one user (AND user = ?) ony shrinks the result set and has a neglectable performance impact.

A move operation will become more complicated because we need to update the relation as well as the etags. But that is only one query in addition to all the etag updates (which have to traverse up the tree for source and target).

searchByMime() uses the same traversal algorithm, which makes it equally slow. Yet another reason why the gallery will take some time to show images. Both search methods need to walk the tree or we need to come up with a smarter algorithm.

I'll write the tests and create a new issue for this.

schiessle · 2014-07-31T09:23:22Z

the SQL was meant to show that inserting the relations is done with one query, limiting the query to only one user (AND user = ?) ony shrinks the result set and has a neglectable performance impact.

Did you had a look at the file cache, table? There is no 'user'. Unfortunately it is not that easy to say which file cache entry belongs to which user. The only reference is the storage ID, resolving this to the user seems to be not that easy for this use case. First you would need to find out which storage IDs belongs to which user. Another problem could be that the path is relative to the storage. So even if you know all storage IDs of a given user it could happen that you get false-positives. All solvable, but probably not as easy as it seems at the first moment and it is not sure how much queries we would really need at the end. But let's discuss it in the new issue.

butonic · 2014-07-31T10:07:57Z

@schiesbn thx for making me write unit tests, found and fixed two bugs.

ghost · 2014-07-31T10:40:19Z

💣 Test Failed. 💣
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6494/

icewind1991 · 2014-07-31T12:12:07Z

Instead of mapping shareid -> fileids it would make more sense to keep a fileid -> fileid mapping of the three, something like

CREATE TABLE oc_files_three (parentint NOT NULL ,child int NOT NULL, , PRIMARY KEY (parent, child));

Then for searching the shared cache can map the shareid to the parent and create a join with the filecache from there.

Additionally this table can be used for moving and deleting within the cache and searching inside a folder

butonic · 2014-07-31T14:39:48Z

@icewind1991 can you elaborate on that in #10077

butonic · 2014-07-31T14:57:51Z

@icewind1991 @schiesbn can I get a 👍 for this to make search in shared files work at least until we come up with a better solution in #10077 ?

schiessle · 2014-07-31T15:00:46Z

👍 under the assumption that Jenkins will finish successfully

scrutinizer-notifier · 2014-07-31T15:05:33Z

The inspection completed: 4 new issues, 3 updated code elements

ghost · 2014-07-31T15:38:00Z

🚀 Test Passed. 🚀
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6500/

icewind1991 · 2014-07-31T16:10:19Z

👍

traverse folders in php to search in shared files

butonic · 2014-07-31T16:28:25Z

@karlitschek @craigpg backport to 7.0.1? is easy and contained.

karlitschek · 2014-07-31T16:30:39Z

yes please. but not before monday when 7.0.1 is out

butonic added this to the ownCloud 7.0.1 milestone Jul 30, 2014

butonic added 5 - To review labels Jul 30, 2014

traverse folders in php to search in shared files

a7b4cd7

add test for search() in shared cache

1af436c

butonic added a commit that referenced this pull request Jul 31, 2014

Merge pull request #10048 from owncloud/fix_search_in_shared_files

dc61804

traverse folders in php to search in shared files

butonic merged commit dc61804 into master Jul 31, 2014

butonic deleted the fix_search_in_shared_files branch July 31, 2014 16:27

butonic added 7 - To release and removed 5 - To review labels Aug 1, 2014

butonic modified the milestones: ownCloud 7.0.1, ownCloud 7.0.2 Aug 1, 2014

butonic self-assigned this Aug 1, 2014

butonic mentioned this pull request Aug 7, 2014

Search results duplicates first folder #10243

Closed

butonic modified the milestones: ownCloud 7.0.2, 2014-sprint-01-current Aug 7, 2014

butonic mentioned this pull request Aug 8, 2014

Fix search in shared files stable7 #10287

Merged

jnfrmarks added the sev3-medium label Sep 4, 2014

lock bot locked as resolved and limited conversation to collaborators Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

traverse folders in php to search in shared files #10048

traverse folders in php to search in shared files #10048

butonic commented Jul 30, 2014

karlitschek commented Jul 30, 2014

schiessle commented Jul 30, 2014

ghost commented Jul 30, 2014

schiessle commented Jul 30, 2014

butonic commented Jul 30, 2014

schiessle commented Jul 30, 2014

butonic commented Jul 30, 2014

schiessle commented Jul 31, 2014

butonic commented Jul 31, 2014

schiessle commented Jul 31, 2014

butonic commented Jul 31, 2014

ghost commented Jul 31, 2014

icewind1991 commented Jul 31, 2014

butonic commented Jul 31, 2014

butonic commented Jul 31, 2014

schiessle commented Jul 31, 2014

scrutinizer-notifier commented Jul 31, 2014

ghost commented Jul 31, 2014

icewind1991 commented Jul 31, 2014

butonic commented Jul 31, 2014

karlitschek commented Jul 31, 2014

traverse folders in php to search in shared files #10048

traverse folders in php to search in shared files #10048

Conversation

butonic commented Jul 30, 2014

karlitschek commented Jul 30, 2014

schiessle commented Jul 30, 2014

ghost commented Jul 30, 2014

schiessle commented Jul 30, 2014

butonic commented Jul 30, 2014

schiessle commented Jul 30, 2014

butonic commented Jul 30, 2014

schiessle commented Jul 31, 2014

butonic commented Jul 31, 2014

schiessle commented Jul 31, 2014

butonic commented Jul 31, 2014

ghost commented Jul 31, 2014

icewind1991 commented Jul 31, 2014

butonic commented Jul 31, 2014

butonic commented Jul 31, 2014

schiessle commented Jul 31, 2014

scrutinizer-notifier commented Jul 31, 2014

ghost commented Jul 31, 2014

icewind1991 commented Jul 31, 2014

butonic commented Jul 31, 2014

karlitschek commented Jul 31, 2014