-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
traverse folders in php to search in shared files #10048
Conversation
@schiesbn What is your opinion? |
@butonic do you mean that a share table entry should not only point to the shared folder but to every file below the folder? I would really avoid to go into this direction. This would not only make the initial share operation more expensive but the share table would also need to track all changes to the file system. This would duplicate (large) parts of the file cache and would add additional costs to every file system operation which adds/removes a file. Currently the share table gives you the starting point at the file cache table and from there you can operate on the file cache table which provides every information needed and keeps track of all changes. I don't see the need to duplicate it. |
🚀 Test Passed. 🚀 |
Regarding your changes, the code looks good to me. Although I never used the search() method by myself. I have no idea how such a "pattern" can look like and whether your changes (for example Would be great if you could add some unit tests to it, thanks! |
The pattern is another thing that is broken. Currently The search method itself is only directly called from the tests and |
I think that's just a syntax question right? We need to come up with a syntax for this pattern. Ideally this syntax works for all storage back-ends so that the pattern can be used right away. I believe this was true before we supported object storage. If such a syntax exists we can use it. If not we need to make sure that we translate it, if we call a a storage which expects a different syntax.
But this tests will never call the search method implemented by the Shared_Cache right? Would be good to have a unit test for it too. |
The current implementation does not scale because And I already stripped it of completely unneeded duplicate queries for user group and what not. But that is a different issue. What I mean is having the current filecache and share tables and an additional relation table called 'oc_shares_files' having the fileid and the shareid as foreign keys: CREATE TABLE `oc_shares_files` (`shareid` int NOT NULL , `fileid` int NOT NULL , PRIMARY KEY (`shareid`, `fileid`));
-- We can then join the share table via the relations to search files
SELECT "sf"."shareid", "fc"."fileid"
FROM "oc_filecache" "fc"
JOIN "oc_share_files" "sf" ON "sf"."fileid"="fc"."fileid"
JOIN "oc_share" "s" ON "sf"."shareid"="s"."id"
WHERE "fc"."name" LIKE ?
AND "s"."share_with" = ? pass in the pattern and the user searching through shared files. The query could be optimized to also show files owned by the user. Maintaining the relations is not that hard: -- add all when a new share is created:
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT ?, `fileid` FROM `oc_filecache` WHERE `path` LIKE ? with the new share id and the file path appended with '%' to get all children of a folder. All this happens inside the database. A quick test on an oracle db that had to insert 7736 rows took less than a second. Honestly, this is what RDBMS are made for. On my productivy mysql machine I did: SELECT count(*) FROM `oc_filecache` WHERE `path` LIKE 'files%';
+----------+
| count(*) |
+----------+
| 21750 |
+----------+
1 row in set (0.19 sec)
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT 1, `fileid` FROM `oc_filecache` WHERE `path` LIKE 'files%';
Query OK, 21750 rows affected (0.07 sec)
Records: 21750 Duplicates: 0 Warnings: 0 so ... IMHO performance is clearly not an issue here. Other operations can now also be offloaded to the database: -- on a create hook add the file to the shares_files relation by looking up the parent in the share_files table
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT `sf`.`shareid`, `fc`.`fileid` FROM `oc_filecache` `fc` JOIN `oc_share_files` `sf` ON `sf`.`fileid`=`fc`.`parent` WHERE `fc`.`fileid` = ?
-- on a delete hook remove the file or all files of a share from the shares_files relation
-- when deleting a file (or folder)
DELETE FROM `oc_shares_files` WHERE `fileid` IN (SELECT `fileid` FROM `oc_filecache` WHERE `path` LIKE ? )
-- when deleting a share
DELETE FROM `oc_shares_files` WHERE `shareid` = ? If we were using foreign keys we could even use the cascade option to remove entries from the relation table. Migration to a relations table is easy because it only adds the relation table. It is basically a speed tradeoff where we maintain the relations of the shared files in the database to speed up all queries related to file shares, because we can let the database handle permissions and don't have to go through php and the connection. Looking at [nested foreach loops that are needed to update the etags](https://github.com/owncloud/core/blob/master/apps/files_sharing/lib/updater.php#L72 for all users with access to a shared file) (which kills sync performance btw) my experience tells me that this can also be improved. Yes we have to recursively walk up the tree as well but we can do that for all users in parallel. |
@butonic thanks for the detailed explanation. I'm not sure if it is that easy in all circumstances, e.g. your initial insert if a share was created only looks for the path, but this way you would also insert files from different users who might have a similar path. Also we could not just add the storage ID as a second parameter because the storage ID could change if the shared folder contains a sub-folder which is a external mount. This could probably be solved, but I assume that some extra queries would be involved. I didn't thought about it in detail, this are just some thoughts coming to my mind while reading it. We would also have to check the other queries if they work in all circumstances. I also wonder if this is a sharing-only issue. Does the searchByMime() method for the local storage is more efficient? I would prefer to optimize this method, if needed. This way all other storage could re-use it. As said, as soon as we know the real path and user of a shared folder, all operation can happen on the file cache without any special code regarding sharing. So if the file cache operation is fast then the same operation for shared files should be fast too. For now, let's fix it they way you did it. I would only like to have some unit tests for it. 😉 |
@schiesbn the SQL was meant to show that inserting the relations is done with one query, limiting the query to only one user ( A move operation will become more complicated because we need to update the relation as well as the etags. But that is only one query in addition to all the etag updates (which have to traverse up the tree for source and target). searchByMime() uses the same traversal algorithm, which makes it equally slow. Yet another reason why the gallery will take some time to show images. Both search methods need to walk the tree or we need to come up with a smarter algorithm. I'll write the tests and create a new issue for this. |
Did you had a look at the file cache, table? There is no 'user'. Unfortunately it is not that easy to say which file cache entry belongs to which user. The only reference is the storage ID, resolving this to the user seems to be not that easy for this use case. First you would need to find out which storage IDs belongs to which user. Another problem could be that the path is relative to the storage. So even if you know all storage IDs of a given user it could happen that you get false-positives. All solvable, but probably not as easy as it seems at the first moment and it is not sure how much queries we would really need at the end. But let's discuss it in the new issue. |
@schiesbn thx for making me write unit tests, found and fixed two bugs. |
💣 Test Failed. 💣 |
Instead of mapping shareid -> fileids it would make more sense to keep a fileid -> fileid mapping of the three, something like
Then for searching the shared cache can map the shareid to the parent and create a join with the filecache from there. Additionally this table can be used for moving and deleting within the cache and searching inside a folder |
@icewind1991 can you elaborate on that in #10077 |
@icewind1991 @schiesbn can I get a 👍 for this to make search in shared files work at least until we come up with a better solution in #10077 ? |
👍 under the assumption that Jenkins will finish successfully |
The inspection completed: 4 new issues, 3 updated code elements |
🚀 Test Passed. 🚀 |
👍 |
traverse folders in php to search in shared files
@karlitschek @craigpg backport to 7.0.1? is easy and contained. |
yes please. but not before monday when 7.0.1 is out |
makeshift fix for #10009
@karlitschek @DeepDiver1975 @craigpg needed for 7.0.1
Why makeshift, you ask? The solution uses the same directory traversing as searchByMime ... which is a performance nightmare and ugly. But it works.
@icewind1991 @PVince81 @schiesbn We definitely need to come up with a better solution... I still vote for a n:m relation table between share and files. Yes it takes longer to share hundreds of files ... but it is a much cleaner solution and allows us to use joins and let the db do it's magic.