Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save image based on path to image #171

Open
dyanakiev opened this issue Sep 26, 2017 · 16 comments
Open

Save image based on path to image #171

dyanakiev opened this issue Sep 26, 2017 · 16 comments
Assignees
Labels

Comments

@dyanakiev
Copy link

dyanakiev commented Sep 26, 2017

Hello, is it possible to save the images based on the requested path to them?
My cache folder started to grow really big.

Example:
https://example.site/img.php?src=uploads/avatars/234234234/avatar22.png
Cache path should be:
/images/cache/uploads/avatars/234234234/avatar22-cached-picture-something-234234.png

@mosbth mosbth self-assigned this Sep 26, 2017
@mosbth
Copy link
Owner

mosbth commented Sep 26, 2017

It is related to the cache dir growing?

A crontab script monitoring and removing old files is one way to deal with it.

Option.
One can (pre)process and save images as an "alias" image and use it instead of direct access to img.php.
https://cimage.se/doc/config-file#alias

Could you elaborate a bit more on your perseived problem?

@dyanakiev
Copy link
Author

dyanakiev commented Sep 26, 2017

Yes so i dont have problem with the disk space, but the folder itself has really big amount of files and i cant even "ls" in it. I dont need to delete old files but i would like to be able to put the cached image in the cache folder with the same structure as from the url i will try to explain again.

Cache folder: /images/resized_cache/

Request uri: https://example.site/img.php?src=uploads/avatars/234234234/avatar22.png

Now is saved like this:
/images/resized_cache/avatar22-something-2342343.png

I would like to be like this:
/images/resized_cache/uploads/avatars/234234234/avatar22-something-2342343.png

@mosbth
Copy link
Owner

mosbth commented Sep 26, 2017

Ok, so you would like to configure how the structure is created in the cache dir?
Instead of saving all files under cache/ you would prefer it used a subdirectory structure, the same way your original files are stored?
And this could/should be an option one could configure in the configfile?

That could be done. I had some thoughts on this earlier but decided to stick with a flat file structure (easier to implement).

From an interest, what would you gain from having this structure? Nice to have or real benefit?

@mosbth mosbth added the feature label Sep 26, 2017
@flobox
Copy link

flobox commented Sep 26, 2017

Big +1!

We also have a huge cache directory (amount of files). On some servers this (huge amount of files in one single directory) could slow down things and cause problems.

@mosbth
Copy link
Owner

mosbth commented Oct 3, 2017

Will it be enough to use a directory structure, in the long term, or should one look into a solution where many images goes into one file (and one index-file)?
Like this:
https://code.facebook.com/posts/685565858139515/needle-in-a-haystack-efficient-storage-of-billions-of-photos/
Perhaps both and start with directory structure and se how long it will be enough.

One could also consider using SQLite for smaller images:
https://www.sqlite.org/fasterthanfs.html

Still, I would like a straight forward solution, without to much hassle.

@flobox
Copy link

flobox commented Oct 3, 2017

I think a directory structure would be the most pragmatic solution in my opinion. For my needs the directory structure would solve it!

@dyanakiev
Copy link
Author

I think that the directory structure will work best.

@flobox
Copy link

flobox commented Jun 19, 2018

Any progress so far regarding this issue? :)

My cache directory is holding 900k of files so far. I have a cronjob enabled deleting cache files older than 365 days, but the directory is still growing. Splitting the cache directory down - like mentioned above - would be super helpful!

@mosbth
Copy link
Owner

mosbth commented Jun 25, 2018

Not much progress, no. I checked my own cache dir and the largest only contains 5k files, not much in comparison to 900k. Nice to know we have some heavy users/usage out there.

I refreshed my memory and gave it some new thougth though. I have no definitive answer for now. I need to sleep on it.

@Surf-N-Code
Copy link

@mosbth We are also very much in need of such a feature as well since our cache directory contains more than a million files by now.

So, huge +1 from us as well.

Do you have any plans on integrating the suggested feature? :)

@mosbth
Copy link
Owner

mosbth commented Feb 26, 2019

Plans exists yes. Then, its that thing, with time... and being able to prioritize among all other stuff one has on his magic agenda to conquer the world.

The plan it to build an alternative cache-structure that mirrors the directory structure of the img/ folder. This alternative structure can be turned on in the config file, its default off (to start with).

The directory structure will look something like this:

Source image:

img/image.png

Cache structure:

cache/image.png/h700-w300-cf.png
cache/image.png/h700-w300-cf-a=0,0,50,0.png

So, each image will have its own cache directory where all its cached versions are saved.

I am not sure how well this scales with a million images, but it might make it a tad easier to keep track on files in the cache directory and to get a visual overview of its content.

This should also make it possible to find stray images in the cache, that is images removed/moved in the img/ folder but still remains in the cache/ folder.

There is also the existing cache/fasttrack, where a hit goes straight to the cached image, that should, most likely, go into the new directory structure.

For those having a cache with a million files, one could guess that some kind of transfer process is needed, from old cache to new cache.

So that is the current, waiting to be implemented.

For stats, I looked in my largest website using cimage.

$ du -sk htdocs/img
442296  htdocs/img
$ du -sk cache/cimage/
586516  cache/cimage/
$ ls -R1 htdocs/img/ | wc -l
3059
$ ls -1 cache/cimage/ | wc -l
6237
$ ls -1 cache/cimage/fasttrack/ | wc -l 
7050

As you see, I have a pretty small cache when comparing to a million cached files.

Anyway, I guess I should be real happy to see that some of you are using cimage to an extent of sites with a million cached files - that is really nice to know. Real nice. :-)

@flobox
Copy link

flobox commented Feb 26, 2019

@mosbth ah great, see things issue getting picked up!

Right now we are having almost 2 (!) million files in the cache directory. It is starting to get nasty :)
I don't think that your pointed out solution is the greatest for cimage applications with a lot if single files (as it is in our case) - since it would also create a LOT of subfolders in the /cache/ directory. I would rather prefer a "time based" directory structure in the cache directory.

cache/2018/11/
cache/2019/01/
cache/2019/02/

Again: we are super happy with cimage serving so many image files to your users! Thank you for your great work here @mosbth !

@mosbth
Copy link
Owner

mosbth commented Feb 26, 2019

@flobox @Surf-N-Code A quick question, when you say you have millions files in the cache, how many files do you have in the source img/?

@flobox Yes, a time based structure could be an alternative. I would prefer avoiding the need of creating too many sub directories.

@mosbth
Copy link
Owner

mosbth commented Feb 26, 2019

This is how the cache currently works, a bit simplified and excluding the usage of HTTP cache settings which further decreases the need to actually run cimage to process the request.

  1. Image url is incoming, /img/image.png?w=700&h=300&cf.
  2. Create a MD5 key of the string img/image.png?w=700&h=300&cf.
  3. Check the cache/fasttrack/${key}.json.
  4. If a hit, load the json file, get the path to the cache file and serve it. Done.
  5. No hit in the fasttrack, process the request through cimage.
  6. Create a new cache file for the request.
  7. Save a new entry to the fasttrack.
  8. Serve the cached image.

The obvious improvement I see is to limit the need of files in the fasttrack directory. This can be reduced to 1 json file per source image, instead of 1 file per "request string" as it is now. These files are small.

The fasttrack could be implemented as a SQLite database, this limits the fasttrack files to 1 and likely adds some time for lookup.

The amount of cached image files could perhaps be limited through rules allowing how to access cimage.

Maybe there is some limited opportunity to reduce the number of files, when one image is the exact copy of another image, but the request url is different. This implies some more processing in cimage, or perhaps some improvement to the code. Anyway, the improvement is most likely not much.

@mosbth
Copy link
Owner

mosbth commented Feb 26, 2019

For general reference, I do assume that there is no actual hard limit, that we are close to reaching, even with millons of cache files, on how many files we can have in a single directory (using ext4) (source).

Another conclusion from the same source is that there is no notable difference in performance, having a directory with millions of files compared or 10 files. At least, not in the way cimage is using the files in the cache dir.

I'm trying to wrap my head around "why do we (really) want this", sort of asking "5 Whys" to get to the root cause of it. That feels like a good exercise before coding away...

So, this far I have:

  1. Physical limit on the filesystem (NO)
  2. Performance improvement related to number of files/directories or its structure (NO)
  3. General improvements in how many files exists in the cache (YES, mainly related to cache/fasttrack which can be reduced to 1 json-file per source image, pointing out the actual cache images).
  4. Replace cache/fasttrack with a SQLite database (NO, reduces the files but most likely adds lookup time).
  5. A more user friendly cache for visual inspection (YES, through directory structure mirroring img/)
  6. A time based directory structure (YES, for visual inspection, ease of cleanup and perhaps backup)
  7. Reduce the number of files, as a way to reduce the amount of data stored (NO, a good intention but not really an issue).
  8. General cleanup and monitoring issues for the cache, keeping track of old files or not used files (NO, would be nice but not yet pointed out as an issue).
  9. Working with the cache through ls, find (NO, not pointed out as an issue).

Anything to add to the list?

@flobox
Copy link

flobox commented Feb 26, 2019

@flobox @Surf-N-Code A quick question, when you say you have millions files in the cache, how many files do you have in the source img/?

@mosbth in the img/ we also have more than 1 million files, BUT they are structured in subfolders!

For general reference, I do assume that there is no actual hard limit, that we are close to reaching, even with millons of cache files) on how many files we can have in a single directory (using ext4) (source).

You are absolutely right on this. It is just getting a little bit unwieldy with that many files in the cache directory in one directory.

Anything to add to the list?

Great list! Nope. Nothing to add from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants