Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with php occ recognize:classify and no tags created #1074

Closed
t0mcat1337 opened this issue Jan 7, 2024 · 12 comments
Closed

Errors with php occ recognize:classify and no tags created #1074

t0mcat1337 opened this issue Jan 7, 2024 · 12 comments
Labels

Comments

@t0mcat1337
Copy link

t0mcat1337 commented Jan 7, 2024

Which version of recognize are you using?

6.0.1

Enabled Modes

Object recognition

TensorFlow mode

Normal mode

Downstream App

Files App

Which Nextcloud version do you have installed?

28.0.1

Which Operating system do you have installed?

Kubernetes @ Ubuntu 22.04.3 LTS VM

Which database are you running Nextcloud on?

mariadb:10.7

Which Docker container are you using to run Nextcloud? (if applicable)

nextcloud:28.0.1-fpm

How much RAM does your server have?

12GiB in the K8s VM

What processor Architecture does your CPU have?

x86_64

Describe the Bug

Running Nextcloud Docker image (NOT AIO one!) in a K8s Cluster, there installed the recognize app.
When running

php occ recognize:classify -vvv

for initial recognitions of already existing files, no tags are created, this is the output:

Processing storage 15 with root ID 187614
generating preview of 187617 with dimension 1024 using nextcloud preview manager
generating preview of 187618 with dimension 1024 using nextcloud preview manager
generating preview of 187619 with dimension 1024 using nextcloud preview manager
generating preview of 187620 with dimension 1024 using nextcloud preview manager
generating preview of 187621 with dimension 1024 using nextcloud preview manager
generating preview of 187622 with dimension 1024 using nextcloud preview manager
[...]
Classifying array (
  0 => '/tmp/oc_tmp_MVG76Z-.jpg',
  1 => '/tmp/oc_tmp_eDX00R-.jpg',
  2 => '/tmp/oc_tmp_GPuild-.jpg',
  3 => '/tmp/oc_tmp_Ihmvzw-.jpg',
  4 => '/tmp/oc_tmp_jjH8Y8-.jpg',
  5 => '/tmp/oc_tmp_wJp7RT-.jpg',
  6 => '/tmp/oc_tmp_p003vM-.jpg',
  7 => '/tmp/oc_tmp_BcBImY-.jpg',
[...]
  23 => '/tmp/oc_tmp_xIOIK4-.jpg',
)
Running array (
  0 => '/usr/bin/nice',
  1 => '-0',
  2 => '/var/www/html/apps/recognize/bin/node',
  3 => '/var/www/html/apps/recognize/src/classifier_imagenet.js',
  4 => '-',
)
Classifier process output: 
Error while running imagenetclassifier
No files left to classify
face classifier end
No files left to classify
No files left to classify
No files left to classify
No files left to classify
face classifier end
No files left to classify
No files left to classify
Processing storage 4 with root ID 3135066
generating preview of 3135070 with dimension 1024 using nextcloud preview manager
[...]

What I'm wondering is: what exactly does "Error while running imagenetclassifier" mean? Where can I find more details what is going wrong here?

These entries are filling the nextcloud.log during the running occ command:
image
image

As a plus:
When uploading a new image to Nextcloud, after a short time it gets tagged by recognize without problems:

image

So I assume, recognize is working at all. This can also be seen in its admin page:

image
image
(BTW: There seems to be some issue with landmark recognition, but that's another one)

Expected Behavior

php occ recognize:classify should not run in errors (or give more output what is going wrong)
tags should be created

To Reproduce

Run NC 28.0.1 FPM Docker Image (in K8s), install recognize, run occ... command

Debug log

No response

@t0mcat1337 t0mcat1337 added the bug Something isn't working label Jan 7, 2024
Copy link

github-actions bot commented Jan 7, 2024

Hello 👋

Thank you for taking the time to open this issue with recognize. I know it's frustrating when software
causes problems. You have made the right choice to come here and open an issue to make sure your problem gets looked at
and if possible solved.
I try to answer all issues and if possible fix all bugs here, but it sometimes takes a while until I get to it.
Until then, please be patient.
Note also that GitHub is a place where people meet to make software better together. Nobody here is under any obligation
to help you, solve your problems or deliver on any expectations or demands you may have, but if enough people come together we can
collaborate to make this software better. For everyone.
Thus, if you can, you could also look at other issues to see whether you can help other people with your knowledge
and experience. If you have coding experience it would also be awesome if you could step up to dive into the code and
try to fix the odd bug yourself. Everyone will be thankful for extra helping hands!
One last word: If you feel, at any point, like you need to vent, this is not the place for it; you can go to the forum,
to twitter or somewhere else. But this is a technical issue tracker, so please make sure to
focus on the tech and keep your opinions to yourself. (Also see our Code of Conduct. Really.)

I look forward to working with you on this issue
Cheers 💙

@marcelklehr
Copy link
Member

Mmh, that's a weird one. The code that runs the model is the same for both the on demand classifying and the classify command :/

Could it be that the container runs out of memory and kills the node process?

@marcelklehr
Copy link
Member

Also, are you running the nextcloud cron jobs in a different container, per chance?

@t0mcat1337
Copy link
Author

Mmh, that's a weird one. The code that runs the model is the same for both the on demand classifying and the classify command :/

Could it be that the container runs out of memory and kills the node process?

Can't imagine it is an out-of-memory issue, as the mentioned error message appears immediatly at the very first storage / root ID run. The log output I pasted is right from the beginning.

And this is the memory the Pod (Container) sees (also no limits are set in K8s for this Pod):
image

@t0mcat1337
Copy link
Author

Also, are you running the nextcloud cron jobs in a different container, per chance?

Yes, the cron jobs are running in a sidecar container in the same pod as the nextcloud app. So the cron container sees the identical resources as the nextcloud container. This sidecar uses the same image as the nextcloud container but - of course - overwrites the default entrypoint with nextclouds "cron.sh" script for starting a cron daemon.

@marcelklehr
Copy link
Member

Can you try running the classifier command manually inside both containers?

$ /var/www/html/apps/recognize/bin/node var/www/html/apps/recognize/src/classifier_imagenet.js some/image.jpg

Maybe it's something to do with the files being stored in /tmp. Some people have had issues with /tmp in docker: https://github.com/nextcloud/recognize#tmp

@t0mcat1337
Copy link
Author

sure.... in nexclouds main app container this is the output:

root@wolke-app-6c47c4f7f5-s6kj2:/var/www/html# /var/www/html/apps/recognize/bin/node /var/www/html/apps/recognize/src/classifier_imagenet.js /mnt/data/wolke/myUserName/files/20240105_160525.jpg
Killed

and this in the cron container:

root@wolke-app-6c47c4f7f5-s6kj2:/var/www/html# /var/www/html/apps/recognize/bin/node /var/www/html/apps/recognize/src/classifier_imagenet.js /mnt/data/wolke/myUserName/files/20240105_160525.jpg
2024-01-10 17:16:16.669353: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 36578304 exceeds 10% of free system memory.
2024-01-10 17:16:16.779000: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 146313216 exceeds 10% of free system memory.
2024-01-10 17:16:16.884561: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 146313216 exceeds 10% of free system memory.
2024-01-10 17:16:17.015482: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 146313216 exceeds 10% of free system memory.
2024-01-10 17:16:17.137038: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 146313216 exceeds 10% of free system memory.
{
  className: 'ballpoint',
  probability: 0.7784780859947205,
  rule: { label: 'office', threshold: 0.1 }
}
{
  className: 'rubber eraser',
  probability: 0.0868125855922699,
  rule: { threshold: 0.1, priority: -2 }
}
{
  className: 'plunger',
  probability: 0.008453534916043282,
  rule: { threshold: 0.1, priority: -2 }
}
{
  className: 'nail',
  probability: 0.0049564167857170105,
  rule: { label: 'portrait', threshold: 0.08, categories: [ 'people' ] }
}
{
  className: 'pencil box',
  probability: 0.004913938231766224,
  rule: { label: 'office', threshold: 0.1 }
}
{
  className: 'screwdriver',
  probability: 0.0029951154720038176,
  rule: { label: 'tool', threshold: 0.4, priority: -1 }
}
{
  className: 'projectile',
  probability: 0.002502560382708907,
  rule: { threshold: 0.1, priority: -2 }
}
["Office"]

@marcelklehr
Copy link
Member

Nice, so something in the main container kills the process. Am I correct that your main container has 1.7GB free memory in the above screenshot? You may need more than that.

@t0mcat1337
Copy link
Author

No, you are not correct. When reading the output of the "free" command you have to look at the "available" column (which is around 7GB in the above screenshot), not the "free" one. Linux tries to use free memory effectively for caching (which can be seen in the "buff/cache" column), so no mem is wasted. But neverless this memory is available for being used by other processes.
As a plus, both containers are running in the same Kubernetes Pod, so at the same Server, so seeing the same ressources (CPU/Memory).
For verifying this, I made another test: I ran "free" directly followed by your "classifier_imagenet.js" command nearly at the same time in both containers.
These are the results:

nextcloud app container:
image

root@wolke-app-6c47c4f7f5-s6kj2:/var/www/html# free; /var/www/html/apps/recognize/bin/node /var/www/html/apps/recognize/src/classifier_imagenet.js /mnt/data/wolke/MyUserName/files/20240105_160525.jpg
               total        used        free      shared  buff/cache   available
Mem:        12244484     5616256     3901268      137260     3194588     6628228
Swap:              0           0           0
Killed

cron container:
image

root@wolke-app-6c47c4f7f5-s6kj2:/var/www/html# free; /var/www/html/apps/recognize/bin/node /var/www/html/apps/recognize/src/classifier_imagenet.js /mnt/data/wolke/MyUserName/files/20240105_160525.jpg
               total        used        free      shared  buff/cache   available
Mem:        12244484     5612700     3908816      137260     3190400     6631784
Swap:              0           0           0
{
  className: 'ballpoint',
  probability: 0.7784780859947205,
  rule: { label: 'office', threshold: 0.1 }
}
{
  className: 'rubber eraser',
  probability: 0.0868125855922699,
  rule: { threshold: 0.1, priority: -2 }
}
{
  className: 'plunger',
  probability: 0.008453534916043282,
  rule: { threshold: 0.1, priority: -2 }
}
{
  className: 'nail',
  probability: 0.0049564167857170105,
  rule: { label: 'portrait', threshold: 0.08, categories: [ 'people' ] }
}
{
  className: 'pencil box',
  probability: 0.004913938231766224,
  rule: { label: 'office', threshold: 0.1 }
}
{
  className: 'screwdriver',
  probability: 0.0029951154720038176,
  rule: { label: 'tool', threshold: 0.4, priority: -1 }
}
{
  className: 'projectile',
  probability: 0.002502560382708907,
  rule: { threshold: 0.1, priority: -2 }
}
["Office"]

--> Notice nearly identical Memory free/available columns in both container.

Meanwhile thinking about this, I assume something in the entrypoint Script or whatever in the nextcloud app container affects memory handling, as both container are based on the same image.

@marcelklehr
Copy link
Member

I'm not sure why the process is getting killed, but that's definitely the problem, you'd need to solve. If I can help more somehow, let me know.

@t0mcat1337
Copy link
Author

So in the end I found the root cause... despite my assumption, there WERE indeed memory limits set for the nextcloud container (2GB). After removing them, the occ command works without problems. Sorry for the irritations, shame on me :S

@marcelklehr
Copy link
Member

No worries! I'm glad we found the root cause :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

2 participants