ERROR: LXC container name not set! #1857

iglov · 2023-03-30T09:00:31Z

OS: Debian 10/11/12
Kernel: 5.10.0-15-amd64 - 6.1.0-18-amd64
Env (depens on deb ver):

resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, lxc 1:4.0.6-2
resource-agents 1:4.12.0-2, pacemaker 2.1.5-1+deb12u1, corosync 3.1.7-1, lxc 1:5.0.2-1+deb12u2

Just trying to add new resource

lxc-start -n front-2.fr
pcs resource create front-2.fr ocf:heartbeat:lxc config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr

After ~5min want to remove it
pcs resource remove front-2.fr --force
got an error and cluster starts to migrate
Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name not set!

as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when agent can't get OCF_RESKEY_container variable.
This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

The question is: why? And how to debug it?

The text was updated successfully, but these errors were encountered:

oalbrigt · 2023-03-30T10:07:03Z

This might be due to the probe-action.

You can try changing

resource-agents/heartbeat/lxc.in

Line 343 in fe1a2f8

LXC_validate

to ocf_is_probe || LXC_validate.

oalbrigt · 2023-03-30T10:13:05Z

Seems like the agent already takes care of probe-actions, so I'll have to investigate further what might cause it.

iglov · 2023-03-30T10:23:45Z

Hey @oalbrigt , thanks 4 reply!

to ocf_is_probe || LXC_validate.

Yep, ofc i can try, but what the point if as we can see, the OCF_RESKEY_container var isn't exists or the agent just doesn't know anything about it. So even if i'll try it, he wont stop the container here for the same reason

resource-agents/heartbeat/lxc.in

Line 184 in fe1a2f8

LXC_stop() {

oalbrigt · 2023-03-31T07:40:39Z

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

kgaillot · 2023-04-03T18:18:31Z

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

No, that's odd. Was the command tried without --force first? It shouldn't normally be necessary, so if it was, that might point to an issue.

iglov · 2023-04-03T18:21:27Z

Hey @kgaillot , thx 4 reply!
Nope, without --force the result is the same.

kgaillot · 2023-04-03T18:54:22Z

@iglov @oalbrigt , can one of you try dumping the environment to a file from within the stop command? Are no OCF variables set, or is just that one missing?

iglov · 2023-04-03T18:57:31Z

Well, i can try if you tell me how to do that and if i find cluster in the same state.

kgaillot · 2023-04-03T19:37:55Z

Something like env > /run/lxc.env in the agent's stop action

iglov · 2023-04-03T19:47:23Z

Oh, you mean i should place env > /run/lxc.env somewhere in the /usr/lib/ocf/resource.d/heartbeat/lxc in LXC_stop() { ... } ? But it won't work cuz: 1. It died before LXC_stop() in the LXC_validate() ; 2. After fencing node will reboot and/run unmounts. So, i think it would be better to put env > /root/lxc.env in LXC_validate()
If all correct i will try when find the cluster with this bug.

kgaillot · 2023-04-03T21:15:44Z

That sounds right

iglov · 2024-02-06T10:02:42Z

Hey guyz! I got it. Tried to stop container nsa-1.ny with pcs resource remove nsa-1.ny --force and got some debug:

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=5d3831d43d924a08a3dad6f49613e661
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
PCMK_quorum_type=corosync
SHLVL=1
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:36160
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

And this how it should looks like

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=b062591edd5142bd952b5ecc4f86b493
OCF_RESKEY_CRM_meta_interval=30000
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
OCF_RESKEY_config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config
PCMK_quorum_type=corosync
OCF_RESKEY_CRM_meta_name=monitor
SHLVL=1
OCF_RESKEY_container=nsa-1.ny
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:44603
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

As you can see, there miss some variables like OCF_RESKEY_container or OCF_RESKEY_config

Any ideas? ^_^

oalbrigt · 2024-02-06T12:01:06Z

That's strange. Did you create it without specifying container=<container name> and using -f to force it? What does you pcs resource config output say?

iglov · 2024-02-06T12:32:45Z

Yes, it's very, VERY strange. I create resources with pcs resource create test ocf:heartbeat:lxc container=test config=/mnt/cluster_volumes/lxc1/test/config (you can see it on topic) BUT it does not matter, cuz as i said earlier:

This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

As you can see, almost a year has passed before the bug appeared. This means, i can create resource with ANY method and it WILL work correctly until... something goes wrong.
With pcs resource config everything is good:

  Resource: nsa-1.ny (class=ocf provider=heartbeat type=lxc)
   Attributes: config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config container=nsa-1.ny
   Operations: monitor interval=30s timeout=20s (nsa-1.ny-monitor-interval-30s)
               start interval=0s timeout=60s (nsa-1.ny-start-interval-0s)
               stop interval=0s timeout=60s (nsa-1.ny-stop-interval-0s)

Soo-o-o-o, i have no idea how to debug it further :(

oalbrigt · 2024-02-06T12:46:27Z

Can you add the output from rpm -qa | grep pacemaker? So I can have our Pacemaker devs see if this is a known issue.

iglov · 2024-02-06T12:50:29Z

Yep, sure, but i have it on debian:

# dpkg -l | grep pacemaker
ii  pacemaker                            2.0.1-5                      amd64        cluster resource manager
ii  pacemaker-cli-utils                  2.0.1-5                      amd64        cluster resource manager command line utilities
ii  pacemaker-common                     2.0.1-5                      all          cluster resource manager common files
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents

# dpkg -l | grep corosync
ii  corosync                             3.0.1-2+deb10u1              amd64        cluster engine daemon and utilities
ii  corosync-qdevice                     3.0.0-4+deb10u1              amd64        cluster engine quorum device daemon
ii  libcorosync-common4:amd64            3.0.1-2+deb10u1              amd64        cluster engine common library

# dpkg -l | grep resource-agents
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents
ii  resource-agents                      1:4.7.0-1~bpo10+1            amd64        Cluster Resource Agents

# dpkg -l | grep lxc
ii  liblxc1                              1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools (library)
ii  lxc                                  1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools
ii  lxc-templates                        3.0.4-0+deb10u1              amd64        Linux Containers userspace tools (templates)
ii  lxcfs                                3.0.3-2                      amd64        FUSE based filesystem for LXC

kgaillot · 2024-02-06T15:04:31Z

@iglov That is extremely odd. If you still have the logs from when that occurred, can you open a bug at bugs.clusterlabs.org and attach the output of crm_report -S --from="YYYY-M-D H:M:S" --to="YYYY-M-D H:M:S" from each node, covering the half hour or so around when the failed stop happened?

iglov · 2024-02-06T20:31:45Z

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

kgaillot · 2024-02-07T17:11:32Z

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

kgaillot · 2024-02-07T17:15:15Z

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

Alternatively you can investigate the file yourself. I'd start with checking the resource configuration and make sure the resource parameters are set correctly there. If they're not, someone or something likely modified the configuration. If they are, the next thing I'd try is crm_simulate -Sx $FILENAME -G graph.xml. The command output should show a stop scheduled on the old node and a start scheduled on the new node (if not, you probably have the wrong input). The graph.xml file should have <rsc_op> entries for the stop and start with all the parameters that will be passed to the agent.

iglov · 2024-02-07T21:25:34Z

Hey @kgaillot ! Thanks 4 explanations and ur time!
Well, i have there something like that

# 0-5 synapses about stonith

<synapse id="6">
  <action_set>
    <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs/>
</synapse>
<synapse id="7">
  <action_set>
    <rsc_op id="33" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="8">
  <action_set>
    <rsc_op id="31" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs3.ny.local.priv" on_node_uuid="1">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs3.ny.local.priv" CRM_meta_on_node_uuid="1" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="9">
  <action_set>
    <crm_event id="26" operation="clear_failcount" operation_key="nsa-1.ny_clear_failcount_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_op_no_wait="true" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </crm_event>
  </action_set>
  <inputs/>
</synapse>

looks good, isn't it? I don't see anything wrong here. But if you still want, i can try to sent you these pe-input files.

kgaillot · 2024-02-08T15:59:41Z

No, something's wrong. The resource parameters should be listed in <attributes> after the meta-attributes (like config="/mnt/cluster_volumes/lxc2/nsa-1.ny/config" container="nsa-1.ny"). Check the corresponding pe-input to see if those are properly listed under the relevant <primitive>.

iglov · 2024-02-08T16:10:35Z

Yep, sry, u right, my bad. I tried to find resource nsa-1.ny in pe-input-250 (this one is the last before fuckup) and there is no that primitive there at all. But it is in pe-input-249. Pooof, it's just disappeared...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: LXC container name not set! #1857

ERROR: LXC container name not set! #1857

iglov commented Mar 30, 2023 •

edited

Loading

oalbrigt commented Mar 30, 2023

oalbrigt commented Mar 30, 2023

iglov commented Mar 30, 2023 •

edited

Loading

oalbrigt commented Mar 31, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Feb 6, 2024

oalbrigt commented Feb 6, 2024

iglov commented Feb 6, 2024 •

edited

Loading

oalbrigt commented Feb 6, 2024

iglov commented Feb 6, 2024

kgaillot commented Feb 6, 2024

iglov commented Feb 6, 2024

kgaillot commented Feb 7, 2024

kgaillot commented Feb 7, 2024

iglov commented Feb 7, 2024

kgaillot commented Feb 8, 2024

iglov commented Feb 8, 2024

ERROR: LXC container name not set! #1857

ERROR: LXC container name not set! #1857

Comments

iglov commented Mar 30, 2023 • edited Loading

oalbrigt commented Mar 30, 2023

oalbrigt commented Mar 30, 2023

iglov commented Mar 30, 2023 • edited Loading

oalbrigt commented Mar 31, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Apr 3, 2023

kgaillot commented Apr 3, 2023

iglov commented Feb 6, 2024

oalbrigt commented Feb 6, 2024

iglov commented Feb 6, 2024 • edited Loading

oalbrigt commented Feb 6, 2024

iglov commented Feb 6, 2024

kgaillot commented Feb 6, 2024

iglov commented Feb 6, 2024

kgaillot commented Feb 7, 2024

kgaillot commented Feb 7, 2024

iglov commented Feb 7, 2024

kgaillot commented Feb 8, 2024

iglov commented Feb 8, 2024

iglov commented Mar 30, 2023 •

edited

Loading

iglov commented Mar 30, 2023 •

edited

Loading

iglov commented Feb 6, 2024 •

edited

Loading