Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task/internal/syslog: Add capability to ignore kernel failures #1666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kotreshhr
Copy link

@kotreshhr kotreshhr commented Aug 5, 2021

Adds capability to ignore kernel failures to jobs. This is done by
adding 'syslog' dict to config dictionary which holds the 'ignorelist'
of kernel failures.

Also removes old kernel failurs from exclude list.

Fixes: https://tracker.ceph.com/issues/50150
Signed-off-by: Kotresh HR [email protected]

teuthology/run.py Outdated Show resolved Hide resolved
teuthology/task/internal/syslog.py Outdated Show resolved Hide resolved
run.Raw('|'),
'egrep', '-v', '\\btcmu-runner\\b.*\\bINFO\\b',
run.Raw('|'),
'head', '-n', '1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these need moved to the qa suite? Please post a ceph PR. It needs backported to octopus/pacific before this can be merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of this is stale. We will have to run all relevant suites and find out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

Copy link
Author

@kotreshhr kotreshhr Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

  1. -s fs --filter mount/kclient This listed more than 600 jobs
  2. -s fs --filter kernel This listed around 54 jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

In the past, this has been confusing to newcomers. There's all sorts of magic defaults in teuthology (e.g. the log ignorelist). Best to move these to the ceph.git qa/ when possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

1. `-s fs --filter mount/kclient`  This listed more than 600 jobs

This will have the best coverage. Add --subset x/16.

@idryomov
Copy link
Contributor

idryomov commented Aug 6, 2021

Since this started out as "begin grepping kernel logs for kclient warnings/failures", is there a suspicion that the existing greps don't work (or perhaps work but not on all distros)? Have you attempted to trigger a WARNING or BUG and observe the test failing?

@batrick
Copy link
Member

batrick commented Aug 6, 2021

Since this started out as "begin grepping kernel logs for kclient warnings/failures", is there a suspicion that the existing greps don't work (or perhaps work but not on all distros)? Have you attempted to trigger a WARNING or BUG and observe the test failing?

Honestly, we didn't know about this syslog grep at the start. I was concerned there was no detection of kernel faults at all since I would occasionally see warnings that should fail tests in a syslog.

So I guess we're both concerned that this grep wasn't catching genuine faults/warnings, that it's not easily configurable, and that it's not easily discovered.

@idryomov
Copy link
Contributor

idryomov commented Aug 6, 2021

I remember it breaking a few years ago -- something to do with traditional syslog vs systemd journal interaction but it was fixed then. It might be the case that it broke again, which is why I asked. The issue is not with the grep invocation but rather with what is being grepped (i.e. does that kern.log file actually contain any kernel log messages at the time grep is invoked, etc).

Removing outdated exclude items is obviously lower priority than resurrecting the core functionality in case it is broken.

@kotreshhr kotreshhr changed the title task/internal/syslog: Remove old kernel failures from exclude list task/internal/syslog: Add capability to ignore kernel failures @kotreshhr Aug 9, 2021
@kotreshhr kotreshhr changed the title task/internal/syslog: Add capability to ignore kernel failures @kotreshhr task/internal/syslog: Add capability to ignore kernel failures Aug 9, 2021
@batrick
Copy link
Member

batrick commented Jan 10, 2022

@kotreshhr what's the status on this PR?

@kotreshhr
Copy link
Author

@kotreshhr what's the status on this PR?

@batrick I think it's good to be merged. Here is the run triggered and Jeff's comments on the same. Sorry, I should have discussed here. It was in the cephfs-team mailing list. Copying it here for reference.


Hi Jeff,

I was working on removing old kernel failures which are masked with the
present teuthology code.
I ran the teuthology on my PR [1] with filter "-s fs --filter mount/kclient"
and found following unique
failures in syslog. Just wanted to check with you are these real failures
which need attention?

The complete run can be found at [2].

Many thanks for doing this. Better testing with teuthology is definitely
an area where I could use a lot of help. FWIW, the pulpito links below
only work if you're on the sepia lab VPN.

The kernel is built from commit 7c1f4b5e3842, so this is missing some of
the more recent patches in testing branch.

'2021-08-14T12:28:01.665935+00:00 smithi071 kernel: WARNING: CPU: 3 PID:
39974 at fs/ceph/mds_client.c:4495 check_session_state+0x55/0x60 [ceph] '
in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327912

I suspect this is fixed with a recent patch from Xiubo (1a45b99820e7).
Do we know whether this test does a 'umount -f' ?

'2021-08-14T13:05:50.330033+00:00 smithi041 kernel: INFO: task
fsstress:47378 blocked for more than 122 seconds. ' in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327931

Doesn't look familiar. It looks like the client is just waiting on an
OSD reply (probably for a write since this is in sync()).

'2021-08-14T14:06:15.476389+00:00 smithi062 kernel: INFO: task
kcompactd0:59 blocked for more than 122 seconds. ' in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327954

Also doesn't look familiar. The task is hung waiting for dirty pages to
be written out. We don't know to which filesystem (but it seems likely
that it was cephfs). Possibly related to #2 above.

'2021-08-14T13:37:56.676192+00:00 smithi052 kernel: WARNING: CPU: 3 PID:
101 at crypto/testmgr.c:5653 alg_test+0x245/0x450 ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327978

Looks unrelated to ceph at all. This one happened at boot time. Probably
some bug in the crypto code. This is a -rc4 based kernel after all, so
there are bugs in there...

'2021-08-14T17:11:12.918237+00:00 smithi027 kernel: [ 969.167486] INFO:
task ffsb:47395 blocked for more than 120 seconds. ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6328144

Not much to go on here. There are no stack traces, AFAICT and it's
unclear why. Maybe we need to do something in Ubuntu distros to turn
them on?

'2021-08-15T06:15:50.378319+00:00 smithi019 kernel: INFO: task iozone:67133
blocked for more than 120 seconds. ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6328732

Hung (briefly) while waiting on ceph_fsync. It looks like this one might
have eventually recovered as we don't see any more messages about
blocked tasks even after another 120s. This may just be due to slow OSD
responses, and could also be related to #2 and/or #3.

[1] #1666
[2]
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/


Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise LGTM; please post a teuthology run with the ceph PR requested for qa/cephfs/begin.yaml.

teuthology/suite/placeholder.py Outdated Show resolved Hide resolved
'egrep', '--binary-files=text',
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b',
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b|\\bOops\\b|\\bWARNING\\b|\\bKASAN\\b',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtlayton are we missing anything here or LGTY?

@batrick batrick mentioned this pull request Jan 13, 2022
14 tasks
@batrick
Copy link
Member

batrick commented Jan 13, 2022

Teuthology run triggered: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-13_16:30:18-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

I killed this @kotreshhr . It has 1000+ jobs. Please use a subset.

@kotreshhr
Copy link
Author

Teuthology run triggered: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-13_16:30:18-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

I killed this @kotreshhr . It has 1000+ jobs. Please use a subset.

@batrick I am not sure I understand --subset option correctly. I did give 3/16. Any suggestion?

@batrick
Copy link
Member

batrick commented Jan 14, 2022

Here is what I usually run and it gets about 300 jobs:

teuthology-suite --machine-type smithi --email [email protected] -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

@kotreshhr
Copy link
Author

Here is what I usually run and it gets about 300 jobs:

teuthology-suite --machine-type smithi --email [email protected] -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run:
http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

@batrick
Copy link
Member

batrick commented Jan 22, 2022

Here is what I usually run and it gets about 300 jobs:
teuthology-suite --machine-type smithi --email [email protected] -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

There's some ... weird failures in that run. Please do --rerun of it to see if they're transient.

@kotreshhr
Copy link
Author

Here is what I usually run and it gets about 300 jobs:
teuthology-suite --machine-type smithi --email [email protected] -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

There's some ... weird failures in that run. Please do --rerun of it to see if they're transient.

Rerun QA link:
http://pulpito.front.sepia.ceph.com/khiremat-2022-01-22_07:42:15-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

batrick
batrick previously approved these changes Jan 24, 2022
@batrick
Copy link
Member

batrick commented Jan 24, 2022

Good to merge IMO from CephFS side. @idryomov do you want to have another look? rbd run needed?

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a confirmation that the whole grep invocation is actually working. Could you please build a kernel that would emit a custom WARNING or ERROR on e.g. CephFS mount and run this against that kernel (and make sure that all supported distros are covered by that run because there used to be annoying differences in how dmesg was persisted between distros). And then another run with that custom log message added to ignorelist, again on all distros. Or maybe you already did something like that?

teuthology/suite/placeholder.py Outdated Show resolved Hide resolved
teuthology/task/internal/syslog.py Outdated Show resolved Hide resolved
@batrick
Copy link
Member

batrick commented Jan 26, 2022

I'd like to see a confirmation that the whole grep invocation is actually working. Could you please build a kernel that would emit a custom WARNING or ERROR

Would it be crazy to add a patch to upstream kernel where a mount option triggers a warning/error so we can continuously verify this works?

@idryomov
Copy link
Contributor

Definitely crazy IMO. But we could find a way to trigger something with one of the "bad" keywords in it on vanilla kernels. For example, take all OSDs out against some in-flight I/O -- that should trigger "INFO: task ... blocked for more than %ld seconds".

@idryomov
Copy link
Contributor

... or we could carry the kind of patch you envision in our testing branch.

@jtlayton
Copy link
Contributor

Would it be crazy to add a patch to upstream kernel where a mount option triggers a warning/error so we can continuously verify this works?

I wouldn't use a mount option. This sounds more like a job for debugfs or maybe consider the fault injection framework documented here:

https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html

@batrick
Copy link
Member

batrick commented Jan 27, 2022

Hmm, I think we can just use the network namespaces to network partition the kernel mount (suspend_netns in mount.py), wait a reasonable amount of time for complaints like the osd timeout, then exit. Just need to grep for the expected failure message (or fail) and make sure ignorelist works by ignoring the message. We don't grep for the expected failure message, perhaps that should go in this PR.

@djgalloway djgalloway changed the base branch from master to main June 1, 2022 17:03
@vshankar
Copy link
Contributor

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

@batrick
Copy link
Member

batrick commented Mar 15, 2024

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

Adds capability to ignore kernel failures to jobs. This is done by
adding 'syslog' dict to config dictionary which holds the 'ignorelist'
of kernel failures.

Also removes old kernel failurs from exclude list.

Fixes: https://tracker.ceph.com/issues/50150
Signed-off-by: Kotresh HR <[email protected]>
@kotreshhr
Copy link
Author

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

@batrick Does that mean, I write a teuthology test that kills OSDs while writing large file ?

@vshankar
Copy link
Contributor

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

@batrick Does that mean, I write a teuthology test that kills OSDs while writing large file ?

As discussed during standup - that's a good start to validate the filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants