Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Tests fail in Gentoo CI - Why? #502

Closed
freijon opened this issue Dec 17, 2024 · 13 comments
Closed

[Question] Tests fail in Gentoo CI - Why? #502

freijon opened this issue Dec 17, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@freijon
Copy link

freijon commented Dec 17, 2024

I've packaged limbo for Gentoo GURU.
On my system the test phase runs fine. However, on the CI machine the tests fail with the following output:

running 6 tests
test tests::test_sequential_write ... ignored
test tests::test_wal_checkpoint ... ignored
test tests::test_last_insert_rowid_basic ... FAILED
test tests::test_sequential_overflow_page ... FAILED
test tests::test_wal_restart ... FAILED
test tests::test_simple_overflow_page ... FAILED

failures:

---- tests::test_last_insert_rowid_basic stdout ----
thread 'tests::test_last_insert_rowid_basic' panicked at test/src/lib.rs:25:82:
called `Result::unwrap()` on an `Err` value: IOError(Os { code: 38, kind: Unsupported, message: "Function not implemented" })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::test_sequential_overflow_page stdout ----
thread 'tests::test_sequential_overflow_page' panicked at test/src/lib.rs:25:82:
called `Result::unwrap()` on an `Err` value: IOError(Os { code: 38, kind: Unsupported, message: "Function not implemented" })

---- tests::test_wal_restart stdout ----
thread 'tests::test_wal_restart' panicked at test/src/lib.rs:25:82:
called `Result::unwrap()` on an `Err` value: IOError(Os { code: 38, kind: Unsupported, message: "Function not implemented" })

---- tests::test_simple_overflow_page stdout ----
thread 'tests::test_simple_overflow_page' panicked at test/src/lib.rs:25:82:
called `Result::unwrap()` on an `Err` value: IOError(Os { code: 38, kind: Unsupported, message: "Function not implemented" })


failures:
    tests::test_last_insert_rowid_basic
    tests::test_sequential_overflow_page
    tests::test_simple_overflow_page
    tests::test_wal_restart

test result: FAILED. 0 passed; 4 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.00s

error: test failed, to rerun pass `-p core_tester --lib`

The only reason I can think of is that I'm missing some dependencies for the package that are installed on my machine but not on the CI machine. Any ideas from your side what the reason could be?

@LtdJorge
Copy link

Might be an IO_URING problem? Error 38 is due to a missing syscall, seems like your TinderBox is running kernel 6.1, so it might need a newer one. What kernel are you running in your personal machine?

Can you add RUST_BACKTRACE=full to the ebuild to see where exactly it is failing? There aren't many ? operators in those tests, but it's not clear.

@freijon
Copy link
Author

freijon commented Dec 17, 2024

I'm using 6.12.5
I could check for / require kernel option IO_URING=y, would that do the trick?
I've added RUST_BACKTRACE=full to the ebuild, might take a while until I get the new CI report though... I'll keep you posted.

@LtdJorge
Copy link

I'm using 6.12.5 I could check for / require kernel option IO_URING=y, would that do the trick? I've added RUST_BACKTRACE=full to the ebuild, might take a while until I get the new CI report though... I'll keep you posted.

I don’t think that kernel feature does much. IO_URING is enabled by default, just some kernels come with new features. It could also be an unrelated syscall. Let me check what the io_uring crate is using and see if 6.1 is ok.

@LtdJorge
Copy link

Seems like the only IO_URING opcodes being submitted are readv, writev and fsync. I hadn't thought about RUST_LOG before, could have added that to the ebuild. I'm using RUST_LOG=trace, but since I'm on 6.12.1, there's no problem here. I'm building a VM with a 6.1 kernel and I will let you know when I have more info.

@LtdJorge
Copy link

Well, Ubuntu even on Kernel 5.15.0 passes every test. For reference, this is the Kernel config it comes with:
config-5.15.0-127-generic.txt

@freijon
Copy link
Author

freijon commented Dec 18, 2024

The new CI report is here:

@penberg penberg added the bug Something isn't working label Dec 18, 2024
@penberg
Copy link
Collaborator

penberg commented Dec 18, 2024

I don't think we're using anything fancy in io_uring that would require a brand new kernel... Can this be a filesystem related limitation?

@LtdJorge
Copy link

Hi Pekka. Sorry, forgot to press Comment, this is what I found:

I had traced it to TempDatabase::new(), seems like limbo_core::PlatformIO::new().unwrap() was where the panic occurred, and that is where the io_uring ring gets built:
image
The new log mirrors this.

I emailed Agostino and he confirmed that CONFIG_IO_URING is not set in the kernel for tinderboxes. I'll ask him if it will be enabled. If it can't, can this make it to GURU without closing the bug?

On limbo's side, there is only the io_uring backend on Linux. I wonder, is the team considering adding a different backend just as a fallback? While io_uring is enabled by default, there are platforms limbo will be run on where it won't be enabled. #494 seems to me like exactly the same issue, Google is known to disable io_uring on Android due to potential security issues found early on, so they might be doing the same for Colab machines.

Of course, adding a fallback to io_uring would kind of defeat the purpose of testing and the tinderbox CI too.

@freijon
Copy link
Author

freijon commented Dec 18, 2024

If it can't, can this make it to GURU without closing the bug?

I'm fine with leaving the bug open indefinitely (this will automatically skip further tests). I'll document our findings and link to this bug.

I agree that some kind of fallback solution would be nice, but I understand if it's not a priority right now. And therefore the problem seems to be on the tinderbox and not with limbo.

But if limbo is unusable with unset CONFIG_IO_URING (for now), wouldn't it make sense to require it in the ebuild?

@penberg
Copy link
Collaborator

penberg commented Dec 18, 2024

@LtdJorge The core/io/darwin.rs backend should actually work on Linux just fine (maybe with minor modifications). If someone wants to rename core/io/linux.rs to core/io/io_uring.rs and core/io/darwin.rs to core/io/unix.rs and make backend configurable on Linux, I am totally fine with that.

@LtdJorge
Copy link

@freijon my experience with ebuilds is limited to a couple patches I had to make on my system. Does setting the value force the tinderbox to use a kernel with io_uring enabled? I was under the impression that every tinderbox was equal, but if not, then we would only need that setting to pass the tests. Enable it, and let it have a go again. Worst case, nothing changes.

@penberg I'll get on it. For making it configurable, though, what should we use? I suppose the easiest is a Cargo feature since I'm not seeing configurability in the code. I'm not sure how much you know about Gentoo and Portage, but it basically comes with a very similar system to Cargo features, USE flags. For example, an io_uring feature could be mapped to an io_uring USE flag since installation in Gentoo means building the source by default.

However, if we have an io backend unix and a backend io_uring, simply enabling them with a feature doesn't select one of them. Kind of the same happens with USE flags, although those can be made mutually exclusive. So we may need another way to configure the backend or provide a default (which is how it works now).

@LtdJorge
Copy link

However, if we have an io backend unix and a backend io_uring, simply enabling them with a feature doesn't select one of them. Kind of the same happens with USE flags, although those can be made mutually exclusive. So we may need another way to configure the backend or provide a default (which is how it works now).

I think the best solution is to create a disable-io-uring Cargo feature just for the environments that need it, and to add a #[cfg(target_family = “unix”)] to the unix backend and a #[cfg(all(target_os = “linux”, not(feature = “disable-io-uring”)))] to the io_uring backend.

@penberg let me know what you think. Also, should I open a new issue for this? I don’t want to pollute this one more.

@freijon
Copy link
Author

freijon commented Dec 20, 2024

Does setting the value force the tinderbox to use a kernel with io_uring enabled?

Unfortunately not. Also, the check at build time to require IO_URING is not ideal because this check happens before the build and there is no guarantee that the user will use limbo with the same kernel configuration as during build time.

I ended up with showing a warning if IO_URING is not available. I thought about disabling the tests entirely, but I think tests are a good thing to have. So for now we'll just leave the gentoo-bug open, this will skip further tests on the tinderbox. And when the fallback is ready I'll implement a USE flag.

Thank you for your support!

@freijon freijon closed this as completed Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants