Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: failed gc-compaction runs in staging #9552

Closed
skyzh opened this issue Oct 28, 2024 · 3 comments · Fixed by #9765
Closed

pageserver: failed gc-compaction runs in staging #9552

skyzh opened this issue Oct 28, 2024 · 3 comments · Fixed by #9765
Assignees
Labels
t/bug Issue Type: Bug

Comments

@skyzh
Copy link
Member

skyzh commented Oct 28, 2024

Steps to reproduce

2024-10-21T20:06:35.796499Z ERROR request{method=PUT path=/v1/tenant/0d3d204617651d10e4e1b6adac554419-0408/timeline/436d7c005a9d2f3e24c2494c392d4c9a/compact request_id=d6779577-c753-4246-8cb8-d39e917fccf6}: Error processing HTTP request: InternalServerError(replay_history: key=000000067F00008000005F200C0000104827 d@17D1/F3C75F20 d@17D1/F3C75FD0
full_history: key=000000067F00008000005F200C0000104827 d@17D1/F3C75F20 d@17D1/F3C75FD0
when processing: [] horizon=17D4/2D4399F0
Caused by:
    invalid history, no base image
Stack backtrace:
   0: <T as core::convert::Into<U>>::into
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/convert/mod.rs:759:9
      pageserver::http::routes::timeline_compact_handler::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1780:58
      core::result::Result<T,E>::map_err
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/result.rs:854:27
      pageserver::http::routes::timeline_compact_handler::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1777:9
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
      pageserver::http::routes::timeline_compact_handler::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1789:6
      pageserver::http::routes::api_handler::{{closure}}::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:2852:48
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
   1: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:328:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/loom/std/unsafe_cell.rs:16:9
      tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:317:30
   2: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:485:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
      std::panicking::try
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
      std::panic::catch_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
      tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:473:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:208:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:153:15
   3: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   4: tokio::runtime::scheduler::multi_thread::worker::Context::run
   5: tokio::runtime::context::scoped::Scoped<T>::set
   6: tokio::runtime::context::runtime::enter_runtime
   7: tokio::runtime::scheduler::multi_thread::worker::run
   8: tokio::runtime::task::core::Core<T,S>::poll
   9: tokio::runtime::task::harness::Harness<T,S>::poll
  10: tokio::runtime::blocking::pool::Inner::run
  11: std::sys::backtrace::__rust_begin_short_backtrace
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
      std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/pal/unix/thread.rs:105:17
  14: <unknown>
  15: <unknown>)

2024-10-21T20:02:26.703182Z ERROR request{method=PUT path=/v1/tenant/087ac06fbe210d0183b7fc682536a3a8-0408/timeline/e1752fdcecc235afa3d385712ef767c1/compact request_id=5354b6d5-3d75-4b3a-a65d-ae86b25b3e11}: Error processing HTTP request: InternalServerError(replay_history: key=000000067F00004005000060080000007E82 d@4/570B37A8
full_history: key=000000067F00004005000060080000007E82 d@4/570B37A8
when processing: [] horizon=A/2A74C970
Caused by:
    invalid history, no base image
Stack backtrace:
   0: <T as core::convert::Into<U>>::into
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/convert/mod.rs:759:9
      pageserver::http::routes::timeline_compact_handler::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1780:58
      core::result::Result<T,E>::map_err
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/result.rs:854:27
      pageserver::http::routes::timeline_compact_handler::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1777:9
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
      pageserver::http::routes::timeline_compact_handler::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:1789:6
      pageserver::http::routes::api_handler::{{closure}}::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/http/routes.rs:2852:48
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
   1: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:328:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/loom/std/unsafe_cell.rs:16:9
      tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/core.rs:317:30
   2: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:485:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
      std::panicking::try
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
      std::panic::catch_unwind
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
      tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:473:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:208:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.1/src/runtime/task/harness.rs:153:15
   3: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   4: tokio::runtime::scheduler::multi_thread::worker::Context::run
   5: tokio::runtime::context::scoped::Scoped<T>::set
   6: tokio::runtime::context::runtime::enter_runtime
   7: tokio::runtime::scheduler::multi_thread::worker::run
   8: tokio::runtime::task::core::Core<T,S>::poll
   9: tokio::runtime::task::harness::Harness<T,S>::poll
  10: tokio::runtime::blocking::pool::Inner::run
  11: std::sys::backtrace::__rust_begin_short_backtrace
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
      std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/pal/unix/thread.rs:105:17
  14: <unknown>
  15: <unknown>)

Need to dump the layers and see why the history is missing. Potentially, we can fix it, or we can choose to not run compaction on this specific key and retain full history.

Expected result

Actual result

Environment

Logs, links

@skyzh skyzh added the t/bug Issue Type: Bug label Oct 28, 2024
@skyzh skyzh self-assigned this Oct 28, 2024
@skyzh
Copy link
Member Author

skyzh commented Oct 28, 2024

part of #9114

@skyzh
Copy link
Member Author

skyzh commented Nov 5, 2024

looking at 320dd426b7b72c4e091155482a8d8c93

curl http://localhost:9898/v1/tenant/320dd426b7b72c4e091155482a8d8c93/timeline/1d3555cc3a5e87f58cfc8c50f729e92b/getpage\?key\=000000067F00004005000060270000000000\&lsn\=4/B0092040

gives back correct result with that LSN. so something with gc-compaction's read path.

@skyzh
Copy link
Member Author

skyzh commented Nov 6, 2024

the missing key does not belong to the corresponding shard, so shard compaction might have removed some history. gc-compaction needs to implement shard filter or keep the history.

skyzh added a commit that referenced this issue Nov 18, 2024
close #9552, close
#8920, part of
#9114

## Summary of changes

* Drop keys not belonging to this shard during gc-compaction to avoid
constructing history that might have been truncated during shard
compaction.
* Run gc-compaction at the end of shard compaction test.

---------

Signed-off-by: Alex Chi Z <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant