Skip to content

Commit

Permalink
safekeeper: lift decoding and interpretation of WAL to the safekeeper (
Browse files Browse the repository at this point in the history
…#9746)

## Problem

For any given tenant shard, pageservers receive all of the tenant's WAL
from the safekeeper.
This soft-blocks us from using larger shard counts due to bandwidth
concerns and CPU overhead of filtering
out the records.

## Summary of changes

This PR lifts the decoding and interpretation of WAL from the pageserver
into the safekeeper.

A customised PG replication protocol is used where instead of sending
raw WAL, the safekeeper sends
filtered, interpreted records. The receiver drives the protocol
selection, so, on the pageserver side, usage
of the new protocol is gated by a new pageserver config:
`wal_receiver_protocol`.

 More granularly the changes are:
1. Optionally inject the protocol and shard identity into the arguments
used for starting replication
2. On the safekeeper side, implement a new wal sending primitive which
decodes and interprets records
 before sending them over
3. On the pageserver side, implement the ingestion of this new
replication message type. It's very similar
 to what we already have for raw wal (minus decoding and interpreting).
 
 ## Notes
 
* This PR currently uses my [branch of
rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support)
which includes the deserialization logic for the new replication message
type. PR for that is open
[here](neondatabase/rust-postgres#32).
* This PR contains changes for both pageservers and safekeepers. It's
safe to merge because the new protocol is disabled by default on the
pageserver side. We can gradually start enabling it in subsequent
releases.
* CI tests are running on #9747
 
 ## Links
 
 Related: #9336
 Epic: #9329
  • Loading branch information
VladLazar authored Nov 25, 2024
1 parent 5c23569 commit 7a2f0ed
Show file tree
Hide file tree
Showing 26 changed files with 870 additions and 86 deletions.
11 changes: 7 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion libs/pageserver_api/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ use std::{
str::FromStr,
time::Duration,
};
use utils::logging::LogFormat;
use utils::{logging::LogFormat, postgres_client::PostgresClientProtocol};

use crate::models::ImageCompressionAlgorithm;
use crate::models::LsnLease;
Expand Down Expand Up @@ -120,6 +120,7 @@ pub struct ConfigToml {
pub no_sync: Option<bool>,
#[serde(with = "humantime_serde")]
pub server_side_batch_timeout: Option<Duration>,
pub wal_receiver_protocol: PostgresClientProtocol,
}

#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
Expand Down Expand Up @@ -330,6 +331,9 @@ pub mod defaults {
pub const DEFAULT_IO_BUFFER_ALIGNMENT: usize = 512;

pub const DEFAULT_SERVER_SIDE_BATCH_TIMEOUT: Option<&str> = None;

pub const DEFAULT_WAL_RECEIVER_PROTOCOL: utils::postgres_client::PostgresClientProtocol =
utils::postgres_client::PostgresClientProtocol::Vanilla;
}

impl Default for ConfigToml {
Expand Down Expand Up @@ -418,6 +422,7 @@ impl Default for ConfigToml {
.map(|duration| humantime::parse_duration(duration).unwrap()),
tenant_config: TenantConfigToml::default(),
no_sync: None,
wal_receiver_protocol: DEFAULT_WAL_RECEIVER_PROTOCOL,
}
}
}
Expand Down
36 changes: 36 additions & 0 deletions libs/pq_proto/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,9 @@ pub enum BeMessage<'a> {
options: &'a [&'a str],
},
KeepAlive(WalSndKeepAlive),
/// Batch of interpreted, shard filtered WAL records,
/// ready for the pageserver to ingest
InterpretedWalRecords(InterpretedWalRecordsBody<'a>),
}

/// Common shorthands.
Expand Down Expand Up @@ -672,6 +675,25 @@ pub struct WalSndKeepAlive {
pub request_reply: bool,
}

/// Batch of interpreted WAL records used in the interpreted
/// safekeeper to pageserver protocol.
///
/// Note that the pageserver uses the RawInterpretedWalRecordsBody
/// counterpart of this from the neondatabase/rust-postgres repo.
/// If you're changing this struct, you likely need to change its
/// twin as well.
#[derive(Debug)]
pub struct InterpretedWalRecordsBody<'a> {
/// End of raw WAL in [`Self::data`]
pub streaming_lsn: u64,
/// Current end of WAL on the server
pub commit_lsn: u64,
/// Start LSN of the next record in PG WAL.
/// Is 0 if the portion of PG WAL did not contain any records.
pub next_record_lsn: u64,
pub data: &'a [u8],
}

pub static HELLO_WORLD_ROW: BeMessage = BeMessage::DataRow(&[Some(b"hello world")]);

// single text column
Expand Down Expand Up @@ -996,6 +1018,20 @@ impl BeMessage<'_> {
Ok(())
})?
}

BeMessage::InterpretedWalRecords(rec) => {
// We use the COPY_DATA_TAG for our custom message
// since this tag is interpreted as raw bytes.
buf.put_u8(b'd');
write_body(buf, |buf| {
buf.put_u8(b'0'); // matches INTERPRETED_WAL_RECORD_TAG in postgres-protocol
// dependency
buf.put_u64(rec.streaming_lsn);
buf.put_u64(rec.commit_lsn);
buf.put_u64(rec.next_record_lsn);
buf.put_slice(rec.data);
});
}
}
Ok(())
}
Expand Down
1 change: 1 addition & 0 deletions libs/utils/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ pprof.workspace = true
regex.workspace = true
routerify.workspace = true
serde.workspace = true
serde_with.workspace = true
serde_json.workspace = true
signal-hook.workspace = true
thiserror.workspace = true
Expand Down
95 changes: 80 additions & 15 deletions libs/utils/src/postgres_client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,94 @@ use postgres_connection::{parse_host_port, PgConnectionConfig};

use crate::id::TenantTimelineId;

/// Postgres client protocol types
#[derive(
Copy,
Clone,
PartialEq,
Eq,
strum_macros::EnumString,
strum_macros::Display,
serde_with::DeserializeFromStr,
serde_with::SerializeDisplay,
Debug,
)]
#[strum(serialize_all = "kebab-case")]
#[repr(u8)]
pub enum PostgresClientProtocol {
/// Usual Postgres replication protocol
Vanilla,
/// Custom shard-aware protocol that replicates interpreted records.
/// Used to send wal from safekeeper to pageserver.
Interpreted,
}

impl TryFrom<u8> for PostgresClientProtocol {
type Error = u8;

fn try_from(value: u8) -> Result<Self, Self::Error> {
Ok(match value {
v if v == (PostgresClientProtocol::Vanilla as u8) => PostgresClientProtocol::Vanilla,
v if v == (PostgresClientProtocol::Interpreted as u8) => {
PostgresClientProtocol::Interpreted
}
x => return Err(x),
})
}
}

pub struct ConnectionConfigArgs<'a> {
pub protocol: PostgresClientProtocol,

pub ttid: TenantTimelineId,
pub shard_number: Option<u8>,
pub shard_count: Option<u8>,
pub shard_stripe_size: Option<u32>,

pub listen_pg_addr_str: &'a str,

pub auth_token: Option<&'a str>,
pub availability_zone: Option<&'a str>,
}

impl<'a> ConnectionConfigArgs<'a> {
fn options(&'a self) -> Vec<String> {
let mut options = vec![
"-c".to_owned(),
format!("timeline_id={}", self.ttid.timeline_id),
format!("tenant_id={}", self.ttid.tenant_id),
format!("protocol={}", self.protocol as u8),
];

if self.shard_number.is_some() {
assert!(self.shard_count.is_some());
assert!(self.shard_stripe_size.is_some());

options.push(format!("shard_count={}", self.shard_count.unwrap()));
options.push(format!("shard_number={}", self.shard_number.unwrap()));
options.push(format!(
"shard_stripe_size={}",
self.shard_stripe_size.unwrap()
));
}

options
}
}

/// Create client config for fetching WAL from safekeeper on particular timeline.
/// listen_pg_addr_str is in form host:\[port\].
pub fn wal_stream_connection_config(
TenantTimelineId {
tenant_id,
timeline_id,
}: TenantTimelineId,
listen_pg_addr_str: &str,
auth_token: Option<&str>,
availability_zone: Option<&str>,
args: ConnectionConfigArgs,
) -> anyhow::Result<PgConnectionConfig> {
let (host, port) =
parse_host_port(listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
parse_host_port(args.listen_pg_addr_str).context("Unable to parse listen_pg_addr_str")?;
let port = port.unwrap_or(5432);
let mut connstr = PgConnectionConfig::new_host_port(host, port)
.extend_options([
"-c".to_owned(),
format!("timeline_id={}", timeline_id),
format!("tenant_id={}", tenant_id),
])
.set_password(auth_token.map(|s| s.to_owned()));
.extend_options(args.options())
.set_password(args.auth_token.map(|s| s.to_owned()));

if let Some(availability_zone) = availability_zone {
if let Some(availability_zone) = args.availability_zone {
connstr = connstr.extend_options([format!("availability_zone={}", availability_zone)]);
}

Expand Down
12 changes: 12 additions & 0 deletions libs/wal_decoder/src/models.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,18 @@ pub struct InterpretedWalRecord {
pub xid: TransactionId,
}

impl InterpretedWalRecord {
/// Checks if the WAL record is empty
///
/// An empty interpreted WAL record has no data or metadata and does not have to be sent to the
/// pageserver.
pub fn is_empty(&self) -> bool {
self.batch.is_empty()
&& self.metadata_record.is_none()
&& matches!(self.flush_uncommitted, FlushUncommittedRecords::No)
}
}

/// The interpreted part of the Postgres WAL record which requires metadata
/// writes to the underlying storage engine.
#[derive(Serialize, Deserialize)]
Expand Down
15 changes: 10 additions & 5 deletions libs/wal_decoder/src/serialized_batch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -496,11 +496,16 @@ impl SerializedValueBatch {
}
}

/// Checks if the batch is empty
///
/// A batch is empty when it contains no serialized values.
/// Note that it may still contain observed values.
/// Checks if the batch contains any serialized or observed values
pub fn is_empty(&self) -> bool {
!self.has_data() && self.metadata.is_empty()
}

/// Checks if the batch contains data
///
/// Note that if this returns false, it may still contain observed values or
/// a metadata record.
pub fn has_data(&self) -> bool {
let empty = self.raw.is_empty();

if cfg!(debug_assertions) && empty {
Expand All @@ -510,7 +515,7 @@ impl SerializedValueBatch {
.all(|meta| matches!(meta, ValueMeta::Observed(_))));
}

empty
!empty
}

/// Returns the number of values serialized in the batch
Expand Down
1 change: 1 addition & 0 deletions pageserver/src/bin/pageserver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ fn main() -> anyhow::Result<()> {
// after setting up logging, log the effective IO engine choice and read path implementations
info!(?conf.virtual_file_io_engine, "starting with virtual_file IO engine");
info!(?conf.virtual_file_io_mode, "starting with virtual_file IO mode");
info!(?conf.wal_receiver_protocol, "starting with WAL receiver protocol");

// The tenants directory contains all the pageserver local disk state.
// Create if not exists and make sure all the contents are durable before proceeding.
Expand Down
5 changes: 5 additions & 0 deletions pageserver/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ use remote_storage::{RemotePath, RemoteStorageConfig};
use std::env;
use storage_broker::Uri;
use utils::logging::SecretString;
use utils::postgres_client::PostgresClientProtocol;

use once_cell::sync::OnceCell;
use reqwest::Url;
Expand Down Expand Up @@ -190,6 +191,8 @@ pub struct PageServerConf {
/// Maximum amount of time for which a get page request request
/// might be held up for request merging.
pub server_side_batch_timeout: Option<Duration>,

pub wal_receiver_protocol: PostgresClientProtocol,
}

/// Token for authentication to safekeepers
Expand Down Expand Up @@ -350,6 +353,7 @@ impl PageServerConf {
server_side_batch_timeout,
tenant_config,
no_sync,
wal_receiver_protocol,
} = config_toml;

let mut conf = PageServerConf {
Expand Down Expand Up @@ -393,6 +397,7 @@ impl PageServerConf {
import_pgdata_upcall_api,
import_pgdata_upcall_api_token: import_pgdata_upcall_api_token.map(SecretString::from),
import_pgdata_aws_endpoint_url,
wal_receiver_protocol,

// ------------------------------------------------------------
// fields that require additional validation or custom handling
Expand Down
7 changes: 3 additions & 4 deletions pageserver/src/pgdatadir_mapping.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1229,10 +1229,9 @@ impl<'a> DatadirModification<'a> {
}

pub(crate) fn has_dirty_data(&self) -> bool {
!self
.pending_data_batch
self.pending_data_batch
.as_ref()
.map_or(true, |b| b.is_empty())
.map_or(false, |b| b.has_data())
}

/// Set the current lsn
Expand Down Expand Up @@ -1408,7 +1407,7 @@ impl<'a> DatadirModification<'a> {
Some(pending_batch) => {
pending_batch.extend(batch);
}
None if !batch.is_empty() => {
None if batch.has_data() => {
self.pending_data_batch = Some(batch);
}
None => {
Expand Down
3 changes: 2 additions & 1 deletion pageserver/src/tenant/timeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2470,6 +2470,7 @@ impl Timeline {
*guard = Some(WalReceiver::start(
Arc::clone(self),
WalReceiverConf {
protocol: self.conf.wal_receiver_protocol,
wal_connect_timeout,
lagging_wal_timeout,
max_lsn_wal_lag,
Expand Down Expand Up @@ -5896,7 +5897,7 @@ impl<'a> TimelineWriter<'a> {
batch: SerializedValueBatch,
ctx: &RequestContext,
) -> anyhow::Result<()> {
if batch.is_empty() {
if !batch.has_data() {
return Ok(());
}

Expand Down
Loading

1 comment on commit 7a2f0ed

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7050 tests run: 6717 passed, 0 failed, 333 skipped (full report)


Flaky tests (2)

Postgres 17

Code coverage* (full report)

  • functions: 30.9% (7975 of 25846 functions)
  • lines: 48.6% (63300 of 130158 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
7a2f0ed at 2024-11-25T20:45:19.502Z :recycle:

Please sign in to comment.