pageserver: improve flush upload queue parallelism #10096

erikgrinaker · 2024-12-11T16:52:18Z

Currently, when flushing delta layers, we upload the layer and then update the index. But index updates act as an upload queue parallelism barrier. This means that we're effectively uploading layers one at a time.

Instead, we should flush a batch of layer files (something like 100-1000 MB) for each index update, allowing us to upload these in parallel.

Also note that the flush loop backpressure from #8550 will also prevent parallelism. We'll need to remove this first, see #10095.

skyzh · 2024-12-11T17:08:53Z

In theory if we put multiple upload tasks into the queue, they will be fired in parallel.

neon/pageserver/src/tenant/remote_timeline_client.rs

Lines 1813 to 1831 in ef233e9

    
           let can_run_now = match next_op { 
        
               UploadOp::UploadLayer(..) => { 
        
                   // Can always be scheduled. 
        
                   true 
        
               } 
        
               UploadOp::UploadMetadata { .. } => { 
        
                   // These can only be performed after all the preceding operations 
        
                   // have finished. 
        
                   upload_queue.inprogress_tasks.is_empty() 
        
               } 
        
               UploadOp::Delete(..) => { 
        
                   // Wait for preceding uploads to finish. Concurrent deletions are OK, though. 
        
                   upload_queue.num_inprogress_deletions == upload_queue.inprogress_tasks.len() 
        
               } 
        
               UploadOp::Barrier(_) | UploadOp::Shutdown => { 
        
                   upload_queue.inprogress_tasks.is_empty() 
        
               } 
        
           };

I think compaction already utilized this code path?

erikgrinaker · 2024-12-11T17:13:30Z

Yes. The problem is that when we flush a delta layer, we schedule both an UploadLayer and an UploadMetadata for every layer sequentially. As you can see in the code you quoted, UploadMetadata acts as a barrier for parallel uploads, since it waits for all preceding layers to finish uploading.

We want to flush a batch of layers and schedule UploadLayer for them all, and then schedule an UploadMetadata barrier for the entire batch.

neon/pageserver/src/tenant/timeline.rs

Lines 3932 to 3936 in c42c28b

    
           for layer in layers_to_upload { 
        
               self.remote_client.schedule_layer_file_upload(layer)?; 
        
           } 
        
           self.remote_client 
        
               .schedule_index_upload_for_metadata_update(&update)?;

neon/pageserver/src/tenant/timeline.rs

Lines 3848 to 3850 in c42c28b

    
           // Schedule remote uploads that will reflect our new disk_consistent_lsn 
        
           self.schedule_uploads(disk_consistent_lsn, layers_to_upload) 
        
               .map_err(|e| FlushLayerError::from_anyhow(self, e))?;

neon/pageserver/src/tenant/timeline.rs

Lines 3809 to 3824 in c42c28b

    
           // Normal case, write out a L0 delta layer file. 
        
           // `create_delta_layer` will not modify the layer map. 
        
           // We will remove frozen layer and add delta layer in one atomic operation later. 
        
           let Some(layer) = self 
        
               .create_delta_layer(&frozen_layer, None, ctx) 
        
               .await 
        
               .map_err(|e| FlushLayerError::from_anyhow(self, e))? 
        
           else { 
        
               panic!("delta layer cannot be empty if no filter is applied"); 
        
           }; 
        
           ( 
        
               // FIXME: even though we have a single image and single delta layer assumption 
        
               // we push them to vec 
        
               vec![layer.clone()], 
        
               Some(layer), 
        
           )

erikgrinaker · 2024-12-17T13:18:52Z

There's a prototype PR for deferred index uploads during flush in #10144. However, this approach has a fair number of issues -- we have to wait for "some time" in case further layers are flushed, and this both breaks caller expectations that index uploads are scheduled immediately, and it can still cause index barriers if the previous index isn't uploaded before the next layer comes in.

Instead, let's try to reorder layer uploads ahead of index uploads in the upload queue:

neon/pageserver/src/tenant/remote_timeline_client.rs

Lines 1828 to 1836 in 3d30a7a

    
           // If we cannot launch this task, don't look any further. 
        
           // 
        
           // In some cases, we could let some non-frontmost tasks to "jump the queue" and launch 
        
           // them now, but we don't try to do that currently.  For example, if the frontmost task 
        
           // is an index-file upload that cannot proceed until preceding uploads have finished, we 
        
           // could still start layer uploads that were scheduled later. 
        
           if !can_run_now { 
        
               break; 
        
           }

A few things to keep in mind:

We want to upload indexes as soon as the necessary layers are uploaded.
We want to coalesce indexes where we can.
We don't want to reorder operations on the same layer filename.
We don't want to reorder wrt. deletes.

erikgrinaker added a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024

erikgrinaker self-assigned this Dec 11, 2024

erikgrinaker mentioned this issue Dec 11, 2024

Epic: optimize WAL ingest pipeline #9624

Open

erikgrinaker linked a pull request Dec 13, 2024 that will close this issue

pageserver: upload flushed layers in parallel #10144

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: improve flush upload queue parallelism #10096

pageserver: improve flush upload queue parallelism #10096

erikgrinaker commented Dec 11, 2024

skyzh commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 •

edited

Loading

erikgrinaker commented Dec 17, 2024

pageserver: improve flush upload queue parallelism #10096

pageserver: improve flush upload queue parallelism #10096

Comments

erikgrinaker commented Dec 11, 2024

skyzh commented Dec 11, 2024

erikgrinaker commented Dec 11, 2024 • edited Loading

erikgrinaker commented Dec 17, 2024

erikgrinaker commented Dec 11, 2024 •

edited

Loading