Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition in aio_command_finish? #687

Open
geordieintheshellcode opened this issue Oct 3, 2022 · 4 comments
Open

Potential race condition in aio_command_finish? #687

geordieintheshellcode opened this issue Oct 3, 2022 · 4 comments

Comments

@geordieintheshellcode
Copy link

I'd like to get a second opinion on whether there is a potential race condition in aio_command_finish.

What does aio_command_finish do?

On entry into the function we call tcmur_tcmulib_cmd_complete which locks the device's cmds_list_lock and then calls tcmulib_command_complete to update the command ring tail. We then unlock the cmds_list_lock and call tcmulib_processing_complete to notify the kernel that there are commands to reap.

The Problem

We can have multiple threads calling into aio_command_finish concurrently. This means we can have one thread calling tcmulib_command_complete (updating the ring buffer tail) while another thread is calling tcmulib_processing_complete. tcmulib_processing_complete subsequently calls into tcmu_handle_completions on the kernel side (see target_core_user.c).

As far as I can tell there is insufficient synchronisation between the user-space thread which is updating the command ring tail pointer and the kernel thread which is reading the command ring tail pointer. Our tcmulib_command_complete thread is essentially doing:

tcmulib_command_complete()
    entry = list->tail
    entry->cmd_id = 42
    list->tail++

while our kernel thread is doing:

tcmu_handle_completion()
    while (last != READ_ONCE(list->tail))
       entry = last;
       if !find_cmd(entry->cmd_id)
          print_error("cmd_id %u not found, ring is broken\n")
       last++

Now, unless I'm misunderstanding something I believe that the CPU is entitled to re-order the instructions in tcmulib_command_complete could be re-ordered at runtime such that the increment to list->tail happens BEFORE the assignment to entry->data. In pseudo-code, something like:

 entry = list->tail
 list->tail++
 entry->cmd_id = 42

Whether this happens, and how likely it is to happen probably depends on the CPU architecture. If this was to occur then it would be possible for tcmu_handle_completions to access an entry via list->tail before the store to cmd_id has been made visible. This would result in a bogus value for cmd_id being read and potentially the call to find_cmd would fail, or worse result in the wrong command being processed. The READ_ONCE on the kernel side ensures that list->tail is always read (rather than being read from a cached value in a register), but it DOES NOT ensure that the stores to entry in user-space are only made visible once the increment of the tail pointer has been made visible. As far as I understand the store/load to list->tail should use relevant atomics or a memory barrier to ensure the correct synchronisation?

I'd appreciate it if someone were to review my logic! I definitely feel like I could have missed something. I'll see if I can repro it with tcmu-runner.

@geordieintheshellcode
Copy link
Author

geordieintheshellcode commented Oct 4, 2022

I've managed to repo the problem on a couple of aarch64 machines (Fedora 36 (aarch64), 5.19.12-200.fc36.aarch64 and Rocky Linux release 8.4 (Green Obsidian), 4.18.0-305.3.1.el8_4.aarch64). I can't repro the problem on x86.

Repro steps:

  1. Check out this branch https://github.com/geordieintheshellcode/tcmu-runner/tree/tcmu-race-condition. It's master with some extra scripts and a modification to the CMakeLists.txt to install the file_example handler.
  2. Build and install. Run the tcmu-runner binary.
  3. Run ./create.sh to create a 1 GiB file-backed TCMU backstore and hook it up to a SCSI HBA/LUN.
  4. Find the TCMU SCSI device and run fio against it: sudo FILE=/dev/<dev_name> QD=128 BS=4k fio randwrite.fio

After a while (almost instantly on one of my test machines, on the other it takes a few minutes), you'll see a bunch of write commands failing in dmesg and the target_core_user module reports that the ring is corrupted:

[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#339 CDB: Write(10) 2a 00 00 1c c9 f8 00 00 08 00
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#276 CDB: Write(10) 2a 00 00 1d cd b8 00 00 08 00
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:50 2022] sd 0:0:1:0: [sda] tag#451 CDB: Write(10) 2a 00 00 15 b1 78 00 00 08 00
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#704 CDB: Write(10) 2a 00 00 16 64 78 00 00 08 00
[Mon Oct  3 17:23:53 2022] cmd_id 6 not found, ring is broken
[Mon Oct  3 17:23:53 2022] sd 0:0:1:0: [sda] tag#901 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[Mon Oct  3 17:23:53 2022] ring broken, not handling completions

What's a bit confusing is even after applying the following patch to ensure that only a single thread is calling tcmur_tcmulib_cmd_complete and tcmulib_processing_complete at once, I still see failed write commands, but no ring corruption this time:

diff --git a/tcmur_cmd_handler.c b/tcmur_cmd_handler.c
index 01d6a10..20266e0 100644
--- a/tcmur_cmd_handler.c
+++ b/tcmur_cmd_handler.c
@@ -64,18 +64,24 @@ void tcmur_tcmulib_cmd_complete(struct tcmu_device *dev,
        pthread_cleanup_pop(0);
 }

+static pthread_mutex_t mega_lock = PTHREAD_MUTEX_INITIALIZER;
+
 static void aio_command_finish(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
                               int rc)
 {
        struct tcmur_device *rdev = tcmu_dev_get_private(dev);
        int wake_up;

+       pthread_mutex_lock(&mega_lock);
+
        tcmur_tcmulib_cmd_complete(dev, cmd, rc);
        track_aio_request_finish(rdev, &wake_up);
        while (wake_up) {
                tcmulib_processing_complete(dev);
                track_aio_wakeup_finish(rdev, &wake_up);
        }
+
+       pthread_mutex_unlock(&mega_lock);
 }

dmesg:

[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Sense Key : Hardware Error [current]
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 Add. Sense: Internal target failure
[Tue Oct  4 11:47:03 2022] sd 0:0:1:0: [sda] tag#245 CDB: Write(10) 2a 00 00 1c 1a 78 00 00 08 00
[Tue Oct  4 11:47:03 2022] blk_update_request: critical target error, dev sda, sector 1841784 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0

and tcmu-runner.log:

2022-10-04 11:47:03.880 66678 [ERROR] check_iovec_length:143 file/file: iov len mismatch: iov len 4096, xfer len 40704, block size 512

This is quite odd as the fio is only writing 4096 B blocks! I wonder if we have a similar problem with the tcmur_cmdproc_thread thread calling tcmulib_get_next_command and clashing with the kernel thread which is updating the ring buffer. I've not had time to investigate this.

@geordieintheshellcode
Copy link
Author

geordieintheshellcode commented Oct 4, 2022

I think tcmu_get_next_command is also broken on aarch64. The problem seems to be that we load the cmd_head index without the necessary memory barriers. Atomically loading the cmd_head index in device_cmd_head() seemingly fixes the issue. More investigation is required though. I will break this out into a separate issue (#688)

@lxbsz
Copy link
Collaborator

lxbsz commented Oct 9, 2022

@geordieintheshellcode This sounds reasonable and could happen IMO.

Could you raise one PR to fix this ? Thanks!

@geordieintheshellcode
Copy link
Author

Sorry I've been quite busy recently. I will open a PR as soon as I can. Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants