Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap GCM and AES operations (IDFGH-13563) #14452

Closed
bryghtlabs-richard opened this issue Aug 27, 2024 · 9 comments
Closed

Overlap GCM and AES operations (IDFGH-13563) #14452

bryghtlabs-richard opened this issue Aug 27, 2024 · 9 comments
Assignees
Labels
Resolution: Won't Do This will not be worked on Status: Done Issue is done internally Type: Feature Request Feature request for IDF

Comments

@bryghtlabs-richard
Copy link
Contributor

bryghtlabs-richard commented Aug 27, 2024

Is your feature request related to a problem?

I'd like my TLS transfers to be faster. My system spends a lot of time doing GCM or AES operations after the session is established.

Describe the solution you'd like.

It may be worth looking into overlapping AES and GCM operations on ESP32 systems. Some terms:

  • tAES = time to encrypt/decrypt 128-bit block
  • tGCM = time to GCM 128-bit block
  • N = number of 128-bit blocks in a packet, maybe 90 per packet

Currently, each is completed after the other. So total time = N * tAES + N * tGCM. Pseudocode:

for each block:
    GCM block
send all blocks to AES unit
Wait for AES unit done-signal.

Instead, on other systems I have been able to overlap AES and GCM block operations, so total time = tAES + (N-1)*max(tAES, tGCM) + tGCM. This would require the CPU feeding the AES unit periodically, which will have some overhead compared to batching AES operations. It's best of the AES unit can be configured to be fed input from the CPU but output via DMA. Then CPU can wait for input-ready-signal for each block, and only waits for done-signal at end of transfer. Pseudocode:

for each block:
    GCM block
    wait for AES input-ready-signal
    send block to AES unit
Wait for AES unit done-signal

Describe alternatives you've considered.

No response

Additional context.

I've tuned the GCM loop for xtensa, but if it works this could save quite a bit more time.

@bryghtlabs-richard bryghtlabs-richard added the Type: Feature Request Feature request for IDF label Aug 27, 2024
@github-actions github-actions bot changed the title Overlap GCM and AES operations Overlap GCM and AES operations (IDFGH-13563) Aug 27, 2024
@espressif-bot espressif-bot added the Status: Opened Issue is new label Aug 27, 2024
@Harshal5
Copy link
Collaborator

Hi @bryghtlabs-richard,

Thanks for the in-depth description. Had a few questions regarding the same:

  1. Just wanted to confirm, which ESP target are you referring to for the proposal. ESP32-S2 supports purely hardware-accelerated AES-GCM operations, whereas, the other targets support hardware-accelerated AES operations and software GCM operations.
    (I assume that you are mentioning the software GCM workflow of targets other than ESP32-S2, for the next question)

  2. As you mentioned, currently, we perform the GCM operation first on all the input blocks and then the AES operation, I think this would happen only in the case of decryption but not in the case of encryption.

image In the case of decryption, we perform the GCM operation using the input buffer first followed by AES-decryption, so are you referring to if there could be a chance that we utilize the time taken for decryption with the GCM operation parallelly?

In the case of encryption, we first perform the AES operation, and the result is then used for GCM operation. In this case, do you propose that we do not perform AES encryption of all blocks first but rather perform block-wise encryption and use the encryption time of the n+1 th block for the GCM operation of the nth block?

Is this what you proposed or did I miss out on something?

@bryghtlabs-richard
Copy link
Contributor Author

Hi @Harshal5 , I'm currently using ESP32-S3, with hardware-AES and software-GCM. I'm mostly interested in decryption speed but both operations could benefit from pipelining the AES and GCM stages. Your understanding of the pipeline timing is correct.

@Harshal5
Copy link
Collaborator

@bryghtlabs-richard Thanks for confirming, got it!
I shall get some readings regarding the performance impact of some combinations.

@Harshal5
Copy link
Collaborator

Hi @bryghtlabs-richard,

I think the proposed max(tAES, tGCM) time, where AES operation is performed on one core and the GCM operation is performed on the other parallely could only be achieved for multicore systems. Also, creating a new task on the other core for the GCM operation using xTaskCreate() would add additional overheads of time (that overshadows any time-wise improvement that we would have achieved) and memory (which is a constraint during the TLS operations, so setting up a new task from the TLS task would be expensive).
0001-feat-aes-gcm-Parallelize-GCM-calculations.patch

Another apporach could be saving some "waiting" time during the AES operation, wherein we start AES operation and start the GCM operation just before we check for AES esp_aes_wait_dma_done(), carry out the GCM operation and then again resume the AES operation. This shows only a slightest (4%) improvement in the performance but that too when data length > 2K bytes and this apporach involves some module intermixing in the code (GCM references in AES module) that I think would not be a good practice.
0001-feat-aes-gcm-Tweak-pipelining-of-software-GCM-operat.patch

I am not sure if these approaches would help us achieve considerable performance improvements.

@bryghtlabs-richard
Copy link
Contributor Author

Sorry, I did not mean to propose a multithreading solution, rather pipelining the hardware AES plus software GCM like your second approach, but I don't think we would want to check for esp_aes_wait_dma_done(), rather we only want to wait until the AES unit is ready to accept new input(if the ESP AES unit has the ability to accept new input before it is finished outputting). It will take me a few days to review your patch.

@Harshal5
Copy link
Collaborator

Currently, the API mbedtls_gcm_update() performs the GCM operation followed by the AES operation over the whole "length" bytes. So, when the API returns we expect both (GCM and AES) operations over the whole "length" bytes to be completed. The next AES operation or the next data input to the AES peripheral would take place only in the next mbedtls_gcm_update() API call.
I don't think it is feasible to change the behaviour of the API.

During the AES operation we generate the DMA descriptors list to contain the complete "length" bytes of the data instead of chunking the data in small blocks. So, once we start the AES operation, the AES peripheral would complete its operation over the complete "length" bytes of the data and then would be ready to accept some new input data. Chunking into small blocks adds up the overheads of recreating the DMA descriptors list again and again.

@bryghtlabs-richard
Copy link
Contributor Author

Chunking into small blocks adds up the overheads of recreating the DMA descriptors list again and again.

We won't want DMA descriptors per AES block, but I was hoping that the AES core input and output could be configured separately(DMA vs CPU access), but I found the AES core documentation, and DMA is only supported for both input & output, or not all all(CPU input & output). So, if it's possible to gain time by overlapping, it would need to be done without DMA(which may be slower overall).

The next AES operation or the next data input to the AES peripheral would take place only in the next mbedtls_gcm_update() API call. I don't think it is feasible to change the behaviour of the API.

I agree we should not change the behaviour of the API, only within mbedtls_gcm_update() calls would the overlapping occur.

@Harshal5
Copy link
Collaborator

Harshal5 commented Sep 20, 2024

Yes, as of now I can think only of the second approach as mentioned above that could increase the performance of AES-GCM operations:

Another approach could be saving some "waiting" time during the AES operation, wherein we start AES operation and start the GCM operation just before we check for AES esp_aes_wait_dma_done(), carry out the GCM operation and then again resume the AES operation. This shows only the slightest (4%) improvement in the performance but that too when data length > 2K bytes and this approach involves some module intermixing in the code (GCM references in AES module) that I think would not be a good practice.

But I am not sure if this approach could be desirable given it involves modules intermixing in the code (GCM references in AES module) and gives just a slightest improvement in the performance.

edit: Also, reiterating the above approach, in cases when the input and output buffers are the same, the AES and GCM operations would not be mutually exclusive, thus I think the approach could fail.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Nov 12, 2024
@Harshal5
Copy link
Collaborator

Closing this issue as of now then, please feel free to reopen if needed.

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Won't Do This will not be worked on and removed Status: In Progress Work is in progress labels Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Won't Do This will not be worked on Status: Done Issue is done internally Type: Feature Request Feature request for IDF
Projects
None yet
Development

No branches or pull requests

3 participants