-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOLD for payment 2023-07-03] Investigate why some changes didn't make it to staging as expected #10214
Comments
cc @Expensify/mobile-deployers – @johnmlee101, @rafecolton, @yuwenmemon, and I spent a few hours trying to figure this out this morning. We made some good progress but didn't quite get it worked out. It seems like a pretty extreme edge case, but the main moral for now is be careful with your cherry-picks, particularly with reverts. Hopefully we'll have more tangible guidance or improvements here soon. |
cc @kidroca in case you're interested in investigating |
Found another problem scenario in this workflow run:
|
I'm able to reproduce the initial problem in a fresh repo. Added unit tests covering this here: #10316 Now need to figure out how to fix this |
I believe this article basically explains exactly what is happening to us |
This is the next article in the series where the author describes how to fix the problem. I'm not immediately sure how to do that in code in our automation, but given that they are describing the exact problem we are experiencing, I think this is the solution. |
Oh man, that article ends with more questions than it begins with!
…On Tue, Aug 9, 2022 at 2:26 PM Rafe Colton ***@***.***> wrote:
This <https://devblogs.microsoft.com/oldnewthing/20180314-00/?p=98235> is
the next article in the series where the author describes how to fix the
problem. I'm not immediately sure how to do that in code in our automation,
but given that they are describing the exact problem we are experiencing, I
think this is the solution.
—
Reply to this email directly, view it on GitHub
<#10214 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJMAB6GWO4QYDMKG3YKZQLVYK5G3ANCNFSM55MK2K7Q>
.
You are receiving this because you are on a team that was mentioned.Message
ID: ***@***.***>
|
It's really hard to create a TL;DR of this problem because it's pretty convoluted, but these diagrams are pretty good. Note that in our example, our equivalent of Before (with cherry-pick)After (by merging into a "helper" branch)So I pushed a number of changes to this PR to:
If you check out that PR, you can run the tests locally with: If you comment out the assertions in the tests that look like this: assert_equal "$output" "[ '7', '5', '1' ]" Then you'll see the solution outlined in the article indeed fixes the scenario where a revert PR is persisted on |
Okay, I believe I know how to fix this. I included some diagrams to explain the fix in the draft PR. I also got tests passing but need to implement the changes in the actual GitHub Workflows. Also I just want to carry this out with more examples to try and see if we can find any other cases where it doesn't work. For example, I haven't yet explicitly tested this slightly different scenario that's currently blocking deploys. I'm also not sure if there's anything we'll need to do to get The TL;DR of this solution is that we actually merge the temporary CP branch into both main and staging, instead of just staging. That way, it becomes a more recent common ancestor of both those branches and fixes the later merge that happens when main is merged back into staging. One sort of annoying side-effect of this is that it will mean one more |
My concern with this is that we're digging so deep into the weeds that we're ("we" as in, anyone that isn't Rory 😁 ) never going to be able to maintain this or fix it if it breaks. Thinking back to web-e deploying, we've never ran into this problem and that's a deploy that's been pretty stable for many years (with a CP process to both staging and production branches). So, at some point... do we sit back and ask ourselves the question: Is this still the right direction to be pushing ourselves in? GitHub actions work great for simple repos (@AndrewGable loved implementing them on comp, k2, etc.), but when it comes to very complex deploy flows like this, I think that's when the complexity begins to be more than we have the capacity to manage. I know that doesn't really help to solve this particular issue, and I'm still very lost on it, but I can't escape the feeling like we've dug ourselves into a hole and we just keep digging deeper :( |
@tgolen Feel free to correct me if I'm misinterpreting, but I think that there's an underlying assumption in your concern – namely, some combination of the following:
But I don't think that's true. As a matter of fact, the Git logic in the NewDot deploy process mirrors the Git logic in the Web-E deploy process pretty closely. To demonstrate that I believe this same problem can occur in Web-Expensify, I wrote a little script. You'll notice that it doesn't use any flags to automatically resolve conflicts, and when you run it you'll see that there are no manual conflicts to resolve either: #!/bin/bash
# Fail immediately if there is an error thrown
set -e
BLUE=$'\e[1;34m'
RED=$'\e[1;31m'
RESET=$'\e[0m'
function info {
echo "$BLUE$1$RESET"
}
function error {
echo "💥 $RED$1$RESET"
}
info "Creating repo"
mkdir ~/DumDumRepo
cd DumDumRepo
git init -b main
echo "some content" >> myFile.txt
git add myFile.txt
git commit -m "Initial commit"
info
info "Creating staging branch"
git checkout -b staging
git checkout main
info
info "Appending and prepending content (initial PR)"
printf "Before:\n\n$(cat myFile.txt)"
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
printf "\n\nAfter:\n\n$(cat myFile.txt)\n"
info
info "Committing change to main"
git add myFile.txt
git commit -m "Append and prepend content"
info
info "Merging main into staging"
git checkout staging
git merge main
info
info "Making an unrelated change on main"
git checkout main
printf "File list before:\n"
ls
echo "some content" >> anotherFile.txt
git add anotherFile.txt
git commit -m "Create another file"
printf "File list after:\n"
ls
info
info "Reverting the append + prepend on main"
printf "Before:\n\n$(cat myFile.txt)\n"
echo "some content" > myFile.txt
printf "\nAfter:\n\n$(cat myFile.txt)\n"
git add myFile.txt
git commit -m "Revert append and prepend"
info
info "Cherry-picking the revert to staging"
REVERT_COMMIT=$(git rev-parse HEAD)
git checkout staging
git cherry-pick "$REVERT_COMMIT"
info
info "Verifying that the revert is present on staging, but the unrelated change is not"
printf "myFile.txt on staging:\n\n$(cat myFile.txt)"
printf "\n\nFile list on staging:\n\n"
ls
if [[ "$(cat myFile.txt)" != "some content" ]]; then
error "Revert did not make it to staging"
fi
if [[ -f anotherFile.txt ]]; then
error "Unrelated change made it to staging"
fi
info
info "Repeating previously reverted append + prepend on main"
git checkout main
printf "Before:\n\n$(cat myFile.txt)\n\n"
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
git add myFile.txt
git commit -m "Append and prepend content again"
printf "\n\nAfter:\n\n$(cat myFile.txt)\n"
info
info "Mering main into staging"
git checkout staging
git merge main
info
info "Verifying that the append + prepend (just added back on main) are now present on staging"
printf "myFile.txt on staging:\n\n$(cat myFile.txt)"
if [[ "$(cat myFile.txt)" != *"Prepended"* ]]; then
error "Prepended content not present on staging"
fi
if [[ "$(cat myFile.txt)" != *"Appended"* ]]; then
error "Appended content not present on staging"
fi
info
info "Verifying that previously unrelated change is now present on staging"
printf "File list on staging:\n\n"
ls
if [[ ! -f anotherFile.txt ]]; then
error "Other file is not present on staging"
fi
info
info "Cleaning up..."
cd ..
rm -rf ~/DumDumRepo While this script is kind of long, you'll see that it demonstrates this problem:
I encourage creating the script locally and running it to see the output. Unless I'm missing something all this same Git logic is present in Web-Expensify, and the fact that a person is there to resolve conflicts won't really make a difference, because there are no conflicts. EDIT: In case anyone prefers an abbreviated version of the above script without so many logs, here ya go: #!/bin/bash
mkdir ~/DumDumRepo
cd DumDumRepo
git init -b main
echo "some content" >> myFile.txt
git add myFile.txt
git commit -m "Initial commit"
git checkout -b staging
git checkout main
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
git add myFile.txt
git commit -m "Append and prepend content"
git checkout staging
git merge main
git checkout main
echo "some content" >> anotherFile.txt
git add anotherFile.txt
git commit -m "Create another file"
echo "some content" > myFile.txt
git add myFile.txt
git commit -m "Revert append and prepend"
REVERT_COMMIT=$(git rev-parse HEAD)
git checkout staging
git cherry-pick "$REVERT_COMMIT"
if [[ "$(cat myFile.txt)" != "some content" ]]; then
echo "Error: Revert did not make it to staging"
fi
if [[ -f anotherFile.txt ]]; then
echo "Error: Unrelated change made it to staging"
fi
git checkout main
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
git add myFile.txt
git commit -m "Append and prepend content again"
git checkout staging
git merge main
if [[ "$(cat myFile.txt)" != *"Prepended"* ]]; then
echo "Error: Prepended content not present on staging"
fi
if [[ "$(cat myFile.txt)" != *"Appended"* ]]; then
echo "Error: Appended content not present on staging"
fi
if [[ ! -f anotherFile.txt ]]; then
echo "Other file is not present on staging"
fi
cd ..
rm -rf ~/DumDumRepo |
Thanks, Rory! I totally understand that the problem you're running into can be reproduced on ANY repository. It's not a problem that's specific to a repo. So, while I believe you that the script reproduces the bug, I don't think it's an accurate test. Can you reproduce the bug using the web-e deploy process? |
Primarily, I think one of the disconnects with the bug, and the web-e deploy process might be your step 7:
We don't merge main into staging during the deploy process. We create a new copy of staging based on main (and destroy the previous version of staging). Typically referred to as the "branch dance". |
Very astute observation! This does in fact prevent the issue we're seeing here! You can see that by changing the above script(s) to replace: - git checkout staging
- git merge main
+ git branch -D staging
+ git checkout -b staging And looking at the Git chart it makes sense why: So my next question is: How do we manage that "branch dance" in Web-Expensify? Does it require input from someone in Ring 0 to delete the (presumably protected?) staging branch? If we can use the same process in NewDot it would be a valuable simplification and solve this problem to boot |
Ah, cool! I believe the way it works is that the deployer group is given permissions on the repo to force push to staging and production. Then the script does a forced push of the current The force push is applied here: https://github.com/Expensify/PHP-Libs/blob/main/src/Deploy/Deployify.php#L199 I think the force parameter is passed from here: https://github.com/Expensify/Web-Expensify/blob/20ba8c1e41eeae72ba158c7fd8173a09cd777475/_tools/deploy/lib.sh#L120-L121 |
@roryabraham I haven't dug that deep into the problem, but could a thing like enforcing PRs to be rebased onto I've been asked to rebase my PR when working on other open source projects Rebase keeps a linear history and this should help with something being cherry picked after merge, since it would be picking from a "straight line". The branch to be merged will be properly up to date with any changes in The problem with this strategy is that merging something to |
After discussing in slack we are going to:
|
This is an important fix we need to make, but we're currently prioritizing WAQ and product bugs. |
Taking this off HOLD and treating it as my top priority |
Triggered auto assignment to @nathanmetcalf ( |
@nathanmetcalf the GitHub settings changes we need were discussed here. |
@nathanmetcalf friendly bump so we can hopefully roll out the fix for this tomorrow 🙂 |
Updated the rules as per Rory's New. messages. |
Seems my instructions caused some problems. Taking this over to https://github.com/Expensify/Expensify/issues/294557 for auditing purposes. |
This is fixed. Most of the testing + context is in this slack thread |
The solution for this issue has been 🚀 deployed to production 🚀 in version 1.3.32-5 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue: If no regressions arise, payment will be issued on 2023-07-03. 🎊 After the hold period is over and BZ checklist items are completed, please complete any of the applicable payments for this issue, and check them off once done.
As a reminder, here are the bonuses/penalties that should be applied for any External issue:
|
Context
Problem
There is a very obscure edge case with our git CI that caused some changes on
main
to not be correctly propagated tostaging
during a staging deploy. There were a few affected files, but the one we've focused on as our case study issrc/pages/settings/Payments/PaymentsPage/BasePaymentsPage.js
. Here's my best attempt at summarizing the history of this file.OfflineIndicator
was added incc59b0a
(#9589).1.1.86-0
BasePaymentsPage
.1.1.86-2
inffd8581
(#10115)OfflineIndicator
was added back inf9cf909
(#10135)At this point,
main
contained theOfflineIndicator
inBasePaymentsPage
, butstaging
did not. However, whenmain
was merged intostaging
, the usage ofOfflineIndicator
was carried over tomain
but the import was not.In order to reproduce this locally and see it in action:
Then look at the diff between
update-fake-staging-from-fake-main
:You would think both of these changes would be carried over, but then run:
And you'll see that the
import
is missing.Solution
TBD, but we have some additional observations/hypotheses:
staging
, we somehow muddled the history such that git thinks that change (ffd8581) is newer than the change on main that added it back (f9cf909).main
the commits from the revert PR are part of a linear history, while on staging the merge commit is included and is treated as a separate branch?cherry-pick
command that's used to see if we can keep these histories in line and prevent this problem.The text was updated successfully, but these errors were encountered: