-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test results depends on Mac device #109
Comments
Same happen to me, I record all images using Catalina, but running the same test in a Travis environment with macOS 10.14 It's failing for the diffs, as if they were shadows with different contrast |
Same issue here. Any idea about the reason? |
It seems to me the reason is that Apple in Xcode 11x migrate Simulators processing from CPU to GPU. GPU calculations faster but not accurate and it result may depends on hardware GPU venders and drivers. I spend a lot of time, but the only working work around for me is using And this request #108 |
@navartis Why are you using the |
[Q] We didn't find a good solution on my side yet but we noticed one thing: The random diff result seems to come from views containing transparent assets (images with alpha channel not set to 1) |
@luiz-brasil Unfortunately '6/256' is a magic number. I hate magic numbers in code, but in this case I have no ideas |
We're also seeing this issue. Interestingly it seems to only occur when using either usesDrawViewHierarchyInRect OR snapshotting during a UI test (not unit test). Seems that it could be related to being rendered in a UIWindow. I also got the same results between two Mac Pros (i.e. identical machines produce the same result). Could this be graphics card related and is it time to raise a radar? |
I'm so glad that I stumbled across this thread, because I think I'm seeing the same thing. About a week ago, I started noticing failures in some of my iOS Snapshot tests. These tests passed locally - both when run via Xcode and Fastlane - but failed when run on CI. I'm still trying to figure out which OS our build machine is using (though I suspect it might be 10.14.x), but it just so happens that I updated my machine to Catalina last week, and now I'm getting these failures. I was able to capture the diff screenshots after running the tests on CI, and they're gray. I asked a designer to take one and bump up the contract and only with a massive bump in contrast were we able to see the slightest traces of some scree elements. (And I really had to squint to see them, even after the adjustment.) Seems like the workaround at the moment is to bump up the tolerance. Has anyone found a different solution? |
Just to confirm - is everyone else seeing that this not only happens between 10.14 and 10.15 (which is normal for any upgrade) but also between machines that are running the same version of macOS Catalina (10.15)? Finally, has anyone got a simple test case that could be sent to Apple? |
@samskiter unfortunately I don't have a specific test case, primarily because these errors occur randomly. I can affirm that the error occurs between different machines with the same version of Mac OS Catalina. |
@samskiter Results seem to be GPU dependent. In the simulator you can get different results by using the integrate or discrete GPUs One of the Apple engineers that works on the simulator said the way forward is to increase pixel matching tolerance. Conversation on Twitter. Hope that helps. |
😢 |
Wow, apple just broke an entire tool. Thanks Apple.
…On Tue, 5 May 2020 at 17:31, navartis ***@***.***> wrote:
😢
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#109 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJWOW56ZLTVHHMZVZYTWC3RQA5MRANCNFSM4JGMB5YA>
.
|
Because of this issue, what values are teams using for perPixelTolerance and overallTolerance? |
I'm running a little experiment on a few tests. I'll have to follow up tomorrow to let you know if it works for all of my tests, but so far I'm having luck with it. For almost all screens: This fixed most of my failures, but screens that use transparent elements still fail. For screens with transparent elements (where alpha < 1), this is actually passing for me: Again, these successes are based on a very small sampling of tests that are currently passing. Nearly all screens are being verified with the per pixel value, as we don't use a whole lot of transparency. I'll update this post tomorrow when I know whether or not it's a success. |
@JustinDSN Unfortunately, I'm going to have to spend more time on this until I know exactly which values are going to work for us. The per pixel value of 6/256 is working for most screens, but the overall tolerance is giving me problems. It seems like screens that incorporate a transparency or use the fullscreen modal that was new to iOS 13 must be compared using overall tolerance. I've seen the tests pass locally with an overall tolerance value as low as 0.02 but when I run the tests on CI, even 0.25 isn't enough for one particular screen. (For the others, it's fine.) I'm trying a value of 0.30 on CI right now to see what happens. Even if that passes, it's way too high of a threshold to give me any sort of confidence that this is an effective and worthwhile comparison. (Update: It failed at 0.30 too. Just for that particular screen.) To make things a little more fun, I do have a screen that definitely uses transparency and passes perfectly with a per pixel comparison. It's nearly identical to that one screen that I can't get a passing screenshot comparison on using per pixel or overall tolerance. (Mentioned in the previous paragraph.) Our CI build machine uses an older version of Xcode than I do - which isn't ideal, of course - so maybe that's part of it? I'm still working with the same small sampling of tests. I'll scale this and update this thread again as soon as I figure out the floor of this overall tolerance threshold and apply it to more tests. |
OOI does going up from 6/256 also fix your issuesd without having to use an overall tolerance? |
@samskiter It fixed the problem in some places, but not all. Also, my results are based on images that I took from my machine as the reference images. I don't know if the result would change if my colleague would have recently uploaded a reference image. Despite some successes with the perPixel screen comparison, I'm just not confident in these tests anymore. I've already spent too much time trying to salvage them, and it's just not sustainable. This is a huge let down because they were really valuable to our organization. In fact, we caught a bug yesterday morning in a screen where the comparison is still working. Short answer to your question: No. At the very least, screens where some form of transparency is used are definitely going to give you problems with a perPixel tolerance measurement. |
That's frustrating. We also depend heavily on these! |
Were folks able to find any solution or workaround for this? |
No solution yet to this problem? just add |
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters. ## Why is this necessary? Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109). The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another without worrying about the tests failing due to differences in GPU rendering.
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters. ## Why is this necessary? Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109). The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions. This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination. Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today. ## Why create separate "imprecise" variants? The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler. Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters. ## Why is this necessary? Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109). The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions. This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination. Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today. ## Why create separate "imprecise" variants? The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler. Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters. ## Why is this necessary? Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109). The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions. This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination. Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today. ## Why create separate "imprecise" variants? The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler. Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters. ## Why is this necessary? Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109). The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions. This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination. Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today. ## Why create separate "imprecise" variants? The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler. Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
After updating MacOS up to Catalina 10.15 and Xcode up to 11.1 (11A1027)
I found that tests on other Mac device are failed with reference images saved an my Mac device.
Latest stable version 6.2.0 of library are used.
I spend 2 days trying to find the reason and fix it, implement "Highlight Different Pixels" feature and create request #108 which can help investigate such situations.
But now a time I have no ideas how to fix this issue.
The text was updated successfully, but these errors were encountered: