Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VA: Add a method for performing MPIC compliant challenge validation #7794

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

beautifulentropy
Copy link
Member

@beautifulentropy beautifulentropy commented Nov 8, 2024

Add VA.ValidateChallenge, a new MPIC compliant gRPC method that will replace VA.PerformValidation for the validation of ACME challenges. A follow-up will add VA.CheckCAA and an RA feature flag that enables their use.

Part of #7615
Part of #7614
Part of #7616


Ballot Summary for Reviewers

You can read the full ballot contents here. I have pulled together a summary below:

3.2.2.9 Multi-Perspective Issuance Corroboration

... Furthermore, for any pair of DNS resolvers used on a Multi-Perspective Issuance Corroboration attempt, the straight-line distance between the two States, Provinces, or Countries the DNS resolvers reside in MUST be at least 500 km. The location of a DNS resolver is determined by the point where unencapsulated outbound DNS queries are typically first handed off to the network infrastructure providing Internet connectivity to that DNS resolver.

This PR does not attempt to satisfy the aforementioned distance requirement. This will need to be satisfied as part of the datacenter selection process for perspectives.

Table: Quorum Requirements

# of Distinct Remote Network Perspectives Used # of Allowed non-Corroborations
2-5 1
6+ 2

...

Phased Implementation Timeline

  • Effective March 15, 2025, the CA MUST implement Multi-Perspective Issuance Corroboration using at least two (2) remote Network Perspectives. The CA MAY proceed with certificate issuance if the number of remote Network Perspectives that do not corroborate the determinations made by the Primary Network Perspective ("non-corroborations") is greater than allowed in the Quorum Requirements table.
  • Effective September 15, 2025, the CA MUST implement Multi-Perspective Issuance Corroboration using at least two (2) remote Network Perspectives. The CA MUST NOT proceed with certificate issuance if the number of non-corroborations is greater than allowed in the Quorum Requirements table.
  • Effective March 15, 2026, the CA MUST implement Multi-Perspective Issuance Corroboration using at least three (3) remote Network Perspectives. The CA MUST NOT proceed with certificate issuance if the number of non-corroborations is greater than allowed in the Quorum Requirements table and if the remote Network Perspectives that do corroborate the determinations made by the Primary Network Perspective do not fall within the service regions of at least two (2) distinct Regional Internet Registries.

These requirements are satisfied by this PR.

  • Effective June 15, 2026, the CA MUST implement Multi-Perspective Issuance Corroboration using at least four (4) remote Network Perspectives. The CA MUST NOT proceed with certificate issuance if the number of non-corroborations is greater than allowed in the Quorum Requirements table and if the remote Network Perspectives that do corroborate the determinations made by the Primary Network Perspective do not fall within the service regions of at least two (2) distinct Regional Internet Registries.
  • Effective December 15, 2026, the CA MUST implement Multi-Perspective Issuance Corroboration using at least five (5) remote Network Perspectives. The CA MUST NOT proceed with certificate issuance if the number of non-corroborations is greater than allowed in the Quorum Requirements table and if the remote Network Perspectives that do corroborate the determinations made by the Primary Network Perspective do not fall within the service regions of at least two (2) distinct Regional Internet Registries.

These requirements are not satisfied by this PR. The following code will need to be updated to reject validation requests when fewer than 4 (later 5) remote VAs are required.

func (va *ValidationAuthorityImpl) remoteValidateChallenge(ctx context.Context, req *vapb.ValidationRequest) (mpicSummary, *probs.ProblemDetails) {
	remoteVACount := len(va.remoteVAs)
	if remoteVACount < 3 {
		return mpicSummary{}, probs.ServerInternal("Insufficient remote perspectives: need at least 3")
	}

5.4.1 Types of events recorded

  1. Multi-Perspective Issuance Corroboration attempts from each Network Perspective, minimally recording the following information:
    - a. an identifier that uniquely identifies the Network Perspective used;
    - b. the attempted domain name and/or IP address; and
    - c. the result of the attempt (e.g., "domain validation pass/fail", "CAA permission/prohibition").>
  2. Multi-Perspective Issuance Corroboration quorum results for each attempted domain name or IP address represented in a Certificate request (i.e., "3/4" which should be interpreted as "Three (3) out of four (4) attempted Network Perspectives corroborated the determinations made by the Primary Network Perspective).

These requirements are satisfied by this PR.

@beautifulentropy beautifulentropy force-pushed the mpic-part-two branch 2 times, most recently from d77a42b to 14a8ac5 Compare November 8, 2024 22:28
@beautifulentropy beautifulentropy changed the title WIP VA: Add a method for performing MPIC compliant challenge validation Nov 8, 2024
Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does two things:

  1. Add logging and threshold checking required by MPIC
  2. Reimplement VA.PerformValidation as VA.ValidateChallenge, with the important distinction that VA.ValidateChallenge does not check CAA.

I remember we discussed during standup last week some challenges around multi-perspective CAA rechecking that led to (2), but I think we forgot to write down the details. I did my best to write down what I remember of it: #7808.

Looking at (2) in this PR, I'm concerned about how much near-duplication of code it results in. I'd rather do some refactoring to implement (1) in the existing code, and treat (2) as separate followup change - or possibly take a different approach entirely, like the alternatives mentioned in #7808.

If we do pursue (2) as a followup change, my goal would be to reuse as much code as possible between the two RPC methods, so we have less code, and simpler code, to reason about.

va/vampic.go Outdated Show resolved Hide resolved
va/proto/va.proto Outdated Show resolved Hide resolved
va/proto/va.proto Outdated Show resolved Hide resolved
va/vampic.go Outdated
Comment on lines 39 to 47
func (va *ValidationAuthorityImpl) observeLatency(op, perspective, challType, probType, result string, latency time.Duration) {
labels := prometheus.Labels{
"operation": op,
"perspective": perspective,
"challenge_type": challType,
"problem_type": probType,
"result": result,
}
va.metrics.validationLatency.With(labels).Observe(latency.Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't add much vs calling Observe() directly, other than moving from named fields (across multiple lines) to positional fields in the function call. While this allows the call sites to use a single line, it makes it less obvious at the call site that the parameters are correct (also the same result could be achieved by using .WithLabelValues(op, perspective, challType,...).

The two call sites look like this:

		va.observeLatency(challenge, va.perspective, string(chall.Type), probType, outcome, localLatency)
		if va.isPrimaryVA() {
			// Observe total validation latency (primary+remote).
			va.observeLatency(challenge, all, string(chall.Type), probType, outcome, va.clk.Since(start))

Since several of those are the same, let's use the cool .CurryWith() method to reduce duplication while still keeping the clarity of naming the labels inline:

hist := va.metrics.validationLatency.CurryWith(prometheus.Labels{
  "operation": challenge,
  "challenge_type": chall.Type,
  "problem_type": probType,
  "result": outcome,
}
hist.With({"perspective": va.perspective}).Observe(localLatency.Seconds())
if va.isPrimaryVA() {
  hist.With({"perspective": all}).Observe(va.clk.Since(start))

Copy link
Member Author

@beautifulentropy beautifulentropy Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7799 adds two more call sites. This helper keeps the label boilerplate out of the way. It is less likely to result in mislabeling or missing a label altogether than .WithLabelValues(. The .CurryWith() suggestion is a nice trick though! I had forgotten that existed.

va/vampic.go Outdated Show resolved Hide resolved
va/vampic.go Outdated Show resolved Hide resolved
va/vampic.go Outdated Show resolved Hide resolved
va/vampic.go Outdated Show resolved Hide resolved
va/vampic.go Outdated Show resolved Hide resolved
va/vampic.go Outdated Show resolved Hide resolved
@jsha
Copy link
Contributor

jsha commented Nov 14, 2024

We talked about ways to extract out some of the non-core parts of this change into their own changes so the important stuff is more readily visible. Some possibilities:

In the current code we have:

	type rvaResult struct {
		hostname string
		response *vapb.ValidationResult
		err      error
	}

	results := make(chan *rvaResult, len(va.remoteVAs))

In the new code we have:

	type response struct {
		addr   string
		result *vapb.ValidationResult
		err    error
	}

	responses := make(chan *response, remoteVACount)

That is, renaming rvaResult to response, hostname to addr, and results to responses. Those renamings seem unobjectionable to me, but we can make them in the original code first, so there are fewer diffs when comparing with the new code.

The current code counts good and bad results in integers. This PR instead accumulates them as slices; that's great! We can backport that code into PerformValidation.

The old code has:

				currProb = probs.ServerInternal("Remote PerformValidation RPC failed")

The new code has:

				currProb = probs.ServerInternal("Secondary domain validation RPC failed")

Again, a fine change but one we can make in the existing PerformValidation code.

In va_test.go there is a rename of cancelledVA to canceledVA that touches a lot of lines. I'm all for consistency of spelling, but that should be its own PR. That reduces diffs, and it also allows us to better focus on whether we hit all the spots we intend to hit. For instance there's a variable name on line 398 that needs updating to be consistent.

This PR introduces the new validation_latency metric that is similar to validation_time except that it renames the type label to challenge_type and adds operation and perspective. We can add code to PerformValidation that increments this (with operation being challenge+caa). Also, FWIW, we should put the initialization of this metric next to the initialization for validation_time and explain the difference.

Comment on lines +140 to +142
if remoteVACount < 3 {
return mpicSummary{}, probs.ServerInternal("Insufficient remote perspectives: need at least 3")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the number of remoteVAs is determined at construction time, it seems better to put this check in NewValidationAuthorityImpl.

Copy link
Member Author

@beautifulentropy beautifulentropy Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it should eventually move there. Today that change would make remote VAs a requirement to run Boulder starting in the next release.

go.mod Outdated Show resolved Hide resolved
jsha pushed a commit that referenced this pull request Nov 15, 2024
Bring this code more in line with `VA.remoteDoDCV` in #7794. This should
make these two easier to diff in review.
@jsha
Copy link
Contributor

jsha commented Nov 15, 2024

Note: not a complete review, I just wanted to get some notes down before I go get the kid.

--- /tmp/old.txt	2024-11-15 14:57:20.746779025 -0800
+++ /tmp/new.txt	2024-11-15 14:58:03.728594914 -0800
@@ -1,12 +1,11 @@
-func (va *ValidationAuthorityImpl) performRemoteValidation(
-	ctx context.Context,
-	req *vapb.PerformValidationRequest,
-) *probs.ProblemDetails {
+func (va *ValidationAuthorityImpl) remoteDoDCV(ctx context.Context, req *vapb.DCVRequest) (mpicSummary, *probs.ProblemDetails) {
+	// Mar 15, 2026: MUST implement using at least 3 perspectives
+	// Jun 15, 2026: MUST implement using at least 4 perspectives
+	// Dec 15, 2026: MUST implement using at least 5 perspectives
 	remoteVACount := len(va.remoteVAs)
-	if remoteVACount == 0 {
-		return nil
+	if remoteVACount < 3 {
+		return mpicSummary{}, probs.ServerInternal("Insufficient remote perspectives: need at least 3")
 	}
-
 	type response struct {
 		addr   string
 		result *vapb.ValidationResult
@@ -15,34 +14,31 @@
 
 	responses := make(chan *response, remoteVACount)
 	for _, i := range rand.Perm(remoteVACount) {
-		go func(rva RemoteVA, out chan<- *response) {
-			res, err := rva.PerformValidation(ctx, req)
-			out <- &response{
-				addr:   rva.Address,
-				result: res,
-				err:    err,
-			}
-		}(va.remoteVAs[i], responses)
+		go func(rva RemoteVA) {
+			res, err := rva.DoDCV(ctx, req)
+			responses <- &response{rva.Address, res, err}
+		}(va.remoteVAs[i])
 	}
 
-	required := remoteVACount - va.maxRemoteFailures
 	var passed []string
 	var failed []string
+	passedRIRs := make(map[string]struct{})
+
 	var firstProb *probs.ProblemDetails
+	for i := 0; i < remoteVACount; i++ {
+		resp := <-responses
 
-	for resp := range responses {
 		var currProb *probs.ProblemDetails
-
 		if resp.err != nil {
 			// Failed to communicate with the remote VA.
 			failed = append(failed, resp.addr)
-
-			if canceled.Is(resp.err) {
-				currProb = probs.ServerInternal("Remote PerformValidation RPC canceled")
+			if errors.Is(resp.err, context.Canceled) {
+				currProb = probs.ServerInternal("Secondary domain validation RPC canceled")
 			} else {
-				va.log.Errf("Remote VA %q.PerformValidation failed: %s", resp.addr, resp.err)
-				currProb = probs.ServerInternal("Remote PerformValidation RPC failed")
+				va.log.Errf("Remote VA %q.ValidateChallenge failed: %s", resp.addr, resp.err)
+				currProb = probs.ServerInternal("Secondary domain validation RPC failed")
 			}
+
 		} else if resp.result.Problems != nil {
 			// The remote VA returned a problem.
 			failed = append(failed, resp.result.Perspective)
@@ -50,37 +46,53 @@
 			var err error
 			currProb, err = bgrpc.PBToProblemDetails(resp.result.Problems)
 			if err != nil {
-				va.log.Errf("Remote VA %q.PerformValidation returned malformed problem: %s", resp.addr, err)
-				currProb = probs.ServerInternal("Remote PerformValidation RPC returned malformed result")
+				va.log.Errf("Remote VA %q.ValidateChallenge returned a malformed problem: %s", resp.addr, err)
+				currProb = probs.ServerInternal("Secondary domain validation RPC returned malformed result")
 			}
+
 		} else {
 			// The remote VA returned a successful result.
 			passed = append(passed, resp.result.Perspective)
+			passedRIRs[resp.result.Rir] = struct{}{}
 		}
 
 		if firstProb == nil && currProb != nil {
 			// A problem was encountered for the first time.
 			firstProb = currProb
 		}
+	}
+
+	// Prepare the summary, this MUST be returned even if the validation failed.
+	summary := prepareSummary(passed, failed, passedRIRs, remoteVACount)
 
-		if len(passed) >= required {
-			// Enough successful responses to reach quorum.
-			return nil
+	maxRemoteFailures := maxValidationFailures(remoteVACount)
+	if len(failed) > maxRemoteFailures {
+		// Too many failures to reach quorum.
+		if firstProb != nil {
+			firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+			return summary, firstProb
 		}
-		if len(failed) > va.maxRemoteFailures {
-			// Too many failed responses to reach quorum.
-			firstProb.Detail = fmt.Sprintf("During secondary validation: %s", firstProb.Detail)
-			return firstProb
+		return summary, probs.ServerInternal("Secondary domain validation failed due to too many failures")
+	}
+
+	if len(passed) < (remoteVACount - maxRemoteFailures) {
+		// Too few successful responses to reach quorum.
+		if firstProb != nil {
+			firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+			return summary, firstProb
 		}
+		return summary, probs.ServerInternal("Secondary domain validation failed due to insufficient successful responses")
+	}
 
-		// If we somehow haven't returned early, we need to break the loop once all
-		// of the VAs have returned a result.
-		if len(passed)+len(failed) >= remoteVACount {
-			break
+	if len(passedRIRs) < 2 {
+		// Too few successful responses from distinct RIRs to reach quorum.
+		if firstProb != nil {
+			firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+			return summary, firstProb
 		}
+		return summary, probs.Unauthorized("Secondary domain validation failed to receive enough corroborations from distinct RIRs")
 	}
 
-	// This condition should not occur - it indicates the passed/failed counts
-	// neither met the required threshold nor the maxRemoteFailures threshold.
-	return probs.ServerInternal("Too few remote PerformValidation RPC results")
+	// Enough successful responses from distinct RIRs to reach quorum.
+	return summary, nil
 }

Now with some of the formatting and naming changes merge to main, I pulled performRemoteValidation from main and did a diff against remoteDoDCV in this branch. There are still a few formatting and naming diffs that make it hard to zero in on the functionality diffs. Can we eliminate those? For instance:

			out <- &response{
				addr:   rva.Address,
				result: res,
				err:    err,
			}

Became:

			responses <- &response{rva.Address, res, err}

Poking around it looks like the latter is the more common style we use for this pattern, so let's merge it upstream as well.

Also, the "Remote PerformValidation RPC" to "Secondary domain validation" error message changes didn't wind up making it into the backports.

Similarly:

  for resp := range responses {

Became:

  for i := 0; i < remoteVACount; i++ {

I think the former is more appropriate in this case.

if canceled.Is(resp.err) {

Became:

if errors.Is(resp.err, context.Canceled) {

The documentation for canceled.Is() says it checks for both context.Canceled and grpc/codes.Canceled. Thinking about it, I can't think of a reason we would receive grpc/codes.Canceled here. The canceled package was introduced to solve a particular CT related logging problem (#3447), and also predated errors.Is (I think). So the switch here to errors.Is seems correct to me. Let's backport it. I actually wonder now if we need this canceled check at all. For one thing, in the new code we never return early and so never cancel anything. However, even in the old code, we would cancel after leaving the loop so would never expect to get a canceled response. Gonna take a look at the git history for this check.

Other than that the diff here seems to do what I expect. Could you update the PR description to describe how DoDCV behaves differently than PerformValidation? Here's what I've come to understand from factoring out some of the components:

  • Doesn't do CAA checking
  • Waits for all backends to return, doesn't try to return early if it gets too many failures, or enough successes.
  • Enforces that there are 2+ distinct RIRs among the successful responses.

That last one is probably implicit in the ballot summary text but I think it's useful to list it in the executive summary up top.

Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the PR where we're adding enforcement of MPIC-related constraints, there are some useful checks to add:

  • There are no duplicate perspectives.
  • If a VA considers itself remote (i.e. no backends), its perspective and RIR are non-empty. I'm sure these can also be expressed in the config validation language, but IMO it doesn't hurt to check them in the constructor as well.
  • If a VA considers itself remote, its Perspective is not PrimaryPerspective. Note: this will run into some trouble in prod because we are using boulder va, not yet boulder-remoteva.

va/va_test.go Outdated Show resolved Hide resolved
Comment on lines +790 to +793
// ua if set to pass, the remote VA will always pass validation. If set to
// fail, the remote VA will always fail validation with probs.Unauthorized.
// This is set to pass by default.
ua string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is not quite correct. ua is set to "" by default, by the normal zero value rules.

Looking in setupRVAs, if it receives an empty ua it will default to "user agent 1.0".

Suggested change
// ua if set to pass, the remote VA will always pass validation. If set to
// fail, the remote VA will always fail validation with probs.Unauthorized.
// This is set to pass by default.
ua string
// ua the user agent to be used by this remote VA. Different test cases use the
// user agent string to accept or reject VAs selectively to create conditions for
// testing. For instance, `TestDoDCVMPIC` accepts all requests with a user agent
// of "pass".
ua string

jsha added a commit that referenced this pull request Nov 18, 2024
Previously this was a configuration field.

Ports `maxAllowedFailures()` from `determineMaxAllowedFailures()` in
#7794.

Test updates:
 
Remove the `maxRemoteFailures` param from `setup` in all VA tests.

Some tests were depending on setting this param directly to provoke
failures.

For example, `TestMultiVAEarlyReturn` previously relied on "zero allowed
failures". Since the number of allowed failures is now 1 for the number
of remote VAs we were testing (2), the VA wasn't returning early with an
error; it was succeeding! To fix that, make sure there are two failures.
Since two failures from two RVAs wouldn't exercise the right situation,
add a third RVA, so we get two failures from three RVAs.

Similarly, TestMultiCAARechecking had several test cases that omitted
this field, effectively setting it to zero allowed failures. I updated
the "1 RVA failure" test case to expect overall success and added a "2
RVA failures" test case to expect overall failure (we previously
expected overall failure from a single RVA failing).

In TestMultiVA I had to change a test for `len(lines) != 1` to
`len(lines) == 0`, because with more backends we were now logging more
errors, and finding e.g. `len(lines)` to be 2.
@beautifulentropy beautifulentropy marked this pull request as draft November 25, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants