-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ovn-tester: Most timers are not useful #85
Comments
I don't really agree with this statement. ovn-nbctl commands (on the client side) wait until the nbctl daemon (server side) processed the unixctl command (
IMO it's still useful to know that something happened around the time we were doing that nbctl command. That compared to just getting a measurement for the complete iteration in which case we'd really have no clue where the problem is.
Again, I'm not sure I understand why it's not useful to know that "something" happened at this point.
I agree, an audit is needed to remove some of the measurements (and maybe add others).
We have most of these already and for some of the others there's WIP already:
This gets logged in the nbctl daemon logs. If we move to IDL however, we'd need a way to log it there.
https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*
https://patchwork.ozlabs.org/project/openvswitch/list/?series=262657&state=*
How is this different from the above? Or do you mean in general, e.g., for ACL related flows?
Indeed I'm afraid the only option right now is to scrape logs from all components. However, if we make sure we have logs for all of the points above, another option would be to configure logs to be sent to a centralized syslog server (e.g., on the ovn tester node). This would make data collection/analysis easier.
Agreed.
I don't agree with the second part of this statement. We actually are interested in "how long it takes" to enforce a network policy (i.e., time it takes for traffic to be allowed by an ACL). |
Thanks for the reply Dumitru. I've got comments in-line below.
Fair point. I guess it's not correct to say that we're only measuring the SSH roundtrip time, since there is the possibility that there could be some issue with the nbctl daemon.
I think that I need to clarify the thesis behind when I say timers are not "useful", since in retrospect I was vague (and also partially incorrect). Think about it from the perspective of us trying to quantify how long it takes for OVN to perform operations. Using In retrospect, I shouldn't have implied that timing the
I would advise that this audit be carried out from a fresh perspective. In other words, if ovn-tester didn't exist yet and we wanted to start performance testing OVN, what measurements would we want? We can then compare that list of measurements to what we have and see how we can improve.
Thanks for providing links to the relevant commits/patches. This goes to show that we at least partially have the necessary tools to measure times in a more fine-grained manner. The problem is that before we can know that we have all the tools we need, we need to do that first step of ensuring we know everything that we want to measure :) .
Hopefully there is no difference. I don't think that for the case of adding a logical switch port this should actually be necessary, which is why I put the "(Possibly)" there. The reason this went through my head was just because I was trying to think of every timestamp we might possibly want to get throughout adding the logical switch port. And if there is some discrepancy between when ovn-installed is set and when flows are installed, this reporting would tell us. But generally, you're also correct that other operations may (currently) only be able to use the presence or absence of flows in the flow table as a method to know the command has been processed. As you pointed out, ACLs are one of those situations. We don't have anything we can monitor in OVS except for flow installation.
Yes that is a potential option. Another option I thought of is to install agent programs on all nodes to monitor for the creation of records and then report timestamps for those events back to ovn-tester. I'm honestly not sure which of those ideas is actually going to be easier to implement.
If we're interested in how long it takes to enforce a network policy, then that's fine. But I still think that pings are not the best tool for it. We should be able to tell that an ACL is "activated" through other means, such as checking for flows to be installed (as you hinted at previously), by scraping logs, or through some new means. The ping is a good follow-up to ensure that the ACL is actually being enforced. |
I recently did an audit of all places where the
@ovn_stats.timeit
decorator is used. What I found was that the only truly useful places are in:WorkerNode.wait()
: After provisioning a worker node, this waits for the correspondingChassis
record to appear in the SBDB.WorkerNode.ping_port()
: This determines how long it takes before pings are successful [1] .The rest of the timed operations essentially call
ovn-nbctl
a bunch of times. Since we use theovn-nbctl
daemon on the central nodes, anyovn-nbctl
command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at theNamespace.add_ports
graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.
This problem also extends to the "useful" timers I mentioned above. When we time
WorkerNode.ping_port()
, This includes the SSH connection overhead, plus python client code execution (such as multipledatetime.now()
calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.How can we fix this? There are a few steps to take:
external_ids:ovn-installled
set.Right now, there is nothing that can monitor the databases in real time and note the necessary timestamps.
[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses
ovn-installed-ts
. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.The text was updated successfully, but these errors were encountered: