Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flapping detection algorithm to link traps #244

Conversation

lunkwill42
Copy link
Member

@lunkwill42 lunkwill42 commented Jun 6, 2024

Incomplete implementation of flapping detection algorithms for link traps.

This closes #144 (and completes #232)

This builds on #241 and is dependent on that being merged first, keeping in Draft mode until then.

@lunkwill42 lunkwill42 self-assigned this Jun 6, 2024
@lunkwill42
Copy link
Member Author

The original code has an impressive amount of redundancies when it comes to event generation. I hope to factor those out once I'm closing in on feature parity.

Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 98.02956% with 4 lines in your changes missing coverage. Please review.

Project coverage is 98.92%. Comparing base (ef688f8) to head (eed192d).

Files Patch % Lines
src/zino/trapobservers/link_traps.py 92.98% 4 Missing ⚠️
Additional details and impacted files
@@                        Coverage Diff                        @@
##           feature/link-traps-reschedule     #244      +/-   ##
=================================================================
+ Coverage                          98.88%   98.92%   +0.04%     
=================================================================
  Files                                 51       52       +1     
  Lines                               6882     7056     +174     
=================================================================
+ Hits                                6805     6980     +175     
+ Misses                                77       76       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from d2bf37d to 687c565 Compare June 7, 2024 14:05
Copy link

sonarcloud bot commented Jun 7, 2024

Quality Gate Passed Quality Gate passed

Issues
298 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

This is an initial reinterpretation of what Zino 1 is doing to track
flapping states, and is still subject to change.  This suggests
keeping a separate model for flapping state tracking, so that algorithm
methods can be concentrated in that model.  Flapping states could
also potentially be tracked directly on Port model objects, but we'll
try this version first.
These have been implemented in the FlappingStates class
instead.
Zino 1 uses it, so we need it too.
For quick value lookups where there may be no pre-existing data (in
which case we return 0 values)
The logic is more or less kept from Zino 1, although I don't yet
understand the concept where an interface can be considered flapping
even if we still have no flap tracking data for it...
When flapping stops, the final flap count must be recorded in the event.
This is used by Zino 1, but was not yet allowed in the Zino 2 models
@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from 405339a to f047b38 Compare July 1, 2024 09:47
This roughly follows the Zino 1 implementation, but is not yet
feature complete.
If there is no flapping state tracked for an interface, it is definitely
not considered to be flapping yet.  Let other parts of the code
initialize flap tracking data for an interface once traps start
coming in.

This came after discussions with Håvard E about the original Tcl code,
which works on a confusing set of global variables.
After talking to Håvard E to understand the intent and flow of the
original code better, I felt the need to add some comments and to
simplify some logic.
Imports that are only necessary for type annotations should be guarded.
The original Zino code will sometimes alter already closed events.
Apparently, the reasoning goes that operators sometimes prematurely
close events for flapping ports (which Håvard E argues that we should
maybe disallow in the future).  The original Zino therefore takes
care to update both a closed flapping event and the currently open
flapping event whenever the link trap code handles a flapping state.

A limitation I immediately see is that multiple closed events could
exist for a flapping state (until they are expired/scavenged), but we
can only index one of them.  As with the original code, there is a
bias towards only indexing the latest event.

As a side note: Zino 1 used the terms "pre-closed" and "closed" for
events.  The former refers to events that have been given a state of
CLOSED by an operator.  The latter refers to events that have been
evicted completely from the running state.
Resolved after a discussion with Håvard.  When in the middle of a flap
storm, we should not commit the event on every update of the flap
counter, as that would also create a storm of update notifications to
connected clients.

This just modifies the existing event object in-place.
This was resolved with Håvard.  ifDescr or ifAlias is not normally sent
with a trap, and Håvard can think of no good reason for needing to keep
a trap-specific mapping of ifindexes to ifdescrs or ifaliases.

Which means we can keep using our existing mappings and ignore these
comments.
We could sometimes get a trap from a device that has been identified in
the running state, but that has been removed from the polldevs config.
There is no point to scheduling a poll from such a device.
Tuples as dict keys cannot be serialized to JSON by Pydantic without
custom logic.  I.e. it serializes to a string representation, with
tuple elements separated by commas, but Pydantic is unable to
deserialize this representation as a tuple, causing Zino to crash
when loading stored state.

This annotates the PortIndex type with modifiers that let's Pydantic
successfully serialize and deserialize PortIndex dict keys to/from JSON
Flapping state stats were constantly being reset every time an event
was updated, due to a bug in the implementation.

This renames `FlappingState.flapping` in an attempt to clarify that
this is a flag used to indicate that a flapping state has been declared
(i.e. an event has likely been created), while the `is_flapping()`
method is only used to verify that the current stats qualify as flapping
and that a flapping state should/could be established.
Updates/creates docstrings and cleans up a few small things after
increased understanding of what the original flapping code is actually
tracking.
@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from f047b38 to 1ac3b25 Compare July 1, 2024 13:36
All existing flap stats should be aged periodically, to ensure flapping
states are cleared after traps stop coming in.
The loop may delete FlappingState objects from the iterated dict,
which would raise an Exception.
@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from 47bcece to 50340a1 Compare July 2, 2024 13:40
@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from 50340a1 to 241b45e Compare July 2, 2024 13:59
We will need it on other test modules that are upcoming.
This factors out `stabilize_flapping_state()` from
`age_single_interface_flapping_state()` in order to separate concerns
and make things more testable.
The submitted interface argument may not have any flap stats, but
there is no need for the implementation to crash if it doesn't.  It's
safe to quietly ignore it, as this means the interface is already
"unflapped".
@lunkwill42 lunkwill42 force-pushed the feature/link-traps-flapping branch from ea6926e to a3132e1 Compare July 3, 2024 10:22
This belongs in the `FlapState` class itself.
It may be handy to let the caller access the created/updated FlapState
object as soon as it has updated it.
Now that the logic is better understood, this factors out the link
transition code that updates events for interfaces in either a
flapping state or a non-flapping state.

There were some redundancies that were copied directly from Zino 1,
these have been factored out.  Also, the code was made slightly more
readable by fetching the associated `FlapState` objects if they exist.

Additionally, this fixes a discovered problem in the first version of
the code, in where a transition out of a flapping state wasn't handled
properly, because the port was looked up incorrectly in the index of
flapping states.
Tests more of the behavioral aspects of the link transition handler,
focusing on how it handles events and flapping states.
@lunkwill42 lunkwill42 deleted the branch Uninett:feature/link-traps-reschedule July 3, 2024 14:17
@lunkwill42 lunkwill42 closed this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant