Skip to content

Latest commit

 

History

History
140 lines (117 loc) · 6.28 KB

README.org

File metadata and controls

140 lines (117 loc) · 6.28 KB

bgworker - a Python background worker for Cisco NSO

bgworker is a library for running background worker processes in Cisco NSO. It is packaged as an NSO package so that other NSO packages can easily depend on it.

This repository follows the standard form for NID (NSO in Docker) package repositories and thus includes CI testing and a test package to verify that the bgworker library works correctly. You can look at test-packages/tbgw for an example of how to use the bgworker library.

bgworker handles a number of things that aren’t otherwise entirely intuitive how to handle, such as:

  • NCS package events like reload and redeploy
  • background worker process dying (will restart it)
  • configuration changes - enabling/disabling of the background worker
  • HA events - only run background process when in specified HA mode
    • per default runs when HA is disabled or HA mode = primary (legacy master)
    • can also be set to run when in HA mode none or secondary (legacy slave)
    • or set to always to always run regardless of current HA mode

The example code in this package does not demonstrate all the abilities of the process supervisor. It does have a random condition to die every now in a while and the supervisor will then restart the process. This is evident from the log where you can see “Bad dice value” followed by the supervisor saying it is starting the process again. You can test the package redeploy by issuing `request packages package bgworker redeploy` and see how fast it is, it should be near instantaneous, showing that we correctly react to the python vm stop request from NCS. The reaction to configuration changes can be tested by using the ncs_cli and disabling the bgworker by going into `configure` mode and doing `set bgworker disabled` followed by `commit`. Finally, HA can be tested by loading whatever HAFW package you want, enabling HA in NCS and using the HAFW package functionality to switch to primary/master mode or away from primary/master mode, which should then lead to starting or stopping the background worker process respectively.

This code is written for Python 3. Python 2 is dead and you should stop using it. NSO 5.3 deprecates support for Python 2. It is weird that NSO doesn’t ship with using Python 3 per default, so you have to enable this yourself (it is documented in the user guide).

HA enabled

Reading the status of HA, whether HA is enabled or disabled, is only performed at startup. It is currently not possible to change enable or disable HA in NCS by only issuing a reload. If this changes in a newer version of NCS, then we would need to react to this too.

Recommended use / design pattern

A note on error handling and anti-patterns

It appears the following is a rather intuitive code pattern:

def bg_function():
    with ncs.maapi.Maapi() as m:
        with ncs.maapi.Session(m, 'something', 'system'):
            while True:
                try:
                    # do work here

                except:
                    # handle faults

You should not write code like that!

The problem is with the placement of the while-True-try-except loop in relation to the maapi context managers. An exception occurring in this loop will be caught and then retried, which is the sought after behaviour. It will however be retried within the same maapi context manager and if the exception was related to the maapi session, like a maapi connection failure, then it will be retried with the same maapi session which is still broken. The maapi session context manager does not automatically reconnect. MAAPI related exceptions are somewhat rare and won’t show up in most code but the above pattern exacerbates the problem as it continuously retries with the same broken maapi session.

This should be obvious but as 100% of the beta developers that tried using background_process implemented this exact anti-pattern, it was obvious that a note of warning was needed.

There are multiple solutions. The while-try loop can be moved to encapsulate the context managers like so:

def bg_function():
    while True:
        try:
            with ncs.maapi.Maapi() as m:
                with ncs.maapi.Session(m, 'something', 'system'):
                    # do work here

    except:
        # handle faults

It is relatively expensive setting up a Maapi session, so keeping them somewhat long lived is desirable. It naturally depends on the type of work you are carrying out of the overhead of opening new sessions is relevant or not. If each cycle takes minute, the overhead is likely negligible while if a cycle is on the order of a few seconds then you should probably consider a pattern where you try to persist the session over multiple work cycles.

A surprisingly simple way of dealing with this is to remove the exception handler:

def bg_function():
    with ncs.maapi.Maapi() as m:
        with ncs.maapi.Session(m, 'something', 'system'):
            while True:
                # do work here

The child process will die but as the background_process supervisor monitors the child it will be restarted in about a second.

The exception handler can be made more specific to not catch the maapi exceptions (instead letting the child die for that case and rely on the supervisor for restart). This is probably a better approach as soon as your worker function reaches some level of complexity in which case more complete exception handling is necessary.

Alternatively an outer while True loop is added, in which case we probably should break up the code into multiple functions since being 5-6 levels of nesting deep before you start writing your actual application code is pretty appalling.

BUGS

  • [ ] logging levels can’t seem to be reconfigured. Have to redeploy package to use new level.

To Do

  • [ ] describe the design
    • [ ] why multiprocessing?
    • [ ] why threads?
      • [ ] why so many?
    • [ ] why multiprocessing AND threads?
    • [ ] what’s up with the logger stuff?
  • [ ] write a more complete example showing how we can subscribe to config changes in worker process