Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement testing for Prefect workflow #17

Open
7 tasks
Tracked by #14
amcnicho opened this issue Dec 10, 2024 · 0 comments
Open
7 tasks
Tracked by #14

Implement testing for Prefect workflow #17

amcnicho opened this issue Dec 10, 2024 · 0 comments
Labels

Comments

@amcnicho
Copy link
Member

amcnicho commented Dec 10, 2024

Objective

Define, measure, and improve the reliability and fault tolerance of an example workflow based on Prefect.

Requirements

  • Incorporate entities that facilitate timely and accurate failure detection.
  • An ideal rollback recovery approach would not require source code modifications, source code recompilation, or relinking support binaries.
  • Recovery rollback should include robust failure detection that activates without user intervention.
  • Time to create checkpoints should be significantly shorter than the application runtime and the checkpoint size should be small.

Prerequisite

Note: This is essentially a sub-task of #8


Definition of Done

  • The team has implemented tests that quantify the reliability and fault tolerance of the example Prefect workflow
  • The team has simulated failures in the operation of the example Prefect workflow to demonstrate the usefulness of the tests

Key Decision Points

  1. How to measure reliability and fault tolerance?
  2. Appropriate resolution of measurement quantities?
  3. What are realistic simulations of failure?

Artifacts

  • Initial definitions of reliability and fault tolerance against which to implement tests for monitoring.
  • Test system and associated CI capabilities
  • Passing tests that function as the basis for a monitoring system of workflows based on Prefect.

Success Criteria

There are established definitions and initial measurements that quantify the reliability and fault tolerance of workflows based on Prefect.

Potential Challenges

  • Commonly used monitoring signals (latency, traffic, errors, saturation, time-to-recovery) might be difficult to quantify using workflows that only mock behavior of domain applications, i.e., sleep functions on a instead of actual workloads.
  • Appropriate measurement resolution still undefined without knowing the details of integration with other services (such as user interfaces, resource pools, user demand).
  • Without a well understood model of real incidents that might occur in a future working system, simulated failures might provide unrealistic constraints on the development of example workflows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant