Skip to content

Dataflow is a Kubernetes-native platform for executing large parallel data-processing pipelines.

License

Notifications You must be signed in to change notification settings

tachyus-ryan/argo-dataflow

 
 

Repository files navigation

Dataflow

Build codecov

Summary

Dataflow is a Kubernetes-native platform for executing large parallel data-processing pipelines.

Each pipeline is specified as a Kubernetes custom resource which consists of one or more steps which source and sink messages from data sources such Kafka, NATS Streaming, or HTTP services.

Each step runs zero or more pods, and can scale horizontally using HPA or based on queue length using built-in scaling rules. Steps can be scaled-to-zero, in which case they periodically briefly scale-to-one to measure queue length so they can scale a back up.

Learn more about features.

Introduction to Dataflow

Use Cases

  • Real-time "click" analytics
  • Anomaly detection
  • Fraud detection
  • Operational (including IoT) analytics

Screenshot

Screenshot

Example

pip install git+https://github.com/argoproj-labs/argo-dataflow#subdirectory=dsls/python
from argo_dataflow import cron, pipeline

if __name__ == '__main__':
    (pipeline('hello')
     .namespace('argo-dataflow-system')
     .step(
        (cron('*/3 * * * * *')
         .cat()
         .log())
    )
     .run())

Documentation

Read in order:

Beginner:

Intermediate:

Advanced

Architecture Diagram

Architecture

About

Dataflow is a Kubernetes-native platform for executing large parallel data-processing pipelines.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 87.0%
  • Python 8.7%
  • Makefile 2.2%
  • Dockerfile 0.9%
  • JavaScript 0.4%
  • Java 0.4%
  • Shell 0.4%