diff --git a/README.md b/README.md index 5d2a54f..e46e8d3 100644 --- a/README.md +++ b/README.md @@ -27,9 +27,9 @@ This centering of evented communication makes Flowmetal ideal for **coordination **Scripting** - Durablity and distribution of execution come at coordination costs which make Flowmetal well suited for coordination tasks, but not for heavy processing. -- For a problem statement, see [Call/CC Airflow](doc/call_cc_airflow.md). +- For a problem statement, see [Call/CC Airflow](doc/what_problem.md). - For an architecture overview, see [Architecture](doc/architecture.md). -- For example doodles, see [examples](examples). +- For example doodles, see [examples](examples) and [NOTEs](doc/NOTES.md). ## License diff --git a/doc/call_cc_airflow.md b/doc/call_cc_airflow.md deleted file mode 100644 index 48476e2..0000000 --- a/doc/call_cc_airflow.md +++ /dev/null @@ -1,156 +0,0 @@ -# What problem are you trying to solve? - -In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap. - -Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation. -This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand. - -In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern. -Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling. - -Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation. -Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job. - -Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks. -That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level. - -Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines. - -Consider this example from the Airflow documentation - - -```python -from __future__ import annotations - -import pendulum - -from airflow import DAG -from airflow.operators.empty import EmptyOperator -from airflow.utils.edgemodifier import Label - -with DAG( - "example_branch_labels", - schedule="@daily", - start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), - catchup=False, -) as dag: - ingest = EmptyOperator(task_id="ingest") - analyse = EmptyOperator(task_id="analyze") - check = EmptyOperator(task_id="check_integrity") - describe = EmptyOperator(task_id="describe_integrity") - error = EmptyOperator(task_id="email_error") - save = EmptyOperator(task_id="save") - report = EmptyOperator(task_id="report") - - ingest >> analyse >> check - check >> Label("No errors") >> save >> report - check >> Label("Errors found") >> describe >> error >> report -``` - -Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement. -But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation. - -Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax - - -```python -def noop(): - pass - -def check(result): - return True - -ingest = noop -analyse = noop -describe = noop -error = noop -save = noop -report = noop - -def main(): - ingest() - result = analyse() - if check(result): - save(result) - report(result) - else: - describe() - error() - report() -``` - -The program a developer authors is already a control flow graph. -We have a name for "operators" or state nodes - they're just functions. -We have a name and notation for transitions in the state chart - they're just sequential statements. - -Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize. - -Temporal does better than Airflow and reflects this understanding. -Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers. - -Consider this example from the Temporal documentation - - -```python -@activity.defn -async def cancellable_activity(input: ComposeArgsInput) -> NoReturn: - try: - while True: - print("Heartbeating cancel activity") - await asyncio.sleep(0.5) - activity.heartbeat("some details") - except asyncio.CancelledError: - print("Activity cancelled") - raise - - -@workflow.defn -class GreetingWorkflow: - @workflow.run - async def run(self, input: ComposeArgsInput) -> None: - activity_handle = workflow.start_activity( - cancel_activity, - ComposeArgsInput(input.arg1, input.arg2), - start_to_close_timeout=timedelta(minutes=5), - heartbeat_timeout=timedelta(seconds=30), - ) - - await asyncio.sleep(3) - activity_handle.cancel() -``` - -This is really good compared to an equivalent Airflow graph! -All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs. -But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK. - -In comparison were this workflow a "simple" Python script it would only need to be written - -```python -# https://pypi.org/project/timeoutcontext/ -from timeoutcontext import task_with_timeout, TimeoutError - - -def cancellable_activity(): - try: - while True: - print("Heartbeating cancellable activity") - sleep(0.5) - except TimeoutError: - print("Activity cancelled") - raise - - -def main(): - task = task_with_timeout(lambda: cancellable_activity(), - timeout=timedelta(minutes=5)) - sleep(3) - task.cancel() -``` - -As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users. -The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs. -There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program. -It is in effect a new language with none of the notational advantages of being one. - -The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation. -This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support. -For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools. - -In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API. diff --git a/doc/what_problem.md b/doc/what_problem.md index 8b5bc3f..48476e2 100644 --- a/doc/what_problem.md +++ b/doc/what_problem.md @@ -1,47 +1,156 @@ -# An Asynchronous, Distributed Task Engine - -This document presents a design without reference implementation for a distributed programming system; -sometimes called a workflow engine. -It is intended to provide architectural level clarity allowing for the development of alternative designs or implementations as may suit. - -## Problem Statement +# What problem are you trying to solve? In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap. -Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) -and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation. +Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation. This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand. In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern. -Traditional automation tools such as shells are inadequate for this environment because achieving network communication is excessively difficult. +Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling. -In a distributed environment it cannot be assumed that a single machine can remain available to execute automation; -This requires an approach to automation which allows for the incremental execution of single tasks at a time with provisions for relocation and recovery should failure occur. +Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation. +Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job. -It also cannot be assumed that a single machine is sufficiently available to receive and process incoming events such as callbacks. -A distributed system is needed to wrangle distributed systems. +Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks. +That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level. -## Design Considerations +Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines. -- Timeouts are everywhere -- Sub-Turing/boundable -- +Consider this example from the Airflow documentation - -## Architectural Overview +```python +from __future__ import annotations -### Events -Things that will happen, or time out. +import pendulum -### Actions -Things the workflow will do, or time out. +from airflow import DAG +from airflow.operators.empty import EmptyOperator +from airflow.utils.edgemodifier import Label -### Bindings -Data the workflow either was given or computed. +with DAG( + "example_branch_labels", + schedule="@daily", + start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), + catchup=False, +) as dag: + ingest = EmptyOperator(task_id="ingest") + analyse = EmptyOperator(task_id="analyze") + check = EmptyOperator(task_id="check_integrity") + describe = EmptyOperator(task_id="describe_integrity") + error = EmptyOperator(task_id="email_error") + save = EmptyOperator(task_id="save") + report = EmptyOperator(task_id="report") -### Conditionals -Decisions the workflow may make. + ingest >> analyse >> check + check >> Label("No errors") >> save >> report + check >> Label("Errors found") >> describe >> error >> report +``` -### Functions -A convenient way to talk about fragments of control flow graph. +Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement. +But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation. -### Tracing & Reporting +Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax - + +```python +def noop(): + pass + +def check(result): + return True + +ingest = noop +analyse = noop +describe = noop +error = noop +save = noop +report = noop + +def main(): + ingest() + result = analyse() + if check(result): + save(result) + report(result) + else: + describe() + error() + report() +``` + +The program a developer authors is already a control flow graph. +We have a name for "operators" or state nodes - they're just functions. +We have a name and notation for transitions in the state chart - they're just sequential statements. + +Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize. + +Temporal does better than Airflow and reflects this understanding. +Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers. + +Consider this example from the Temporal documentation - + +```python +@activity.defn +async def cancellable_activity(input: ComposeArgsInput) -> NoReturn: + try: + while True: + print("Heartbeating cancel activity") + await asyncio.sleep(0.5) + activity.heartbeat("some details") + except asyncio.CancelledError: + print("Activity cancelled") + raise + + +@workflow.defn +class GreetingWorkflow: + @workflow.run + async def run(self, input: ComposeArgsInput) -> None: + activity_handle = workflow.start_activity( + cancel_activity, + ComposeArgsInput(input.arg1, input.arg2), + start_to_close_timeout=timedelta(minutes=5), + heartbeat_timeout=timedelta(seconds=30), + ) + + await asyncio.sleep(3) + activity_handle.cancel() +``` + +This is really good compared to an equivalent Airflow graph! +All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs. +But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK. + +In comparison were this workflow a "simple" Python script it would only need to be written + +```python +# https://pypi.org/project/timeoutcontext/ +from timeoutcontext import task_with_timeout, TimeoutError + + +def cancellable_activity(): + try: + while True: + print("Heartbeating cancellable activity") + sleep(0.5) + except TimeoutError: + print("Activity cancelled") + raise + + +def main(): + task = task_with_timeout(lambda: cancellable_activity(), + timeout=timedelta(minutes=5)) + sleep(3) + task.cancel() +``` + +As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users. +The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs. +There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program. +It is in effect a new language with none of the notational advantages of being one. + +The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation. +This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support. +For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools. + +In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API.