Consolidate docs

This commit is contained in:
Reid 'arrdem' McKenzie 2023-03-07 19:09:34 -07:00
parent 6c5217850c
commit b1b00ee384
3 changed files with 141 additions and 188 deletions

View file

@ -27,9 +27,9 @@ This centering of evented communication makes Flowmetal ideal for **coordination
**Scripting** - Durablity and distribution of execution come at coordination costs which make Flowmetal well suited for coordination tasks, but not for heavy processing.
- For a problem statement, see [Call/CC Airflow](doc/call_cc_airflow.md).
- For a problem statement, see [Call/CC Airflow](doc/what_problem.md).
- For an architecture overview, see [Architecture](doc/architecture.md).
- For example doodles, see [examples](examples).
- For example doodles, see [examples](examples) and [NOTEs](doc/NOTES.md).
## License

View file

@ -1,156 +0,0 @@
# What problem are you trying to solve?
In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap.
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand.
In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern.
Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling.
Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation.
Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job.
Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks.
That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level.
Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines.
Consider this example from the Airflow documentation -
```python
from __future__ import annotations
import pendulum
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.utils.edgemodifier import Label
with DAG(
"example_branch_labels",
schedule="@daily",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
) as dag:
ingest = EmptyOperator(task_id="ingest")
analyse = EmptyOperator(task_id="analyze")
check = EmptyOperator(task_id="check_integrity")
describe = EmptyOperator(task_id="describe_integrity")
error = EmptyOperator(task_id="email_error")
save = EmptyOperator(task_id="save")
report = EmptyOperator(task_id="report")
ingest >> analyse >> check
check >> Label("No errors") >> save >> report
check >> Label("Errors found") >> describe >> error >> report
```
Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement.
But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation.
Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax -
```python
def noop():
pass
def check(result):
return True
ingest = noop
analyse = noop
describe = noop
error = noop
save = noop
report = noop
def main():
ingest()
result = analyse()
if check(result):
save(result)
report(result)
else:
describe()
error()
report()
```
The program a developer authors is already a control flow graph.
We have a name for "operators" or state nodes - they're just functions.
We have a name and notation for transitions in the state chart - they're just sequential statements.
Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize.
Temporal does better than Airflow and reflects this understanding.
Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers.
Consider this example from the Temporal documentation -
```python
@activity.defn
async def cancellable_activity(input: ComposeArgsInput) -> NoReturn:
try:
while True:
print("Heartbeating cancel activity")
await asyncio.sleep(0.5)
activity.heartbeat("some details")
except asyncio.CancelledError:
print("Activity cancelled")
raise
@workflow.defn
class GreetingWorkflow:
@workflow.run
async def run(self, input: ComposeArgsInput) -> None:
activity_handle = workflow.start_activity(
cancel_activity,
ComposeArgsInput(input.arg1, input.arg2),
start_to_close_timeout=timedelta(minutes=5),
heartbeat_timeout=timedelta(seconds=30),
)
await asyncio.sleep(3)
activity_handle.cancel()
```
This is really good compared to an equivalent Airflow graph!
All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs.
But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK.
In comparison were this workflow a "simple" Python script it would only need to be written
```python
# https://pypi.org/project/timeoutcontext/
from timeoutcontext import task_with_timeout, TimeoutError
def cancellable_activity():
try:
while True:
print("Heartbeating cancellable activity")
sleep(0.5)
except TimeoutError:
print("Activity cancelled")
raise
def main():
task = task_with_timeout(lambda: cancellable_activity(),
timeout=timedelta(minutes=5))
sleep(3)
task.cancel()
```
As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users.
The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs.
There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program.
It is in effect a new language with none of the notational advantages of being one.
The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation.
This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support.
For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools.
In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API.

View file

@ -1,47 +1,156 @@
# An Asynchronous, Distributed Task Engine
This document presents a design without reference implementation for a distributed programming system;
sometimes called a workflow engine.
It is intended to provide architectural level clarity allowing for the development of alternative designs or implementations as may suit.
## Problem Statement
# What problem are you trying to solve?
In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap.
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.)
and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand.
In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern.
Traditional automation tools such as shells are inadequate for this environment because achieving network communication is excessively difficult.
Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling.
In a distributed environment it cannot be assumed that a single machine can remain available to execute automation;
This requires an approach to automation which allows for the incremental execution of single tasks at a time with provisions for relocation and recovery should failure occur.
Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation.
Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job.
It also cannot be assumed that a single machine is sufficiently available to receive and process incoming events such as callbacks.
A distributed system is needed to wrangle distributed systems.
Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks.
That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level.
## Design Considerations
Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines.
- Timeouts are everywhere
- Sub-Turing/boundable
-
Consider this example from the Airflow documentation -
## Architectural Overview
```python
from __future__ import annotations
### Events
Things that will happen, or time out.
import pendulum
### Actions
Things the workflow will do, or time out.
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.utils.edgemodifier import Label
### Bindings
Data the workflow either was given or computed.
with DAG(
"example_branch_labels",
schedule="@daily",
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
) as dag:
ingest = EmptyOperator(task_id="ingest")
analyse = EmptyOperator(task_id="analyze")
check = EmptyOperator(task_id="check_integrity")
describe = EmptyOperator(task_id="describe_integrity")
error = EmptyOperator(task_id="email_error")
save = EmptyOperator(task_id="save")
report = EmptyOperator(task_id="report")
### Conditionals
Decisions the workflow may make.
ingest >> analyse >> check
check >> Label("No errors") >> save >> report
check >> Label("Errors found") >> describe >> error >> report
```
### Functions
A convenient way to talk about fragments of control flow graph.
Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement.
But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation.
### Tracing & Reporting
Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax -
```python
def noop():
pass
def check(result):
return True
ingest = noop
analyse = noop
describe = noop
error = noop
save = noop
report = noop
def main():
ingest()
result = analyse()
if check(result):
save(result)
report(result)
else:
describe()
error()
report()
```
The program a developer authors is already a control flow graph.
We have a name for "operators" or state nodes - they're just functions.
We have a name and notation for transitions in the state chart - they're just sequential statements.
Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize.
Temporal does better than Airflow and reflects this understanding.
Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers.
Consider this example from the Temporal documentation -
```python
@activity.defn
async def cancellable_activity(input: ComposeArgsInput) -> NoReturn:
try:
while True:
print("Heartbeating cancel activity")
await asyncio.sleep(0.5)
activity.heartbeat("some details")
except asyncio.CancelledError:
print("Activity cancelled")
raise
@workflow.defn
class GreetingWorkflow:
@workflow.run
async def run(self, input: ComposeArgsInput) -> None:
activity_handle = workflow.start_activity(
cancel_activity,
ComposeArgsInput(input.arg1, input.arg2),
start_to_close_timeout=timedelta(minutes=5),
heartbeat_timeout=timedelta(seconds=30),
)
await asyncio.sleep(3)
activity_handle.cancel()
```
This is really good compared to an equivalent Airflow graph!
All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs.
But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK.
In comparison were this workflow a "simple" Python script it would only need to be written
```python
# https://pypi.org/project/timeoutcontext/
from timeoutcontext import task_with_timeout, TimeoutError
def cancellable_activity():
try:
while True:
print("Heartbeating cancellable activity")
sleep(0.5)
except TimeoutError:
print("Activity cancelled")
raise
def main():
task = task_with_timeout(lambda: cancellable_activity(),
timeout=timedelta(minutes=5))
sleep(3)
task.cancel()
```
As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users.
The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs.
There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program.
It is in effect a new language with none of the notational advantages of being one.
The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation.
This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support.
For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools.
In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API.