Consolidate docs
This commit is contained in:
parent
6c5217850c
commit
b1b00ee384
3 changed files with 141 additions and 188 deletions
|
@ -27,9 +27,9 @@ This centering of evented communication makes Flowmetal ideal for **coordination
|
|||
|
||||
**Scripting** - Durablity and distribution of execution come at coordination costs which make Flowmetal well suited for coordination tasks, but not for heavy processing.
|
||||
|
||||
- For a problem statement, see [Call/CC Airflow](doc/call_cc_airflow.md).
|
||||
- For a problem statement, see [Call/CC Airflow](doc/what_problem.md).
|
||||
- For an architecture overview, see [Architecture](doc/architecture.md).
|
||||
- For example doodles, see [examples](examples).
|
||||
- For example doodles, see [examples](examples) and [NOTEs](doc/NOTES.md).
|
||||
|
||||
## License
|
||||
|
||||
|
|
|
@ -1,156 +0,0 @@
|
|||
# What problem are you trying to solve?
|
||||
|
||||
In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap.
|
||||
|
||||
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
|
||||
This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand.
|
||||
|
||||
In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern.
|
||||
Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling.
|
||||
|
||||
Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation.
|
||||
Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job.
|
||||
|
||||
Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks.
|
||||
That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level.
|
||||
|
||||
Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines.
|
||||
|
||||
Consider this example from the Airflow documentation -
|
||||
|
||||
```python
|
||||
from __future__ import annotations
|
||||
|
||||
import pendulum
|
||||
|
||||
from airflow import DAG
|
||||
from airflow.operators.empty import EmptyOperator
|
||||
from airflow.utils.edgemodifier import Label
|
||||
|
||||
with DAG(
|
||||
"example_branch_labels",
|
||||
schedule="@daily",
|
||||
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
|
||||
catchup=False,
|
||||
) as dag:
|
||||
ingest = EmptyOperator(task_id="ingest")
|
||||
analyse = EmptyOperator(task_id="analyze")
|
||||
check = EmptyOperator(task_id="check_integrity")
|
||||
describe = EmptyOperator(task_id="describe_integrity")
|
||||
error = EmptyOperator(task_id="email_error")
|
||||
save = EmptyOperator(task_id="save")
|
||||
report = EmptyOperator(task_id="report")
|
||||
|
||||
ingest >> analyse >> check
|
||||
check >> Label("No errors") >> save >> report
|
||||
check >> Label("Errors found") >> describe >> error >> report
|
||||
```
|
||||
|
||||
Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement.
|
||||
But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation.
|
||||
|
||||
Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax -
|
||||
|
||||
```python
|
||||
def noop():
|
||||
pass
|
||||
|
||||
def check(result):
|
||||
return True
|
||||
|
||||
ingest = noop
|
||||
analyse = noop
|
||||
describe = noop
|
||||
error = noop
|
||||
save = noop
|
||||
report = noop
|
||||
|
||||
def main():
|
||||
ingest()
|
||||
result = analyse()
|
||||
if check(result):
|
||||
save(result)
|
||||
report(result)
|
||||
else:
|
||||
describe()
|
||||
error()
|
||||
report()
|
||||
```
|
||||
|
||||
The program a developer authors is already a control flow graph.
|
||||
We have a name for "operators" or state nodes - they're just functions.
|
||||
We have a name and notation for transitions in the state chart - they're just sequential statements.
|
||||
|
||||
Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize.
|
||||
|
||||
Temporal does better than Airflow and reflects this understanding.
|
||||
Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers.
|
||||
|
||||
Consider this example from the Temporal documentation -
|
||||
|
||||
```python
|
||||
@activity.defn
|
||||
async def cancellable_activity(input: ComposeArgsInput) -> NoReturn:
|
||||
try:
|
||||
while True:
|
||||
print("Heartbeating cancel activity")
|
||||
await asyncio.sleep(0.5)
|
||||
activity.heartbeat("some details")
|
||||
except asyncio.CancelledError:
|
||||
print("Activity cancelled")
|
||||
raise
|
||||
|
||||
|
||||
@workflow.defn
|
||||
class GreetingWorkflow:
|
||||
@workflow.run
|
||||
async def run(self, input: ComposeArgsInput) -> None:
|
||||
activity_handle = workflow.start_activity(
|
||||
cancel_activity,
|
||||
ComposeArgsInput(input.arg1, input.arg2),
|
||||
start_to_close_timeout=timedelta(minutes=5),
|
||||
heartbeat_timeout=timedelta(seconds=30),
|
||||
)
|
||||
|
||||
await asyncio.sleep(3)
|
||||
activity_handle.cancel()
|
||||
```
|
||||
|
||||
This is really good compared to an equivalent Airflow graph!
|
||||
All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs.
|
||||
But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK.
|
||||
|
||||
In comparison were this workflow a "simple" Python script it would only need to be written
|
||||
|
||||
```python
|
||||
# https://pypi.org/project/timeoutcontext/
|
||||
from timeoutcontext import task_with_timeout, TimeoutError
|
||||
|
||||
|
||||
def cancellable_activity():
|
||||
try:
|
||||
while True:
|
||||
print("Heartbeating cancellable activity")
|
||||
sleep(0.5)
|
||||
except TimeoutError:
|
||||
print("Activity cancelled")
|
||||
raise
|
||||
|
||||
|
||||
def main():
|
||||
task = task_with_timeout(lambda: cancellable_activity(),
|
||||
timeout=timedelta(minutes=5))
|
||||
sleep(3)
|
||||
task.cancel()
|
||||
```
|
||||
|
||||
As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users.
|
||||
The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs.
|
||||
There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program.
|
||||
It is in effect a new language with none of the notational advantages of being one.
|
||||
|
||||
The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation.
|
||||
This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support.
|
||||
For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools.
|
||||
|
||||
In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API.
|
|
@ -1,47 +1,156 @@
|
|||
# An Asynchronous, Distributed Task Engine
|
||||
|
||||
This document presents a design without reference implementation for a distributed programming system;
|
||||
sometimes called a workflow engine.
|
||||
It is intended to provide architectural level clarity allowing for the development of alternative designs or implementations as may suit.
|
||||
|
||||
## Problem Statement
|
||||
# What problem are you trying to solve?
|
||||
|
||||
In building, operating and maintaining distributed systems (many computers in concert) engineers face a tooling gap.
|
||||
|
||||
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.)
|
||||
and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
|
||||
Within the confines of a single computer, we have shells (`bash`, `csh`, `zsh`, `oil` etc.) and a suite of small programs which mesh together well enough for the completion of small tasks with ad-hoc automation.
|
||||
This is an enormous tooling win, as it allows small tasks to be automated at least for a time with a minimum of effort and with tools close to hand.
|
||||
|
||||
In interacting with networks, communicating between computers is difficult with traditional tools and communication failure becomes an ever-present concern.
|
||||
Traditional automation tools such as shells are inadequate for this environment because achieving network communication is excessively difficult.
|
||||
Traditional automation tools such as shells struggle with this task because they make it difficult to implement features such as error handling.
|
||||
|
||||
In a distributed environment it cannot be assumed that a single machine can remain available to execute automation;
|
||||
This requires an approach to automation which allows for the incremental execution of single tasks at a time with provisions for relocation and recovery should failure occur.
|
||||
Furthermore, in a distributed environment it cannot be assumed that a single machine can remain available to execute automation.
|
||||
Automation authors are thus confronted not just with the need to use "real" programming languages but the need to implement a database backed state machine which can "checkpoint" the job.
|
||||
|
||||
It also cannot be assumed that a single machine is sufficiently available to receive and process incoming events such as callbacks.
|
||||
A distributed system is needed to wrangle distributed systems.
|
||||
Taking a step back, this is an enormous failure of the languages we have available to describe workflow tasks.
|
||||
That users need to write state machines that define state machines that actually perform the desired task shows that the available tools operate at the wrong level.
|
||||
|
||||
## Design Considerations
|
||||
Airflow for instance succeeds at providing a "workflow" control flow graph abstraction which frees users of the concerns of implementing their own resumable state machines.
|
||||
|
||||
- Timeouts are everywhere
|
||||
- Sub-Turing/boundable
|
||||
-
|
||||
Consider this example from the Airflow documentation -
|
||||
|
||||
## Architectural Overview
|
||||
```python
|
||||
from __future__ import annotations
|
||||
|
||||
### Events
|
||||
Things that will happen, or time out.
|
||||
import pendulum
|
||||
|
||||
### Actions
|
||||
Things the workflow will do, or time out.
|
||||
from airflow import DAG
|
||||
from airflow.operators.empty import EmptyOperator
|
||||
from airflow.utils.edgemodifier import Label
|
||||
|
||||
### Bindings
|
||||
Data the workflow either was given or computed.
|
||||
with DAG(
|
||||
"example_branch_labels",
|
||||
schedule="@daily",
|
||||
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
|
||||
catchup=False,
|
||||
) as dag:
|
||||
ingest = EmptyOperator(task_id="ingest")
|
||||
analyse = EmptyOperator(task_id="analyze")
|
||||
check = EmptyOperator(task_id="check_integrity")
|
||||
describe = EmptyOperator(task_id="describe_integrity")
|
||||
error = EmptyOperator(task_id="email_error")
|
||||
save = EmptyOperator(task_id="save")
|
||||
report = EmptyOperator(task_id="report")
|
||||
|
||||
### Conditionals
|
||||
Decisions the workflow may make.
|
||||
ingest >> analyse >> check
|
||||
check >> Label("No errors") >> save >> report
|
||||
check >> Label("Errors found") >> describe >> error >> report
|
||||
```
|
||||
|
||||
### Functions
|
||||
A convenient way to talk about fragments of control flow graph.
|
||||
Compared to handwriting out all the nodes in the control flow graph and the requisite checkpointing to the database on state transitions, this is a considerable improvement.
|
||||
But if you stop and look at this control flow graph, it's expressing a simple branching operation for which we have much more conventional notation.
|
||||
|
||||
### Tracing & Reporting
|
||||
Consider the same workflow using "normal" Python syntax rather than the embedded workflow syntax -
|
||||
|
||||
```python
|
||||
def noop():
|
||||
pass
|
||||
|
||||
def check(result):
|
||||
return True
|
||||
|
||||
ingest = noop
|
||||
analyse = noop
|
||||
describe = noop
|
||||
error = noop
|
||||
save = noop
|
||||
report = noop
|
||||
|
||||
def main():
|
||||
ingest()
|
||||
result = analyse()
|
||||
if check(result):
|
||||
save(result)
|
||||
report(result)
|
||||
else:
|
||||
describe()
|
||||
error()
|
||||
report()
|
||||
```
|
||||
|
||||
The program a developer authors is already a control flow graph.
|
||||
We have a name for "operators" or state nodes - they're just functions.
|
||||
We have a name and notation for transitions in the state chart - they're just sequential statements.
|
||||
|
||||
Anything we need developers to write above that baseline represents friction specific to a workflow task which we should seek to minimize.
|
||||
|
||||
Temporal does better than Airflow and reflects this understanding.
|
||||
Using insights from Azure Durable Functions, their SDK leverages the details of Python's `async` and `await` operators to hijack program control flow and implement workflow features under the covers.
|
||||
|
||||
Consider this example from the Temporal documentation -
|
||||
|
||||
```python
|
||||
@activity.defn
|
||||
async def cancellable_activity(input: ComposeArgsInput) -> NoReturn:
|
||||
try:
|
||||
while True:
|
||||
print("Heartbeating cancel activity")
|
||||
await asyncio.sleep(0.5)
|
||||
activity.heartbeat("some details")
|
||||
except asyncio.CancelledError:
|
||||
print("Activity cancelled")
|
||||
raise
|
||||
|
||||
|
||||
@workflow.defn
|
||||
class GreetingWorkflow:
|
||||
@workflow.run
|
||||
async def run(self, input: ComposeArgsInput) -> None:
|
||||
activity_handle = workflow.start_activity(
|
||||
cancel_activity,
|
||||
ComposeArgsInput(input.arg1, input.arg2),
|
||||
start_to_close_timeout=timedelta(minutes=5),
|
||||
heartbeat_timeout=timedelta(seconds=30),
|
||||
)
|
||||
|
||||
await asyncio.sleep(3)
|
||||
activity_handle.cancel()
|
||||
```
|
||||
|
||||
This is really good compared to an equivalent Airflow graph!
|
||||
All the details are "normal" Python, and the SDK fits "natively" into how Python execution occurs.
|
||||
But it's still laden with syntax such as the `async` function coloring and decorators which serve only to support the workflow SDK.
|
||||
|
||||
In comparison were this workflow a "simple" Python script it would only need to be written
|
||||
|
||||
```python
|
||||
# https://pypi.org/project/timeoutcontext/
|
||||
from timeoutcontext import task_with_timeout, TimeoutError
|
||||
|
||||
|
||||
def cancellable_activity():
|
||||
try:
|
||||
while True:
|
||||
print("Heartbeating cancellable activity")
|
||||
sleep(0.5)
|
||||
except TimeoutError:
|
||||
print("Activity cancelled")
|
||||
raise
|
||||
|
||||
|
||||
def main():
|
||||
task = task_with_timeout(lambda: cancellable_activity(),
|
||||
timeout=timedelta(minutes=5))
|
||||
sleep(3)
|
||||
task.cancel()
|
||||
```
|
||||
|
||||
As with Airflow, the Temporal SDK effectively requires that the programmer learn not just a set of libraries but the Python `async` features because the implementation of the workflow engine is leaked to the users.
|
||||
The problem isn't just the excessive syntax, it's that as with Airflow user workflows are no longer "normal" programs.
|
||||
There is in effect an entire Temporal interpreter stacked inbetween the Python runtime with which users are familiar and the user's program.
|
||||
It is in effect a new language with none of the notational advantages of being one.
|
||||
|
||||
The flipside is that there is an enormous advantage in tooling to be had by leveraging an existing language - or something that looks enough like one - rather than inventing a new notation.
|
||||
This is the cardinal sin of workflow tools like kubeflow and various CI "workflow" formats - they adopt unique YAML or XML based notations which have no tooling support.
|
||||
For instance by being "normal" (ish) Python, the Temporal SDK benefits from access to editor autocompletion, the MyPy typechecker and all manner of other tools.
|
||||
|
||||
In short, the goal of this project is to provide a field-leading implementation of an [AMBROSIA](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) like durable functions platform with "plain" Python as the user-facing API.
|
||||
|
|
Loading…
Reference in a new issue