Some tapping

2021-04-11 01:01:25 -06:00 · 2021-04-11 01:01:25 -06:00 · 98ffc020f9
commit 98ffc020f9
parent d3192702fc
4 changed files with 150 additions and 51 deletions
--- a/projects/calf/README.md
+++ b/projects/calf/README.md
@ -10,45 +10,6 @@ I found the JVM environment burdensome, difficult to maintain velocity in, and m
 Calf is a testbed.
 It's supposed to be a lightweight, unstable, easy for me to hack on substrate for exploring those old ideas and some new ones.

-Particularly I'm interested in:
- compilers-as-databases (or using databases)
- stream processing and process models of computation more akin to Erlang
- reliability sensitive programming models (failure, recovery, process supervision)
-
-I previously [blogged a bit](https://www.arrdem.com/2019/04/01/the_silver_tower/) about some ideas for what this could look like.
-I'm convinced that a programming environment based around [virtual resiliency](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) is a worthwhile goal (having independently invented it) and worth trying to bring to a mainstream general purpose platform like Python.
-
-## Manifesto
-
-In the last decade, immutability has been affirmed in the programming mainstream as an effective tool for making programs and state more manageable, and one which has been repeatedly implemented at acceptable performance costs.
-Especially in messaging based rather than state sharing environments, immutability and "data" oriented programming is becoming more and more common.
-
-It also seems that much of the industry is moving towards message based reactive or network based connective systems.
-Microservices seem to have won, and functions-as-a-service seem to be a rising trend reflecting a desire to offload or avoid deployment management rather than wrangle stateful services.
-
-In these environments, programs begin to consist entirely of messaging with other programs over shared channels such as traditional HTTP or other RPC tools or message buses such as Kafka, gRPC, ThriftMux and soforth.
-
-Key challenges with these connective services are:
- How they handle failure
- How they achieve reliability
- The ergonomic difficulties of building and deploying connective programs
- The operational difficulties of managing N-many 'reliable' services
-
-Tools like Argo, Airflow and the like begin to talk about such networked or evented programs as DAGs; providing schedulers for sequencing actions and executors for performing actions.
-
-Airflow provides a programmable Python scheduler environment, but fails to provide an execution isolation boundary (such as a container or other subprocess/`fork()` boundary) allowing users to bring their own dependencies.
-Instead Airflow users must build custom Airflow packagings which bundle dependencies into the Airflow instance.
-This means that Airflow deployments can only be centralized with difficulty due to shared dependencies and disparate dependency lifecycles and limits the return on investment of the platform by increasing operational burden.
-
-Argo ducks this mistake, providing a robust scheduler and leveraging k8s for its executor.
-This allows Argo to be managed independently of any of the workloads it manages - a huge step forwards over Airflow - but this comes at considerable ergonomic costs for trivial tasks and provides a more limited scheduler.
-
-Previously I developed a system which provided a much stronger DSL than Airflow's, but made the same key mistake of not decoupling execution from the scheduler/coordinator.
-Calf is a sketch of a programming language and system with a nearly fully featured DSL, and decoupling between scheduling (control flow of programs) and execution of "terminal" actions.
-
-In short, think a Py-Lisp where instead of doing FFI directly to the parent Python instance you do FFI by enqueuing a (potentially retryable!) request onto a shared cluster message bus, from which subscriber worker processes elsewhere provide request/response handling.
-One could reasonably accuse this project of being an attempt to unify Erlang and a hosted Python to build a "BASH for distsys" tool while providing a multi-tenant execution platform that can be centrally managed.
-
 ## License

 Copyright Reid 'arrdem' McKenzie, 3/5/2017.
--- a/projects/flowmetal/NOTES.org
+++ b/projects/flowmetal/NOTES.org
@ -1,4 +1,4 @@
-# Notes
+#+TITLE: Notes

 https://github.com/Pyrlang/Pyrlang
 https://en.wikipedia.org/wiki/Single_system_image
--- a/projects/flowmetal/README.md
+++ b/projects/flowmetal/README.md
@ -30,14 +30,23 @@ While Flowmetal foreign functions could be fast, Flowmetal's interpreter isn't d
 It's designed for eventing and ensuring durability.
 This makes Flowmetal suitable for interacting with and coordinating other systems, but it's not gonna win any benchmark games.

-## Wait what?
+## An overview

-Okay.
-In simpler words, Flowmetal is an interpreted lisp which can use a datastore of your choice for durability.
+In the systems world we have SH, Borne SH, BASH, ZSH and friends which provide a common interface for connecting processes together.
+However in the distributed system world we don't have a good parallel for connecting microservices; especially where complex failure handling is required.
+
+I previously [blogged a bit](https://www.arrdem.com/2019/04/01/the_silver_tower/) about some ideas for what this could look like.
+I'm convinced that a programming environment based around [virtual resiliency](https://www.microsoft.com/en-us/research/publication/a-m-b-r-o-s-i-a-providing-performant-virtual-resiliency-for-distributed-applications/) is a worthwhile goal (having independently invented it) and worth trying to bring to a mainstream general purpose platform like Python.
+
+Flowmetal is an interpreted language backed by a durable event store.
+The execution history of a program is persisted to the durable store as execution precedes.
+If an interpretation step fails to persist, it can't have external effects and can be retried or recovered.
+The event store also provides Flowmetal's only interface for communicating with external systems.
 Other systems can attach to Flowmetal's datastore and send events to and receive them from Flowmetal.
 For instance Flowmetal contains a reference implementation of a HTTP callback connector and of a HTTP request connector.
+This allows Flowmetal programs to request that HTTP requests be sent on their behalf, consume the result, and wait for callbacks.

-A possible Flowmetal setup looks something like this -
+A Flowmetal setup would look something like this -

 ```
                      +----------------------------+
@ -65,19 +74,118 @@ A possible Flowmetal setup looks something like this -
                +--------------------------+
 ```

-In this setup, the Flowmetal interpreters are able to interact with an external HTTP service; sending and receiving webhooks with Flowmetal programs waiting for those external events to arrive.
+A Flowmetal program could look something like this -

-For instance this program would use the external connector stubs to build up interaction(s) with an external system.
+``` python
+#!/usr/bin/env flowmetal

-```lisp
- 
+from flow.http import callback
+from flow.http import request
+from flow.time import forever, sleep


+# A common pattern is to make a HTTP request to some service which will do some
+# processing and attempt to deliver a callback.
+def simple_remote_job(make_request):
+  # Make a HTTP callback.
+  # HTTP callbacks have a URL to which a result may be delivered at most once,
+  # and define an "event" which can be waited for.
+  cb = callback.new()
+  # Use user-defined logic to construct a job request.
+  # When the job completes, it should make a request to the callback URL.
+  request = make_request(cb.url)
+  # We can now start the job
+  resp = request.execute(request)
+  # And now we can await a response which will populate the callback's event.
+  return await cb.event
+
+
+# But there are a couple things which can go wrong here. The initial request
+# could fail, the remote process could fail and the callback delivery could fail
+# to name a few. We can provide general support for handling this by using the
+# same control inversion pattern we used for building the request.
+def reliable_remote_job(make_request,
+                        job_from_response,
+                        get_job,
+                        callback_timeout=None,
+                        job_completed=None,
+                        poll_sleep=None,
+                        poll_timeout=None):
+  # The structure here is much the same, except we need to handle some extra cases.
+  # First, we're gonna do the same callback dance but potentially with a timeout.
+  cb = callback.new()
+  request = make_request(cb.url)
+  resp = request.execute(request)
+  resp.raise_for_status()
+  job = job_from_response(resp)
+
+# If the user gave us a circuit breaker, use that to bound our waiting.
+  with callback_timeout or forever():
+    try:
+      await cb.event
+      return get_job(job)
+    except Timeout:
+      pass
+
+  # The user hasn't given us enough info to do busy waiting, so we timed out.
+  if not (job_from_response and job_completed and get_job):
+    raise Timeout
+
+  # If we failed to wait for the callback, let's try polling the job.
+  # We'll let the user put a total bound on this too.
+  with poll_timeout or forever():
+    # We use user-defined logic to wait for the job to complete.
+    # Note that we don't conflate get_job and job_completed or assume they
+    # compose so that users can consume status endpoints without fetches.
+    while not job_completed(job):
+      # The user can define how we back off too.
+      # A stateful function could count successive calls and change behavior.
+      # For instance implementing constant, fibonacci or exponential backoff.
+      sleep(poll_sleep() if poll_sleep else 1)
+
+    # When the job is "complete", we let the user control fetching its status
+    return get_job(job)
+
+
+# Let's do a quick example of consuming something like this.
+# Say we have a system - let's call it wilson - that lets us request jobs
+# for doing bare metal management. Drains, reboots, undrains and the like.
+def create_job_request(host, stages, callbacks):
+  """Forge but don't execute a job creation request."""
+  return request.new("POST", f"http://wilson.local/api/v3/host/{host}",
+                     json={"stages": stages, "callbacks": callbacks or []})
+
+
+def job_from_response(create_resp):
+  """Handle the job creation response, returning the ID of the created job."""
+  return create_resp.json().get("job_id")
+
+
+def get_job(job_id):
+  """Fetch a job."""
+  return request.new("GET" f"http://wilson.local/api/v3/job/{job_id}").json()
+
+
+def job_completed(job_id):
+  """Decide if a job has competed."""
+  return (
+    request.new("GET" f"http://wilson.local/api/v3/job/{job_id}/status")
+    .json()
+    .get("status", "PENDING")
+  ) in ["SUCCESS", "FAILURE"]
+
+
+# These tools in hand, we can quickly express a variety of reliable jobs.
+def reboot(host):
+  """Reboot a host, attempting callback waiting but falling back to retry."""
+  return reliable_remote_job(
+    lambda url: create_job_request(host, ["drain", "reboot", "undrain"], [url]),
+    job_from_response,
+    get_job,
+    job_completed=job_completed,
+  )
 ```

-
-Comparisons to Apache Airflow are at least in this setup pretty apt, although Flowmetal's durable execution model makes it much more suitable for providing reliable workflows and its DSL is more approachable.
-
 ## License

 Mirrored from https://git.arrdem.com/arrdem/flowmetal
--- a/projects/flowmetal/doc/manifesto.md
+++ b/projects/flowmetal/doc/manifesto.md
@ -0,0 +1,30 @@
+# A manifesto
+
+In the last decade, immutability has been affirmed in the programming mainstream as an effective tool for making programs and state more manageable, and one which has been repeatedly implemented at acceptable performance costs.
+Especially in messaging based rather than state sharing environments, immutability and "data" oriented programming is becoming more and more common.
+
+It also seems that much of the industry is moving towards message based reactive or network based connective systems.
+Microservices seem to have won, and functions-as-a-service seem to be a rising trend reflecting a desire to offload or avoid deployment management rather than wrangle stateful services.
+
+In these environments, programs begin to consist entirely of messaging with other programs over shared channels such as traditional HTTP or other RPC tools or message buses such as Kafka, gRPC, ThriftMux and soforth.
+
+Key challenges with these connective services are:
+- How they handle failure
+- How they achieve reliability
+- The ergonomic difficulties of building and deploying connective programs
+- The operational difficulties of managing N-many 'reliable' services
+
+Tools like Argo, Airflow and the like begin to talk about such networked or evented programs as DAGs; providing schedulers for sequencing actions and executors for performing actions.
+
+Airflow provides a programmable Python scheduler environment, but fails to provide an execution isolation boundary (such as a container or other subprocess/`fork()` boundary) allowing users to bring their own dependencies.
+Instead Airflow users must build custom Airflow packagings which bundle dependencies into the Airflow instance.
+This means that Airflow deployments can only be centralized with difficulty due to shared dependencies and disparate dependency lifecycles and limits the return on investment of the platform by increasing operational burden.
+
+Argo ducks this mistake, providing a robust scheduler and leveraging k8s for its executor.
+This allows Argo to be managed independently of any of the workloads it manages - a huge step forwards over Airflow - but this comes at considerable ergonomic costs for trivial tasks and provides a more limited scheduler.
+
+Previously I developed a system which provided a much stronger DSL than Airflow's, but made the same key mistake of not decoupling execution from the scheduler/coordinator.
+Calf is a sketch of a programming language and system with a nearly fully featured DSL, and decoupling between scheduling (control flow of programs) and execution of "terminal" actions.
+
+In short, think a Py-Lisp where instead of doing FFI directly to the parent Python instance you do FFI by enqueuing a (potentially retryable!) request onto a shared cluster message bus, from which subscriber worker processes elsewhere provide request/response handling.
+One could reasonably accuse this project of being an attempt to unify Erlang and a hosted Python to build a "BASH for distsys" tool while providing a multi-tenant execution platform that can be centrally managed.