Write up where this came from

2023-03-08 10:16:31 -07:00 · 2023-03-08 10:16:31 -07:00 · d62ef16f5b
commit d62ef16f5b
parent 8fc7ed9dcf
1 changed files with 64 additions and 0 deletions
--- a/doc/servitor.md
+++ b/doc/servitor.md
@ -0,0 +1,64 @@
+# Servitor: the precursor
+
+The year was 2017 and Twitter infrastructure had a problem.
+
+In order to manage large-scale stateful storage systems, it needed a variety of workflows which could sequence long-running tasks. 
+Draining traffic from a service instance, ensuring that local data was replicated, performing destructive upgrades and bringing the instance back into topology was a common example.
+Another was coordinating workflows made out of unreliable tasks which required extensive error handling and retrying.
+
+Twitter had, over the years, deployed a variety of in-house services which implemented specific workflows as domain-specific dedicated applications.
+These applications struggled because each one contained both an implementation both of the core workflow and all the requisite state management machinery.
+Several of these systems had long standing core bugs leading to lost tasks and high rates of manual intervention in supposedly fully automated processes.
+
+Twitter had previously developed an in-house workflow engine for performing remediations called Mechanic.
+Mechanic was tightly integrated with the observability stack, and enabled users to automate invoking workflows when metrics thresholds were crossed.
+However Mechanic's workflows were linear sequences of actions.
+It had no capabilities for doing error detection, retrying or particularly involved programming.
+
+Various teams at Twitter had experimented with adopting Apache Airflow as a workflow platform.
+While addressing the limitations of Mechanic's workflows by providing support for more fully featured DAGs, the operational experience with Airflow was poor.
+The multi-tenant SRE Airflow cluster was a source of pain, and individual groups had a hard time managing Airflow when they needed many DAGs.
+
+Enter Servitor.
+
+Servitor was originally conceived of as "call/cc airflow" by a particularly lisp minded fresh grad.
+The insight behind the system was that any given "task" which a workflow should perform is basically just a function call.
+The only real difference being the workflow engine must commit the state of having started an action, and then the task must await the state produced by the action.
+Centering asynchronous waiting and timeouts made unreliable actions, asynchronous actions and synchronous actions easy to represent.
+
+Servitor differentiated itself from Airflow by designing not for polling tasks but for webhook based tasks with polling as an optional fallback.
+While this design choice has theoretical limitations, it proved easy to integrate into the existing ecosystem.
+Webhooks fit naturally into asynchronous waiting and timeout models, and their poorer delivery semantics compared to polling actually become a feature rather than a bug.
+
+Rather than requiring users to write a graph DSL in Python, Servitor presented a purpose-built state chart language with sub-Turing semantics.
+
+Every Plan in Servitor consisted of parameters, a control flow graph, an entry point, a history and a latch for a final "outcome" result.
+Every node in the control flow graph represented an Action which would produce an Outcome and provided a pattern-matching based system for choosing the next node in the graph based on that Outcome.
+Outcomes were of three kinds, `Success`, `Failure` and `Unrecoverable`.
+
+Successful outcomes represented roughly normal control flow.
+
+Failure outcomes represented errors in Action implementations, timeouts waiting for an external results and soforth.
+The Servitor language featured an explicit `retry` word for branching backwards and well retrying after a failure occurred.
+However failures were not infinitely retryable.
+Every node in the DAG had a "retry limit" counter, which placed an upper bound on the number of times an action could be re-attempted.
+Eventually, any retry loop will exceed its retry limit and encounter an Unrecoverable outcome for having done so.
+
+Unrecoverable outcomes were unique in that they prohibited retrying.
+Upon encountering an Unrecoverable outcome, a Plan would have to go try something else - or give up.
+
+That excitable fresh grad has since had a long time to reflect both on the successes and failures of Servitor.
+While an undeniable operational success and well loved by its users for its scalability and reliability, in many ways it failed to live up to its full potential.
+
+First, it presented users with unfamiliar notation and unusual state-chart oriented operational semantics.
+This is a far cry from the traditional scripting tools with which its ultimate users were familiar.
+
+Second, while it succeeded admirably at modeling and handling failures, that same model made retrofitting data processing features difficult.
+This meant even workflows with simple carried information or data conditionals were impossible to implement or required external coordination.
+
+A successor system must look like a "normal" language, providing familiar constructs such as sequential statements and support data processing.
+Indeed there is good reason to think an entirely conventional scripting language could live within the persisted structure we usually associate with a "workflow" engine.
+Statements and function calls are already a graph, as much as Servitor or Airflow's graph ever was.
+
+Wouldn't a durable Python implementation obviate the need for a distinct "workflow" tool as such?
+Perhaps it's not that easy, but the result would be a much more generally useful starting place than a state chart engine.