From d62ef16f5bf491932e06806d8dfb74eccf84759c Mon Sep 17 00:00:00 2001 From: Reid 'arrdem' McKenzie Date: Wed, 8 Mar 2023 10:16:31 -0700 Subject: [PATCH] Write up where this came from --- doc/servitor.md | 64 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 doc/servitor.md diff --git a/doc/servitor.md b/doc/servitor.md new file mode 100644 index 0000000..91acb64 --- /dev/null +++ b/doc/servitor.md @@ -0,0 +1,64 @@ +# Servitor: the precursor + +The year was 2017 and Twitter infrastructure had a problem. + +In order to manage large-scale stateful storage systems, it needed a variety of workflows which could sequence long-running tasks. +Draining traffic from a service instance, ensuring that local data was replicated, performing destructive upgrades and bringing the instance back into topology was a common example. +Another was coordinating workflows made out of unreliable tasks which required extensive error handling and retrying. + +Twitter had, over the years, deployed a variety of in-house services which implemented specific workflows as domain-specific dedicated applications. +These applications struggled because each one contained both an implementation both of the core workflow and all the requisite state management machinery. +Several of these systems had long standing core bugs leading to lost tasks and high rates of manual intervention in supposedly fully automated processes. + +Twitter had previously developed an in-house workflow engine for performing remediations called Mechanic. +Mechanic was tightly integrated with the observability stack, and enabled users to automate invoking workflows when metrics thresholds were crossed. +However Mechanic's workflows were linear sequences of actions. +It had no capabilities for doing error detection, retrying or particularly involved programming. + +Various teams at Twitter had experimented with adopting Apache Airflow as a workflow platform. +While addressing the limitations of Mechanic's workflows by providing support for more fully featured DAGs, the operational experience with Airflow was poor. +The multi-tenant SRE Airflow cluster was a source of pain, and individual groups had a hard time managing Airflow when they needed many DAGs. + +Enter Servitor. + +Servitor was originally conceived of as "call/cc airflow" by a particularly lisp minded fresh grad. +The insight behind the system was that any given "task" which a workflow should perform is basically just a function call. +The only real difference being the workflow engine must commit the state of having started an action, and then the task must await the state produced by the action. +Centering asynchronous waiting and timeouts made unreliable actions, asynchronous actions and synchronous actions easy to represent. + +Servitor differentiated itself from Airflow by designing not for polling tasks but for webhook based tasks with polling as an optional fallback. +While this design choice has theoretical limitations, it proved easy to integrate into the existing ecosystem. +Webhooks fit naturally into asynchronous waiting and timeout models, and their poorer delivery semantics compared to polling actually become a feature rather than a bug. + +Rather than requiring users to write a graph DSL in Python, Servitor presented a purpose-built state chart language with sub-Turing semantics. + +Every Plan in Servitor consisted of parameters, a control flow graph, an entry point, a history and a latch for a final "outcome" result. +Every node in the control flow graph represented an Action which would produce an Outcome and provided a pattern-matching based system for choosing the next node in the graph based on that Outcome. +Outcomes were of three kinds, `Success`, `Failure` and `Unrecoverable`. + +Successful outcomes represented roughly normal control flow. + +Failure outcomes represented errors in Action implementations, timeouts waiting for an external results and soforth. +The Servitor language featured an explicit `retry` word for branching backwards and well retrying after a failure occurred. +However failures were not infinitely retryable. +Every node in the DAG had a "retry limit" counter, which placed an upper bound on the number of times an action could be re-attempted. +Eventually, any retry loop will exceed its retry limit and encounter an Unrecoverable outcome for having done so. + +Unrecoverable outcomes were unique in that they prohibited retrying. +Upon encountering an Unrecoverable outcome, a Plan would have to go try something else - or give up. + +That excitable fresh grad has since had a long time to reflect both on the successes and failures of Servitor. +While an undeniable operational success and well loved by its users for its scalability and reliability, in many ways it failed to live up to its full potential. + +First, it presented users with unfamiliar notation and unusual state-chart oriented operational semantics. +This is a far cry from the traditional scripting tools with which its ultimate users were familiar. + +Second, while it succeeded admirably at modeling and handling failures, that same model made retrofitting data processing features difficult. +This meant even workflows with simple carried information or data conditionals were impossible to implement or required external coordination. + +A successor system must look like a "normal" language, providing familiar constructs such as sequential statements and support data processing. +Indeed there is good reason to think an entirely conventional scripting language could live within the persisted structure we usually associate with a "workflow" engine. +Statements and function calls are already a graph, as much as Servitor or Airflow's graph ever was. + +Wouldn't a durable Python implementation obviate the need for a distinct "workflow" tool as such? +Perhaps it's not that easy, but the result would be a much more generally useful starting place than a state chart engine.