flowmetal/doc/servitor.md

64 lines
5.4 KiB
Markdown

# Servitor: the precursor
The year was 2017 and Twitter infrastructure had a problem.
In order to manage large-scale stateful storage systems, it needed a variety of workflows which could sequence long-running tasks.
Draining traffic from a service instance, ensuring that local data was replicated, performing destructive upgrades and bringing the instance back into topology was a common example.
Another was coordinating workflows made out of unreliable tasks which required extensive error handling and retrying.
Twitter had, over the years, deployed a variety of in-house services which implemented specific workflows as domain-specific dedicated applications.
These applications struggled because each one contained both an implementation both of the core workflow and all the requisite state management machinery.
Several of these systems had long standing core bugs leading to lost tasks and high rates of manual intervention in supposedly fully automated processes.
Twitter had previously developed an in-house workflow engine for performing remediations called Mechanic.
Mechanic was tightly integrated with the observability stack, and enabled users to automate invoking workflows when metrics thresholds were crossed.
However Mechanic's workflows were linear sequences of actions.
It had no capabilities for doing error detection, retrying or particularly involved programming.
Various teams at Twitter had experimented with adopting Apache Airflow as a workflow platform.
While addressing the limitations of Mechanic's workflows by providing support for more fully featured DAGs, the operational experience with Airflow was poor.
The multi-tenant SRE Airflow cluster was a source of pain, and individual groups had a hard time managing Airflow when they needed many DAGs.
Enter Servitor.
Servitor was originally conceived of as "call/cc airflow" by a particularly lisp minded fresh grad.
The insight behind the system was that any given "task" which a workflow should perform is basically just a function call.
The only real difference being the workflow engine must commit the state of having started an action, and then the task must await the state produced by the action.
Centering asynchronous waiting and timeouts made unreliable actions, asynchronous actions and synchronous actions easy to represent.
Servitor differentiated itself from Airflow by designing not for polling tasks but for webhook based tasks with polling as an optional fallback.
While this design choice has theoretical limitations, it proved easy to integrate into the existing ecosystem.
Webhooks fit naturally into asynchronous waiting and timeout models, and their poorer delivery semantics compared to polling actually become a feature rather than a bug.
Rather than requiring users to write a graph DSL in Python, Servitor presented a purpose-built state chart language with sub-Turing semantics.
Every Plan in Servitor consisted of parameters, a control flow graph, an entry point, a history and a latch for a final "outcome" result.
Every node in the control flow graph represented an Action which would produce an Outcome and provided a pattern-matching based system for choosing the next node in the graph based on that Outcome.
Outcomes were of three kinds, `Success`, `Failure` and `Unrecoverable`.
Successful outcomes represented roughly normal control flow.
Failure outcomes represented errors in Action implementations, timeouts waiting for an external results and soforth.
The Servitor language featured an explicit `retry` word for branching backwards and well retrying after a failure occurred.
However failures were not infinitely retryable.
Every node in the DAG had a "retry limit" counter, which placed an upper bound on the number of times an action could be re-attempted.
Eventually, any retry loop will exceed its retry limit and encounter an Unrecoverable outcome for having done so.
Unrecoverable outcomes were unique in that they prohibited retrying.
Upon encountering an Unrecoverable outcome, a Plan would have to go try something else - or give up.
That excitable fresh grad has since had a long time to reflect both on the successes and failures of Servitor.
While an undeniable operational success and well loved by its users for its scalability and reliability, in many ways it failed to live up to its full potential.
First, it presented users with unfamiliar notation and unusual state-chart oriented operational semantics.
This is a far cry from the traditional scripting tools with which its ultimate users were familiar.
Second, while it succeeded admirably at modeling and handling failures, that same model made retrofitting data processing features difficult.
This meant even workflows with simple carried information or data conditionals were impossible to implement or required external coordination.
A successor system must look like a "normal" language, providing familiar constructs such as sequential statements and support data processing.
Indeed there is good reason to think an entirely conventional scripting language could live within the persisted structure we usually associate with a "workflow" engine.
Statements and function calls are already a graph, as much as Servitor or Airflow's graph ever was.
Wouldn't a durable Python implementation obviate the need for a distinct "workflow" tool as such?
Perhaps it's not that easy, but the result would be a much more generally useful starting place than a state chart engine.