64 lines
5.4 KiB
Markdown
64 lines
5.4 KiB
Markdown
# Servitor: the precursor
|
|
|
|
The year was 2017 and Twitter infrastructure had a problem.
|
|
|
|
In order to manage large-scale stateful storage systems, it needed a variety of workflows which could sequence long-running tasks.
|
|
Draining traffic from a service instance, ensuring that local data was replicated, performing destructive upgrades and bringing the instance back into topology was a common example.
|
|
Another was coordinating workflows made out of unreliable tasks which required extensive error handling and retrying.
|
|
|
|
Twitter had, over the years, deployed a variety of in-house services which implemented specific workflows as domain-specific dedicated applications.
|
|
These applications struggled because each one contained both an implementation both of the core workflow and all the requisite state management machinery.
|
|
Several of these systems had long standing core bugs leading to lost tasks and high rates of manual intervention in supposedly fully automated processes.
|
|
|
|
Twitter had previously developed an in-house workflow engine for performing remediations called Mechanic.
|
|
Mechanic was tightly integrated with the observability stack, and enabled users to automate invoking workflows when metrics thresholds were crossed.
|
|
However Mechanic's workflows were linear sequences of actions.
|
|
It had no capabilities for doing error detection, retrying or particularly involved programming.
|
|
|
|
Various teams at Twitter had experimented with adopting Apache Airflow as a workflow platform.
|
|
While addressing the limitations of Mechanic's workflows by providing support for more fully featured DAGs, the operational experience with Airflow was poor.
|
|
The multi-tenant SRE Airflow cluster was a source of pain, and individual groups had a hard time managing Airflow when they needed many DAGs.
|
|
|
|
Enter Servitor.
|
|
|
|
Servitor was originally conceived of as "call/cc airflow" by a particularly lisp minded fresh grad.
|
|
The insight behind the system was that any given "task" which a workflow should perform is basically just a function call.
|
|
The only real difference being the workflow engine must commit the state of having started an action, and then the task must await the state produced by the action.
|
|
Centering asynchronous waiting and timeouts made unreliable actions, asynchronous actions and synchronous actions easy to represent.
|
|
|
|
Servitor differentiated itself from Airflow by designing not for polling tasks but for webhook based tasks with polling as an optional fallback.
|
|
While this design choice has theoretical limitations, it proved easy to integrate into the existing ecosystem.
|
|
Webhooks fit naturally into asynchronous waiting and timeout models, and their poorer delivery semantics compared to polling actually become a feature rather than a bug.
|
|
|
|
Rather than requiring users to write a graph DSL in Python, Servitor presented a purpose-built state chart language with sub-Turing semantics.
|
|
|
|
Every Plan in Servitor consisted of parameters, a control flow graph, an entry point, a history and a latch for a final "outcome" result.
|
|
Every node in the control flow graph represented an Action which would produce an Outcome and provided a pattern-matching based system for choosing the next node in the graph based on that Outcome.
|
|
Outcomes were of three kinds, `Success`, `Failure` and `Unrecoverable`.
|
|
|
|
Successful outcomes represented roughly normal control flow.
|
|
|
|
Failure outcomes represented errors in Action implementations, timeouts waiting for an external results and soforth.
|
|
The Servitor language featured an explicit `retry` word for branching backwards and well retrying after a failure occurred.
|
|
However failures were not infinitely retryable.
|
|
Every node in the DAG had a "retry limit" counter, which placed an upper bound on the number of times an action could be re-attempted.
|
|
Eventually, any retry loop will exceed its retry limit and encounter an Unrecoverable outcome for having done so.
|
|
|
|
Unrecoverable outcomes were unique in that they prohibited retrying.
|
|
Upon encountering an Unrecoverable outcome, a Plan would have to go try something else - or give up.
|
|
|
|
That excitable fresh grad has since had a long time to reflect both on the successes and failures of Servitor.
|
|
While an undeniable operational success and well loved by its users for its scalability and reliability, in many ways it failed to live up to its full potential.
|
|
|
|
First, it presented users with unfamiliar notation and unusual state-chart oriented operational semantics.
|
|
This is a far cry from the traditional scripting tools with which its ultimate users were familiar.
|
|
|
|
Second, while it succeeded admirably at modeling and handling failures, that same model made retrofitting data processing features difficult.
|
|
This meant even workflows with simple carried information or data conditionals were impossible to implement or required external coordination.
|
|
|
|
A successor system must look like a "normal" language, providing familiar constructs such as sequential statements and support data processing.
|
|
Indeed there is good reason to think an entirely conventional scripting language could live within the persisted structure we usually associate with a "workflow" engine.
|
|
Statements and function calls are already a graph, as much as Servitor or Airflow's graph ever was.
|
|
|
|
Wouldn't a durable Python implementation obviate the need for a distinct "workflow" tool as such?
|
|
Perhaps it's not that easy, but the result would be a much more generally useful starting place than a state chart engine.
|