Import datalog-shell

This commit is contained in:
Reid 'arrdem' McKenzie 2021-05-14 23:47:05 -06:00
parent 318f7caa6a
commit 633060910c
6 changed files with 522 additions and 0 deletions

View file

@ -0,0 +1,9 @@
py_binary(
name = "datalog-shell",
main = "__main__.py",
deps = [
"//projects/datalog",
py_requirement("prompt_toolkit"),
py_requirement("yaspin"),
]
)

View file

@ -0,0 +1,18 @@
.PHONY: deploy test
deploy: .dev
source .dev/bin/activate; pip install twine; rm -r dist; python setup.py sdist; twine upload dist/*;
.dev:
virtualenv --python=`which python3` .dev
source .dev/bin/activate; pip install pytest; python setup.py develop
node_modules/canopy:
npm install canopy
src/datalog/parser.py: node_modules/canopy src/datalog.peg
node_modules/canopy/bin/canopy --lang=python src/datalog.peg
mv src/datalog.py src/datalog/parser.py
test: .dev $(wildcard src/**/*) $(wildcard test/**/*)
source .dev/bin/activate; PYTHONPATH=".:src/" pytest -vv

View file

@ -0,0 +1,179 @@
# Datalog.Shell
A shell for my Datalog engine.
## What is Datalog?
[Datalog](https://en.wikipedia.org/wiki/Datalog) is a fully
declarative language for expressing relational data and queries,
typically written using a syntactic subset of Prolog. Its most
interesting feature compared to other relational languages such as SQL
is that it features production rules.
Briefly, a datalog database consists of rules and tuples. Tuples are
written `a(b, "c", 126, ...).`, require no declaration eg. of table,
may be of arbitrary even varying length. The elements of this tuple
are strings which may be written as bare words or quoted.
In the interpreter (or a file), we could define a small graph as such -
```
$ datalog
>>> edge(a, b).
⇒ edge('a', 'b')
>>> edge(b, c).
⇒ edge('b', 'c')
>>> edge(c, d).
⇒ edge('c', 'd')
```
But how can we query this? We can issue queries by entering a tuple
terminated with `?` instead of `.`.
For instance we could query if some tuples exist in the database -
```
>>> edge(a, b)?
⇒ edge('a', 'b')
>>> edge(d, f)?
⇒ Ø
>>>
```
We did define `edge(a, b).` so our query returns that tuple. However
the tuple `edge(d, f).` was not defined, so our query produces no
results. Rather than printing nothing, the `Ø` symbol which denotes
the empty set is printed for clarity.
This is correct, but uninteresting. How can we find say all the edges
from `a`? We don't have a construct like wildcards with which to match
anything - yet.
Enter logic variables. Logic variables are capitalized words, `X`,
`Foo` and the like, which are interpreted as wildcards by the query
engine. Capitalized words are always understood as logic variables.
```
>>> edge(a, X)?
⇒ edge('a', 'b')
```
However unlike wildcards which simply match anything, logic variables
are unified within a query. Were we to write `edge(X, X)?` we would be
asking for the set of tuples such that both elements of the `edge`
tuple equate.
```
>>> edge(X, X)?
⇒ Ø
```
Of which we have none.
But what if we wanted to find paths between edges? Say to check if a
path existed from `a` to `d`. We'd need to find a way to unify many
logic variables together - and so far we've only seen queries of a
single tuple.
Enter rules. We can define productions by which the Datalog engine can
produce new tuples. Rules are written as a tuple "pattern" which may
contain constants or logic variables, followed by a sequence of
"clauses" separated by the `:-` assignment operator.
Rules are perhaps best understood as subqueries. A rule defines an
indefinite set of tuples such that over that set, the query clauses
are simultaneously satisfied. This is how we achieve complex queries.
There is no alternation - or - operator within a rule's body. However,
rules can share the same tuple "pattern".
So if we wanted to say find paths between edges in our database, we
could do so using two rules. One which defines a "simple" path, and
one which defines a path from `X` to `Y` recursively by querying for
an edge from `X` to an unconstrained `Z`, and then unifying that with
`path(Z, Y)`.
```
>>> path(X, Y) :- edge(X, Y).
⇒ path('X', 'Y') :- edge('X', 'Y').
>>> path(X, Y) :- edge(X, Z), path(Z, Y).
⇒ path('X', 'Y') :- edge('X', 'Z'), path('Z', 'Y').
>>> path(a, X)?
⇒ path('a', 'b')
⇒ path('a', 'c')
⇒ path('a', 'd')
```
We could also ask for all paths -
```
>>> path(X, Y)?
⇒ path('b', 'c')
⇒ path('a', 'b')
⇒ path('c', 'd')
⇒ path('b', 'd')
⇒ path('a', 'c')
⇒ path('a', 'd')
```
Datalog also supports negation. Within a rule, a tuple prefixed with
`~` becomes a negative statement. This allows us to express "does not
exist" relations, or antjoins. Note that this is only possible by
making the [closed world assumption](https://en.wikipedia.org/wiki/Closed-world_assumption).
Datalog also supports binary equality as a special relation. `=(X,Y)?`
is a nonsense query alone because the space of `X` and `Y` are
undefined. However within a rule body, equality (and negated
equality statements!) can be quite useful.
For convenience, the Datalog interpreter supports "retracting"
(deletion) of tuples and rules. `edge(a, b)!` would retract that
constant tuple, but we cannot retract `path(a, b)!` as that tuple is
generated by a rule. We can however retract the rule - `edge(X, Y)!`
which would remove both edge production rules from the database.
The Datalog interpreter also supports reading tuples (and rules) from
one or more files, each specified by the `--db <filename>` command
line argument.
## Usage
`pip install --user arrdem.datalog.shell`
This will install the `datalog` interpreter into your user-local
python `bin` directory, and pull down the core `arrdem.datalog` engine
as well.
## Status
This is a complete to my knowledge implementation of a traditional datalog.
Support is included for binary `=` as builtin relation, and for negated terms in
rules (prefixed with `~`)
Rules, and the recursive evaluation of rules is supported with some guards to
prevent infinite recursion.
The interactive interpreter supports definitions (terms ending in `.`),
retractions (terms ending in `!`) and queries (terms ending in `?`), see the
interpreter's `help` response for more details.
### Limitations
Recursion may have some completeness bugs. I have not yet encountered any, but I
also don't have a strong proof of correctness for the recursive evaluation of
rules yet.
The current implementation of negated clauses CANNOT propagate positive
information. This means that negated clauses can only be used in conjunction
with positive clauses. It's not clear if this is an essential limitation.
There is as of yet no query planner - not even segmenting rules and tuples by
relation to restrict evaluation. This means that the complexity of a query is
`O(dataset * term count)`, which is clearly less than ideal.
## License
Mirrored from https://git.arrdem.com/arrdem/datalog-py
Published under the MIT license. See [LICENSE.md](LICENSE.md)

View file

@ -0,0 +1,263 @@
#!/usr/bin/env python3
__doc__ = f"""
Datalog (py)
============
An interactive datalog interpreter with commands and persistence
Commands
~~~~~~~~
.help (this message)
.all display all tuples
.quit to exit the REPL
To exit, use control-c or control-d
The interpreter
~~~~~~~~~~~~~~~
The interpreter reads one line at a time from stdin.
Lines are either
- definitions (ending in .),
- queries (ending in ?)
- retractions (ending in !)
A definition may contain arbitrarily many datalog tuples and rules.
edge(a, b). edge(b, c). % A pair of definitions
edge(a, b). % The REPL's response that it has been committed
edge(b, c).
A query may contain definitions, but they exist only for the duration of the query.
edge(X, Y)? % A query which will enumerate all 2-edges
edge(a, b).
edge(b, c).
edge(c, d). edge(X, Y)? % A query with a local tuple
edge(a, b).
edge(b, c).
edge(c, d).
A retraction may contain only one tuple or clause, which will be expunged.
edge(a, b)! % This tuple is in our dataset
edge(a, b) % So deletion succeeds
edge(a, b)! % This tuple is no longer in our dataset
Ø % So deletion fails
"""
import argparse
import logging
import sys
from datalog.debris import Timing
from datalog.evaluator import select
from datalog.reader import pr_str, read_command, read_dataset
from datalog.types import (
CachedDataset,
Constant,
Dataset,
LVar,
PartlyIndexedDataset,
Rule,
TableIndexedDataset
)
from prompt_toolkit import print_formatted_text, prompt, PromptSession
from prompt_toolkit.formatted_text import FormattedText
from prompt_toolkit.history import FileHistory
from prompt_toolkit.styles import Style
from yaspin import Spinner, yaspin
STYLE = Style.from_dict({
# User input (default text).
"": "",
"prompt": "ansigreen",
"time": "ansiyellow"
})
SPINNER = Spinner(["|", "/", "-", "\\"], 200)
class InterpreterInterrupt(Exception):
"""An exception used to break the prompt or evaluation."""
def print_(fmt, **kwargs):
print_formatted_text(FormattedText(fmt), **kwargs)
def print_db(db):
"""Render a database for debugging."""
for e in db.tuples():
print(f"{pr_str(e)}")
for r in db.rules():
print(f"{pr_str(r)}")
def main(args):
"""REPL entry point."""
if args.db_cls == "simple":
db_cls = Dataset
elif args.db_cls == "cached":
db_cls = CachedDataset
elif args.db_cls == "table":
db_cls = TableIndexedDataset
elif args.db_cls == "partly":
db_cls = PartlyIndexedDataset
print(f"Using dataset type {db_cls}")
session = PromptSession(history=FileHistory(".datalog.history"))
db = db_cls([], [])
if args.dbs:
for db_file in args.dbs:
try:
with open(db_file, "r") as f:
db = db.merge(read_dataset(f.read()))
print(f"Loaded {db_file} ...")
except Exception as e:
print("Internal error - {e}")
print(f"Unable to load db {db_file}, skipping")
while True:
try:
line = session.prompt([("class:prompt", ">>> ")], style=STYLE)
except (InterpreterInterrupt, KeyboardInterrupt):
continue
except EOFError:
break
if line == ".all":
op = ".all"
elif line == ".dbg":
op = ".dbg"
elif line == ".quit":
break
elif line in {".help", "help", "?", "??", "???"}:
print(__doc__)
continue
elif line.split(" ")[0] == ".log":
op = ".log"
else:
try:
op, val = read_command(line)
except Exception as e:
print(f"Got an unknown command or syntax error, can't tell which")
continue
# Definition merges on the DB
if op == ".all":
print_db(db)
# .dbg drops to a debugger shell so you can poke at the instance objects (database)
elif op == ".dbg":
import pdb
pdb.set_trace()
# .log sets the log level - badly
elif op == ".log":
level = line.split(" ")[1].upper()
try:
ch.setLevel(getattr(logging, level))
except BaseException:
print(f"Unknown log level {level}")
elif op == ".":
# FIXME (arrdem 2019-06-15):
# Syntax rules the parser doesn't impose...
try:
for rule in val.rules():
assert not rule.free_vars, f"Rule contains free variables {rule.free_vars!r}"
for tuple in val.tuples():
assert not any(isinstance(e, LVar) for e in tuple), f"Tuples cannot contain lvars - {tuple!r}"
except BaseException as e:
print(f"Error: {e}")
continue
db = db.merge(val)
print_db(val)
# Queries execute - note that rules as queries have to be temporarily merged.
elif op == "?":
# In order to support ad-hoc rules (joins), we have to generate a transient "query" database
# by bolting the rule on as an overlay to the existing database. If of course we have a join.
#
# `val` was previously assumed to be the query pattern. Introduce `qdb`, now used as the
# database to query and "fix" `val` to be the temporary rule's pattern.
#
# We use a new db and db local so that the ephemeral rule doesn't persist unless the user
# later `.` defines it.
#
# Unfortunately doing this merge does nuke caches.
qdb = db
if isinstance(val, Rule):
qdb = db.merge(db_cls([], [val]))
val = val.pattern
with yaspin(SPINNER) as spinner:
with Timing() as t:
try:
results = list(select(qdb, val))
except KeyboardInterrupt:
print(f"Evaluation aborted after {t}")
continue
# It's kinda bogus to move sorting out but oh well
sorted(results)
for _results, _bindings in results:
_result = _results[0] # select only selects one tuple at a time
print(f"{pr_str(_result)}")
# So we can report empty sets explicitly.
if not results:
print("⇒ Ø")
print_([("class:time", f"Elapsed time - {t}")], style=STYLE)
# Retractions try to delete, but may fail.
elif op == "!":
if val in db.tuples() or val in [r.pattern for r in db.rules()]:
db = db_cls([u for u in db.tuples() if u != val],
[r for r in db.rules() if r.pattern != val])
print(f"{pr_str(val)}")
else:
print("⇒ Ø")
parser = argparse.ArgumentParser()
# Select which dataset type to use
parser.add_argument("--db-type",
choices=["simple", "cached", "table", "partly"],
help="Choose which DB to use (default partly)",
dest="db_cls",
default="partly")
parser.add_argument("--load-db", dest="dbs", action="append",
help="Datalog files to load first.")
if __name__ == "__main__":
args = parser.parse_args(sys.argv[1:])
logger = logging.getLogger("arrdem.datalog")
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
ch.setFormatter(formatter)
logger.addHandler(ch)
main(args)

View file

@ -0,0 +1,35 @@
from setuptools import setup
setup(
name="arrdem.datalog.shell",
# Package metadata
version="0.0.2",
license="MIT",
description="A shell for my datalog engine",
long_description=open("README.md").read(),
long_description_content_type="text/markdown",
author="Reid 'arrdem' McKenzie",
author_email="me@arrdem.com",
url="https://git.arrdem.com/arrdem/datalog-shell",
classifiers=[
"License :: OSI Approved :: MIT License",
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"Topic :: Database",
"Topic :: Database :: Database Engines/Servers",
"Topic :: Database :: Front-Ends",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
],
scripts=[
"bin/datalog"
],
install_requires=[
"arrdem.datalog~=2.0.0",
"prompt_toolkit==2.0.9",
"yaspin==0.14.3",
],
)

View file

@ -8,20 +8,27 @@ autoflake==1.4
Babel==2.9.0 Babel==2.9.0
beautifulsoup4==4.9.3 beautifulsoup4==4.9.3
black==20.8b1 black==20.8b1
bleach==3.3.0
certifi==2020.12.5 certifi==2020.12.5
cffi==1.14.5
chardet==4.0.0 chardet==4.0.0
click==7.1.2 click==7.1.2
colorama==0.4.4
commonmark==0.9.1 commonmark==0.9.1
coverage==5.5 coverage==5.5
cryptography==3.4.7
docutils==0.17 docutils==0.17
idna==2.10 idna==2.10
imagesize==1.2.0 imagesize==1.2.0
importlib-metadata==4.0.1
iniconfig==1.1.1 iniconfig==1.1.1
isodate==0.6.0 isodate==0.6.0
isort==5.8.0 isort==5.8.0
jedi==0.18.0 jedi==0.18.0
jeepney==0.6.0
Jinja2==2.11.3 Jinja2==2.11.3
jsonschema==3.2.0 jsonschema==3.2.0
keyring==23.0.1
livereload==2.6.3 livereload==2.6.3
lxml==4.6.3 lxml==4.6.3
m2r==0.2.1 m2r==0.2.1
@ -35,10 +42,12 @@ openapi-spec-validator==0.3.0
packaging==20.9 packaging==20.9
parso==0.8.2 parso==0.8.2
pathspec==0.8.1 pathspec==0.8.1
pkginfo==1.7.0
pluggy==0.13.1 pluggy==0.13.1
prompt-toolkit==3.0.18 prompt-toolkit==3.0.18
pudb==2020.1 pudb==2020.1
py==1.10.0 py==1.10.0
pycparser==2.20
pyflakes==2.3.1 pyflakes==2.3.1
Pygments==2.8.1 Pygments==2.8.1
pyparsing==2.4.7 pyparsing==2.4.7
@ -48,10 +57,14 @@ pytest-cov==2.11.1
pytest-pudb==0.7.0 pytest-pudb==0.7.0
pytz==2021.1 pytz==2021.1
PyYAML==5.4.1 PyYAML==5.4.1
readme-renderer==29.0
recommonmark==0.7.1 recommonmark==0.7.1
redis==3.5.3 redis==3.5.3
regex==2021.4.4 regex==2021.4.4
requests==2.25.1 requests==2.25.1
requests-toolbelt==0.9.1
rfc3986==1.5.0
SecretStorage==3.3.1
six==1.15.0 six==1.15.0
snowballstemmer==2.1.0 snowballstemmer==2.1.0
soupsieve==2.2.1 soupsieve==2.2.1
@ -67,6 +80,8 @@ sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4 sphinxcontrib-serializinghtml==1.1.4
toml==0.10.2 toml==0.10.2
tornado==6.1 tornado==6.1
tqdm==4.60.0
twine==3.4.1
typed-ast==1.4.2 typed-ast==1.4.2
typing-extensions==3.7.4.3 typing-extensions==3.7.4.3
unify==0.5 unify==0.5
@ -74,5 +89,8 @@ untokenize==0.1.1
urllib3==1.26.4 urllib3==1.26.4
urwid==2.1.2 urwid==2.1.2
wcwidth==0.2.5 wcwidth==0.2.5
webencodings==0.5.1
yamllint==1.26.1 yamllint==1.26.1
yarl==1.6.3 yarl==1.6.3
yaspin==1.5.0
zipp==3.4.1