diff --git a/projects/datalog-shell/BUILD b/projects/datalog-shell/BUILD new file mode 100644 index 0000000..66a9af8 --- /dev/null +++ b/projects/datalog-shell/BUILD @@ -0,0 +1,9 @@ +py_binary( + name = "datalog-shell", + main = "__main__.py", + deps = [ + "//projects/datalog", + py_requirement("prompt_toolkit"), + py_requirement("yaspin"), + ] +) diff --git a/projects/datalog-shell/Makefile b/projects/datalog-shell/Makefile new file mode 100644 index 0000000..d76435d --- /dev/null +++ b/projects/datalog-shell/Makefile @@ -0,0 +1,18 @@ +.PHONY: deploy test + +deploy: .dev + source .dev/bin/activate; pip install twine; rm -r dist; python setup.py sdist; twine upload dist/*; + +.dev: + virtualenv --python=`which python3` .dev + source .dev/bin/activate; pip install pytest; python setup.py develop + +node_modules/canopy: + npm install canopy + +src/datalog/parser.py: node_modules/canopy src/datalog.peg + node_modules/canopy/bin/canopy --lang=python src/datalog.peg + mv src/datalog.py src/datalog/parser.py + +test: .dev $(wildcard src/**/*) $(wildcard test/**/*) + source .dev/bin/activate; PYTHONPATH=".:src/" pytest -vv diff --git a/projects/datalog-shell/README.md b/projects/datalog-shell/README.md new file mode 100644 index 0000000..0c2c84c --- /dev/null +++ b/projects/datalog-shell/README.md @@ -0,0 +1,179 @@ +# Datalog.Shell + +A shell for my Datalog engine. + +## What is Datalog? + +[Datalog](https://en.wikipedia.org/wiki/Datalog) is a fully +declarative language for expressing relational data and queries, +typically written using a syntactic subset of Prolog. Its most +interesting feature compared to other relational languages such as SQL +is that it features production rules. + +Briefly, a datalog database consists of rules and tuples. Tuples are +written `a(b, "c", 126, ...).`, require no declaration eg. of table, +may be of arbitrary even varying length. The elements of this tuple +are strings which may be written as bare words or quoted. + +In the interpreter (or a file), we could define a small graph as such - + +``` +$ datalog +>>> edge(a, b). +⇒ edge('a', 'b') +>>> edge(b, c). +⇒ edge('b', 'c') +>>> edge(c, d). +⇒ edge('c', 'd') +``` + +But how can we query this? We can issue queries by entering a tuple +terminated with `?` instead of `.`. + +For instance we could query if some tuples exist in the database - + +``` +>>> edge(a, b)? +⇒ edge('a', 'b') +>>> edge(d, f)? +⇒ Ø +>>> +``` + +We did define `edge(a, b).` so our query returns that tuple. However +the tuple `edge(d, f).` was not defined, so our query produces no +results. Rather than printing nothing, the `Ø` symbol which denotes +the empty set is printed for clarity. + +This is correct, but uninteresting. How can we find say all the edges +from `a`? We don't have a construct like wildcards with which to match +anything - yet. + +Enter logic variables. Logic variables are capitalized words, `X`, +`Foo` and the like, which are interpreted as wildcards by the query +engine. Capitalized words are always understood as logic variables. + +``` +>>> edge(a, X)? +⇒ edge('a', 'b') +``` + +However unlike wildcards which simply match anything, logic variables +are unified within a query. Were we to write `edge(X, X)?` we would be +asking for the set of tuples such that both elements of the `edge` +tuple equate. + +``` +>>> edge(X, X)? +⇒ Ø +``` + +Of which we have none. + +But what if we wanted to find paths between edges? Say to check if a +path existed from `a` to `d`. We'd need to find a way to unify many +logic variables together - and so far we've only seen queries of a +single tuple. + +Enter rules. We can define productions by which the Datalog engine can +produce new tuples. Rules are written as a tuple "pattern" which may +contain constants or logic variables, followed by a sequence of +"clauses" separated by the `:-` assignment operator. + +Rules are perhaps best understood as subqueries. A rule defines an +indefinite set of tuples such that over that set, the query clauses +are simultaneously satisfied. This is how we achieve complex queries. + +There is no alternation - or - operator within a rule's body. However, +rules can share the same tuple "pattern". + +So if we wanted to say find paths between edges in our database, we +could do so using two rules. One which defines a "simple" path, and +one which defines a path from `X` to `Y` recursively by querying for +an edge from `X` to an unconstrained `Z`, and then unifying that with +`path(Z, Y)`. + +``` +>>> path(X, Y) :- edge(X, Y). +⇒ path('X', 'Y') :- edge('X', 'Y'). +>>> path(X, Y) :- edge(X, Z), path(Z, Y). +⇒ path('X', 'Y') :- edge('X', 'Z'), path('Z', 'Y'). +>>> path(a, X)? +⇒ path('a', 'b') +⇒ path('a', 'c') +⇒ path('a', 'd') +``` + +We could also ask for all paths - + +``` +>>> path(X, Y)? +⇒ path('b', 'c') +⇒ path('a', 'b') +⇒ path('c', 'd') +⇒ path('b', 'd') +⇒ path('a', 'c') +⇒ path('a', 'd') +``` + +Datalog also supports negation. Within a rule, a tuple prefixed with +`~` becomes a negative statement. This allows us to express "does not +exist" relations, or antjoins. Note that this is only possible by +making the [closed world assumption](https://en.wikipedia.org/wiki/Closed-world_assumption). + +Datalog also supports binary equality as a special relation. `=(X,Y)?` +is a nonsense query alone because the space of `X` and `Y` are +undefined. However within a rule body, equality (and negated +equality statements!) can be quite useful. + +For convenience, the Datalog interpreter supports "retracting" +(deletion) of tuples and rules. `edge(a, b)!` would retract that +constant tuple, but we cannot retract `path(a, b)!` as that tuple is +generated by a rule. We can however retract the rule - `edge(X, Y)!` +which would remove both edge production rules from the database. + +The Datalog interpreter also supports reading tuples (and rules) from +one or more files, each specified by the `--db ` command +line argument. + +## Usage + +`pip install --user arrdem.datalog.shell` + +This will install the `datalog` interpreter into your user-local +python `bin` directory, and pull down the core `arrdem.datalog` engine +as well. + +## Status + +This is a complete to my knowledge implementation of a traditional datalog. + +Support is included for binary `=` as builtin relation, and for negated terms in +rules (prefixed with `~`) + +Rules, and the recursive evaluation of rules is supported with some guards to +prevent infinite recursion. + +The interactive interpreter supports definitions (terms ending in `.`), +retractions (terms ending in `!`) and queries (terms ending in `?`), see the +interpreter's `help` response for more details. + +### Limitations + +Recursion may have some completeness bugs. I have not yet encountered any, but I +also don't have a strong proof of correctness for the recursive evaluation of +rules yet. + +The current implementation of negated clauses CANNOT propagate positive +information. This means that negated clauses can only be used in conjunction +with positive clauses. It's not clear if this is an essential limitation. + +There is as of yet no query planner - not even segmenting rules and tuples by +relation to restrict evaluation. This means that the complexity of a query is +`O(dataset * term count)`, which is clearly less than ideal. + +## License + +Mirrored from https://git.arrdem.com/arrdem/datalog-py + +Published under the MIT license. See [LICENSE.md](LICENSE.md) diff --git a/projects/datalog-shell/__main__.py b/projects/datalog-shell/__main__.py new file mode 100755 index 0000000..7d7e453 --- /dev/null +++ b/projects/datalog-shell/__main__.py @@ -0,0 +1,263 @@ +#!/usr/bin/env python3 + +__doc__ = f""" +Datalog (py) +============ + +An interactive datalog interpreter with commands and persistence + +Commands +~~~~~~~~ + .help (this message) + .all display all tuples + .quit to exit the REPL + +To exit, use control-c or control-d + +The interpreter +~~~~~~~~~~~~~~~ + +The interpreter reads one line at a time from stdin. +Lines are either + - definitions (ending in .), + - queries (ending in ?) + - retractions (ending in !) + +A definition may contain arbitrarily many datalog tuples and rules. + + edge(a, b). edge(b, c). % A pair of definitions + ⇒ edge(a, b). % The REPL's response that it has been committed + ⇒ edge(b, c). + +A query may contain definitions, but they exist only for the duration of the query. + + edge(X, Y)? % A query which will enumerate all 2-edges + ⇒ edge(a, b). + ⇒ edge(b, c). + + edge(c, d). edge(X, Y)? % A query with a local tuple + ⇒ edge(a, b). + ⇒ edge(b, c). + ⇒ edge(c, d). + +A retraction may contain only one tuple or clause, which will be expunged. + + edge(a, b)! % This tuple is in our dataset + ⇒ edge(a, b) % So deletion succeeds + + edge(a, b)! % This tuple is no longer in our dataset + ⇒ Ø % So deletion fails + +""" + +import argparse +import logging +import sys + +from datalog.debris import Timing +from datalog.evaluator import select +from datalog.reader import pr_str, read_command, read_dataset +from datalog.types import ( + CachedDataset, + Constant, + Dataset, + LVar, + PartlyIndexedDataset, + Rule, + TableIndexedDataset +) + +from prompt_toolkit import print_formatted_text, prompt, PromptSession +from prompt_toolkit.formatted_text import FormattedText +from prompt_toolkit.history import FileHistory +from prompt_toolkit.styles import Style +from yaspin import Spinner, yaspin + + +STYLE = Style.from_dict({ + # User input (default text). + "": "", + "prompt": "ansigreen", + "time": "ansiyellow" +}) + +SPINNER = Spinner(["|", "/", "-", "\\"], 200) + + +class InterpreterInterrupt(Exception): + """An exception used to break the prompt or evaluation.""" + + +def print_(fmt, **kwargs): + print_formatted_text(FormattedText(fmt), **kwargs) + + +def print_db(db): + """Render a database for debugging.""" + + for e in db.tuples(): + print(f"⇒ {pr_str(e)}") + + for r in db.rules(): + print(f"⇒ {pr_str(r)}") + + +def main(args): + """REPL entry point.""" + + if args.db_cls == "simple": + db_cls = Dataset + elif args.db_cls == "cached": + db_cls = CachedDataset + elif args.db_cls == "table": + db_cls = TableIndexedDataset + elif args.db_cls == "partly": + db_cls = PartlyIndexedDataset + + print(f"Using dataset type {db_cls}") + + session = PromptSession(history=FileHistory(".datalog.history")) + db = db_cls([], []) + + if args.dbs: + for db_file in args.dbs: + try: + with open(db_file, "r") as f: + db = db.merge(read_dataset(f.read())) + print(f"Loaded {db_file} ...") + except Exception as e: + print("Internal error - {e}") + print(f"Unable to load db {db_file}, skipping") + + while True: + try: + line = session.prompt([("class:prompt", ">>> ")], style=STYLE) + except (InterpreterInterrupt, KeyboardInterrupt): + continue + except EOFError: + break + + if line == ".all": + op = ".all" + elif line == ".dbg": + op = ".dbg" + elif line == ".quit": + break + + elif line in {".help", "help", "?", "??", "???"}: + print(__doc__) + continue + + elif line.split(" ")[0] == ".log": + op = ".log" + + else: + try: + op, val = read_command(line) + except Exception as e: + print(f"Got an unknown command or syntax error, can't tell which") + continue + + # Definition merges on the DB + if op == ".all": + print_db(db) + + # .dbg drops to a debugger shell so you can poke at the instance objects (database) + elif op == ".dbg": + import pdb + pdb.set_trace() + + # .log sets the log level - badly + elif op == ".log": + level = line.split(" ")[1].upper() + try: + ch.setLevel(getattr(logging, level)) + except BaseException: + print(f"Unknown log level {level}") + + elif op == ".": + # FIXME (arrdem 2019-06-15): + # Syntax rules the parser doesn't impose... + try: + for rule in val.rules(): + assert not rule.free_vars, f"Rule contains free variables {rule.free_vars!r}" + + for tuple in val.tuples(): + assert not any(isinstance(e, LVar) for e in tuple), f"Tuples cannot contain lvars - {tuple!r}" + + except BaseException as e: + print(f"Error: {e}") + continue + + db = db.merge(val) + print_db(val) + + # Queries execute - note that rules as queries have to be temporarily merged. + elif op == "?": + # In order to support ad-hoc rules (joins), we have to generate a transient "query" database + # by bolting the rule on as an overlay to the existing database. If of course we have a join. + # + # `val` was previously assumed to be the query pattern. Introduce `qdb`, now used as the + # database to query and "fix" `val` to be the temporary rule's pattern. + # + # We use a new db and db local so that the ephemeral rule doesn't persist unless the user + # later `.` defines it. + # + # Unfortunately doing this merge does nuke caches. + qdb = db + if isinstance(val, Rule): + qdb = db.merge(db_cls([], [val])) + val = val.pattern + + with yaspin(SPINNER) as spinner: + with Timing() as t: + try: + results = list(select(qdb, val)) + except KeyboardInterrupt: + print(f"Evaluation aborted after {t}") + continue + + # It's kinda bogus to move sorting out but oh well + sorted(results) + + for _results, _bindings in results: + _result = _results[0] # select only selects one tuple at a time + print(f"⇒ {pr_str(_result)}") + + # So we can report empty sets explicitly. + if not results: + print("⇒ Ø") + + print_([("class:time", f"Elapsed time - {t}")], style=STYLE) + + # Retractions try to delete, but may fail. + elif op == "!": + if val in db.tuples() or val in [r.pattern for r in db.rules()]: + db = db_cls([u for u in db.tuples() if u != val], + [r for r in db.rules() if r.pattern != val]) + print(f"⇒ {pr_str(val)}") + else: + print("⇒ Ø") + + +parser = argparse.ArgumentParser() + +# Select which dataset type to use +parser.add_argument("--db-type", + choices=["simple", "cached", "table", "partly"], + help="Choose which DB to use (default partly)", + dest="db_cls", + default="partly") + +parser.add_argument("--load-db", dest="dbs", action="append", + help="Datalog files to load first.") + +if __name__ == "__main__": + args = parser.parse_args(sys.argv[1:]) + logger = logging.getLogger("arrdem.datalog") + ch = logging.StreamHandler() + ch.setLevel(logging.INFO) + formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s") + ch.setFormatter(formatter) + logger.addHandler(ch) + main(args) diff --git a/projects/datalog-shell/setup.py b/projects/datalog-shell/setup.py new file mode 100644 index 0000000..2ef284a --- /dev/null +++ b/projects/datalog-shell/setup.py @@ -0,0 +1,35 @@ +from setuptools import setup + + +setup( + name="arrdem.datalog.shell", + # Package metadata + version="0.0.2", + license="MIT", + description="A shell for my datalog engine", + long_description=open("README.md").read(), + long_description_content_type="text/markdown", + author="Reid 'arrdem' McKenzie", + author_email="me@arrdem.com", + url="https://git.arrdem.com/arrdem/datalog-shell", + classifiers=[ + "License :: OSI Approved :: MIT License", + "Development Status :: 3 - Alpha", + "Intended Audience :: Developers", + "Topic :: Database", + "Topic :: Database :: Database Engines/Servers", + "Topic :: Database :: Front-Ends", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.6", + "Programming Language :: Python :: 3.7", + ], + + scripts=[ + "bin/datalog" + ], + install_requires=[ + "arrdem.datalog~=2.0.0", + "prompt_toolkit==2.0.9", + "yaspin==0.14.3", + ], +) diff --git a/tools/python/requirements.txt b/tools/python/requirements.txt index 41b862a..04401b0 100644 --- a/tools/python/requirements.txt +++ b/tools/python/requirements.txt @@ -8,20 +8,27 @@ autoflake==1.4 Babel==2.9.0 beautifulsoup4==4.9.3 black==20.8b1 +bleach==3.3.0 certifi==2020.12.5 +cffi==1.14.5 chardet==4.0.0 click==7.1.2 +colorama==0.4.4 commonmark==0.9.1 coverage==5.5 +cryptography==3.4.7 docutils==0.17 idna==2.10 imagesize==1.2.0 +importlib-metadata==4.0.1 iniconfig==1.1.1 isodate==0.6.0 isort==5.8.0 jedi==0.18.0 +jeepney==0.6.0 Jinja2==2.11.3 jsonschema==3.2.0 +keyring==23.0.1 livereload==2.6.3 lxml==4.6.3 m2r==0.2.1 @@ -35,10 +42,12 @@ openapi-spec-validator==0.3.0 packaging==20.9 parso==0.8.2 pathspec==0.8.1 +pkginfo==1.7.0 pluggy==0.13.1 prompt-toolkit==3.0.18 pudb==2020.1 py==1.10.0 +pycparser==2.20 pyflakes==2.3.1 Pygments==2.8.1 pyparsing==2.4.7 @@ -48,10 +57,14 @@ pytest-cov==2.11.1 pytest-pudb==0.7.0 pytz==2021.1 PyYAML==5.4.1 +readme-renderer==29.0 recommonmark==0.7.1 redis==3.5.3 regex==2021.4.4 requests==2.25.1 +requests-toolbelt==0.9.1 +rfc3986==1.5.0 +SecretStorage==3.3.1 six==1.15.0 snowballstemmer==2.1.0 soupsieve==2.2.1 @@ -67,6 +80,8 @@ sphinxcontrib-qthelp==1.0.3 sphinxcontrib-serializinghtml==1.1.4 toml==0.10.2 tornado==6.1 +tqdm==4.60.0 +twine==3.4.1 typed-ast==1.4.2 typing-extensions==3.7.4.3 unify==0.5 @@ -74,5 +89,8 @@ untokenize==0.1.1 urllib3==1.26.4 urwid==2.1.2 wcwidth==0.2.5 +webencodings==0.5.1 yamllint==1.26.1 yarl==1.6.3 +yaspin==1.5.0 +zipp==3.4.1