Import datalog-shell

2021-05-14 23:47:05 -06:00 · 2021-05-14 23:47:05 -06:00 · 633060910c
commit 633060910c
parent 318f7caa6a
6 changed files with 522 additions and 0 deletions
--- a/projects/datalog-shell/BUILD
+++ b/projects/datalog-shell/BUILD
@ -0,0 +1,9 @@
+py_binary(
+    name = "datalog-shell",
+    main = "__main__.py",
+    deps = [
+        "//projects/datalog",
+        py_requirement("prompt_toolkit"),
+        py_requirement("yaspin"),
+    ]
+)
--- a/projects/datalog-shell/Makefile
+++ b/projects/datalog-shell/Makefile
@ -0,0 +1,18 @@
+.PHONY: deploy test
+
+deploy: .dev
+	source .dev/bin/activate; pip install twine; rm -r dist; python setup.py sdist; twine upload dist/*;
+
+.dev:
+	virtualenv --python=`which python3` .dev
+	source .dev/bin/activate; pip install pytest; python setup.py develop
+
+node_modules/canopy:
+	npm install canopy
+
+src/datalog/parser.py: node_modules/canopy src/datalog.peg
+	node_modules/canopy/bin/canopy --lang=python src/datalog.peg
+	mv src/datalog.py src/datalog/parser.py
+
+test: .dev $(wildcard src/**/*) $(wildcard test/**/*)
+	source .dev/bin/activate; PYTHONPATH=".:src/" pytest -vv
--- a/projects/datalog-shell/README.md
+++ b/projects/datalog-shell/README.md
@ -0,0 +1,179 @@
+# Datalog.Shell
+
+A shell for my Datalog engine.
+
+## What is Datalog?
+
+[Datalog](https://en.wikipedia.org/wiki/Datalog) is a fully
+declarative language for expressing relational data and queries,
+typically written using a syntactic subset of Prolog. Its most
+interesting feature compared to other relational languages such as SQL
+is that it features production rules.
+
+Briefly, a datalog database consists of rules and tuples. Tuples are
+written `a(b, "c", 126, ...).`, require no declaration eg. of table,
+may be of arbitrary even varying length. The elements of this tuple
+are strings which may be written as bare words or quoted.
+
+In the interpreter (or a file), we could define a small graph as such -
+
+```
+$ datalog
+>>> edge(a, b).
+⇒ edge('a', 'b')
+>>> edge(b, c).
+⇒ edge('b', 'c')
+>>> edge(c, d).
+⇒ edge('c', 'd')
+```
+
+But how can we query this? We can issue queries by entering a tuple
+terminated with `?` instead of `.`.
+
+For instance we could query if some tuples exist in the database -
+
+```
+>>> edge(a, b)?
+⇒ edge('a', 'b')
+>>> edge(d, f)?
+⇒ Ø
+>>> 
+```
+
+We did define `edge(a, b).` so our query returns that tuple. However
+the tuple `edge(d, f).` was not defined, so our query produces no
+results. Rather than printing nothing, the `Ø` symbol which denotes
+the empty set is printed for clarity.
+
+This is correct, but uninteresting. How can we find say all the edges
+from `a`? We don't have a construct like wildcards with which to match
+anything - yet.
+
+Enter logic variables. Logic variables are capitalized words, `X`,
+`Foo` and the like, which are interpreted as wildcards by the query
+engine. Capitalized words are always understood as logic variables.
+
+```
+>>> edge(a, X)?
+⇒ edge('a', 'b')
+```
+
+However unlike wildcards which simply match anything, logic variables
+are unified within a query. Were we to write `edge(X, X)?` we would be
+asking for the set of tuples such that both elements of the `edge`
+tuple equate.
+
+```
+>>> edge(X, X)?
+⇒ Ø
+```
+
+Of which we have none.
+
+But what if we wanted to find paths between edges? Say to check if a
+path existed from `a` to `d`. We'd need to find a way to unify many
+logic variables together - and so far we've only seen queries of a
+single tuple.
+
+Enter rules. We can define productions by which the Datalog engine can
+produce new tuples. Rules are written as a tuple "pattern" which may
+contain constants or logic variables, followed by a sequence of
+"clauses" separated by the `:-` assignment operator.
+
+Rules are perhaps best understood as subqueries. A rule defines an
+indefinite set of tuples such that over that set, the query clauses
+are simultaneously satisfied. This is how we achieve complex queries.
+
+There is no alternation - or - operator within a rule's body. However,
+rules can share the same tuple "pattern".
+
+So if we wanted to say find paths between edges in our database, we
+could do so using two rules. One which defines a "simple" path, and
+one which defines a path from `X` to `Y` recursively by querying for
+an edge from `X` to an unconstrained `Z`, and then unifying that with
+`path(Z, Y)`.
+
+```
+>>> path(X, Y) :- edge(X, Y).
+⇒ path('X', 'Y') :- edge('X', 'Y').
+>>> path(X, Y) :- edge(X, Z), path(Z, Y).
+⇒ path('X', 'Y') :- edge('X', 'Z'), path('Z', 'Y').
+>>> path(a, X)?
+⇒ path('a', 'b')
+⇒ path('a', 'c')
+⇒ path('a', 'd')
+```
+
+We could also ask for all paths -
+
+```
+>>> path(X, Y)?
+⇒ path('b', 'c')
+⇒ path('a', 'b')
+⇒ path('c', 'd')
+⇒ path('b', 'd')
+⇒ path('a', 'c')
+⇒ path('a', 'd')
+```
+
+Datalog also supports negation. Within a rule, a tuple prefixed with
+`~` becomes a negative statement. This allows us to express "does not
+exist" relations, or antjoins. Note that this is only possible by
+making the [closed world assumption](https://en.wikipedia.org/wiki/Closed-world_assumption).
+
+Datalog also supports binary equality as a special relation. `=(X,Y)?`
+is a nonsense query alone because the space of `X` and `Y` are
+undefined. However within a rule body, equality (and negated
+equality statements!) can be quite useful.
+
+For convenience, the Datalog interpreter supports "retracting"
+(deletion) of tuples and rules. `edge(a, b)!` would retract that
+constant tuple, but we cannot retract `path(a, b)!` as that tuple is
+generated by a rule. We can however retract the rule - `edge(X, Y)!`
+which would remove both edge production rules from the database.
+
+The Datalog interpreter also supports reading tuples (and rules) from
+one or more files, each specified by the `--db <filename>` command
+line argument.
+
+## Usage
+
+`pip install --user arrdem.datalog.shell`
+
+This will install the `datalog` interpreter into your user-local
+python `bin` directory, and pull down the core `arrdem.datalog` engine
+as well.
+
+## Status
+
+This is a complete to my knowledge implementation of a traditional datalog.
+
+Support is included for binary `=` as builtin relation, and for negated terms in
+rules (prefixed with `~`)
+
+Rules, and the recursive evaluation of rules is supported with some guards to
+prevent infinite recursion.
+
+The interactive interpreter supports definitions (terms ending in `.`),
+retractions (terms ending in `!`) and queries (terms ending in `?`), see the
+interpreter's `help` response for more details.
+
+### Limitations
+
+Recursion may have some completeness bugs. I have not yet encountered any, but I
+also don't have a strong proof of correctness for the recursive evaluation of
+rules yet.
+
+The current implementation of negated clauses CANNOT propagate positive
+information. This means that negated clauses can only be used in conjunction
+with positive clauses. It's not clear if this is an essential limitation.
+
+There is as of yet no query planner - not even segmenting rules and tuples by
+relation to restrict evaluation. This means that the complexity of a query is
+`O(dataset * term count)`, which is clearly less than ideal.
+
+## License
+
+Mirrored from https://git.arrdem.com/arrdem/datalog-py
+
+Published under the MIT license. See [LICENSE.md](LICENSE.md)
--- a/projects/datalog-shell/main.py
+++ b/projects/datalog-shell/main.py
@ -0,0 +1,263 @@
+#!/usr/bin/env python3
+
+__doc__ = f"""
+Datalog (py)
+============
+
+An interactive datalog interpreter with commands and persistence
+
+Commands
+~~~~~~~~
+  .help      (this message)
+  .all       display all tuples
+  .quit      to exit the REPL
+
+To exit, use control-c or control-d
+
+The interpreter
+~~~~~~~~~~~~~~~
+
+The interpreter reads one line at a time from stdin.
+Lines are either
+ - definitions (ending in .),
+ - queries (ending in ?)
+ - retractions (ending in !)
+
+A definition may contain arbitrarily many datalog tuples and rules.
+
+   edge(a, b). edge(b, c).  % A pair of definitions
+   ⇒ edge(a, b). % The REPL's response that it has been committed
+   ⇒ edge(b, c).
+
+A query may contain definitions, but they exist only for the duration of the query.
+
+   edge(X, Y)? % A query which will enumerate all 2-edges
+   ⇒ edge(a, b).
+   ⇒ edge(b, c).
+
+   edge(c, d). edge(X, Y)? % A query with a local tuple
+   ⇒ edge(a, b).
+   ⇒ edge(b, c).
+   ⇒ edge(c, d).
+
+A retraction may contain only one tuple or clause, which will be expunged.
+
+   edge(a, b)!   % This tuple is in our dataset
+   ⇒ edge(a, b)  % So deletion succeeds
+
+   edge(a, b)!   % This tuple is no longer in our dataset
+   ⇒ Ø           % So deletion fails
+
+"""
+
+import argparse
+import logging
+import sys
+
+from datalog.debris import Timing
+from datalog.evaluator import select
+from datalog.reader import pr_str, read_command, read_dataset
+from datalog.types import (
+  CachedDataset,
+  Constant,
+  Dataset,
+  LVar,
+  PartlyIndexedDataset,
+  Rule,
+  TableIndexedDataset
+)
+
+from prompt_toolkit import print_formatted_text, prompt, PromptSession
+from prompt_toolkit.formatted_text import FormattedText
+from prompt_toolkit.history import FileHistory
+from prompt_toolkit.styles import Style
+from yaspin import Spinner, yaspin
+
+
+STYLE = Style.from_dict({
+    # User input (default text).
+    "": "",
+    "prompt": "ansigreen",
+    "time": "ansiyellow"
+})
+
+SPINNER = Spinner(["|", "/", "-", "\\"], 200)
+
+
+class InterpreterInterrupt(Exception):
+  """An exception used to break the prompt or evaluation."""
+
+
+def print_(fmt, **kwargs):
+  print_formatted_text(FormattedText(fmt), **kwargs)
+
+
+def print_db(db):
+  """Render a database for debugging."""
+
+  for e in db.tuples():
+    print(f"⇒ {pr_str(e)}")
+
+  for r in db.rules():
+    print(f"⇒ {pr_str(r)}")
+
+
+def main(args):
+  """REPL entry point."""
+
+  if args.db_cls == "simple":
+    db_cls = Dataset
+  elif args.db_cls == "cached":
+    db_cls = CachedDataset
+  elif args.db_cls == "table":
+    db_cls = TableIndexedDataset
+  elif args.db_cls == "partly":
+    db_cls = PartlyIndexedDataset
+
+  print(f"Using dataset type {db_cls}")
+
+  session = PromptSession(history=FileHistory(".datalog.history"))
+  db = db_cls([], [])
+
+  if args.dbs:
+    for db_file in args.dbs:
+      try:
+        with open(db_file, "r") as f:
+          db = db.merge(read_dataset(f.read()))
+          print(f"Loaded {db_file} ...")
+      except Exception as e:
+        print("Internal error - {e}")
+        print(f"Unable to load db {db_file}, skipping")
+
+  while True:
+    try:
+      line = session.prompt([("class:prompt", ">>> ")], style=STYLE)
+    except (InterpreterInterrupt, KeyboardInterrupt):
+      continue
+    except EOFError:
+      break
+
+    if line == ".all":
+      op = ".all"
+    elif line == ".dbg":
+      op = ".dbg"
+    elif line == ".quit":
+      break
+
+    elif line in {".help", "help", "?", "??", "???"}:
+      print(__doc__)
+      continue
+
+    elif line.split(" ")[0] == ".log":
+      op = ".log"
+
+    else:
+      try:
+        op, val = read_command(line)
+      except Exception as e:
+        print(f"Got an unknown command or syntax error, can't tell which")
+        continue
+
+    # Definition merges on the DB
+    if op == ".all":
+      print_db(db)
+
+    # .dbg drops to a debugger shell so you can poke at the instance objects (database)
+    elif op == ".dbg":
+      import pdb
+      pdb.set_trace()
+
+    # .log sets the log level - badly
+    elif op == ".log":
+      level = line.split(" ")[1].upper()
+      try:
+        ch.setLevel(getattr(logging, level))
+      except BaseException:
+        print(f"Unknown log level {level}")
+
+    elif op == ".":
+      # FIXME (arrdem 2019-06-15):
+      #   Syntax rules the parser doesn't impose...
+      try:
+        for rule in val.rules():
+          assert not rule.free_vars, f"Rule contains free variables {rule.free_vars!r}"
+
+        for tuple in val.tuples():
+          assert not any(isinstance(e, LVar) for e in tuple), f"Tuples cannot contain lvars - {tuple!r}"
+
+      except BaseException as e:
+        print(f"Error: {e}")
+        continue
+
+      db = db.merge(val)
+      print_db(val)
+
+    # Queries execute - note that rules as queries have to be temporarily merged.
+    elif op == "?":
+      # In order to support ad-hoc rules (joins), we have to generate a transient "query" database
+      # by bolting the rule on as an overlay to the existing database. If of course we have a join.
+      #
+      # `val` was previously assumed to be the query pattern. Introduce `qdb`, now used as the
+      # database to query and "fix" `val` to be the temporary rule's pattern.
+      #
+      # We use a new db and db local so that the ephemeral rule doesn't persist unless the user
+      # later `.` defines it.
+      #
+      # Unfortunately doing this merge does nuke caches.
+      qdb = db
+      if isinstance(val, Rule):
+        qdb = db.merge(db_cls([], [val]))
+        val = val.pattern
+
+      with yaspin(SPINNER) as spinner:
+        with Timing() as t:
+          try:
+            results = list(select(qdb, val))
+          except KeyboardInterrupt:
+            print(f"Evaluation aborted after {t}")
+            continue
+
+      # It's kinda bogus to move sorting out but oh well
+      sorted(results)
+
+      for _results, _bindings in results:
+        _result = _results[0] # select only selects one tuple at a time
+        print(f"⇒ {pr_str(_result)}")
+
+      # So we can report empty sets explicitly.
+      if not results:
+        print("⇒ Ø")
+
+      print_([("class:time", f"Elapsed time - {t}")], style=STYLE)
+
+    # Retractions try to delete, but may fail.
+    elif op == "!":
+      if val in db.tuples() or val in [r.pattern for r in db.rules()]:
+        db = db_cls([u for u in db.tuples() if u != val],
+                    [r for r in db.rules() if r.pattern != val])
+        print(f"⇒ {pr_str(val)}")
+      else:
+        print("⇒ Ø")
+
+
+parser = argparse.ArgumentParser()
+
+# Select which dataset type to use
+parser.add_argument("--db-type",
+                    choices=["simple", "cached", "table", "partly"],
+                    help="Choose which DB to use (default partly)",
+                    dest="db_cls",
+                    default="partly")
+
+parser.add_argument("--load-db", dest="dbs", action="append",
+                    help="Datalog files to load first.")
+
+if __name__ == "__main__":
+  args = parser.parse_args(sys.argv[1:])
+  logger = logging.getLogger("arrdem.datalog")
+  ch = logging.StreamHandler()
+  ch.setLevel(logging.INFO)
+  formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
+  ch.setFormatter(formatter)
+  logger.addHandler(ch)
+  main(args)
--- a/projects/datalog-shell/setup.py
+++ b/projects/datalog-shell/setup.py
@ -0,0 +1,35 @@
+from setuptools import setup
+
+
+setup(
+    name="arrdem.datalog.shell",
+    # Package metadata
+    version="0.0.2",
+    license="MIT",
+    description="A shell for my datalog engine",
+    long_description=open("README.md").read(),
+    long_description_content_type="text/markdown",
+    author="Reid 'arrdem' McKenzie",
+    author_email="me@arrdem.com",
+    url="https://git.arrdem.com/arrdem/datalog-shell",
+    classifiers=[
+        "License :: OSI Approved :: MIT License",
+        "Development Status :: 3 - Alpha",
+        "Intended Audience :: Developers",
+        "Topic :: Database",
+        "Topic :: Database :: Database Engines/Servers",
+        "Topic :: Database :: Front-Ends",
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.6",
+        "Programming Language :: Python :: 3.7",
+    ],
+
+    scripts=[
+        "bin/datalog"
+    ],
+    install_requires=[
+        "arrdem.datalog~=2.0.0",
+        "prompt_toolkit==2.0.9",
+        "yaspin==0.14.3",
+    ],
+)
--- a/tools/python/requirements.txt
+++ b/tools/python/requirements.txt
@ -8,20 +8,27 @@ autoflake==1.4
 Babel==2.9.0
 beautifulsoup4==4.9.3
 black==20.8b1
+bleach==3.3.0
 certifi==2020.12.5
+cffi==1.14.5
 chardet==4.0.0
 click==7.1.2
+colorama==0.4.4
 commonmark==0.9.1
 coverage==5.5
+cryptography==3.4.7
 docutils==0.17
 idna==2.10
 imagesize==1.2.0
+importlib-metadata==4.0.1
 iniconfig==1.1.1
 isodate==0.6.0
 isort==5.8.0
 jedi==0.18.0
+jeepney==0.6.0
 Jinja2==2.11.3
 jsonschema==3.2.0
+keyring==23.0.1
 livereload==2.6.3
 lxml==4.6.3
 m2r==0.2.1
@ -35,10 +42,12 @@ openapi-spec-validator==0.3.0
 packaging==20.9
 parso==0.8.2
 pathspec==0.8.1
+pkginfo==1.7.0
 pluggy==0.13.1
 prompt-toolkit==3.0.18
 pudb==2020.1
 py==1.10.0
+pycparser==2.20
 pyflakes==2.3.1
 Pygments==2.8.1
 pyparsing==2.4.7
@ -48,10 +57,14 @@ pytest-cov==2.11.1
 pytest-pudb==0.7.0
 pytz==2021.1
 PyYAML==5.4.1
+readme-renderer==29.0
 recommonmark==0.7.1
 redis==3.5.3
 regex==2021.4.4
 requests==2.25.1
+requests-toolbelt==0.9.1
+rfc3986==1.5.0
+SecretStorage==3.3.1
 six==1.15.0
 snowballstemmer==2.1.0
 soupsieve==2.2.1
@ -67,6 +80,8 @@ sphinxcontrib-qthelp==1.0.3
 sphinxcontrib-serializinghtml==1.1.4
 toml==0.10.2
 tornado==6.1
+tqdm==4.60.0
+twine==3.4.1
 typed-ast==1.4.2
 typing-extensions==3.7.4.3
 unify==0.5
@ -74,5 +89,8 @@ untokenize==0.1.1
 urllib3==1.26.4
 urwid==2.1.2
 wcwidth==0.2.5
+webencodings==0.5.1
 yamllint==1.26.1
 yarl==1.6.3
+yaspin==1.5.0
+zipp==3.4.1