living with-spec
TRANSCRIPT
Parsing with DerivativesA Functional Pearl
Matthew Might David DaraisUniversity of Utah
[email protected], [email protected]
Daniel SpiewakUniversity of Wisconsin, Milwaukee
AbstractWe present a functional approach to parsing unrestricted context-free grammars based on Brzozowski’s derivative of regular expres-sions. If we consider context-free grammars as recursive regular ex-pressions, Brzozowski’s equational theory extends without modifi-cation to context-free grammars (and it generalizes to parser combi-nators). The supporting actors in this story are three concepts famil-iar to functional programmers—laziness, memoization and fixedpoints; these allow Brzozowski’s original equations to be translit-erated into purely functional code in about 30 lines spread overthree functions.
Yet, this almost impossibly brief implementation has a draw-back: its performance is sour—in both theory and practice. Theculprit? Each derivative can double the size of a grammar, and withit, the cost of the next derivative.
Fortunately, much of the new structure inflicted by the derivativeis either dead on arrival, or it dies after the very next derivative.To eliminate it, we once again exploit laziness and memoizationto transliterate an equational theory that prunes such debris intoworking code. Thanks to this compaction, parsing times becomereasonable in practice.
We equip the functional programmer with two equational theo-ries that, when combined, make for an abbreviated understandingand implementation of a system for parsing context-free languages.
Categories and Subject Descriptors F.4.3 [Formal Languages]:Operations on languages
General Terms Algorithms, Languages, Theory
Keywords formal languages, parsing, derivative, regular expres-sions, context-free grammar, parser combinator
1. IntroductionIt is easy to lose sight of the essence of parsing in the minutiaeof forbidden grammars, shift-reduce conflicts and opaque actiontables. To the extent that understanding in computer science comesfrom implementation, a deeper appreciation of parsing often seemsout of reach. Brzozowski’s derivative upsets this calculus of effortand understanding to make the construction of parsing systemsaccessible to the common functional programmer.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.ICFP’11, September 19–21, 2011, Tokyo, Japan.Copyright c� 2011 ACM 978-1-4503-0865-6/11/09. . . $10.00
The derivative of regular expressions [1], if gently temperedwith laziness, memoization and fixed points, acts immediatelyas a pure, functional technique for generating parse forests fromarbitrary context-free grammars. Despite—even because of—itssimplicity, the derivative transparently handles ambiguity, left-recursion, right-recursion, ill-founded recursion or any combina-tion thereof.
1.1 Outline• After a review of formal languages, we introduce Brzozowski’s
derivative for regular languages. A brief implementation high-lights its rugged elegance.
• As our implementation of the derivative engages context-freelanguages, non-termination emerges as a problem.
• Three small, surgical modifications to the implementation (butnot the theory)—laziness, memoization and fixed points—guarantee termination. Termination means the derivative canrecognize arbitrary context-free languages.
• We generalize the derivative to parsers and parser combinatorsthrough an equational theory for generating parse forests.
• We find poor performance in both theory and practice. Theroot cause is vestigial structure left in the grammar by earlierderivatives; this structure is malignant: though it no longerserves a purpose, it still grows in size with each derivative.
• We develop an optimization—compaction—that collapses gram-mars by excising this mass. Compaction, like the derivative,comes from a clean, equational theory that exploits lazinessand memoization in its transliteration to working code.
In this article, we provide code in Racket, but it should adapt readilyto any Lisp. All code and test cases within or referenced from thisarticle (plus additional implementations in Haskell and Scala) areavailable from:
http://www.ucombinator.org/projects/parsing/
2. Preliminary: Formal languagesA language L is a set of strings. A string w is a sequence ofcharacters from an alphabet A. (From the parser’s perspective, a“character” might be a token/terminal.)
Two atomic languages arise often in formal languages: theempty language and the null (or empty-string) language:• The empty language ; contains no strings at all:
; = {} .
• The null language ✏ contains only the length-zero “null” string:
✏ = {w} where length(w) = 0.
blog.klipse.tech/clojure/2016/10/02/parsing-with-derivatives-regular.html
Writing a spec should enable automatic:
• Validation
• Error reporting
• Destructuring
• Instrumentation
• Test-data generation
• Generative test generation
*http://clojure.org/about/spec
“Composition is about decomposing.”
— E. Normand
The code base• ~15k loc
• ETL
• Risk-hedging
• Demand-responsive pricing
• Packing & routing
• Internal BI tools
• …
Validation• API boundaries
• Structured errors
• Fail fast = more context
• nil punning
• Multiple airities
Friendlier error messages• Pluggable explanations via s/*explain-out*
• Capture common mistakes
• Provide hints
• Limitations
• One explainer function for all
• Don’t know which spec (only what went wrong)
Destructuring• Pull apart and name
• Patterns
• case (dispatch on tag)
• core.match
• Separate data description and transformation
Data macros
• Recursive transformations into canonical form
• s/conformer
• Do more without code macros
* juxt.pro/blog/posts/data-macros.html
Generative testing • Limitations
• sequences with internal structure (time series etc.)
• generic higher-order functions (e.g. map)
• Uncovers numerical instabilities
• s/exercise for mocking
Queryable data descriptions:
• s/registry, s/form
• Build a graph
• No inline docs :(
Interact with your type system!
Case study: autogenerating materialised views
KafkaMaterialised views
Events External data
Automatic view generation• Event & attribute ontology
• Manual (via spec) • Inferred
• Statistical analysis (seasonality detection, outlier removal, …)
Onyx Onyx
Onyx
Simple first, then Easy
Going forward
My favourites
• Data macros
• Destructuring
• Queryable data descriptions
• Structured errors
clojure.org/about/spec
matt.might.net/papers/might2011derivatives.pdf
infoq.com/presentations/Mixin-based-Inheritance
realworldclojure.com/the-system-paradigm
juxt.pro/blog/posts/data-macros.html
blog.klipse.tech/clojure/2016/10/02/parsing-with-derivatives-regular.html
indiegogo.com/projects/typed-clojure-clojure-spec-auto-annotations#/
github.com/arohner/spectrum