9/10/2015 1 Text Processing With Boost Or, "Anything you can do, I can do better"

Download 9/10/2015 1 Text Processing With Boost Or,

Post on 27-Dec-2015

215 views

Category:

Documents

0 download

TRANSCRIPT

Slide 1 9/10/2015 1 Text Processing With Boost Or, "Anything you can do, I can do better" Slide 2 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 2 Talk Overview Goal: Become productive C++ string manipulators with the help of Boost. 1. The Simple Stuff Boost.Lexical_cast Boost.String_algo 2. The Interesting Stuff Boost.Regex Boost.Spirit Boost.Xpressive Slide 3 9/10/2015 3 Part 1: The Simple Stuff Utilities for Ad Hoc Text Manipulation Slide 4 >> int('123') 123 >>> str(123) '123' C++: int i = atoi("123");"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 4 A Legacy of Inadequacy Python: >>> int('123') 123 >>> str(123) '123' C++: int i = atoi("123"); char buff[10]; itoa(123, buff, 10); No error handling! Complicated interface Not actually standard! Slide 5 > i; // OK, i == 789 }"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 5 Stringstream: A Better atoi() { std::stringstream sout; std::string str; sout > str; // OK, str == "123" } { std::stringstream sout; int i; sout > i; // OK, i == 789 } Slide 6 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 6 Boost.Lexical_cast // Approximate implementation... template Target lexical_cast(Source const & arg) { std::stringstream sout; Target result; if(!(sout > result)) throw bad_lexical_cast( typeid(Source), typeid(Target)); return result; } Kevlin Henney Slide 7 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 7 Boost.Lexical_cast int i = lexical_cast ( "123" ); std::string str = lexical_cast ( 789 ); Ugly name Sub-par performance No i18n Clean Interface Error Reporting, Yay! Extensible Slide 8 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 8 Boost.String_algo Extension to std:: algorithms Includes algorithms for: trimming case-conversions find/replace utilities ... and much more! Pavol Droba Slide 9 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 28 Iterator: regex_token_iterator Tokenizes input according to pattern Example: scan HTML for email addresses std::string html = ; regex mailto(" ", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; std::ostream_iterator out(std::cout, "\n"); std::copy(begin, end, out); // write all email addresses to std::cout std::string html = ; regex mailto(" ", regex_constants::icase); sregex_token_iterator begin(html.begin(), html.end(), mailto, 1), end; using namespace boost::lambda; std::for_each(begin, end, std::cout > expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term));"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 34 group ::= '(' expr ')' fact ::= integer | group; term ::= fact (('*' fact) | ('/' fact))* expr ::= term (('+' term) | ('-' term))* Infix Calculator Grammar In Boost.Spirit spirit::rule group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term)); Slide 35 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 35 Balanced, Nested Braces In Extended Backus-Naur Form: In Boost.Spirit: braces ::= '{' ( not_brace+ | braces )* '}' not_brace ::= not '{' or '}' rule braces, not_brace; braces = '{' >> *( +not_brace | braces ) >> '}'; not_brace = ~chset_p("{}"); Slide 36 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 36 Spirit Parser Primitives SyntaxMeaning ch_p('X')Match literal character 'X' range_p('a','z')Match characters in the range 'a' through 'z' str_p("hello")Match the literal string "hello" chseq_p("ABCD")Like str_p, but ignores whitespace anychar_pMatches any single character chset_p("1234")Matches any of '1', '2', '3', or '4' eol_pMatches end-of-line (CR/LF and combinations) end_pMatches end of input nothing_pMatches nothing, always fails Slide 37 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 37 Spirit Parser Operations SyntaxMeaning x >> yMatch x followed by y x | yMatch x or y ~xMatch any char not x (x is a single-char parser) x - yDifference: match x but not y *xMatch x zero or more times +xMatch x one or more times !xx is optional x[ f ]Semantic action: invoke f when x matches Slide 38 r2 = ch_p('a') >> 'b';// OK! rule r3 = + str_p("hello");// OK!"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 38 Static DSEL Gotcha! Operator overloading woes At least one operand must be a parser! rule r1 = anychar_p >> 'b'; // OK rule r2 = 'a' >> 'b';// Oops! rule r3 = + "hello";// Oops! rule r2 = ch_p('a') >> 'b';// OK! rule r3 = + str_p("hello");// OK! Slide 39 storable = not_brace.copy();"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 39 Storing rules spirit::rule polymorphic rule holder Caveat: spirit::rule can be tricky Assignment operator has unusual semantics. Cannot put them in std containers! spirit::stored_rule for value semantics spirit::rule not_brace = ~chset_p("{}"); spirit::stored_rule storable = not_brace.copy(); Slide 40 struct."> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 40 Algorithm: spirit::parse #include using namespace boost; int main() { spirit::rule group, fact, term, expr; group = '(' >> expr >> ')'; fact = spirit::int_p | group; term = fact >> *(('*' >> fact) | ('/' >> fact)); expr = term >> *(('+' >> term) | ('-' >> term)); assert( spirit::parse("2*(3+4)", expr).full ); assert( ! spirit::parse("2*(3+4", expr).full ); } Parse strings as an expr (start symbol = expr ). spirit::parse returns a spirit::parse_info struct. Slide 41 struct."> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 41 Algorithm: spirit::parse spirit::rule group, fact, term, expr; //... expr = term >> *(('+' >> term) | ('-' >> term)); spirit::parse_info info = spirit::parse("1+3", expr); if(info.full) { std::cout copyright 2006 David Abrahams, Eric Niebler 9/10/2015 42 Algorithm: spirit::parse spirit::rule group, fact, term, expr; //... spirit::parse_info info = spirit::parse("1 + 3", expr, spirit::space_p); if(info.full) { std::cout copyright 2006 David Abrahams, Eric Niebler 9/10/2015 43 Spirit Grammars Group rules together Decouple tokenizing/lexing/scanning struct calculator : spirit::grammar { template struct definition { spirit::rule group, fact, term, expr; definition(calculator const & self) { //... expr = term >> *(('+' >> term) | ('-' >> term)); } rule const & start() const { return expr; } }; Slide 44 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 44 Algorithm: spirit::parse struct calculator : spirit::grammar { //... }; // This works! spirit::parse("1+3", calculator()); // This works, too! spirit::parse("1 +3", calculator(), spirit::space_p); Infix Calculator, revolutions Slide 45 > (*alpha_p)[&write] >> '}'); Match alphabetic characters, call write() with range of characters that matched."> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 45 Semantic Actions Action to take when part of your grammar succeeds void write(char const *begin, char const *end) { std::cout.write(begin, end begin); } // This prints "hi" to std::cout spirit::parse("{hi}", '{' >> (*alpha_p)[&write] >> '}'); Match alphabetic characters, call write() with range of characters that matched. Slide 46 > int_p[&write] >> ')'); int_p "returns" an int. using namespace lambda; // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[cout > ')'); We can use a Boost.Lambda expression as a semantic action!"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 46 Semantic Actions A few parsers process input first void write(int d) { std::cout > int_p[&write] >> ')'); int_p "returns" an int. using namespace lambda; // This prints "42" to std::cout spirit::parse("(42)", '(' >> int_p[cout > ')'); We can use a Boost.Lambda expression as a semantic action! Slide 47 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 47 Closures Data associated with a rule. struct calc_closure : spirit::closure { member1 val; }; rule expr; A calc_closure stack frame contains a double named val. When parsing rule expr, its calc_closure stack frame is accessed as members of expr. Slide 48 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 48 Closures and Phoenix Phoenix: the next version of Lambda struct calculator : spirit::grammar { template struct definition { spirit::rule group, fact, term, expr; spirit::rule top; definition(calculator const & self) { using namespace phoenix; top = expr[ self.val = arg1 ]; //... expr = term[ expr.val = arg1 ] >> *(('+' >> term[ expr.val += arg1 ]) | ('-' >> term[ expr.val -= arg1 ])); } spirit::rule const & start() const { return top; } }; struct calculator : spirit::grammar { template struct definition { spirit::rule group, fact, term, expr; spirit::rule top; definition(calculator const & self) { using namespace phoenix; top = expr[ self.val = arg1 ]; //... expr = term[ expr.val = arg1 ] >> *(('+' >> term[ expr.val += arg1 ]) | ('-' >> term[ expr.val -= arg1 ])); } spirit::rule const & start() const { return top; } }; Slide 49 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 49 A Calculator that calculates! using namespace phoenix; calculator calc; std::string str; while (std::getline(std::cin, str)) { double n = 0; if (spirit::parse(str.c_str(), calc[ var(n) = arg1 ]).full) { std::cout > pat >> eow; sregex braces, not_brace; not_brace = ~(set= '{', '}'); braces = '{' >> *(+not_brace | by_ref(braces)) >> '}'; Embed regexes by reference, too"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 52 Xpressive: A Mixed-Mode DSEL Mix-n-match static and dynamic regex // Get a pattern from the user at runtime: std::string str = get_pattern(); sregex pat = sregex::compile( str ); // Wrap the regex in begin- and end-word assertions: pat = bow >> pat >> eow; sregex braces, not_brace; not_brace = ~(set= '{', '}'); braces = '{' >> *(+not_brace | by_ref(braces)) >> '}'; Embed regexes by reference, too Slide 53 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 53 Sizing It All Up RegexSpiritXpr Ad-hoc pattern matching, regular languages Structured parsing, context-free grammars Manipulating text Semantic actions, manipulating program state Dynamic; new statements at runtime Static; no new statements at runtime Exhaustive backtracking semantics Blessed by TR1 Slide 54 9/10/2015 54 Appendix: Boost and Unicode Future Directions Slide 55 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 55 Wouldn't it be nice... Hmm... where, oh where, is Boost.Unicode? Slide 56 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 56 UTF-8 Conversion Facet Converts UTF-8 input to UCS-4 For use with std::locale Implementation detail! But useful nonetheless Slide 57 > str; assert( str == L"" ); // OOPS! :-( std::wifstream good("C:\\utf8.txt"); good.imbue(std::locale(std::locale(), new utf8_codecvt_facet)); good >> str; assert( str == L"" ); // SUCCESS!! :-) }"> copyright 2006 David Abrahams, Eric Niebler 9/10/2015 57 UTF-8 Conversion Facet #define BOOST_UTF8_BEGIN_NAMESPACE #define BOOST_UTF8_END_NAMESPACE #define BOOST_UTF8_DECL #include int main() { std::wstring str; std::wifstream bad("C:\\utf8.txt"); bad >> str; assert( str == L"" ); // OOPS! :-( std::wifstream good("C:\\utf8.txt"); good.imbue(std::locale(std::locale(), new utf8_codecvt_facet)); good >> str; assert( str == L"" ); // SUCCESS!! :-) } Slide 58 copyright 2006 David Abrahams, Eric Niebler 9/10/2015 58 Thanks! Boost: http://boost.orghttp://boost.org BoostCon: http://www.boostcon.orghttp://www.boostcon.org Questions?