strings and automata modulo theories margus veanes july 18, 2015smt'15, san fransisco1
TRANSCRIPT
SMT'15, San Fransisco 1
STRINGS AND
AUTOMATA MODULO THEORIES
Margus Veanes
July 18, 2015
SMT'15, San Fransisco 2
• Symbolic execution– Path feasibility analysis involving string
constraints– Regular expression matching
• Security vulnerabilities– SQL injection attacks– XSS attacks – DoS attacks
• e.g. regex injection
– Directory traversal attacks
…• Data processing
– Parallelization– Deforestation
• Malware detection
MOTIVATION
July 18, 2015
[OWASP]top 1,3 culprits
http://foo.bar.system/scripts/..%c1%1c../winnt/system32/cmd.exe?/c+dir+c:\
SMT'15, San Fransisco 3
“EARLY” WORK RELATED TO STRING ANALYSIS
• Tools– Mona: Henriksen-Jensen-Jørgensen-Klarlund-Paige-Rauhe-Sandholm, TACAS’95
• Built on BRICS automata library
– JSA: Christensen-Møller-Schwartzbach, SAS’03 (Uses BRICS)– Haderach: Shannon-Hajra-Lee-Zhan-Khurshid, MUTATION’07 (Uses BRICS)
• Theory– Bjørner, PhD Thesis’98, Decision procedure for queues– Blumensath-Grädel, LICS’00 (automatic structures)– Benedikt-Libkin-Schwentick-Segoufin, LICS’01 (regular string relations)– Khoussainov-Nies-Rubin-Stephan, LICS’04 (automatic Boolean Algebras)– Bala, STACS’04, (regular term matching)– Kunc, DLT’2007, (complexity of language equations)
July 18, 2015
SMT'15, San Fransisco 4
THE RISE OF THE STRING ANALYZERS
• String theory encodings in SMT:– Pex-LL: Bjørner-Tillmann-Voronkov, TACAS’09 (strings + SMT)– Reggae: Li-Xie-Tillmann-deHalleux-Schulte, ASE’09 (symolic exploration of regex code)– Z3-str: Zheng-Zhang-Ganesh, ESEC/FSE 2013 (plugin to Z3)– CVC4-str: Liang-Reynolds-Tinelli-Barrett-Deters, CAV’14 (DPLL(TSLRp))– S3: Trinh-Chu-Jaffar, CCS’14 (uses Z3-str-star)
• Automata related:– Stranger: Yu-Alkhalaf-Bultan-Ibarra-Cova, SPIN’08, TACAS’09, TACAS’10 (automata based)– DPRLE: Hooimeijer-Weimer, PLDI’09 (subset checking)– Hampi: Kiezun-Ganesh-Guo-Hooimeijer-Ernst, ISSTA’09 (best paper award) (reduction to BV)– Kaluza(in Kudzu): Saxena-Akhawe-Hanna-Mao-McCamant-Song, Okland’10 (Hampi + mult.var.)– Rex: Veanes-deHalleux-Tillmann-Bjørner-deMoura, ICST’10, LPAR’2010 (language acceptors)– Bek: Hooimeijer-Livshits-Molnar-Saxena-Veanes-Bjørner, USENIX Security'11, POPL’12 (transducers)– Bex: D’Antoni-Veanes, VMCAI’13, CAV’13 (lookahead)– PASS: Li-Ghosh, HVC 2013 (best paper award) . (array based)– SMC: Luu-Shinde-Saxena-Demsky, PLDI’14 (model counting)
CAV’15:– ABC: Aydin-Bang-Bultan (automata based counting, using Stranger and BRICS)– NORN: Abdulla-Atig-Chen-Holik-Rezine-Rümmer-Stenman, also CAV’14 (Horn clauses, BRICS)– Z3-str+: Zheng-Ganesh-Subramanian-Tripp-Dolby-Zhang. (string + regex + length )
July 18, 2015
SMT'15, San Fransisco 5
TWO QUESTIONS
• What are characters?
• What are strings?
July 18, 2015
smileycipher(“hello world”) = “ 😧😤😫😫😮😶😮
”😱😫😣
Is this a string function?
SMT'15, San Fransisco 6
WHAT ARE CHARACTERS?1. Elements of a Finite Alphabet ?
– Only primitive operation is =: Bool– What about Unicode, e.g., 😀 😁 http://unicode.org/charts/PDF/U1F600.pdf
• || = 1,112,064 – For succinctness allow total order ≺: Bool and ranges [a-b] (denotes {x | a ≼ x ≼ b})
• This affects the notion of automaton over !• Why not other operations as well?
2. Bit-vectors, say char (BV16) ?– With primitive operations like &: char char char – “ ” 😀 = “\uD83D\uDE00” (UTF16 surrogate pair)
• has its own theory, namely bv theory!
3. Integers (code points) ?– 😀 = 0x1F600 = 128512– e.g. + 1 = = 0x1F601😀 😁
• has its own theory, namely int theory!
…
July 18, 2015
SMT'15, San Fransisco 7
WHAT ARE STRINGS?• Finite sequences of characters (char)
– CVC4-strSingleton string = char
• Restricted arrays of int to char– Pex-LL, PASSarray<int,char> ≠ char singleton string ≠ char
• Finite lists of characters– Pex-Rexlist<char> ≠ char singleton string ≠ char
• Finite queues– transducers
The answer depends on the context and the required operations. – First, Last, Rest, Append, Substring, Length, …
July 18, 2015
SMT'15, San Fransisco 8
ANALYSIS TASKS
• Consider character type C, string type S<C>, and regular expression type R<C>.– When is DPLL(TC,TS<C>,TR<C>) possible/feasible?
• What about (finite state) transducers?– Regular transformations of type S<Tin> S<Tout>
– Typically Tin = Tout = bit-vectors– Many string transformations are such:
• sanitizers, encoders
July 18, 2015
SMT'15, San Fransisco 9
HTML ENCODER
July 18, 2015
Arithmetic operations on
characters
SMT'15, San Fransisco 10
FOR EACH DOMAIN SPECIFIC TASK
Design a language that• only has the features required by the task• it is simple to use• enables to automatically reason about what
the programs do• compiles into efficient code
July 18, 2015
SMT'15, San Fransisco 11
THE REST OF THE TALK
• Symbolic Automata and Transducers• BEK and string sanitizers• BEX and string encoders• Data parallel BEK/BEX for string processing
July 18, 2015
SYMBOLIC FINITE AUTOMATA
July 18, 2015 SMT'15, San Fransisco 12
SMT'15, San Fransisco 13
SYMBOLIC FINITE AUTOMATON (SFA)
• Labels are predicates
qp x. 'a' ≤ x ≤ 'd'
July 18, 2015
one symbolic transition:
denotesmany concrete
transitions:qp
'a'
‘c'‘b'
'd'
for x〚 'a' ≤ x ≤ 'd' 〛
SFA EXECUTION EXAMPLE
14
λx. x mod 2=0
λx. x mod 2=1
p q
λx. x mod 2 =0λx. x mod 2=1
1 2 5 3
p p q p p
p is final accept the inputJuly 18, 2015 SMT'15, San Fransisco
SYMBOLIC FINITE AUTOMATAWhat is the alphabet?
July 18, 2015 SMT'15, San Fransisco 15
ALPHABET IS ANEFFECTIVE BOOLEAN ALGEBRA
July 18, 2015 SMT'15, San Fransisco 16
Domain Predicates
P 2D
(D,P, 〚 _ 〛 , , T, , , )
ALPHABET EXAMPLE
July 18, 2015 SMT'15, San Fransisco 17
{a,b}
{,{a},{b},{a,b}}
id
{a,b}
c
p q
{a,b}{a}
{b}
a*b(a|b)*
SFA over 2{a,b} :
regex :
2{a,b} = (D,P, 〚 _ 〛 , , T, , , )
ALPHABET EXAMPLE: 2BVK
• D = {n | 0 n < 2k}• P = BDDs of depth k• Boolean operations are BDD operations Below 〚 i 〛 = {n D | i'th bit of n is 1}
July 18, 2015 SMT'15, San Fransisco 18
i has fixed size independent of i
ALPHABET EXAMPLE: SMTINT
• D = Integers • P = integer linear arithmetic formulas
(with one fixed free variable)• 〚 〛 = 〚〛 〚〛• 〚〛 = , 〚 〛 = D \ 〚〛• Satisfiability: 〚〛
July 18, 2015 SMT'15, San Fransisco 19
BOOLEAN ALGEBRA INTERFACE IN C#
July 18, 2015 SMT'15, San Fransisco 20
public interface IBoolAlg<P>{
P Top { get; }P Bot { get; }P Not(P pred);P Or(P pred1, P pred2);P And(P pred1, P pred2);bool IsSat(P predicate);}
public interface IBoolAlgExt<P,D> : IBoolAlg<P>{IEnumerable<D> Den(P);P One(D);}
UNIT ALPHABET EXAMPLE IN C#
July 18, 2015 SMT'15, San Fransisco 21
class A1 : IBoolAlg<bool>{
public bool Top { get { return true; } }public bool Bot { get { return false; } }public bool Not(bool pred) { return !pred; }public bool Or(bool pred1, bool pred2) { return pred1 || pred2; }public bool And(bool pred1, bool pred2) { return pred1 && pred2; }public bool IsSat(bool pred){ return pred; }}
One-letter alphabet
ANOTHER ALPHABET EXAMPLE IN C#
July 18, 2015 SMT'15, San Fransisco 22
class A16 : IBoolAlg<UInt16>{
public UInt16 Top { get { return 0xFFFF; } }public UInt16 Bot { get { return 0; } }public UInt16 Not(UInt16 pred) { return ~pred; }public UInt16 Or(UInt16 pred1, UInt16 pred2) { return pred1 | pred2; }public UInt16 And(UInt16 pred1, UInt16 pred2) { return pred1 & pred2; }public bool IsSat(UInt16 pred){ return pred != 0; }}
16-letter alphabet
ALPHABET TRANSFORMATIONS
• Effective Boolean algebras can be extended– e.g. disjoint union
• Effective Boolean algebras can be restricted– e.g. restriction wrt. a given predicate
July 18, 2015 SMT'15, San Fransisco 23
DISJOINT UNION OF ALPHABETS IN C#
July 18, 2015 SMT'15, San Fransisco 24
public class PairAlg<S, T> : IBoolAlg<Pair<S, T>>{ IBoolAlg<S> A; IBoolAlg<T> B; Pair<S,T> Bot {get return new Pair<S,T>(A.Bot,B.Bot);} … public Pair<S, T> Or(Pair<S,T> a, Pair<S,T> b) { return new Pair<S,T>(A.Or(a[0],b[0]), B.Or(a[1],b[1])); } public bool IsSat(Pair<S,T> p) { return A.IsSat(p[0]) || B.IsSat(p[1]); }}
SFA VS. CLASSICAL AUTOMATA?
• SFAs can support infinite alphabets• For some cases SFAs are
exponentially more succinct than NFAsExample (recall the BDDs i from before):
Equivalent NFA requires 2k transitions.July 18, 2015 SMT'15, San Fransisco 25
SYMBOLIC FINITE AUTOMATAAlgorithms over SFAs.
July 18, 2015 SMT'15, San Fransisco 26
ALGORITHMS OVER SFAS
• Language intersection– Uses product of automata
• Language complementation– Requires determinization
• Minimization– Extensions of Moore/Hopcroft [POPL’14]
• Regex SFA construction– Uses BDDs to represent Unicode character sets– Requires BDD interval-set conversions
• May cause exponential blowup: recall the BDDs i
July 18, 2015 SMT'15, San Fransisco 27
LANGUAGE INTERSECTION
• Uses DFS and product of transitions
July 18, 2015 SMT'15, San Fransisco 28
p1 q1
p2 q2
A:
B:
p1
p2
AB: q1
q2
delete when
unsat
X
INTERSECTION EXAMPLE
July 18, 2015 SMT'15, San Fransisco 29
a1 a2
2
A:
B:
66
b1
3
a1
b1
a2 b2
23
63
a1 b2
3
let k(x) ((x mod k) = 0)
AB:
b263
X
LANGUAGE COMPLEMENTATIONFirst determinize then swap final and nonfinal states
July 18, 2015 SMT'15, San Fransisco 30
p q
r
{p}{q}
{q,r}
{r}
delete unsat guards
determinize
31
MINIMIZATION (SYMBOLIC MOORE)
D := (F (Q\F)) ((Q\F) F)foreach (p’,q’) D, (p,q) D if (IsSat(guard(p,p’) ∧ guard(q,q’)))
add (p,q) to D
p
q
p’
q’
distinguishable
φ
ψ
distinguishable IsSat(φ ∧ ψ)
July 18, 2015 SMT'15, San Fransisco
REGEX SFA
• Classical algorithm extended to work with predicates– First produces SFA (SFA with -moves )– Then -moves are eliminated using the
standard -elimination algorithm– Requires interval-set BDD algorithm for
converting character classesExample: [\0x0-\0xFF] = BDD whose bits in pos. > 7 are 0
July 18, 2015 SMT'15, San Fransisco 32
ONLINE SFA ALGORITHM EXAMPLES
• http://www.rise4fun.com/Bex/zE
July 18, 2015 SMT'15, San Fransisco 33
SYMBOLIC FINITE TRANSDUCERS
July 18, 2015 SMT'15, San Fransisco 34
SYMBOLIC FINITE TRANSDUCER (SFT)
• Labels are guarded transformation functions
Concrete transitions:
p
q
Symbolic transition:
‘\x80’/“\xC2\x80”
… ‘\x7FF’/“\xDF\xBF”
q
p
x. 8016 ≤ x ≤ 7FF16/[C016|x10,6, 8016|x5,0]
guard
bitvector operations
1920transitions
SMT'15, San Fransisco 35July 18, 2015
SFT EXECUTION EXAMPLE
36
x mod 2 =0/[x, x]
x mod 2 =1/[x-1]
p q
x mod 2 =0/[]x mod 2 =1/[x-1]
1 2 5 3
p p q p p
Input tape
Output tape 0 2 2 4 2
July 18, 2015 SMT'15, San Fransisco
SYMBOLIC FINITE TRANSDUCERSProperties and algorithms
July 18, 2015 SMT'15, San Fransisco 37
WHY SFTS?
• They have good algebraic properties (POPL'12)– SFTs are closed under composition– Equivalence is decidable in the single-valued case– domain of an SFT is an SFA
• SFAs are closed under Boolean operations
• Useful for various analysis tasks
July 18, 2015 SMT'15, San Fransisco 38
SFT COMPOSITION
AB = x.B(A(x))
July 18, 2015 SMT'15, San Fransisco 39
a1 a2A
B
x>0/ [x+1,x+2]
b1 b2x<5/ [] b3x<4/[x,x]
AB a1b1
x>0 x+1<5 x+2<4 / [x+2, x+2] a2b3
SMT'15, San Fransisco 40
• Composition:
• Equiv. checking for single-valued-SFTs:(undecidable in general)
Algorithms use SMT for satisfiability checking of character formulas
SFT A B
SFT ALGORITHMS
July 18, 2015
in outSFT Bin outSFT A
in outSFT A
in outSFT B
“input string” A and B not equivalent
SMT'15, San Fransisco 41
PROPERTY ANALYSIS (USENIX SEC'11)
• Does it matter if a sanitizer is applied twice? Idempotence:
• Does order of sanitizers matter? Commutativity:
July 18, 2015
“input string” A not idempotent
A AA A
A
“input string” A and B not commutative
B AB A
A BA B
APPLICATIONS
July 18, 2015 SMT'15, San Fransisco 42
APPLICATIONS OF SFAS/SFTS
• SFAs:– Regex support in parameterized unit testing– Fuzz testing of regexes– Password generation
• SFTs:– Analysis of string encoders/decoders– Security analysis of sanitizers
July 18, 2015 SMT'15, San Fransisco 43
SMT'15, San Fransisco 44
APPLICATION 1REGEXES IN PARAMETERIZED UNIT TESTING
• Rex component in Pex• Generate values for s that reach the return branches
– s is a string of Unicode characters (16-bit bit-vectors)
July 18, 2015
bool IsValidEmail(string s) { string r1 = @"^[A-Za-z0-9]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$"; string r2 = @"^\d.*$"; if (System.Text.RegularExpressions.Regex.IsMatch(s, r1)) if (System.Text.RegularExpressions.Regex.IsMatch(s, r2)) return false; //branch 1 else return true; //branch 2 else return false; //branch 3 }
Solve: sL(r1)L(r2) [eg. s = “[email protected]”]
Solve: sL(r1)\L(r2) [eg. s = “[email protected]”]
Solve: sL(r1) [eg. s = “[email protected]”]
APPLICATION 2 PASSWORD GENERATIONGiven constraints:• Length is k: "^[\x21-\x7E]{k}$"• Contains 2 capital letters: "[A-Z].*[A-Z]"• Contains a digit: "\d"• Contains a non-word character: "\W"Generate random instances with uniform distribution that match all the above conditions.k=4 : http://www.rise4fun.com/Rex/4nE
http://www.rise4fun.com/Bek/c3j
July 18, 2015 SMT'15, San Fransisco 45
SMT'15, San Fransisco 46
APPLICATION 3SAFETY ANALYSIS
Example: suppose good output = “NoEars"NoEars = [^\uDE38-\uDE40]*bad output: WithEars = Complement(NoEars)
x(smileycipher(x) WithEars) ?
{x | smileycipher(x) WithEars}
Does there exist an input x that causes “ears" in the
output ?
http://www.rise4fun.com/Bek/5sHO
July 18, 2015
EXTENSIONS
July 18, 2015 SMT'15, San Fransisco 47
EXTENSIONS OF SFAS AND SFTS
• ESFT– SFA/SFTswith look-ahead [CAV'13]– BEX language
• STT – Symbolic automata/transducer over trees– FAST language [PLDI’14]
• k-SFT – SFT with lookback [POPL’15]
July 18, 2015 SMT'15, San Fransisco 48
ESFAS AND ESFTS
• Unlike in the classical caselook-ahead breaks many properties– e.g. equivalence of ESFAs is undecidable
July 18, 2015 SMT'15, San Fransisco 49
x1≤FF ∧ x2≤FF ∧ x3≤FF / [x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F]
q
above ESFT, reads 3 and writes 4 symbols
(base64encoder)
http://www.rise4fun.com/Bex/tutorial/guide
M a n M a n
T W F u T W F u
SMT'15, San Fransisco 50
FAST (TREE TRANSDUCERS)
• Trees are common input/output data structures– XML query, type-checking, etc…– Natural Language translators (from parse tree to parse
tree)– Compilers/optimizers (from parse tree to parse tree)– Tree manipulating programs: data structures algorithms,
ontologies, etc…– Augmented Reality
– http://www.rise4fun.com/Fast/tutorial/guide July 18, 2015
SMT'15, San Fransisco 51
TransducerModel
Z3
Transformation Analysis Does it do the right thing?
AnalysisquestionAutomata-.NET
s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"):
b := false; yield('\\', c); case (c == '\\'):
b := !b; yield(c); case (true):
b := false; yield(c);
};
DSL
Code Gen
C# JavaScript C
Code Gen
OUR RECIPE FOR EACH TASK
July 18, 2015
SMT'15, San Fransisco 52
Automata-.NET will be open source on GitHub under MIT license
Some references:
BEK• Fast and precise sanitizer analysis with BEK
Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11• Symbolic finite state transducers: algorithms and applications
Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12
BEX• Static analysis of string encoders and decoders
D’Antoni, Veanes, VMCAI13• Equivalence of extended symbolic finite transducers
D’Antoni, Veanes, CAV13• Data parallel string manipulating programs
Veanes, Mytkowicz, Molnar, Livshits, POPL15July 18, 2015
QUESTIONS?
Links to related online tutorials:– Bek
http://rise4fun.com/Bek/tutorial
– Bexhttp://rise4fun.com/Bex/tutorial
– Rexhttp://rise4fun.com/rex/
– Fasthttp://rise4fun.com/Fast/tutorial
SMT'15, San Fransisco 53July 18, 2015