isbn 0-321-19362-8 chapter 6 data types character strings pattern matching

21
ISBN 0-321-19362-8 Chapter 6 Data Types •Character Strings •Pattern Matching

Post on 20-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

ISBN 0-321-19362-8

Chapter 6

Data Types•Character Strings•Pattern Matching

Page 2: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-2

Character String Types

• Values are sequences of characters• Design issues:

1. Is it a primitive type or just a special kind of array?

2. How is it stored in memory?

3. Is the length of a string variable static or dynamic?

Page 3: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-3

Character String Operations

• Assignment • Comparison (==, >, etc.) • Catenation

– Sometimes an operator is provided (+ in Java, . in perl)

– Some languages have a repetition operator (x in perl, * in python and ruby)

• Substring reference• Pattern matching

Page 4: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-4

Actual String Implementations

• Pascal – Not primitive; assignment and comparison only

(of packed arrays)

• Ada, FORTRAN 90, and BASIC– Somewhat primitive– Assignment, comparison, catenation, substring

reference – FORTRAN has an intrinsic for pattern matching– Ada code N := N1 & N2 (catenation) N(2..4) (substring reference)

Page 5: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-5

Character String Implementations

• C and C++– Not primitive - implemented as arrays of

characters terminated by null character

– Use char arrays and a library of functions that provide operations

• SNOBOL4 (a string manipulation language)– Primitive

– Many operations, including elaborate pattern matching

Page 6: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-6

Character String Implementations

• Java - String class (not arrays of char)– Objects cannot be changed (immutable)– StringBuffer is a class for changeable string

objects

• Javascript, Ruby– strings are objects with many operations

• Perl, PHP – strings are primitive

Page 7: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-7

String Length Options

1. Static - FORTRAN 77, Ada, COBOLe.g. (FORTRAN 90)

CHARACTER (LEN = 15) NAME;

2. Limited Dynamic Length - C and C++ actual length is indicated by a null character

3. Dynamic - SNOBOL4, Perl, JavaScript

Page 8: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-8

Evaluation of String Types

• Aid to writability• As a primitive type with static length, they are

inexpensive to provide--why not have them?• Dynamic length is nice, but is it worth the

expense?

Page 9: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-9

Character String Implementation

• Static length - compile-time descriptor• Limited dynamic length - may need a run-time

descriptor for length (but not in C and C++)• Dynamic length - need run-time descriptor;

allocation/deallocation is the biggest implementation problem

Page 10: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-10

Character String Descriptors

Compile-time descriptor for static strings

Run-time descriptor for limited dynamic strings

Page 11: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-11

Character String Interpolation

• Two literal representations of strings in many scripting languages– Single quoted strings are literals

• Every character inside is stored as written. (In some languages, a few characters may be treated specially.)

• These are like the double quoted strings in Java

– Double quoted strings are interpolated• Special characters have their regular meaning unless

they have a backslash in front of them.• Variable names are expanded, replaced by the value

of the variable.

Page 12: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-12

Pattern Matching

• A useful operation for strings• Usually based on regular expressions• Some languages have pattern matching built

into the language (perl, python, ruby, …)• Some languages implement pattern matching

via external libraries or classes– Java has Pattern and Matcher classes

Page 13: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-13

Recursive definition of a regular expressions• Individual terminals are regular expressions• If a and b are regular expressions so are

– a | b choice

– ab sequence

– (a) grouping

– a* zero or more repetitions

• Nothing else is a regular expression

Page 14: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-14

Examples

• Identifiers– letter(letter | digit)*

• Binary strings– (0 | 1)(0 | 1)*

• Binary strings divisible by 2– (0 | 1)*0

Page 15: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-15

Regular Expressions and Pattern Matching in Perl• The operators =~ and !~ check for match and

no match respectively.• A pattern is enclosed between slashes as in

/pattern/• If a pattern appears by itself, the variable $_ is

checked for a match• ^ at the beginning of the pattern means it must

start at the beginning of the string• $ at the end of a pattern means it must end at

the end of the string

Page 16: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-16

Example

• The following Perl script checks for the regular expression a+b+#!/usr/bin/perl # This is a Perl script

$_ = <STDIN>; # Read into $_

if (/^a+b+$/) # Match $_ with a+b+

{ print "yes\n";}

else

{ print "no\n"; }

Page 17: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-17

Pattern symbols

Case insensitive\i

Grouping ()

Choice|

Between i and j occurrences{i, j}

None of enclosed characters[^abc]

One of enclosed characters[abc]

0 or 1 occurrences of previous character?

1 or more occurrences+

0 or more occurrences*

Any single character (except '\n').

MeaningSymbol

Page 18: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-18

Character Classes

• There are several classes of characters that have special names

\S

\W

\D

Exclude

Any whitespace\s

Any letter, digit, or underscore

\w

Any digit\d

Match

Page 19: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-19

Anchors

• Used to specify position within a string

• \bpattern\b matches the word pattern but not patterned

Not at word boundary\B

Word boundary\b

End of string$

Beginning of string^

PositionSymbol

Page 20: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-20

String Manipulation

• split( /pattern/, string) splits a string into tokens using the pattern as the delimitersplit( /, /, $fullName)

• tr/a..z/A..Z/ transliterates characters• s/pattern/replacement/ replaces one

occurrences of pattern in a string with replacement. A g after the last slash replaces all occurrences.– Use this same syntax for substitution in the vi

editor

Page 21: ISBN 0-321-19362-8 Chapter 6 Data Types Character Strings Pattern Matching

Copyright © 2004 Pearson Addison-Wesley. All rights reserved. 6-21

Pattern Memory

• Any part of a pattern that has parentheses around it will cause the matching text to be stored in pattern memory

• Within a pattern, you can use \1, \2, \3 to refer back to an earlier part of the pattern. This is called a back reference.

• After a match has completed, the variables $1, $2, … $9 contain the pieces from the last match.