unit 1-strings,patterns and regular expressions

16
STRINGS,PATTERNS AND REGULAR EXPRESSIONS BY SANA MATEEN

Upload: sana-mateen

Post on 08-Apr-2017

14 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Unit 1-strings,patterns and regular expressions

STRINGS,PATTERNS AND REGULAR EXPRESSIONS

BY SANA MATEEN

Page 2: Unit 1-strings,patterns and regular expressions

INTRODUCTION TO REGULAR EXPRESSIONS

It is a way of defining patterns. A notation for describing the strings produced by regular expression. The first application of regular expressions in computer system was in the text

editors ed and sed in the UNIX system. Perl provides very powerful and dynamic string manipulation based on the usage of

regular expressions. Pattern Match – searching for a specified pattern within string. For example:

A sequence motif, Accession number of a sequence, Parse HTML, Validating user input.

Regular Expression (regex) – how to make a pattern match.

Page 3: Unit 1-strings,patterns and regular expressions

HOW REGEX WORK

Regex code

Perl compiler

Input data (e.g. sequence file)

output regex engine

Page 4: Unit 1-strings,patterns and regular expressions

Simple Patterns Place the regex between a pair of forward slashes ( / / ). try: #!/usr/bin/perl

while (<STDIN>) { if (/abc/) { print “>> found ‘abc’ in $_\n”; } }

Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script.

If you type anything containing ‘abc’ the print statement is returned.

Page 5: Unit 1-strings,patterns and regular expressions

STAGES1. The characters

\ | ( ) [ { ^ $ * + ? .are meta characters with special meanings in regular expression. To use metacharacters in regular expression without a special meaning being attached, it must be escaped with a backslash. ] and } are also metacharacters in some circumstances.

2. Apart from meta characters any single character in a regular expression /cat/ matches the string cat.

3. The meta characters ^ and $ act as anchors:^ -- matches the start of the line$ -- matches the end of the line.so regex /^cat/ matches the string cat only if it appears at the start of the line./cat$/ matches only at the end of the line./^cat$/ matches the line which contains the string cat and /^$/ matches an empty line.

4. The meta character dot (.) matches any single character except newline, so/c.t/ matches cat,cot,cut, etc.

Page 6: Unit 1-strings,patterns and regular expressions

STAGES5. A character class is set of characters enclosed in square brackets. Matches any

single character from those listed.So /[aeiou]/- matches any vowel/[0123456789]/-matches any digitOr /[0-9]/

6. A character class of the form /[^....]/ matches any characters except those listed, so /[^0-9]/ matches any non digit.

7. To remove the special meaning of minus to specify regular expression to match arithmetic operators.

/[+\-*/]/8. Repetition of characters in regular expression can be specified by the quantifiers

* -- zero or more occurrences+ -- one or more occurrences? – zero or more occurrences

9. Thus /[0-9]+/ matches an unsigned decimal number and /a.*b/ matches a substring starting with ‘a’ and ending with ‘b’, with an indefinite number of other characters in between.

Page 7: Unit 1-strings,patterns and regular expressions

FACILITIES1. Alternations |

If RE1,RE2,RE3 are regular expressions, RE1|RE2|RE3 will match any one of the components.

2. Grouping- ( )Round Brackets can be used to group items./pitt the (elder|younger)/

3. Repetition countsExplicit repetition counts can be added to a component of regular expression

/(wet[]){2}wet/ matches ‘ wet wet wet’Full list of possible count modifiers are{n} – must occur exactly n times{n,} –must occur at least n times{n,m}- must occur at least n times but no more than m times.

4. Regular expression Simple regex to check for an IP address: ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$

Page 8: Unit 1-strings,patterns and regular expressions

FACILITIES5. Non-greedy matching

A pattern including.* matches the longest string it can find.The pattern .*? Can be used when the shortest match is required.? – shortest match

6.Short handThis notation is given for frequent occurring character classes.\d – matches- digit\w – matches – word\s- matches- whitespace\D- matches any non digit characterCapitalization of notation reverses the sense

7. Anchors\b – word boundary\B – not a word boundary/\bJohn/ -matches both the target string John and Johnathan.

8. Back ReferencesRound brackets define a series of partial matches that are remembered for use in subsequent processing or in the RegEx itself.

9. The Match OperatorThe match operator, m//, is used to match a string or statement to a regular expression. For example, to match

the character sequence "foo" against the scalar $bar, you might use a statement like this:if ($bar =~ /foo/) Note that the entire match expression.that is the expression on the left of =~ or !~ and the match operator,

returns true (in a scalar context) if the expression matches. Therefore the statement:$true = ($foo =~ m/foo/);

Page 9: Unit 1-strings,patterns and regular expressions

BINDING OPERATOR Previous example matched against $_

Want to match against a scalar variable?

Binding Operator “=~” matches pattern on right against string on left.

Usually add the m operator – clarity of code.

$string =~ m/pattern/

Page 10: Unit 1-strings,patterns and regular expressions

MATCHING ONLY ONCE There is also a simpler version of the match operator - the ?PATTERN?

operator. This is basically identical to the m// operator except that it only matches once

within the string you are searching between each call to reset. For example, you can use this to get the first and last elements within a list: To remember which portion of string matched we use $1,$2,$3 etc

#!/usr/bin/perl @list = qw/food foosball subeo footnote terfoot canic footbrdige/; foreach (@list) { $first = $1 if ?(foo.*)?; $last = $1 if /(foo.*)/; } print "First: $first, Last: $last\n"; This will produce following result First: food, Last: footbrdige

Page 11: Unit 1-strings,patterns and regular expressions

s/PATTERN/REPLACEMENT/;

$string =~ s/dog/cat/;

#/user/bin/perl $string = 'The cat sat on the mat'; $string =~ s/cat/dog/; print "Final Result is $string\n"; This will produce following result The dog sat on the mat

THE SUBSTITUTION OPERATORThe substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is:

The PATTERN is the regular expression for the text that we are looking for. The REPLACEMENT is a specification for the text or regular expression that we want to use to replace the found text with.For example, we can replace all occurrences of .dog. with .cat. Using

Another example:

Page 12: Unit 1-strings,patterns and regular expressions

PATTERN MATCHING MODIFIERSm//i – Ignore case when pattern matching.m//g – Helps to count all occurrence of substring.

$count=0;while($target =~ m/$substring/g) {$count++}

m//m – treat a target string containing newline characters as multiple lines.

m//s –Treat a target string containing new line characters as single string, i.e dot matches any character including newline.

m//x – Ignore whitespace characters in the regular expression unless they occur in character class.

m//o – Compile regular expressions once only

Page 13: Unit 1-strings,patterns and regular expressions

THE TRANSLATION OPERATORTranslation is similar, but not identical, to the principles of substitution, but

unlike substitution, translation (or transliteration) does not use regular expressions for its search on replacement values. The translation operators are −

tr/SEARCHLIST/REPLACEMENTLIST/cds y/SEARCHLIST/REPLACEMENTLIST/cds

The translation replaces all occurrences of the characters in SEARCHLIST with the corresponding characters in REPLACEMENTLIST.

For example, using the "The cat sat on the mat." string#/user/bin/perl$string = 'The cat sat on the mat'; $string =~ tr/a/o/; print "$string\n";When above program is executed, it produces the following result −The cot sot on the mot.

Page 14: Unit 1-strings,patterns and regular expressions

TRANSLATION OPERATOR MODIFIERS Standard Perl ranges can also be used, allowing you to specify ranges of characters

either by letter or numerical value. To change the case of the string, you might use the following syntax in place of

the uc function. $string =~ tr/a-z/A-Z/;

Following is the list of operators related to translation.

Modifier Description

c Complements SEARCHLIST

d Deletes found but unreplaced characters

s Squashes duplicate replaced characters.

Page 15: Unit 1-strings,patterns and regular expressions

SPLIT Syntax of split split REGEX, STRING will split the STRING at every match of the REGEX. split REGEX, STRING, LIMIT where LIMIT is a positive number. This will

split the STRING at every match of the REGEX, but will stop after it found LIMIT-1 matches. So the number of elements it returns will be LIMIT or less.

split REGEX - If STRING is not given, splitting the content of $_, the default variable of Perl at every match of the REGEX.

split without any parameter will split the content of $_ using /\s+/ as REGEX. Simple cases split returns a list of strings: use Data::Dumper qw(Dumper); # used to dump out the contents of

any variable during the running of a program my $str = "ab cd ef gh ij"; my @words = split / /, $str; print Dumper \@words; The output is: $VAR1 = [ 'ab', 'cd', 'ef', 'gh', 'ij' ];

Page 16: Unit 1-strings,patterns and regular expressions