regular expressions (regex) for seo
DESCRIPTION
Regular Expressions are highly technical. This training covers the basics of RegEx and also gives examples of how to use it. Take some time to go through each example and try to figure it out on your own.TRANSCRIPT
Regular Expressions for SEO The Coolest Pattern Matching Search Language...
Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant
For Powered by Search Internal | October 2013
Some of our clients...
We’re in business because we believe that great brands need both voice and visibility in order to connect people with what matters. A boutique, full-service digital marketing agency in Toronto, Powered by Search is a PROFIT HOT 50-ranked agency that delivers search engine optimization, pay per click advertising, local search, social media marketing, and online reputation management services.
Featured in...
RegEx Basics
Practical SEO Uses
RegEx Puzzles for Homework
Regular Expressions for SEO
http://xkcd.com/
RegEx Basics
RegEx Basics Use Sublime Text
This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too. It’s the text editor you’ll fall in love with.
RegEx Basics Literal Matching
Text I want to match this.
RegEx match this
RegEx matches literal strings. This is like running a normal search in Word. Pretty cool, huh?
RegEx Basics Anchors
Text I want this, I want that, I want I want I want
RegEx ^I want
There are a couple of special characters called “Anchors.” The carret (^) represents the beginning of a line. The dollar sign ($) represents the end of a line. You see these a lot in .htaccess files.
Text I want this, I want that, I want I want I want
RegEx I want$
RegEx Basics Special Characters
There are also a series of other special characters. These are:
• [ - Starts a Character Class (More Later) • \ - Escapes or modifies the character after it. • . - Wildcard. It represents any character. • | - OR, so (this|that|the other) means this, that, or the other. • ( - Starts a group. • ) - Ends a group.
To match any of these literal characters, put a \backslash in front of it. This also applies to ?+*^$ which we’ve talked about or will get to later.
RegEx Basics Quantifiers
A quantifier tells the expression how many times to match the expression before it.
• ? - Zero or one time • + - One or more times • {exactly} - Exactly this many times • {min,max} - Between min and max times • * - Zero or more times
Text Ahhhhhhhhhhh. A spider.
RegEx A[h]+
RegEx Basics Greedy vs. Non-Greedy
Quantitative expressions are greedy by default. It’ll repeat the expression as many times as possible before giving up and continuing with the rest of the RegEx. This leads to unexpected issues. To make these quantifiers, *+{}, non-greedy, just add a question mark.
Text <p>test</p>
RegEx (Greedy) <.+>
Text <p>test</p>
RegEx (Lazy) <.+?>
RegEx Basics Variations / Character Classes []
A variation is a set of literal characters that can possibly fill a space. For example: The characters in the variation aren’t a GROUP. What the following RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t, an h, an a, or an n.” That’s not what we want.
Text Well then I’m better than you.
RegEx th[ea]n
Text Well then I’m better than you.
RegEx [then|than]
RegEx Basics Groups ()
In the case above, we could use a group to solve our problem. A group isn’t the best answer. It’s for alternation and/or quantification.
Text Well then I’m better than you.
RegEx (then|than)
Text I like redredred apples.
RegEx (blue|green|red)+
RegEx Basics Variables / Captured Groups $1
When you use a group, it captures the information in a numbered variable. They count up from $1. You can use the variable when doing a find-replace.
Text https://www.searchersforbeerfridges.com/?vote_number=9001
RegEx Find .+?//(.*?)/.*
RegEx Replace $1
New Text www.searchersforbeerfridges.com
Practical SEO Uses
Practical SEO Uses Google Analytics – Branded Organic In Analytics I often want to find branded organic search traffic. Let’s look at the GWT data in Analytics for our fictional client, Lett.Me.
Lett Me has a ton of common mis-typings and variations. They get traffic from lm, lm.com, let me, lettme.com, letme.com, let.me, and lett.me. What’s the regular expression that captures all of that?
Practical SEO Uses Google Analytics – Branded Organic Here’s the regular expression I came up with. It matches some funky cases like let me.com but that’s fine: You can also remove the square brackets, but I feel like it’s easier to read with them in. Without them it looks like this: Now just save this RegEx in your reporting document and you’ll never have to type out the whole thing again. Imagine what this could do for reporting on keyword groups!
RegEx Find (lm|let[t]?[ ]?[\.]?me)(\.com)?
RegEx Find (lm|let{1,2} ?\.?me)(\.com)?
Practical SEO Uses Trim To Root
Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?
Practical SEO Uses Trim To Root
Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?
RegEx Find ^ .*?//(.*?)/.* RegEx Replace
$1
Practical SEO Uses Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?
Practical SEO Uses Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?
RegEx Find <[a-z0-9]{1,6}></[a-z0-9]{1,6}>
RegEx Replace
Practical SEO Uses Top Level Domains
Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?
Practical SEO Uses Top Level Domains
Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?
RegEx Find .*//(.*?)\.(bs|spam)/.*
RegEx Replace $1
Practical SEO Uses Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)
Practical SEO Uses Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)
RegEx Find ^.*?//.*(directory|article).*?(/|\..{2,3}$).*
Practical SEO Uses Merging Lists
Does the list of URLs contain domains we’ve already disavowed? Say we’re doing a reconsideration request and we don’t want to consider any of the links we’ve already disavowed. So, we have List A, new links with some old links mixed in, that we want cleansed of any of the domains in List B. It’s a whole process. What do you think it is?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
Practical SEO Uses Merging Lists
First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
Practical SEO Uses Merging Lists
First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
RegEx Find ^ .*?//(.*?)/.* RegEx Replace
$1
Practical SEO Uses Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B directorylinks.com spam.com mafia-wars.com 192.233.111
Practical SEO Uses Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B directorylinks.com spam.com mafia-wars.com 192.233.111
RegEx Find \n
RegEx Replace |
Practical SEO Uses Merging Lists
If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B?
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B directorylinks.com|spam.com|mafia-wars.com|192.233.111
Practical SEO Uses Merging Lists
If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? .*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).*
List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
List B directorylinks.com|spam.com|mafia-wars.com|192.233.111
Practical SEO Uses Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”
Practical SEO Uses Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”
RegEx Find <a.{0,100}href=.{0,100}mooz\.com
<a.{0,100}href=.{0,100}?mooz\.com(.{0,100}?)(Cow Melk|Milk)
RegEx Puzzles for Homework
RegEx Puzzles for Homework Resources
Sample HTML https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing Sample URLs https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing
RegEx Puzzles for Homework Puzzles
Some Puzzles: • Show only the domain, no sub-domain, with a find-replace. • Find all links that are obviously from a blog. • Format a list of links as domains in a comma separated list.
RegEx Puzzles for Homework No Sub-Domains
Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?
RegEx Puzzles for Homework No Sub-Domains
Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?
RegEx Find ^.*?//(.*\.)*(.*)\.(.{2,3})/.*
RegEx Replace $2.$3
RegEx Puzzles for Homework Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?
RegEx Puzzles for Homework Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?
RegEx Find .*(/blog|/article|feed\.|/feed).*
RegEx Puzzles for Homework Comma Separated Domains
Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?
RegEx Puzzles for Homework Comma Separated Domains
Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?
RegEx Find (|\n).*//(.*)/.* Replace With
$2, Delete trailing comma
http://www.smbc-comics.com/
Questions?
Thanks for Hanging Out
Stay in Touch
Twitter: @troyfawkes Google+: http://gplus.to/TroyFawkes Email: [email protected]
www.poweredbysearch.com
www.troyfawkes.com