regular expressions (regex) for seo

Regular Expressions for SEO The Coolest Pattern Matching Search Language...

Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant

For Powered by Search Internal | October 2013

Some of our clients...

We’re in business because we believe that great brands need both voice and visibility in order to connect people with what matters. A boutique, full-service digital marketing agency in Toronto, Powered by Search is a PROFIT HOT 50-ranked agency that delivers search engine optimization, pay per click advertising, local search, social media marketing, and online reputation management services.

Featured in...

http://www.ctv.ca/generic/generated/static/business/article2079880.html

RegEx Basics

Practical SEO Uses

RegEx Puzzles for Homework

Regular Expressions for SEO

http://xkcd.com/

http://xkcd.com/

RegEx Basics

RegEx Basics Use Sublime Text

This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too. It’s the text editor you’ll fall in love with.

RegEx Basics Literal Matching

Text I want to match this.

RegEx match this

RegEx matches literal strings. This is like running a normal search in Word. Pretty cool, huh?

RegEx Basics Anchors

Text I want this, I want that, I want I want I want

RegEx ^I want

There are a couple of special characters called “Anchors.” The carret (^) represents the beginning of a line. The dollar sign ($) represents the end of a line. You see these a lot in .htaccess files.

Text I want this, I want that, I want I want I want

RegEx I want$

RegEx Basics Special Characters

There are also a series of other special characters. These are:

• [ - Starts a Character Class (More Later) • \ - Escapes or modifies the character after it. • . - Wildcard. It represents any character. • | - OR, so (this|that|the other) means this, that, or the other. • ( - Starts a group. • ) - Ends a group.

To match any of these literal characters, put a \backslash in front of it. This also applies to ?+*^$ which we’ve talked about or will get to later.

RegEx Basics Quantifiers

A quantifier tells the expression how many times to match the expression before it.

• ? - Zero or one time • + - One or more times • {exactly} - Exactly this many times • {min,max} - Between min and max times • * - Zero or more times

Text Ahhhhhhhhhhh. A spider.

RegEx A[h]+

RegEx Basics Greedy vs. Non-Greedy

Quantitative expressions are greedy by default. It’ll repeat the expression as many times as possible before giving up and continuing with the rest of the RegEx. This leads to unexpected issues. To make these quantifiers, *+{}, non-greedy, just add a question mark.

Text <p>test</p>

RegEx (Greedy) <.+>

Text <p>test</p>

RegEx (Lazy) <.+?>

RegEx Basics Variations / Character Classes []

A variation is a set of literal characters that can possibly fill a space. For example: The characters in the variation aren’t a GROUP. What the following RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t, an h, an a, or an n.” That’s not what we want.

Text Well then I’m better than you.

RegEx th[ea]n


RegEx [then|than]

RegEx Basics Groups ()

In the case above, we could use a group to solve our problem. A group isn’t the best answer. It’s for alternation and/or quantification.


RegEx (then|than)

Text I like redredred apples.

RegEx (blue|green|red)+

RegEx Basics Variables / Captured Groups $1

When you use a group, it captures the information in a numbered variable. They count up from $1. You can use the variable when doing a find-replace.

Text https://www.searchersforbeerfridges.com/?vote_number=9001

RegEx Find .+?//(.*?)/.*

RegEx Replace $1

New Text www.searchersforbeerfridges.com

Practical SEO Uses

Practical SEO Uses Google Analytics – Branded Organic In Analytics I often want to find branded organic search traffic. Let’s look at the GWT data in Analytics for our fictional client, Lett.Me.

Lett Me has a ton of common mis-typings and variations. They get traffic from lm, lm.com, let me, lettme.com, letme.com, let.me, and lett.me. What’s the regular expression that captures all of that?

Practical SEO Uses Google Analytics – Branded Organic Here’s the regular expression I came up with. It matches some funky cases like let me.com but that’s fine: You can also remove the square brackets, but I feel like it’s easier to read with them in. Without them it looks like this: Now just save this RegEx in your reporting document and you’ll never have to type out the whole thing again. Imagine what this could do for reporting on keyword groups!

RegEx Find (lm|let[t]?[ ]?[\.]?me)(\.com)?

RegEx Find (lm|let{1,2} ?\.?me)(\.com)?

Practical SEO Uses Trim To Root

Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?

Practical SEO Uses Trim To Root

Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?

RegEx Find ^ .*?//(.*?)/.* RegEx Replace

$1

Practical SEO Uses Fixing HTML – Nested Tags

I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?

Practical SEO Uses Fixing HTML – Nested Tags

I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?

RegEx Find <[a-z0-9]{1,6}></[a-z0-9]{1,6}>

RegEx Replace

Practical SEO Uses Top Level Domains

Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?

Practical SEO Uses Top Level Domains

Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?

RegEx Find .*//(.*?)\.(bs|spam)/.*

RegEx Replace $1

Practical SEO Uses Finding Substrings in Domains

Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)

Practical SEO Uses Finding Substrings in Domains

Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)

RegEx Find ^.*?//.*(directory|article).*?(/|\..{2,3}$).*

Practical SEO Uses Merging Lists

Does the list of URLs contain domains we’ve already disavowed? Say we’re doing a reconsideration request and we don’t want to consider any of the links we’ve already disavowed. So, we have List A, new links with some old links mixed in, that we want cleansed of any of the domains in List B. It’s a whole process. What do you think it is?

List A http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/

List B http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/


First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?




First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that?



RegEx Find ^ .*?//(.*?)/.* RegEx Replace

$1


Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?


List B directorylinks.com spam.com mafia-wars.com 192.233.111


Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: \n is the newline character and you need it. What’s the RegEx?


List B directorylinks.com spam.com mafia-wars.com 192.233.111

RegEx Find \n

RegEx Replace |


If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B?


List B directorylinks.com|spam.com|mafia-wars.com|192.233.111

Practical SEO Uses Finding Client Anchor in HTML

Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”

Practical SEO Uses Finding Client Anchor in HTML

Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”

RegEx Find <a.{0,100}href=.{0,100}mooz\.com

<a.{0,100}href=.{0,100}?mooz\.com(.{0,100}?)(Cow Melk|Milk)

RegEx Puzzles for Homework

RegEx Puzzles for Homework Resources

Sample HTML https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing Sample URLs https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing

https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing




https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing



RegEx Puzzles for Homework Puzzles

Some Puzzles: • Show only the domain, no sub-domain, with a find-replace. • Find all links that are obviously from a blog. • Format a list of links as domains in a comma separated list.

RegEx Puzzles for Homework No Sub-Domains

Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?

RegEx Puzzles for Homework No Sub-Domains

Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?

RegEx Find ^.*?//(.*\.)*(.*)\.(.{2,3})/.*

RegEx Replace $2.$3

RegEx Puzzles for Homework Blog or RSS

In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?

RegEx Puzzles for Homework Blog or RSS

In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?

RegEx Find .*(/blog|/article|feed\.|/feed).*

RegEx Puzzles for Homework Comma Separated Domains

Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?

RegEx Puzzles for Homework Comma Separated Domains

Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?

RegEx Find (|\n).*//(.*)/.* Replace With

$2, Delete trailing comma

http://www.smbc-comics.com/




Questions?

Thanks for Hanging Out

Stay in Touch

Twitter: @troyfawkes Google+: http://gplus.to/TroyFawkes Email: [email protected]

www.poweredbysearch.com

www.troyfawkes.com

regular expressions (regex) for seo

Technology