learning to read urls

54
Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles

Upload: calvin-giles

Post on 12-Jul-2015

226 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Learning to read urls

Learning to read urls

Finding the word boundaries in multi-word domain names withpython and sklearn.Calvin Giles

Page 2: Learning to read urls

Who am I?Data Scientist at AdthenaPyData Co-OrganiserPhysicistLike to solve problems pragmatically

Page 3: Learning to read urls

The ProblemGiven a domain name:

'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com'

Find the concatenated sentence:

'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'

Page 4: Learning to read urls

Why is this useful?How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'?

How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'?

Domains resolved into words can be compared on a semantic level, not simply as strings.

Page 5: Learning to read urls

Primary use caseGiven 500 domains in a market, what are the themes?

Page 6: Learning to read urls

Scope of projectAs part of our internal idea incubation Adthena labs, this approach was developed during a one-day hack to determine if such an approach could be useful to the business.

Page 7: Learning to read urls

Adthena's Data

> 10 million unique domains> 50 million unique search terms

3rd Party DataProject Gutenberg (https://www.gutenberg.org/)Google ngram viewer datasets(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

Page 8: Learning to read urls

Process1. Learn some words

2. Find where words occur in a domain name

3. Choose the most likely set of words

Page 9: Learning to read urls

1. Learn some wordsBuild a dictionary using suitable documents.

Documents: search terms

In [2]: import pandas, ossearch_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv'))search_terms = search_terms['SearchTerm'].dropna().str.lower()search_terms.iloc[1000000::2000000]

Out[2]: 1000000 new 2014 mercedes benz b200 cdi3000000 weight watchers in glynneath5000000 property for rent in batlow nsw7000000 us plug adaptor for uk9000000 which features mobile is best for purchaseName: SearchTerm, dtype: object

In [125]: from sklearn.feature_extraction.text import CountVectorizerdef build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())

Page 10: Learning to read urls

In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001)dictionary_size = len(st_dictionary)print('{} words found'.format(num_fmt(dictionary_size)))sorted(st_dictionary)[dictionary_size//20::dictionary_size//10]

Out[126]:

21.4k words found

['430', 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']

Page 11: Learning to read urls

We have 21 thousand words in our base dictionary. We can augment this with some booksfrom project gutenberg:

In [127]: dictionary = st_dictionaryfor fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear in 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))

2.11k words found in a_christmas_carol.txt1.65k words found in alice_in_wonderland.txt3.71k words found in huckleberry_finn.txt4.09k words found in pride_and_predudice.txt4.52k words found in sherlock_holmes.txt26.4k words in dictionary

Page 12: Learning to read urls

Actually, scrap that...... and use the google ngram viewer datasets:

Page 13: Learning to read urls

In [212]: dictionary = set()ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')]for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))

2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv"12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv"5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv"4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv"3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv"2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv"2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv"2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv"2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv"2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv"61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv"55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv"72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv"46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv"36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv"32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv"36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv"37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv"30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv"12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv"31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv"36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv"63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"

Page 14: Learning to read urls

That takes us to ~1M words!

We even get some good two-letter words to work with:

In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2})))print(sorted({w for w in dictionary if len(w) == 2}))

142 2-letter words['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']

Page 15: Learning to read urls

In [144]: choice(list(dictionary), size=40)

Out[144]: array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')

Page 16: Learning to read urls

2. Find where words occur in a domain nameFind all substrings of a domain that are in our dictionary, along with their start and endindicies.

Page 17: Learning to read urls

In [149]: def find_words_in_string(string, dictionary, longest_word=None): if longest_word is None: longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)

Page 18: Learning to read urls

In [234]: domain = 'powerwasherchicago'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)

39['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'icago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was', 'wash', 'washe', 'washer', 'we', 'wer']

Page 19: Learning to read urls

In [235]: domain = 'catholiccommentaryonsacredscripture'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)

101['acr', 'acre', 'acred', 'ary', 'aryo', 'at', 'ath', 'atho', 'athol', 'atholic', 'cat', 'cath', 'catho', 'cathol', 'catholi', 'catholic', 'cco', 'ccom', 'co', 'com', 'comm', 'comme', 'commen', 'comment', 'commenta', 'commentar', 'commentary', 'cre', 'cred', 'creds', 'cri', 'crip', 'cript', 'dsc', 'dscr', 'ed', 'eds', 'en', 'ent', 'enta', 'entar', 'entary', 'hol', 'holi', 'holic', 'icc', 'icco', 'ipt', 'lic', 'me', 'men', 'ment', 'menta', 'mentar', 'mentary', 'mm', 'mme', 'mment', 'nsa', 'nsac', 'nta', 'ntar', 'ntary', 'oli', 'olic', 'omm', 'omme', 'ommen', 'omment', 'on', 'ons', 'ptu', 'pture', 're', 'red', 'reds', 'rip', 'ript', 'ryo', 'ryon', 'ryons', 'sac', 'sacr', 'sacre', 'sacred', 'scr', 'scri', 'scrip', 'script', 'scriptur', 'scripture', 'tar', 'tary', 'tho', 'thol', 'tholic', 'tur', 'ture', 'ure', 'yon', 'yons']

Page 20: Learning to read urls

3. Choose the most likely set of wordsSimple approach to do this:

1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping3. Decide how likely is the domain given a particular subset 4. Decide how likely it is that the subset would occur overall 5. Determine best subset

P(d|s)P(s)

P(s|d)argmaxs

Page 21: Learning to read urls

We need some domain name data for the next part...

In [153]: domains = pandas.read_csv(os.path.join(data_directory, 'domains.csv'))domains = domains['Domain'].str.lower()

domains = domains[domains.str.endswith(".com")]

domains = domains.str.replace("\.com$", "")

domains = domains.str.replace("̂https?\:\/\/", "")domains = domains.str.replace("̂www\d?\.", "")

num_fmt(len(domains))

Out[153]: '3.8M'

Page 22: Learning to read urls

In [224]: choice(domains, size=20)

Out[224]: array(['1topchannel', 'scales-chords', 'marcusmajestic', 'mylyfestart', 'bluediamondturlock', 'bedfordvisionclinic', 'justinmccain', 'miniot-online', 'chelseabarracksbarracks', 'zeroeasy', 'newlookupholstery', 'radcliffehealth', 'embracingthemundane', 'immunityassist', 'simplynostretchmarks', 'teachmetoswim', 'thetford-europe', 'charlesallenford', 'china-chargermanufacturer', 'coolbabykid'], dtype=object)

Page 23: Learning to read urls

1. Find all subsets of the set of words found

There are different sentences that can be constructed from n substrings, including the emptysentence. We can get an idea how bad that will be with a sample of the data.

2n

Page 24: Learning to read urls

In [53]: longest_word = max(len(word) for word in dictionary) # speeds up searchdef find_n_words_in_string(domain): return len(set(find_words_in_string(domain, dictionary, longest_word)))

In [56]: import numpyn_words = domains.tail(1000).apply(find_n_words_in_string)(n_words).describe().apply(num_fmt)

Out[56]: count 1kmean 28.3std 15.8min 125% 1750% 2675% 38max 93Name: Domain, dtype: object

In [227]: num_fmt(2**28), 2**93

Out[227]: ('268M', 9903520314283042199192993792)

Page 25: Learning to read urls

So the worst case in a sample of 1000 domains is permutations to test!293

Page 26: Learning to read urls

Combine steps 1 and 2

1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping

becomes:

1. Find all subsets with non-overlapping words2. Do nothing :-)

Page 27: Learning to read urls

3.1 Find all subsets with non-overlapping words

Build a tree of subsets of non-overlapping words by sorting the words by their start index.

...and only return the "best" few cases anyway

It seems intuitive that sentences that match more of the domain are better. This is not infalable, butwe can achieve som significant if we only consider sentences at least half as long as the best match.

In practice, this does not appear to have any impact on the results but prevents an explosion ofsentences with particularly long domains.

Page 28: Learning to read urls

A little more code...In [147]: def find_sentences(string, words, part_sentence, sentences, threshold=0.0,

current_idx=0, current_score=0, best_score=0): """ Return sentences made of words that are common substrings of ̀string̀. ̀words̀ MUST be ordered by start index or the results will be wrong! """ current_threshold = int(best_score * threshold) if ((current_idx >= len(string)) or current_score + len(string) - current_idx < current_threshold): return sentences, best_score

for i, (word, start_idx, end_idx) in enumerate(words): if current_idx > start_idx: continue new_score = current_score + len(word) best_score = max(best_score, new_score) new_part_sentence = part_sentence + [word] if new_score + len(string) - end_idx >= current_threshold: sentences.append((new_part_sentence, new_score)) sentences, best_score = find_sentences(string=string, words=words[i+1:], part_sentence=new_part_sentence, sentences=sentences, threshold=threshold, current_idx=end_idx, current_score=new_score, best_score=best_score) return sentences, best_score

Page 29: Learning to read urls

Add a wrapper

In [148]: def get_sentences(domain, thresh=0.95): words = set(find_words_in_string(domain, dictionary, longest_word)) words = sorted(words, key=lambda x:(x[1], -x[2], x[0]))

sentences, best_score = find_sentences(domain, words, [], [], thresh)

return [sentence for sentence, score in sentences if score >= int(best_score * thresh)]

Page 30: Learning to read urls

In [64]: sentences = get_sentences('powerwasherchicago')print(len(sentences))

choice(sentences, size=15)

Out[64]:

245

array([['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'ashe', 'chica', 'go'], ['power', 'was', 'her', 'chica', 'go'], ['power', 'was', 'he', 'rch', 'cago'], ['power', 'was', 'her', 'chicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'ash', 'erc', 'hicago'], ['ower', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'was', 'her', 'chi', 'cago'], ['power', 'was', 'her', 'chic', 'ago'], ['power', 'as', 'he', 'rch', 'ica', 'go'], ['ower', 'washer', 'chicago'], ['owe', 'rwas', 'he', 'rch', 'ica', 'go'], ['power', 'washer', 'chic', 'go']], dtype=object)

Page 31: Learning to read urls

In [65]: sentences = get_sentences('catholiccommentaryonsacredscripture')print(len(sentences))choice(sentences, size=15)

Out[65]:

540428

array([['cat', 'holi', 'ccom', 'me', 'nta', 'ryon', 'sacr', 'ed', 'scrip', 're'], ['catholic', 'co', 'mm', 'en', 'aryo', 'nsac', 'ed', 'scri', 'pture'], ['catholic', 'omm', 'enta', 'ryon', 'sacr', 'eds', 'crip', 'tur'], ['cathol', 'icc', 'ommen', 'tar', 'on', 'sacr', 'ed', 'script', 'ure'], ['at', 'holic', 'omme', 'ntary', 'ons', 'acred', 'scri', 'pture'], ['cathol', 'icc', 'omm', 'ntar', 'yons', 'creds', 'crip', 'ture'], ['cat', 'hol', 'icc', 'omm', 'entary', 'ons', 'acr', 'eds', 'cri', 'ptu', 're'], ['cath', 'lic', 'com', 'me', 'ntar', 'yon', 'sac', 're', 'dsc', 'ript', 'ure'], ['cathol', 'icco', 'mm', 'ntary', 'on', 'sac', 're', 'dsc', 'rip', 'ture'], ['catholic', 'co', 'mm', 'enta', 'ryon', 'sac', 're', 'dsc', 'rip', 'tur'], ['cat', 'holic', 'com', 'me', 'ntar', 'yon', 'sac', 'reds', 'cript', 're'], ['cat', 'holic', 'com', 'menta', 'ryon', 'acr', 'ed', 'cript', 'ure'], ['cat', 'oli', 'ccom', 'mentary', 'nsac', 'red', 'scri', 'pture'], ['cathol', 'icc', 'ommen', 'tary', 'on', 'sacr', 'ed', 'cri', 'ture'], ['cat', 'hol', 'ccom', 'me', 'ntar', 'on', 'sac', 'red', 'scripture']], dtype=object)

Page 32: Learning to read urls

In [71]: tail_sentences = domains.tail(1000).apply(get_sentences).apply(len)

In [155]: tail_sentences.describe().apply(int).apply(num_fmt)

Out[155]: count 1kmean 1.18kstd 10.7kmin 125% 1250% 3975% 145max 280kName: Domain, dtype: object

Page 33: Learning to read urls

In [73]: domains.tail(1000)[tail_sentences <= 1].values

Out[73]: array(['cizerl', 'sahoko', 'pes-llc', 'mp3fil', 'wyzli', 'buypsa', 'ylqhjt', 'sblgnt', 'axbet', 'eirnyc', 'wsl', 'kms88', 'paknic', 'mrojp', 'irozho', 'bienve'], dtype=object)

Page 34: Learning to read urls

In [74]: domains.tail(1000)[tail_sentences > 10000].values

Out[74]: array(['studentdebtreductioncenter', 'inspiredholisticwellness', 'forensicaccountingexpert', 'medicalintuitivetraining', 'lavidamassagesandyspringsga', 'thirdgenerationshootingsupply', 'commercialrefrigerationrepairmiami', 'athenatrainingacademy', 'business-leadership-qualities', 'casaquetzalsanmigueldeallende', 'landscapedesignimagingsoftware', 'southcaliforniauniversity', 'replacementtractorpartsforsale', 'reinventinghealthcareinfo', 'shoppingforpowerinvertersnow', 'cambriaheightschristianacademy', 'californiaconstructionjobs', 'margaritavilleislandhotel', 'whatstoressellgarciniacambogia'], dtype=object)

In [75]: [' '.join(sentence) for sentence in get_sentences('replacementtractorpartsforsale ')[:10]]

Out[75]: ['replacement tractor parts forsale', 'replacement tractor parts forsa', 'replacement tractor parts forsa le', 'replacement tractor parts fors ale', 'replacement tractor parts fors al', 'replacement tractor parts fors le', 'replacement tractor parts for sale', 'replacement tractor parts for sal', 'replacement tractor parts for ale', 'replacement tractor parts for al']

Page 35: Learning to read urls

3.2 Decide how likely is the domain given a particular subset 

A first approach would be to say that the probability decreasses as each letter in the domain isommited from the sentence. We could model this in an unnormalised way by counting thesentence length.

To sort by this probability, we can therefore use the following:

P(d|s)

In [77]: def score_d_given_s(sentence, domain): domain_length = len(domain) sentence_length = sum(len(word) for word in sentence) return sentence_length / domain_length, 1.0 / (1 + len(sentence))

Page 36: Learning to read urls

In [78]: domain = 'powerwasherchicago'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1][:15]

Out[78]: [['power', 'washer', 'chicago'], ['pow', 'erw', 'asher', 'chicago'], ['powe', 'rwa', 'sher', 'chicago'], ['powe', 'rwas', 'her', 'chicago'], ['powe', 'rwas', 'herc', 'hicago'], ['power', 'was', 'her', 'chicago'], ['power', 'was', 'herc', 'hicago'], ['power', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'washe', 'rch', 'icago'], ['power', 'washer', 'chi', 'cago'], ['power', 'washer', 'chic', 'ago'], ['power', 'washer', 'chica', 'go'], ['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'as', 'herc', 'hicago']]

Page 37: Learning to read urls

In [79]: domain = 'catholiccommentaryonsacredscripture'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[:-15:-1]

Out[79]: [['catholic', 'commenta', 'ryon', 'sacred', 'scripture'], ['catholic', 'commenta', 'ryons', 'acred', 'scripture'], ['catholic', 'commentar', 'yon', 'sacred', 'scripture'], ['catholic', 'commentar', 'yons', 'acred', 'scripture'], ['catholic', 'commentary', 'on', 'sacred', 'scripture'], ['catholic', 'commentary', 'ons', 'acred', 'scripture'], ['cat', 'holic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cat', 'holic', 'commenta', 'ryons', 'acred', 'scripture'], ['cat', 'holic', 'commentar', 'yon', 'sacred', 'scripture'], ['cat', 'holic', 'commentar', 'yons', 'acred', 'scripture'], ['cat', 'holic', 'commentary', 'on', 'sacred', 'scripture'], ['cat', 'holic', 'commentary', 'ons', 'acred', 'scripture'], ['cath', 'olic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cath', 'olic', 'commenta', 'ryons', 'acred', 'scripture']]

Page 38: Learning to read urls

Let's see the top guesses for a selection of domains:

Page 39: Learning to read urls

In [105]: import redef flesh_out_sentence(sentence, domain): if sum(len(w) for w in sentence) == len(domain): return sentence full_sentence = [] for word in sentence: start, end = re.search(re.escape(word), domain).span() if start > 0: full_sentence.append(domain[:start]) full_sentence.append(word) domain = domain[end:] if len(domain) > 0: full_sentence.append(domain) return full_sentence

Page 40: Learning to read urls

In [ ]: def guess(d, n_guesses=25): guesses = [] sentences = get_sentences(d) sentences = sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1] i = 0 for i, s in enumerate(sentences[:n_guesses]): s = flesh_out_sentence(s, d) guesses.append(' '.join(s)) for _ in range(i + 1, n_guesses): guesses.append('') return pandas.Series(guesses)

Page 41: Learning to read urls

In [238]: subset = domains.iloc[len(domains)//200::len(domains)//100]df = pandas.DataFrame(subset.apply(guess).values, index=(subset+'.com').values)# df.to_csv(os.path.join(data_directory, 'predictions.csv'))df = df.iloc[:10, :3]df['correct'] = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0] # Correct guess for first 10 domains or -1df[['correct'] + list(range(3))]

Out[238]: correct 0 1 2

hedgefundupdate.com 0hedge fundupdate

hedge fundupdate

he dge fundupdate

traveldailynews.com 3traveldailynews

tra veldailynews

trav eldailynews

miriamkhalladi.com -1miria mkhalladi

miriam khalladi

mir iam khalladi

poolheatpumpstore.com 0pool heatpump store

pool heatpumps tore

poo lhe at pumpstore

blogorganization.com 0blogorganization

blo gorganization

blo gorganization

smallcapvoice.com 2 smallcap voice smal lcap voice small cap voice

cefcorp.com 0 cef corp c efc orp cef c orp

lightandmotionphotography.com 3lightandmotionphotography

lightandmotionphotography

ligh tandmotionphotography

uggbootrepairs.com 0ugg bootrepairs

ugg boo trepairs

ugg boo trepairs

abundancesecrets.com 0abundancesecrets

abun dancesecrets

abund ancesecrets

Page 42: Learning to read urls

In [239]: %matplotlib inlineimport matplotlib.pyplot as pltimport seaborncorrect = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0, 4, 1, 0, 4, 0, 0, -1, 0, 0, -1, 1, 8, 0, 0, 0, 0, 8, 0, -1, -1, 0, -1, 0, 3, 16, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, -1, 0, 2, 4, 13, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1]

Page 43: Learning to read urls

In [240]: pandas.Series(correct).hist(bins=range(-5, 25), normed=True, figsize=(12, 5))plt.xlabel('correct guess no. or -1 if incorrect');

Page 44: Learning to read urls

In a test of 100 samples, the first guess was correct 65 times and one of the first 25 were correct87 times.

Page 45: Learning to read urls

Is this good enough?Primary use case: given 500 domains in a market, what are the themes?

Expect ~325 domains in theme clusters and ~175 distributed randomly.

This will probably still require human sanity checks.

Page 46: Learning to read urls

What can be done?So far, we only consider the likelyhood of a domain given a sentence.

But how likely is the sentence?

The next hack day is to develop a model for sentence likelyhood .P(s)

Page 47: Learning to read urls

Determine the best sentence 

From Bayes:

Since is the same for all sentences, this can be ignored when finding the argmax:

P(s|d)argmaxs

P(s|d) =P(d|s)P(s)

P(d)

P(d)P(s|d) = P(d|s)P(s)argmaxs argmaxs

Page 48: Learning to read urls

What was doneTrained dictionary using google ngram viewer dataFound word substrings in domainBuilt sentences from words with applied crude cutsOrdered predictions based on crude score functionMeasured performance on 100 labelled domains

Page 49: Learning to read urls

What I usedInspiration:

Peter Norvig's

Libraries:

pandas, numpy, resklearn.feature_extraction.text.CountVectorizer

Functions:

spell-correct (http://norvig.com/spell-correct.html)

build_dictionary(corpus, min_df=0)find_words_in_string(string, dictionary, longest_word=None)find_sentences(string, words, part_sentence, sentences, threshold=0.0)get_sentences(domain, thresh=0.95)score_d_given_s(sentence, domain)guess(d, n_guesses=25)

Page 50: Learning to read urls

After training, it can be used like this:

In [211]: guess('powerwasherchicago')[0]

Out[211]: 'power washer chicago'

Page 51: Learning to read urls

What still needs to be done for performance

Performance needs to be tested against a larger labelled dataset including robusttrain-develop-test splits.

Sentences need to be compared based on the likelyhood of that sentenceconstruction, i.e.

Additional words need to be incorporated into the dictionary

Threshold hyper-parameters need tuning

P(s)

Page 52: Learning to read urls

...and to make it usable

Replace custom code with library functions where possibleExtend remaining code to support array and dataframe inputsMake compatible with sklearn pipelineImprove .com, .co.uk etc. handling so it can be used on a wider set of domainsOptimise substring search

Page 53: Learning to read urls

Think you can do better?Get in touch:

[email protected]@calvingiles

Page 54: Learning to read urls

In [122]: import math

def num_fmt(num): i_offset = 12 # change this if you extend the symbols!!! prec = 3 fmt = '.{p}g'.format(p=prec) symbols = [#'Y', 'Z', 'E', 'P', 'T', 'G', 'M', 'k', '', 'm', 'u', 'n'] try: e = math.log10(abs(num)) except ValueError: return repr(num) if e >= i_offset + 3: return '{:{fmt}}'.format(num, fmt=fmt) for i, sym in enumerate(symbols): e_thresh = i_offset - 3 * i if e >= e_thresh: return '{:{fmt}}{sym}'.format(num/10.**e_thresh, fmt=fmt, sym=sym) return '{:{fmt}}'.format(num, fmt=fmt)