cwkaa 2010

4

Click here to load reader

Upload: sam-neaves

Post on 01-Jun-2015

296 views

Category:

Technology


0 download

DESCRIPTION

Maths

TRANSCRIPT

Page 1: Cwkaa 2010

COMSM1402 Advanced Algorithms 2010

Raphael Clifford

November 5, 2010

Due: The coursework should be handed in at the start of the lecture onFriday, 17 December. This is both the normal and late deadline. Online sub-missions can be made up to midnight. Your marks will be based on the bestfive answers out of the first six questions plus your mark for question seven.

Problems:

1. The purpose of this question is to show a weakly universal class of hashfunctions H for which E[M ] = b

√n− 1c. M is the maximum load as-

suming n items are hashed into n slots using a universal family of hash

functions. For positive n, we use the notation [n]def= {0, . . . , n− 1}.

Define a family H of hash functions from [n] to [n] as follows. Let ` bean integer, with 1 ≤ ` ≤ n. For each V ⊂ [n] of cardinality `, we definethe hash function hV : [n]→ [n] by the following property. hV maps eachelement of V onto 0, and hV maps [n]\V injectively into [n]\{0}. Notethat hV is not uniquely determined by this property, but we can alwayschoose one hV satisfying this property (verify). Define

H := {hV : V ⊂ [n], |V | = `}

Argue that H is weakly universal if ` ≤√n− 1. Note that the maximum

load always equals `.

[10 points]

2. The following approach is useful in streaming algorithms; you should thinkabout why this might be. Suppose that we have a sequence of items,passing by one at a time. We want to maintain a sample of one item thathas the property that it is uniformly distributed over all the items thatwe have seen at each step. Moreover, we want to accomplish this withoutknowing the total number of items in advance or storing all of the itemsthat we see. Consider the following algorithm, which stores just one itemin memory at all times. When the first item appears, it is stored in thememory. When the kth item appears, it replaces the item in memory withprobability 1/k. Explain why this algorithm solves the problem.

Now suppose instead we want a sample of s items instead of just one,without replacement. That is, we don’t want to get the same item multiple

1

Page 2: Cwkaa 2010

times in our sample. If this weren’t an issue, we could get a sample of sitems with replacement just by running s independent copies of the above.Generalize the above process to that case. (Hint: start by taking the firsts items and storing them as your sample. With what probability shouldeach new item come into the sample?) [10 points]

3. The simplest variant of cuckoo hashing is as follows. There is a table withm cells. Each element x can hash into exactly two locations, given by hashfunctions, h1(x) and h2(x). When an item is placed into the hash table,if at least one of these two location is free, the item is placed in the freelocation. If neither locations is free, x is placed in one of the two locations,and kicks out the element y that is in that location. Then y is placed inits alternative location. If that location is free, then all is well, and y isplaced there. Otherwise, y must kick out the element in that location,and this new element must try to move to its alternative location, and soon.

It is possible that, at some point, the process will loop. The loop caneither be found explicitly, or a limit on the number of times elements canbe kicked out can be enforced and the whole dataset rehashed if this limitis ever reached.

One way to generalise this is to use more than two hash functions so thateach element has more than two alternatives for which element to kickout randomly at each step. The task is to implement a generalised variantof cuckoo hashing. You should make a choice about how you will createthe hash functions and explain it clearly in terms of the randomness andindependence you are using. You could for example, simply toss somecoins if you only need a small number of random bits to start off. Feelfree to try different hash function families and report on what effect, ifany, this has. You may also want to experiment with creating randomnumbers using methods described in the lectures or otherwise. In yourexperiments, use a table of size 8192, and add elements until the first timeyou cannot add an element. (For convenience, you may assume an elementcannot be added if, after repeating the kick out step 20 times, you are notdone.) Using 2 hash functions and then 3 hash functions, and runningthe experiment 1000 times, examine how full the hash table can be beforeproblems start to occur. Compare your results with the bounds from thetheory and discuss what you find. For this problem, please submit yourcode.

You can choose any programming language you like, but please includeclear instructions on how to run your code on a lab machine in a file calledreadme.txt that is included with your submission.

[10 points]

4. This question has two parts. A naive implementation of a van Emde Boastree uses O(|U |) space, where |U | is the universe size. Explain in detail

2

Page 3: Cwkaa 2010

how this can be reduced to O(n) space (where n is the number of elementsto be stored). What are the complexities of the different operations in yourreduced space data stucture?

The van Emde Boas tree layout can be used to implement a number ofother data stuctures and to speed up important applications. Find anexample from the literature and explain in detail how the van Emde Boastree improves the time complexities of the relevant operations. Your ex-planation should give suitable citations and ideally provide proofs of anyresults you report.

[10 points]

5. Consider the following pattern matching problem involving wildcard sym-bols. A single character wildcard is said to match any other symbol in theinput alphabet.

INPUT: Text T = t1 . . . tn, pattern P = p1 . . . pm. At most ` of the pat-tern characters pi are non-wildcards (i.e. normal characters) and the restsingle character wildcards.OUTPUT: The Hamming distance between P and every substring of T oflength m.

Example: let p = ab?ab and text t = b?bbabba and ` = 4. The output is3, 0, 2, 4.

(a) Give an algorithm that solves this problem.

(b) What is the asymptotic time complexity of your algorithm? Makesure to explain your working carefully.

The better the time complexity, the more marks will be awarded. Inparticular, extra marks will be given for fast solutions whose running timeis parameterised by ` as well as n and m. A Θ(nm) time solution will gainno marks.

You can assume it takes no more than log2 n bits (i.e a single word ofmemory) to represent any of the input symbols and that simple arithmeticoperations on the input symbols, including addition and multiplicationtake constant time.

[10 points]

6. (a) The recurrence for the running time of the algorithm for computinga suffix array presented in lectures is T (n) = T (2n/3) + O(n). Showhow to modify the algorithm to give one whose recurrence is T (n) =T (3n/7) + O(n). Is 3/7 the best possible, or can you do better?

(b) Suppose we have a pattern p and a text t and we want to find for everyposition in t the longest substring of p that matches there exactly.

3

Page 4: Cwkaa 2010

Give a fast algorithm to solve this problem together with its analysis.The better the time complexity, the more marks will be awarded.

[10 points]

7. For this question you are asked to write a two page summary of a researchpaper. I would like you to choose a highly cited paper from one of theleading algorithms conferences to write about. Luckily there is already awebsite (http://www.cs.utah.edu/~suresh/citations/) that has beenthrough the papers written from 1997–2006 for FOCS, STOC and SODA(look up what these stand for) and counted the citation numbers for youalthough these numbers are now underestimates in most cases. Alter-natively you may choose a paper from any of the conferences listed athttp://www.cs.tau.ac.il/~iftgam/eventlist.htm. You should checkon http://scholar.google.com that any paper you choose has a currentcitation count of at least one hundred.

Please post the title of the paper, its authors, the conference name andthe number of citations on the unit forum as soon as you have made yourchoice. You may not, of course, choose the same paper as someone else.

Your two page review should include:

• A short one or two paragraph summary of the paper.

• A deeper, more extensive outline of the main points of the paper,including for example assumptions made, arguments presented, dataanalyzed, and conclusions drawn.

• Any limitations or extensions you see for the ideas in the paper.

• Your opinion of the paper; primarily, the quality of the ideas and itsreal or potential impact.

[30 points]

Academic Integrity: All the work you hand in should be your own. If youwork with other students, you should list them on your coursework along witha brief explanation of which topics you discussed. In general, any source otherthan the lectures should be explicitly cited at the point where it is used.

4