aparna kulkarni nachal ramasamy rashmi havaldar n-grams to process hindi queries

17
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries

Upload: mervin-stevenson

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Aparna Kulkarni

Nachal Ramasamy

Rashmi Havaldar

N-grams to Process Hindi Queries

TopicsTopics Introduction N-gram strategies Transliteration Methodology Document Preparation Hand Transliteration Number of N-grams & Choice of N Ranking & Retrieval. Summary

•Retrieval systems based on N-grams have been used as alternatives to word-based systems.•A query that contains misspellings or differences in transliteration can defeat word-based systems. N-gram systems are more resistant to these problems.•N-grams offer a language-independent technique. •We present a retrieval system based on N-grams that uses a collection of Hindi songs. Within this retrieval system, we study the effect of varying N on retrievability. •We Rank the N-grams produced by different values of N and select the top 10 song with higher rank.

Abstract

Introduction•N-grams are consecutive overlapping N-character sequences formed from an input stream. • We Extracted N-grams by means of an example with “N = 3” and the

string: “salt in the coffee”. •There are three N-gram Strategies.

•Only method (a) is truly language-independent because it avoids the concept of “words.” Hence we conducted our experiments using (a).

TransliterationTransliteration•Transliteration is a process where an input string in some alphabet is converted to a string in another alphabet based on the phonemes in the string in contrast to translation.•As the Hindi words can be transliterated in many different ways, the query-to-index mapping may not be accomplished readily and automatically.•The Devnagari alphabet used in Hindi can be transliterate in multiple ways to Roman script because of the absence of direct correspondence between phonemes in the two alphabets •There does not exist one accepted system of transliteration also users may not be consistent in transliteration.•Despite the diversity in transliteration schemes, the transliterations of a single word somewhat resemble each other. •some transliterations of the Hindi word for “law” are: “kaanoon”, “kanoon”, “kaanun” and “kanun”.

 

MethodologyMethodology•The techniques we used in our project are “the various N-gram extraction techniques” and “the choice of N for the N-gram”. •The vector space model for document retrieval represents documents as vectors of (term,weight) tuples.•Here each “term” is N-gram of the text in the collection.•Also stopword removal and stemming need not be done to the term since N-grams is language-independent technique.•The weight for each term was calculated as (add) where ‘tfij’ is the term frequency of term ‘i’ in document ‘j’ or the number of times term ‘i’ occurs in document ‘j’, ‘n’ is the total number of documents and ‘dfi’ is the number of documents that contain term ‘i’.•we chose to use multiple values of N (Say 3,4 & 5)to create the terms which composed a single vector space

Document Preparation The document collection consist of titles of 152 Hindi film songs.

The “titles” of the songs are indexed into our song database. The vector-space model is used to represent the song title

documents in our collection for our experiments. Each document (song) in our collection had multiple vector

representations. For each song, we generated N-grams for N = 3, 4, 5. Treating

these N-grams as vector terms, we built separate vector representations for each song for each value of N.

We also built a retrieval system based on N-grams for our particular collection.

Users may enter a few words of a desired song as a query. The system responds with a number of songs sorted in increasing order of rank;

The N-gram strategy can be varied easily in order to compare the effect of changing N.

Database Schemasongs

song_id

song_path

song_name

movie_name

music_director

lyricist

singer

gram3

song_id

term_id

term

tf

gram5

song_id

term_id

term

tf

gram4

song_id

term_id

term

tf

Soul Music!Soul Music!

Hand-transliterationHand-transliterationOur system performs very well for hand-transliterated

queries. Most queries will have a non-zero similarity with the

target song and also for all values of N Results are very similar for N = 3, 4, 5.Example for a Hand-transliteration handled in a

query. The result is still accurate even with a garbled query

Choice of NChoice of N

Our experiments suggest that N = 3 is an acceptable values of N due to size of the document vectors and the time taken to process a query.

Other considerations that may influence the choice of N could be the size of the document vectors and the time taken to process a query.

We found that the size in bytes of the document vectors fell for N = 3, 4, 5

Query processing times is low as N was increased. As N increases only fewer songs returns non-zero similarity

with the query. Therefore, the number of responses to sort and rank was reduced.

Ranking and RetrievalRanking and Retrieval After retrieving the correct set of songs closing matching the query for

different values of N(3,4 &5), We report the song Id and Rank for each retrieval of N.

To calculate Rank, we specially formulated an equation:

Rank={[0.5*Rank(3-gram)]+ [0.2*Rank(4-gram)]+ [0.3*Rank(5-gram)]

3

for a particular song Id.• We take a song Id and look for the rank of the Song in 3-gram,4-gram,

5gram respectively for atleast10 songs in each retrieval.• If all the three ranks are same for a particular song we are calculating

the rank based on the above equation.• If for a particular song from 3-gram was not returned in the 4 and 5

gram retrieval ,we are giving a rank of 11 for that song –Id in 4 and 5 as the probability of the song occurring in the retrieval to be 11.

Ranking and Retrieval(Cont’d)Ranking and Retrieval(Cont’d) our refined evaluation records the rank at which the correct song

was found if the correct song has a non-zero similarity with the query.

In the final list will have a list of 30 Song Id’s and their corresponding Rank.

Now we sort the list and merge to take the top 10 Songs with higher rank.

The higher rank is 1.0 and the least is 10.0 in the list of 10 songs. The trigrams have a higher weight due to more accurate results,

specifically for Hindi language.

Conclusion•The retrieval system performed over the Hindi database is novel. •N-gram searches retrieved documents fairly accurately despite garbled queries compared to word based. •N-gram techniques are language-independent. Therefore, they are well-suited for collections having documents in different languages or multi-lingual documents. •Based on our completed studies, we recommend N-grams as a strong alternative to word-based search techniques.

Scope of the ProjectScope of the ProjectHere, we used only the first lines of every song to

keep the experiment within manageable bounds. Our system can be extended easily on procurement of

a full-text collection of songs. Also, the canonical responses from our system may

be used as indices into databases of complete songs that use the same transliteration for all song titles.

Questions??Questions??

Thank You!!