oktavia search engine - pyconjp2014
DESCRIPTION
TRANSCRIPT
![Page 1: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/1.jpg)
DeNA Co, Ltd. Yoshiki Shibukawa
9/14/2014 PyConJP
![Page 2: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/2.jpg)
! Yoshiki Shibukawa ! Work for DeNA Co, Ltd. ! @shibu_jp (twitter) ! yoshiki.shibukawa (Facebook) ! [email protected] (mail)
! Languages ! C/C++, Python, JavaScript
! Founder of sphinx-users.jp ! San Francisco -> Tokyo
![Page 3: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/3.jpg)
! The Basic of Existing Search Engines ! The structure of Oktavia ! Oktavia API examples
![Page 4: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/4.jpg)
! In some cases, inverted index is not good for Eastern Asian Languages.
! FM-index is a completely different search algorithm.
! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
![Page 5: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/5.jpg)
![Page 6: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/6.jpg)
AM.txt (0)
• Good morning
• Hi
PM.txt (1)
• Good afternoon
• Good evening
• Hi
![Page 7: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/7.jpg)
Word Document ID
Good 0, 1
Morning 0
Afternoon 1
Evening 1
Hi 0, 1
! Word -> Document ! Split words in query
string and search each word from table and show result.
Good Morning → (0, 1) and (0,) → (0,)
![Page 8: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/8.jpg)
• It is nice weather to go out to PyConJP. English
• 这是不错的天气出去PyConJP Chinese • 今日はPyConJPに出かけるにはいい天気ですね Japanese
• 그것은 PyConJP 에 외출 좋은 날씨 입니다 Korean※
※Korean has space between group of words, but not each word.
![Page 9: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/9.jpg)
今日はPyConJPに出かけるにはいい天
気ですね
今日|は|PyConJP|に|出かける|に|は|いい|天気|です|ね
! Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji
! It needs deep knowledge of each language and big dictionary.
![Page 10: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/10.jpg)
Word Doc ID
今日 0
は 0, 0
PyConJP 0
に 0, 0
出かける 0
いい 0
天気 0
です 0
ね 0
! Document becomes words and it can use same inverted index backend.
! Same word splitter is needed when creating index and searching.
![Page 11: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/11.jpg)
! 2-gram
! 3-gram
! Split a query word into fixed length strings then search each chunk
! Use each chunk as a word
こんにちは
こん|んに|にち|ちは
こんにちは
こんに|んにち|にちは
![Page 12: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/12.jpg)
Word Doc/Pos ID
こん (0, 0)
んに (0, 1)
にち (0, 2)
ちは (0, 3)
! It can still use an inverted index algorithm.
! Index file become big.
! It can’t treat shorter words than chunk size.
こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)
![Page 13: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/13.jpg)
Inverted Index
Have space Split document by space Simple Space is needed
Eastern Asian Language
N-gram Still simple Index becomes huge
NLP Works perfect
with Asian language
NLP processor and dictionary
is needed
![Page 14: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/14.jpg)
![Page 15: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/15.jpg)
! It provides a search engine for browser. ! Inverted Index
! It didn’t support Japanese. ! I sent some patches. ! But they were not enough…
![Page 16: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/16.jpg)
![Page 17: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/17.jpg)
! Developed by… ! Paolo Ferragina ! Giovanni Manzini
! FM-index is not popular in western countries. ! It is completely different from existing algorithm. ! Existing algorithm is enough for western
languages. ! It is popular in genome analysis.
! I made new search engine by using this algorithm.
![Page 18: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/18.jpg)
Estimated Time: 15min
![Page 19: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/19.jpg)
! Search Engine works on web browser. ! Written in Python and JSX (altJS made by
DeNA. See http://jsx.github.io/ )
! It uses FM-index as a backend search algorithm.
![Page 20: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/20.jpg)
! It is similar to Action Script 3 ! Class statement (no prototype!) ! Strict type checking ! No “this” hell ! Performance optimization
![Page 21: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/21.jpg)
![Page 22: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/22.jpg)
! FM-index is the fastest algorithm that uses a compressed index file.
! FM-index doesn’t need word splitting.
![Page 23: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/23.jpg)
! Oktavia adds extra information ! Add region information to source text.
! You can add as many metadata as you can. ! Section (documents and sections) ! Block (code block and so on) ! Splitter (word splitter) ! Table (rows and columns)
Ep4.txt
Use the Force, Luke. No, I am your father. Ep5.txt
![Page 24: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/24.jpg)
Read Source
Generate Index
File API
Read Index
File API
Search Result
CLI tool Browser search program
![Page 25: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/25.jpg)
Read Source
Generate Index
File API
Read Index
File API
Show Search Result
CLI tool Browser search program
! I published yesterday. ! It supports Python 2.6, 2.7, 3.3, 3.4.
![Page 26: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/26.jpg)
! Use Oktavia API to implement search feature in your application
![Page 27: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/27.jpg)
! Build JSX version
! web/bin/oktavia-jquery-ui.js, web/bin/oktavia-web-runtime.js are important.
$ git clone [email protected]:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build
![Page 28: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/28.jpg)
! Creating index ! Dump an index file in base64 encode and create
file in the following style.
! concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js).
! Add web/bin/oktavia-jquery-ui.js to your website. ! It reads index and runtime on WebWorker and
sends requests and show result.
var searchIndex = 'aGVsbG8gd29ybGQ…..=’;
![Page 29: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/29.jpg)
Estimated Time: 23min
![Page 30: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/30.jpg)
! Oktavia provides APIs for creating your better search engine.
! Most important part for user experience is an adjustment of scoring (sorting and filtering).
! In some case, user feels “not available” is important information, but in other case, it is just noise.
![Page 31: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/31.jpg)
! I want to buy some bottle of wine for gift!
Cabernet Sauvignon [Sold Out] • From France
Pinot noir [Sold Out] • From Chili
Zinfandel [Sold Out] • From USA
Photo by Josh Kenzer under CC-NC-SA
![Page 32: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/32.jpg)
! I want to buy “My Little Pony DVD”!
Season One $32
Season Two $32
Season Three [Sold out]
![Page 33: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/33.jpg)
! Oktavia class (oktavia.py) ! Main entry point of creating/searching.
! Metadata classes (metadata.py) ! Section ! Block ! Splitter ! Table
! Query, Result classes (TBD)
![Page 34: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/34.jpg)
![Page 35: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/35.jpg)
! Sorry, I am working… In future the following code will work:
![Page 36: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/36.jpg)
! In some cases, inverted index is not good for Eastern Asian Languages.
! FM-index is a completely different search algorithm.
! I published new PyPI module yesterday ! It includes only essential part of Oktavia ! I will add features more.
![Page 37: Oktavia Search Engine - pyconjp2014](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c37d594a7959dc288b4591/html5/thumbnails/37.jpg)
! Office Hour ! 13:40-14:10
! Message ! Facebook(yoshiki.shibukawa) ! Twitter(@shibu_jp, @shibukawa)