building a search engine for the cuban web · pdf filecommon search engine features 2 1 3 web...
TRANSCRIPT
![Page 1: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/1.jpg)
N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S P A I N
Building a Search Engine for the Cuban Web
Jorge Luis Betancourt
Search/Crawl Engineer
![Page 2: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/2.jpg)
2
Who am I01
Jorge Luis Betancourt González
Search/Crawl Engineer
Apache Nutch Committer & PMC
Apache Solr/ES enthusiast
![Page 3: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/3.jpg)
3
Agenda
• Introduction & motivation
• Technologies used
• Customizations
• Conclusions and future work
![Page 4: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/4.jpg)
4
Introduction / Motivation
Cuba
Internet Intranet
Global search engines can’t access documents
hosted the Cuban Intranet
![Page 5: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/5.jpg)
5
Writing your own web search engine
from scratch?
or …
![Page 6: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/6.jpg)
6
Common search engine features
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails
• filters (facets)
• show metadata
• match text with images
• near real time • email, push, SMS
![Page 7: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/7.jpg)
7
How to fulfill these requirements?
store query At the core a search
engine: stores some
information a retrieve this
information when a
question is received
![Page 8: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/8.jpg)
8
Open Source to the rescue …
Index Server
crawler
web interface
2
1
3
![Page 9: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/9.jpg)
9
Apache Nutch
“ Nutch is a well matured, production ready
Web crawler. Enables fine grained
configuration, relying on Apache Hadoop™
data structures, which are great for batch
processing.
![Page 10: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/10.jpg)
10
Apache Nutch
• Highly scalable
• Highly extensible
• Pluggable parsing protocols, storage,
indexing, scoring,
• Active community
• Apache License
![Page 11: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/11.jpg)
11
Apache Solr
TOTAL DOWNLOADS
8M+MONTHLY
DOWNLOADS 250,000+• Apache License
• Highly modular
• Based on Lucene
• Great community
• Stability / Scalability
• Battle tested
![Page 12: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/12.jpg)
12
Back to the list of features
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
![Page 13: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/13.jpg)
13
Image search and thumbnails
Custom parser & indexer to store the image
thumbnail
img p
h1
Custom parser &
indexer & scoring
identify and store the text
related with an image
![Page 14: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/14.jpg)
14
How does it work?
img p
h11
img
img
3
2
![Page 15: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/15.jpg)
15
News search (NRT & alerting)
Nutch is really not suited for this task: Batch nature of
the Hadoop Jobs doesn’t fit well in this scenario
![Page 16: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/16.jpg)
16
Our topology
http://news-site.com
RSS fetch parse
index
parse the RSS feed and outputs the news links to be processed according to SC protocol.
https://github.com/commoncrawl/news-crawl
monitor
flaxsearch/luwak
![Page 17: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/17.jpg)
17
Querying the data
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
17
![Page 18: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/18.jpg)
18
Querying the data
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
18
![Page 19: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/19.jpg)
19
Apache Solr
• Solr has full support for highlighting (3 impl)
• powerful faceting capabilities (even more on recent
releases)
• autocorrection support based on the index content
• awesome scalability (SolrCloud, classic master-slave
replication)
![Page 20: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/20.jpg)
20
The features, once again
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
![Page 21: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/21.jpg)
21
The features, once again
2
1
3
Web search: HTML & documents (PDF, DOC)
Image search (size, format, color, objects)
News search (alerting, notifications)
• highlighting
• filters (facets)
• suggestions
• autocorrection
• thumbnails • show metadata
• match text with images
• near real time • email, push, SMS
• filters (facets)
![Page 22: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/22.jpg)
22
Other features - monitoring
We needed a way of monitoring our infrastructure
without a great Internet connection you can’t send
GB of logs to a cloud environment, so …
(and facets)analytical tool
(and logs)
(and metrics)time series store
![Page 23: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/23.jpg)
23
Other features - monitoring
(and facets)analytical tool
(and logs)
(and metrics)time series store
(and logs) parsing & aggregation
![Page 24: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/24.jpg)
24
Banana (Kibana port) for visualizations
![Page 25: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/25.jpg)
25
Infrastructure
Solr Master
CrawlersNutch
SolrReplicador
WEB
HTTP
HTTP HTTP HTTP
HTTP HTTP
JAVABIN
1
2
![Page 26: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/26.jpg)
26
Some usage stats
less than 10 000 visits around 600 unique visitors
![Page 27: Building a Search Engine for the Cuban Web · PDF fileCommon search engine features 2 1 3 Web search: ... (SolrCloud, classic master-slave ... Apply deep learning techniques to process](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa2ad017f8b9a07758d4c49/html5/thumbnails/27.jpg)
27
Future work
Apply deep learning techniques to process the raw
images and mix with current approach
Increase the number of signals that we get from our
crawlers (correlate even more crawl related events)