real world analytics with solr cloud and spark

51
Real-World Analytics with Solr Cloud and Spark Solving Analytic Problems for Billions of Records Within Seconds Vancouver, May 2016 | Johannes Weigend | QAware GmbH Johannes Weigend Apache Big Data North America 2016 May 2016

Upload: qaware-gmbh

Post on 16-Apr-2017

385 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Real World Analytics with Solr Cloud and Spark

Real-World Analytics with Solr Cloud and SparkSolving Analytic Problems for Billions of Records Within Seconds

Vancouver, May 2016 | Johannes Weigend | QAware GmbH

Johannes Weigend Apache Big Data North America 2016 May 2016

Page 2: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Any Question? Ask or Twitter with the Hashtag #cloudnativenerd

Page 3: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

The Problem We Want to Solve

■Interactive applications with runtimes lower than a second!

■Processing of billions of records (>109 rows / records)■Continuously import data (near realtime)■Applications on top of the Reactive Manifesto

Page 4: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Page 5: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Page 6: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Horizontal Scalability can be difficult!

■Horizontal Scalability of functions■Trivial ■Loadbalancing of (stateless) services (makro- / microservices)

■More users ! more machines ■Not trivial ■More machines ! faster response times

■Horizontal Scalability of data■Trivial ■Linear distribution of data on multiple machines

■More machines ! more data ■Not trivial ■Constant response times with growing datasets

Page 7: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Hadoop Gives Answers for Horizontal Scalability of Data and Functions

Page 8: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Page 9: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

The Processing of Distributed Data can be Quite Slow!

9

Data Flow

Read Read Read

Filter Filter Filter

Map Map Map

Reduce

foreach() -> Minutes / Hours

HDFS / NFS / NoSQL

Page 10: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

With Former Indexing and Searching, Less Data has to be Read and Filtered.

10

Filter

Search Search Search

Map Map Map

Reduce

Data FlowFilter Filterforeach()

-> Seconds/Minutes

Search / NoSQL

Page 11: Real World Analytics with Solr Cloud and Spark

SparkSearch Search Search

Map Map Map

Reduce

Distributed Data

Cluster Processing

Business Layer

Frontend

Page 12: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

DEMO

Page 13: Real World Analytics with Solr Cloud and Spark

Spark

1. Solr Cloud for Analytics

Filter

Search Search Search

Map Map Map

Reduce

Data FlowFilter Filter

Search / NoSQL

Page 14: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

■Document based NoSQL database with outstanding search capabilities■A document is a collection of fields (string, number, date, …)■Single und multiple fields (fields can be arrays)■Nested documents■Static und dynamic scheme■Powerful query language (Lucene)

■Horizontal scalable with Solr Cloud ■Distributed data in separate shards ■Resilience by the combination of zookeeper and replication

■Powerful aggregations (aka facets) ■Stable —> V 6.0

14

Cloud

Page 15: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Shard2

The Architecture of Solr Cloud

Solr Server

Zookeeper

Solr ServerSolr Server

Shard1

Zookeeper Zookeeper Zookeeper Cluster

Solr Cloud

Leader

Scale Out

Shard3

Replika8 Replika9

Shard5Shard4 Shard6 Shard8Shard7 Shard9

Replika2 Replika3 Replika5

Shards

Replicas

Collection

Replica4 Replica7 Replika1 Shard6

Page 16: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Solr Stores Everything in a Single „Table“ (BigTable). Searching is Extremely Fast and Powerful.*

Customer Order

*1Name Amount

Address Product

Type ID Name Address Amount Product K2BCustomer 1 K 1 A 1 - - [3,5]Customer 2 K 2 A 2 - - [4]

Order 3 - - Z 1 P 1 [1]Order 4 - - Z 2 P 2 [2]

...

SolrDocument

SolrDocumentSolrDocument

SolrDocument

(*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms

Page 17: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

A Solr Cloud can be Started in Seconds.

■ Create a scheme by reusing an existing set of solr config files■ There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be

copied and modified

■ Start solr■ When the wizzard asks for a collection name use „bigdata2016“ (see above)

■Make a first test

cp $SOLR_HOME/server/solr/configset/basic_configs \ $SOLR_HOME/server/solr/configsets/bigdata2016

$SOLR_HOME/bin/solr start –e cloud

curl localhost:8983/solr/jax2016/query?q=*:*

Page 18: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

With the Solr Cloud Collection API, Shards can be Created, Changed or Deleted.

■ Create a collection

■ Delete a collection <<SOLR URL>>/solr/admin/collections?action=DELETE& name=<<name of collection>>

<<SOLR URL>>/solr/admin/collections?action=CREATE& name=<<name of collection>>& numShards=16& replicationFactor=2& maxShardsPerNode=8& collection.configName= <<name of uploaded zookeeper configuration>>

https://cwiki.apache.org/confluence/display/solr/Collections+API

Page 19: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Zookeeper has to be Started First and the Solr Configuration must be Uploaded to Use a Solr Cloud.

1.Start zookeeper on 2n+1 nodes (odd number)

2.Upload the solr configuration into zookeeper

3.Start solr on n-nodes connected to the zookeeper cluster

4.Create a collection with a number of shards and replicas

$SOLR_HOME/bin/solr start –c -z 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102

$SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd upconfig -zkhost 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 -confname ekgdata -solrhome /opt/solr/server/solr -confdir /opt/solr/server/solr/configsets/ekgdata_configs/conf

$ZOO_HOME/bin/zkServer.sh start

Page 20: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Example: Solr Cloud for Analytics of Insurance Data

■ Insurance sample data with the following fields

Education IncomeGender

...

Page 21: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

DEMO

Page 22: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Solr Supports JSON Queries per HTTP Post

Page 23: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Term Facets Group and Count a Single Field.

23

Page 24: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Function Facets Aggregate Fields.

24

http://yonik.com/solr-facet-functions/

Page 25: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Pivot Facets Compose Facets into Hierarchies.

25

Page 26: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Solr 6 Supports SQL

■ Solr 6 supports distributed SQL■ The JDBC Driver is part of the solrj client library

■ A collection is currently mapped as single table. ■ Collection -> Table■ SolrDocument -> Row■ Field -> Column

■ The Solr 6.0 is limited, but more functionality is expected in upcoming versions■ No database metadata, no prepared statements, no mapping to tables per type field

Page 27: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Resilience

■The number of replicas per shard is configurable (replication factor)■This number corresponds with the number of nodes which can silently

fail■Zookeeper is the single source of failure, but can also be failsafe by

running multiple instances■Solr knows all zookeeper instances and can silently switch over to the

next available leader if last connected zookeeper crashes

Page 28: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

You Got Everything What You Need! – Or Not?

■Client side processing of solr documents does not scale■No possibility to run parallel business logic inside solr■The solr index is not a general purpose store for huge data■Images■Videos■Binaries / large text documents

■No Interface to machine learning or typical statistics libraries (R) ...

28

Page 29: Real World Analytics with Solr Cloud and Spark

SparkDistributed In-Memory Computing

mit Apache Spark

Filter

Search Search Search

Map Map Map

Reduce

Data flowFilter Filter

Search / NoSQL

Page 30: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

■Distributed computing (100x faster than Hadoop (M/R)■Distributed Map/Reduce on distributed data can be done in-memory ■Written in Scala (JVM)■Java/Scala/Python APIs■Processes data from distributed and non-distributed sources■ Textfiles (accessible from all nodes)■Hadoop File System (HDFS)■Databases (JDBC)■ Solr per Lucidworks API■ ...

30

READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 31: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Cluster

JVM

Worker

Worker

JVM

JVM

JVM

Worker

Master / Yarn / MesosJVM

Executor

Executor

JVM

JVM

JVM

Executor

start

start

start

TaskTask(s)

Slave

Slave

Slave

Master Host

Spark Context

MasterURL

Resilient Distributed

Dataset RDD

Driver Node

creates

Driver Application

Application

uses

Partition

Task(s)

Partition

Task(s)

Partition

Page 32: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

A Very First Spark Application

Page 33: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Spark Pattern 1: Distributed Task with Params

Page 34: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Spark Pattern 2: Distributed Read from External Sources

Page 35: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Spark Pattern 3: Caching and Further Processing with RDDs

Page 36: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

DEMO

Page 37: Real World Analytics with Solr Cloud and Spark

SparkPutting all together

Solr & Spark in Action

Filter

Search Search Search

Map Map Map

Reduce

DatenflussFilter Filter

Search / NoSQL

Page 38: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

How to implement readFromShard()?

■ Several possibilities for that:■ SolrJ: SolrStream■ /export Handler kann Massendaten aus SOLR streamen■ Unterstützt nur JSON Export (Kein Binary Format !)

■ Or: SolrJ cursor marks■ Or: Custom export handler

http://localhost:8983/solr/jax2016/export?q=*:*&sort=id%20asc&fl=id&wt=xml

Page 39: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

LucidWorks has released a Spark/Solr Integration Library.https://github.com/lucidworks/spark-solr

Page 40: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

1

2

3

4

Lucidworks Solr-Spark Adapter V 2.1

Page 41: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Logfile Analytics with Solr and Spark

■Histogram of all exception from hosts A,B,C during time interval D■Step 1: Search with Solr■Solr Query (q=*Exception AND (server: A OR server:B OR server:C) AND timestamp

between [1.1.2015, 31.12.2015]

■Step 2: Create a map with key = << exception name >>, value = count■Group with Spark

Page 42: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 42

1

2

3

4

Page 43: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

DEMO

+

Page 44: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Specifications – Intel NUC6i5SYK

44

6th generation Intel® Core™ i5-6260U processor with Intel® Iris™ graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP)

CPU

32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz

RAM

256 GB Samsung M.2 internal SSDDISK

! This case is as powerful like four notebooks

8 Cores, 16 HT Units, 128 GB RAM, 1 TB DiskTotal

Page 45: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH 45

Technical Cluster Architecture

hdfs

Ubuntu Linux

Solr Cloud

Zookeeper #1

Spark

Zeppelin

Master JVM Slave JVM

Executor JVM #1

Ubuntu Linux

Solr Cloud

Zookeeper #2

Spark

Zeppelin

Master JVM #2 Slave JVM #2

Executor JVM #2

Ubuntu Linux

Solr Cloud

Spark Master JVM #4 Slave JVM #4

Executor JVM #4

Ubuntu Linux

Solr Cloud

Zookeeper #3

Spark

Master JVM #3 Slave JVM #3

Executor JVM #3

s1 s2 s3 s4

s5 s6 s7 s8

s13 s14 s15 s16

s9 s10 s11 s12

1

23

4

Page 46: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

You can even run Solr Cloud and Spark on Odroid 4 70$ ARM Computers

■ 8 Cores ■ ca. 1/10 CPU performance in comparison to the Intel NUC 6 / Core i5

Page 47: Real World Analytics with Solr Cloud and Spark

47

SPARK WorkerSOLR 5.3

Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$

SPARK WorkerSOLR 5.3

SPARK WorkerSOLR 5.3

SPARK WorkerSOLR 5.3

SPARK Master

SOLR 5.3SPARK Worker

ZOOKEEPER

40 Cores 10 GB RAM 320 GB eMMC Disk

Page 48: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Page 49: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

Summary

■Solr Cloud and Spark are a powerful combination for interactive analytics and data intense applications■Writing distributed software stays hard. Only distribute if you have to.■100% Open Source■A simple integration of Solr and Spark is easy. For high performance

applications things could be more complicated.■If professional product support is needed, customers can switch to

Lucidworks Fusion to get a pre integrated and supported Solr/Spark platform

Page 50: Real World Analytics with Solr Cloud and Spark

Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH

@JohannesWeigend@qaware

slideshare.net/qaware

blog.qaware.de

Page 51: Real World Analytics with Solr Cloud and Spark

51