real world analytics with solr cloud and spark

Download Real World Analytics with Solr Cloud and Spark

Post on 16-Apr-2017

376 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • Real-World Analytics with Solr Cloud and SparkSolving Analytic Problems for Billions of Records Within Seconds

    Vancouver, May 2016 | Johannes Weigend | QAware GmbH

    Johannes Weigend Apache Big Data North America 2016 May 2016

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Any Question? Ask or Twitter with the Hashtag #cloudnativenerd

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    The Problem We Want to Solve

    Interactive applications with runtimes lower than a second! Processing of billions of records (>109 rows / records)

    Continuously import data (near realtime)

    Applications on top of the Reactive Manifesto

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Horizontal Scalability can be difficult!

    Horizontal Scalability of functions

    Trivial Loadbalancing of (stateless) services (makro- / microservices) More users ! more machines

    Not trivial More machines ! faster response times

    Horizontal Scalability of data

    Trivial Linear distribution of data on multiple machines More machines ! more data

    Not trivial Constant response times with growing datasets

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Hadoop Gives Answers for Horizontal Scalability of Data and Functions

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    The Processing of Distributed Data can be Quite Slow!

    9

    Data Flow

    Read Read Read

    Filter Filter Filter

    Map Map Map

    Reduce

    foreach() -> Minutes / Hours

    HDFS / NFS / NoSQL

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    With Former Indexing and Searching, Less Data has to be Read and Filtered.

    10

    Filter

    Search Search Search

    Map Map Map

    Reduce

    Data FlowFilter Filterforeach()

    -> Seconds/Minutes

    Search / NoSQL

  • SparkSearch Search Search

    Map Map Map

    Reduce

    Distributed Data

    Cluster Processing

    Business Layer

    Frontend

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    DEMO

  • Spark

    1. Solr Cloud for Analytics

    Filter

    Search Search Search

    Map Map Map

    Reduce

    Data FlowFilter Filter

    Search / NoSQL

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Document based NoSQL database with outstanding search capabilities

    A document is a collection of fields (string, number, date, )

    Single und multiple fields (fields can be arrays)

    Nested documents

    Static und dynamic scheme

    Powerful query language (Lucene)

    Horizontal scalable with Solr Cloud Distributed data in separate shards

    Resilience by the combination of zookeeper and replication

    Powerful aggregations (aka facets) Stable > V 6.0

    14

    Cloud

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Shard2

    The Architecture of Solr Cloud

    Solr Server

    Zookeeper

    Solr ServerSolr Server

    Shard1

    Zookeeper Zookeeper Zookeeper Cluster

    Solr Cloud

    Leader

    Scale Out

    Shard3

    Replika8 Replika9

    Shard5Shard4 Shard6 Shard8Shard7 Shard9

    Replika2 Replika3 Replika5

    Shards

    Replicas

    Collection

    Replica4 Replica7 Replika1 Shard6

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Solr Stores Everything in a Single Table (BigTable). Searching is Extremely Fast and Powerful.*

    Customer Order

    *1Name Amount

    Address Product

    Type ID Name Address Amount Product K2BCustomer 1 K 1 A 1 - - [3,5]Customer 2 K 2 A 2 - - [4]

    Order 3 - - Z 1 P 1 [1]Order 4 - - Z 2 P 2 [2]

    ...

    SolrDocument

    SolrDocumentSolrDocument

    SolrDocument

    (*) With 100 million documents per shard, runtimes of queries and aggregations are normally less then 100ms

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    A Solr Cloud can be Started in Seconds.

    Create a scheme by reusing an existing set of solr config files

    There are examples in the installation directory $SOLR_HOME/server/solr/configsets which can be

    copied and modified

    Start solr

    When the wizzard asks for a collection name use bigdata2016 (see above)

    Make a first test

    cp $SOLR_HOME/server/solr/configset/basic_configs \ $SOLR_HOME/server/solr/configsets/bigdata2016

    $SOLR_HOME/bin/solr start e cloud

    curl localhost:8983/solr/jax2016/query?q=*:*

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    With the Solr Cloud Collection API, Shards can be Created, Changed or Deleted.

    Create a collection

    Delete a collection /solr/admin/collections?action=DELETE& name=

    /solr/admin/collections?action=CREATE& name=& numShards=16& replicationFactor=2& maxShardsPerNode=8& collection.configName=

    https://cwiki.apache.org/confluence/display/solr/Collections+API

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Zookeeper has to be Started First and the Solr Configuration must be Uploaded to Use a Solr Cloud.

    1.Start zookeeper on 2n+1 nodes (odd number)

    2.Upload the solr configuration into zookeeper

    3.Start solr on n-nodes connected to the zookeeper cluster

    4.Create a collection with a number of shards and replicas

    $SOLR_HOME/bin/solr start c -z 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102

    $SOLR_HOME/server/scripts/cloud-scripts$ ./zkcli.sh -cmd upconfig -zkhost 192.168.1.100:2181,192.168.1.101:2181,192.168.1.102 -confname ekgdata -solrhome /opt/solr/server/solr -confdir /opt/solr/server/solr/configsets/ekgdata_configs/conf

    $ZOO_HOME/bin/zkServer.sh start

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Example: Solr Cloud for Analytics of Insurance Data

    Insurance sample data with the following fields

    Education IncomeGender

    ...

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    DEMO

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Solr Supports JSON Queries per HTTP Post

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Term Facets Group and Count a Single Field.

    23

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Function Facets Aggregate Fields.

    24

    http://yonik.com/solr-facet-functions/

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Pivot Facets Compose Facets into Hierarchies.

    25

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Solr 6 Supports SQL

    Solr 6 supports distributed SQL

    The JDBC Driver is part of the solrj client library

    A collection is currently mapped as single table.

    Collection -> Table

    SolrDocument -> Row

    Field -> Column

    The Solr 6.0 is limited, but more functionality is expected in upcoming versions

    No database metadata, no prepared statements, no mapping to tables per type field

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    Resilience

    The number of replicas per shard is configurable (replication factor)

    This number corresponds with the number of nodes which can silently

    fail

    Zookeeper is the single source of failure, but can also be failsafe by

    running multiple instances

    Solr knows all zookeeper instances and can silently switch over to the

    next available leader if last connected zookeeper crashes

  • Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | QAware GmbH

    You Got Everything What You Need! Or Not?

    Client side proc