big data open source software and projects abds in summary i: layers 1 to 2 data science curriculum...

10
Big Data Open Source Software and Projects ABDS in Summary I: Layers 1 to 2 Data Science Curriculum March 1 2015 Geoffrey Fox [email protected] http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Upload: simon-williamson

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Big Data Open Source Software and Projects

ABDS in Summary I: Layers 1 to 2Data Science Curriculum

March 1 2015

Geoffrey Fox [email protected]             http://www.infomall.org

School of Informatics and ComputingDigital Science Center

Indiana University Bloomington

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting

Functions

1) Message and Data Protocols: Avro, Thrift, Protobuf

2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, Stratosphere (Apache Flink), Reef, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB

11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Yarcdata, AllegroGraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage

7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker, Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds, Networking: Google Cloud DNS, Amazon Route 53

 

21 layers Over 300 Software Packages April 3 2015

Functionality of 21 HPC-ABDS Layers1) Message Protocols:2) Distributed Coordination:3) Security & Privacy:4) Monitoring: 5) IaaS Management from HPC to hypervisors:6) DevOps: 7) Interoperability:8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management

B) NoSQLC) SQL 

12) In-memory databases&caches / Object-relational mapping / Extraction Tools13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI:14) A) Basic Programming model and runtime, SPMD, MapReduce:

B) Streaming:15) A) High level Programming: 

B) Application Hosting Frameworks16) Application and Analytics: 17)Workflow-Orchestration:

Here are 21 functionalities. (including 11, 14, 15 subparts)

4 Cross cutting at top17 in order of layered diagram starting at bottom

Functionality of 21 HPC-ABDS Layers1) Message Protocols:2) Distributed Coordination:3) Security & Privacy:4) Monitoring: 5) IaaS Management from HPC to hypervisors:6) DevOps: 7) Interoperability:8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management

B) NoSQLC) SQL 

12) In-memory databases&caches / Object-relational mapping / Extraction Tools13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI:14) A) Basic Programming model and runtime, SPMD, MapReduce:

B) Streaming:15) A) High level Programming: 

B) Application Hosting Frameworks16) Application and Analytics: 17)Workflow-Orchestration:

Here are 21 functionalities. (including 11, 14, 15 subparts)

4 Cross cutting at top17 in order of layered diagram starting at bottom

Apache Thrift• http://en.wikipedia.org/wiki/Apache_Thrift• Thrift is an interface definition language and 

binary communication protocol that is used to define and create services for numerous languages. 

• It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development". 

• It combines a software stack with a code generation engine to build services that work efficiently to a varying degree and seamlessly between C#, C++ (on POSIX-compliant systems), Cappuccino, Cocoa, Delphi, Erlang, Go, Haskell, Java, Node.js, OCaml, Perl, PHP, Python, Ruby and Smalltalk

• Note this type of capability augmented by serializers such as Java Kyro

Google Protobuf (Protocol Buffers)

• http://en.wikipedia.org/wiki/Protocol_Buffers• Protocol Buffers are a way of encoding structured data

 in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.

• Protocol Buffers are a method of serializing structured data. As such, they are useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates from that description source code in various programming languages for generating or parsing a stream of bytes that represents the structured data.

• Protocol Buffers are serialized into a binary wire format which is compact, forwards-compatible, and backwards-compatible, but not self-describing (that is, there is no way to tell the names, meaning, or full datatypes of fields without an external specification). 

• C++, Java, Python• Protocol Buffers are very similar to the Apache Thrift protocol (used by Facebook 

for example), except that the public Protocol Buffers implementation does not include a concrete RPC protocol stack to use for defined services.

Apache Avro• http://avro.apache.org/docs/current/  • Apache Avro relies on schemas defined with Json. When Avro data is read, the schema used 

when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

• When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

• When Avro is used in RPC, the client and server exchange schemas in the connection handshake.

• Avro differs from Thrift and Protocol Buffers in these ways– Dynamic typing: Avro does not require that code be generated. Data is always 

accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.

– Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.

– No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Functionality of 21 HPC-ABDS Layers1) Message Protocols:2) Distributed Coordination:3) Security & Privacy:4) Monitoring: 5) IaaS Management from HPC to hypervisors:6) DevOps: 7) Interoperability:8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management

B) NoSQLC) SQL 

12) In-memory databases&caches / Object-relational mapping / Extraction Tools13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI:14) A) Basic Programming model and runtime, SPMD, MapReduce:

B) Streaming:15) A) High level Programming: 

B) Application Hosting Frameworks16) Application and Analytics: 17)Workflow-Orchestration:

Here are 21 functionalities. (including 11, 14, 15 subparts)

4 Cross cutting at top17 in order of layered diagram starting at bottom

Apache Zookeeper & Google Chubby• http://en.wikipedia.org/wiki/Apache_ZooKeeper• Important technology to provide reliable control

 metadata in distributed scalable systems• Zookeeper is a distributed configuration service, synchronization 

service, and naming registry for large distributed systems.• ZooKeeper was a sub project of Hadoop but is now a top-level project 

in its own right. Based  on Google Chubby• ZooKeeper's architecture supports high availability through redundant 

services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a trie (digital tree) datastructure. 

• Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered.

• ZooKeeper is used by companies including Rackspace, Yahoo and eBay as well as open source enterprise search systems like Solr and Storm.

• See improved technology Giraffe http://grid.hust.edu.cn/xhshi/projects/giraffe.htm 

JGroups• http://en.wikipedia.org/wiki/JGroups• JGroups is a reliable multicast system written in the 

Java language and Open Source under LGPL• JGroups adds a "grouping" layer over a transport protocol, 

internally keeping a list of participants. This list is used to:– Make the application aware of the listeners– Make some or all transmissions reliable– Allow totally ordered transmissions

• JGroups is a toolkit for reliable multicast communication. It can be used to create groups of processes whose members can send messages to each other. JGroups enables developers to create reliable multipoint (multicast) applications where reliability is a deployment issue. JGroups also relieves the application developer from implementing this logic themselves. This saves significant development time and allows for the application to be deployed in different environments without having to change code

• The most powerful feature of JGroups is its flexible protocol stack, which allows developers to adapt it to exactly match their application requirements and network characteristics. The benefit of this is that you only pay for what you use. By mixing and matching protocols, various differing application requirements can be satisfied. JGroups comes with a number of protocols UDP (IP Multicast), TCP, JMS (but anyone can write their own).