hortonworks dataflow white paper

Upload: ivan-georgiev

Post on 06-Jul-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Hortonworks DataFlow White Paper

    1/12

    Hortonworks DataFlow www.hortonworks.com ©2016 Hortonwork

    A HORTONWORKS WHITE PAPER

    FEBRUARY 201

    HORTONWORKS DATAFLOWAccelerating Big Data Collection and DataFlow Management

  • 8/17/2019 Hortonworks DataFlow White Paper

    2/12

    2Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Contents

    What is Hortonworks DataFlow? 3

    Benets of Hortonworks DataFlow 4

    • Leverage Operational Efciency

    • Make Better Business Decisions

    • Increase Data Security

    Features of Hortonworks DataFlow 5

    • Data Collection

    • Real Time Decisions

    • Operational Efciency

    • Security and Provenance

    • Bi-directional DataFlow

    Common applications of Hortonworks DataFlow 6

    • Accelerated Data Collection andOperational Effectiveness 

    • Increased Security and Unprecedented Chain of Custody 

    • The Internet of Any Thing with Hortonworks DataFlow

    • Adaptive to Resource Constraints

    • Secure Data Collection

    • Prioritized Data Transfer and Bi-directional Feedback Loop

    Why Hortonworks for Apache™ Hadoop®? 12

    About Hortonworks 12

  • 8/17/2019 Hortonworks DataFlow White Paper

    3/12

    3Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    What is Hortonworks DataFlow?

    Hortonworks DataFlow (HDF), powered by Apache™ NiFi, is the rstintegrated platform that solves the complexity and challenges ofcollecting and transporting data from a multitude of sources be theybig or small, fast or slow, always connected or intermittently available.

    Hortonworks DataFlow is a single combined platform for data acquisition, simple event processing,transport and delivery, designed to accommodate the highly diverse and complicated dataowsgenerated by a world of connected people, systems and things.

    An ideal solution for the Internet of Any Thing (IoAT), HDF enables simple, fast data acquisition,secure data transport, prioritized data ow and clear traceability of data from the very edge of yournetwork all the way to the core data center. Through a combination of an intuitive visual interface,a high delity access and authorization mechanism and an “always on” chain of custody (data

    provenance) framework, HDF is the perfect complement to HDP to bring together historical andperishable insights for your business. A single integrated platform for data acquisition, simpleevent processing, transport and delivery mechanism from source to storage.

    A single integrated platform for data acquisition, simple event processing, transport and deliverymechanism from source to storage.

    Hortonworks DataFlow is basedon Apache NiFi, technologyoriginally created by the NSA(National Security Agency) in2006 to address the challengeof automating the ow of databetween systems of all types–the very same problem that enterprises are encounteringtoday. After eight years ofdevelopment and use vat scale,the NSA Technology TransferProgram released NiFi to theApache Software Foundation inthe fall of 2014.

    Hortonworks DataPlatform can be used to enrich contentand support changes to real-time dataows.

    Hortonworks DataFlow enables the real time collection andprocessing of perishable insights.

    ENRICH CONTENT

    STORE DATA AND METADATA

    PERISHABLE

    INSIGHTS

    HISTORICAL

    INSIGHTS

    Hortonworks DataFlow is designed to securely collect and transport data from highly diverse data sources be they big orsmall, fast or slow, always connected or intermittently available.

    Figure 1: Hortonworks DataFlow 

  • 8/17/2019 Hortonworks DataFlow White Paper

    4/12

    4Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Benets of Hortonworks DataFlow

    DataFlow was designed inherently to meet the practical challenges of collecting data from a wide range of disparate data sources;securely, efciently and over a geographically disperse and possibly fragmented network. Because the NSA encountered many of theissues enterprises are facing now, this technology has been eld-proven with built-in capabilities for security, scalability, integration,reliability and extensibility and has a proven track record for operational usability and deployment.

    HORTONWORKS DATAFLOW ENABLES ENTERPRISES TOHORTONWORKS DATAFLOW ENABLES ENTERPRISES TO

    Leverage Operational Efciency Make Better Business Decisions Increase Data Security

    • Accelerate big data ROI via simplied

    data collection and a visually intuitive

    dataow management interface

    • Signicantly reduce cost and complexityof managing, maintaining and

    evolving dataows

    • Trace and verify value of data sources

    for future investments

    • Quickly adapt to new data sources

    through an extremely scalable

    extensible platform

    • Make better business decisions with

    highly granular data sharing policies

    • Focus on innovation by automating

    dataow routing, management andtrouble-shooting without the need

    for coding

    • Enable on-time, immediate decision

    making by leveraging real time data

    bi-directional dataows

    • Increase business agility with prioritized

    data collection policies

    • Support unprecedented yet simple 

    to implement data security from 

    source to storage

    • Improve compliance and reduce riskthrough highly granular data access,

    data sharing and data usage policies

    • Create a secure dataow ecosystem

    with the ability to run the same security

    and encryption on smale scale JVM

    capable data sources as well as

    enterprise class datacenters

    Accelerate big data ROI through a singledata-source agnostic collection platform

    Better business decisions with highlygranular data sharing policies

    Reduce cost and complexity through anintuitive, real-time visual user interface

    React in real time by leveragingbi-directional data ows and prioritized 

    data feeds

    Unprecedented yet simple to implementdata security from source to storage

    Adapt to new data sources through anextremely scalable, extensible platform

  • 8/17/2019 Hortonworks DataFlow White Paper

    5/12

    5Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Features of Hortonworks DataFlow

    DATA COLLECTION REAL TIME DECISIONS OPERATIONAL EFFICIENCY

    Integrated collection from dynamic,

    disparate and distributed sources of

    differing formats, schemas, protocols,

    speeds and sizes such as machines,

    geo location devices, click streams,

    les, social feeds, log les and videos

    Real-time evaluation of perishable

    insights at the edge as being pertinent

    or not, and executing upon consequent

    decisions to send, drop or locally store

    data as needed

    Fast, effective drag and drop interface 

    for creation, management, tuning and

    troubleshooting of dataows, enabling

    coding free creation and adjustments

    of dataows in ve minutes or less

    SECURITY AND PROVENANCE BI-DIRECTIONAL DATAFLOW COMMAND AND CONTROL

    Secure end-to-end routing from source

    to destination, with discrete user

    authorization and detailed, real time

    visual chain of custody and metadata

    (data provenance)

    Reliably prioritize and transport data

    in real time leveraging bi-directional

    dataows to dynamically adapt to

    uctuations in data volume, network

    connectivity and source and

    endpoint capacity

    Immediate ability to create, change,

    tune, view, start, stop, trace, parse, lter,

     join, merge, transform, fork, clone or

    replay dataows through a visual user

    interface with real time operational

    visibility and feedback

    Figure 2: Apache NiFi Real-Time Visual User Interface

  • 8/17/2019 Hortonworks DataFlow White Paper

    6/12

    6Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Common Applications of Hortonworks DataFlow

    Hortonworks DataFlow accelerates time to insight by securely enabling off-the shelf, ow basedprogramming for big data infrastructure and simplifying the current complexity of secure dataacquisition, ingestion and real time analysis of distributed, disparate data sources.

    An ideal framework for collection of data and management of dataows, the most popular usesof Hortonworks DataFlow are for: simplied, streamlined big data ingest, increased security forcollection and sharing of data with high delity chain of custody metadata and as the underlyinginfrastructure for the internet of things.

    CASE 1: ACCELERATED DATA COLLECTION ANDOPERATIONAL EFFECTIVENESS

    Streamlined Big Data Ingestion

    Hortonworks DataFlow accelerates big dwata pipeline ingest through a single integrated andeasily extensible visual interface for acquiring, and ingesting data from different, disparate,distributed data sources in real time. The simplied and integrated creation, control and analysesof data ows results in faster ROI of big data projects and increased operational effectiveness.

    WHAT IS A COMMAND AND

    CONTROL INTERFACE?

    A command and controlinterface provides the abilityto manipulate the dataowin real time such that currentcontextual data can be fed backto the system to immediatelychange its output. This is incontrast to a design and deployapproach which involvesstatically programming adataow system before thedata is owed through it,and then returning to a static

    programming phase to makeadjustments and restarting thedataow again. An analogy tothis would be the differencebetween 3D printing whichrequires pre-planning beforeexecution, and molding claywhich provides immediatefeedback and adjustment of theend product in real time.

    To learn more aboutHortonworks DataFlow go tohttp://hortonworks.com/hdf

    Figure 2: An integrated, data source agnostic collection platform

    UIEasilyAccessand TraceChangesto DataFlow

    UIProcessor RealTime Changes

    UIDynamicallyAdjust DataFlow

    http://hortonworks.com/hdf/http://hortonworks.com/hdf/

  • 8/17/2019 Hortonworks DataFlow White Paper

    7/12

    7Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Why aren’t current data collection systems ideal?

    Current big data collection and ingest tools are purpose-built and over-engineered simply becausethey were not originally designed with universally applicable, operationally efcient designprinciples in mind. This creates a complex architecture of disparate acquisition, messaging andoften customized transformation tools that make big data ingest complex, time consumingand expensive from both a deployment and a maintenance perspective. Further, the time lagassociated with the command line and coding dependent tools fetters access to data andprevents the on-time, operational decision making required of today’s business environment.

    Figure 3: Current big data ingest solutions are complex and operationally inefcient

  • 8/17/2019 Hortonworks DataFlow White Paper

    8/12

    8Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    CASE 2: INCREASED SECURITY AND UNPRECEDENTED CHAIN OF CUSTODYIncreased security and provenance with Hortonworks DataFlow

    Data security is growing ever more important in the world of ever connected devices and the needto adhere to compliance and data security regulations is currently difcult, complex and costly.Verication of data access and usage is difcult, time consuming and often involves a manualprocess of piecing together different systems and reports to verify where data is sourced from,how it is used, who has used it and how often.

    Current tools utilized for transporting electronic data today are not designed for the expectedsecurity requirements of tomorrow. It is difcult, if not almost impossible for current tools to sharediscrete bits of data, much less do so dynamically—a problem that had to be addressed in theenvironment of Apache NiFI as a dataow platform used in governmental agencies.

    Hortonworks Dataow addresses the security and data provenance needs in an electronic worldof distributed real time big data ow management. Hortonworks DataFlow augments existingsystems with a secure, reliable, simplied and integrated big data ingestion platform that ensuresdata security from all sources – be they centrally located, high volume data centers or remotelydistributed data sources over geographically dispersed communication links. As part of itssecurity features, HDF inherently provides end-to-end data provenance—a chain-of-custody fordata. Beyond the ability to meet compliance regulations, provenance provides a method for tracingdata from its point of origin, from any point in the dataow, in order to determine which datasources are most used and most valuable.

    WHAT IS DATA

    PROVENANCE?

    Provenance is dened asthe place of origin or earliestknown history of some thing.In the context of a dataow,data provenance is the abilityto trace the path of a piece ofdata within a dataow fromits place of creation, throughto its nal destination. WithinHortonworks DataFlow data

    provenance provides the abilityto visually verify where datacame from, how it was used,who viewed it, whether it wassent, copied, transformedor received. Any system orperson who came in contactwith a specic piece of data iscaptured in its completenessin terms of time, date, action,precedents and dependents fora complete picture of the chainof custody for that precisepiece of data within a dataow.This provenance metadatainformation is used to support

    data sharing compliancerequirements as well as fordataow troubleshooting andoptimization.

    Figure 4: Secure from source to storage with high delity data provenance

  • 8/17/2019 Hortonworks DataFlow White Paper

    9/12

    9Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Data democratization with unprecedented security

    Hortonworks DataFlow opens new doors of business insights by enabling secure access withouthindering analysis, as very specic data can be shared or not shared. For example, Mary couldbe given access to discrete pieces of data tagged with the term “nance” within a dataow, whileJan could be given access to the same dataow but with access only to data tagged with “2015”and “revenue”. The removes the disadvantages of role based data access which can inadvertentlycreate security risks, while still enabling democratization of data for comprehensive analysis anddecision making.

    Hortonworks Dataow, with its inherent ability to support ne grained provenance data andmetadata throughout the collection, transport and ingest process provides comprehensive anddetailed information needed for audit and remediation unmatched by any existing data ingestsystem in place today.

    Figure 5: Fine-grained data access and control

  • 8/17/2019 Hortonworks DataFlow White Paper

    10/12

    10Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    CASE 3: THE INTERNET OF ANY THING (IOAT)The Internet of Any Thing with Hortonworks DataFlow

    Designed in the eld, where resources are scarce (power, connectivity, bandwidth), HortonworksDataFlow is a scalable, proven platform for the acquisition and ingestion of the Internet of Things(IoT), or even more broadly, the Internet of Any Thing (IoAT).

    Adaptive to Resource Constraints

    There are many challenges in enabling an ever connected yet physically dispersed Internetof Things. Data sources may often be remote, physical footprints may be limited, power andbandwidth are likely to be both variable and constrained. Unreliable connectivity disruptscommunication and causes data loss while the lack of security on most of the world’s deployedsensors puts businesses and safety at risk.

    At the same time, devices are producing more data than ever before. Much of the data beingproduced is data-in-motion and unlocking the business value from this data is crucial to businesstransformations of the modern economy.

    Yet business transformation relies on accurate, secure access to data from the source through tostorage. Hortonworks DataFlow was designed with all these real-world constraints in mind:power limitations, connectivity uctuations, data security and traceability, data source diversityand geographical distribution, altogether, for accurate, time-sensitive decision making.

    Figure 6: A proven platform for the Internet of Things

  • 8/17/2019 Hortonworks DataFlow White Paper

    11/12

    11Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Secure Data Collection

    Hortonworks Dataow addresses the security needs of IoT with a secure, reliable, simplied andintegrated big data collection platform that ensures data security from distributed data sourcesover geographically dispersed communication links. The security features of HDF includesend-to-end data provenance — a chain-of-custody for data. This enables IoT systems to verifyorigins of the dataow, trouble-shoot problems from point of origin through destination and theability to determine which data sources are most frequently used and most valuable.

    Hortonworks DataFlow is able to run security and encryption on small scale, JVM-capable data sources as well as enterprise class datacenters. This enables the Internet of Things with areliable, secure, common data collection and transport platform with a real-time feedback loopto continually and immediately improve algorithms and analysis for accurate, informed on-timedecision making.

    Prioritized Data Transfer and Bi-directional Feedback Loop

    Because connectivity and available bandwidth may uctuate, and the volume of data beingproduced by the source may exceed that which can be accepted by the destination, HortonworksDataFlow supports the prioritization of data within a dataow. This means that should therebe resource constraints, the data source can be instructed to automatically promote the moreimportant pieces of information to be sent rst, while holding less important data for futurewindows of transmission opportunity, or possibly not sent at all. For example, should an outagefrom a remote device occur, it is critical to send the “most important” data from that device rstas the outage is repaired and communication is re-established, Once this critical “most important”data has been sent, it can then be followed by the backlog of lower priority data that is vital tohistorical analysis, but less critical to immediate decision making.

    Hortonworks DataFlow enables the decision to be made at the edge of whether to send, drop orlocally store data, as needed, and as conditions change. Additionally, with a ne grained commandand control interface, data queues can be slowed down, or accelerated to balance the demands ofthe situation at hand with the current availability and cost of resources.

    With the ability to seamlessly adapt to resource constraints in real time, ensure secure datacollection and prioritized data transfer, Hortonworks DataFlow is a proven platform ideal for theInternet of Things.

  • 8/17/2019 Hortonworks DataFlow White Paper

    12/12

    12Hortonworks DataFlow • ©2016 Hortonworks • www.hortonworks.com

    Why Hortonworks for Apache Hadoop?

    © 2011-2016 Hortonworks Inc. All Rights Reserved

    P i P li | T f S i

    About Hortonworks

    +1 408 675-0983+1 855 8-HORTONINTL: +44 (0) 20 3826 1405

    For further information visitFor more information,visit www.hortonworks.com.

    Contact

    Founded in 2011 by 24 engineers from the original Yahoo! Apache Hadoop development andoperations team, Hortonworks has amassed more Apache Hadoop experience under one roofthan any other organization. Our team members are active participants and leaders in ApacheHadoop developing, designing, building and testing the core of the Apache Hadoop platform.We have years of experience in Apache Hadoop operations and are best suited to support yourmission-critical Apache Hadoop project.

    For an independent analysis of Hortonworks Data Platform and its leadership among ApacheHadoop vendors, you can download the Forrester Wave™: Big Data Apache Hadoop Solutions,

    Q1 2014 report from Forrester Research.

    Hortonworks is a leading innovator at creating, distributing and supporting enterprise-ready open data platforms. Our mission is tomanage the world’s data. We have a single-minded focus on driving innovation in open source communities such as Apache Hadoop, NiFiand Spark. Our open Connected Data Platforms power Modern Data Applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest. Along with our 1600+ partners, we provide the expertise, training and services that allows our customers tounlock the transformational value of data across any line of business. We are Powering the Future of Data™.

    http://www.hortonworks.com/http://info.hortonworks.com/ForresterWave_Hadoop.htmlhttp://info.hortonworks.com/ForresterWave_Hadoop.htmlhttp://info.hortonworks.com/ForresterWave_Hadoop.htmlhttp://info.hortonworks.com/ForresterWave_Hadoop.htmlhttp://www.hortonworks.com/https://twitter.com/hortonworkshttps://www.facebook.com/hortonworks/https://www.linkedin.com/company/hortonworks