5 pitfalls to avoid with hadoop

Upload: karthikeyan-balasubramaniam

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    1/20

  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    2/20

    Intro: Maximizing the fourth v of Big Data 3

    Pitfall #1: Hadoop is not a data integration tool 4

    Pitfall #2: MapReduce programmers are hard to find 6

    Pitfall #3: Most data integration tools dont run natively within Hadoop 9

    Pitfall #4: Hadoop may cost more than you think 12

    Pitfall #5: Elephants dont thrive in isolation 15

    Benchmark 18

    Conclusion 19

    2

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    3/20

    Traditional business intelligence architectures are struggling to efficiently process Big Data sets, particularly

    massive semi-structured and unstructured data. Therefore, its been difficult to realize the full potential of Big Data

    Hadoop allows organizations to overcome the architectural limitations in managing Big Data, but care needs to be

    taken in order to make the most of what Hadoop has to offer.

    Big Data is commonly characterized with respect to the three vs that is high-volume, high-velocity, and high-

    variety of data assets but what really matters is the fourth v: value. Value is the positive impact on the business

    in terms of gaining actionable insight from massive amounts of data. Big Data can uncover significant value for

    organizations, for example: new revenue streams, new customer insights, improved decision making, better

    quality products, improved customer experience, and so on.

    Hadoop has emerged as the de facto Big Data analytics operating system to help deal with the avalanche of data

    coming from logs, email, sensor devices, mobile devices, social and more. While business intelligence systems

    are typically the last stop in extracting value from Big Data, the first stop is commonly manipulation of the data in

    a process called Extract, Transform, Load (ETL). ETL is the process by which data is moved from source systems

    manipulated into a consumable format and loaded into a target system for performing advance analytics, analysisand reporting. In fact, industry analyst Gartner recognizes that most organizations will adapt their data integration

    strategy using Hadoop as a form of preprocessor for Big Data integration in the data warehouse.

    However, as organizations begin to deploy this new framework, there are some pitfalls to avoid in successfully

    performing ETL with Hadoop. First, businesses need to know the pitfalls, and then how to overcome the challenges

    We will offer some guiding principles to address these challenges, as well as specific details on how to leverage

    Syncsorts data integration tool for Hadoop, DMX-h, to drive sustainable success with your Hadoop deployment.

    3

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    4/20

    A data integration tool provides an environment to make it easier for a broad audience to develop and maintain

    ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data

    transformation functions (aggregations, joins, change data capture [CDC], cleansing, filtering, reformatting

    lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerfu

    connectivity to source and target systems, and advanced features to make data integration easily accessible by

    data analysts.

    Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is

    a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads

    increase. It frees the programmer from concerns about how to physically manage large data sets when spreading

    processing across multiple nodes. There is a rich ecosystem of Hadoop utilities that can be used to create ETL

    jobs, but they are all separately evolving projects and require specific, new skills. For example, Sqoop developmen

    (to move data into and out of HDFS from RDBMSs) requires skilled programmers knowledgeable in the Sqoop

    command line syntax. Flume is used for moving data from a variety of systems into Hadoop; Oozie helps with

    workflows; and Pig is a scripting platform for more easily creating Hadoop jobs. However, they all require much

    hand-coding, as well as specialized skills and knowledge of Hadoop and MapReduce.

    Finally, basic ETL operations such as data transformations are easy within a mature data integration tool. However

    trying to accomplish the same task with Hadoop can quickly become complex and take a lot of expertise and

    effort. For example, building a simple CDC process can easily translate into hundreds of lines of code that not only

    takes several days to develop, but also requires resources to maintain and tune as needs evolve in the future

    Alternatively, a preferred approach is to utilize a data integration tool that makes it easy to create and maintain

    Hadoop ETL jobs.

    4

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    5/20

    ETL is emerging as the key use case for Hadoop implementations. However,

    Hadoop alone lacks many attributes needed for successful ETL deployments.

    Therefore, its important to choose a data integration tool that can fill the ETL

    gaps.

    Choose a user-friendly graphical interface to easily build ETL jobs without

    writing MapReduce code.

    Ensure that the solution has a large library of pre-built data integration

    functions that can be easily reused.

    Include a metadata repository to enable re-use of developments, as well as

    data lineage tracking.

    Select a tool with a wide variety of connectors to source and target

    systems.

    Syncsort DMX-h is high-performance data integration software that provides a smarter

    approach to Hadoop ETL including: an intuitive graphical interface for easily creating and

    maintaining jobs, a wide range of productivity features, metadata facilities for development

    re-use and data lineage, high-performance connectivity capabilities, and an ability to run

    natively, avoiding code generation.

    5

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    6/20

    Programming with the MapReduce processing paradigm in Hadoop requires not only Java programming skills, but

    also a deep understanding of how to develop the appropriate Mappers, Reducers, Partitioners, Combiners, etc. A

    typical Hadoop task often has multiple steps (as shown in the image on the next page) and a typical application

    can have multiple tasks. Most of these steps need to be coded by a Java developer (or using Pig script). With

    hand-coding, these steps can quickly become unwieldy to create and maintain.

    Even with expert MapReduce programmers building jobs successfully, MapReduce code has limited metadata

    associated with it. This issue makes impact analysis and data lineage difficult to perform and thus creates an

    overall lack of transparency into the ETL execution flow. Ultimately, thousands of lines of Java code without any

    metadata and limited documentation produces major risks for organizations, specifically hindering business agility

    complicating data governance, and jeopardizing regulatory compliance.

    Not only does MapReduce programming require specialized skills that are hard to find and expensive, hand-

    coding does not scale well in terms of job creation productivity, job re-use, and job maintenance. Thats where

    data integration tools excel, with intuitive graphical interfaces, prebuilt functions, and facilities to easily create, re-

    use, and maintain ETL jobs. With data integration tools, business analysts can easily create, maintain, and re-usejobs in minutes or hours in a graphical manner that would otherwise take days or weeks with a developer writing

    thousands of lines of code. Easy job creation and maintenance are critical in preventing bottlenecks that reduce

    an organizations ability to extract the full value of Big Data.

    6

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    7/20

    Hadoop ETL requires organizations to acquire a completely new set of advanced

    programming skills that are expensive and difficult to find. To overcome this pitfall

    its critical to choose a data integration tool that both complements Hadoop and

    also leverages skills organizations already have.

    Select a tool with a graphical user interface (GUI) that abstracts the

    complexities of MapReduce programming.

    Look for pre-built templates specifically to create MapReduce jobs without

    manually writing code.

    Insist on the ability to re-use previously created MapReduce flows as

    means to increase developers productivity.

    Avoid code generation since it frequently requires tuning and maintenance.

    Visually track data flows with metadata and lineage

    7

    Local

    Disk

    MAP

    REDUCE HDFS

    Input

    Formatter

    Ouput

    Formatter

    SORT

    Optional

    Partitioner

    Optional

    Combiner

    LocalDisk

    SORT

    REDUCE HDFSOuput

    Formatter

    Local

    Disk

    SORT

    Local

    Disk

    MAPInput

    Formatter SORT

    Optional

    Partitioner

    Optional

    Combiner

    Local

    Disk

    MAPInput

    Formatter SORT

    Optional

    Partitioner

    Optional

    Combiner

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    8/20

    Using DMX-h reduces or eliminates the need for costly, hard-to-find MapReduce programmers

    With DMX-h, Mappers and Reducers are all built through an easy-to-use graphical developmen

    environment, eliminating the need to write any code. DMX-h provides powerful and highly

    efficient out-of-the-box capabilities for all key ETL functions and transformations. DMX-h

    Mapper and Reducer steps can optionally perform processing that eliminates the need for

    other steps in the MapReduce processing flow (including the InputFormatter, Partitioner

    Combiner, and OutputFormatter) by simply checking options in the DMX-h graphical use

    interface.

    There are a number of other benefits inherent in DMX-h as a powerful data integration too

    that make MapReduce programming more efficient. First, its easy to develop ETL jobs that

    execute within MapReduce by using pre-defined templates and accelerators for common

    transformations such as CDC, joins, and more. Second, jobs can be easily re-used to create

    new data flows in less time, improving developer productivity. Additionally, built-in metadatacapabilities enable greater transparency into impact analysis, data lineage, and execution

    flow, thereby facilitating data governance and regulatory compliance. No code generation

    means there is no code to maintain or tune. As a result, organizations can minimize or even

    eliminate the need to find and acquire new MapReduce skills. Instead, they can leverage ETL

    expertise within their existing staff to quickly learn and implement ETL processes in Hadoop

    using DMX-h.

    8

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    9/20

    Most data integration solutions offered for Hadoop do not run natively and generate hundreds of lines of code to

    accomplish even simple tasks. This can have a significant impact on the overall time it takes to load and process

    data. Thats why its critical to choose a data integration tool that is tightly integrated within Hadoop and can run

    natively within the MapReduce framework. Moreover, its important to consider not only the horizontal scalability

    inherent to Hadoop, but also the vertical scalability within each node. Remember, vertical scalability is about the

    processing efficiency of each node. A good example of vertical scalability is sorting, a key component of every

    MapReduce process (equally important is connectivity efficiency, covered in Pitfall #5). When vertical scalability is

    most efficient, it also delivers the fastest job processing time, thereby reducing overall time to value.

    Unfortunately, many data integration tools add a layer of overhead

    that hurts performance. Most data integration tools are peripheral to

    Hadoop. They simply interact with Hadoop from the outside, treating

    it as just another target engine to push processing. They take the

    same approach as with relational databases the so-called push-

    down optimizations. This means they generate code, in most cases

    Java, Pig or HiveQL, which then needs to be compiled before it isexecuted in Hadoop. Generating optimum code is not trivial, and

    most of these tools can end up generating very inefficient code

    that developers then need to understand, fine-tune, and maintain.

    Instead, it is better to run natively within Hadoop with no need to

    pre-compile, which is both easier to maintain and more efficient,

    eliminating processing overhead.

    9

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    10/20

    Most data integration tools are simply code generators that add extra overhead to

    the Hadoop framework. A smarter approach must fully integrate with Hadoop and

    provide means to seamlessly optimize performance without adding complexity.

    Understand how different solutions are specifically interacting with Hadoop

    and the amount of code that they are generating.

    Choose solutions with the ability to run natively within each Hadoop node

    without generating code.

    Run performance benchmarks and study which tools deliver the best

    combination of price and performance for your most common use cases.

    Select an approach with built-in optimizations to maximize Hadoops

    vertical scalability.

    DMX-h provides a truly integrated approach to Hadoop ETL. DMX-h is not a code generator

    Instead, Hadoop automatically invokes the highly efficient DMX-h runtime engine, which

    executes on all nodes as an integral part of the Hadoop framework. DMX-h automatically

    optimizes the resource utilization (e.g., CPU, memory and I/O) on each node to deliver the

    highest levels of performance, scalability, and throughput, with no manual tuning needed

    Compared with Java or Pig, DMX-h execution is typically 2 to 3x faster, which means it can

    process more data in the same amount of time without the need for additional nodes.

    DMX-h has a very small footprint with no dependencies on third-party systems like a relationa

    database, compiler, or application server for design or runtime. As a result, DMX-h can be

    easily installed and deployed on every data node in a Hadoop cluster or on virtualized

    environments in the cloud.

    Syncsort accomplishes these performance differentiators by leveraging a number o

    contributions the company has made to the Apache Hadoop open source community,

    including a new feature to allow for an external sort implementation within the MapReduce

    framework (MAPREDUCE-2454 ). Therefore, organizations using Hadoop no longer have to

    rely on the standard Hadoop sort, but can plug in their own sort as well.

    10

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    11/20

    The pluggable sort option also enables development of MapReduce jobs within the DMX-h

    graphical interface. Additionally, it allows the DMX-h engine to run natively within the Hadoop

    cluster nodes. This approach makes it much easier to implement common tasks that are

    difficult to execute in Hadoop (e.g., joins). For all Hadoop users, this new feature enables

    more sophisticated manipulation of data within Hadoop like hash aggregations, hash joins

    sampling N matches or even a no-sort option (i.e. ability to bypass sort when not needed

    redundant).

    11

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    12/20

    Hadoop is significantly disrupting the cost structure of processing data at scale. However, deploying Hadoop is

    not free, and significant costs can add up. Vladimir Boroditsky, a director of software engineering at Googles

    Motorola Mobility Holdings Inc., recognized in a Wall Street Journal article that there is a very substantial cost

    to free software, noting that Hadoop comes with additional costs of hiring in-house expertise and consultants

    In all, the primary costs to consider for a complete enterprise data integration solution powered with Hadoop

    include: software, technical support, skills, hardware and time-to-value.

    The first three factors software, support, and skills should be considered together. While the Hadoop software

    itself is open source and free, typically its desirable to purchase a support subscription with an enterprise service

    level agreement (SLA). Likewise, its important to consider the software and subscription costs as a whole when

    considering the data integration tool to work in tandem with Hadoop. In terms of skills, the Wall Street Journal cites

    that a Hadoop programmer, also sometimes referred to as a data scientist, can easily command at least $300,000

    per year. Although the data integration tool may add costs on the software and support side, using the right too

    can reduce overall costs of development and maintenance by dramatically reducing time to build and manage

    Hadoop jobs. Finally, data integration tool skills are much more broadly available and much less expensive than

    the specialized Hadoop MapReduce developer skills.

    While Hadoop leverages commodity hardware, associated costs can still be significant. When dealing with dozens

    of nodes over months and years, hardware costs add up, whether commodity or not. Therefore, it is still important to

    use hardware in the most efficient manner. Unfortunately, Hadoops core mechanics of MapReduce are inefficient

    with respect to processing data on each individual node. The strategy with Hadoop is to spread the processing

    and data across many nodes so that inefficiencies such as sorting are minimized. However, the inefficiencies are

    12

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    13/20

    Hadoop provides virtually unlimited horizontal scalability. However, hardwareand development costs can quickly hinder sustainable growth. Therefore, its

    important to maximize developer productivity and per-node efficiency to contain

    costs.

    Choose cost-effective software and support, including both the Hadoop

    distribution and the data integration tool.

    Ensure tools include features to reduce development and maintenance

    efforts of MapReduce jobs. Look for optimizations that enhance Hadoops vertical scalability to reduce

    hardware requirements.

    still there and add up as the number of nodes grows. Vertical scalability is critical to contain costs associated

    with growing Hadoop clusters. Therefore, its important to consider data integration tools that can complement

    Hadoop with the ability to maximize processing efficiency on each node, for example, by enabling Hadoop to cal

    more efficient sort algorithms and seamlessly optimize MapReduce operations.

    Time-to-value is the time difference between the time needed to create and deploy jobs and when an organization

    may start extracting value from Big Data. This dimension is another benefit of using a data integration tool with a

    graphical interface to speed development and maintenance. The time to create ETL jobs and deploy them into

    production is dramatically lower when using the right data integration tool as opposed to using Hadoop utilities

    such as Pig, Hive, and Sqoop.

    13

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    14/20

    DMX-h dramatically reduces costs of leveraging Hadoop in a number of ways. First, DMX-h

    reduces time-to-value by making the development of Hadoop jobs much faster and easier than

    manual coding. With DMX-h, there is no need to hire additional programmers to implement

    Hadoop ETL. For the most part, you can leverage existing skills within the organization or

    more easily find data integration tool developers at a more reasonable cost.

    In terms of hardware, a rule-of-thumb cost for one Hadoop node is about $5,000.

    However, when adding the operating system (for example a support subscription), cooling,

    maintenance, power, rack space, etc., the total cost can grow to $12,000. And that does

    not include administration costs. DMX-h enables Hadoop clusters to scale more efficiently

    and cost-effectively by maximizing vertical scalability of each individual node. With more

    efficient hardware utilization, organizations can reduce capital and operational expenses by

    eliminating the need for additional compute nodes on the cluster.

    14

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    15/20

    One of Hadoops hallmark strengths is its ability to process massive data volumes of nearly any type. But that

    strength cannot be fully utilized unless the Hadoop cluster is adequately connected to all available data sources

    and targets, including relational databases, files, CRM systems, social media, mainframe and so on. However

    moving data in and out of Hadoop is not trivial. Moreover, with the birth of new categories of data management

    technologies, broadly generalized as NoSQL and NewSQL, mission critical systems like mainframes can al

    too often be neglected. The fact is that at least 70% of the worlds transactional production applications run on

    mainframe platforms. The ability to process and analyze mainframe data with Hadoop could open up a wealth o

    opportunities by delivering deeper analytics, at lower cost, for many organizations.

    Shortening the time it takes to get data into the Hadoop Distributed File System (HDFS) can be critical for many

    companies, such as those that must load billions of records each day. Reducing load times can also be importan

    for organizations that plan to increase the amount and types of data they will need to load into Hadoop, as their

    application or business grows. Finally, pre-processing data before loading into Hadoop is vital in order to filter out

    noise of irrelevant data, achieve significant storage space savings, and optimize performance.

    15

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    16/20

    Without the right connectivity, Hadoop risks becoming another data silo within the

    enterprise. Tools to get the needed data in and out of Hadoop at the right time

    are critical to maximize the value of Big Data.

    Select tools with a wide range of native connectors, particularly for popular

    relational databases, appliances, files and systems.

    Dont forget to include mainframe data in your Hadoop and Big Data

    strategies.

    Make sure connectivity is provided not only from a stand-alone data

    integration server to Hadoop, but also directly from the Hadoop cluster

    itself to a variety of sources and targets.

    Look for connectors that dont require writing additional code.

    Ensure high-performance connectivity in both loading and extracting data

    from various sources and targets.

    DMX-h offers a range of high-performance connectors for every major RDBMS, appliances

    XML, flat files, legacy sources and even mainframes.

    DMX-h writes data directly to HDFS using native Hadoop interfaces. DMX-h can partition

    the data and parallelize the loading processes to load multiple streams simultaneously into

    HDFS, reducing the time to load data into HDFS by up to 6x.

    16

    File-BasedSource

    RDBMS Appliances Other

    Flat

    Mainframe

    HDFS

    Legacy Sources

    Oracle

    DB2

    SQL Server

    Teradata

    Sybase

    ODBC

    Netezza

    Greenplum

    Vertica

    XML

    MQ

    Salesforce.com

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    17/20

    DMX-h can also connect directly from each data node in the cluster, to virtually any source

    and target for even greater efficiency and faster data movement.

    Finally, Syncsort is commonly used to pre-process data prior to loading it into Hadoop. By

    first integrating and structuring the data with Syncsort prior to loading to HDFS, load times

    are reduced downstream, MapReduce tasks execute faster and more efficiently, and storagerequirements on the cluster are reduced.

    17

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    18/20

    A leading global financial services organization with trillions of dollars in assets is looking to improveperformance of its Hadoop ETL jobs.

    18

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    19/20

    As the de facto standard for Big Data processing and analytics, Hadoop represents a tremendous vehicle to extract value

    from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop in order to achieve

    a complete ETL solution can hinder the overall potential value of Big Data. Syncsort DMX-h provides a smarter approach

    making Hadoop a more mature environment for enterprise ETL. Development and maintenance are eased, overall costs are

    dramatically reduced, performance is multiplied, opportunities to leverage every data source are guaranteed, and time-to

    value is minimized.

    As a high-performance leader in the data integration space, Syncsort has worked with early adopter Hadoop customers to

    identify and solve the most common pitfalls organizations are facing. Regardless of the approach you take, its important to

    recognize and address these pitfalls prior to deploying ETL on Hadoop:

    Hadoop is not a data integration tool

    Select a data integration tool that can dramatically speed development and maintenance efforts

    by providing all the capabilities to make Hadoop ETL-ready, including connectivity, breadth of

    transformations and data processing functions, metadata, reusability and ease-of-use.

    MapReduce programmers are hard to find

    Make sure your data integration tool includes specialized facilities to ease MapReduce job

    development. Also minimize the need to acquire MapReduce programming skills by selecting a too

    that allows you to leverage the same data integration expertise your organization already has, to

    develop MapReduce jobs without hand-coding.

    Most data integration tools dont run natively within Hadoop

    Choose a data integration tool that runs natively within the Hadoop framework to minimize data

    movement and maximize data processing performance within each node. Avoid code generators

    altogether, as their code output frequently requires tedious tuning and maintenance.

    Hadoop may cost more than you think

    Do not underestimate the cost of using Hadoop including software, support, hardware, and skills

    Choose a data integration tool that complements Hadoops horizontal scalability with greateperformance and efficiency on each node to minimize hardware costs.

    Elephants dont thrive in isolation

    Unleash Hadoops potential by making sure your data integration tool provides high-performance

    connectivity to move data into and out of Hadoop from virtually any system, particularly major

    relational databases, appliances, files and mainframes.#5

    #4

    #3

    #2

    #1

    19

    http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO
  • 8/12/2019 5 Pitfalls to Avoid With Hadoop

    20/20

    Simplifying and accelerating ETL use cases with Hadoop

    Hadoop MapReduce: To Sort or Not to Sort

    2013: The Year Big Data Gets Bigger

    Syncsort provides data-intensive organizations across the big data continuum with a smarter

    way to collect and process the ever-expanding data avalanche. With thousands of deployments

    across all major platforms, including mainframe, Syncsort helps customers around the world

    to overcome the architectural limits of todays ETL and Hadoop environments, empowering

    their organizations to drive better business outcomes in less time, with less resources and

    lower TCO. For more information visit www.syncsort.com.

    2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and productf

    http://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypOhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Link