5 pitfalls to avoid with hadoop

8/12/2019 5 Pitfalls to Avoid With Hadoop

1/20


2/20

Intro: Maximizing the fourth v of Big Data 3

Pitfall #1: Hadoop is not a data integration tool 4

Pitfall #2: MapReduce programmers are hard to find 6

Pitfall #3: Most data integration tools dont run natively within Hadoop 9

Pitfall #4: Hadoop may cost more than you think 12

Pitfall #5: Elephants dont thrive in isolation 15

Benchmark 18

Conclusion 19

2
http://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypO


3/20

Traditional business intelligence architectures are struggling to efficiently process Big Data sets, particularly

massive semi-structured and unstructured data. Therefore, its been difficult to realize the full potential of Big Data

Hadoop allows organizations to overcome the architectural limitations in managing Big Data, but care needs to be

taken in order to make the most of what Hadoop has to offer.

Big Data is commonly characterized with respect to the three vs that is high-volume, high-velocity, and high-

variety of data assets but what really matters is the fourth v: value. Value is the positive impact on the business

in terms of gaining actionable insight from massive amounts of data. Big Data can uncover significant value for

organizations, for example: new revenue streams, new customer insights, improved decision making, better

quality products, improved customer experience, and so on.

Hadoop has emerged as the de facto Big Data analytics operating system to help deal with the avalanche of data

coming from logs, email, sensor devices, mobile devices, social and more. While business intelligence systems

are typically the last stop in extracting value from Big Data, the first stop is commonly manipulation of the data in

a process called Extract, Transform, Load (ETL). ETL is the process by which data is moved from source systems

manipulated into a consumable format and loaded into a target system for performing advance analytics, analysisand reporting. In fact, industry analyst Gartner recognizes that most organizations will adapt their data integration

strategy using Hadoop as a form of preprocessor for Big Data integration in the data warehouse.

However, as organizations begin to deploy this new framework, there are some pitfalls to avoid in successfully

performing ETL with Hadoop. First, businesses need to know the pitfalls, and then how to overcome the challenges

We will offer some guiding principles to address these challenges, as well as specific details on how to leverage

Syncsorts data integration tool for Hadoop, DMX-h, to drive sustainable success with your Hadoop deployment.

3


4/20

A data integration tool provides an environment to make it easier for a broad audience to develop and maintain

ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data

transformation functions (aggregations, joins, change data capture [CDC], cleansing, filtering, reformatting

lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerfu

connectivity to source and target systems, and advanced features to make data integration easily accessible by

data analysts.

Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is

a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads

increase. It frees the programmer from concerns about how to physically manage large data sets when spreading

processing across multiple nodes. There is a rich ecosystem of Hadoop utilities that can be used to create ETL

jobs, but they are all separately evolving projects and require specific, new skills. For example, Sqoop developmen

(to move data into and out of HDFS from RDBMSs) requires skilled programmers knowledgeable in the Sqoop

command line syntax. Flume is used for moving data from a variety of systems into Hadoop; Oozie helps with

workflows; and Pig is a scripting platform for more easily creating Hadoop jobs. However, they all require much

hand-coding, as well as specialized skills and knowledge of Hadoop and MapReduce.

Finally, basic ETL operations such as data transformations are easy within a mature data integration tool. However

trying to accomplish the same task with Hadoop can quickly become complex and take a lot of expertise and

effort. For example, building a simple CDC process can easily translate into hundreds of lines of code that not only

takes several days to develop, but also requires resources to maintain and tune as needs evolve in the future

Alternatively, a preferred approach is to utilize a data integration tool that makes it easy to create and maintain

Hadoop ETL jobs.

4


5/20

ETL is emerging as the key use case for Hadoop implementations. However,

Hadoop alone lacks many attributes needed for successful ETL deployments.

Therefore, its important to choose a data integration tool that can fill the ETL

gaps.

Choose a user-friendly graphical interface to easily build ETL jobs without

writing MapReduce code.

Ensure that the solution has a large library of pre-built data integration

functions that can be easily reused.

Include a metadata repository to enable re-use of developments, as well as

data lineage tracking.

Select a tool with a wide variety of connectors to source and target

systems.

Syncsort DMX-h is high-performance data integration software that provides a smarter

approach to Hadoop ETL including: an intuitive graphical interface for easily creating and

maintaining jobs, a wide range of productivity features, metadata facilities for development

re-use and data lineage, high-performance connectivity capabilities, and an ability to run

natively, avoiding code generation.

5


6/20

Programming with the MapReduce processing paradigm in Hadoop requires not only Java programming skills, but

also a deep understanding of how to develop the appropriate Mappers, Reducers, Partitioners, Combiners, etc. A

typical Hadoop task often has multiple steps (as shown in the image on the next page) and a typical application

can have multiple tasks. Most of these steps need to be coded by a Java developer (or using Pig script). With

hand-coding, these steps can quickly become unwieldy to create and maintain.

Even with expert MapReduce programmers building jobs successfully, MapReduce code has limited metadata

associated with it. This issue makes impact analysis and data lineage difficult to perform and thus creates an

overall lack of transparency into the ETL execution flow. Ultimately, thousands of lines of Java code without any

metadata and limited documentation produces major risks for organizations, specifically hindering business agility

complicating data governance, and jeopardizing regulatory compliance.

Not only does MapReduce programming require specialized skills that are hard to find and expensive, hand-

coding does not scale well in terms of job creation productivity, job re-use, and job maintenance. Thats where

data integration tools excel, with intuitive graphical interfaces, prebuilt functions, and facilities to easily create, re-

use, and maintain ETL jobs. With data integration tools, business analysts can easily create, maintain, and re-usejobs in minutes or hours in a graphical manner that would otherwise take days or weeks with a developer writing

thousands of lines of code. Easy job creation and maintenance are critical in preventing bottlenecks that reduce

an organizations ability to extract the full value of Big Data.

6


7/20

Hadoop ETL requires organizations to acquire a completely new set of advanced

programming skills that are expensive and difficult to find. To overcome this pitfall

its critical to choose a data integration tool that both complements Hadoop and

also leverages skills organizations already have.

Select a tool with a graphical user interface (GUI) that abstracts the

complexities of MapReduce programming.

Look for pre-built templates specifically to create MapReduce jobs without

manually writing code.

Insist on the ability to re-use previously created MapReduce flows as

means to increase developers productivity.

Avoid code generation since it frequently requires tuning and maintenance.

Visually track data flows with metadata and lineage

7

Local

Disk

MAP

REDUCE HDFS

Input

Formatter

Ouput

Formatter

SORT

Optional

Partitioner

Optional

Combiner

LocalDisk

SORT

REDUCE HDFSOuput

Formatter

Local

Disk

SORT

Local

Disk

MAPInput

Formatter SORT

Optional

Partitioner

Optional

Combiner

Local

Disk

MAPInput

Formatter SORT

Optional

Partitioner

Optional

Combiner


8/20

Using DMX-h reduces or eliminates the need for costly, hard-to-find MapReduce programmers

With DMX-h, Mappers and Reducers are all built through an easy-to-use graphical developmen

environment, eliminating the need to write any code. DMX-h provides powerful and highly

efficient out-of-the-box capabilities for all key ETL functions and transformations. DMX-h

Mapper and Reducer steps can optionally perform processing that eliminates the need for

other steps in the MapReduce processing flow (including the InputFormatter, Partitioner

Combiner, and OutputFormatter) by simply checking options in the DMX-h graphical use

interface.

There are a number of other benefits inherent in DMX-h as a powerful data integration too

that make MapReduce programming more efficient. First, its easy to develop ETL jobs that

execute within MapReduce by using pre-defined templates and accelerators for common

transformations such as CDC, joins, and more. Second, jobs can be easily re-used to create

new data flows in less time, improving developer productivity. Additionally, built-in metadatacapabilities enable greater transparency into impact analysis, data lineage, and execution

flow, thereby facilitating data governance and regulatory compliance. No code generation

means there is no code to maintain or tune. As a result, organizations can minimize or even

eliminate the need to find and acquire new MapReduce skills. Instead, they can leverage ETL

expertise within their existing staff to quickly learn and implement ETL processes in Hadoop

using DMX-h.

8


9/20

Most data integration solutions offered for Hadoop do not run natively and generate hundreds of lines of code to

accomplish even simple tasks. This can have a significant impact on the overall time it takes to load and process

data. Thats why its critical to choose a data integration tool that is tightly integrated within Hadoop and can run

natively within the MapReduce framework. Moreover, its important to consider not only the horizontal scalability

inherent to Hadoop, but also the vertical scalability within each node. Remember, vertical scalability is about the

processing efficiency of each node. A good example of vertical scalability is sorting, a key component of every

MapReduce process (equally important is connectivity efficiency, covered in Pitfall #5). When vertical scalability is

most efficient, it also delivers the fastest job processing time, thereby reducing overall time to value.

Unfortunately, many data integration tools add a layer of overhead

that hurts performance. Most data integration tools are peripheral to

Hadoop. They simply interact with Hadoop from the outside, treating

it as just another target engine to push processing. They take the

same approach as with relational databases the so-called push-

down optimizations. This means they generate code, in most cases

Java, Pig or HiveQL, which then needs to be compiled before it isexecuted in Hadoop. Generating optimum code is not trivial, and

most of these tools can end up generating very inefficient code

that developers then need to understand, fine-tune, and maintain.

Instead, it is better to run natively within Hadoop with no need to

pre-compile, which is both easier to maintain and more efficient,

eliminating processing overhead.

9


10/20

Most data integration tools are simply code generators that add extra overhead to

the Hadoop framework. A smarter approach must fully integrate with Hadoop and

provide means to seamlessly optimize performance without adding complexity.

Understand how different solutions are specifically interacting with Hadoop

and the amount of code that they are generating.

Choose solutions with the ability to run natively within each Hadoop node

without generating code.

Run performance benchmarks and study which tools deliver the best

combination of price and performance for your most common use cases.

Select an approach with built-in optimizations to maximize Hadoops

vertical scalability.

DMX-h provides a truly integrated approach to Hadoop ETL. DMX-h is not a code generator

Instead, Hadoop automatically invokes the highly efficient DMX-h runtime engine, which

executes on all nodes as an integral part of the Hadoop framework. DMX-h automatically

optimizes the resource utilization (e.g., CPU, memory and I/O) on each node to deliver the

highest levels of performance, scalability, and throughput, with no manual tuning needed

Compared with Java or Pig, DMX-h execution is typically 2 to 3x faster, which means it can

process more data in the same amount of time without the need for additional nodes.

DMX-h has a very small footprint with no dependencies on third-party systems like a relationa

database, compiler, or application server for design or runtime. As a result, DMX-h can be

easily installed and deployed on every data node in a Hadoop cluster or on virtualized

environments in the cloud.

Syncsort accomplishes these performance differentiators by leveraging a number o

contributions the company has made to the Apache Hadoop open source community,

including a new feature to allow for an external sort implementation within the MapReduce

framework (MAPREDUCE-2454 ). Therefore, organizations using Hadoop no longer have to

rely on the standard Hadoop sort, but can plug in their own sort as well.

10


11/20

The pluggable sort option also enables development of MapReduce jobs within the DMX-h

graphical interface. Additionally, it allows the DMX-h engine to run natively within the Hadoop

cluster nodes. This approach makes it much easier to implement common tasks that are

difficult to execute in Hadoop (e.g., joins). For all Hadoop users, this new feature enables

more sophisticated manipulation of data within Hadoop like hash aggregations, hash joins

sampling N matches or even a no-sort option (i.e. ability to bypass sort when not needed

redundant).

11


12/20

Hadoop is significantly disrupting the cost structure of processing data at scale. However, deploying Hadoop is

not free, and significant costs can add up. Vladimir Boroditsky, a director of software engineering at Googles

Motorola Mobility Holdings Inc., recognized in a Wall Street Journal article that there is a very substantial cost

to free software, noting that Hadoop comes with additional costs of hiring in-house expertise and consultants

In all, the primary costs to consider for a complete enterprise data integration solution powered with Hadoop

include: software, technical support, skills, hardware and time-to-value.

The first three factors software, support, and skills should be considered together. While the Hadoop software

itself is open source and free, typically its desirable to purchase a support subscription with an enterprise service

level agreement (SLA). Likewise, its important to consider the software and subscription costs as a whole when

considering the data integration tool to work in tandem with Hadoop. In terms of skills, the Wall Street Journal cites

that a Hadoop programmer, also sometimes referred to as a data scientist, can easily command at least $300,000

per year. Although the data integration tool may add costs on the software and support side, using the right too

can reduce overall costs of development and maintenance by dramatically reducing time to build and manage

Hadoop jobs. Finally, data integration tool skills are much more broadly available and much less expensive than

the specialized Hadoop MapReduce developer skills.

While Hadoop leverages commodity hardware, associated costs can still be significant. When dealing with dozens

of nodes over months and years, hardware costs add up, whether commodity or not. Therefore, it is still important to

use hardware in the most efficient manner. Unfortunately, Hadoops core mechanics of MapReduce are inefficient

with respect to processing data on each individual node. The strategy with Hadoop is to spread the processing

and data across many nodes so that inefficiencies such as sorting are minimized. However, the inefficiencies are

12


13/20

Hadoop provides virtually unlimited horizontal scalability. However, hardwareand development costs can quickly hinder sustainable growth. Therefore, its

important to maximize developer productivity and per-node efficiency to contain

costs.

Choose cost-effective software and support, including both the Hadoop

distribution and the data integration tool.

Ensure tools include features to reduce development and maintenance

efforts of MapReduce jobs. Look for optimizations that enhance Hadoops vertical scalability to reduce

hardware requirements.

still there and add up as the number of nodes grows. Vertical scalability is critical to contain costs associated

with growing Hadoop clusters. Therefore, its important to consider data integration tools that can complement

Hadoop with the ability to maximize processing efficiency on each node, for example, by enabling Hadoop to cal

more efficient sort algorithms and seamlessly optimize MapReduce operations.

Time-to-value is the time difference between the time needed to create and deploy jobs and when an organization

may start extracting value from Big Data. This dimension is another benefit of using a data integration tool with a

graphical interface to speed development and maintenance. The time to create ETL jobs and deploy them into

production is dramatically lower when using the right data integration tool as opposed to using Hadoop utilities

such as Pig, Hive, and Sqoop.

13


14/20

DMX-h dramatically reduces costs of leveraging Hadoop in a number of ways. First, DMX-h

reduces time-to-value by making the development of Hadoop jobs much faster and easier than

manual coding. With DMX-h, there is no need to hire additional programmers to implement

Hadoop ETL. For the most part, you can leverage existing skills within the organization or

more easily find data integration tool developers at a more reasonable cost.

In terms of hardware, a rule-of-thumb cost for one Hadoop node is about $5,000.

However, when adding the operating system (for example a support subscription), cooling,

maintenance, power, rack space, etc., the total cost can grow to $12,000. And that does

not include administration costs. DMX-h enables Hadoop clusters to scale more efficiently

and cost-effectively by maximizing vertical scalability of each individual node. With more

efficient hardware utilization, organizations can reduce capital and operational expenses by

eliminating the need for additional compute nodes on the cluster.

14


15/20

One of Hadoops hallmark strengths is its ability to process massive data volumes of nearly any type. But that

strength cannot be fully utilized unless the Hadoop cluster is adequately connected to all available data sources

and targets, including relational databases, files, CRM systems, social media, mainframe and so on. However

moving data in and out of Hadoop is not trivial. Moreover, with the birth of new categories of data management

technologies, broadly generalized as NoSQL and NewSQL, mission critical systems like mainframes can al

too often be neglected. The fact is that at least 70% of the worlds transactional production applications run on

mainframe platforms. The ability to process and analyze mainframe data with Hadoop could open up a wealth o

opportunities by delivering deeper analytics, at lower cost, for many organizations.

Shortening the time it takes to get data into the Hadoop Distributed File System (HDFS) can be critical for many

companies, such as those that must load billions of records each day. Reducing load times can also be importan

for organizations that plan to increase the amount and types of data they will need to load into Hadoop, as their

application or business grows. Finally, pre-processing data before loading into Hadoop is vital in order to filter out

noise of irrelevant data, achieve significant storage space savings, and optimize performance.

15


16/20

Without the right connectivity, Hadoop risks becoming another data silo within the

enterprise. Tools to get the needed data in and out of Hadoop at the right time

are critical to maximize the value of Big Data.

Select tools with a wide range of native connectors, particularly for popular

relational databases, appliances, files and systems.

Dont forget to include mainframe data in your Hadoop and Big Data

strategies.

Make sure connectivity is provided not only from a stand-alone data

integration server to Hadoop, but also directly from the Hadoop cluster

itself to a variety of sources and targets.

Look for connectors that dont require writing additional code.

Ensure high-performance connectivity in both loading and extracting data

from various sources and targets.

DMX-h offers a range of high-performance connectors for every major RDBMS, appliances

XML, flat files, legacy sources and even mainframes.

DMX-h writes data directly to HDFS using native Hadoop interfaces. DMX-h can partition

the data and parallelize the loading processes to load multiple streams simultaneously into

HDFS, reducing the time to load data into HDFS by up to 6x.

16

File-BasedSource

RDBMS Appliances Other

Flat

Mainframe

HDFS

Legacy Sources

Oracle

DB2

SQL Server

Teradata

Sybase

ODBC

Netezza

Greenplum

Vertica

XML

MQ

Salesforce.com


17/20

DMX-h can also connect directly from each data node in the cluster, to virtually any source

and target for even greater efficiency and faster data movement.

Finally, Syncsort is commonly used to pre-process data prior to loading it into Hadoop. By

first integrating and structuring the data with Syncsort prior to loading to HDFS, load times

are reduced downstream, MapReduce tasks execute faster and more efficiently, and storagerequirements on the cluster are reduced.

17


18/20

A leading global financial services organization with trillions of dollars in assets is looking to improveperformance of its Hadoop ETL jobs.

18


19/20

As the de facto standard for Big Data processing and analytics, Hadoop represents a tremendous vehicle to extract value

from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop in order to achieve

a complete ETL solution can hinder the overall potential value of Big Data. Syncsort DMX-h provides a smarter approach

making Hadoop a more mature environment for enterprise ETL. Development and maintenance are eased, overall costs are

dramatically reduced, performance is multiplied, opportunities to leverage every data source are guaranteed, and time-to

value is minimized.

As a high-performance leader in the data integration space, Syncsort has worked with early adopter Hadoop customers to

identify and solve the most common pitfalls organizations are facing. Regardless of the approach you take, its important to

recognize and address these pitfalls prior to deploying ETL on Hadoop:

Hadoop is not a data integration tool

Select a data integration tool that can dramatically speed development and maintenance efforts

by providing all the capabilities to make Hadoop ETL-ready, including connectivity, breadth of

transformations and data processing functions, metadata, reusability and ease-of-use.

MapReduce programmers are hard to find

Make sure your data integration tool includes specialized facilities to ease MapReduce job

development. Also minimize the need to acquire MapReduce programming skills by selecting a too

that allows you to leverage the same data integration expertise your organization already has, to

develop MapReduce jobs without hand-coding.

Most data integration tools dont run natively within Hadoop

Choose a data integration tool that runs natively within the Hadoop framework to minimize data

movement and maximize data processing performance within each node. Avoid code generators

altogether, as their code output frequently requires tedious tuning and maintenance.

Hadoop may cost more than you think

Do not underestimate the cost of using Hadoop including software, support, hardware, and skills

Choose a data integration tool that complements Hadoops horizontal scalability with greateperformance and efficiency on each node to minimize hardware costs.

Elephants dont thrive in isolation

Unleash Hadoops potential by making sure your data integration tool provides high-performance

connectivity to move data into and out of Hadoop from virtually any system, particularly major

relational databases, appliances, files and mainframes.#5

#4

#3

#2

#1

19


20/20

Simplifying and accelerating ETL use cases with Hadoop

Hadoop MapReduce: To Sort or Not to Sort

2013: The Year Big Data Gets Bigger

Syncsort provides data-intensive organizations across the big data continuum with a smarter

way to collect and process the ever-expanding data avalanche. With thousands of deployments

across all major platforms, including mainframe, Syncsort helps customers around the world

to overcome the architectural limits of todays ETL and Hadoop environments, empowering

their organizations to drive better business outcomes in less time, with less resources and

lower TCO. For more information visit www.syncsort.com.

2013 Syncsort Incorporated. All rights reserved. DMExpress is a trademark of Syncsort Incorporated. All other company and productf
http://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.linkedin.com/shareArticle?mini=true&url=http%3A%2F%2Fbigdata.syncsort.com&title=New%20eBook%20on%20Hadoop%20ETL&source=Syncsort&summary=Make%20sure%20your%20Hadoop%20projects%20are%20a%20success!%20%20Get%20the%20new%20eBook%20from%20Syncsort:%20%E2%80%9C5%20Pitfalls%20to%20Avoid%20with%20Hadoop%E2%80%9D%20and%20learn%20how%20to%20identify%20and%20navigate%20the%20most%20common%20challenges%20of%20Hadoop%20ETL.%20http://hub.am/Y8kQg9%0Ahttps://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fbigdata.syncsort.com%2Fhadoop-pitfalls-ebook&t=Make+sure+your+Hadoop+projects+are+a+success%21++Get+the+new+eBook+from+Syncsort%3A+%E2%80%9C5+Pitfalls+to+Avoid+with+Hadoop%E2%80%9D+and+learn+how+to+identify+and+navigate+the+most+common+challenges+of+Hadoop+ETL.+http%3A%2F%2Fhub.am%2FY8kGWjhttps://twitter.com/intent/tweet?text=Hot+off+the+press!+New+eBook+helps+companies+avoid+the+5+most+common+pitfalls+of+Hadoop+ETL.+http://hub.am/Y8kypOhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://www.syncsort.com/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/01/2013-the-year-big-data-gets-bigger/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Linkhttp://bigdata.syncsort.com/hadoop-data-processing-whitepaper?utm_source=eBook&utm_medium=syncsort-ebook&utm_campaign=Top-5-Pitfalls-eBook-Link

5 pitfalls to avoid with hadoop

Documents