ast-0073561_big_data_ibm_v2

Upload: harshad-patil

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    1/8

    SPECIAL REPORT

    Deriving Meaning

    From theData Explosion

    Copyright InfoWorld Media Group. All rights reserved.

    Big DataAnalyticsDeep

    Dive

    Sponsored by

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    2/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    2Big Data AnalyticsDeep Divei

    Making sense o big dataNew analysis tools and abundant processing power unlock critical insights fromunfathomable volumes of corporate and external data

    i By David S. Linthicum

    THE ABILITY TO DERIVE MEANING quickly rom

    huge quantities o structured and unstructured data has

    been an objective o enterprise IT since the inception o

    databases. In the past, this has been a distant dream or

    at least a cost-prohibitive one.Today we have a growing number o technologies that

    make that dream a reality. Cloud computing provides

    per-drink access to thousands o processor cores and

    massive amounts o on-demand data storage. Emerging

    technologies apply a divide-and-conquer approach to

    big data computing problems, using distributed process-

    ing to return results almost instantaneously rom huge

    data sets.

    New analytics tools take advantage o all this horse-

    power. Advanced data visualization technology makes

    large and complex data sets understandable and enables

    domain experts to spot underlying trends and patterns.

    Some tools even recognize patterns and alert users about

    issues that need attention.

    Big data is riding a sharp growth curve. IDC predicts

    big data technology and services will grow worldwide

    rom $3.2 billion in 2010 to $16.9 billion in 2015. This

    represents a compound annual growth rate o 40 per-

    cent about seven times that o the overall inormation

    and communications technology market.

    MAKING THE BUSINESS CASEThe business benets are clear. On the one hand, we

    can derive meaning rom data that once merely took up

    space website clickstream data, system event logs, and

    so on and use that inormation to improve a broad

    range o systems. A whole new world o vertical appli-

    cations opens up.

    What sort o applications? How about sensor data

    collected on jet engine perormance that could, when

    analyzed over time, reduce the incidence o ailure. Data

    on the vital signs o inants in hospital nurseries could be

    plumbed or patterns that reveal hazards or new oppor-

    tunities or progress. The use cases are limited only by

    the imagination:

    A bank visualizes patterns within data and docu-ments that determine the likelihood o raud and

    can take corrective action beore the business is

    impacted.

    Doctors determine patterns o treatment o dis-

    eases that provide the most desirable outcomes

    using 20 years o historical data rom more than

    30 dierent data sources.

    Auto manufacturers see the core cause o

    production delays, and perhaps attach these

    analytics to business processes to automatically

    take corrective action, such as leveraging dierent

    logistical approaches.

    The utility industry creates predictive models

    o consumer markets by deploying technologies

    such as a smart grid.

    With the proper data visualization on top, big data can

    shortcut the usual process o business analytics, where

    stakeholders pass requirements to analytics proession-

    als who create static reports and hand them back over

    the transom. Big data visualization even a simple tag

    cloud enables those with domain expertise to develop

    a more direct relationship with the data and spot pat-terns that an analytics proessional might miss.

    Despite all the promise, big data technology and

    applications are still very much at an early stage. Hadoop

    has become the preerred platorm or processing large

    quantities o unstructured data, but its a distributed pro-

    cessing ramework and development environment that

    typically requires specialized application development

    skills to use eectively. And thats just or unstructured

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    3/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    3Big Data AnalyticsDeep Divei

    or semi-structured data. Big data can also reer to quan-

    tities o structured data so large they cant be processed

    in a reasonable length o time.The larger question is how big data analytics t into

    companies that already have a business analytics inra-

    structure. In many cases, exploring large unstructured

    data sets is a precursor to drilling deep with more con-

    ventional tools.

    BULKING UP BUSINESS INTELLIGENCE

    Gartner reports that 45 percent o sales management

    teams identiy sales analytics as a priority to help them

    understand sales perormance, market conditions, and

    opportunities. In short, the better that inormation is

    and the more routinely it is put in the hands o thosewho need it the greater likelihood productivity and

    revenue will increase.

    A ew key trends within BI relate to the emerging use

    o big data. They include:

    Theabilitytoleveragebothstructuredand

    unstructured data, and visualize that data together

    as a single, logical set o inormation.

    Theabilitytoimplystructureatthetimeofanaly-

    sis, and thus provide fexibility by decoupling the

    underlying structure rom structured or unstruc-

    tured physical data.

    Theabilitytoleveragecurrent,ornear-real-time

    data, allowing critical applications, business pro-

    cesses, and humans to view the up-to-the-minute

    state o the data.

    Theabilitytoaccessoutsidedatasourcesinthe

    cloud, thus allowing BI analysts to mash up data

    rom outside o the enterprise to enhance or

    rene the data analysis process.

    Theabilitytobinddataanalyticstobusinesspro -

    cesses and applications, and thus allow them to

    automatically deal with issues and opportunitieswithout human intervention.

    Traditional BI is based on structured data, but these

    insights oten all short in predictive and indicative analyt-

    ics, due to data sets that are too old or limited in scope.

    Structured data is only a small portion o stored business

    data. Indeed, many analysts estimate that structured data

    accounts or only 5 percent o total enterprise data.

    Big data analytics enables better analytical insights by

    integrating massive amounts o data o varying complex-

    ity, ormats, and timeliness into one structured output.Combining text, voice, streaming data, and unstructured

    data analytics into one structure allows businesses to

    harness the dierent views o related inormation into

    dynamic analytical models. These models support mul-

    tidimensional metrics that can be leveraged with tradi-

    tional analytics.

    BI tools are still evolving to support this approach to

    big data analytics. They provide better data visualization

    capabilities that take into account the use o near-real-

    time inormation and the widening range o structured

    and unstructured data. To put it simply, i it exists in any

    electronic ormat, chances are it can be analyzed.

    REACHING OUT TO THE CLOUD

    One o the more exciting areas o big data involves data

    sources that are related to your business but were not

    collected or stored by the business itsel. On a simple

    level, this might entail mashing up your existing sales

    trends with key economic data or, as is in vogue now,

    data about trending topics on social networking sites.

    Indeed, social content is the astest-growing category o

    new content in the enterprise and will eventually attain 20

    percent market penetration. Gartner denes social con-

    tent as unstructured data created, edited, and published

    on corporate blogs and communication and collaboration

    platorms, along with such commercial platorms as Face-

    book, LinkedIn, Twitter, YouTube, and more.

    Companies now specialize in providing this data on

    demand or use within enterprise BI, including the larger

    IaaS and PaaS cloud computing providers such as Google

    and Amazon Web Services. Indeed, the growth o data

    providers oering big data analytics services should only

    increase as BI proessionals and their stakeholders under-

    stand the value it oers.

    Finally, theres the ability to close the loop. Moreand more automated systems will be able to bind big

    data analytics to business processes, allowing opera-

    tional systems to react to various thresholds in near-real

    time. This is known as embedded analytics, and may be

    created programmatically, or congured and exposed

    rom within tools that support such services. Business

    examples include analyzing real-time delivery metrics

    and rerouting orders to suppliers that have better track

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    4/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    4Big Data AnalyticsDeep Divei

    records, or automatically adjusting production schedules

    based upon predictive sales trends that leverage known

    correlations with key predictive data.

    MORE DATA SOURCES,

    MORE POSSIBILITIES

    A key challenge or big data analytics is the sheer pro-

    lieration o data sources that may or may not have

    structure. We aggregate these data sources around an

    implied structure created or the purpose o the data

    query and then present this structure to either a data

    analytics API or service, or to a BI tool or the purposes

    o data visualization or other types o interactive analyt-

    ics (see Figure 1).

    Remember that the prevailing trend is to study a mix

    o structured and unstructured data. The unstructured

    data may come rom a variety o sources, including:

    Webpages

    Videoandsoundles

    Documents

    SocialmediaAPIsorservicesthatprovide

    trending data

    Externaldatasources,suchasdataservicespro-

    vided by IaaS and PaaS cloud computing players

    Legacyunstructureddata,suchasolder

    text-based databases

    So how does this work? The unstructured and struc-

    tured data is gathered into a le system (e.g., Hadoop

    Distributed File System, or HDFS). The data is stored in

    blocks on the various nodes in the Hadoop cluster (see

    Figure 1). The le system creates many replications o

    the data blocks, distributing them in the cluster in reliable

    ways that can be retrieved aster. Block sizes may vary,

    but a typical HDFS block size is 128MB, and is replicated

    to multiple nodes across the cluster.

    We deal only with fles, which means the content

    does not adhere to a structure beore it exists in the le

    system. Data maps are then applied over the unstruc-

    tured content to dene the core metadata or that con-

    tent. They may be mapped and remapped any number

    o times to support the changing metadata requirements

    o the analytical tools or others who leverage the data.

    In some instances, Hadoop Hive is employed. Hive is

    a data warehouse system that provides data summariza-

    tion, ad-hoc queries, and analysis o large datasets stored

    in the Hadoop cluster. Hive provides a mechanism to

    project structure onto this data and to query the data

    using a SQL-like language called HiveQL. The interace

    depends on your requirements and the data integration

    FIG. 1. MAKING SENSE OF DATA

    Big data analytics uses both structured and unstructured data and returns results to the B I tool in a matter o seconds. You can employ BI tools

    to analyze the data visually or as embedded analytics in an enterprise application or business process using a data analytics API or service.

    SECURITY

    HUMAN USER

    BI TOOLS

    AGGREGATED DATA

    HADOOP CLUSTER

    ENTERPRISE APPLICATIONS &

    BUSINESS PROCESSES

    DATA ANALYTICS API

    DATA

    GOVERN-

    ANCE

    STRUCTURED DATA UNSTRUCTURED DATA

    OPERATIONAL LEGACY

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    5/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    5Big Data AnalyticsDeep Divei

    capabilities o your BI tool.

    Another option is Apache Pig. Pig is a high-level

    platorm or creating MapReduce programs used withHadoop. It abstracts the programming rom the MapRe-

    duce engine. Like Hive, Pig uses its own language to

    interact with the data.

    Generally speaking, when you execute a query rom

    a BI tool, the ollowing occurs:

    1. The BI tool will reach out to the cluster to get

    the le metadata inormation. Typically, the

    BI tool will deal directly with a data structure

    that may exist specically or the analytics

    use case or model (see Figure 2). You should

    consider this structure an abstract representa-

    tion o the underlying structured or unstruc-

    tured physical data.

    2. From there, the system will reach out to the

    data nodes to get the real data blocks and

    bring those back into the structure. There

    may be any number o physical and logical

    nodes, depending upon the requirements o

    the system and how its architected.

    3. The MapReduce parallel programming

    model gathers the data rom the Hadoop

    cluster. This system deals with the operational

    details, managing the processing load across

    available server resources.

    4. The requested result set is returned to the BItool or visualization or other types o process-

    ing, typically bound to specic data structures.

    5. The BI tool can layer this data into dened

    models, including loading the data rom the

    result sets directly into a dimensional model

    or complex analytical processing or into

    graphic representations o the data.

    6. The data can be rereshed at any increment

    by repeating this process.

    This is a very general scenario, and the BI tools you

    select may have a dierent approach. Many BI tools use

    mappers that make the data appear as i it were stored in

    a traditional relational database. Others take advantage

    o the native eatures o big data technology, including

    the ability to treat structured and unstructured data di-

    erently within the analytical models.

    Some BI tools load the summarized or aggregated data

    into a temporary, multidimensional cube structure (see

    Figure 3). This allows the analyst to visualize the data com-

    ing rom the big data systems in ways that are most useul.

    Whats dierent about this model is that both struc-

    tured and unstructured data can now be visualized. Also,

    FIG. 2. STRUCTURE ON-THE-FLY

    The BI tools use a structure that can be created just or the purpose o data analytics. The inormation exists in the le system cluster and meta-

    data is mapped onto the content as required to support the use case. This provides a more dynamic and fexible way to approach BI.

    BI TOOL

    TEMPORARY STRUCTURE

    HADOOP CLUSTER

    STRUCTURED DATA UNSTRUCTURED DATA

    OPERATIONAL LEGACY

    AGGREGATED DATA WITH METADATA

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    6/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    6Big Data AnalyticsDeep Divei

    new and expanded analysis may be creatively derived

    rom the availability o this data, such as:

    Reportingordescriptiveanalytics

    Modelingorpredictiveanalytics

    Clustering

    Afnitygrouping

    Whats most important about big data analytics is the

    new mindset that is emerging. We now consider most

    data as something to be explored by anyone who needs

    to explore it. Were moving away rom a limited view o

    our own business data. Moreover, our analytical mod-

    els, such as predictive modeling, provide better results

    because the data is more complete.

    BIG DATA VISUALIZATION AND

    ANALYSIS USE CASES

    Within vertical industries, interest in big data is high, but the

    degree o knowledge and investment varies widely (see Figure

    4). Health care, transportation, and education lead the pack.

    USE CASE: BUSINESS PROCESS IMPROVEMENT

    Big data analytics enable companies to examine the state

    o their business with greater detail and accuracy, includ-

    ing the productivity o their business processes. Analytics

    coupled with data visualization shine a bright light onareas where business processes all short.

    For example, using data visualization, a business can

    zoom in on the process o recording a sale and process-

    ing it or shipping and explore the relationship between

    that business process and customer satisaction. Optimiz-

    ing that process seldom ails to increase repeat business.

    USE CASE: BUSINESS-CRITICAL

    APPLICATION AUGMENTATION

    Embedded big data analytics coupled with operational enter-

    prise applications can deliver enormous business value. For

    example, a company could augment a shipping application

    with analytical inormation about on-time delivery records

    culled rom perhaps terabytes o PDF shipping records gath-

    ered over the years. That data might also be mashed up

    with inormation rom outside sources such as complaints

    recorded in social media or blogs.

    USE CASE: IMPROVING HEALTH CARE

    TREATMENT AND OUTCOMES

    There is much low-hanging ruit when it comes to

    the business value o big data analytics in health care;

    FIG. 3. DIMENSIONS

    OF DATA

    BI tools employ a range o analytical

    models and structures to analyze

    big data. In this case the data is

    loaded into a multidimensional

    temporary model where it may be

    visualized in any number o ways.

    BI TOOL

    HADOOP CLUSTER

    STRUCTURED DATA UNSTRUCTURED DATA

    OPERATIONAL

    PRODUCTS

    MINIVAN

    COUPE

    SEDAN

    TWITTER

    FACEBOOK

    YOUTUBE

    20s

    30s

    40s

    SOCIAL DATA CUSTOMER AGES

    MULTI-

    DIMENSIONAL

    DATA

    DATA VISUALIZATION

    LEGACY

    AGGREGATED DATA

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    7/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    7Big Data AnalyticsDeep Divei

    however, the best use case is in patient diagnosis and

    treatment. The health care system stores our inorma-

    tion in dierent places using dierent ormats, which has

    made it dicult or impossible to analyze this data as a

    single cluster o inormation.

    With big data analytics, we now can gather all struc-

    tured and unstructured health care data and place this

    content in a single cluster or analysis by a BI tool. This

    improves health care proessionals ability to determine

    patterns o treatment success by analyzing holistic patient

    data and treatments that lead to the desired outcome.

    USE CASE: IMPROVING

    RETAIL BUSINESS PERFORMANCERetail businesses depend on an in-depth understand-

    ing o specifc markets and customers to gain com-

    petitive advantage. Here, big data analytics has huge

    potential value, allowing those who drive the BI tools to

    create models that determine the success o a product

    based upon predictive data points gathered rom huge

    amounts o unstructured data.

    This data may include demographics inormation

    around the existing customer base, as compared to the

    number o times the product is mentioned within social

    media systems, as compared to patterns around the

    success o products sold in the past, as compared to

    weather patterns that may aect the use o the product

    (say, a down coat in a very cold winter). The idea is to

    provide those who make critical decisions in the retail

    space with the means to slice and dice all the relevant

    data to determine which products to advertise, discount,

    or display adjacent to other products.

    USE CASE: IMPROVING

    TRANSPORTATION SYSTEMS

    Transportation systems depend on efciencies. For

    example, airlines need to select the best and most pro-

    itable routes. Using big data analytics, they can predictthe protability o those routes by leveraging historical

    data visualized with key predictive metrics applied to

    data gathered rom external sources.

    Big data analytics allows airlines to gather years and

    years o fight data rom the government, including point

    o origin, number o passengers, on-time records, etc..

    They can then look at this data with known pricing

    inormation rom other carriers. Add in predictive data,

    the number o times the destination appeared in Web

    searches over the past several years, and the number o

    FIG. 4. Has your organization already invested in technology specifcally designedto address the big data challenge?

    According to Gartner, almost all verticals are investing in big data analytics, with transportation, education, and health care leading the way.

    Education

    Retail

    Transportation

    Health care

    Communications, Media, Services

    Insurance

    Energy, Utilities

    Banking

    Manuacturing

    Government

    Yes No, but plan to within a year No, but plan to within two years No plans Dont know

    39%

    29%

    36%

    36%

    25%

    21%

    31%

    22%

    23%

    23%

    0%

    12%

    4%

    9%

    10%

    5%

    14%

    18%

    13%

    16%

    39%

    29%

    36%

    36%

    25%

    21%

    31%

    22%

    23%

    23%

    17%

    29%

    21%

    15%

    15%

    21%

    11%

    18%

    9%

    18%

    17%

    12%

    11%

    15%

    20%

    18%

    17%

    12%

    18%

    8%

  • 7/29/2019 AST-0073561_big_data_ibm_v2

    8/8

    I N F O W O R L D . C O M D E E P D I V E S E R I E S

    8Big Data AnalyticsDeep Divei

    times its been mentioned on social networking sites as

    well. By modeling this data within a BI tool, the airline

    would have a good idea as to the past use and prot-ability o the route, as well as the uture possibility that

    tickets would sell, and at what price.

    MAPPING A PATH FOR YOUR BUSINESS

    To ully capitalize on big data analytics, you need to ree

    yoursel rom the limitations o traditional BI. Unortu-

    nately, those who created BI oten attempt to orce tra-

    ditional BI approaches (square peg) into the new world

    o big data (round hole). As a result, they miss the ull

    potential o this technology or ail altogether.

    Heres a rough outline o how to implement big data

    analytics in your own business:

    1. Understand your core business requirements

    or this technology and create a business case.

    2. Deine your data sources. Where are they?

    What are they? Whats the best way to reach

    and replicate them when required?.

    3. Deine known use cases, including the

    analytical models you will need to understand

    aspects o your data.

    4. Create a proo-o-concept to both learn about the

    technology and better understand the complexities

    o bringing the technology into the enterprise.

    5. Consider perormance, security, and data

    governance. These issues are oten overlooked,

    but ultimately are required or a successul

    implementation.

    6. Spend the time and money to evaluate BI technol-ogy in terms o eatures and unctions. BI and data

    visualization provide your window into the data,

    and any limitations will limit the value o the data.

    7. Deine the metrics o success. Evaluate whats

    working and whats not ater a year o living

    with this technology. Adjustments can be made

    with very little disruption.

    8. Finally, make sure to create a road map or

    this technology. Include how it will be

    leveraged today, and into short- and long-term

    business planning.

    The value o this technology is very clear to those in

    business. The ability to work with good inormation has

    always been the big challenge or IT. Now we have the tools

    to address this issue, and its up to you to make it work.i

    David S. Linthicum is the CTO and founder of Blue Mountain

    Labs and an internationally recognized industry expert and

    thought leader. Dave has authored 13 books on computing,

    the latest of which is Cloud Computing and SOA Convergence

    in Your Enterprise, a Step-by-Step Approach. Daves indus-

    try experience includes positions as CTO and CEO of several

    successful software companies, and upper-level management

    positions in Fortune 100 companies. He keynotes leading tech-

    nology conferences on cloud computing, SOA, enterprise appli-

    cation integration, and enterprise architecture.