bigdata challenges opportunities

87
BIG DATA: CHALLENGES AND OPPORTUNITIES £ ¥ $ £ ¥ $ VOL 11 NO 1 2013 Infosys Labs Briefings Infosys Labs Briefings

Upload: justopenminded

Post on 26-Oct-2015

22 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Bigdata Challenges Opportunities

For information on obtaining additional copies, reprinting or translating articles, and all other correspondence,

please contact:

Email: [email protected]

© Infosys Limited, 2013

Infosys acknowledges the proprietary rights of the trademarks and product names of the other

companies mentioned in this issue of Infosys Labs Briefings. The information provided in this

document is intended for the sole use of the recipient and for educational purposes only. Infosys

makes no express or implied warranties relating to the information contained in this document or to

any derived results obtained by the recipient from the use of the information in the document. Infosys

further does not guarantee the sequence, timeliness, accuracy or completeness of the information and

will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of,

any of the information or in the transmission thereof, or for any damages arising there from. Opinions

and forecasts constitute our judgment at the time of release and are subject to change without notice.

This document does not contain information provided to us in confidence by our clients.

BIG DATA: CHALLENGES AND

OPPORTUNITIES

£¥$ €

£

¥ $

Subu Goparaju Senior Vice President

and Head of Infosys Labs

“At Infosys Labs, we constantly look for opportunities to leverage

technology while creating and implementing innovative business

solutions for our clients. As part of this quest, we develop engineering

methodologies that help Infosys implement these solutions right,

first time and every time.”

BIG D

ATA

: CH

ALLEN

GES A

ND

OPPO

RTU

NITIES

VO

L 11 NO

1 2013

VOL 11 NO 12013

Infosys Labs Briefings

Infosys L

abs Briefin

gs

Page 2: Bigdata Challenges Opportunities

AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [email protected].

ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected].

AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [email protected].

ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected].

BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [email protected].

GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [email protected].

KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [email protected].

MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [email protected].

NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [email protected].

NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [email protected].

NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [email protected].

PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [email protected].

PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [email protected].

PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [email protected].

SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [email protected].

SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [email protected].

SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [email protected].

ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [email protected].

Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.

As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.

Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies.

Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data.

Like always do let us know your feedback about the issue.

Happy Reading,

Yogesh Dandawate Deputy Editor [email protected]

Authors featured in this issue

Infosys Labs BriefingsAdvisory Board

Anindya Sircar PhDAssociate Vice President &

Head - IP Cell

Gaurav RastogiVice President,

Head - Learning Services

Kochikar V P PhDAssociate Vice President,

Education & Research Unit

Raj Joshi Managing Director,

Infosys Consulting Inc.

Ranganath MVice President & Chief Risk Officer

Simon Towers PhDAssociate Vice President and

Head - Center of Innovation for Tommorow’s Enterprise,

Infosys Labs

Subu GoparajuSenior Vice President &

Head - Infosys Labs

Big Data: Countering Tomorrow’s Challenges

Page 3: Bigdata Challenges Opportunities

Infosys Labs Briefings

3

9

19

27

35

41

47

53

65

73

VOL 11 NO 12013

Opinion: Metadata Management in Big Data By Gautham VemugantiAny enterprise that is in the process of or considering Big data applications deployment has to address the metadata management problem. The author proposes a metadata management framework to realize Big data analytics.

Trend: Optimization Model for Improving Supply Chain VisibilityBy Saravanan BalarajThe paper tries to explore the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization.

Discussion: Retail Industry – Moving to Feedback EconomyBy Prasanna Rajaraman and Perumal BabuBig data analysis of customers’ preferences can help retailers gain a significant competitive advantage, suggest the authors.

Perspective: Harness Big Data Value and Empower Customer Experience TransformationBy Zhong Li PhDAlways-on digital customers continuously create more data in various types. Enterprise are analyzing this heterogeneous data for understanding customer behavior, spend, social media patterns.

Framework: Liquidity Risk Management and Big Data: A New Challenge for BanksBy Abhishek Kumar SinhaManaging liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). The author proposes an iterative framework for effective liquidity risk management.

Model: Big Data Medical Engine in the Cloud (BDMEiC): Your New Health DoctorBy Anil Radhakrishnan and Kiran KalmadiIn this paper the authors describe how Big data analytics can play a significant role in the early detection and diagnosis of fatal diseases, reduction in health care costs improving quality of health care administration.

Approach: Big Data Powered Extreme Content HubBy Sudeeshchandran Narayanan and Ajay SadhuWith the arrival of Big Content, the need to extract, enrich, organize and manage semi-structured and un-structured content and media is increasing. This paper talks about the need for an Extereme Content Hub to tame the Big data explosion.

Insight: Complex Events Processing: Unburdening Big Data ComplexitiesBy Bill Peer, Prakash Rajbhoj and Narayanan ChathanurComplex Event Processing along with in-memory data grid technologies can help in pattern detection, matching, analysis, processing and split second decision making in Big data scenarios opine the authors.

Practioners Perspective: Big Data: Testing Approach to Overcome Quality ChallengesBy Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar GajjaThis paper suggests the need for a robust testing approach to validate Big data systems to identify possible defects early in the implementation life cycle.

Research: Nature Inspired Visualization of Unstructured Big DataBy Aaditya PrakashClassical visualization methods are falling short in accurately representing the multidimensional and ever growing Big data. Taking inspiration from nature, the author has proposed a nature inspired spider cobweb visualization technique for visualization of Big data.

Index

Page 4: Bigdata Challenges Opportunities

“Robust testing approach needs to be defined for validating structured and unstructured data to identify possible

defects early in the implementation life cycle.”

Naju D. Mohan Delivery Manager, RCL Business Unit Infosys Ltd.

“Big Data augmented with Complex Event Processing capabilities can provide solutions in utilizing

memory data grids for analyzing trends, patterns and events in real time.”

Bill Peer Principal Technology Architect Infosys Labs, Infosys Ltd.

Page 5: Bigdata Challenges Opportunities

3

VOL 11 NO 12013

Metadata Management in Big DataBy Gautham Vemuganti

Big data, true to its name, deals with large volumes of data characterized by volume,

variety and velocity. Any enterprise that is in the process of or considering a Big data applications deployment has to address the metadata management problem. Traditionally, much of the data that business users use is structured. This however is changing with the exponential growth of data or Big data.

Metadata defining this data, however, is spread across the enterprise in spreadsheets, databases, applications and even in people’s minds (the so-called “tribal knowledge”). Most enterprises do not have a formal metadata management process in place because of the misconception that it is an Information Technology (IT) imperative and it does not have an impact on the business.

However, the converse is true. It has been proven that a robust metadata management process is not only necessary but required for successful information management. Big data introduces large volumes of unstructured data for analysis. This data could be in the form of a text file or any multimedia file (for e.g., audio, video). To bring this data into the fold of an

information management solution, its metadata should be correctly defined.

Metadata management so lut ions provided by various vendors usually have a narrow focus.An ETL vendor will capture metadata for the ETL process.A BI vendor will provide metadata management capabilities for their BI solution. The silo-ed nature of metadata does not provide business users an opportunity to have a say and actively engage in metadata management. A good metadata management solution must provide visibility across multiple solutions and bring business users into the fold for a collaborative, active metadata management process.

METADATA MANAGEMENT CHALLENGESMetadata, simply defined, is data about data.In the context of analytics some common examples of metadata are report definitions, table definitions, meaning of a particular master data entity (sold-to customer, for example), ETL mappings and formulas and computations.The importance of metadata cannot be overstated. Metadata drives the accuracy of reports, validates data transformations, ensures

Big data analytics must reckon theimportance and criticality of metadata

Infosys Labs Briefings

Page 6: Bigdata Challenges Opportunities

4

accuracy of calculations and enforces consistent definition of business terms across multiple business users.

In a typical large enterprise which has grown by mergers, acquisitions and divestitures, metadata is scattered across the enterprise in various forms as noted in the introduction.

In large enterprises, there is wide acknowledgement that metadata management is critical but most of the time there is no enterprise level sponsorship of a metadata management initiative.Even if there is, it is only focused either for one specific project sponsored by one specific business.

T h e i m p a c t o f g o o d m e t a d a t a management practices are not consistently understood across the various levels of the enterprise. Conversely, the impact of poorly managed metadata comes to light only after the fact i.e., a certain transformation happens,

a report or a calculation is run or two divisional data sources are merged.

Metadata is typically viewed as the exclusive responsibility of the IT organization with business having little or no input or say in its management. The primary reason is that there are multiple layers of organization between IT and business. This introduces communication barriers between IT and business.

Finally, metadata is not viewed as a very exciting area of opportunity.It is only addressed as an after-thought.

DIFFERENCES BETWEEN TRADITIONAL AND BIG DATA ANALYTICSIn traditional analytics, implementations data is typically stored in a data warehouse. The data warehouse is modeled using one of several techniques, developed over time and is a constantly evolving entity. Analytics

People Rules

Metrics

Single monolithic governance process

Multiple governance process

Process

People RulesMetrics Process

People RulesMetrics Process

People Rules

Metrics Process

Figure 1: Data Governance Shift with Big Data Analytics Source: Infosys Research

Page 7: Bigdata Challenges Opportunities

5

application developed using the data in a data warehouse are also long-lived. Data governance in traditional analytics is a centralized process.Metadata is managed as part of the data governance process.

In traditional analytics, data is discovered, collected, governed, stored and distributed.

Big data introduces large volumes of unstructured data.This data changes is highly dynamic and therefore needs to be ingested quickly for analysis.

B ig da ta ana ly t i c s appl i ca t ions , however, are characterized by short-lived, quick implementations focused on solving a specific business problem.The emphasis of Big data analytics applications is more on experimentation and speed as opposed to long drawn out modeling exercise.

The need to experiment and derive insights quickly using data changes the way data is governed. In traditional analytics there is usually one central governance team focused on governing the way data is used and distributed in the enterprise.In Big data analytics, there are multiple governance processes in play simultaneously, each geared towards answering a specific business question. Figure 1 illustrates this.

Most of the metadata management challenges we referred to in the previous section alluded to typical enterprise data that is highly structured. To analyze unstructured data, additional metadata definitions are necessary.

To illustrate the need to enhance metadata to support Big data analytics, consider sentiment analysis using social media conversations as an example. Say someone posts a message on Facebook “I do not like my cell-phone reception. My wireless carrier promised wide cell coverage but it is spotty at best.I think I will switch carriers”. To infer the intent of this customer,

the inference engine has to rely on metadata as well as the supporting domain ontology.The metadata will define “Wireless Carrier”, “Customer”, “Sentiment” and “Intent”.The inference engine will leverage the ontology dependent on this metadata to infer that this customer wants to switch cell phone carriers.

Big data is not just restricted to text.It could also contain images, videos, and voice files. Understanding, categorizing and creating metadata to analyze this kind of non-traditional content is critical.

It is evident that Big data introduces additional challenges in metadata management.It is clear that there is a need for a robust metadata management process which will govern metadata with the same rigor as data for enterprises to be successful with Big data analytics.

To summarize, a metadata management process specific to Big data should incorporate the context and intent of data, support non-traditional sources of data and be robust to handle the velocity of Big data.

ILLUSTRATIVE EXAMPLE Consider an existing master data management system in a large enterprise.This master data system has been developed over time.This has specific master data entities like product, customer, vendor, employee etc.The master data system is tightly governed and data is processed (cleansed, enriched and augmented) before it is loaded into the master data repository.

This specific enterprise is considering bringing in social media data for enhanced customer analytics.This social media data is to be sourced from multiple sources and incorporated into the master data management system.

A s n o t e d e a r l i e r , s o c i a l m e d i a conversations have context , intent and sentiment.The context refers to the situation

Page 8: Bigdata Challenges Opportunities

6

METADATA STORAGE

METADATA DISTRIBUTION

METADATA GOVERNANCE

METADATA COLLECTIONCollect

METADATA DISCOVERY

Figure 2: Metadata Management Framework for Big Data AnalyticsSource: Infosys Research

in which a customer was mentioned, the intent refers to the action that an individual is likely to take and the sentiment refers to the “state of being” of the individual.

For example, if an individual sent a tweet or a starts a Facebook conversation about a retailer from a football game. The context would then be a sports venue. If the tweet or conversation consisted of positive comments about the retailer then the sentiment would be determined as positive. If the update consisted of highlighting a promotion by the retailer then the intent would be to collaborate or share with the individual’s network.

If such social media updates have to be incorporated into any solution within the enterprise then the master data management solution has to be enhanced with metadata about “Context”, ”Sentiment” and “Intent”. Static lookup information will need to be generated and stored so that an inference engine can leverage this information to provide inputs for analysis. This will also necessitate a change in the back-end.The ETL processes that are responsible

for this master data will now have to incorporate the social media data as well. Furthermore, the customer information extracted from these feeds need to be standardized before being loaded into any transaction system.

FRAMEWORK FOR METADATA MANAGEMENT IN BIG DATA ANALYTICSWe propose that metadata be managed using 5 components shown in Figure 2.

Metadata Discovery – Discovering metadata is critical in Big data for the reasons of context and intent noted in the prior section. Social data is typically sourced from multiple sources.All these sources will have different formats. Once metadata for a certain entity is discovered for one source it needs to be harmonized across all sources of interest. This process for Big data will need to be formalized using metadata governance.

Metadata Collection – A metadata collection mechanism should be implemented. A robust collection mechanism should aim to minimize or eliminate metadata silos. Once again, a technology or a process for metadata collection should be implemented.

Metadata Governance – Metadata creation and maintenance needs to be governed.Governance should include resources from both the business and IT teams. A collaborative framework between business and IT should be established to provide this governance.Appropriate processes (manual or technical) should be utilized for this purpose. For example, on-boarding a new Big data source should be a collaborative effort between business users and IT. IT will provide the technology to enable business users discover metadata.

Page 9: Bigdata Challenges Opportunities

7

BIG DATA DISTRIBUTION

DATA DISTRIBUTION

DATA STORAGE

DATA GOVERNANCE

DATA COLLECTION

DATA DISCOVERY

Collect

METADATA DISCOVERY

METADATA COLLECTION

METADATA GOVERNANCE

METADATA STORAGE

METADATA DISTRIBUTION

Collect

Metadata Storage – Multiple models for enterprise metadata storage exist.The Common Warehouse Meta-model (CWM) is one example. A similar model or its extension thereof can be utilized for this purpose.If one such model will not fit the requirements of an enterprise then suitable custom models can be developed.

Metadata Distribution – This is the final component. Metadata, once stored will need to be distributed to consuming applications.A formal distribution model should be put into place to enable this distribution. For example, some applications can directly integrate to the metadata storage layer while others will need some specialized interfaces to be able to leverage this metadata.

We note that in traditional analytics implementation, a framework similar to the one we propose exists but with data.

The metadata management framework should be implemented alongside a data management framework to realize Big data analytics.

THE PARADIGM SHIFTThe discussion in this paper brings to light the importance of metadata and the impact it has not only on Big data analytics but traditional analytics as well.We are of the opinion that if enterprises want to get value out of their data assets and leverage the Big data tidal wave then the time is right to shift the paradigm from data governance to metadata governance and make data management part of the metadata governance process.

A framework is as good as how it is viewed and implemented within the enterprise.

The metadata management framework is successful if there is sponsorship for this effort from the highest levels of management.This

Figure 3: Equal Importance of Metadata & Data Processing for Big Data Analytics

Source: Infosys Research

Page 10: Bigdata Challenges Opportunities

8

include both business and IT leadership within the enterprise. The framework can be viewed as being very generic. Change is a constant in any enterprise.The framework can be made flexible to adapt to changing needs and requirements of the business.

All the participants and personas in engaged in the data management function within an enterprise should participate in the process.This will promote and foster collaboration between business and IT.This should be made sustainable and followed diligently by all the participants until this framework is used to on-board not only new data sources but also new participants in the process.

Metadata and its management is an oft ignored area in enterprises with multiple consequences.The absence of robust metadata management processes lead to erroneous results, project delays and multiple interpretations of business data entities. These are all avoidable with a good metadata management framework.

The consequences affect the entire enterprise either directly or indirectly.From the lowest level employee to the senior most executive, incorrect or poorly managed metadata not only will affect operations but also directly contribute to the top-line growth and bottom-line profitability of an enterprise. Big data is viewed as the most important innovation that brings tremendous value to enterprises. Without a proper metadata management framework, this value might not be realized.

CONCLUSIONBig data has created quite a bit of buzz in the market place.Pioneers like Yahoo and Google created the foundations of what is today called Hadoop.There are multiple players in the Big data market today developing everything from technology to manage Big data to applications

needed to analyze Big data to companies engaged in Big data analysis and selling that content.

In the midst of all the innovation in the Big data space, metadata is often forgotten. It is important for us to recognize and realize the importance of metadata management and the critical impact it has on enterprises.

If enterprises wish to remain competitive, they have to embark on Big data analytics initiatives.In this journey, enterprises cannot afford to ignore the metadata management problem.

REFERENCES1. Davenport, T., and Harris, J., (2007),

Competing on Analytics – The New Science of Winning, Harvard Business School Press.

2. J e n n i n g s , M . , W h a t r o l e d o e s m e t a d a t a m a n a g e m e n t p l a y i n enterprise information management ( E I M ) ? . A v a i l a b l e a t h t t p : / /searchbusinessanalytics.techtarget.com/answer/The-importance-of-metadata-management-in-EIM.

3. Metadata Management Foundation Capabilities Component (2011). http://mike2.openmethodology.org/wiki/Metadata_Management_Foundation_Capabilities_Component.

4. Rogers, D. (2010), Database Management: Metadata is more important than you think. Available at http://www.databasejournal.com/sqletc/art ic le .php/3870756/Database-Management-Metadata-is-more-important-than-you-think.htm.

5. Data Governance Institute, (2012), The DGI Data Governance Framework. Available a t http://datagovernance.com/fw_the_DGI_data_governance_framework.html.

Page 11: Bigdata Challenges Opportunities

9

VOL 11 NO 12013

Optimization Model for Improving Supply Chain Visibility

By Saravanan Balaraj

In today’s competit ive ‘ lead or leave’ marketplace , B ig data i s seen as an

oxymoron that offers both challenge as well as opportunity. Effective and efficient strategies to acquire, manage and analyze data leads to better decision making and competitive advantage. Unlocking potential business value out of this diverse and multi-structured dataset beyond organizational boundary is a mammoth task.

We have stepped into an interconnected and intelligent digital world where convergence of new technologies is fast happening round the corners. In this process the underlying data set is growing not only in volumes but also in velocity and variety. The resulting data explosion created by a combination of mobile devices, tweets, social media, blogs, sensors and emails demands a new kind of data intelligence.

Big data has started creating lot of buzz across verticals and Big data in supply chain is no different. Supply chain is one of the key focus

areas that are undergoing transformational changes in the recent past. Traditional supply chain applications leverage only on transactional data to solve operational problems and improve efficiency. Having stepped into Big data world, the existing supply chain applications have become obsolete as they are unable to cope up with tremendously increasing data volumes cutting across multiple sources, the speed with which they are generated and unprecedented growth in new data forms.

Enterprises are in tremendous pressure to solve new problems emerging out of new forms of data. Handling large volume of data across multiple sources and deriving value out of this massive chunk for strategy execution is the biggest challenge that enterprises are facing in today’s competitive landscape. Careful analysis and appropriate usage of these data would result in cost-reduction and better operational performance. Competitive pressures and customers ‘more for less’

Enterprises need to adopt different Big data analytic tools and technologies

to improve their supply chains

Infosys Labs Briefings

Page 12: Bigdata Challenges Opportunities

10

attitudes have left enterprise with no choice other than to re-think on their supply chain strategies and creating a differentiation.

Enterprises need to adopt appropriate Big data techniques and technologies and build suitable models to derive value out of these unstructured data and henceforth plan, schedule and route in a cost-effective manner. The paper tries to explore what are the challenges that dot the Big data adoption in supply chain and proposes a value model for Big data optimization.

BIG DATA WAVEInternational Data Corporation (IDC) has predicted that Big data market will grow from $3.2 billion in 2010 to $16.9 billion by 2015 at a compound annual growth rate of 40% [2]. This shows tremendous traction towards Big data tools, technologies and platforms among enterprises. Lots of researches and investments are carried out on how to fully tap the potential benefits hidden in Big data and derive financial value out of it. Value derived out of Big data enables enterprises to achieve differentiation by reducing cost, efficient planning and thereby improving process efficiency.

Big data is an important asset in supply chain which enterprises are looking forward to capitalize upon. They adopt different Big data analytic tools and technologies to improve their supply chain, production and customer engagement processes. The path towards operational excellence is facilitated through efficient planning and scheduling of production and logistic processes.

Though supply chain data is really huge, it brings about the biggest opportunity for enterprises to reduce cost and improve their operational performances. The areas in supply

chain planning where Big data can create an impact are: demand forecasting, inventory management, production planning, vendor management and logistics optimization. Big data can improve supply chain planning process if appropriate business models are identified, designed, built and then executed. Some of its key benefits are: short time-to-market, improved operational excellence, cost reduction and increased profit margins. CHALLENGES WITH SUPPLY CHAIN PLANNINGSupply chain planning process success depends on how closely demands are forecasted, inventories are managed and logistics are planned. Supply chain is the heart of industry vertical and if managed efficiently drives positive business and enables sustainable advantage. With the emergence of Big data, optimizing supply chain processes has become complicated than ever before. Handling Big data challenges in supply chain and transforming them into opportunities is the key to corporate success. The key challenges are:

■ Volume - According to a McKinsey report, the number of RFID tags sold globally is projected to increase from 12 million in 2011 to 209 billion in 2021 [3]. Along with this, phenomenal increase in the usage of temperature sensors, QR codes and GPS devices, the underlying supply chain data generated has multiplied manifold beyond our expectations. Data is flowing across multiple systems and sources and hence they are likely to be error-prone and incomplete. Handling such huge data volumes is a challenge.

Page 13: Bigdata Challenges Opportunities

11

■ Velocity - Business has become highly dynamic and volatile. The changes arising due to unexpected events must be handled in a timely manner in order to avoid losing out in business. Enterprises are finding it extremely difficult to cope up with this data velocity. Optimal decisions must be made quickly and shorter processing time is the key for successful operational execution which is lacking in traditional data management systems.

■ Variety - In supply chain, data has emerged in different forms which don’t fit in traditional applications and models. Structured (transactional data), unstructured (social data), sensor data

Launch Customer

Promotion Inventory Transportation

Data Sourcing

Data Extraction & Cleansing

Data Representation

Acquire

OLTP DB

Transactional Systems Big Data Systems

SensorRFID Transactional Social

Video

Voice

Digital ImageChannelTime boundQR

Temperature

Structured Unstructured New Type

Cascading | Hive Pig | MapReduce HDFS | NoSQL

(temperature and RFID) along with new data types (video, voice and digital images) have created nightmares among enterprise to handle such diverse and heterogeneous data sets.

In today’s data explosion in terms of volume, variety and velocity, handling them alone doesn’t suffice. Value creation by analyzing such massive data sets and extraction of data intelligence for successful strategy execution is the key.

BIG DATA IN DEMAND FORECASTING & SUPPLY CHAIN PLANNINGEnterprises use forecasting to determine how much to produce of each product type, when

Figure 1: Optimization Model for Improving Supply Chain Visibility - I

Source: Infosys Research

Page 14: Bigdata Challenges Opportunities

12

and where to ship them, thereby improving supply chain visibility. Inaccurate forecast causes detrimental effect in supply chain. Over-forecast results in inventory pile ups and working capital locks. Under-forecast leads to failure in meeting demand, resulting in loss of customer and sales. Hence in today’s volatile market comprised of unpredictable shifts in customer demands, improving accuracy of forecast is of paramount importance.

Data in supply chain planning has mushroomed in terms of volumes, velocity and variety. Tesco, for instance, generates more than 1.5 billion new data items every month. Wal-Mart ’s warehouse handles some 2.5 petabytes of information which is roughly equivalent to half of all the letters delivered by the US Postal Service in 2010. According to McKinsey Global institute report [3], leveraging on Big data in demand forecasting and supply chain planning could increase profit margin by 2-3% in Fast Moving Consumer Goods (FMCG) manufacturing value chain. This unearths tremendous opportunity in forecasting and supply chain planning available for enterprises to capitalize on this Big data deluge.

MISSING LINKS IN TRADITIONAL APPROACHESEnterprises have s tar ted real iz ing the importance of Big data in forecasting and have begun investing in Big data forecasting tools and technologies to improve their supply chain, production and manufacture planning processes. Traditional forecasting tools aren’t adequate enough in handling huge data volumes, variety and velocity. Moreover they are missing out on the following key aspect which improves accuracy of forecasts:

■ Social Media Data As An Input: Social media is a platform that enables enterprises to collect information a b o u t p o t e n t i a l a n d p r o s p e c t customers. Thanks to the technological advancements that has made tracking customer data easier. Companies can now track every visit customer makes to the websites, e-mail exchanged and comments logged across social media websites. Social media data helps analyze customer pulse and gain insights on forecasting, planning, scheduling of supply chain and inventories. Buzz in social networks can be used as an input for demand forecasting for numerous benefit. One such use case is, enterprise can launch a new product to online fans to sense customer acceptance. Based on the response, inventories and supply chain can be planned to direct stocks to high buzz locations during launch phase.

■ Predict And Respond Approach: Traditional forecasting is done by analyzing historical patterns, considering sales inputs and promotional plans to forecast demand and supply chain planning. They focus on ‘what happened’ and work on ‘sense and respond’ strategy. ‘History repeats itself’ is no longer apt in todays’ competitive marketplace. Enterprises need to focus on ‘what will happen’ and require ‘predict and respond’ strategy to stay alive in business. This calls for models and systems capable of capturing, handling and analyzing huge volume of real-time data generated from unexpected competitive events, weather patterns, point-of-sales and

Page 15: Bigdata Challenges Opportunities

13

natural disasters (volcanoes, floods, etc.) and converting them into actionable information for forecasting plans on production, inventory holdings and supply chain distribution.

■ Optimized Decisions with Simulations: Traditional decision support systems lack flexibility to meet changing data requirements. In real world scenario, supply chain delivery plan changes unexpectedly due to various reasons like demand change, revised sales forecast, etc. The model and system should have ability to factor in this and respond quickly to such unplanned events. Decision should be taken only after careful analysis of the unplanned events impact on other elements of supply chain. Traditional approaches lack this capability and this necessitates a model for performing what-if analysis on all possible decisions and selecting the optimal one in the Big data context.

IMPROVING SUPPLY CHAIN VISIBILITY USING BIG DATA Supply chain doesn’t lack data – what’s missing is a suitable model to convert this huge diverse raw data into actionable information so that enterprises can make critical business decisions for efficient supply chain planning. A 3-stage optimized value model helps to overcome the challenges posed by Big data in supply chain planning and demand forecasting. It bridges the existing gaps in traditional Big data approaches and offers a perspective to unlock the value from growing Big data torrent. Designing and building an optimized Big data model for supply chain planning is a complex task but successful execution leads to

significant financial benefits. Let’s take a deep dive into each stage of this model and analyze what their value-add are in enterprises supply chain planning process.

Acquire Data: The biggest driver of supply chain planning is data. Acquiring all the relevant data for supply chain planning is the first step in this optimized model. It involves three steps namely data sourcing, data extraction and cleansing and data representation which make data ready for further analysis.

■ Data Sourcing - Data is available in different forms across multiple sources, systems and geographies. It contains extensive details of historical demand data and other relevant information. For further analysis it is therefore necessary to source required data. Data that are to be sourced for improving accuracy of forecast in-addition to transactional data are:

■ Product Promotion data - items, prices, sales

■ Launch data - items to be ramped up or down

■ Inventory data - stock in warehouse

■ Customer data - purchase history, social media data

■ Transportation data - GPS and logistics data.

Enterprises should adopt appropriate Big data systems that are capable of handling such huge data volumes, variety and velocity.

Page 16: Bigdata Challenges Opportunities

14

■ Data Extraction and Cleansing - Data sources are available in different forms from structured (transactional data) to un-structured (social media, images, sensor data, etc.) and they are not in analysis-friendly formats. Also due to large volume of heterogeneous data there is high probabi l i ty of inconsistencies and data errors while sourcing. The sourced data should be expressed in structured form for supply chain planning. Moreover analyzing inaccurate and untimely data leads to erroneous non-optimal results. High quality and comprehensive data is a valuable asset and appropriate data cleansing mechanisms should be in place for maintaining the quality of Big data. Choice of Big data tools for data cleansing and enrichment plays a crucial role in supply chain planning.

■ Data Representation – Database design for such huge data volume is a herculean task and poses some serious performance issues if not executed properly. Data representation plays a key role in Big data analysis. There are numerous ways to store data and each design has its own set of advantages and drawbacks. Selection of appropriate database design and executing appropriate design favoring business objectives reduces the efforts in reaping benefits out of Big data analysis in supply chain planning.

Analyze Data: The next stage is analyzing cleansed data and capturing value for forecasting and supply chain planning. There is plethora of Big data techniques available in market for forecasting and supply chain planning. The

selection of Big data technique depends on the business scenario and enterprise objectives. Incompatible data formats make value creation from Big data a complex task and this calls for innovation in techniques to unlock business value out of the growing Big data torrent. The proposed model adopts optimization technique to generate insights out of this voluminous and diverse Big dataset.

■ Optimization in Big data analysis - Manufacturers have started synchronizing forecasting with production cycles, so accuracy of forecasting plays a crucial role in their success. Adoption of optimization technique in Big data analysis creates a new perspective and it helps in improving the accuracy of demand forecasting and supply chain planning. Analyzing the impact of promotions on one specific product for demand forecasting appears to be an easy task. But real life scenarios comprises of huge army of products with factors affecting their demand varying for every product and location making it difficult for traditional techniques in data analysis.

Optimization technique has several capabilities which make it an ideal choice for data analysis in such scenarios. Firstly, this technique is designed for analyzing and drawing insights for highly complex system with huge data volumes, multiple constraints and factors to be accounted for. Secondly, supply chain planning has number of enterprise objectives associated with it like cost reduction, demand fulfillment, etc. The impact of each of these objective measures on enterprises profitability can be easily analyzed using optimization

Page 17: Bigdata Challenges Opportunities

15

technique. Flexibility of optimization technique is another benefit that makes it suitable for Big data analysis to uncover new data connections and turn them into insights.

Optimization model comprises of four components, viz., (i) input – consistent, real-time, quality data which is sourced, cleansed and integrated becomes the input of the optimization model; (ii) goals – the model should take into consideration all the goals pertaining to the forecasting and supply chain planning like minimizing cost, maximizing demand coverage, maximizing profits, etc. (iii) constraints – the model should

incorporate the entire constraints specific to the supply chain planning in the model; some of the constraints are minimum inventory in warehouse, capacity constraint, route constraint, demand coverage constraint, etc; and (iv) output – results based on input, goals and constraints defined in the model that can be used for strategy executions. The result can be demand plan, inventory plan, production plan, logistics plan, etc.

■ Choice of Algorithm: One of the key differentiators in supply chain planning is the algorithm used in modeling.

Data Sourcing

Data Extraction & Cleansing

Data Representation

ACQUIRE

OPTIMIZATION TECHNIQUEINPUT

OUTPUT

GOALSMin (Cost)

Max (Profit)Max (Demand Coverage)

CONSTRAINTSCapacity constraintRoute Constraint

Demand Coverage Constraint

Inventory PlanDemand PlanLogistics Plan

ANALYZE

ACHIEVE

Performance TrackersKPI Dashboards

Actual Vs. Planned

Multi User Collaboration

BuildCompare Simulate

Scenario Management

Figure 2: Optimization Model for Improving Supply Chain Visibility – II

Source: Infosys Research

Page 18: Bigdata Challenges Opportunities

16

Optimization problems have numerous possible solutions and the algorithm should have the capability to fine-tune itself for achieving optimal solutions.

Achieve Business Objective: The final stage in this model is achieving business objectives through demand forecasting and supply chain planning. It involves three steps which facilitates enterprise in supply chain decisions.

■ Scenario Management – Business events are difficult to predict and most of the times deviate from their standard paths resulting unexpected behaviors and events. This makes it difficult for planning and optimizing during uncertain times. Scenario management is the approach to overcome such uncertain situations. Scenario management facilitates creating business scenarios, comparing multiple different scenarios, analyze and assessing its impact before making decisions. This capability helps to balance conflicting KPIs and arrive at an optimal solution matching business needs.

■ Multi User Collaboration – Optimization model in real business case comprises of highly complex data sets and models which requires support from an army of analysts and determines its effects on enterprises goals. Combinations of technical and domain experts are required to obtain optimal results. To achieve near accurate forecasts and supply chain optimization the model should support multi-user collaboration so that multiple users can collaboratively produce optimal plans and schedules and re-optimize as and

when business changes. This model builds a collaborative system with capability of supporting inputs from multiple users and incorporating in its decision making process

■ P e r f o r m a n c e T r a c k e r – D e m a n d forecasting and supply chain planning does not follow build-model-execute approach, i t requires s igni f icant continuous effort. Frequent changes in the inputs and business rules necessitate monitoring of data, model and algorithm performance. Actual and planned results are to be compared regularly and steps are to be taken to minimize the deviations in accuracy. KPI is to be defined and dashboard should be constant ly monitored for model performances.

KEY BENEFITSEnterprises can accrue lot of benefit by adopting this 3-stage model for Big data analysis. Some of them are detailed below:

Improves Accuracy of Forecast: One of the key objectives of forecasting is profit maximization. This model adopts effective data sourcing, cleansing and integration systems and makes data ready for forecasting. Inclusion of social media data, promotional data, weather predictions, seasonality’s in addition to historical demand and sales histories adds value and improves forecasting accuracy. Moreover optimization technique for Big data analysis reduces forecasting errors to a great extent.

Continuous Improvement: Acquire-Analyze-Achieve model is not a hard-wired model. It allows flexibility to fine tune and supports what-if analysis. Multiple scenarios can be

Page 19: Bigdata Challenges Opportunities

17

created, compared and simulated to identify the impact of change on the supply chain and demand forecasting prior to the making any decisions. Also it enables enterprise to define, track and monitor KPIs from time to time resulting in continuous process improvements.

Better Inventory Management: Inventory data along with weather predictions, history of sales and seasonality is considered as an input to the model for forecasting and planning supply chain. This approach minimizes incidents of out-of-stock or over-stocks across different warehouses. Optimal plan for inventory movement is forecasted and appropriate stocks are maintained at each warehouse to meet the upcoming demand. To a great extent this will reduce loss of sales and business due to stock-outs and leads to better inventory management.

Logistic Optimization: Constant sourcing and continuous analysis of transportation data (GPS and other logistics data) and using them for demand forecasting and supply chain planning through optimization techniques helps in improving distribution management. Moreover optimization of logistics improves fuel efficiency and efficient routing of vehicles resulting in operational excellence and better supply chain visibility.

CONCLUSIONSAs rapid penetration of information technology in supply chain planning continues, the amount of data that can be captured, stored and analyzed has increased manifold. The challenge is to derive value out of these large volumes of data by unlocking financial benefits in congruent with the enterprises’ business objectives.

Competitive pressures and customers ‘more for less’ attitude has left enterprises with

no option other than reducing cost in their operational executions. Adopting effective and efficient supply chain planning and optimization techniques to match customer expectations with its offerings is the key to corporate success. To attain operational excellence and sustainable advantage, it is necessary for the enterprise to build innovative models and frameworks leveraging the power of Big data.

Optimized value model on Big data offers a unique way of demand forecasting and supply chain optimization through collaboration, scenario management and performance management. This model on continuous improvement opens up doors for big opportunities for the next generation of demand forecasting and supply chain optimization.

REFERENCES1. IDC - Press Release (2012) , IDC

Releases First Worldwide Big data Technology and Services Market Forecast, Shows Big data as the Next Essential Capability and a Foundation for the Intelligent Economy. Available a t h t t p : / / w w w . i d c . c o m / g e t d o c .jsp?containerId=prUS23355112.

2. McKinsey Global Institute (2011), Big data: The next frontier for innovation, competition, and productivity. Available at http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx.

3. Furio, S., Andres, C., Lozano, S., Adenso-Diaz, B., (2009), Mathematical model to optimize land empty container movements . Available at http://www.fundacion.valenciaport.com/

Page 20: Bigdata Challenges Opportunities

18

Articles/doc/presentations/HMS2009_Paperid_27_Furio.aspx.

4. Stojkovića, G., Soumisb, F., Desrosiersc, J., Solomon, M. (2001), An optimization model for a real-time flight scheduling problem. Available at http://www.sciencedirect.com/science/article/pii/S0965856401000398.

5. Beck, M., Moore, T., Plank, J., Swany, M. (2000), Logistical Networking. Available

at: http://loci.cs.utk.edu/ibp/files/pdf/LogisticalNetworking.pdf.

6. Lasschuit, W., Thijssen, N., (2004), Supporting supply chain planning and scheduling decisions in the oil and chemical industry, Computers and Chemical Engineering, issue 28, pp. 863–870. Available at http://www.aimms.com/aimms/download/case_studies/shell_elsevier_article.pdf.

Page 21: Bigdata Challenges Opportunities

19

VOL 11 NO 12013

Retail Industry – Moving to Feedback Economy

By Prasanna Rajaraman and Perumal Babu

Retail industry is going through a major paradigm shift. The past decade has seen

unprecedented churn in retail industry virtually changing the landscape. Erstwhile marquee brands from traditional retailing side have ceded space to start-ups and new business models.

The key driver of this change is a confluence of technological, sociological and customer behavioral trends creating this strategic infection point in retailing ecology. Trends like emergence of internet as major retailing channel, social platforms going mainstream, pervasive retailing and emergence of digital customer has presented a major challenge to traditional retailers and retailing models.

On the other hand, these trends have also enabled opportunities for retailers to better understand customer dynamics. For the first time, retailers have access to unprecedented amount of publicly available information on customer behavior and trends; voluntarily

shared by customers. The more effective retailers can tap into these behavioral and social reservoirs of data to model purchasing behaviors and trends of their current and prospective customers. Such data can also provide the retailers with predictive intelligence, which if leveraged effectively can create enough mindshare, that the sale is completed even before the conscious decision to purchase is taken.

This move to a feedback economy where retailers can have 360 degree view of the customer thought process across the selling cycle is a paradigm shift for retail industry – from retailer driving sales to retailer engaging customer across the sales and support cycle. Every aspect of retailing from assortment/allocation planning, marketing/promotions to customer interactions has to take the evolving consumer trends into consideration.

T h e i m p l i c a t i o n f r o m b u s i n e s s perspective is that retailers have to better understand customer dynamics and align

Gain better insight into customer dynamics through Big Data analytics

Infosys Labs Briefings

Page 22: Bigdata Challenges Opportunities

20

ImplicitGuidance &

Control

ImplicitGuidance &

Control

Observation

Orient Decide Act

OutsideInformation

Observe

UnfoldingCircumstances

UnfoldingInteraction

withEnviroment

UnfoldingInteraction

withEnviroment

Decision(Hypothesis)

Action(Test)

Feed

Forw

ard

Feed

Feed

Forw

ard

Forw

ard

CulturalTransactions

GeneticHeritage

NewInformation

Previousexperiences

Analysisand

Synthesis

Feedback

Feedback

Feedback

business processes effectively with these trends. In addition, this implies that cycle times will be shorter and businesses have to be more tactical in their promotions and offerings. Retailers who can ride this wave will be better able to address demand and command higher margins for the products and services. Failing this, retailers will be left with low-margin pricing/commodity space.

F r o m i n f o r m a t i o n t e c h n o l o g y perspective, the key challenge is that nature of this information with respect to lifecycle, velocity, heterogeneousness of the sources and volume is radically different from what traditional systems handle. Also, there are overarching concerns like that of data privacy, compliance and regulatory changes that need to be internalized with internal processes. The key is to manage lifecycle of this Big data and effectively integrate with the organizational system and to derive actionable information.

TOWARDS A FEEDBACK ECONOMYCustomer dynamics refers to customer-business relationships that describe the ongoing interchange of information and transactions between customers and organizations that goes beyond the transactional nature of the interaction to look at emotions, intent and desires. Retailers can create significant competitive differentiation by understanding the customer’s true intent in a way that also supports the business’ intents [1, 2, 3, 4].

John Boyd a colonel military strategist in the US air force developed the OODA loop (Observe, Orient, Decide and Act) which he used for combative operations. Today’s business environment is nothing different Retailers are battling to get customer into their shops (physical or net-front) and convert their visits to sales. And understanding customer dynamics play a key role in this effort. The OODA loop explains the crux of the feedback economy.

Figure 1: OODA loop Source: Reference [5] Source: Reference [5]

Page 23: Bigdata Challenges Opportunities

21

In a feedback economy, there is constant feedback to the system from every phase of its execution. Along with this, the organization should observe the external environment, unfolding circumstances and customer interactions. These inputs are analyzed and action is taken based on these inputs. This cycle of adaptation and optimization makes the organization more efficient and effective on an ongoing basis.

Leveraging this feedback loop is pivotal in having a proper understanding of customer needs and wants and the evolving trends. In today’s environment, this means acquiring data from heterogeneous sources viz., in-store transaction history, web analytics, etc. This creates a huge volume of data that has to be analyzed to get the required actionable insights

BIG DATA LIFECYCLE: ACQUIRE-ANALYZE-ACTIONIZEThe lifecycle of Big data can be visualized as a three-phased approach resulting in continuous optimization. The first step in moving towards feedback economy is to acquire data. In this case, retailer should look into the macro and micro environment trends, consumer behavior - their likes, emotions, etc. Data from electronic channels like blogs, social networking sites and twitter will give the retailer a humongous amount of data regarding the consumer. These feeds help the retailer understand consumer dynamics and give more insights into her buying patterns.

The key advantage of plugging into these disparate sources is the sheer information one can gather about customer – both individually and in aggregate. On other hand, Big data is materially different from the data the retailers are used to handling. Most of the data is unstructured (from blogs, twitter feeds, etc.) and

cannot be directly integrated with traditional analytics tool leading to challenges on how the data can be assimilated with backend decision making systems and analyzed.

In the assimilate/analyze phase, retailer must decide which data is of use and define rules for filtering the unwanted data. Filtering should be done with utmost care, as there are cases where indirect inferences are possible. The data available to the retailer after the acquisition phase would be of multiple formats and they have to be cleaned and harmonized with the backend platforms.

Cleaned up data is then mined for actionable insights. Actionize is a phase where the insights gathered from analyze phase is converted to actionable business decisions by the retailer.

The response i.e., business outcome is fed back to the system so that the system can self-tune on an ongoing basis to result in a self-adaptive system that leverages Big data and feedback loops to offer business insight more customized than what would be traditionally possible. It is imperative to understand that this feedback cycle is an ongoing process and not to be considered as a one stop solution for the analytics needs of a retailer.

ACQUIRE: FOLLOWING CUSTOMER FOOTPRINTSTo understand the customer, retailers have to leverage every interaction with the customer and tap into the source of customer insight. Traditionally, retailers have relied primarily on in-store customer interactions and associated transaction data along with specialized campaigns like opinion polls to gain better insight into customer dynamics. While this interaction looks limited, a recent incident shows how powerful customer sales history can be leveraged to gain predictive intelligence on customer needs.

Page 24: Bigdata Challenges Opportunities

22

“A father of a teenage girl called in a major North American retailer to complain that the retailer had mailed coupons for child care products addressed to his underage daughter. Few days later, the same father called in and apologized that his daughter was indeed pregnant and he was not aware of it earlier” [6].

Surprisingly, by all indications, only in-store purchase data was mined by the retailer in this scenario to identify the customer need which in this case is that of childcare products.

To exploit the power of next generation of analytics retailers must plug into data from non-traditional sources like social sites, twitter feeds, environment sensor networks, etc. to have better insight into customer needs. Most major retailers now have multiple channels – brick/mortar store, online store, mobile apps, etc. Each of these touch points not only acts as a sales channel but can also generate data on customer needs and wants. Coupling this information with other repository like Facebook posts, twitter feeds (i.e., sentiment analysis) and web analytics retailers have the opportunity to track customer footprints both in and outside the store and to customize their offerings and interactions with customer.

Traditionally retailers have dealt with voluminous data. For example, Wal-Mart logs more than 2.5 petabytes of information about customer transactions every hour, equivalent to 167 times the books in the Library of Congress [7].

However, the nature of Big data is materially different from traditional transaction data and this must be considered while data planning is done. Further, while data is available readily, the legality and compliance aspect of gathering and using data is additional aspect that needs to be considered. Further, integrating information from multiple sources

can result in generating data that is beyond what user originally consented to; potentially resulting in liability for the retailer. Given that most of this information is accessible globally, retailers should ensure compliance with local regulations (EU data /privacy protection regulations, HIPAA for US medical data, etc.) where they operate.

ANALYZE - INSIGHTS (LEADS) TO INNOVATIONAnalyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources)[9].

The key to acquire Big data is to handle these dimensions while assimilating these aforementioned external sources of data. To understand how Big data analytics can enrich and enhance a typical retail process – allocation planning – let’s look at the allocation planning case study for a major North American apparel retailer.

The forecasting engine used for planning process uses statistical algorithms to determine allocation quantities. Key inputs to forecasting engine are sales history and current performance of store. In addition, adjustments are also based on parameters like Promotional events (including markdown), current stock levels, back orders to determine the inventory that needs to be shipped to particular store.

While this is fairly in line with industry standard for allocation forecasting, Big data can enrich this process by including additional parameters that can impact demand. For e.g., a news piece on a town’s go-green initiative or no plastic day can be taken as additional adjustment parameter for non-green items in that area. Similarly, a weather forecast on warm front in

Page 25: Bigdata Challenges Opportunities

23

an area can automatically trigger reduction of stocks of warm-clothing for stores there.

A high-level logical view of Big data implementation is explained below to further understanding on how Big data can be assimilated with traditional data sources. The data feeds for the implementation comes from various structured sources like forums, feedback forms, rating sites and unstructured source like social web, etc. as well as semi-structured data from emails, word documents, etc. This is a veritable data feast thrown compared to traditional systems but it is important that we diet on such data and use only those feeds that create optimum value. This is done through synergy of business knowledge and processes specific to retailer and the industry segment the retailer operates in and set of tools specialized in analyzing huge volume

of data in rapid speed. Once data is massaged for downstream systems, big analytics tools are used to analyze. Based on business needs, real-time or offline data processing/analytics can be used. In real-life scenarios, both these approaches are used based on situation and need.

Proper analysis needs data not just from consumer insight sources but also from transactional data history and consumer profiles.

ACTIONIZE – BIG DATA TO BIG IDEASThis is the key part of the Big data cycle. Even the best data cannot be substituted for timely action. The technology and functional stacker will facilitate retailer getting proper insight into key customer intent on purchase – what, where, why and at what price. By knowing this,

Figure 2: Correlation between Customer Ratings and Sales Source: Reference [12]

Best Sellers in Tablet PCs Most Wished For in Tablet PCs

Kindle Fire HD 7”, Dolby Audio,Dual-Band Wi-Fi, 32GB

1.

Samsung Galaxy Tab 2(7-Inch,Wi-Fi)

3.

Samsung Galaxy Tab 2(10.1-Inch, Wi-Fi)

4.

Kindle Fire HD 8.9”, Dolby Audio,Dual-Band Wi-Fi, 32 GB

5.

Kindle Fire HD 8.9”, 4G LTE Wireless,Dolby Audio, Dual-Band Wi-Fi, 32GB

1.

Kindle Fire HD 8.9”, Dolby Audio,Dual-Band Wi-Fi, 32GB

2.

Kindle Fire, Full Color 7”Multi-touch Display, Wi-Fi

3.

Kindle Fire HD 7”, Dolby Audio,Dual-Band Wi-Fi, 32 GB

4.

Kindle Fire HD 8.9”, Dolby Audio,Dual-Band Wi-Fi, 16GB

2.

Samsung Galaxy Tab 2(7-Inch, Wi-Fi)

5.

Page 26: Bigdata Challenges Opportunities

24

the retailer can customize the 4Ps (product, pricing, promotions and place) to create enough mindshare from customer perspective that sales become inevitable [10].

For example, a cursory look at random product category (tablet) in an online retailer site shows the strong correlation between customer ratings and sales, i.e., 4 out of 6 best user-rated products are in the top five in sales – a 60% correlation even when other parameters like brand, price, release date are not taken into consideration [Fig. 2] 12. The retailer knowing the customer ratings can offer promotions that can tip the balance between sales and lost opportunity. While this example may not be the rule, the key to analysis and actionizing the data is to correlate the importance of user feedback data and concomitant sales.

BIG DATA OPPORTUNITIESThe implication of Big data analytics on major retailing processes will be along the following areas.

■ Identifying the Product Mix: The assortment and allocation will need to take into consideration the evolving user trends identified from Big data analytics to ensure the offering matches the market needs. Allocation planning especially has to be tactical with shorter lead times.

■ Promotions and Pricing: Retailers have to move from generic pricing strategies to customized user specific.

■ C o m m u n i c a t i o n w i t h C u s t o m e r : Advertising will move from mass media to personalized communication; from one way to two-way communication. Retailers will gain more from viral

marketing [12] than from traditional advertising channels.

■ Compliance: Governmental regulations and compliance requirements are mandatory to avoid liability as co-mingling data from disparate sources can result in generation of personal data beyond the scope of the original user’s intent. While data is available globally, the use has to comply with local law of the land and ensure it is done keeping in mind customer’s sensibilities.

■ People, Process and Organizational Dynamics: The move to feedback economy requires di f ferent organizat ional mindset and processes. Decision making will need to be more bottom-up and collaborative. Retailers need to engage customer to ensure the feedback loop is in place. Further, Big data being cross-functional, needs the active participation and coordination between various departments in the organization; hence managing organizational dynamics is the key consideration.

■ B e t t e r C u s t o m e r E x p e r i e n c e : Organizations can improve the overall customer experience by providing updates services and thereby eliminating surprises. For instance Big data solutions can be used to pro-actively inform customers of expected shipment delays based on traffic data, climate and other external factors.

BIG DATA ADOPTION STRATEGYPresented below is a perspective on how to adopt a Big data solution within the enterprise.

Page 27: Bigdata Challenges Opportunities

25

Define Requirements, Scope and Mandate:

Define mandate and objective in terms of what is the required from Big data solution. A guiding factor to identify the requirements would be the prioritized list of business strategies. As part of initiation, it is important to also identify the goal and KPIs that vindicates the usage of Big data.

Key Player: Business

Choosing the Right Data Sources:Once the requirement and scope is defined, the IT department has to identify the various feeds that would fetch the relevant data. These feeds would be structured, semi structured and unstructured. The source could be internal or external. For internal sources, the policies and processes should be defined to enable friction less flow of data.

Key Players: IT and Business

Choosing the Required Tools and Technologies: After deciding upon the sources of data that would feed the system, the right tools and technology should be identified and aligned with business needs. Key areas are capturing the data, tools and rules to clean the data, identify tools for real-time and offline analytic, identify storage and other infrastructure needs.

Key Player: IT

Creating Inferences from Insights:One of the key factors to a successful Big data implementation is to have a pool of talented data analyst who can create proper inferences from the insights and facilitate build and definition of new analytic models. These models help in probing the data and understand the insights.

Key Player: Data Analyst

Strategy to Actionize the Insights:Business should create process that would take these inferences as inputs to decision making. Stakeholders in decision making should be identified and actionable inferences have to be communicated at the right time. Speed is critical to the success of Big data.

Key Player: Business

Measuring the Business Benefits:The success of the Big data initiative depends on the value it creates to the organization and its decision making body. It should also be noted that unlike other initiatives, Big data initiatives are usually continuous process in search of the best results. Organizations should be in tune to this understanding to derive the best results. However, it is important that a goal is set and measured to track the initiative and ensure its movement in the right direction.

Key Players: IT and Business

CONCLUSIONThe move to feedback economy presents an inevitable paradigm shift for the retail industry. Big data as the enabling technology will play key role in this transformation. As ever, business needs will continue to drive technology process and solution. However, given the criticality of Big data, organizations will need to treat Big data as an existential strategy and make the right investment to ensure they can ride the wave.

REFERENCES1. Customer dynamics. Available at http://

en.wikipedia.org/wiki/Customer_dynamics.

Page 28: Bigdata Challenges Opportunities

26

2. Davenport, T. and. Harris, G., (2007), Competing on Analytics, Harvard Business School Publishing.

3. D e B o r d e , M . , ( 2 0 0 6 ) , D o Y o u r Organizational Dynamics Determine Your Operational Success?, The O and P Edge.

4. Lemon, K., Barnett, T., White, Russell S. Winer, Dynamic Customer Relationship Management: Incorporating Future Considerations into the Service Retention Decision, Journal of Marketing.

5. Boyd, J. (September 3, 1976). OODA loop, In Destruction and Creation. Available at http://en.wikipedia.org/wiki/OODA_loop.

6. Doyne, S. (2012), Should Companies Collect Information About You?, NY Times. Available at http://learning.blogs .nyt imes . com/2012/02/21/should-companies-collect-information-about-you/.

7. Data, data everywhere (2010), The Economist. Available at http://www.economist.com/node/15557443.

8. IDC Digital Universe (2011). Available a t h t t p : / / c h u c k s b l o g . e m c . c o m /chucks_blog/2011/06/2011-idc-digital-universe-study-big-data-is-here-now-what.html.

9. Gartner Says Solving ‘Big data’ Challenge Involves More Than Just Managing Volumes of Data (2011). http://www.gartner.com/it/page.jsp?id=1731916.

10. Gens , F . ( 2012 ) . IDC Pred ic t ion 2012: Competing for 2020. Available at ht tp ://cdn. idc .com/research/Predict ions12/Main/downloads/IDCTOP10Predictions2012.pdf.

11. Bhasin, H. 4Ps of marketing. Available at http://www.marketing91.com/marketing-mix-4-ps-marketing/.

12. Amazon US site / tablets category (2012). Available at http://www.amazon.com/gp/top-rated/electronics/3063224011/r e f = z g _ b s _ t a b _ t _ t r ? p f _ r d _p = 1 3 7 4 9 6 9 7 2 2 & p f _ r d _ s = r i g h t -8&pf_rd_t=2101&pf_rd_i=list&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=14YWR6HBVR6XAS7WD2GG.

13. Godwin, G. (2008) Viral marketing. Available at http://sethgodin.typepad.com/seths_blog/2008/12/what-is-viral-m.html.

14. Wang, R. (2012), Monday’s Musings: Beyond The Three V’s of Big data – Viscosity and Virality , http://blog.softwareinsider .org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/

Page 29: Bigdata Challenges Opportunities

27

VOL 11 NO 12013

Harness Big Data Value andEmpower Customer Experience

TransformationBy Zhong Li PhD

In today’s hyper-competitive experience economy, communication service providers

(CSPs) recognize that product and price alone will not differentiate their business and brand. Since brand loyalty, retention and long-term profitability are now so closely aligned with customer experience, the ability to understand customers, spot changes in their behavior and adapt quickly to new consumer needs is fundamental to the success of the consumer driven Communication Service Industry.

The increasingly sophisticated digital consumers demand more personal ized serv ices through the channel o f the i r choice. In fact, the internet, mobile and particularly, the rise of social media in the past 5 years have empowered consumers more than ever before. There is a growing challenge for CSPs that are contending with an increasingly scattered relationship with

customers who can now choose from multiple channels to conduct business interactions. A recent industry research indicates that some 90% of today’s consumers in the US and West Europe interact across multiple channels, representing a moving target that makes achieving a full view of the customer that much more challenging .

To compound this trend, the always-on digital customers continuously create more data in various types, from many more touch points with more interaction options. CSPs encounter “Big data phenomenon” by accumulating significant amounts of customer related information such as purchase patterns, activities on the website, from mobile, social media or interactions with the network and call centre.

Such Big data phenomenon presents CSPs with challenges along 3V dimensions (Fig. 1), viz.,

Communication Service Providers need to leverage the 3M Framework with a holistic 5C process to extract Big Data value (BDV)

Infosys Labs Briefings

Page 30: Bigdata Challenges Opportunities

28

■ Large Volume: Recent industry research shows that the amount of data that the CSP has to manage with consumer transaction and interaction has doubled in the past three years, and its growth is also in acceleration to double the size again in the next two years, much of it coming from new sources including blogs, social media, internet search, and networks [7].

■ Broad Variety: The type, form and

format of data are created in a broad variety. Data is created from multiple channels such as online, call centre, stores and social media including Facebook, Twitter and other social media platforms. It presents itself in a variety of types, comprising structured data form transaction, semi-structure data from call records and unstructured data in multi-media forms from social interactions

■ Rapidly Changing Velocity: The always on digital consumers create change dynamics of data in the speed of light. They equally demand fast response from CSPs to satisfy their personalized needs in real time.

CSPs of all sizes have learned the hard way that it is very difficult to take full advantage of all of the customer interactions in Big data if they do not know what their customers are demanding or what their relative value to the business is. Even some CSPs that do segment their customers with the assistance of customer relationship management (CRM) system struggle to take complete advantage of that segmentation in developing a real-time value strategy. In hyper-sophisticated interaction patterns throughout their journey spanning marketing, research, order, service and retention, Big data sheds shining light to expose treasured customer intelligence along aspects of 4Is viz., interest, insight, interaction and intelligence.

■ Interest and Insight: Customers offer their attention for interest and share their insights. They visit a web site, make a call, or access a retail store, share view on social media because they want something from CSP at that moment – information about a product or help with a problem. These interactions present an opportunity for the CSP to communicate with a customer who is engaged by choice and ready to share information regarding her personalized wants and needs.

■ Interaction and Intelligence: It is typically crucial for CSPs to target offerings to particular customer segments based on the intelligence of customer data. The success of these real time interactions – whether through online, mobile, social media, or other channels depends to a great extent on the CSP’s understanding of the customer’s wants and needs at the time of the interaction.

SocialStore

Web Call centreVariety

Value

Volume Velocity

Mobile

Figure 1: Big Data in 3Vs is accumulated from Multiple ChannelsSource: Infosys Research

Page 31: Bigdata Challenges Opportunities

29

Therefore, alongside managing and securing Big data in 3V dimensions, CSPs are facing a fundamental challenge on how to explore and harness Big data Value (BDV).

A HOLISTIC 5C PROCESS TO HARNESS BDVRising to the challenges and leveraging on the opportunity in Big data, CSPs need to harness BDV with predictive models to provide deeper insight into customer intelligence from profiles, behaviours and preferences that are hidden in Big data of vast volume and broad variety, and to deliver superior personalized experience with fast velocity in real time throughout entire customer journey.

In the past decade, most CSPs have invested significant amount of efforts in the implementation of complex CRM systems to manage customer experience. While those CRM systems bring efficiency in helping CSPs to deliver on “what” to do in managing historical transactions, they lack the crucial capability of defining “how” to act in time with the most relevant interaction to maximize the value for the customer.

CSPs now need to look beyond what CRM has to offer and dive deeper to cover “how” to do things right for the customer by capturing customers’ subjective sentiment in a particular interaction, resultant insight into predication on what a customers demand from CSPs and trigger proactive action to satisfy their needs, which is more likely to lead to customer delight and ultimate revenues.

■ To do so, CSPs needs to execute a holistic 5C process, i.e., collect, converge, correlate, collaborate and control, in extracting BDV (Fig. 2).

The holistic 5C process will help CSPs to aggregate the whole interaction with a customer across time and channels, support with large volume and broad variety of data including promotion, product, order and services, define interactions with that of customer’s preferences. The context of the customer’s relationship with the CSP, and actual and potential value that she derives, in particular, determine the likelihood that she consumer will take particular actions based on real time intelligence. Big data can help the CSP correlate the customer’s needs with product, promotion, order, service and deliver the right offer at the right time in the appropriate context that she is most likely to respond to.

AN OVERARCHING 3M FRAMEWORK TO EXTRACT BDVTo execute a holistic 5C process for Big data, CSPs need to implement an overarching framework that integrates the various pools of customer related data residing in CSPs enterprise systems, create an actionable customer profile, deliver insight based on that profile in real time customer interaction event and effectively match sales and service resources to take proactive actions, so as to monetize ultimate value on the fly.

CollaborateCollect

Converge Control

Correlate

Customer

Product

Service

OrderPromotion

Figure 2: Harness BDV with a Holistic 5C Process Source: Infosys Research

Page 32: Bigdata Challenges Opportunities

30

The overarching framework needs to incorporate 3M modules, i.e. Model, Monitor and Mobilize

■ Model Profile: It models customer profile based on all the transactions that helps CSPs gain insight at the individual-customer level. Such a profile requires not only integration of all customer facing systems and enterprise systems, but integration with all the customer interactions such as email, mobile, online and social in enterprise systems such as OMS, CMS, IMS and ERP in parallel with CRM paradigm, and model an actionable customer profile to be able to effectively deploy resources for a distinct customer experience.

■ Monitor Pattern: It monitors customer interaction events from multiple touch points in real time, dynamically senses and triggers matching patterns of events with the defined policies and set models, and makes suitable recommendations and offers at right time through an appropriate channel. It enables CSPs to quickly respond to changes in the marketplace—a seasonal change in demand, for example—and bundle offerings that will appeal to a particular customer, across a particular channel, at a particular time.

■ Mobilize Process: It mobilizes a set of automations that allows customers en joy the personal ized engaging journey in real time that spans outbound and inbound communications, sales, orders, service and help intervention, and fulfil customer’s next immediate demand.

The 3M framework needs to be based on an event-driven architecture (EDA) incorporating Enterprise Service Bus (ESB) and Business Process Management (BPM) and should be application and technology agnostic. It needs to interact with multiple channels using events; match patterns of a set of events with pre-defined policies, rules, and analytical models; deliver a set of automations to fulfil personalized experience that spans the complete customer lifecycle.

Furthermore, the 3M framework needs to be supported with key high-level functional components, which include:

■ Customer Intelligence from Big data: A typical implementation of customer intel l igence from Big data is the combination of Data Warehouse and real time customer intelligence analytics. It requires aggregation of customer and product data from CSP’s various data sources in BSS/OSS, leveraging CSP’s existing investments with data models, workflows, decision tables, user interface, etc. It also integrates with the key modules in CSP’s enterprise landscape, covering:

■ C u s t o m e r M a n a g e m e n t : A complete customer relationship management solution combines a 360 degree view of the customer with intelligent guidance and seamless back-office integration to increase first contact resolution and operational efficiency.

■ O f f e r M a n a g e m e n t : C S P - s p e c i f i c s p e c i a l i z a t i o n and re-use capabi l i t ies that define new services, products,

Page 33: Bigdata Challenges Opportunities

31

bundles, fulfilment processes and dependencies and rapidly c a p i t a l i z e o n n e w m a r k e t o p p o r t u n i t i e s a n d i m p r o v e customer experience.

■ O r d e r M a n a g e m e n t : The configurable best practices for creating and maintaining holistic order journey that is critical to the success of such product-intensive functions as account opening, quote generation, ordering, contract generation, product fulfilment and service delivery.

■ Service Management: Case based work automation and a complete view of each case enables an effective management of every case throughout its lifecycle.

■ Event Driven Process Automation: A dynamic process automation engine empowered with EDA leverages the context of the interaction to orchestrate the flow of activities, guiding customer service representatives (CSRs) and self-service customers through every step in their inbound and outbound interactions, in particular for Campaign Management and Retention Management.

■ Campaign Management: Outbound interactions are typically used to target products and services to particular customer segments based on analysis of customer data through appropriate channels. It uncovers relevant, timely and actionable consumer and network

insights to enable intelligently driven marketing campaigns to develop, define and refine marketing messages and target customer with a more effective planand meet customers at the touch points of their choosing through optimized display and search results while generating demand via automated email creation, delivery and results tracking.

■ R e t e n t i o n M a n a g e m e n t : Customers offer their attention, e i t h e r i n t r u s i v e l y o r n o n -intrusively to look for the products and services that meet their needs through the channel of their choices. It dynamically captures consumer data from highly active and relevant outlets such as social media, websites and other social sources and enables CSPs to quickly respond to customer needs and proactively deliver relevant offers for upgrades and product bundles that take into account each customer’s personal preference.

■ Experience Personalization: It provides the customer with personalized, relevant experience, enabled from business process automation that connects people, processes and systems in real time and eliminates product, process and channel silos. It helps CSPs extend predictive targeting beyond basic cross-sells to automate more of their cross-channel strategies and gain valuable insights from hidden, consuming and interaction patterns.

Page 34: Bigdata Challenges Opportunities

32

Overall, the 3M framework will empower BDV solution for CSP to execute on the real-time decision that aligns individual needs with business objectives and dynamically fulfils the next best action or offer that will increase the value of each personalized interaction.

BDV IN ACTION- CUSTOMER EXPERIENCE OPTIMIZATION

By implementing the proposed BDV solution, CSPs can optimize customer experience that delivers the right interaction with each customer at right time so as to build strong relationships, reduce churn, and increase customer value to the business.

■ From Customer Experience Perspective: It provides CSP with real-time, end-to end visibility into all the customer interaction events taking place across multi-channels, by correlating and analyzing these events, using a set of business rules, and automatically takes proactive actions which ultimately lead to customer experience optimization. It helps CSP turn their multi-channel contacts with customers into cohesive, integrated interaction patterns, allowing them to better segment their customers and ultimately to take full advantage of that segmentation, deliver personalized experiences that are dynamically tailored to each customer while dramatically improving interaction effectiveness and efficiency.

■ From CSPs Perspective: It helps CSPs quickly weed out underperforming campaigns and learn more about their customers and their needs. From retail

store to contact centre to Web to social media, it helps CSPs deliver a new standard of branded, consistent customer experiences that build deeper, more profitable and lasting relationships. It enables CSPs to maximize productivity by handling customer interactions as fast as possible in the most profitable channel.

At every point in the customer lifecycle, from marketing campaigns, offer and order to servicing and retention efforts, BDV helps to inform its interactions with that customer’s preferences, the context of her relationship with the business, and actual and potential value, enables CSPs focus on creating personalized experiences that balance the customer’s needs with business values.

■ Campaign Management: BDV delivers focused campaigns on the customer with predictive modelling and cost-effective campaign automation that consistently distinguishes the brand and supports personalized communications with prospects and customers.

■ Offer Management: BDV dynamically generates offers that account for such factors as the current interaction with the customer, the individual’s total value across product lines, past interactions, and likelihood of defecting. It helps deliver optimal value and increases the effectiveness of propositions with next-best-action recommendations tailored to the individual customer.

■ Order Management: BDV enables the unified process automation applicable to multiple product lines, with agile and

Page 35: Bigdata Challenges Opportunities

33

flexible workflow, rules and process orchestration that accounts for the individual needs in product pricing, configuration, processing, payment scheduling and delivery.

■ Service Management: BDV empowers customer service representatives to act based on the unique needs and behaviours of each customer using real-time intelligence combined with holistic customer content and context.

■ Retention Management: BDV helps CSPs retain more high-value customers w i t h t a r g e t e d n e x t - b e s t - a c t i o n dialogues. It consistently turns customer interactions into sales opportunities by automatically prompting customer service representatives to proactively deliver relevant offers to satisfy each customer’s unique need.

CONCLUSIONToday’s increasingly sophisticated digital consumers expect CSPs to deliver product, service and interaction experience designed “just for me at this moment.” To take on the challenge, CSPs need to deliver customer experience optimization powered by BDV in real time.

By implementing an overarching 3M BDV framework to execute a holistic 5C process new products can be brought to market with faster velocity and with the ability to easily adapt common services to accommodate unique customer and channel needs.

Suffice it to say that BDV will enable CSP to deliver customer-focused experience that matches responses to specific individual demands; provide real time intelligent guidance

that streamlines complex interactions; and automate interactions from end-to-end. The result is an optimized customer experience that helps CSPs substantially increase customer satisfaction, retention and profitability, and consequently empowers CSPs evolving into the experience centric Tomorrow’s Enterprise.

REFERENCES1. IBM Big data solutions deliver insight

and relevance for digital media – Solution Brief- June 2012 available at www-05.ibm.com/fr/events/netezzaDM.../Solutions_Big_Data.pdf.

2. Oracle Big data Premier-Presentation (May 2012) . Available at http://premiere.digitalmedianet.com/articles/viewarticle.jsp?id=1962030.

3. SAP HANA™ for Next-Generation Business Applications and Real-Time Analytics (July 2012). Available at http://www.saphana.com/docs/DOC-1507.

4. SAS® High-Performance Analytics (June 2012). Available at http://www.sas.com/reg/gen/uk/hpa?gclid=CJKpvvCJiLQCFbMbtAodpj4Aaw.

5. Transform the Customer Experience with Pega-CRM (2012). Available at http://www.pega.com/sites/default/files/private/Transform-Customer-E x p e r i e n c e - w i t h - P e g a - C R M - W P -Apr2012.pdf.

6. The Forrester Wave™: Enterprise Hadoop Solutions for Big data-Feb 2012. Available at http://center.uoregon.edu/AIM/uploads/INFOTEC2012/H A N D O U T S / K E Y _ 2 4 1 3 5 0 6 /Infotec2012BigDataPresentationFinal.pdf.

7. S h a h S . ( 2 0 1 2 ) , T o p 5 R e a s o n s Communications Service Providers

Page 36: Bigdata Challenges Opportunities

34

Need Operational Intelligence. Available at http://blog.vitria.com/bid/88402/Top-5-Reasons-Communications-Service-Providers-Need-Operational-Intelligence.

8. Connolly S. and Wooledge S. (2012), Harnessing the Value of Big data Analytics. Available at http://www.asterdata.com/wc-0217-harnessing-value-bigdata/.

Page 37: Bigdata Challenges Opportunities

35

VOL 11 NO 12013

Liquidity Risk Management and Big Data: A New Challenge for Banks

By Abhishek Kumar Sinha

During the 2008 financial crisis, banks faced an enormous challenge of managing

liquidity and remaining solvent. As many financial institutions failed, those who survived the crisis have fully understood the importance of liquidity risk management. Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate information that may not be enough for efficient liquidity risk management (LRM). Banks must have reliable data on daily positions and other liquidity measures that have to be monitored continuously. During signs of stress, like changes in liquidity of various asset classes and unfavorable market conditions, banks need to react to these changes in order to remain credible in the market. In banking liquidity risk and reputation is so heavily linked to the extent that even a single liquidity event can lead to catastrophic funding problems for a bank.

MISMANAGEMENT OF LIQUIDITY RISK: SOME EXAMPLES OF FAILURES Northern Rock was a star performer UK bank

until the 2007 crisis struck. The source of funding was mostly wholesale funding and capital market funding. Hence in the 2008 crisis, when these funding avenues dried up across the globe, it was unable to fund its operations. During the crisis, the bank’s stock fell 32% along with depositors run on the bank. The central bank had to intervene and support the bank in the form of deposit protection and money market operations. Later the Government took the ultimate step of nationalizing the bank.

Lehman Brothers had 600 billion in assets before its eventual collapse. The bank’s stress testing omitted its riskiest asset -- the commercial real estate portfolio, which in turn led to misleading stress test results. The liquidity of the bank was very low compared to the balance sheet size and the risks it had taken. The bank had used deposits with clearing banks as assets in its liquidity buffer which was not in compliance with the regulatory guidelines. The bank lost 73% in share price during the first half of 2008, and filed for bankruptcy in September 2008.

Implement a Big Data framework and manage your liquidity risk better

Infosys Labs Briefings

Page 38: Bigdata Challenges Opportunities

36

2008 financial crisis has shown that the current liquidity risk management (LRM) approach is highly unreliable in a changing and difficult macroeconomic atmosphere. The need of the hour is to improve operational liquidity management on a priority basis.

THE CURRENT LRM APPROACH AND ITS PAIN POINTS

Compliance/RegulationAcross global regulators, LRM principles have become stricter and complex in nature. The regulatory focus is mainly on areas like risk governance, measurement, monitoring and disclosure. Hence, the biggest challenge for the financial institutions worldwide is to react to these regulatory measures in an appropriate and timely manner. Current systems are not equipped enough to handle these changes. For example, LRM protocols for stress testing and contingency funding planning (CFP) focus more on the inputs to the scenario analysis and new stress testing scenarios. These complex inputs need to be very clearly selected and hence it poses a great challenge for the financial institution.

Siloed Approach to Data ManagementMany banks use a spreadsheet-based LRM approach that gets data from different sources which are neither uniform nor comparable. This leads to a great amount of risk in manual processes and data quality issues. In such a scenario, it becomes impossible to collate enterprise wide liquidity position and the risk remains undetectable.

Lack of Robust LRM InfrastructureThere is a clear lack of a robust system which can incorporate real-time data and generate

necessary actions in time. The various liquidity parameters can be changing funding costs, counterparty risks, balance sheet obligations, and quality of liquidity in capital markets.

THE NEED OF A READY-MADE SOLUTIONIn a recent Swift survey, 91% respondents indicated that there is a lack of ready-made liquidity risk analytics and business intelligence applications to complement risk integration processes. Since we can see that the regulation around the globe in form of Basel III, Solvency II, CRD IV, etc., are shaping up hence there is an opportunity to standardize the liquidity reporting process. A solution that can do this can be of great help to banks as it would save them both effort and time, as well as increase the efficiency of reporting. Banks can focus solely on the more complex aspects like inputs to the stress testing process and on business and strategy to control liquidity risk. Even though there can be differences in approach of various banks in managing liquidity, these changes can be incorporated in the solution as per the requirements.

CHALLENGES/SCOPE OF REQUIREMENTS FOR LRMThe scope of requirements for LRM ranges from concentration analysis of liquidity exposures, calculation of average daily peak of liquidity usage, historical and future view of liquidity flows on both contractual and behavioral in nature, collateral management, stress testing and scenario analysis, generate regulatory reports, liquidity gap across buckets, contingency fund planning, net interest income analysis, fund transfer pricing, to capital allocation. All these liquidity measures are monitored and alerts generated in case of thresholds breached.

Page 39: Bigdata Challenges Opportunities

37

Concentration analysis of liquidity exposures shows some important points on whether the assets or liabilities of the institution are dependent on a certain customer, or a product like asset or mortgage backed securities. It also tries to see if the concentration is region wise country wise, or by any other parameter that can be used to detect a concentration for the overall funding and liquidity situation.

Calculation of average daily peak of liquidity usage gives a fair idea of the maximum intraday liquidity demand and the firm can keep necessary steps to manage the liquidity in ideal way. The idea is to detect patterns and in times of high, low or medium liquidity scenarios utilize the available liquidity buffer in the most optimized way.

Collateral management is very important as the need for collateral and its value after applying the required haircuts has to be monitored on a daily basis. In case of unfavorable margin calls the amount of collateral needs to be adjusted to avoid default in various outstanding positions.

Stress testing and scenario analysis is like a self-evaluation for the banks, in which they need to see how bad things can go in case of high stress events. Internal stress testing is very important to see the amount of loss in case of unfavorable events. For the systematically important institutions, regulators have devised some stress scenarios based on the past crisis events. These scenarios need to be given as an input to the stress tests and the results have to be given to the regulators. A proper stress testing ensures that the institution is aware of what risk it is taking and what can be the consequences of the same.

Regulatory liquidity reports have Basel III liquidity ratios like liquidity coverage ratio (LCR), net stable funding ratio (NSFR), FSA and Fed 4G guidelines, early warning indicators, funding concentration, l iquidity asset/collateral, and stress testing analysis. Timely completion of these reports in the prescribed format is important for financial institutions to remain complaint with the norms.

Net interest income analysis (NIIA), FTP and capital allocation are performance indicators for an institution that raises money from deposits or other avenues and lends it to customers, or performs an investment to achieve a rate of return. The NII is the difference between the cost of funds to the interest rate achieved by lending or investing the same. The implementation of FTP links the liquidity risk/market risk to the performance management of the business units. The NII analysis helps in predicting the future state of the P/L statement and balance sheet of the bank.

Contingency fund planning contains of wholesale, retail and other funding reports in areas of both secured and unsecured funds, so that in case of these funding avenues drying up banks can look for other alternatives. It states the reserve funding avenues like use of credit lines, repro transactions, unsecured loans, etc., that can be accessed timely and at a reasonable cost in liquidity crisis situation.

Intra-group borrowing and lending reports show the liquidity position across group companies. Derivatives reports related to market value, collateral and cash flows are very important to an efficient derivatives portfolio management. Bucket-wise and cumulative liquidity gap under business as usual and stress

Page 40: Bigdata Challenges Opportunities

38

scenario situations give a fair idea of varying liquidity across time buckets. Both contractual and behavioral cash flows are tracked to get the final inflow and outflow scenario. This is done over different time periods, like 30 days to 3 years to get a long term as well as short term view of liquidity. Historic cash flows are tracked as they help in modeling the future behavioral cash flows. Historical assumptions plus current market scenarios are very important in dynamic analysis of behavioral cash flows. Other important reports are related to available pool of unencumbered assets and non-marketable assets.

All the scoped requirements can only be satisfied when the firm has a framework in place to take necessary decisions related to liquidity risk. Hence, next we would have a look into a LRM framework and as well as a data governance framework for managing liquidity risk data.

LRM FRAMEWORKSeparate group for LRM that is a constituted of members from the asset liability committee, risk committee and top management needs to be formed. This group must function independent of the other groups in the firm

and must have the autonomy to take liquidity decisions. Strategic level planning helps in defining the liquidity risk policy in a clear manner related to the overall business strategy of the firm.

The risk appetite of the firm needs to be mentioned in measurable terms and the same has to be communicated to all the stakeholders in the firm. Liquidity risks across the business need to be identified and the key risk indicators and metrics are to be decided. Risk indicators are to be monitored on a regular basis, so that in the case of an upcoming stress scenario preemptive steps can be taken. Monitoring and reporting is to be done for internal control as well as for the regulatory compliance.

Finally there has to be a periodic analysis of the whole system in order to identify possible gaps in it and the frequency of review has to be at least once in a year and in case of extreme markets scenarios more frequently.

To satisfy the scoped out requirements we can see that the data from various sources is used to form liquidity data warehouse and datamart which acts as an input to the analytical engines.

The engines contain business rules and logic based on which the key liquidity parameters are calculated. All the analysis is presented in report and dashboards form for both regulatory compliance and internal risk management as well as for decision making purposes.

Some Uses of Big data Application in LRM1. S t a g i n g A r e a C r e a t i o n f o r D a t a

Warehouse: Big data appl icat ion can store huge volumes of data and perform some analysis on it along with aggregating data for further analysis. Due to its fast processing for large amount of data it can be used as loader to

Corporate Governance

Strategic Level Planning

Identify & Assess Liquidity Risk

Monitor & Report

PeriodicAnalysis for Possible Gaps

Take Corrective

Measures

Figure 1: Iterative Framework for effective liquidity risk management Source: Infosys Research

Page 41: Bigdata Challenges Opportunities

39

load data into the data warehouse along with facilitating the extract-transform-load (ETL) processes.

2. Preliminary Data Analysis: Data can be moved in from various sources and then using a visual analytics tool to create a picture of what data is available and how it can be used.

3. Making Full enterprise Data Available for High performance Analytics: Analytics at large firms were often limited to the sample set of records on which the analytical engines would run and provide certain results, but as a Big data application provides distributed parallel processing capacity the limitation of number of records is non-existent now.

Billions of records can now be processed at increasingly amazing speeds.

HOW BIG DATA CAN HELP IN LRM ANALYTICS AND BI

■ Operational efficiency and swiftness is a point where high performance analytics can help to achieve faster decision making because all the required analysis is obtained much faster.

■ Liquidity risk is a killer in today’s financial world and is most difficult to tracks as for large banks have diverse instruments and a large number of scenarios need to be analyzed like changes in interest rates, exchange rates, liquidity and depth in the markets

Big DataApplication

Data quality/Data checks/OperationalData Store/

Staging Layer

Reporting / BI

Regulatory ReportsBasel related ratiosNSFR & LCRFED 4GFSA reportsStress testingReportsRegulatory capitalallocation

Internal Liquidityrelated Reports

Net interest incomeanalysisALM reportsFTP & liquidity costsFundingConcentrationLiquid assetsCapital allocation &planningInternal stress testKey risk indicatorsOther reports

Data Store

Data Warehouse

DataMart

General LedgerReconciliation

Analytical EngineAsset LiabilityManagement.Fund TransferPricingLiquidity Risk& CapitalCalculation

Data Sources

Market Data

Reference data

External Data

General Ledger

System of RecordsCollateral,Deposits,Loans,Securities,Product/LOB

LoadETL

ETL

Figure 2: LRM data governance framework for Analytics and BI with Big data capabilities

Source: Infosys Research

Page 42: Bigdata Challenges Opportunities

40

worldwide, and for such dynamic analysis Big data analytics is a must.

■ Stress testing and scenario analysis, both require intensive computing as lot of data is involved hence faster scenario analysis means quick action in case of stressed market conditions. With Big data capabilities scenarios that would takes hours to otherwise run can now be run in minutes and hence aid in quick decision making and action.

■ Efficient product pricing can be achieved by implementing real time fund transfer pr ic ing system and prof i tab i l i ty calculations. This ensures the best possible pricing of market risks along with adjustments like liquidity premium across the business units.

CONCLUSIONThe LRM system is the key for a financial institution to survive in competitive and highly unpredictable financial markets. The whole idea of managing liquidity risk is to know the truth, and be ready for the worst market scenarios. This predictability is what is needed, and can save a bank in times like the 2008 crisis. Even at the business level a proper LRM system can help in better product pricing using FTP, and hence pricing can be logical and transparent.

Traditionally data has been a headache for banks and is seen more as compliance and regulation requirement, but going forward there are going to be even more stringent regulations and reporting standards across the globe. After the crisis of 2008 new Basel III liquidity reporting standards, newer scenarios for stress testing have been issued that requires extensive data analysis and can only be timely

possible with Big data applications. All in the banking industry know that the future is uncertain and high margins will always be a challenge, so an efficient data management along with Big data capabilities needs to be in place. This will add value to the banks profile by clear focus on the new opportunities for banks and bring predictability to their overall businesses.

Successful banks in future would be the ones who take LRM initiatives seriously and implement the system successfully. Banks with an efficient LRM system would definitely build a strong brand and reputation in the eyes of investors, customers, and regulators around the world.

REFERENCES1. Banking on Analytics: How High-

Performance Analytics Tackle Big data Challenges in Banking (2012), SAS white paper. Available at http://www.sas.com/resources/whitepaper/wp_42594.pdf.

2. New regime, rules and requirements —welcome to the new liquidity, Basel lll: implementing liquidity requirements, ERNST & YOUNG (2011).

3. Leveraging Technology to Shape the future of Liquidity Risk Management, Sybase Aite. Group study, July, 2010.

4. Managing liquidity risk, Collaborative solutions to improve position management and analytics (2011), SWIFT white paper.

5. Principles for Sound Liquidity Risk Management and Supervision, BIS Document, (2008).

6. Technology Economics: The Cost of Data, Howard Rubin, Wall Street and Technology Website, Available at http://www.wallstreetandtech.com/data-management/231500503.

Page 43: Bigdata Challenges Opportunities

41

VOL 11 NO 12013

Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor

By Anil Radhakrishnan and Kiran Kalmadi

Imagine a world, where the day to day data about an individual’s health is tracked,

transmitted, stored, analyzed on a real-time basis. Worldwide diseases are diagnosed at an early stage without the need to visit a doctor. And lastly a world, where every individual will have a ‘life certificate’ that contains all their health information, updated on a real time basis. This is the world, to which Big data can lead us to.

Given the amount of data generated for e.g., , body vitals, blood samples, etc., every day in the human body, it’s a haven for generating Big data. Analyzing this Big data in healthcare is of prime importance. Big data analytics can play a significant role in the early detection/advanced diagnosis of such fatal diseases that which can reduce health care cost and improve quality.

Hospi ta l s , medica l un ivers i t i e s , researchers, insurers will be positively impacted on applying analytics on this Big data. However, the principal beneficiaries of analyzing this Big data will be the Government, patients and therapeutic companies.

RAMPANT HEALTHCARE COSTSA look at the healthcare expenditure of countries like US and UK, would automatically explain the burden that healthcare is on the economy. As per data released by Centers for Medicare and Medicaid Services, health expenditure in the US is estimated to have reached $2.7 trillion or over $8,000 per person [1]. By 2020, this is expected to balloon to $4.5 trillion [2]. These costs will have a huge bearing on an economy that is struggling to get up on its feet, having just come out of a recession.

According to the Office for National Statistics in the UK, healthcare expenditure in UK amounted to £140.8 billion in 2010; from £136.6 billion in 2009 [3]. With rising healthcare cost, countries like Spain have already pledged to save €7 Billion by slashing health spending, while also charging more for drugs [5]. Middle income earners will now have to pay more for drugs.

This increase in healthcare costs is not isolated to a few countries alone. According to World Health Organization statistics released

Diagnose, customize and administer health care on real time using BDMEiC

Infosys Labs Briefings

Page 44: Bigdata Challenges Opportunities

42

in 2011, per capita total expenditure on health jumped from US$ 566 to US$ 899 from 2000 to 2008, an alarming increase of 58% [4]. This huge increase is testimony to the fact that far from increasing steadily, healthcare costs have been increasing exponentially.

While healthcare costs have been increasing, the data generated through body vitals, lab reports, prescriptions, etc. has also been increasing significantly. Analysis of this data will lead to better and advanced diagnosis, early detection and more effective drugs which in turn will result in significant reduction in healthcare costs.

HOW BIG DATA ANALYTICS CAN HELP REDUCE HEALTHCARE COSTS?Analysis of ‘Big data’ that is generated from various real time patient records possesses a lot of potential for creating quality healthcare at reduced costs. Real time refers to data like body temperature, blood pressure, pulse/heart rate, and respiratory rate that can be generated every 2-3 minutes. This data collected across individuals provides the volume of data at a high velocity, while also providing the required variety since it is obtained across geographies. The analysis of this data can help in reducing costs by enabling real time diagnosis, analysis and medication, which offers

■ Improved insights into drug effectiveness ■ Insights for early detection of diseases ■ Improved insights into origins of various

diseases ■ Insights to create personalized drugs.

These insights that Big data analytics provides are unparalleled and go a long way in reducing the cost of healthcare.

USING BIG DATA ANALYTICS FOR PERSONALIZING DRUGSThe patents of many high profile drugs are ending by 2014. Hence, therapeutic companies need to examine the response of patients to these drugs to help create personalized drugs. Personalized drugs are those that are tailored according to an individual patient. Real time data collected from various patients will help generate Big data, the analysis of which will help identify how individual patients, reacted to the drugs administered to them. By this analysis, therapeutic companies will be able to create personalized drugs custom-made to an individual.

A personalized drug is one of the important solutions that Big data analytics will have the power to offer. Imagine a situation where, analytics will help determine the exact amount and type of medicine that an individual would require, even without them having to visit a doctor. That’s the direction in which Big data analytics in healthcare has to move. In addition, the analytics of this data can also significantly reduce healthcare costs that run into billions of dollars every year.

BIG DATA ANALYTICS FOR REAL TIME DIAGNOSIS USING BIG DATA MEDICAL ENGINE IN THE CLOUD (BDMEIC)Big data analytics for real time diagnosis are characterized by real time Big data analytics systems. These systems contain a closed loop feedback system, where insights from the application of the solution serve as feedback for further analysis. (Refer Figure 1).

Access to real time data provides a quick way to accumulate and create Big data. The closed loop feedback system is important because it helps the system in building its intelligence. These systems can not only help

Page 45: Bigdata Challenges Opportunities

43

to monitor patients in real time but can also be used to provide diagnosis, detect early and deliver medication in real time.

This can be achieved through a Big data Medical Engine in the Cloud (BDMEiC) [Fig. 2].

This solution would consist of: ■ Two medical patches (arm and thigh) ■ Analytics engine ■ Smartphone ■ Data Center.

As depicted above, the BDMEiC solution consists of the following:

1. Arm and thigh based electronic medical patch

An arm based electronic medical patch (these patches are thin, lightweight, elastic and have embedded sensors) that can monitor the patient is strapped to the arm of an individual , which reads vitals like body temperature, blood pressure, pulse/heart rate, and respiratory rate to monitor brain, heart, muscle activity, etc.

The patch then transmits this real time data to the individual’s smartphone

which is synced with the patch . The extraction of the data happens at regular intervals (every 2-3 minutes).The smartphone transmits the real time data to the data center in the medical e n g i n e . The thigh based electronic medical patch is used for providing medication. The patch comes with a drug cartridge (pre-loaded drugs) that can be inserted into a slot in the patch. When i t r ece ives da ta f rom the smartphone, the device can provide the required medication to the patient through auto-injectors that are a part of the drug cartridge.

2. Data Center The data center is the Big data cloud

storage that receives real time data from the medical patch and stores it. This data center will be a repository of real time data received across different individuals across geographies. This data is then transmitted to the Big data analytics engine

3. Big Data Analytics Engine The Big data analytics engine performs

three major functions - analyzing data, sharing analyzed data with organizations and transmitting medication instructions back to the smartphone.

• Analyzing Data: It analyzes the data (like body temperature, blood pressure, pulse/heart rate, and respiratory rate, etc.) received from the data center using i ts inbuilt medical intelligence, across individuals. As the system keeps analyzing this data it also keeps building on its intelligence.

New solutionbased on analysis

Analysis ofreal time data

Newer Insightsfrom Solutions

Real TimeMedical Data

Real Time Big DataAnalytics system

Feedback

Figure 1: Real Time Big Data Analytics System Source: Infosys Research Source: Infosys Research

Page 46: Bigdata Challenges Opportunities

44

Figure 2: Big Data Medical Engine in the Cloud (BDMEiC) Source: Infosys Research

• Sharing Analyzed Data: The analytics engine also transmits its analysis to various universities, medical centers, therapeutic companies and other related organizations for further research.

• T r a n s m i t t i n g M e d i c a t i o n Instructions: The analytics engine a l s o c a n t r a n s m i t m e d i c a t i o n instructions to an individual’s s m a r t p h o n e , w h i c h i n t u r n transmits data to the thigh patch, whenever medication has to be provided.

The BDMEiC solution can act as a real time doctor that diagnoses, analyzes, and provides personalized medication to individuals. Such a solution that harnesses the potential of Big data provides manifold benefits to various beneficiaries.

BENEFITS AND BENEFICIARIES OF BDMEICThe BDMEiC solution if adopted in a large scale manner can offer a multitude of benefits, few of which are listed below.

Real time MedicationWith the analytics engine, monitoring patient data in real time, the diagnosis and treatment of patients in real time is possible. With the data being shared with top research facilities and medical institutions in the world, the diagnosis and treatment would be more effective and accurate.

Specific Instances: Blood pressure data can be monitored real time and stored in the data center. The analysis of this data by the analytics engine can keep the patients as well as doctor updated real time, if the blood pressure moves beyond permissible limits.

Beneficiaries: Patients, medical institutions and research facilities.

ConvenienceThe BDMEiC solution offers convenience to patients, who would not always be in a position to visit a doctor.

Specific Instances: Body vitals can be measured and analyzed with the patient being at home. This especially helps in the case of senior citizens and busy executives who can now be diagnosed and treated right at home or while on the move.

Beneficiaries: Patients.

Insights into drug effectivenessThe system allows doctors, researchers and therapeutic companies to understand the impact of their drugs in real time. This helps them to create better drugs in the future.

Specific Instances: The patents of many high profile drugs are ending by 2014. Therapeutic companies can use BDMEiC to perform real

Organizations

Medical Research Centers

TherapeuticCompanies

MedicalUniversities

Medical LabsMedical EngineAnalyticsEngine

DataCenter

1

2 3

4

Page 47: Bigdata Challenges Opportunities

45

time Big data analysis, to understand their existing drugs better, so that they can create better drugs in the future.

Bene f i c iar i e s : Doctors , researchers and therapeutic companies

Early Detection of DiseasesAs BDMEiC monitors, stores, and analyzes data in real time, it allows medical researchers, doctors and medical labs to detect diseases at an early stage. This allows them to provide an early cure.

Specific Instances: Early detection of diseases like cancer, childhood pneumonia, etc., using BDMEiC can help provide medication at an early stage thereby increasing the survival rate.

Beneficiaries: Researchers, medical Labs and patients.

Improved Insights into Origins of Various DiseasesWith BDMEiC storing and analyzing real time data, researchers get to know the cause and symptoms of a disease much better and at an early stage.

Specific Instances: Newer strains of viruses can be monitored and researched in real time.

Beneficiaries: Researchers and medical labs.

Insights to Create Personalized DrugsReal time data collected from BDMEiC will help doctors administer the right dose of drugs to the patients.

Specific Instances: Instead of a standard pill, patients can be given the right amount of drugs, customized according to their needs.

Beneficiaries: Patients and doctors

Reduced CostsReal time data collected from BDMEiC assists in the early detection of diseases, thereby reducing the cost of treatment.

Specific Instances: Early detection of cancer and other life threatening diseases can lead to lesser spending on healthcare.

Beneficiaries: Government and patients.

CONCLUSIONThe present state of the healthcare system leaves a lot to be desired. Healthcare costs are spiraling and forecasts suggest that they are not poised to come down any time soon. In such a situation, organizations world over, including governments should look to harness the potential of real time Big data analytics to provide high quality and cost effective healthcare. vThe solution proposed in this paper, tries to utilize this potential to bridge the gap between medical research, and the final delivery of the medicine.

REFERENCES1. US Food and Drug Administration, 20122. National Health Expenditure Projections

2011-2021 (January 2012), Centers for Medicare & Medicaid Services, Office of the Actuary. Available at http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/Downloads/Proj2011PDF.pdf.

3. Jurd, A. (2012), Expenditure on healthcare in the UK 1997 - 2010, Office for National Statistics. Available at http://www.ons.gov.uk/ons/dcp171766_264293.pdf .

Page 48: Bigdata Challenges Opportunities

46

4. World Health Statistics 2011, World Health Organization. Available at h t t p : / / w w w . w h o . i n t / w h o s i s /whostat/EN_WHS2011_Full.pdf .

5. The Ministry of Health, Social Policy and Equality Spain (). Available at http://www.msssi.gob.es/ssi/violenciaGenero/publicaciones/comic/docs/PilladaIngles.pdf.

Page 49: Bigdata Challenges Opportunities

47

VOL 11 NO 12013

Big Data Powered Extreme Content Hub

By Sudheeshchandran Narayanan and Ajay Sadhu

Content is getting bigger by the minute and smarter by the second [5] . As

content grows in size and becomes varied in structure, discovery of valuable and relevant content becomes a challenge. Existing Content Management (ECM) products are limited by scalability, variety, rigid schema, limited indexing and processing capability.

Content enrichment often is an external activity and not often deployed. The content manager is more like a content repository and is used primarily for search and retrieval of the published content. Existing content management solutions can handle few data formats and provide very limited capability with respect to content discovery and enrichment.

With the arrival of Big Content, the need to extract, enrich, organize and manage the semi-structured and un-structured content and media is increasing. As the next generation of users will rely heavily on the new modes of

interacting with the content for e.g., mobile devices and tablets , there is a need to re-look at the traditional content management strategies. Artificial intelligence will now play a key role in information retrieval, information classification and usage for these sophisticated users. To facilitate the usage of Artificial Intelligence on this Big Content, there is a need to have knowledge on entities, domain, etc., to be captured, processed, reused, and interpreted by the computer. This has resulted in formal specification and capture of the structure of the domain called ontologies. Classification of these entities within the domain into predefined categories called taxonomy and inter-relating them to create the semantic web (web of data).

The new breed of content management solutions need to bring in elastic indexing, distributed content storage and low latency to address these changes. But the story does not end there. The ease to deploy

Taming Big content explosion and providing contextual and relevant information is the need of the day

Infosys Labs Briefings

Page 50: Bigdata Challenges Opportunities

48

technologies l ike natural language text analytics, machine learning now takes these new breed of content management to the next level of maturity. Time is the essence for everyone today. Contextual filtering of the content based on relevance is an immediate need. There is a need to organize content, create new taxonomy, and create new links and relationships beyond what is specified. The next generation of content management solutions should leverage the ontologies, semantic web and linked data to derive the context of the content and enrich the content metadata with this context. Then leveraging this context, the system should provide real-time alerts as the content arrives.

In this paper, we discuss the details of the extreme content hub and its implementation semantics, technology viewpoint and use cases.

THE BIG CONTENT PROBLEM IN TODAYS ENTERPRISESLegacy Content Management System (CMS) has focused on addressing the fundamental problems in content management i.e., content organization, indexing, and searching. With the internet evolution, these CMS’ evolved to Content Publishing Lifecycle Management (CPLM) and workflow capabilities to the overall offering. The focus of these ECM products were towards providing a solution for the enterprise customers to easily store and retrieve various documents and provide a simplified search interface. Some of these solutions evolved to address the web publishing problem. These existing content management solutions have constantly shown performance and scalability concerns. Enterprises have invested in high end servers and hired performance engineering experts to address this. But will this last long?

Figure 1: Augmented Capabilities of Extreme Content Hub Manager

Source: Infosys Research

Automated ContentDiscovery

Highly AvailableElastic Scalable

System

HeterogeneousContent Ingestion

Unified IntelligentContent Access and

InsightsContent Enrichment

Core Features• Indexing• Search• Workflow• Metadata Repository• Content Versioning

Page 51: Bigdata Challenges Opportunities

49

With the arrival of Big data (volume, variety and velocity), these problems have amplified further and the need for next generation capabilities for content management has evolved further.

Requirements and demand has gone just beyond storing, searching and indexing of traditional documents. Enterprise needs to store a wide variety of contents ranging from documents, videos, social media feeds, blogs posts, podcast, images, etc. Extraction, enrichment, organization and management of semi, unstructured and multi-structured content and media are a big challenge today. Enterprises are under tremendous competitive pressure to derive meaningful insights from these piles of information assets and derive business value from this Big data. Enterprises are looking for contextual and relevant information at lightning speed. The ECM solution must address all of the above technical and business requirements.

EXTREME CONTENT HUB: KEY CAPABILITIESKey capabilities required for the Extreme Content Hub (ECH) apart from the traditional indexing, storage and search capabilities can be classified in the following five dimensions. (Fig. 2)

Heterogeneous Content Ingestion that provides input adapters to a wide variety of content (document, videos, images, blogs, feeds, etc.) into the content hub seamlessly. The next generation of content management system needs to support

Real-Time Content Ingestion for RSS feeds, news feeds, etc. and support stream of events to be ingested as one of the key capabilities for content ingestion.

Automated Content discovery that extracts the metadata and classifies the incoming content seamlessly to pre-defined ontologies and taxonomies.

Scalable, Fault-tolerant Elastic System that can seamlessly expand to the demands of volume, velocity and variety growth of the content.

Content Enrichment services that leverages machine learning and text analytics technologies to enrich the context of the incoming content.

Unified Intelligent Content Access that provides a set of content access services that are context aware and based on information relevance by user modeling and personalization.

To realize ECH, there is a need to augment the existing search and indexing technologies with the next generation of machine learning and text analytics to bring in a cohesive platform. The existing content management solution still provides quite a good list of features that cannot be ignored.

BIG DATA TECHNOLOGIES: RELEVANCE FOR THE CONTENT HUBWith the advent of Big data, the technology landscape has made a significant shift . Distributed computing has now become a key enabler for large scale data processing and with open source contributions this has received a significant boost in recent years. Year 2012 has been the year for large scale Big data technology adoption.

The other significant advancement has been in the NoSQL (Not Only SQL) technology which complements the existing RDBMS systems for scalability and flexibility. Scalable near real-time access provided by these systems has boosted the adoption of distributed

Page 52: Bigdata Challenges Opportunities

50

computing for real-time data storage and indexing needs.

Scalable and elast ic deployments provided by the advancement in private and public cloud deployments has accelerated adoption of distributed computing in enterprises. Overall, there is a significant change from our earlier approaches to solve the ever increasing data and performance problem by throwing more hardware at the problem. Today deploying a scalable distributed computing infrastructure that not only addresses the velocity, variety and volume problem but also providing it at a cost effective alternative using open source technologies provides the business case for building the ECH. The solution to the problem is to augment the existing content management solution with the processing capabilities of the Big data technologies to create a comprehensive platform that brings in the best of both worlds.

REALIZATION OF THE ECHECH requires a sca lable faul t to lerant elastic system that provides scalability on storage, compute and network infrastructure. Distributed processing technologies like Hadoop provide the foundation platform for this. Private cloud based deployment model will provide the on-demand elasticity and scale that is required to setup such a platform.

Metadata model driven ingestion framework could ingest a wide variety of feeds to the hub seamlessly. Content ingestion could deploy content security tagging during the ingestion process to ensure that the content stored inside the hub is secured and authorized before access.

NoSQL technologies like HBase and MongoDB could provide the scalable metadata repository needs for the system.

Figure 2: Extreme Content Hub Source: Reference [12]

Machine Learning Algorithms

Content Services

Unified Enterprise Content Access

KnowledgeFeeds to variousexistingsystems

ExistingEnterpriseContent

Social FeedIntegration

Log Feedsfrom variousenterprisesystem

News, Alerts& RSS Feeds(Real Time)

Search Services Content ClassificationService

Un-

Str

uctu

red

Con

tent

Ext

ract

or

Met

adat

a E

xtra

ctor

Content ClassificationService

Distributed File System (Hadoop)

Metadata Driven Augmented CM Processing Framework(Generic Transformation, Dynamic Cluster Expansion, Audit Logging)

Content Processing Workflows(Task Co-ordination, sequencing, scheduling etc. for Backend Processing)

UnifiedContent

Extractor

LinkStorage(Hbase)

IndexStorage(Hbase)

RuleEngine Existing

EnterpriseCM

Content ManagementInterface

Alerts & ContentAPI Service

Dashboard

Auto -Classifier Recommendation

Extreme Content Hub

Page 53: Bigdata Challenges Opportunities

51

Search and indexing technologies have matured to be next level after the advent of the Web 2.0 0 and deploying a scalable indexing service like Solr, Elastic Search, etc., provides the much needed scalable indexing and search capability required for the platform.

Deploying machine learning algorithms leveraging Mahout and R on this platform can bring in auto-discovery of the content metadata and auto-classification for content enrichment. De-duplication and other value added services can be seamless deployed as batch framework on the Hadoop infrastructure to bring value added context to the content.

Machine learning and text analytics technologies can be further leveraged to provide the recommendation and contextualization of the user interactions to provide unified context aware services.

BENEFITS OF ECH ECH is at the center of enterprise knowledge management and innovation. Serving contextual and relevant information to the users will be one of the fundamental usages ECH.

Auto-indexing will help discover multiple facets of the content and help in discovering new patterns and relationships between the various entities that would have been particular unnoticed in the legacy world. The integrated metadata view of the content will help in building a 360 degree view on a particular domain or entity from the various sources.

ECH could enable discovery of user taste and likings based on the content searched and viewed. This could serve real-time recommendation to users through content hub services. This could help the enterprise in specific user behavior modeling. Emerging trends in the various domains can be discovered as content gets ingested on the hub.

ECH could extend as an analytics platform for video and text analytics. Real-time information discovery can be facilitated using pre-defined alerts/rules which could get triggered as new content arrives in the hub.

The derived metadata and context could be pushed to the existing content management solution to derive the benefits and investments done on the existing products and platforms and augment the processing and analytics capabilities with new technologies.

ECH will now be able to handle large volumes, wide variety of content formats and bring in deep insights leveraging the power of machine learning. These solutions will be very cost effective and will also leverage existing investment in the current CMS.

CONCLUSIONThere need is to take a platform centric approach to this Big content problem rather than a standalone content management solution. There is a need to look at it strategically and adopt a scalable architecture platform to address this. However such initiative doesn’t need to replace the existing content management solutions but to augment the capabilities to fill in required white spaces. The approach discussed in this paper provides one such implementation of the augmented content hub leveraging the current advancement in Big data technologies. Such an approach will provide the enterprise with a competitive edge in years to come.

REFERENCES1. Agichtein, E., Brill, E. and Dumais, S.

(2006), Improving web search ranking by incorporating user behavior. Available ̀ at http://research.microsoft.com/en-us/um/people/sdumais/.

2. Dumain, S. (2011), Temporal Dynamics

Page 54: Bigdata Challenges Opportunities

52

and Information Retrieval. Available at http://research.microsoft.com/en-us/um/people/sdumais/.

3. Reamy, T. (2012) , Taxonomy and Enterprise Content Management . Available at http://www.kapsgroup.com/presentations.shtml.

4. Reamy, T. (2012), Enterprise Content Categorization – How to Successfully

Choose, Develop and Implement a Semant ic Strategy, ht tp ://www.k a p s g r o u p . c o m / p r e s e n t a t i o n s /ContentCategorization-Development.pdf.

5. Barroca, E. (2012), Big data’s Big Challenges for Content Management, TechNewsWorld. Available at http://www.technewsworld.com/story/74243.html.

Page 55: Bigdata Challenges Opportunities

53

VOL 11 NO 12013

Complex Events Processing: Unburdening Big Data Complexities

By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur

A study by The Economist revealed that 1.27 Zettabyte was the amount of information

in existence in 2010 as household data [1]. The Wall Street Journal reported Big data as the new boss in all key sectors such as education, retail and finance. But on the other side, an average Fortune 500 enterprise is estimated to have around 10 years’ worth of customer data and more than two-thirds of it being unusable. How can enterprises make such an explosion of data usable and relevant? Not trillions but quadrillions amount of data for analysis overall and it is expected to increase exponentially and evidently impacts businesses worldwide. Additionally the problem is of providing speedier results and that is expected to go slower with more data to analyze unless technologies innovate in the same pace.

Any function or business, whether it is road traffic control, high frequency trading, auto adjudication of insurance claims or controlling supply chain logistics of electronics manufacturing, all requires huge data sets to be

analyzed as well as a need for timely processing and decision making. Any delay even in seconds or milliseconds affects the outcome. Significantly, technology should be capable of interpreting historical patterns, apply them to current situations and take accurate decisions with minimal human interference.

Big data is about the strategy to deal with vast chunk of incomprehensible data sets. There is now awareness across industries that traditional methods of data stores and processing power like databases, files, mainframes or even mundane caching cannot be used as a solution for Big data. Still the existing models do not address capabilities of processing, analysis of data, integrating with events and real time analytics, all in split second intervals.

On the other hand, Complex Event Processing (CEP) has evolved to provide solutions in utilizing memory data grids for analyzing trends, patterns and events in real time and assessments in a matter of milliseconds. However, Event Clouds, a byproduct of using

Analyze, crunch and detect unforeseen conditions in real time through CEP of Big Data

Infosys Labs Briefings

Page 56: Bigdata Challenges Opportunities

54

CEP techniques, can be further leveraged to monitor for unforeseen conditions birthing, or even the emergence of an unknown-unknown, creating early awareness and potential first mover advantage for the savvy organization.

To set the context of the paper we attempt at highlighting how CEP with in-memory data grid technologies helps in pattern detection, matching, analysis, processing and decision making in split seconds with the usage of Big data. This model should serve any industry function where time is the essence and Big data is at the core and CEP acts as the mantle. Later, we propose treating an Event Cloud as more than just an event collection bucket used for event pattern matching or as simply the immediate memory store of an exo-cortex for machine learning; an Event Cloud is also a robust corpus with its own intrinsic characteristics that can be measured, quantified, and leveraged for advantage. For example, by automating the detection of a shift away from an Event Cloud’s steady state, the emergence of a previously unconsidered situation may be observed. It is this application, programmatically discerning the shift away from an Event Cloud’s normative state, which is explored in this paper.

CEP AS REAL TIME MODEL FOR BIG DATA: SOME RELEVANT CASESIn current times, traffic updates are integrated with cities traffic control system as well as many global positioning service (GPS) electronic receivers used quite commonly by drivers. These receivers automatically adjust and reroute in case of the normal route is traffic ridden. This helps but the solution is reactionary. Many technology companies are investing in pursuit of the holy grail of the solution to detect and predict traffic blockages and take proactive action to control the traffic itself and even avoid mishaps. For

this there is a need to analyze traffic data over different parameters such as rush hour, accidents, seasonal impacts of snow, thunderstorms, etc., and come up with predictable patterns over years and decades. Second is application of this pattern to input conditions. All this requires huge data crunching, analyses and on top of it real time application such as CEP.

Big data has already taken importance in financial market particularly in high frequency trading. Since the 2008 economic downturn and its rippling effects on the stock market, the volume of trade has come down at all the top exchanges such as New York, London, Singapore, Hong Kong or Mumbai. But the contrasting factor is the rise in High Frequency Trading (HFT). It is claimed that around 70% of all equity trades were accounted by HFT in 2010 versus 10% in 2000. HFT is 100% dependent on technology and the trading strategies are developed out of complex algorithms. Only those trades will have a better win ratio that has developed a better strategy and has more data to crunch in faster time. This is where CEP could be useful.

The healthcare industry in USA is set to undergo a rapid change with the Affordable Care Act. Healthcare insurers are expected to see an increase in their costs due to increased risks of covering more individuals and legally cannot deny insurance with pre-conditions. Hospitals are expected to see more patient data which means increased analyses and pharmaceutical companies need better integration with the insurers and consumers to have speedier and accurate settlements. Even though most of these transactions can be performed on non-real time basis, technology still needs both Big data and complex processing for a scalable solution.

In India the outstanding cases in various judicial courts touch 32 million. In USA, family based cases and immigration related ones

Page 57: Bigdata Challenges Opportunities

55

are piling up waiting for a hearing. Judicial pendency has left no country untouched. Scanning through various federal, state and local law points, past rulings, class suits, individual profiles, evidence details etc., are required to put forward the cases for the parties involved and the winner is the one who is able to present a better analysis of available facts. Can technology help in addressing such problems across nations?

All of these cases across such diverse industries showcase the importance of processing gigantic amounts of data and also the need to have the relevant information churned out in right time.

WHY AND WHERE BIG DATA Big data has evolved due to the existing limitations of current technologies. Two-tier or multi-tier architecture with even a high performing database at one end is not enough to analyze and crunch such colossal information in desired time frames. The fastest databases today are benchmarked at tera bytes of information as noted by the transaction processing council Volumes of exa and zetta bytes of data need a different technology. Analysis of unstructured data is another criterion for the evolution of Big

data. Information available as part of health records, geo maps, multimedia (audio, video and picture) is essential for many businesses and mining such unstructured sets require storage power as well as transaction processing power. Add this to the variety of sources such as social media, legacy systems, vendor systems, localized data, mechanical and sensor data. Finally the critical component of Speed to get the data through the steps of Unstructured → Structured → Storage → Mine → Analyze → Process → Crunch → Customize → Present.

BIG DATA METHODOLOGIES: SOME EXAMPLESApache™ Hadoop™ project [2] and its relatives such as Avro™, ZooKeeper™, Cassandra™, Pig™ provided the non-database form of technology as the way to solve problems with massive data. It used distributed architecture as the foundation to remove the constraints of traditional constructs.

Both Data (storage, transportation) and Processing (analysis, conversion, formatting) are distributed in this architecture. Figure 1 and Figure 2 compare the traditional vs. Distributed Architecture.

Validation

Enrichment

Transformation

Strandardization

Route

Operate

Middle TierServer Tier Client Tier Distributed Nodes

Data Nodes Data Nodes

Data Nodes Data Nodes

Client Tier

Processing Nodes

Processing Nodes

Distributed Nodes

Data Nodes Data Nodes

Data Nodes Data Nodes

Client Tier

Processing Nodes

Processing Nodes

Figure 1: Conventional Multi-Tier Architecture Source: Infosys Research

Figure 2: Distributed Multi-Nodal Architecture Source: Infosys Research

Page 58: Bigdata Challenges Opportunities

56

A key advantage o f d i s t r ibuted architecture is scalability. Nodes can be added without affecting the design of the underlying data structures and processing units.

IBM has even gone a step ahead in getting Watson [5], the famous artificial intelligent computer which can learn as it gets more information and patterns for decision making.

Similarly IBM [6], Oracle [7], Teradata [8] and many leading software providers have created the Big data methodologies as an impetus to help enterprise information management.

VELOCITY PROBLEM IN BIG DATAEven though we clearly see the benefits of Big data and its architecture can easily be applicable to any industry, there are some limitations that is not easily perceivable. Few pointers:

■ Can Big data help a trader to give the best win scenarios based on millions and even billions of computations of multiple trading parameters in real time?

■ Can Big data forecast traffic scenarios based on sensor data, vehicle data, seasonal change, major public events and provide alternate path to drivers through their GPS devices in real time helping both city officials as well as drivers to save time?

■ Can Big data detect fraud detection scenarios running through multiple shopping patterns of a user through historical data and match with the current transaction in real time?

■ Can Big data provide real time analytical solutions out of the box and support predictive analytics?

There are multiple business scenarios in which data has to be analyzed in real time. These data are created, updated and transferred because of real time business or system level events. Since the data is in the form of real time events, this requires a paradigm shift in the methodology in the way data is viewed and analyzed. Real time data analyses in such cases means that data has to be analyzed before the data hits the disk. Difference between ‘event’ and ‘data’ just vanishes.

In such cases across the industry where Big data is unequivocally needed to manage the data but to use this data effectively and integrate with real time events and provide business with express results, a complimentary technology is required and that’s where CEP can fit in.

VELOCITY PROBLEM: CEP AS A SOLUTIONThe need here is the analyses of data arriving through the form of real time event streams and identifying patterns or trends based on vast historical data. Adding to the complexity is other real time events.

The vastness is solved with Big data and real time analysis of multiple events, pattern detection and appropriate matching and crunching is solved by CEP.

Real time event analysis ensures avoiding duplicates and synchronization issues as data is still in flight and storage is still a step away. Similarly it facilitates predictive analysis of data by means of pattern matching and trending. This enables enterprise to provide early warning signals and take corrective measures in real time itself.

Reference architecture of traditional CEP is shown in Figure 3.

CEP’s original objective was to provide processing capability similar to Big data with

Page 59: Bigdata Challenges Opportunities

57

distributed architecture and in memory grid computing. The difference was CEP was to handle multiple events seemingly unrelated and correlate them to provide a desired and meaningful output. The backbone of CEP though can be the traditional architectures such multi-tier technologies with CEP usually in the middle tier.

Figure 4 shows how the CEP on Big data solves the velocity problem with Big data and complements the overall information management strategy for any enterprise that aims to use Big data. CEP can utilize Big data particularly by highly scalable in-memory data grids to store the raw feeds, events of interests and detected events and analyze this data in real time by correlating with other in flight events. Fraud detection is a very apt example where historic data of the customer’s transaction, his usage profile, location, etc., is stored in

the in memory data grid and every new event (transactions) from the customer is analyzed by CEP engine by correlating and applying patterns on the event data with the historic data stored in the memory grid.

There are multiple scenarios some of them outlined through this paper where CEP complements Big data and other offline analytical approaches to accomplish an active and dynamic event analytics solution.

EVENT CLOUDS AND DETECTION TECHNIQUESCEP and Event CloudsA linearly ordered sequence of events is called an event stream [9]. An event stream may contain many different types of events, but there must be some aspect of the events in the event stream that allow for a specific ordering. This is typically an ordering via timestamp.

Event Access Event Attributes Relationships

PersistenceModels

StorageOptions

Securityand SearchScalability

Event Modeling and Management

User R

olesP

erformance

Security and

Authentication

Failure andR

ecoveryA

ccessM

anagement

Mem

oryM

anagement

Monitoring and A

dministration Tools

Lang

uage

Con

stuc

tsM

ulti

Use

rS

uppo

rtS

tand

ard

Fun

ctio

nsD

ebug

Cap

abili

tyF

eatu

reS

et

Dev

., B

usin

ess

Use

r To

ols

(Pla

tfor

m In

depe

nden

t)

Met

a D

ata

Rep

osito

ryD

omai

nm

odel

Cat

alog

Obj

ect

Mod

elC

atal

ogE

vent

Cat

alog

Eve

ntO

rigi

nato

r

Eve

nt G

ener

atio

n an

d C

aptu

re

EventPre-filtering

EventStreams

Event Consumer

CEPLanguages Patterns Domain Specific

Algorithms

Pre-processing Refine VisualizeAggregate

and correlate

Event Handlers Event Processing Engine

Event Processing and Logic

Actions Patterns

Source: Infosys ResearchFigure 3: Complex Events Processing- Reference Architecture

Page 60: Bigdata Challenges Opportunities

58

By watching for Event patterns of interest, such as multiple usages of the same credit card at a gas station within a 10 minute window, in an event stream, systems can respond with predefined business driven behaviors, such as placing a fraud alert on the suspect credit card.

An Event Cloud is “a partially ordered set of events (POSET), either bounded or unbounded, where the partial orders are imposed by the causal, timing and other relationships between events” [10]. As such, it is a collection of events within which the ordering of events may not be possible. Further, there may or may not be an affinity of the events within a given Event Cloud. If there is an affinity, it may be as broad as “all events of interest to our company” or as specific as “all events from the emitters located at the back of the building.”

Event Clouds and event streams may contain events from sources outside of an

organization, such as stock market trades or tweets from a particular twitter user. Event Clouds and event streams may have business events, operational events, or both. Strictly speaking, an event stream is an Event Cloud, but an Event Cloud may or may not be an event stream, as dictated by the ordering requirement.

Typical ly , a landscape with CEP capabilities will include three logical units: (i) emitters that serve as sources of events, (ii) a CEP engine, and (iii) targets to be notified under certain event conditions. Sources can be anything from an application to a sensor to even the CEP engine itself. CEP engines, that are the heart of the system, are implemented in one of two fundamental ways. Some follow the paradigm of being rules based, matching on explicitly stated event patterns using algorithms like Rete, while other CEP engines use the more sophisticated event analytics approach looking

Event Access Event Attributes Relationships

PersistenceModels

StorageOptions

Securityand SearchScalability

Event Modeling and Management

Dev

., B

usin

ess

Use

r To

ols

(Pla

tfor

m In

depe

nden

t)

Eve

nt G

ener

atio

n an

d C

aptu

re

DashboardEvent Streams

Event Consumer

CEPLanguages Patterns Domain Specific

Algorithms

Pre-processing Refine VisualizeAggregate

and correlate

QueryAgent

WriteConnectorEvent Handlers Event Processing Engine

Event Processing and Logic

Actions Patterns

Lang

uage

Con

stuc

tsM

ulti

Use

rS

uppo

rtS

tand

ard

Fun

ctio

nsD

ebug

Cap

abili

tyF

eatu

reS

et

Met

a D

ata

Rep

osito

ryD

omai

nm

odel

Cat

alog

Obj

ect

Mod

elC

atal

ogE

vent

Cat

alog

Eve

ntO

rigi

nato

r

InMemoryDB orDataGrid

Big Data

Source: Infosys ResearchFigure 4: CEP on Big Data

Page 61: Bigdata Challenges Opportunities

59

for probabilities of event patterns emerging using techniques like Bayesian Classifiers [11]. In either case of rules or analytics, some consideration of what is of interest must be identified up front. Targets can be anything from dashboards to applications to the CEP engine itself.

Users of the system, using the tools provided by the CEP provider, articulate events and patterns of events that they are interested in exploring, observing, and/or responding to. For example, a business user may indicate to the system that for every sequence wherein a customer asks about a product three times but does not invoke an action that results in a buy, the system is then to provide some promotional material to the customer in real-time. As another example, a technical operations department may issue event queries to the CEP engine, in real time, asking about the number of server instances being brought online and the probability that there may be a deficit in persistence storage to support the servers.

Focusing on events, while extraordinarily powerful, biases what can be cognized. That is, what you can think of, you can explore. What you can think of, you can respond to.

However, by adding the Event Cloud, or event stream, to the pool of elements being observed, emergent patterns not previously considered can be brought to light. This is the crux of this paper, using the Event Cloud as a porthole into unconsidered situations emerging.

EVENT CLOUDS HAVE FORMAs represented in Figure 5 , there is a

point wherein events flowing through a CEP engine are unprocessed. This point is an Event Cloud, which may or may not be physically located within a CEP engine memory space. This Event Cloud has events entering its logical space and leaving it. The only bias to the events travelling through the CEP engine’s Event Cloud is based on which event sources are serving as inputs to the particular CEP engine. For environments wherein all events, regardless of source, are sent to a common CEP engine, there is no bias of events within the Event Cloud.

There are a number of attributes about the Event Cloud that can be captured, depending upon a particular CEP’s implementation. For example, if an Event Cloud is managed

Source: Infosys Research

Input Adapter

Input Adapter

Input Adapter

Input Adapter

Out

put B

us

Eve

nt In

gres

s B

us

Apply Rules

Filter

Union

Correlate

Match

Event Cloud

Output Adapter

Output Adapter

Output Adapter

Input Adapter

Input Adapter

Input Adapter

Input Adapter

Out

put B

us

Eve

nt In

gres

s B

us

Apply Rules

Filter

Union

Correlate

Match

Event Cloud

Output Adapter

Output Adapter

Output Adapter

Figure 5: CEP Engine Components

Page 62: Bigdata Challenges Opportunities

60

in memory and is based on a time window, for e.g., events of interest only stay within consideration by the engine for a period of time, then the number of events contained within an Event Cloud can be counted. If the structure holding an Event Cloud expands and contracts with the events it is funneling, then the memory footprint of the Event Cloud can be measured. In addition to the number of events and the memory size of the containing unit, the counts of the event types themselves that happen to be present at a particular time within the Event Cloud become a measurable characteristic. These properties, viz., memory size, event counts, and event types, can serve as measurable characteristics describing an Event Cloud, giving it a size and shape Figure 6.

EVENT CLOUD STEADY STATEThe properties of an Event Cloud that give it form can be used to measure its state. By collecting its state over time, a normative operating behavior can be identified and its steady state can be determined. This steady state is critical when watching for unpredicted patterns. When a new flow pattern of events

causes an Event Cloud’s shape to shift away from its steady state, a situation change has occurred Figure 7. When these steady state deviations happen, and if no new matching patterns or rules are being invoked, then an unknown-unknown may have emerged. That is, something significant enough to adjust your systems operating characteristics has occurred yet isn’t being acknowledged in some way. Either it has been predicted but determined to not be important, or it was simply not considered.

ANOMALY DETECTION APPLIED TO EVENT CLOUD STEADY STATE SHIFTSFinding patterns in data that do not match a baseline pattern is the realm of anomaly detection. As such, by using the steady state of an Event Cloud as the baseline we can apply anomaly detection techniques to discern a shift.

Table 1 presents a catalog of various anomaly detection techniques that are applicable to Event Cloud shift discernment. This list isn’t to serve as an exhaustive compilation, but rather to showcase the variety of possibilities. Each algorithm has its own set of strengths

Event S

Event A

Event M

Event A

Event S

Event M

Event A

Event A

Event M

Event Cloud

Figure 6: Event Cloud (The Events traversing an Event Cloud at any particular moment give it shape and size) Source: Infosys Research

Figure 7: Event Cloud Shift (Shape shifts as new patterns occur) Source: Infosys Research

Event Cloud Steady State Shift

Buy

Buy

Buy

Buy

Ask

Look

Look

AskBuy

Ask

Buy

AskAsk

Ask

Ask

Ask

Event Cloud Steady

State Form

Event Cloud

New Form

Page 63: Bigdata Challenges Opportunities

61

such as simplicity, speed of computation, and certainty scores. Each algorithm, likewise, has weaknesses to include computational demands, blind spots in data deviations, and difficulty in establishing a baseline for comparison. All of these factors must be considered when selecting an appropriate algorithm.

Using the three properties defined for an Event Cloud’s shape (for e.g., event counts, event types, and Event Cloud size) combined with time properties, we have a multivariate data instance with three of them being continuous types, viz., counts, sizes, and time and one being categorical, viz., types. These four dimensions, and their characteristics, become a constraint on which anomaly detection algorithms can be applied [13].

The anomaly type being detected is also a constraint. In this case, the Event Cloud deviations are being classified as collective anomaly. It is collective anomaly, as opposed to point anomaly or context anomaly as we are comparing a collection of data instances that form the Event Cloud shape with a broader set of all data instances that formed the Event Cloud steady state shape.

Statistical algorithms lend themselves well to anomaly detection when analyzing continuous and categorical data instances.

Further, knowing an Event Cloud’s steady state shape a priori isn’t assumed, so the use of a non-parametric statistical model is appropriate [13]. Therefore, the technique of statistical profiling using histograms is explored as an example implementation approach for catching a steady state shift.

One basic approach to trap the moment of an Event Cloud’s steady state shift is to leverage a histogram based on each event type, with the number of times a particular count of an event type shows up in a given Event Cloud instance becoming a basis for comparison. The histogram generated over time would then serve as the baseline steady state picture of normative behavior. Individual instances of an Event Cloud’s shape could then be compared to the Event Cloud’s steady state histogram to discern if a deviation has occurred. That is, does the particular Event Cloud instance contain counts of events that have rarely, or never, appeared in the Event Cloud’s history.

Figure 8 represents the case with a steady state histogram on the left, and the Event Cloud comparison instance on the right. In this depiction the histogram shows, as an example, that three Ask Events were contained within an Event Cloud instance exactly once in the history of this Event Cloud. The Event Cloud

Table 1: Applicability of Anomaly Detection Techniques to Event Cloud Steady State Shifts

Source: Derived from Anomaly Detection: A survey [12]

Technique Classification

Example ConstituentTechniques

Event Cloud Shift Applicability Challenges

Classification BasedNeural Networks | Bayesian NetworksSupport Vector Machines Rule

Accurately labeled training data for theclassifiers is difficult to obtain

Nearest Neighbour Based Clustering Based

Distance to kth Nearest Neighbour Relative Density

Defining meaningful distance measuresdifficult

Statistical Parametric | Non-Parametric Histogram approaches miss unique combinations

Spectral Low Variance PCA Eigenspace - Based High computational complexity

Page 64: Bigdata Challenges Opportunities

62

instance, on the right, that will be compared shows that the instance has six Ask Events in its snap shot state.

An anomaly score for each event type is calculated, by comparing each Event Cloud instance event type count to the event type quantity occurrence bins within the Event Cloud steady state histogram, and then these individual scores are combined for an aggregate score [13]. This aggregate score then becomes the basis upon which a judgment is made regarding a whether deviation has occurred or not.

While simple to implement, the primary weakness of using the histogram based approach is that a rare combination of events in an Event Cloud would not be detected, if the quantities of the individual events present were in their normal or frequent quantities.

LIMITATIONS OF EVENT CLOUD SHIFTSAnomaly detection algorithms have blind spots, or situations where they cannot discern an Event Cloud shift. This implies that it is possible for an Event Cloud to shift undetected, under just the right circumstances. However, following the lead suggested by Okamoto and Ishida with immunity-based anomaly detection systems [13], rather than having

a single observer detecting when an Event Cloud deviates from steady state, a system could have multiple observers, each with their own techniques and approaches applied. Their individual results could then be aggregated, with varying weights applied to each technique, to render a composite Event Cloud steady state shift score. This will help remove the chances of missing a state change shift.

With the approach outlined by this paper, the scope of indicators is such that you get an early indicator that something new is emerging and nothing more. Noticing an Event Cloud shift only indicates that a situational change has occurred; it does not identify or highlight what the root cause of the change is, nor does it fully explain what is happening. Analysis is still required to determine what initiated the shift along with what opportunities for exploitation may be present.

FURTHER RESEARCHMany enterprise CEP implementations are architected in layers, wherein event abstraction hierarchies, event pattern maps and event processing networks are used in concert to increase the visibility aspects of the system [14] as well as to help with overall performance by allowing for the segmenting of Event flows. In general, each layer going up the hierarchy is an aggregation of multiple events from its immediate child layer. With the lowest layer containing the finest grained events and the highest layer containing the coarsest grained events, the Event Clouds that manifest at each layer are likewise of varying granularity (Figure 9). Therefore a noted Event Cloud steady state shift at the lowest layer represents the finest granularity shift that can be observed. An Event Cloud’s steady state shifts at the highest layer represent the coarsest steady

Figure 8: Event Cloud Histogram & ComparisonSource: Infosys Research

1Ask

Event (s)Ask

EventBuy

Event (s)Event Cloud

Steady State HistogramEvent Cloud

Comparison Instance

Event Cloud Histogramand Instance Comparison

LookEvent (s)

BuyEvent

LookEvent

2 3 1 2 3 1 2 3

A

A

A

A

A

A

B

B

B

B

Page 65: Bigdata Challenges Opportunities

63

state shifts that can be observed. Techniques for interleaving individual layer Event Cloud steady state shifts along with opportunities and consequences of their mixed granularity can be explored.

The technique presented in this paper is designed to capture the beginnings of a situational change not explicitly coded for. With the recognition of a new situation emerging, the immediate task is to discern what is happening and why, while it is unfolding. Further research can be done to discern which elements available from the steady state shift automated analysis would be of value to help an analyst — business or technical -- unravel the genesis of the situation change. By discovering what change information is of value, not only can an automated alert be sent to interested parties, but it can contain helpful clues on where to start their analysis.

CONCLUSIONIt would be an understatement that without the right set of systems, methodologies, controls, checks and balances on data, no enterprise can survive. Big data solves the problem of vastness and multiplicity of the ever rising information in this information age. What Big data does not fulfill is the complexity associated with real time

data analysis. CEP though designed purely for events complements the Big data strategy of any enterprise.

Event Cloud, a constituent component of CEP can be used for more than its typical application. By treating it as a first class citizen of indicators, and not just a collection point computing construct, a company can gain insight into the early emergence of something new, something previously not considered and potentially the birthing of an unknown-unknown.

With organizations growing in their usage of Big data, and the desire to move closer to real time response, companies will inevitably leverage the CEP paradigm. The question will be do they use it as everyone else does, triggering off of conceived patterns, or will they exploit it for unforeseen situation emergence? When the situation changes, the capability is present and the data is present, but are you?

REFERENCES1. WSJ article on Big data. Available at

http://online.wsj.com/article/SB10000872396390443890304578006252019616768.html.

2. T r a n s a c t i o n P r o c e s s i n g C o u n c i l Benchmark comparison or leading databases. Available at http://www.tpc.org/tpcc/results/tpcc_perf_results.asp.

3. T r a n s a c t i o n P r o c e s s i n g C o u n c i l Benchmark comparison or leading databases. Available at http://www.tpc.org/tpcc/results/tpcc_perf_results.asp.

4. Apache Hadoop project site. Available at http://hadoop.apache.org/.

5. IBM Watson – Artificial intelligent super computer’s Home Page. Available at http://www-03.ibm.com/innovation/us/watson/.

CEP In Layers

S A M A

AN

S M S

TH

S MS

A

MS

A

TH

AN

Events

Event Clouds

Figure 9: Event HierarchiesSource: Infosys Research

Page 66: Bigdata Challenges Opportunities

64

6. IBM’s Big data initiative. Available at http://www-01.ibm.com/software/data/bigdata/.

7. Oracle’s Big data initiative. Available a t h t t p : / / w w w . o r a c l e . c o m / u s /technologies/big-data/index.html.

8. Teradata Big data Analytics offerings. Available at http://www.teradata.com/business-needs/Big-Data-Analytics/.

9. Luckham, D. and Schulte, R. (2011), Event Processing Glossary – Version 2.0, Compiled. Available at http://www.complexevents.com/2011/08/23/event-processing-glossary-version-2-0/.

10. Bass, T. (2007), What is Complex Event Processing? TIBCO Software Inc.

11. Bass , T . (2010) , Orwel l ian Event Processing. Available at http://www.thecepblog.com/2010/02/28/orwellian-event-processing/.

12. Chandola, V., Banerjee, A., and Vipin Kumar, V. (2009), Anomaly Detection : A Survey, ACM Computing Surveys.

13. Okamoto, T. and Ishida, Y. (2009), An Immunity-Based Anomaly Detection System with Sensor Agents, sensor ISSN 1424-8220.

14. Luckham, D. (2002), The Power of Events, An Introduction to Complex Event Process ing in Dis t r ibuted Enterprise Systems, Addison Wesley, Boston.

15. Vincent, P. (2011), ACM Overview of BI Technology misleads on CEP. Available at http://www.thetibcoblog.com/2011/07/28/acm-overview-of-bi-technology-misleads-on-cep/.

16. About Esper and NEsper FAQ, http://esper.codehaus.org/tutorials/faq_esper/faq.html#what-algorithms.

17. Ide , T . and Kashima, H. (2004) , Eigenspace-based Anomaly Detection in Computer Systems, Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August pp. 22-25.

Page 67: Bigdata Challenges Opportunities

65

VOL 11 NO 12013

Big Data: Testing Approach to Overcome Quality Challenges

By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja

Testing Big data is one of the biggest challenges faced by organizations because

of lack of knowledge on what to test and how much data to test. Organizations have been facing challenges in defining the test strategies for structured and unstructured data validation, setting up an optimal test environment, working with non-relational databases and performing non-functional testing. These challenges are causing in poor quality of data in production and delayed implementation and increase in cost. Robust testing approach need to be defined for validating structured and unstructured data and start testing early to identify possible defects early in the implementation life cycle and to reduce the overall cost and time to market.

Different testing types like functional and non-functional testing are required along with strong test data and test environment management to ensure that the data from varied sources is processed error free and is of good quality to perform analysis. Functional testing activities like validation of map reduce process,

structured and unstructured data validation, data storage validation are important to ensure that the data is correct and is of good quality. Apart from functional validations other non-functional testing like performance and failover testing plays a key role to ensure the whole process is scalable and is happening within specified SLA.

Big data implementation deals with writing complex Pig, Hive programs and running these jobs using Hadoop map reduce framework on huge volumes of data across different nodes. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. Hadoop uses Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Hadoop utilizes its own distributed file system, HDFS, which makes data available to multiple computing nodes.

Figure 1 shows the step by step process on how Big data is processed using Hadoop ecosystem. First step loading source data into

Validate data quality by employing a structured testing technique

Infosys Labs Briefings

Page 68: Bigdata Challenges Opportunities

66

HDFS involves in extracting the data from different source systems and loading into HDFS. Data is extracted using crawl jobs for web data, tools like sqoop for transactional data and then loaded into HDFS by splitting into multiple files. Once this step is completed second step perform map reduce operations involves in processing the input files and applying map and reduce operations to get a desired output. Last setup extract the output results from HDFS involves in extracting the data output generated out of second step and loading into downstream systems which can be enterprise data warehouse for generating analytical reports or any of the transactional systems for further processing

BIG DATA TESTING APPROACHAs we are dealing with huge data and executing on multiple nodes there are high chances of having bad data and data quality issues at each stage of the process. Data functional testing is performed to identify these data issues because of coding errors or node configuration errors.

Testing should be performed at each of the three phases of Big data processing to ensure that data is getting processed without any errors. Functional Testing includes (i) validation of pre-Hadoop processing; (ii), validation of Hadoop Map Reduce process data output; and (iii) validation of data extract, and load into EDW. Apart from these functional validations non-functional testing including performance testing and failover testing needs to be performed.

Figure 2 shows a typical Big data architecture diagram and highlights the areas where testing should be focused.

Validation of Pre-Hadoop ProcessingData from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further.

Issues: Some of the issues which we face during this phase of the data moving from source

Figure 1: Big Data Testing Focus Areas Source: Infosys Research

1 2 3Loading Sourcedata files into HDFS

Perform MapReduce operations

Extractthe output

results fromHDFS

Page 69: Bigdata Challenges Opportunities

67

systems to Hadoop are incorrect data captured from source systems, incorrect storage of data, incomplete or incorrect replication.

Validations: Some high level scenarios that need to be validated during this phase include:

1. Comparing input data file against source systems data to ensure the data is extracted correctly

2. Validating the data requirements and ensuring the right data is extracted,

3. Validating that the files are loaded into HDFS correctly, and

4. Validating the input files are split, moved and replicated in different data nodes.

Validation of Hadoop Map Reduce Process

Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data coming from different sources.

Issues: Some issues that we face during this phase of the data processing are coding issues in map-reduce jobs, jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes, incorrect aggregations, node configurations, and incorrect output format.

Validations: Some high level scenarios that need to be validated during this phase include:

1. V a l i d a t i n g t h a t d a t a p r o c e s s i n g i s c o m p l e t e d a n d o u t p u t f i l e i s generated

Figure 2: Big Data architecture Source: Infosys Research

Enterprise DataWarehouse

ReportsTesting

Reporting using BI Tools

25% 25%25% 25% 1 2 3 4 5

Big Data Testing Focus Areas

Bar graph

Big Data Analytics

Pig HIVE

HBase (NoSQL DB) Map Reduce (Job Execution)

HDFS (Hadoop Distributed File System)

TransactionalData (RDBMS)

Non

-Fun

ctio

nalT

estin

g (P

erfo

rman

ce, F

ail o

ver t

estin

g)

4

4

Map-Reduceprocess validation

2

ETL Processvalidation

3

Pre-Hadoopprocess validation

1

Web Logs StreamingData Social Data

Processed Data

Data Load using Sqoop

hado

op ETL Process

Page 70: Bigdata Challenges Opportunities

68

2. Val idat ing the business logic on standalone node and then validating after running against multiple nodes

3. Validating the map reduce process to verify that key value pairs are generated correctly

4. V a l i d a t i n g t h e a g g r e g a t i o n a n d consolidation of data after reduce process

5. Validating the output data against the source files and ensuring the data processing is completed correctly

6. Validating the output data file format and ensuring that the format is per the requirement.

Validation of Data Extract, and Load into EDWOnce map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.

Issues: Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS.

Validations: Some high level scenarios that need to be validated during this phase include:

1. Validating that transformation rules are applied correctly

2. Validating that there is no data corruption by comparing target table data against HDFS files data

3. Validating the data load in target system

4. Validating the aggregation of data

5. Validating the data integrity in the target system.

Validation of ReportsAnalytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive.

Issues: Some of the issues faced while generating reports are report definition not set as per the requirement, report data issues, layout and format issues.

Validations: Some high level validations performed during this phase include:

Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring. Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports.

Cube Testing: Cubes are testing to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report.

Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases.

Page 71: Bigdata Challenges Opportunities

69

VOLUME, VARIETY AND VELOCITY:HOW TO TEST?In the earlier sections we have seen step by step details on what need to be tested at each phase of the Big data processing. During these phases of Big data processing the three dimensions or characteristics of Big data i.e. volume, variety and velocity are validated to ensure there are no data quality defects and no performance issues.

Volume: The amount of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year [3]. Huge volume of data flows from multiple systems which need to be processed and analyzed. When it comes to validation it is a big challenge to ensure that whole data setup processed is correct. Manually validating the whole data is a tedious task. We should use compare scripts to validate the data. As data is stored in HDFS is in file format scripts can be written to compare two files and extract the differences using compare tools [4]. Even if we use compare tools it will take a lot of time to do

100% data comparison. To reduce the time for execution we can either run all the comparison scripts in parallel on multiple nodes just like how data is processed using Hadoop map-reduce process or sample the data ensuring maximum scenarios are covered.

Figure 3 shows the approach on how voluminous amount of data is compared. Data is converted into expected result format and then compared using compare tools with actual data. This is a faster approach but involves initial scripting time. This approach will reduce further regression testing cycle time. When we don’t have time to validate complete data, sampling can be done for validation.

Variety: The variety of data types is increasing, namely unstructured text-based data and semi-structured data like social media data, location-based data, and log-file data.

Structured Data is data which is in defined format which is coming from different RDBMS tables or from structured files. The data that is of transactional nature can be handled in files or tables for validation purpose.

Figure 3: Approach for High Volume Data Validation Source: Infosys Research

Tool

toco

mpa

reth

e fil

es

Map Reduce Jobs

Discrepancy Report

Testing Scripts to validate data in HDFS

Uns

truct

ured

to S

truct

ured

Map Reduce Jobsrun in test

environment togenerate the output

Custom scripts toconvert unstructured

data to structured data

Scripts to convertdata to expected

results data

Unstructured(data)

“testing”

Raw

Dat

ato

Exp

ecte

dR

esul

tsfo

rmat

Structureddata

1 SD Test2 SD1Test1

OutputDataFiles

Unstructured(data)

“testing”

Structureddata

testing

Structureddata testing

ActualResults

ExpectedResults

ExpectedResults

File byFile

Comparison

Page 72: Bigdata Challenges Opportunities

70

Semi-structured data does not have any defined format but structure can be derived based on the multiple patterns of the data. Example of data is extracted by crawling through different websites for analysis purposes. For validation data need to be first transformed into structured format using custom built scripts. First the pattern need to be identified and then copy books or pattern outline need to be prepared, later this copy book need to be used in scripts to convert the incoming data into a structured format and then validations performed using compare tools.

Unstructured data is the data that does not have any format and is stored in documents or web content, etc. Testing unstructured data is very complex and is time consuming. Automation can be achieved to some extent by converting the unstructured data into structured data using scripting like PIG scripting as showing in Figure 3. But the overall coverage using automation will be very less because of unexpected behavior of data; input data can be in any form and changes every time new test is performed. We need to deploy a business scenario validation strategy for unstructured data. In this strategy we need to identify different scenarios that can occur in our day to day unstructured data analysis and test data need to be setup based on test scenarios and executed.

Velocity: The speed at which new data is being created – and the need for real-time analytics to derive business value from it -- is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users. Data speed needs to be considered when implementing any Big data appliance to overcome performance problems. Performance testing plays an

important role to identify any performance bottleneck in the system and the system can handle high velocity streaming data.

NON-FUNCTIONAL TESTINGIn the earlier sections we have seen how functional testing is performed at each phase of Big data processing, these tests are performed to identify functional coding issues, requirements issues. Performance testing and failover testing need to be performed to identify performance bottlenecks and to validate the non-functional requirements.

Performance Testing: Any Big data project involves in processing huge volumes of structured and unstructured data and is processed across multiple nodes to complete the job in less time. At times because of bad architecture and poorly designed code, performance is degraded. If the performance is not meeting the SLA, the purpose of setting up Hadoop and other Big data technologies is lost. Hence, performance testing plays a key role in any Big data project due to huge volume of data and complex architecture.

Some of the areas where performance issues can occur are imbalance in input splits, redundant shuffle and sorts, moving most of the aggregation computations to reduce process which can be done at map process. [5]. These performance issues can be eliminated by carefully designing the system architecture and doing performance test to identify the bottlenecks.

Performance testing is conducted by setting up huge volume of data and an infrastructure similar to production. Utilities like Hadoop performance monitoring tool can be used to capture the performance metrics and identify the issues. Performance metrics like

Page 73: Bigdata Challenges Opportunities

71

job completion time, throughput, and system level metrics like memory utilization etc. are captured as part of performance testing.

Failover Testing: Hadoop architecture consists of a name node and hundreds of data notes hosted on several server machines and each of them are connected. There are chances of node failure and some of the HDFS components become non-functional. Some of the failures can be name node failure, data node failure and network failure. HDFS architecture is designed to detect these failures and automatically recover to proceed with the processing.

Failover testing is an important focus area in Big data implementations with the objective of validating the recovery process and to ensure the data processing happens seamlessly when switched to other data nodes.

Some validations that need to be performed during failover testing are validating that checkpoints of edit logs and FsImage of name node are happening at a defined intervals, recovery of edit logs and FsImage files of name node, no data corruption because of the name node failure, data recovery when data node fails and validating that replication is initiated when one of data node fails or data become corrupted. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) metrics are captured during failover testing.

TEST ENVIRONMENT SETUPAs Big data involves handling huge volume and processing across multiple nodes, setting up a test environment is the biggest challenge. Setting up the environment on cloud will give us the flexibility to setup and maintain it during test execution. Hosting the environment on the cloud will also help in optimizing the infrastructure and faster time to market.

Key steps involved in setting up environment on cloud are [6]:

A. Big data Test infrastructure requirement assessment

1. A s s e s s t h e B i g d a t a p r o c e s s i n g requirements

2. Evaluate the number of data nodes required in QA environment

3. U n d e r s t a n d t h e d a t a p r i v a c y requirements to evaluate private or public cloud

4. Evaluate the software inventory required to be setup on cloud environment (Hadoop, File system to be used, No SQL DBs, etc).

B. Big data Test infrastructure design

1. Document the high level cloud test infrastructure design (Disk space, RAM required for each node, etc.)

2. Identify the cloud infrastructure service provider

3. Document the SLAs, communication plan, maintenance plan, environment refresh plan

4. Document the data security plan

5. Document high level test strategy, t e s t i n g r e l e a s e c y c l e s , t e s t i n g types , volume of data processed b y H a d o o p , t h i r d p a r t y t o o l s required.

Page 74: Bigdata Challenges Opportunities

72

C. Big data Test Infrastructure Implementation and Maintenance

■ Create a cloud instance of Big data test environment

■ Install Hadoop, HDFS, MapReduce and other software as per the infrastructure design

■ Perform a smoke test on the environment by processing a sample map reduce, Pig/Hive jobs

■ Deploy the code to perform testing.

BEST PRACTICESData Quality: It is very important to establish the data quality requirements for different forms of data like traditional data sources, data from social media, data from sensors, etc. If the data quality is ascertained, the transformation logic alone can be tested, by executing tests against all possible data sets.

D a t a S a m p l i n g : Data sampling gains significance in Big data implementation and it becomes the testers’ job to identify suitable sampling techniques that includes all critical business scenarios and the right test data set.

Automation: Automate the test suites as much as possible. The Big data regression test suite will be used multiple times as the database will be periodically updated. Hence an automated regression test suite should be built to use it after reach release. This will save a lot of time during Big data validations.

CONCLUSIONData quality challenges can be encountered by deploying a structured testing approach for both

functional and non-functional requirements. Applying right test strategies and following best practices will improve the testing quality which will help in identifying the defects early and reduce overall cost of the implementation. It is required that organizations invest in building skillset both in development and testing. Big data testing will be a specialized stream and testing team should be built with diverse skillset including coding, white-box testing skills and data analysis skills for them to perform a better job in identifying quality issues in data.

REFERENCES1. Big data overview, Wikipedia.org at

http://en.wikipedia.org/wiki/Big_data.2. White, T. (2010), Hadoop- The Definitive

Guide 2nd Edition, O’Reilly Media.3. Kelly, J. (2012), Big data: Hadoop,

Business Analytics and Beyond, A Big data Manifesto from the Wikibon Community. Available at http://w i k i b o n . o r g / w i k i / v / B i g _ D a t a : _Hadoop,_Business_Analytics_and_Beyond, Mar 2012.

4. Informatica Enterprise Data Integration (1998), Data verification using File and Table compare utility for HDFS and Hive tool. Available at https://community.informatica.com/solutions/1998.

5. Bhandarkar M. (2009), Practical Problem Solving with Hadoop, USENIX ‘09 annual technical conference, June 2009. Available at http://static.usenix.org/event/usenix09/training/tutonefile.html.

6. Naganathan , V . (2012) , Increase Business Value with Cloud-based QA Environments, Available at http://www.infosys.com/IT-services/independent-validation-testing-services/Pages/cloud-based-QA-environments.aspx.

Page 75: Bigdata Challenges Opportunities

73

VOL 11 NO 12013

Nature Inspired Visualization of Unstructured Big Data

By Aaditya Prakash

Exponential growth of data capturing devices has led to an explosion of data available.

Unfortunately not all data available is in the database friendly format. Data which cannot be easily categorized, classified or imported into database are termed Unstructured Data. Unstructured data is ubiquitous and is assumed to be around 80% of all data generated [1]. While tremendous advancements have taken place for analyzing, mining and visualizing structured data, the field of unstructured data, especially unstructured Big data is still in nascent stage.

Lack of recognizable structure and huge size makes it very challenging to work with unstructured large datasets. Classical visualization methods limit the amount of information presented and are asymptotically slow with rising dimensions of the data. We present here a model to mitigate these problems and allow efficient and vast visualization of large unstructured datasets.

A novel approach in unsupervised machine learning is Self-Organizing Maps (SOM). Along with classification, SOMs have added benef i t o f d imensional i ty reduction. SOMs are also used for visualizing multidimensional data into 2D planar diffusion map. This achieves data reduction thus enabling visualization of large datasets. Present models used to visual ize SOM maps lack any deductive ability that may be defeating the power of SOM. We introduce better restructuring of SOM trained data for more meaningful interpretation of very large data sets.

Taking inspiration from the nature, we model the large unstructured dataset into spider cobweb type graphs. This has the benefit of allowing multivariate analysis as different variables can be presented into one spider graph and their inter-variable relations can be projected, which cannot be done with classical SOM maps.

Reconstruct self-organizing maps as spider graphs for better visual interpretation of

large unstructured datasets

Infosys Labs Briefings

Page 76: Bigdata Challenges Opportunities

74

UNSTRUCTURED DATAUnstructured data come in different formats and sizes. Broadly the textual data, sound, video, images, webpages, logs, emails, etc., are categorized into unstructured data. In some cases even a bundle of numeric data could be collectively unstructured, for e.g., health records of a patient. While a table of cholesterol level of all the patients is more structured, all the biostats of a single patient is largely unstructured.

Unstructured data could be of any form and could contain any number of independent variables. Labeling as is done in machine learning is only possible with data where information of variable such as size, length, dependency, precision, etc., is known. Even extraction of the underlying information in a cluster of unstructured data is very challenging because it is not known on what is to be extracted [2].

The potential of hidden analytics within the unstructured large datasets could be a valuable asset to any business or research entity. Consider the case of Enron emails (collected and prepared by CALO project). Emails are primarily unstructured, mostly because people often reply above the last email even when the new email’s content and purpose might be different. Therefore most organizations do not analyze emails or logs but several researchers analyzed the Enron emails and their results show that lot of predictive and analytical information could be obtained from the same [3, 4, 5].

SELF ORGANIZING MAPSAbility to harness the increased computing power has been a great boon to business. From traditional business analytics to machine learning, the knowledge we get from data is invaluable. With computing forecasted to get

faster, may be quantum computing someday, it promises greater role for the data. While there has been a lot of effort to bring some structure into unstructured data [6], the cost of doing so has been the hindrance. With larger datasets it is even a greater problem as it entails more randomness and unpredictability in the data.

Self-Organizing Maps (SOM) are a class of artificial neural networks proposed by Teuvo Kohonen [7] that transform the input dataset into two dimensional lattice, also called Kohonen Map.

StructureAll the points of the input layer are mapped onto two dimensional lattice, called as Kohonen Network. Each point in the Kohonen Network is potentially a Neuron.

Figure 1: Kohonen NetworkSource: Infosys Research

Competition of NeuronsOnce the Kohonen Network is completed the neurons of the network compete according to the weights assigned from the input layer. Function used to declare the winning neuron is the simple Euclidean distance of the input point and its corresponding weight for each of

Page 77: Bigdata Challenges Opportunities

75

the neuron. The function called as discriminant function is represented as,

where, x = point on Input Layer w = weight of the input point (x) i = all the input points j = all the neurons on the lattice d = Euclidean distance

Simply put, the winning neuron is the one whose weight is closest (distance in lattice) to the input layer. This process effectively discretizes the output layer.

Cooperation of Neighboring NeuronsOnce the winning neuron is found, the topological structure can be determined. Similar to the behavior in human brain cells (neurons), the winning neuron also excites its neighbor. Thus the topological structure is determined by the cooperative weights of the winning neuron and its neighbor.

Self-OrganizationThe process of selecting winning neurons and formation of topological structure is adaptive. The process runs multiple times to converge on the best mapping of the given input layer. SOM is better than other clustering algorithms in that it requires very few repetitions to get to a stable structure.

Parallel SOM for large datasetsAmong all classifying machine learning algorithms, convergence speed of the SOM has been found to be the fastest [8]. This implies that for large data sets SOM is the best viable model.

Since the formation of topological structuring is independent of the input points it can easily be parallelized. Carpenter et.al. have demonstrated the ability of SOM to work under massively parallel processing[9]. Kohonen himself has shown that even where the input data may not be in vector form, as found in some unstructured data, large scale SOM can be run nonetheless[10].

SOM PLOTSSOM plots are a two dimensional representation of the topological structure obtained after training the neural nets for given number of repetitions and with given radius. The SOM can be visualized as a complete 2-D topological structure [Fig.2].

Figure 2: SOM Visualization using Rapidminer (AGPL Open Source) Source: Infosys Research

Figure 2, shows the overall topological structure obtained after dimensionality reduction of multivariate dataset. While the graph above may be useful for outlier detection or general categorization it isn’t very useful in analysis of individual variables.

Other option of visualizing SOM is to plot different variables in grid format. One can use R programming language (GNU Open Source) to plot the SOM results.

Page 78: Bigdata Challenges Opportunities

76

Note on running exampleAll the plots presented henceforth have been obtained using R programming language. Dataset used is SPAM Email Database. Database is in public domain and freely available for research at ‘UCI Machine Learning Repository’. It contains 266858 word instances of 4601 SPAM emails. Emails are good example of unstructured data.

U s i n g t h e p u b l i c p a c k a g e s i n R, we obtain the SOM plots.

Figure 3, is the plot of SOM trained result using the package ‘Kohonen’[11]. This plot gives inter-variable analysis. In this case variable being 4 of one the most used words in the SPAM database viz. ‘order’, ‘credit’, ‘free’ and ‘money’. While this plot is better than topological plot as given in Figure 2, it is still difficult to interpret the result in canonical sense.

Figure 4, is again the SOM plot of the above given four most common words in the SPAM database but this one uses the package called ‘SOM’[12]. While this plot is numerical and gives strength of intervariatek relationship it does not help in giving us the analytical picture. The information obtained is not actionable.

SPIDER PLOTS OF SOMAs we have seen in the Figures 2, 3 and 4 the current visualization of SOM output could be improved for more analytical ability. We introduce a new method to plot SOM output especially designed for large datasets.

Algorithm1. Filter the results of SOM.2. Make a polygon with as many sides as

the variables in input.3. Make the radius of the polygon to

be the maximum of the value in the dataset.

4. Draw the grid for the polygon.5. Make segments inside the polygon

if the strength of the two variables inside the segment is greater than the specified threshold.

6. Loop Step v for every variable against every other variable

7. Color the segments based on the frequency of variable.

8. Color the l ine segments based on the threshold of each variable pair plotted.

Figure 3: SOM Visualization in R using the Package ‘Kohonen’

Figure 4: SOM visualization in R using the package ‘SOM’

Source: Infosys Research Source: Infosys Research

Page 79: Bigdata Challenges Opportunities

77

PlotsAs we can see, this plot is more meaningful than the SOM visualization plots obtained before. From the figure we can easily deduce that the words ‘free’ and ‘order’ do not have similar relation as ‘credit’ and ‘money’. Understandably so, because if a Spam email is selling something, it will probably have the words ‘order’ and conversely if it is advertising any product or software for ‘free’ download then it wouldn’t have the words ‘order’ in it. High relationship between ‘credit’ and ‘money’ signifies Spam emails advertising for better ‘Credit Score’ programs and other marketing traps.

Figure 6 shows the relationship of each variable-- in this case four popular recurring

words in the Spam database. The number of threads between one variable to another shows the probability of second variable given the first variable. Several threads between ‘free’ and ‘credit’ suggests that Spam emails offering ‘free credit’ (disguised in other forms by fees or deferred interests) are among the most popular.

Using these Spider plots we can analyze several variables at once. This may cause the graph to be messy but sometimes we need to see the complete picture in order to make canonical decisions about the dataset.

From Figure 7 we can see that even though the figure shows 25 variables it is not as cluttered as a Scatter Plot or Bar chart would be if plotted with 25 variables.

Figure 8: Uncolored Representation of Threads in Six variables

Figure 6: SOM visualization in R using Above Algorithm: Showing Threads, i.e., inter-variable strength)Source: Infosys Research

Figure 5: SOM Visualization in R Using the Above Algorithm: Showing Segment, i.e., inter-variable dependency

Figure 7: Spider Plot showing 25 Sampled Words from the Spam Database

Source: Infosys Research Source: Infosys Research

Source: Infosys Research

Page 80: Bigdata Challenges Opportunities

78

Figure 8 shows the different levels of strength between different variables. While ‘contact’ variable is strong with ‘need’ but not enough with ‘help’ it is no surprise that ‘you’ and ‘need’ are strong. Here the idea was only to present the visualization technique and not the analysis of Spam dataset. For more analysis on Spam filtering and Spam analysis one may refer to several independent works on the same [13, 14].

ADVANTAGESThere are several visual and non-visual advantages of using this new plot against the existing plot obtained. This plot has been designed to handle Big data. Most of the existing plots mentioned above are limited in their capacity to scale. Principally if the range of data is large then most of the existing plots tend to get skewed and important information is lost. By normalizing the data this new plot prevents this issue. By allowing multiple dimensions to be incorporated allows for recognition of indirect relationships.

CONCLUSIONWhile unstructured data is abundant, free and hidden with information the tools of analyzing the same are still nascent and cost of converting them to structured form is very high. Machine learning is used to classify unstructured data but comes with issues of speed and space constraints. SOM are the fastest machine learning algorithms but their visualization powers are limited. We have presented a naturally intuitive method to visualize SOM outputs which facilitates multi-variable analysis and is also highly scalable.

REFERENCE1. Grimes, S., Unstructured data and

the 80 percent rule. Retrieved from

ht tp ://c larabr idge . com/defau l t .aspx?tabid=137.

2. Doan, A., Naughton, J. F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F. and Vuong, B. Q. (2009), Information extraction challenges in managing unstructured data, ACM SIGMOD Record, vol. 37, no. 4, pp. 14-20.

3. Diesner, J., Frantz, T. L. and Carley, K. M. (2005). Communication networks from the Enron email corpus “It’s always about the people. Enron is no different”. In Computational & Mathematical Organization Theory, vol. 11, no. 3, pp. 201-228.

4. Chapanond, A., Krishnamoorthy, M. S., & Yener, B. (2005), Graph theoretic and spectral analysis of Enron email data. In Computational & Mathematical Organization Theory, vol. 11, no.3, pp. 265-281.

5. Peterson, K., Hohensee, M., and Xia, F. (2011), Email formality in the workplace: A case study on the enron corpus. In Proceedings of the Workshop on Languages in Social Media, pp. 86-95. Association for Computational Linguistics.

6. Buneman, P., Davidson, S., Fernandez, M., and Suciu, D. (1997), Adding structure to unstructured data. Database Theory—ICDT’97, pp. 336-350.

7. Kohonen, T. (1990),The self-organizing map. Proceedings of the IEEE, vol. 78, no. 9, pp. 1464-1480.

8. Waller, N. G., Kaiser, H. A., Illian, J. B., and Manry, M. (1998), A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms. Psychometrika, vol. 63, no.1, pp. 5-22.

Page 81: Bigdata Challenges Opportunities

79

9. Carpenter, G. A., and Grossberg, S. (1987), A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer vision, graphics, and image processing, vol. 37, no. 1, pp. 54-115.

10. Kohonen, T., and Somervuo, P. (2002), How to make large self-organizing maps for non-vectorial data. Neural Networks, vol.15, no. 8, pp. 945-952.

11. Wehrens, R & Buydens, L.M.C (2007), Self- and Super-organizing Maps in R: The Kohonen Package. Journal of Statistical Software, vol. 21, no. 5, pp. 1-19.

12. Yan, J. (2012), Self-Organizing Map (with application in gene clustering) in R. Available at http://cran.r-project.org/web/packages/som/som.pdf.

13. Dasgupta, A., Gurevich, M., & Punera, K. (2011), Enhanced email spam filtering through combining similarity graphs. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 785-794.

14. Cormack, G. V. (2007), Email spam f i l t e r i n g : A s y s t e m a t i c r e v i e w . Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp. 335-455.

Page 82: Bigdata Challenges Opportunities

NOTES

Page 83: Bigdata Challenges Opportunities

81

Index

Automated Content Discovery 48, 49, Big Data Analytics 4-8, 19, 24, 40-43, 45, 67, Lifecycle 21, Medical Engine 42- 44 Value, also BDV 27, 29, Campaign Management 31, 32, Common Warehouse Meta-Model, also CWM 7Communication Service Providers, also CSPS 27, Complex Event Processing, also CEP 53-63Content ProcessingWorkflows50 Publishing Lifecycle Management, also CPLM 48, Management System, also CMS 30, 48, 51Contingency Funding Planning, also CFP 36,Customer Dynamics 19-21, 25 Relationship 28, 30Data Warehouse 4- 5, 30, 38-39, 66, 68Enterprise Service Bus, also ESB 30Event Driven Process Automation Architecture, also EDA 30-31Experience Personalization 31

Extreme Content Hub, also ECH 47-51Global Positioning Service, also GPS 10, 13, 17, 54, 56Management Business Process, also BPM 30, Custom Relationship, also CRM 28-30 Information 3, 56-57 Liquidity Risk, also LRM 35-40 Master Data 5-6 Offer 32 Order 30 Retention 31, 32Metadata Discovery 6-7 Extractor 50, Governance 6-7 Management 3-8Net Interest Income Analysis, also NIIA 37Predictive Intelligence 19 Modeling 32 Analytics 54Service Management 31, 33Supply Chain Planning 9-12, 53Un-Structured Content Extractor 50Web Analytics 21

Page 84: Bigdata Challenges Opportunities

BUSINESS INNOVATION through TECHNOLOGY

Editorial Office: Infosys Labs Briefings, B-19, Infosys Ltd.Electronics City, Hosur Road, Bangalore 560100, India

Email: [email protected] http://www.infosys.com/infosyslabsbriefings

© Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained herein or to any derived results obtained by the recipient from the use of the information in this document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising therefrom. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.

EditorPraveen B Malla PhD

Deputy EditorYogesh Dandawate

Graphics & Web EditorRakesh Subramanian

Chethana M G Vivek Karkera

IP ManagerK V R S Sarma

Marketing Manager Gayatri Hazarika

Online MarketingSanjay Sahay

Production ManagerSudarshan Kumar V S

Database ManagerRamesh Ramachandran

Distribution ManagersSanthosh Shenoy

Suresh Kumar V H

How to Reach Us: Email:

[email protected]

Phone: +91 40 44290563

Post: Infosys Labs Briefings,

B-19, Infosys Ltd. Electronics City, Hosur Road,

Bangalore 560100, India

Subscription: [email protected]

Rights, Permission, Licensing and Reprints:

[email protected]

Infosys Labs Briefings is a journal published by Infosys Labs with the objective of offering fresh perspectives on boardroom business technology. The publication aims at becoming the most sought after source for thought leading, strategic and experiential insights on business technology management.

Infosys Labs is an important part of Infosys’ commitment to leadership in innovation using technology. Infosys Labs anticipates and assesses the evolution of technology and its impact on businesses and enables Infosys to constantly synthesize what it learns and catalyze technology enabled business transformation and thus assume leadership in providing best of breed solutions to clients across the globe. This is achieved through research supported by state-of-the-art labs and collaboration with industry leaders.

About InfosysMany of the world’s most successful organizations rely on Infosys to deliver measurable business value. Infosys provides business consulting technology, engineering and outsourcing services to help clients in over 32 countries build tomorrow’s enterprise.

For more information about Infosys (NASDAQ:INFY), visit www.infosys.com

Infosys Labs Briefings

Page 85: Bigdata Challenges Opportunities

31%

OF COMPANIES REPORT THEY ARE

JUST STARTING TO DEVELOP

A MOBILE STRATEGY OR HAVE

NO MOBILE STRATEGY AT ALL.

SOCIAL BUSINESS

CL UD

MOBILITYHOW THE WORLD GETS ONLINE

IN ONE MINUTE ... MOBILE STRATEGY

2008 2014

60%

OF SERVER WORKLOADSWILL BE VIRTUALIZED

IN 2 YEARS

90%OF THE

WORLD’S DATA WAS CREATED IN

THE LAST

2 YEARS

BigData

Cloud Apps

.com Mobile Email

Video Social Search

BIG DATA APPSWHERE MOBILE USERS SPEND TIME

60%

1.5bvia desktop

5.5bvia mobile

100,000

Twee

ts47,0

00A

pp S

tore

dow

nlo

ads

2 million Google searches

695,0

00 F

aceb

ook

stat

us

updat

es

571 new websites

= 100 MILLION

12%

Fortune 500companieswith blogs

23%

Fortune 500companies

active on Twitter

62%

ONLINE RETAILU.S. OUTLOOK: GROWTH

2012 2016

192 million people167 million people

45% $327B$226B43

Mobile Web

16

42

17

54

21

54

23

59

23

70

24

72

25

71

27

72

25

83

26

79

20

88

21

101

23

Mobile Apps20122011

MARFEBJANDECNOVOCTSEPAUGJULJUNMAYAPRMAR

(billions of minutes per month)

AN INFOSYS PUBLICATION

ART & SCIENCE

the DIGITAL ENTERPRISE

REVOLUTIONU R HERE

Click here to explore the current issue of Art & Science

Page 86: Bigdata Challenges Opportunities

AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be reached at [email protected].

ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected].

AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys. He can be contacted at [email protected].

ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys. He can be reached at [email protected].

BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at [email protected].

GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at [email protected].

KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted at [email protected].

MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached at [email protected].

NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted at [email protected].

NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting and Systems Integration wing of the FSI business unit of Infosys. He can be reached at [email protected].

NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be contacted at [email protected].

PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be reached at [email protected].

PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be contacted at [email protected].

PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can be reached at [email protected].

SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting Group. He can be contacted at [email protected].

SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be contacted at [email protected].

SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice under the Cloud Unit of Infosys. He can be reached at [email protected].

ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of Infosys. He can be contacted at [email protected].

Big data was the watchword of year 2012. Even before one could understand what it really meant, it began getting tossed about in huge doses in almost every other analyst report. Today, the World Wide Web hosts upwards of 800 million webpages, each page trying to either educate or build a perspective on the concept of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing after cloud. Big data is of late being adopted across industries with great fervor. In this issue we explore what the Big data revolution is and how it will likely help enterprises reinvent themselves.

As the citizens of this digital world we generate more than 200 exabytes of information each year. This is equivalent to 20 million libraries of Congress. According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook logins, 204-million email exchanges, and more than 2 million search queries fired. Looking at the scale at which data is getting churned it is beyond the scope of a human’s capability to process data and hence there is need for machine processing of information. There is no dearth of data for today’s enterprises. On the contrary, they are mired with data and quite deeply at that. Today therefore the focus is on discovery, integration, exploitation and analysis of this overwhelming information. Big data may be construed as the technological intervention to undertake this challenge.

Since Big data systems are expected to help analysis of structured and unstructured data and hence are drawing huge investments. Analysts have estimated enterprises will spend more than US$120 billion by 2015 on analysis systems. The success of Big data technologies depends upon natural language processing capabilities, statistical analytics, large storage and search technologies.

Big data analytics can help cope with large data volumes, data velocity and data variety. Enterprises have started leveraging these Big data systems to mine hidden insights from data. In the first issue of 2013, we bring to you papers that discuss how Big data analytics can make a significant impact on several industry verticals like medical, retail, IT and how enterprises can harness the value of Big data.

Like always do let us know your feedback about the issue.

Happy Reading,

Yogesh Dandawate Deputy Editor [email protected]

Authors featured in this issue

Infosys Labs BriefingsAdvisory Board

Anindya Sircar PhDAssociate Vice President &

Head - IP Cell

Gaurav RastogiVice President,

Head - Learning Services

Kochikar V P PhDAssociate Vice President,

Education & Research Unit

Raj Joshi Managing Director,

Infosys Consulting Inc.

Ranganath MVice President & Chief Risk Officer

Simon Towers PhDAssociate Vice President and

Head - Center of Innovation for Tommorow’s Enterprise,

Infosys Labs

Subu GoparajuSenior Vice President &

Head - Infosys Labs

Big Data: Countering Tomorrow’s Challenges

Page 87: Bigdata Challenges Opportunities

For information on obtaining additional copies, reprinting or translating articles, and all other correspondence,

please contact:

Email: [email protected]

© Infosys Limited, 2013

Infosys acknowledges the proprietary rights of the trademarks and product names of the other

companies mentioned in this issue of Infosys Labs Briefings. The information provided in this

document is intended for the sole use of the recipient and for educational purposes only. Infosys

makes no express or implied warranties relating to the information contained in this document or to

any derived results obtained by the recipient from the use of the information in the document. Infosys

further does not guarantee the sequence, timeliness, accuracy or completeness of the information and

will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of,

any of the information or in the transmission thereof, or for any damages arising there from. Opinions

and forecasts constitute our judgment at the time of release and are subject to change without notice.

This document does not contain information provided to us in confidence by our clients.

BIG DATA: CHALLENGES AND

OPPORTUNITIES

£¥$ €

£

¥ $

Subu Goparaju Senior Vice President

and Head of Infosys Labs

“At Infosys Labs, we constantly look for opportunities to leverage

technology while creating and implementing innovative business

solutions for our clients. As part of this quest, we develop engineering

methodologies that help Infosys implement these solutions right,

first time and every time.”

BIG D

ATA

: CH

ALLEN

GES A

ND

OPPO

RTU

NITIES

VO

L 11 NO

1 2013

VOL 11 NO 12013

Infosys Labs Briefings

Infosys L

abs Briefin

gs