Download - Real Time Analytics

Transcript
Page 1: Real Time Analytics

Framework for Real time AnalyticsBy Mohsin Hakim

Real Time Analytics

Page 2: Real Time Analytics

IndexIntroduction

Evolving BI and Analytics for Big Data

Impacts to Traditional BI Databases

Challenges

MongoDB with Hadoop

Case Studies

Current Scenario

Page 3: Real Time Analytics

Introduction

Analytics falls along a spectrum. On one end of the spectrum sit batch analytical applications, which are used for complex, long-running analyses. They tend to have slower response times (up to minutes, hours, or days) and lower requirements for availability. Examples of batch analytics include Hadoop-based workloads

On the other end of the spectrum sit real-time analytical applications, which provide lighter-weight analytics very quickly. Latency is low (sub-second) and availability requirements are high (e.g., 99.99%). MongoDB is typically used for real-time analytics. Example applications include:

Business Intelligence (BI) and analytics provides an essential set of technologies and processes that organizations have relied upon over many years to guide strategic business decisions.

Page 4: Real Time Analytics

Introduction1. Predictable Frequency. Data is extracted from source systems at regular intervals - typically measured in days, months and quarters2. Static Sources. Data is sourced from controlled, internal systems supporting established and well-defined back-office processes3. Fixed Models. Data structures are known and modeled in advance of analysis. This enables the development of a single schema to accommodate data from all of the source systems, but adds significant time to the upfront design4. Defined Queries. Questions to be asked of the data (i.e., the reporting queries) are pre-defined. If not all of the query requirements are known upfront, or requirements change, then the schema has to be modified to accommodate changes5. Slow-changing requirements. Rigorous change-control is enforced before the introduction of new data sources or reporting requirements 6. Limited users. The consumers of BI reports are typically business managers and senior executives

Page 5: Real Time Analytics

Evolving BI and Analytics for Big Data

Higher Uptime RequirementsThe immediacy of real-time analytics accessed from multiple fixed and mobile devices places additional demands on the continuous availability of BI systems.Batch-based systems can often tolerate a certain level of downtime, for example for scheduled maintenance. Online systems on the other hand need to maintain operations during both failures and planned upgrades.

The Need for Speed & ScaleTime to value is everything. For example, having access to real-time customer sentiment or logistics tracking is of little benefit unless the data can be analyzed and reported in real-time. As a consequence, the frequency of data acquisition, integration and analysis must increase from days to minutes or less, placing significant operational overhead on BI systems.

Agile Analytics and ReportingWith such a diversity of new data sources, business analysts can not know all of the questions they need to ask in advance. Therefore an essential requirement is that the data can be stored before knowing how it will be processed and queried.The Changing Face of DataData generated by such workloads as social, mobile, sensor and logging, is much more complex and variably structured than traditional transaction data from back-office systems such as ERP, CRM, PoS (Point of Sale) and Accounts Receivable.Taking BI to the CloudThe drive to embrace cloud computing to reduce costs and improve agility means BI components that have traditionally relied on databases deployed on monolithic, scale-up systems have to be re-designed for the elastic scale-out, service-oriented architectures of cloud.

Page 6: Real Time Analytics

Impacts to Traditional BI DatabasesThe relational databases underpinning many of today’s traditional BI platforms are not well suited to the requirements of big data:

• Semi-structured and unstructured data typical in mobile, social and sensor-driven applications cannot be efficiently represented as rows and columns in a relational database table

• Rapid evolution of database schema to support new data sources and rapidly changing data structures is not

possible in relational databases, which rely on costly ALTER TABLE operations to add or modify table attributes

• Performance overhead of JOINs and transaction semantics prevents relational databases from keeping pace with the ingestion of high-velocity data sources

• Quickly growing data volumes require scaling databases out across commodity hardware, rather than the scale-up approach typical of most relational databases

Relational databases’ inability to handle the speed, size and diversity of rapidly changing data generated by modern applications is already driving the enterprise adoption of NoSQL and Big Data technologies in both operational and analytical roles.

Page 7: Real Time Analytics

The purpose

• Flume in Hadoop, for batch processing, which make the data relevant time-wise; it can be used for real time because it would be too fresh, only from several min to even a second late.

• Flume engine, using server side in order to make decisions regarding the current state of affairs.

• Decisions Making are made based on whatever data is received from customers’ current condition without all of the history in their user profiles, which would enable a much more informed decision.

• State of Art Auto updating charting and report creation with Dashboard UI.

Increase scalability and performance of Organizations using Real Time Analysis platform with a focus on storing,

processing and analyzing the exponentially growing data using big data technologies.

Page 8: Real Time Analytics

Challenges1. Getting data metrics to the right peopleOften, social media is treated like the ugly stepchild within the marketing department and real-time social media analytics are either absent or ignored.2. VisualizationVisualizing real-time social media analytics is another key element involved in developing insights that matter.Simply displaying values graphically helps in making the kinds of fast interpretations necessary for making decisions with real-time data, but adding more complex algorithms and using models provides deeper insights, especially when visualized.3. Unstructured data is challengingUnlike the survey data firms are used to dealing with, most (IBM estimates 80%) is unstructured — meaning it consists of words rather than numbers. And, text analytics lags seriously behind numeric analysis.4. Increasing signal to noiseSocial media data is inherently noisy. Reducing noise to even detect signal is challenging — especially in real time. Sure, with enough time, new analytics tools can ferret out the few meaningful comments across various social networks, but few can handle this in real-time.5. A wait and see attitudeAgain, businesses are used to a certain operational model that makes real-time social media analytics challenging. For instance, we listed to a presentation by an analyst from NPR.

Page 9: Real Time Analytics

Top 10 Priorities1 Enable new fast-paced business practices2 Don’t expect the new stuff to replace the old stuff3 Do not assume that all the data needs to be in real time, all the time4 Correlate real-time data with data from other sources and latencies5 Start with a proof of value with measurable outcomes6 As a safe starter project, accelerate successful latent processes into near real time7 Think about operationalizing analytics8 Think about the skills you need9 Examine application business rules to ensure they are ready for real-time data flows10 Evaluate technology platforms and expertise for availability and reliability

Page 10: Real Time Analytics

ChallengesReal-Time Analytics is Hard

Can’t Stay Ahead. You need to account for many types of data, including unstructured and semi-structured data. And new sources present themselves unpredictably. Relational databases aren’t capable of handling this, which leaves you hamstrung. Can’t Scale. You need to analyze terabytes or petabytes of data. You need sub-second response times. That’s a lot more than a single server can handle. Relational databases weren’t designed for thisBatch. Batch processes are the right approach for some jobs. But in many cases, you need to analyze rapidly changing, multi-structured data in real time. You don’t have the luxury of lengthy ETL processes to cleanse data for later.

MongoDB Makes it Easy

Do the Impossible. MongoDB can incorporate any kind of data – any structure, any format, any source – no matter how often it changes. Your analytical engines can be comprehensive and real-time. Scale Big. MongoDB is built to scale out on commodity hardware, in your data center or in the cloud. And without complex hardware or extra software. This shouldn’t be hard, and with MongoDB, it isn’t.Real Time. MongoDB can analyze data of any structure directly within the database, giving you results in real time, and without expensive data warehouse loads.

Page 11: Real Time Analytics

Why Other Databases Fall Short and MangoDBMost databases make you chose between a flexible data model, low latency at scale, and powerful access. But increasingly you need all three at the same time.

Rigid Schemas. You should be able to analyze unstructured, semi-structured, and polymorphic data. And it should be easy to add new data. But this data doesn’t belong in relational rows and columns. Plus, relational schemas are hard to change incrementally, especially without impacting performance or taking the database offline.

Scaling Problems. Relational databases were designed for single-server configurations, not for horizontal scale-out. They were meant to serve 100s of ops per second, not 100,000s of ops per second. Even with a lot of engineering hours, custom sharding layers, and caches, scaling an RDBMS is hard at best and impossible at worst.

Takes Too Long. Analyzing data in real time requires a break from the familiar ETL and data warehouse approach. You don’t have time for lengthy load schedules, or to build new query models. You need to run aggregation queries against variably structured data. And you should be able to do so in place, in real time.

Organizations are using MongoDB for analytics because it lets them store any kind of data, analyze it in real time, and change the schema as they go.

New Data. MongoDB’s document model enables you to store and process data of any structure: events, time series data, geospatial coordinates, text and binary data, and anything else. You can adapt the structure of a document’s schema just by adding new fields, making it simple to bring in new data as it becomes available. Horizontal Scalability. MongoDB’s automatic sharding distributes data across fleets of commodity servers, with complete application transparency. With multiple options for scaling – including range-based, hash-based and location-aware sharding – MongoDB can support thousands of nodes, petabytes of data, and hundreds of thousands of ops per second without requiring you to build custom partitioning and caching layers. Powerful Analytics, In Place, In Real Time. With rich index and query support – including secondary, geospatial and text search indexes – as well as the aggregation framework and native MapReduce, MongoDB can run complex ad-hoc analytics and reporting in place.

Page 12: Real Time Analytics

MongoDB with Hadoop

MongoDB Hadoop

EbayUser data and metadata management for product catalog

User analysis for personalized search & recommendations

Orbitz Management of hotel data and pricing

Hotel segmentation to support building search facets

PearsonStudent identity and access control. Content management of course materials

Student analytics to create adaptive learning programs

Foursquare

User data, check-ins, reviews, venue content management

User analysis, segmentation and personalization

Tier 1 Investment Bank

Tick data, quants analysis, reference data distribution

Risk modeling, security and fraud detection

Industrial Machinery Manufacturer

Storage and real-time analytics of sensor data collected from connected vehicles

Preventive maintenance programs for fleet optimization. In-field monitoring of vehicle components for design enhancements

SFRCustomer service applications accessed via online portals and call centers

Analysis of customer usage, devices & pricing to optimize plans

The following table provides examples of customers using MongoDB together with Hadoop to power big data applications.Whether improving customer service, supporting cross-sell and upsell, enhancing business efficiency or reducing risk, MongoDB and Hadoop provide the foundation to operationalize big data.

Page 13: Real Time Analytics

Future Trends in Real-Time Data, BI, and Analytics

Data types handled in real time today. Numerous TDWI surveys have shown that structured data (whichincludes relational data) is by far the most common class of data types handled for BI and analytic purposes, as well as many operational and transactional ones. It’s no surprise that structured data bubbled to the top of Figure 16. Other data types and sources commonly handled in real time today include application logs (33%), event data (26%), semi-structured data (26%), and hierarchical and raw data (24% each).

Data types to be handled in real time within three years. Looking ahead, a number of data types are poised for greater real-time usage. Some are in limited use today but will experience aggressive adoption within three years, namely social media data (38%), Web logs and clickstreams (34%), and unstructured data (34%). Others are handled in real time today and will become even more so, namely event (36%), semi-structured (33%), structured (31%), and hierarchical (30%) data.

Page 14: Real Time Analytics

Case Studies

Page 15: Real Time Analytics

MongoDB Integration with BI and Analytics Tools

To make online big data actionable through dashboards, reports, visualizations and integration with other data sources, it must be accessible to established BI and analytics tools. MongoDB offers integration with more of the leading BI tools than any other NoSQL or online big data technology, including:

Actuate Alteryx Informatica

Jaspersoft Logi Analytics

MicroStrategy

Pentaho Qliktech SAP Lumira

Page 16: Real Time Analytics

WindyGrid’sOne person, one laptop, and MongoDB’s technology jumpstarted a project that, with other people joining in, went from prototype to one of the nation’s pioneering projects to analyze and act on municipal data in real time. In just four months.

WindyGrid put Chicago on the path of revolutionizing how it operates not by replacing the administrative systems already in place, but by using MongoDB to bring that data together into a new application. With MongoDB’s flexible data model, WindyGrid doesn’t have to go back and redo the schema for each new piece of data. Instead, it can evolve schemas in real time. Which is crucial as WindyGrid expands and adds predictive analytics, growing by millions of pieces of structured and unstructured data each day.

Page 17: Real Time Analytics

Crittercism is A Mobile PioneerCrittercism doesn’t just monitor apps or gather information. Using MongoDB’s powerful built in query functions, it analyzes avalanches of unstructured and non-uniform data in real time. It recognizes patterns, identifies trends, and diagnoses problems. That means that Cirttercism’s customers immediately understand the root cause of problems and the impact they’re having on business. So they know how to prioritize and correct the problems they’re facing and improve performance

The kind of real time analysis that Crittercism provides customers would also be impossible with traditional databases. Crittercism is using MongoDB’s powerful query functions to analyze the broad variety of data it collects, in real time, within the database. A more traditional data warehouse approach, with ETLs and long loading times, can’t match this type of speed. At the same time, MongoDB lets Crittercism efficiently handle the tons of data it’s collecting. During the past two years, the number of requests that Crittercism gathers and analyzes has jumped from 700 to 45,000 per second. Relational databases have a hard time scaling to meet these kinds of demands, typically requiring expensive add-on software, or additional layers of proprietary code, to keep up. With MongoDB, horizontal scalability across multiple data centers is a native function.

Page 18: Real Time Analytics

McAfee - Global CybersecurityGTI analyzes cyberthreats from all angles, identifying threat relationships, such as malware used in network intrusions, websites hosting malware, botnet associations, and more. Threat information is extremely time sensitive; knowing about a threat from weeks ago is useless.

In order to provide up to date, comprehensive threat information, needs to quickly process terabytes of different data types (such as IP address or domain) into meaningful relationships:

e.g. Is this web site good or bad? What other sites have been interacting with it? The success of the cloud-based system also depends on a bidirectional data flow: GTI gathers data from millions of client sensors and provides real-time intelligence back to these end products, at a rate of 100 billion queries per month.

Was unable to address these needs and effectively scale out to millions of records with their existing solutions. For example, the HBase / Hadoop setup made it difficult to run interesting, complex queries, and experienced bugs with the Java garbage collector running out of memory. Another issue was with sharding and syncing;

Lucene was able to index in interesting ways, but required too much customization.

compensated for all the rebuilding and redeploying of Katta shards with “the usual scripting duct tape,” but what they really needed was a solution that could seamlessly handle the sharding and updating on its own.

selected MongoDB, which had excellent documentation and a growing community that was “on fire.”

Page 19: Real Time Analytics

Power JournalismBuzzFeed, the social news and entertainment company, relies on MongoDB to analyze all performance data for its content across the social web. A core part of BuzzFeed’s publishing platform, MongoDB exposes metrics to editors and writers in real time, to help them understand how its content is performing and to optimize for the social web. The company has been using MongoDB since 2010. Here’s why.1.Analytics provide more insight, more quickly. relies on MongoDB for its strategic analytics platform. With apps and dashboards built on MongoDB, can pinpoint when content is viewed and how it is shared. With this approach, is able to quickly gain insight on how its content performs, nimbly optimize user’s experience for posts that are performing best and is able to deliver critical feedback to its writers and editors.2.BuzzFeed is data-driven. At BuzzFeed, data drives decision-making and powers the company. MongoDB enables to effectively analyze, track and expose a range of metrics to writers and employees. This includes: the number of clicks; how often and where posts are being shared; which views on different social media properties lead to the most shares; and how views differ across mobile and desktop.3.Successful web journalism demands scale. processes large volumes of data and this is increasing each year as the site’s traffic continues to grow. Originally built on a relational data store, decided to use MongoDB, a more scalable solution, to collect and track the data they need with a richer functionality than a standard key-value store.4.Editors gain edge with access to data in minutes. Fast, easy access to data is critical to helping editors determine what content will be most shareable in the social media world. With MongoDB, is able to expose performance data shortly after publication, enabling editors to quickly respond by tweaking headlines and determine the best way to promote.5.Setting the infrastructure for new applications. As continues its efforts to leverage stats and optimization, MongoDB will feature prominently in the new infrastructure. MongoDB makes it easy to build apps quickly – a requirement as rolls out additional products.

Page 20: Real Time Analytics

Current Scenario

Page 21: Real Time Analytics

Current Offerings


Top Related