real time analytics

Download Real Time Analytics

Post on 15-Apr-2017




0 download

Embed Size (px)


Twitter Analytics

Framework for Real time AnalyticsBy Mohsin Hakim

Real Time Analytics


IntroductionAnalytics falls along a spectrum. On one end of the spectrum sit batch analytical applications, which are used for complex, long-running analyses. They tend to have slower response times (up to minutes, hours, or days) and lower requirements for availability. Examples of batch analytics include Hadoop-based workloads On the other end of the spectrum sit real-time analytical applications, which provide lighter-weight analytics very quickly. Latency is low (sub-second) and availability requirements are high (e.g., 99.99%). MongoDB is typically used for real-time analytics. Example applications include: Business Intelligence (BI) and analytics provides an essential set of technologies and processes that organizations have relied upon over many years to guide strategic business decisions.

Introduction1. Predictable Frequency. Data is extracted from source systems at regular intervals - typically measured in days, months and quarters2. Static Sources. Data is sourced from controlled, internal systems supporting established and well-defined back-office processes3. Fixed Models. Data structures are known and modeled in advance of analysis. This enables the development of a single schema to accommodate data from all of the source systems, but adds significant time to the upfront design4. Defined Queries. Questions to be asked of the data (i.e., the reporting queries) are pre-defined. If not all of the query requirements are known upfront, or requirements change, then the schema has to be modified to accommodate changes5. Slow-changing requirements. Rigorous change-control is enforced before the introduction of new data sources or reporting requirements 6. Limited users. The consumers of BI reports are typically business managers and senior executives

Evolving BI and Analytics for Big DataHigher Uptime RequirementsThe immediacy of real-time analytics accessed from multiple fixed and mobile devices places additional demands on the continuous availability of BI systems.Batch-based systems can often tolerate a certain level of downtime, for example for scheduled maintenance. Online systems on the other hand need to maintain operations during both failures and planned upgrades.The Need for Speed & ScaleTime to value is everything. For example, having access to real-time customer sentiment or logistics tracking is of little benefit unless the data can be analyzed and reported in real-time. As a consequence, the frequency of data acquisition, integration and analysis must increase from days to minutes or less, placing significant operational overhead on BI systems. Agile Analytics and ReportingWith such a diversity of new data sources, business analysts can not know all of the questions they need to ask in advance. Therefore an essential requirement is that the data can be stored before knowing how it will be processed and queried.The Changing Face of DataData generated by such workloads as social, mobile, sensor and logging, is much more complex and variably structured than traditional transaction data from back-office systems such as ERP, CRM, PoS (Point of Sale) and Accounts Receivable.Taking BI to the CloudThe drive to embrace cloud computing to reduce costs and improve agility means BI components that have traditionally relied on databases deployed on monolithic, scale-up systems have to be re-designed for the elastic scale-out, service-oriented architectures of cloud.

Impacts to Traditional BI DatabasesThe relational databases underpinning many of todays traditional BI platforms are not well suited to the requirements of big data: Semi-structured and unstructured data typical in mobile, social and sensor-driven applications cannot be efficiently represented as rows and columns in a relational database table Rapid evolution of database schema to support new data sources and rapidly changing data structures is notpossible in relational databases, which rely on costly ALTER TABLE operations to add or modify table attributes Performance overhead of JOINs and transaction semantics prevents relational databases from keeping pace with the ingestion of high-velocity data sources Quickly growing data volumes require scaling databases out across commodity hardware, rather than the scale-up approach typical of most relational databasesRelational databases inability to handle the speed, size and diversity of rapidly changing data generated by modern applications is already driving the enterprise adoption of NoSQL and Big Data technologies in both operational and analytical roles.

The purposeFlume in Hadoop, for batch processing, which make the data relevant time-wise; it can be used for real time because it would be too fresh, only from several min to even a second late. Flume engine, using server side in order to make decisions regarding the current state of affairs. Decisions Making are made based on whatever data is received from customers current condition without all of the history in their user profiles, which would enable a much more informed decision.State of Art Auto updating charting and report creation with Dashboard UI.Increase scalability and performance of Organizations using Real Time Analysis platform with a focus on storing, processing and analyzing the exponentially growing data using big data technologies.

Challenges1. Getting data metrics to the right peopleOften, social media is treated like the ugly stepchild within the marketing department and real-time social media analytics are either absent or ignored.2. VisualizationVisualizing real-time social media analytics is another key element involved in developing insights that matter.Simply displaying values graphically helps in making the kinds of fast interpretations necessary for making decisions with real-time data, but adding more complex algorithms and using models provides deeper insights, especially when visualized.3. Unstructured data is challengingUnlike the survey data firms are used to dealing with, most (IBM estimates 80%) is unstructured meaning it consists of words rather than numbers. And, text analytics lags seriously behind numeric analysis.4. Increasing signal to noiseSocial media data is inherently noisy. Reducing noise to even detect signal is challenging especially in real time. Sure, with enough time, new analytics tools can ferret out the few meaningful comments across various social networks, but few can handle this in real-time.5. A wait and see attitudeAgain, businesses are used to a certain operational model that makes real-time social media analytics challenging. For instance, we listed to a presentation by an analyst from NPR.

Top 10 Priorities1 Enable new fast-paced business practices2 Dont expect the new stuff to replace the old stuff3 Do not assume that all the data needs to be in real time, all the time4 Correlate real-time data with data from other sources and latencies5 Start with a proof of value with measurable outcomes6 As a safe starter project, accelerate successful latent processes into near real time7 Think about operationalizing analytics8 Think about the skills you need9 Examine application business rules to ensure they are ready for real-time data flows10 Evaluate technology platforms and expertise for availability and reliability

ChallengesReal-Time Analytics is Hard Cant Stay Ahead. You need to account for many types of data, including unstructured and semi-structured data. And new sources present themselves unpredictably. Relational databases arent capable of handling this, which leaves you hamstrung. Cant Scale. You need to analyze terabytes or petabytes of data. You need sub-second response times. Thats a lot more than a single server can handle. Relational databases werent designed for thisBatch. Batch processes are the right approach for some jobs. But in many cases, you need to analyze rapidly changing, multi-structured data in real time. You dont have the luxury of lengthy ETL processes to cleanse data for later. MongoDB Makes it Easy

Do the Impossible. MongoDB can incorporate any kind of data any structure, any format, any source no matter how often it changes. Your analytical engines can be comprehensive and real-time. Scale Big. MongoDB is built to scale out on commodity hardware, in your data center or in the cloud. And without complex hardware or extra software. This shouldnt be hard, and with MongoDB, it isnt.Real Time. MongoDB can analyze data of any structure directly within the database, giving you results in real time, and without expensive data warehouse loads.

Why Other Databases Fall Short and MangoDBMost databases make you chose between a flexible data model, low latency at scale, and powerful access. But increasingly you need all three at the same time.

Rigid Schemas. You should be able to analyze unstructured, semi-structured, and polymorphic data. And it should be easy to add new data. But this data doesnt belong in relational rows and columns. Plus, relational schemas are hard to change incrementally, especially without impacting performance or taking the database offline. Scaling Problems. Relational databases were designed for single-server configurations, not for horizontal scale-out. They were meant to serve 100s of ops per second, not 100,000s of ops per second. Even with a lot of engineering hours, custom sharding layers, and caches, scaling an RDBMS is hard at best and impossible at worst. Takes Too Long. Analyzing data in real time requires a break from the familiar ETL and data warehouse approach. You dont have time for lengthy load schedules, or to build new query models. You need to run aggregation queries against variably structured data. And you should be able to do so in place, in real time.

Organizations are using MongoDB for analytic