bi(g) data: opportunities for bi professionals

78
BI(G) DATA Opportunities for BI professionals in the Netherlands Most companies mentioned are Dutch

Upload: albert-besselse

Post on 26-Jan-2015

112 views

Category:

Technology


1 download

DESCRIPTION

Presentation given to a group of freelance BI professionals at october 2013 .Description of big data from different views.

TRANSCRIPT

Page 1: Bi(G) data: opportunities for BI Professionals

BI(G) DATA Opportunities for BI professionals

in the Netherlands

Most companies mentioned are Dutch

Page 2: Bi(G) data: opportunities for BI Professionals

Our fantasy...

At Last: an IT job is sexy

Page 3: Bi(G) data: opportunities for BI Professionals

Agenda

● Big Data views○ Scientific Method○ Data Characteristics○ New Technology○ Business Opportunities○ Culture

● Opportunities for BI professionals

Page 4: Bi(G) data: opportunities for BI Professionals

Google Trends

The famous McKinsey Report: Big data: The next frontier for innovation, competition, and productivity

BIG Data became trending because of MckinseyNow it’s correlated with hadoop

Page 5: Bi(G) data: opportunities for BI Professionals

Wikipedia Big DataBig data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a

tolerable elapsed time.[19]

Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.

The target moves due to constant improvement in traditional DBMS technology as well as new databases like NoSQL and their ability to handle larger amounts

of data.[20]

With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.

Focus on volume… instead of other V’s

Page 6: Bi(G) data: opportunities for BI Professionals

BIG DataThe Scientific method is changing

Page 8: Bi(G) data: opportunities for BI Professionals

The Fourth Paradigm: Data-Intensive Scientific Discovery

Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

Implicit in the idea of a fourth paradigm is the ability, and the need, to share data. In sciences like physics and astronomy, the instruments are so expensive that data must be shared

Data analysis is the new microscopeHuman Genome, Large Hydron Collider

Page 9: Bi(G) data: opportunities for BI Professionals

Jim Gray● Thousand years ago: science was

empirical describing natural phenomena

● Last few hundred years: theoretical branch using models, generalizations

● Last few decades: a computational branch simulating complex phenomena

● Today:data exploration (eScience) unify theory, experiment, and simulation○ Data captured by instruments

Or generated by simulator○ Processed by software○ Information/Knowledge stored

in computer○ Scientist analyzes database /

files using data management and statistics

On Sunday, January 28, 2007, during a short solo sailing trip to the Farallon Islands near San Francisco to scatter his mother's ashes, Gray and his 40-foot yacht, Tenacious, were reported missing by his wife, Donna Carnes. The Coast Guard searched for four days using a C-130 plane, helicopters, and patrol boats but found no sign of the vessel.[10][11][12][13]

Gray's boat was equipped with an automatically deployable EPIRB (Emergency Position-Indicating Radio Beacon), which should have deployed and begun transmitting the instant his vessel sank. The area around the Farallon Islands where Gray was sailing is well north of the East-West ship channel used by freighters entering and leaving San Francisco Bay. The weather was clear that day and no ships reported striking his boat, nor were any distress radio transmissions reported.

On February 1, 2007, the DigitalGlobe satellite did a scan of the area, generating thousands of images.[14] The images were posted to Amazon Mechanical Turk in order to distribute the work of searching through them, in hopes of spotting his boat.

In the immediate aftermath of the disappearance, many theories were put forward on how Gray disappeared.[15]

On February 16, 2007, the family and Friends of Jim Gray Group suspended their search,[16]

Page 10: Bi(G) data: opportunities for BI Professionals

but continue to follow any important leads. The family ended its underwater search May 31, 2007. Despite much effort and use of high-tech equipment above and below water, searches did not reveal any new clues.[17][18][19][20][21][22]

Personal life[edit]

While at Berkeley, Gray and his first wife Loretta had a daughter; the couple later divorced.[2] He is survived by his wife, Donna Carnes, his daughter, three grandchildren, and his sister Gail.

The University of California, Berkeley and Gray's family hosted a tribute to him on May 31, 2008. The conference included sessions delivered by Richard Rashid and David Vaskevitch.[23] Microsoft's WorldWide Telescope software is dedicated to Gray. In 2008, Microsoft opened a research center in Madison, Wisconsin, named after Jim Gray.[24]

Having being missing for five years as of May 16, 2012, Gray is legally assumed to have died at sea.[4][25]

Jim Gray Award[edit]

Each year, Microsoft Research presents the Jim Gray eScience Award[26] to a researcher who has made an outstanding contribution to the field of data-intensive computing. Award recipients are selected for their ground-breaking, fundamental contributions to the field of eScience. Previous award winners include Alex Szalay (2007), Carole Goble (2008), Jeff Dozier (2009), Phil Bourne (2010), Mark Abbott (2011) and Antony John Williams (2012).

Books[edit]

● Transaction Processing: Concepts and Techniques (with Andreas Reuter) (1993). ISBN 1-55860-190-2.

● The Benchmark Handbook: For Database and Transaction Processing Systems (1991). Morgan Kaufmann. ISBN 978-1-55860-159-8.

See also

Page 11: Bi(G) data: opportunities for BI Professionals

esciencecenter

Projecten

Page 12: Bi(G) data: opportunities for BI Professionals

Chris AndersonThis is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The end of theory:EdgeWired

Page 13: Bi(G) data: opportunities for BI Professionals

Cukier and MAyer-Schonberger

Shift 1: End of SamplesShift 2: End of exactitudeShift 3: End of Causality

patterns & correlations if you know that your customers are going to buy more products

by analyzing a data set or correlation, then the “why” doesn’t matter

— you should try to exploit that.

The technical equivalent in big data is the ability to survey a whole population instead of just sampling random portions of it.with less error from sampling we can accept more measurement error”. According to the authors, science is obsessed with sampling and measurement error as a consequence of coping in a ‘small data’ world.

The third and most radical shift implies “we won’t have to be fixated on causality [...] the idea of understanding the reasons behind all that happens.” This is a straw

Page 14: Bi(G) data: opportunities for BI Professionals

Nate Silver

“We're not that much smarter than we used to be, even though we have much more information - and that means the real skill now is learning how to pick out the useful information from all this noise.”

“I came to realize that prediction in the era of Big Data was not going very well.”“If the quantity of information is increasing [exponentially]… Most of it is just noise.”“… numbers have no way of speaking for themselves. We speak for them.”

Nate Silver has lived a preposterously interesting life. In 2002, while toiling away as a lowly consultant for the accounting firm KPMG, he hatched a revolutionary method for predicting the performance of baseball players, which the Web site Baseball Prospectus subsequently acquired. The following year, he took up poker in his spare time and quit his job after winning $15,000 in six months. (His annual poker winnings soon ran into the six-figures.)

Page 15: Bi(G) data: opportunities for BI Professionals

Nasim Taleb

Big Data is bullshitThis is the tragedy of big data: The more variables, the more correlations that can show significance. Falsity also grows faster than information; it is nonlinear (convex) with respect to data.

I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack.

1. It is an outlier, as it lies outside the realm of regular expectations, because nothing in the past can convincingly point to its possibility.

2. It carries an extreme 'impact'.3. in spite of its outlier status, human nature makes

us concoct explanations for its occurrence after the fact, making it explainable and predictable.

A small number of Black Swans explains almost everything in our world, from the success of ideas and religions, to the dynamics of historical events, to elements of our own personal lives.

Page 16: Bi(G) data: opportunities for BI Professionals

Ludic Fallay

The discovery of the Higgs particle was a dissapointment for some physicist because now they know what they don’t know: no big things to discover

The ludic fallacy is a term coined by Nassim Nicholas Taleb in his 2007 book The Black Swan. "Ludic" is from the Latin ludus, meaning "play, game, sport, pastime."[1] It is summarized as "the misuse of games to model real-life situations."[2] Taleb explains the fallacy as "basing studies of chance on the narrow world of games and dice."[3]

It is a central argument in the book and a rebuttal of the predictive mathematical models used to predict the future – as well as an attack on the idea of applying naïve and simplified statistical models in complex domains. According to Taleb, statistics works only in some domains like casinos in which the odds are visible and defined. Taleb's argument centers on the idea that predictive models are based on platonified forms, gravitating towards mathematical purity and failing to take some key ideas into account:

● It is impossible to be in possession of all the information.● Very small unknown variations in the data could have a huge impact. Taleb does

differentiate his idea from that of mathematical notions in chaos theory, e.g. the butterfly effect.

● Theories/Models based on empirical data are flawed, as they cannot predict events that have never happened before, but have tremendous impact. E.g. the 911 terrorist attacks, invention of the automobile, etc.

Page 17: Bi(G) data: opportunities for BI Professionals
Page 18: Bi(G) data: opportunities for BI Professionals

Discover what you (don’t) know you don’t know?

Page 19: Bi(G) data: opportunities for BI Professionals

BIG DataData Characteristics are changing

Page 20: Bi(G) data: opportunities for BI Professionals

BI community● Data integration is already 20+ years old● Just another source● We do not have much data● Small or big data: it has to be managed ● Big data = business analytics● One-off projects (data is too varied)● We know what data is all about. Nobody has to tell us what you can do with data.

Collegues..

Page 21: Bi(G) data: opportunities for BI Professionals

Gartner’s definition (2001)

Big Data is high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and

process optimization.

● Volume: relative size of data sources● Velocity: speed at which data refresh is handled● Variety: handling various data formats

● (Validity, Veracity( accuracy, correctness, applicability), Value, and Visibility)

Page 22: Bi(G) data: opportunities for BI Professionals

Variety

source: Hortonworks

Page 23: Bi(G) data: opportunities for BI Professionals

Velocity

keeping history for clickpaths isn’t interesting if the site is changing through the years.

Page 24: Bi(G) data: opportunities for BI Professionals

Volume

Page 25: Bi(G) data: opportunities for BI Professionals

“Information was a pond and has become a river”

Peter Hinssen

fantastiche leuke spreker op het SAS forum. goede presentatie : filtering wordt/is heel belangrijk

Page 26: Bi(G) data: opportunities for BI Professionals

Liquid Data

om data actionable te houden moet er instant gerageerd worden. . vissen in een meer versus vissen in een rivier. zoveel water dat snel voorbij stroomt

Page 27: Bi(G) data: opportunities for BI Professionals

Barry DevlinThe true godfather of Data warehousing. ● Human Sourced Information

○ is now largely digitized and electronically stored everywhere from tweets to movies

● Process-mediated data○ This data includes transactions,

reference tables and relationships, as well as the metadata that sets its context, all in a highly structured form.

● Machine-generated data○ from simple sensor records to

complex computer logs

Page 28: Bi(G) data: opportunities for BI Professionals

Impact on the DWH● The central core business data pillar

is the consistent, quality-assured data found in EDW and MDM systems

● Deep analytic information requires highly flexible, large scale processing such as the statistical analysis and text mining

● Fast analytic data requires such high-speed analytic processing that it must be done on data in-flight,

● Specialty analytic data, using specialized processing such as NoSQL, XML, graph and other databases and data stores

inmon richt zich nu op deep analytic information met zijn text mining

Page 29: Bi(G) data: opportunities for BI Professionals

BIG Data New Tools

Page 30: Bi(G) data: opportunities for BI Professionals
Page 31: Bi(G) data: opportunities for BI Professionals

Other BIG data related trends ● elastic cloud● nosql● data visualization

Page 33: Bi(G) data: opportunities for BI Professionals

Nosql: Mongo DB● How and Why Leading Investment Organizations are Migrating to MongoDB

● Real World MongoDB: Use Cases from Financial Services

● How Financial Firms Create Single Customer Views Using MongoDB

● How Banks Use MongoDB to Manage Risk

● How Banks Manage Reference Data with MongoDB

● How Banks Use MongoDB

as a Tick Database

● Position and Trade Management

withMongoDB

Page 34: Bi(G) data: opportunities for BI Professionals

Nosql: Neo4j

Graph database

● Nodes represent entities

● Properties are pertinent

information that relate to nodes.

● Edges are the lines that connect

nodes to nodes or nodes to

properties and they represent the

relationship between the two

Page 35: Bi(G) data: opportunities for BI Professionals

dataviz: synerscope

Ooh/aah strategy: first be amazed then understand

Page 36: Bi(G) data: opportunities for BI Professionals

Local intelligence: ORTEC/TSSOrtec Team Support Systems (ORTEC TSS),

develops decision, support & information ICT-

Systems to analyze sport performances.

These software systems are employed before,

during and after sport matches. During a match,

they are used to measure teams’ and players’

performances.

Following top athletes and talents by their clubs,

teams, sponsors, unions and the public has

been brought to a whole new dimension

because of these systems.

Page 38: Bi(G) data: opportunities for BI Professionals

Elastic cloud: Amazon Redshift $999 per TB per year

Amazon Redshift $999 per TB per year

Page 39: Bi(G) data: opportunities for BI Professionals

Hadoop….

● ecosystem isn’t stable. A lot of configurations are possible ● Hadoop is complex. Java expertise. ● Apache Hadoop : Open source Hadoop framework in Java.

Consists of Hadoop Common Package (filesystem and OS abstractions), a MapReduce engine (MapReduce or YARN), and Hadoop Distributed File System (HDFS)

● Apache Mahout : Machine learning algorithms for collaborative filtering, clustering, and classification using Hadoop

● Apache Hive : Data warehouse infrastructure for Hadoop. Provides data summarization, query, and analysis using a SQL- like language called HiveQL. Stores data in an embedded Apache Derby database.

● Apache Pig: Platform for creating MapReduce programs using a high-level “Pig Latin” language. Makes MapReduce programming similar to SQL. Can be extended by user defined functions written in Java, Python, etc

● Apache Avro: Data serialization system. Avro IDL is the interface description language syntax for Avro.

Page 40: Bi(G) data: opportunities for BI Professionals

● Apache HBase: Non-relational DBMS part of the Hadoop project. Designed for large quantities of sparse data (like BigTable). Provides a Java API for map reduce jobs to access the data. Used by Facebook.

● Apache ZooKeeper : Distributed configuration service, synchronization service, and naming registry for large distributed systems like Hadoop.

● Apache Cassandra: Distributed database management system. Highly scalable.

● Apache Ambari: A web-based tool for provision, managing and monitoring Apache Hadoop cluster

● Apache Chukwa: A data collection system for managing large distributed systems

● Apache Sqoop: Tool for transferring bulk data between structured databases and Hadoop

● Apache Oozie: A workflow scheduler system to manage Apache Hadoop jobs

●●

Page 41: Bi(G) data: opportunities for BI Professionals

Hadoop jobs

Page 42: Bi(G) data: opportunities for BI Professionals

From a single solution to an Ecosystem

Page 43: Bi(G) data: opportunities for BI Professionals

BIG DataBusiness Opportunities

Page 44: Bi(G) data: opportunities for BI Professionals
Page 45: Bi(G) data: opportunities for BI Professionals

Mckinsey’s big data report

Page 46: Bi(G) data: opportunities for BI Professionals

For big data, 2013 is the year of experimentation and early deployment," said Frank Buytendijk, research vice president at the research firm. "Adoption is still at the early stages with less than 8 percent of all respondents indicating their organization has deployed big data solutions. [Across the board], 20 percent are piloting and experimenting, 18 percent are developing a strategy, 19 percent are knowledge gathering, while the remainder has no plans or don't know."

Page 47: Bi(G) data: opportunities for BI Professionals
Page 48: Bi(G) data: opportunities for BI Professionals
Page 49: Bi(G) data: opportunities for BI Professionals
Page 50: Bi(G) data: opportunities for BI Professionals
Page 51: Bi(G) data: opportunities for BI Professionals

Has "Big Data" significantly changed Data Science principles and practice?

kdnuggets poll (Oct 29, 2013.)

Page 52: Bi(G) data: opportunities for BI Professionals

Analytics is BIG

analytics is hotter. green line is google analytics: blue line should be corrected for that

Page 53: Bi(G) data: opportunities for BI Professionals

Kaggle● Platform for predictive analytics competitions● Business hands over part of the data and keeps part of the data sets● Contenders build models based on the available data● Contenders predict the values of the kept data sets● Best prediction wins the competition

Page 54: Bi(G) data: opportunities for BI Professionals

Algoritmica

Page 56: Bi(G) data: opportunities for BI Professionals

Ewatercycle

A global hydrological model will provide the international community with the best possible estimates of the state of water resources in the world.

Assimilation of remotely sensed and in situ data will be a major mathematical and computational challenge.

A successful implementation of the project will lead to a community model for hydrologists across the globe.- See more at: http://esciencecenter.nl/projects/project-portfolio/water-management/#sthash.Pj7kDbBI.dpuf

Page 57: Bi(G) data: opportunities for BI Professionals

BIG DataCultural shift in using data

Page 58: Bi(G) data: opportunities for BI Professionals

“Perhaps the most important cultural trend today: The explosion of data about every

aspect of our world and the rise of applied math gurus who know how to use it.”

Chris Anderson

Page 59: Bi(G) data: opportunities for BI Professionals

Sharing: Silk

Since Silk first came out of stealth mode in 2011, there have been 300,000 interactive pages created on its cloud-based, web data-crunching platform designed for non-technical “knowledge workers.” Taking less easy-to-read data sets and making them more digestible, results have ranged from the Guardian newspaper in the UK creating graphics of which countries have the most asylum seekers, through to charting what products Google has killed and dads mapping out the best playgrounds for his kid in Amsterdam (where Silk also happens to be founded). It’s been a popular, and free, tool, with pages created by some 16,000 people growing by 20 percent each month. Now, Silk is moving on to its next phase: its first paid product, Silk for Teams, aimed at groups of enterprise users who want to use the platform to produce cleaner internal data sets, and eventually to create data visualizations that work with paywalls.

Page 61: Bi(G) data: opportunities for BI Professionals

“Our research suggests that seven sectors alone could generate more than $3 trillion a year in additional value as a result of open

data…”

Mckinsey

Page 62: Bi(G) data: opportunities for BI Professionals

Open Data

Open data: Unlocking innovation and performance with liquid information

A new McKinsey report says that open data can help create $3 trillion a year of economic value across seven sectors. In a related podcast, the McKinsey Global Institute’s Michael Chui discusses the economic

Page 63: Bi(G) data: opportunities for BI Professionals

Data.Overheid.nl

Page 64: Bi(G) data: opportunities for BI Professionals

Cap Gemini

Page 65: Bi(G) data: opportunities for BI Professionals

Data Journalism

new york times, guardian, sargasso, nu.nl

Page 66: Bi(G) data: opportunities for BI Professionals

Quantified Self

Page 67: Bi(G) data: opportunities for BI Professionals

Quantified Self

Page 68: Bi(G) data: opportunities for BI Professionals

Quantified Self

Page 69: Bi(G) data: opportunities for BI Professionals

Quantified Self

Combining all the sources of this and the previous 3 slides and finding correlations is the essence of (big) data analytics. example: combining sunpower with sleepcycle and fitness and diet

Page 70: Bi(G) data: opportunities for BI Professionals

BIG DataOpportunities for BI professionals

Page 71: Bi(G) data: opportunities for BI Professionals

“The ability to take data — to be able to understand it, to process it, to extract value

from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next

decades.”

Hal Varian

Google guru

Page 72: Bi(G) data: opportunities for BI Professionals

“The illiterate of the 21st century will not be those who cannot read and write, but those

who cannot learn, unlearn, relearn”

Alvin Toffler

Page 73: Bi(G) data: opportunities for BI Professionals

Mckinsey report highlightsA significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in

statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data… Furthermore, this type of talent is difficult to produce, taking years of training in the case of someone with intrinsic

mathematical abilities. (p.10)

Page 74: Bi(G) data: opportunities for BI Professionals

Data Scientist

● Association rule learning

● Classification

● Cluster Analysis

● Crowd Sourcing

● Data Fusion and Integration

● Ensemble Learning

● Genetic Algorithms

● Machine Learning

● Natural Language Processing

● Neural Networks

● Pattern Recognition

● Predictive Modelling

● Regression

● Sentiment Analysis

● Signal Processing

● Supervised and Unsupervised

Learning

● Simulation

● Time Series Analysis

● Visualization

Applying varying degrees of statistics, data visualizations, computer programming, data mining, machine learning, and database engineering to solve complex data problems.

Page 75: Bi(G) data: opportunities for BI Professionals

Typical Big Data Job is not a BI JobJOB OPENING: BIG DATA ARCHITECT

We are looking to expand our core product team with a Senior Java Developer/Architect that will contribute in the product design and development and take pride in the delivery of kick-a** products.

Knowledge, Skills and Experience

● Relevant HBO/University education or experience● Minimum 4 years Java experience● Experience with NoSQL Databases, preferably MongoDB (MapReduce, Sharding)● Experience with Cloud-based infrastructure, esp. AWS● Expertise with Hadoop eco-system is a plus (examples: Flume, Zookeeper, Ganglia, etc)● Experience with Web services (REST/SOAP)● Obsession with performance and big data● Passion for elegant technical design and good programming practices (TDD, CI)● Energetic “self-starter” , have the will to take ownership, and be accountable for deliverables● A true defender of quality and (light-weight) documentation of the designs● Sense of humor is essential

● Not typical BI● hardcore tech..

Page 76: Bi(G) data: opportunities for BI Professionals

Personal Strategies● Do nothing

○ Just sell your personal data○ Wait untill the big DM companies incorporate Hadoop ecosystem

● Hadoop expert○ Learn java and the hadoop ecosystem

● Data scientist○ Learn Python/R○ Learn statistics and all kinds of algorithms (especially Bayes)

● Data architect/manager○ Learn the principles of hadoop/nosql○ Learn how to integrate (big) data in the enterprise dwh○ data governance/ data stewardship/ DQ / metadata

● BI(g) Tool Specialist○ Adopt a big data dataviz or reporting tool (Splunk, Platfora)○ Adopt a platform (Cloudera, Hortonworks, MapR, Azure, Google, Amazon)

● Data artist○ Data visualization tools, design info graphics

● Data story teller○ data journalism course

Page 77: Bi(G) data: opportunities for BI Professionals

Group Activities● Expert Groups

○ Explore platforms○ Explore tools

● Open data for personal and group branding○ Start a project○ Join open data sites

● Data journalism ○ Start a blog/join a blog○ Make news with data

● Business Cases○ Scanning business cases○ Almere Datacapital

Group Activities BI United

Page 78: Bi(G) data: opportunities for BI Professionals

living in an big data augmented world