big data new physics giga om structure conference ny - march 2011
DESCRIPTION
Opening keynote @ Structure Big Data 2011 conference.TRANSCRIPT
© 2011 IBM Corporation1
Big Data. New Physics.And Why Geospatial Data is Analytic SuperFood
Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics
March 23rd, 2011
© 2011 IBM Corporation2
Big Data. New Physics.
More data: better the predictions– Lower false positives
– Lower false negatives
More data: faster– The compute required decreases as database
gets bigger
Bonus: bad data … good– Suddenly glad your data is not perfect
© 2011 IBM Corporation3
Background
Early 80’s: Founded Systems Research & Development
1989 – 2003: Built numerous systems for Las Vegas, including NORA
Designed and deployed +/- 100 systems, at least 5 systems containing multi-billions of records and 100’s of millions of entities
2005: IBM acquires SRD
Today: Focus on ‘sensemaking on streams’ with special attention towards privacy and civil liberties protections
© 2011 IBM Corporation4
Time
Com
pu
tin
g P
ow
er
Gro
wth
Sensemaking
Algorithms
Available Observation
Space
Context
Trend: Organizations Are Getting Dumber
EnterpriseAmnesia
Every two days now we create as much information as we did from the dawn of civilization up until 2003.”
~ Eric Schmidt, CEO Google
© 2011 IBM Corporation5
Time
Com
pu
tin
g P
ow
er
Gro
wth
Sensemaking
Algorithms
Available Observation
Space
Context
Trend: Organizations Are Getting Dumber
WHY?
© 2011 IBM Corporation6
Algorithms at Dead End.
You Can’t Squeeze Knowledge
Out of a Pixel.
© 2011 IBM Corporation8
Context, definition
Better understanding something by taking into account the things around it.
© 2011 IBM Corporation9
Information in Context … and Accumulating
Top 200Customer
Job Applicant
IdentityThief
CriminalInvestigation
© 2011 IBM Corporation10
From Pixels to Pictures to Insight
Observations
Contextualization
Information inContext
Relevance
Consumer(An analyst, a system, the sensor itself, etc.)
© 2011 IBM Corporation11
The Puzzle Metaphor
Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors
What it represents is unknown (there is no picture on hand)
Is it one puzzle, 15 puzzles, or 1,500 different puzzles?
Some pieces are duplicates, missing, incomplete, low quality, or have been misinterpreted
Some pieces may even be professionally fabricated lies
Point being: Until you take the pieces to the table and attempt assembly, you don’t know what you are dealing with
© 2011 IBM Corporation12
How Context Accumulates
With each new observation … one of three assertions are made: 1) Un-associated; 2) placed near like neighbors; or 3) connected
Must favor the false negative
New observations sometimes reverse earlier assertions
As the working space expands, computational effort increases
Given sufficient observations, there can come a tipping point … thereafter, confidence improves while computational effort decreases!
© 2011 IBM Corporation13
Observations
Un
iqu
e Id
enti
ties
True Population
Overstated Population
© 2011 IBM Corporation14
Counting Is Difficult
Mark Smith6/12/1978
443-43-0000
Mark R Smith(707) 433-0000DL: 00001234
File 1
File 2
© 2011 IBM Corporation15
Observations
Un
iqu
e Id
enti
ties
True Population
The Bigger, The More Accurate, The Faster
© 2011 IBM Corporation16
Data Triangulation
Mark Randy Smith443-43-0000
DL: 00001234
New Record
Mark Smith6/12/1978
443-43-0000
Mark R Smith(707) 433-0000DL: 00001234
File 1
File 2
© 2011 IBM Corporation17
Big Data … pile of … Big Data … in context
© 2011 IBM Corporation18
One Form of Context is “Expert Counting”
Is it 5 people each with 1 account … or is it 1 person with 5 accounts?
Is it 20 cases of H1N1 in 20 cities … or one case reported 20 times?
If one cannot count … one cannot estimate vector or velocity (direction and speed).
Without vector and velocity … prediction is nearly impossible.
© 2011 IBM Corporation19
“Key Features” Enable Expert Counting
People Cars Router
Name Make Device IDAddress Model MakeDate of Birth Year ModelPhone License Plate No. Firmware Vers.Passport VIN Asset IDNationality Owner Etc.Biometric Etc.Etc.
© 2011 IBM Corporation20
Consider Lying Identical Twins
#123Sue3/3/84UberstanExp 2011
PASSPORT#123Sue3/3/84UberstanExp 2011
PASSPORT
Fingerprint
DNAMost Trusted
Authority
“Same person –
trust me.”
Most TrustedAuthority
© 2011 IBM Corporation21
The same thing cannot be in two places … at the same time.
Two different things cannot occupy the same space … at the same time.
© 2011 IBM Corporation22
Space & Time Enables Absolute Disambiguation
People Cars RouterName Make Device IDAddress Model MakeDate of Birth Year ModelPhone License Plate No. Firmware Vers.Passport VIN Asset IDNationality Owner Etc.Biometric Etc.Etc.
When When WhenWhere Where Where
© 2011 IBM Corporation23
“Life Arcs” Are Also Telling
Bill Smith4/13/67
Salem, Oregon
Bill Smith4/13/67
Seattle, Washington
Address History
Tampa, FL 2008-2008
Biloxi, MS 2005-2008
NY, NY 1996-2005
Tampa, FL 1984-1996
Address History
San Diego, CA 2005-2009
San Fran, CA 2005-2005
Phoenix, AZ 1990-2005
San Jose, CA 1982-1990
© 2011 IBM Corporation24
OMG
© 2011 IBM Corporation25
Space-Time-Travel
Cell phones are generating a staggering amount of geo-locational data – 600B transactions per day being created in the US alone
This data is being “de-identified” and shared with third parties – in volume and in real-time
Your movement quickly reveals where you spend your time (e.g., evenings vs. working hours) and who you spend your time with
Re-identification (figuring out who is who) is somewhat trivial
© 2011 IBM Corporation26
Space-Time-Travel is Prediction Super-Food
Prediction with 87% certainty where you will be next Thursday at 5:35pm
Names of the top 10 people you co-locate with, not at home and not at work
The Uberstan intelligence service preempts the next mass protest in real-time
A political opponent is crushed and resigns two days after announcing their candidacy
© 2011 IBM Corporation27
Consequences
Space-time-travel data is the ultimate biometric
It will enable enormous opportunity
It will unravel one’s secrets
It will challenge existing notions of privacy
And, it’s here now and more to come
© 2011 IBM Corporation28
Surveillance society
is irresistible.
And you are doing it.Location-based services (GPS), free email, Facebook, etc.
© 2011 IBM Corporation29
2 Big Data Trends
© 2011 IBM Corporation30
Will
ing
ness
to W
ait
The better the predictions … the faster they will be
wanted.
“Why did we have to wait until the
end of the day for the smart answer?”
Trend: Time Is Of The Essence
Relevance (Iffy) (Totally)
Day
Hour
200ms
Batch
Real-Time
© 2011 IBM Corporation31
Acc
oun
tab
le a
nd R
ep
eata
ble
It appears the market is becoming
more tolerant of one-time results that cannot be
easily repeated or
substantiated
Trend: Growing Tolerance for Non-Repeatability
Going ForwardYesterday
Payroll
Now
© 2011 IBM Corporation32
Acc
oun
tab
le a
nd R
ep
eata
ble
6:34pm Recommendation Shoot it6:35pm Action Taken Bang.Dead6:36pm Recommendation Oops.Send Flowers
Going ForwardYesterday Now
Trend: Be Careful What You Wish For
© 2011 IBM Corporation33
Closing Thoughts
© 2011 IBM Corporation34
Time
Com
pu
tin
g P
ow
er
Gro
wth
Sensemaking
Algorithms
Available Observation
Space
Context
Wish This On The Adversary
© 2011 IBM Corporation35
Time
Com
pu
tin
g P
ow
er
Gro
wth
Context Accumulation: The Way Forward
Sensemaking
Algorithms
Available Observation
SpaceContext Context
Accumulation
© 2011 IBM Corporation36
Related Blog Posts
Big Data. New Physics.
Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel
Puzzling: How Observations Are Accumulated Into Context
Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems
Your Movements Speak for Themselves: Space-Time Travel Data is Analytic Super-Food!
Data Finds Data
General Purpose Sensemaking Systems and Information Colocation
Sensemaking on Streams – My G2 Skunk Works Project: Privacy by Design (
PbD)
© 2011 IBM Corporation37
Big Data. New Physics.And Why Geospatial Data is Analytic SuperFood
Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics
March 23rd, 2011
© 2011 IBM Corporation38
“G2”My R&D Skunk Works Project
© 2011 IBM Corporation39
My G2 Goals
General purpose, real-time, sensemaking engine
Performs ‘information colocation’ over diverse data types e.g., structured, unstructured, social, geospatial, queries, hypothesis, anonymized data and more
Exploiting the big data, new physics phenomenon
Delivers “data finds data, relevance finds you”
Engineered for grid compute for massive scalability– Dreaming about: 1T rows for breakfast – then sustaining 1M
context accumulating observations per second– While new observations reverse earlier assertions
Privacy by Design (PbD) – a number of exciting privacy and civil liberties enhancing features baked-in, by design
© 2011 IBM Corporation40
Big Data. New Physics.And Why Geospatial Data is Analytic SuperFood
Jeff Jonas, IBM Distinguished EngineerChief Scientist, IBM Entity Analytics
March 23rd, 2011