jeff jonas big data new physics
DESCRIPTION
Big Data 12.3.14TRANSCRIPT
Big Data. New Physics.And Geospatial “Superfood”
© 2014 IBM Corporation1111
Jeff Jonas, Jeff Jonas, Jeff Jonas, Jeff Jonas, IBM FellowChief Scientist, Context Computing
Email: [email protected]: www.jeffjonas.typepad.com
Twitter: http://www.twitter.com/jeffjonas
About the Speaker
� Jeff Jonas� IBM Fellow, Chief Scientist for Context Computing
� Founder and Chief Scientist of Systems Research & Development (SRD), acquired by IBM in 2005
© 2014 IBM Corporation2222
acquired by IBM in 2005� Been designing, building deploying entity resolution systems for three decades
� This technology is used today by defense & intelligence, financial institutions, humanitarian efforts and more
� Today: Primarily focused on ‘sensemaking on streams’ with special attention towards privacy and civil liberties protections
”The data must find the data and the
relevance must find the user.”
© 2014 IBM Corporation3333
relevance must find the user.”
Com
puting Pow
er Growth
Available Observation
Space
Context
Trend: Organizations Are Getting Dumber
EnterpriseAmnesia
© 2014 IBM Corporation4444
Time
Com
puting Pow
er Growth
Sensemaking Algorithms
Available Observation
Space
ContextWHY?
Trend: Organizations Are Getting Dumber
Com
puting Pow
er Growth
© 2014 IBM Corporation5555
Time
Sensemaking AlgorithmsC
ompu
ting Pow
er Growth
Algorithms at Dead End.
You Can’t
© 2014 IBM Corporation6666
You Can’t Squeeze Knowledge
Out of a Pixel.
Context, definition
Better understanding something
© 2014 IBM Corporation8888
Better understanding something by taking into account the things around it.
I ducked as the bat flew my way.
Another exciting baseball game …
© 2014 IBM Corporation9999
Information in Context … and Accumulating
Top 200CustomerTwitter
LinkedInCareer History
© 2014 IBM Corporation10101010
Customer
JobApplicant
TwitterInfluencer
AMLInvestigation
The Puzzle Metaphor
� Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors
� What it represents is unknown – there is no picture on hand
� Is it one puzzle, 15 puzzles, or 1,500 different puzzles?
© 2014 IBM Corporation11111111
� Some pieces are duplicates, missing, incomplete, low quality, or have been misinterpreted
� Some pieces may even be professionally fabricated lies
� Until you take the pieces to the table and attempt assembly, you don’t know what you are dealing with
270 pieces90%
200 pieces66%
150 pieces50%
6 pieces2%
Puzzling Images: Courtesy Ravensburger © 2011
© 2014 IBM Corporation12121212
90% 66% 50% 2%
30 pieces10% (duplicates)
© 2014 IBM Corporation13131313
© 2014 IBM Corporation14141414
First Discovery
© 2014 IBM Corporation15151515
More Data Finds Data
© 2014 IBM Corporation16161616
Duplicates in Front Of Your Eyes
© 2014 IBM Corporation17171717
First Duplicate Found Here
© 2014 IBM Corporation18181818
© 2014 IBM Corporation19191919
Incremental Context – Incremental Discovery
6:40pm START
22min “Hey, this one is a duplicate!”
35min “I think some pieces are missing.”
© 2014 IBM Corporation20202020
37min “Looks like a bunch of hillbillies ona porch.”
44min “Hillbillies, playing guitars, sittingon a porch, near a barber sign …and a banjo!”
150 pieces
50%
© 2014 IBM Corporation21212121
Incremental Context – Incremental Discovery
47min “We should take the sky and grassoff the table.”
2hr “Let’s switch sides, and see if wecan make sense of this fromdifferent perspectives.”
© 2014 IBM Corporation22222222
different perspectives.”
2hr10m “Wait, there are three … no, fourpuzzles.”
2hr17m “We need a bigger table.”
2hr18m “I think you threw in a few randompieces.”
© 2014 IBM Corporation23232323
How Context Accumulates
� With each new observation … one of three assertions are made: 1) Un-associated; 2) placed near like neighbors; or 3) connected
� Must favor the false negative
� New observations sometimes reverse earlier assertions
© 2014 IBM Corporation24242424
� Some observations produce novel discovery
� The emerging picture helps focus collection interests
� As the working space expands, computational effort increases
� Given sufficient observations, there can come a tipping point
� Thereafter, confidence improves while computational effort decreases!
Unique Iden
tities
Overstated Population
© 2014 IBM Corporation25252525
Observations
Unique Iden
tities
True Population
Counting Is Difficult
Mark Smith6/12/1978
Mark R Smith(707) 433-0000DL: 00001234
© 2014 IBM Corporation26262626
6/12/1978443-43-0000
File 1
File 2
Unique Iden
tities
The Rise and Fall of a Population
© 2014 IBM Corporation27272727
Observations
Unique Iden
tities
True Population
Data Triangulation
New Record
Mark Smith6/12/1978
Mark R Smith(707) 433-0000DL: 00001234
© 2014 IBM Corporation28282828
Mark Randy Smith443-43-0000DL: 00001234
6/12/1978443-43-0000
File 1
File 2
Big Data [in context]. New Physics.
�More data: better the predictions– Lower false positives
– Lower false negatives
© 2014 IBM Corporation29292929
�More data: bad data good– Suddenly glad your data is not perfect
�More data: less compute
Big Data
© 2014 IBM Corporation30303030
Pile of ____ Information In Context
One Form of Context: “Expert Counting”
� Is it 5 people each with 1 account … or is it 1 person with 5 accounts?
� Is it 20 cases of H1N1 in 20 cities … or one case reported 20 times?
© 2014 IBM Corporation31313131
reported 20 times?
� If one cannot count … one cannot estimate vector or velocity (direction and speed).
�Without vector and velocity … prediction is nearly impossible.
Entity ResolutionDemonstration
© 2014 IBM Corporation32323232
Entity Resolution Demonstration
DECEASED PERSONDECEASED PERSONDECEASED PERSONDECEASED PERSONGeorge BalstonYOB: 1951 SSN: 5598DOD: 1995
VOTERVOTERVOTERVOTERGeorge F BalstonYOB: 1951 D/L: 480113070 SW Karen Blvd Apt 7 Beaverton, OR 97005Last voted: 2008
© 2014 IBM Corporation33333333
When it comes to best practices in voter matching, if only a name and year of birth match, this is insufficient proof of a match. Many different people in the
U.S. share a name and year of birth.
Human review is required.
Unfortunately, there can be many thousands of cases just like this and state election offices don’t have the staff/budget to manually review them all.
Now Consider This Tertiary DMV Record
DECEASED PERSONDECEASED PERSONDECEASED PERSONDECEASED PERSONGeorge BalstonYOB: 1951 SSN: 5598DOD: 1995
VOTERVOTERVOTERVOTERGeorge F BalstonYOB: 1951 D/L: 480113070 SW Karen Blvd Apt 7 Beaverton, OR 97005Last voted: 2008
© 2014 IBM Corporation34343434
DMVDMVDMVDMVGeorge F BalstonYOB: 1951 SSN: 5598 D/L: 48013043 SW Clementine Blvd Apt 210Beaverton, OR 97005
The DMV record contains enough features to match both the voter (name, year of birth and driver’s license) and/or the deceased persons record (name, year of birth and SSN). For the sake of argument, let’s say it matches the voter best.
DECEASED PERSONDECEASED PERSONDECEASED PERSONDECEASED PERSONGeorge BalstonYOB: 1951 SSN: 5598DOD: 1995
Features Accumulate
VOTERVOTERVOTERVOTERGeorge F BalstonYOB: 1951 D/L: 480113070 SW Karen Blvd Apt 7 Beaverton, OR 97005Last voted: 2008
DMVDMVDMVDMV
© 2014 IBM Corporation35353535
The voter/DMV record now shares a name, year of birth and SSN with the deceased person. In voter matching best practices, this evidence would be
sufficient to make a determination that this voter is likely deceased. This case no longer needs human review.
DMVDMVDMVDMVGeorge F BalstonYOB: 1951 SSN: 5598 D/L: 48013043 SW Clementine Blvd Apt 210Beaverton, OR 97005
VOTERVOTERVOTERVOTERGeorge F BalstonYOB: 1951 D/L: 480113070 SW Karen Blvd Apt 7Beaverton, OR 97005Last voted: 2008
DMVDMVDMVDMV
As features accumulate it becomes possible to resolve previous un-resolvable
identity records.
As events and transactions
Useful Insight Revealed!Useful Insight Revealed!
© 2014 IBM Corporation36363636
DMVDMVDMVDMVGeorge F BalstonYOB: 1951 SSN: 5598 D/L: 48013043 SW Clementine Blvd Apt 210Beaverton, OR 97005
DECEASED PERSONDECEASED PERSONDECEASED PERSONDECEASED PERSONGeorge BalstonYOB: 1951 SSN: 5598DOD: 1995
As events and transactions accumulate – detection of
relevance improves.
Here we can see George who died in 1995 voted in 2008.
Expert Counting: Degrees of Difficulty
IncompatibleFeatures
Deceit
Bob Jones123455
Ken Wells550119
© 2014 IBM Corporation37373737
Exactly Same
Fuzzy
Bob Jones123455
Bob Jones123455
Bob Jones123455
Robert T Jonnes000123455
Bob Jones123455
bjones@hotmail
Deceit Detection Using Context Accumulation
Deceit
Bob Jones123455
Ken Wells550119Robert Jones
123455POB 13452DOB 03/12/73
Feature Accumulation
© 2014 IBM Corporation38383838
Ken Wells550119POB 999911DOB 03/12/[email protected]
[email protected] 03/12/73Robert Jones123455Ken Wells550119
Resolved!
DOB 03/12/73
Bob JonesPOB [email protected]
Skilled adversaries use “channel separation” to avoid detection.
© 2014 IBM Corporation39393939
Cell Phone #1
Unknown
Cell Phone #2
Unknown
Passport #1
William A.
Bank Acct #1
Billy K.
Detection requires “channel consolidation.”
© 2014 IBM Corporation40404040
William Aaka Billy K.• Cell Phone #1• Cell Phone #2• Bank Acct #1• Passport #1
Take Note
To catch clever criminals, one must ...
1) Collect observations the adversary doesn’t
© 2014 IBM Corporation41414141
1) Collect observations the adversary doesn’t know you have
2) Or, be able to perform compute over your observations in a manner the adversary cannot fathom
InfoSphere Identity Insightv8
© 2014 IBM Corporation42424242
v8
New Think About Expert Counting
IncompatibleFeatures
Deceit
Bob Jones123455
Ken Wells550119
© 2014 IBM Corporation43434343
Exactly Same
Fuzzy
Bob Jones123455
Bob Jones123455
Bob Jones123455
Robert T Jonnes000123455
Bob Jones123455
bjones@hotmail
Key Features Enable Expert Counting
Name License Plate No. Serial NumberAddress VIN MAC AddressDate of Birth Make IP AddressPhone Model MakePassport Year Model
People Cars Router
© 2014 IBM Corporation44444444
Passport Year ModelNationality Color Firmware VersionBiometric Etc. Etc.Etc.
Consider Lying Identical Twins
#123Sue3/3/84UberstanExp 2011
PASSPORT#123Sue3/3/84UberstanExp 2011
PASSPORT
© 2014 IBM Corporation45454545
Fingerprint
DNAMost TrustedAuthority
“Same person –trust me.”
Most TrustedAuthority
�The same thing cannot be in two places … at the same time.
�Two different things cannot occupy the same space … at the
© 2014 IBM Corporation46464646
�Two different things cannot occupy the same space … at the same time.
Space & Time Enables Absolute Disambiguation
When When WhenWhere Where Where
People Cars RouterName License Plate No. Serial NumberAddress VIN MAC AddressDate of Birth Make IP AddressPhone Model MakePassport Year Model
© 2014 IBM Corporation47474747
Passport Year ModelNationality Color Firmware VersionBiometric Etc. Etc.Etc.
“Life Arcs” Are Also Telling
Bill Smith4/13/67
Salem, Oregon
Bill Smith4/13/67
Seattle, Washington
Address History Address History
© 2014 IBM Corporation48484848
Address History
Tampa, FL 2008-2008
Biloxi, MS 2005-2008
NY, NY 1996-2005
Tampa, FL 1984-1996
Address History
San Diego, CA 2005-2009
San Fran, CA 2005-2005
Phoenix, AZ 1990-2005
San Jose, CA 1982-1990
OMG
© 2014 IBM Corporation49494949
Space-Time-Travel
� Cell phones are generating a staggering amount of geo-locational data – 600B transactions per day being created in the US alone
� This data is being “de-identified” and shared with third parties – in volume and in real-time
© 2014 IBM Corporation50505050
parties – in volume and in real-time
� Your movement quickly reveals where you spend your time (e.g., evenings vs. working hours)
� Re-identification (figuring out who is who) is somewhat trivial
� And, oh so powerful predictions …
The 10 People I Spend the Most Time With(Not at Home and Not at Work)
1. Michelle J2. Renee M3. Peggy M4. Erin E5. Joshua J
He must be following me!
© 2014 IBM Corporation51515151
4. Erin E5. Joshua J6. Ivan X7. Bob Y8. Amanda H9. Dane J10. Wesley R
He must be following me!
Consequences
� Space-time-travel data is the ultimate biometric
� It will enable enormous opportunity
� It will unravel one’s secrets
© 2014 IBM Corporation52525252
� It will unravel one’s secrets
� It will challenge existing notions of privacy
� Adoption is now accelerating at a blistering pace
[Theatrical Pause]
© 2014 IBM Corporation53535353
[Theatrical Pause]
The G2 | Sensemaking Project
© 2014 IBM Corporation54545454
The G2 Vision
1) Evaluate each new observation against previous observations.
2) Determine if what is being observed is relevant.
3) Delivering this actionable insight to its consumer
© 2014 IBM Corporation55555555
3) Delivering this actionable insight to its consumer … fast enough to do something about it while it is still happening.
4) Doing this with sufficient accuracy and scale to really matter.
Uniquely G2
� Real “Context Computing”– Complete Context: Contextualize diverse observations, each observation benefiting from others
– Current Context: Real-time, incremental integration
– Conflicting Context: High tolerance for disagreement, confusion and uncertainty
– Self-Correcting Context: New observations able to reverse earlier assertions
� Engineered ground-up for cloud compute … in support of hemisphere-scale data
© 2014 IBM Corporation56565656
� Introduce new data sources (e.g., geospatial), new entity types (e.g., vessels), new features (e.g.,MAC addresses) … without schema change/re-engineering
� From sense to respond in sub-200ms– fast enough to do something about the transaction while it is still happening
� Unprecedented number of Privacy by Design (PbD) features baked-in
Privacy by Design (PbD)
1. Full Attribution
2. Tamper Resistant Audit Log
3. Information Transfer Accounting
4. Data Tethering
© 2014 IBM Corporation57575757
http://jeffjonas.typepad.com/jeff_jonas/2012/06/privacy-by-design-in-the-era-of-big-data.html
4. Data Tethering
5. False Negative Favoring
6. Self-Correcting False Positives
7. Analytics on Anonymized Data
Example: Self-Correcting False Positive
John T Smith Jr123 Main Street703 111-2000
DOB: 03/12/1984
John T Smith123 Main Street
A plausible claim these two people are the same
1
2 John T Smith Sr123 Main Street
Until this record
3
© 2014 IBM Corporation58585858
Which reveals this is a FALSE POSITIVE
123 Main Street703 111-2000DL: 009900991
2123 Main Street703 111-2000DL: 009900991
Until this record comes into view
Example: Self-Correcting False Positive
John T Smith Jr123 Main Street703 111-2000
DOB: 03/12/1984
John T Smith123 Main Street
John T Smith Sr123 Main Street
1
3
2
© 2014 IBM Corporation59595959
123 Main Street703 111-2000DL: 009900991
123 Main Street703 111-2000DL: 009900991
New Best Practice:FIXED IN REAL-TIME
(not end of month)
John T Smith123 Main Street703 111-2000DL: 009900991
2
2
Use Cases
� Maritime Domain AwarenessNew system lets authorities track suspicious ships
http://www.asiaone.com/print/News/Latest%2BNews/Science%2Band%2BTech/Story/A1Story20130703-434337.html
� Voter Registration Modernization
© 2014 IBM Corporation60606060
� Voter Registration ModernizationDavid Becker (PEW Charitable Trust) and Jeff Jonas (IBM) Discuss How G2 Has Helped
Modernize Voter Registration in America
http://ibmreferencehub.com/STG/ibm_executive_edge_2013/#gensession_daytwo_jonasbecker
Closing Thoughts
© 2014 IBM Corporation61616161
Available Observation
Space
Context
Wish This on the Adversary
EnterpriseAmnesia
Com
puting Pow
er Growth
© 2014 IBM Corporation62626262
Time
Sensemaking AlgorithmsC
ompu
ting Pow
er Growth
Wish This for Yourself: Better Sensemaking Skills
Available Observation
Space
Context
Com
puting Pow
er Growth
© 2014 IBM Corporation63636363
Time
Sensemaking AlgorithmsC
ompu
ting Pow
er Growth
State of the Union: Isolated Analytics
Structured Data Analytics
Unstructured Data Analytics
© 2014 IBM Corporation64646464
ObservationSpace
Action
Social Network Analytics
The Future: General Purpose Context Accumulation
Data Finds Data Relevance Finds You
This is GThis is GThis is GThis is G2222
© 2014 IBM Corporation65656565
ObservationSpace
Consumer(An analyst, a system, the sensor itself, etc.)
InformationIn Context
The most competitive organizations
are going to make sense of what they are observing
fast enough to do something about it
© 2014 IBM Corporation66666666
fast enough to do something about it
while they are observing it.
Related Blog Posts
Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel
Puzzling: How Observations Are Accumulated Into Context
Big Data. New Physics.
On A Smarter Planet … Some Organizations Will Be Smarter-er Than Others
© 2014 IBM Corporation67676767
Your Movements Speak for Themselves: Space-Time Travel Data is Analytic Super-Food!
When Federated Search Bites
Data Finds Data
Structuring Unstructured Data
Fantasy Analytics
Questions?
© 2014 IBM Corporation68686868
Email: [email protected]
Blog: www.jeffjonas.typepad.com
Twitter: http://www.twitter.com/jeffjonas
Big Data. New Physics.And Geospatial “Superfood”
© 2014 IBM Corporation69696969
Jeff Jonas, Jeff Jonas, Jeff Jonas, Jeff Jonas, IBM FellowChief Scientist, Context Computing
Email: [email protected]: www.jeffjonas.typepad.com
Twitter: http://www.twitter.com/jeffjonas