big data and future networks: a perspective from the...
TRANSCRIPT
Big Data and Future Networks:A Perspective
from the United States
Hisashi Kobayashi (小林久志)(小林久志)(小林久志)(小林久志)
Princeton University and
National Institute for Information and Communications Technology
2 Big Data and Future Network Design Hisashi Kobayashi
Acknowledgments
Prof. Tadao Saito, Toyota Info Technology CenterDr. Nozumu Nishinaga, Mr. Masahiro Kiyokawa, andMr. Hiroaki Yano, NICTProf. Mung Chiang, Princeton UniversityDr. Evangelos Eleftheriou, IBM Zurich Research LabMr. Kaiser Fung, Author “Numbers Rule Your World”Dr. Kazuo Iwano, Mitsubishi CorporationProf. Brian L. Mark, George Mason UniversityProf. Dipanker Raychaudhuri, Rutgers UniversityProf. Phuoc Tran-Gia, University of WürzburgProf. Howard Wactlar, CMU and NSF CISE DirectorateProf. Philip Yu, University of Illinois at Chicago
3 Big Data and Future Network Design Hisashi Kobayashi
Outline� How Much Information? How Big is Data? 4� President Obama’s Open Government Initiative 12� President Obama’s Big Data Initiative 16� Big Data in Science and Technology Research 17
- NITRD Program, NSF, DARPA, DOE � Big Data in Enterprises 27� Call for Data Science and Data Scientists 36� Big Data and Networks 43� References 51
5 Big Data and Future Network Design Hisashi Kobayashi
Source: The World of Data (by IBM): http://adamov.net.ru/images/share/the-world-of-large-scale-data-processing.jpg
6 Big Data and Future Network Design Hisashi Kobayashi
Online:Disk Drives
File Systems300 Petabytes
Offline:Magnetic Tape
CDs8 Exabytes
Analog Data:Paper – Film
Videotape200 Exabytes
/
Petabyte [1,000,000,000,000,000 bytes OR 1015 bytes]
Exabyte [1,000,000,000,000,000,000 bytes OR 1018 bytes]
How Much Data was Out There? [Kobayashi et al. 2005]
Source: http://www.sims.berkeley.edu/research/projects/how-much-info-2003
cf. 2003 Report by a U.C. Berkeley research group.
7 Big Data and Future Network Design Hisashi Kobayashi
Some Big Numberskilo 10 3
Mega 106
Giga 10 9
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
Yotta 10 24
� 0.43 x 1018 seconds: The Age of the Universe (13.77 billion years).
� 5 Exabytes: All words ever spoken by human beings (in text) Roy Williams (Caltech, 1993)
� 21 Exabytes/month: Global Internet traffic in 2007Padmasree Warrior (CISCO, March 2010)
� 160 Exabytes: Digital information created, captures, and replicated world wide in 2007(International Data Corporation, 2007)
� 42 Zettabytes: All words ever spoken by human beings (if digitized in 6kHz 16 bit audio)
Mark Lieberman (U. Penn, 2003)
8 Big Data and Future Network Design Hisashi Kobayashi
Source: Asigra Info Graphic: http://thumbnails.visually.netdna-cdn.com/big-data-infographic_504f4d2f5bd2f.jpg
9 Big Data and Future Network Design Hisashi Kobayashi
Source: - The Retailer's Guide: http://venturebeat.files.wordpress.com/2012/11/retailersbigdata_final.png
10 Big Data and Future Network Design Hisashi Kobayashi
Source: http://www.weforum.org/reports/personal-data-emergence-new-asset-classJanuary 2011, Davos Switzerland
11 Big Data and Future Network Design Hisashi Kobayashi
� Raw data has little value by itself. � We must process data and extract information in a
usable form.- Big Data tools, e.g., Apache Hadoop, MapReduce- “Data Science,” (data mining, machine learning) - Need for advancing statistical analysis techniquesthat are “scalable.”
� We then must put the information into a valuableaction , e.g., Amazon.com, a better government
Every day, we create 2.5 quintillion (1018) bytes (i.e., 2.5 Exabytes) of data — so much that 90% of the data in the world today has been created in the last two years alone. [IBM]
Open Government Initiative
My administration is committed to creating an unprecedented level of openness in Government. We will work together to ensure the public trust an d establish a system of transparency, public participation, and collaboration. Openness will strengthen our democracy and promote efficiency and effectiveness in Government.
---- President BARACK OBAMA, 01/21/09
12 Big Data and Future Network Design Hisashi Kobayashi
� Government should be transparent
- To promote accountability and provides information to citizens
� Government should be participatory
- Knowledge is widely dispersed in society, and public officials benefit from having access to that knowledge.
� Government should be collaborative
- We should use innovative tools, methods and systems to cooperate with nonprofit organizations, businesses, and individuals in the private sector.
13 Big Data and Future Network Design Hisashi Kobayashi
Open Government Directive1. Publish Government Information Online
2. Improve the Quality of Government Information
3. Create and Institutionalize a Culture of Open Government
4. Create an Enabling Policy Framework for Open Government
-- Peter R. Orszag, Director, Office of Management and Budget, 12/8/09
14 Big Data and Future Network Design Hisashi Kobayashi
http://www.whitehouse.gov/sites/default/files/omb/assets/memoranda_2010/m10-06.pdf
15
Big Data and Future Network Design Hisashi Kobayashi Source: Howard Wactlar, NSF CISE Directorate at N IST Big Data Meeting, June 2012
President Obama’s “Big Data Initiative”
� To advance state-of-the-art technologies to collect , store, preserve, manage, analyze and share Big Data.
� To accelerate the pace of discovery in science and engineering, strengthen the national security, and transform teaching and learning.
� To expand the work force needed to develop and use Big Data technologies.
16 Big Data and Future Network Design Hisashi Kobayashi
More than $200 millions in new commitments through six Federal departments and agencies.
- Office of Science and Technology Policy (OSTP)announced on March 29, 2012
NITRD (Networking and Information Technology Research and Development ) Program
� Provides a framework in which many Federal agencies coordinate their R&D efforts on networking and IT .
� Operates under the aegis of the NITRD Subcommittee ofthe National Science and Technology Council (NSTC) ’sCommittee on Technology.
� The National Coordination Office (NCO) supports the NITRD Program by providing technical expertise, planning and coordination and by serving as the Program ’s centralpoint of contact.
18 Big Data and Future Network Design Hisashi Kobayashi
The NITRD Program’s focus:
�Big Data (BD)
�Cyber Physical Systems (CPS)
�Cyber Security and Information Assurance (CSIA)
�Health Information Technology R & D (Health IT R&D)
�Human Computer Interaction and Information Management (HCI&IM)
�Etc.
20 Big Data and Future Network Design Hisashi Kobayashi
23 Big Data and Future Network Design Hisashi Kobayashi
Source: Howard Wactlar, NSF CISE Directorate at N IST Big Data Meeting, June 2012
24 Big Data and Future Network Design Hisashi Kobayashi
Source: Howard Wactlar, NSF CISE Directorate at N IST Big Data Meeting, June 2012
XDATA Program
� Invest $25 million/year
� Develop computational techniques and software tools , for both semi-structured (e.g., tabular, relational, categorical, meta-data) and unstructured (e.g., text documents, message traffic) data.
- Scalable algorithms for processing imperfect data in distributed data stores;
- Effective human-computer interaction tools for rapidly customizable visual reasoning
25 Big Data and Future Network Design Hisashi Kobayashi
Scalable Data Management Analysis and Visualization (SDAV) Institute: ($25 million over 5 years)
26 Big Data and Future Network Design Hisashi Kobayashi
Project Leader: Dr. Arie ShoshaniLawrence Berkeley National Laboratory
DOE’s
29 Big Data and Future Network Design Hisashi Kobayashi
http://sourcedigit.com/700-big-data-market-size-forecasts-2012-17/
The Big Data market will exceed $50B worldwide by 2017.
30 Big Data and Future Network Design Hisashi Kobayashi
The Big Data Market. IDC Japan’s Forecast 2011年年年年 142.5 億円、億円、億円、億円、 2012年年年年 197億円、億円、億円、億円、 2016年年年年 765億円億円億円億円
現在の現在の現在の現在のBigData市場は市場は市場は市場はIT 市場全体の市場全体の市場全体の市場全体の
13兆円の兆円の兆円の兆円の 0.1% 強程度強程度強程度強程度
31 Big Data and Future Network Design Hisashi Kobayashi
Source: http://www.microsoft.com/ja-jp/sqlserver/2012/big-data/default.aspx
Another Forecast is much Bigger (by an order of magnitude)
32 Big Data and Future Network Design Hisashi Kobayashi
Big Data: The Management Revolution
� Success story of Amazon.com30-40% annual growth in 2008-2012 [HBR]
� Data Analytics (DA) will replace the HiPPO.HiPPO= Highest Paid Person’s Opinion [HBR]
� Data analysts (or data scientists) are inshort supply.
[HBR]: Harvard Business Review, October 2012: http://hbr.org/archive-toc/BR1210Diamond ハーバード・ビジネス・レビューハーバード・ビジネス・レビューハーバード・ビジネス・レビューハーバード・ビジネス・レビュー 「ビッグデータ競争元年」「ビッグデータ競争元年」「ビッグデータ競争元年」「ビッグデータ競争元年」February 2013
33 Big Data and Future Network Design Hisashi Kobayashi
Big Data in Enterprises cont’d�Big Data exceeds the processing capacity of conventional relational database systems.� Big Data primarily addresses the database (DB)/data warehousing (DWH) aspect of data analysis.� “Apache Hadoop ” is the first technology for Big Data.
-- Distributed data storage-- Analysis algorithms for parallel data
34 Big Data and Future Network Design Hisashi Kobayashi
� A distributed computational framework that can process a wide range of datasets.
� High-performance parallel data processing using MapReduce .
� Reliable data storage using the HadoopDistributed File System (HDFS).
- Query language is NoSQL (“Not only SQL”)� Typical users seem obsessed with “quantity”, not
“quality,” of data. More thought should be given how to collect and select data [Kaiser Fung] .
35 Big Data and Future Network Design Hisashi Kobayashi
How to handle 3 Vs [IBM]
1. Volume:- Massively parallel processing (e.g., Greenplum data computing) - Distributed computing platform (e.g., Apache Hadoop).
2. Velocity: - Processing of “streaming data” to keep storage requirement
practical. (e.g., Large Hadron Collider at CERN)- Instantaneous response in some applications (e.g., financial trading)
3. Variety: - Need to deal with diverse data types and sources (e.g., text from SNS,
data from sensors, image data, GPS data from mobile phones, etc.)
[IBM] http://www-01.ibm.com/software/data/bigdata
36 Big Data and Future Network Design Hisashi Kobayashi
� Data Warehousing (DWH): Store large volumes information from multiple sources.
� Hadoop-based Analytics: Reduce the cost of analyzing massive data.
� Unstructured Database (as well as RDB) and NoSQL� Stream Computing: Continuously analyze data to take action in
real-time.� Text Analytics (or Text Mining): Analyze textual content of
unstructured information, using information retrieval, data miningmachine learning, statistics and computational linguistics.
� Data Visualization Tools (or Infographics): Real-time processing and dashboard presentation.
e.g. Tableau [http://www.tableausoftware.com/], Spotfire [http://spotfire.tibco.jp/], etc.
Big Data Platform
37 Big Data and Future Network Design Hisashi Kobayashi
Some Vendors of Big Data Tools� Greenplum: http://en.w ikipedia.org/wiki/Greenplum
- founded in 2003- acquired by EMC in 2010
� Netezza: http://en.wikipedia.org/wiki/Netezza- founded in 2000.- acquired by IBM in 2011 for $1.7B.
� SPSS: http://ja.wikipedia.org/wiki/SPSS -founded in 1988-acquired by IBM in 2009 for $1.2 B)
� Vertica (acquired by HP) � Oracle, SAP and Microsoft also provide Big Data Tools
日本に関しては;日本に関しては;日本に関しては;日本に関しては; 日経コンピュータ日経コンピュータ日経コンピュータ日経コンピュータ 2013年年年年 1月月月月10 日号日号日号日号
38 Big Data and Future Network Design Hisashi Kobayashi
Call for Better DATA SCIENCEAnd
More DATA SCIENTISTS
39 Big Data and Future Network Design Hisashi Kobayashi
� Try to gain “insights” from data, instead of presentingall collected data.
� Study and extend “classical statistical techniques”:- Exploratory Data Analysis (EDA).- Time Series Analysis- Hidden Markov Models (HMMs)- Bayesian Statistics and MCMC- etc.
� “Scalable” Algorithms and Analyticse.g., PageRank Algorithm (an efficient algorithm to
compute eigenvectors of a Markov transition matrix)
41 Big Data and Future Network Design Hisashi Kobayashi
Important Subfields of Data Mining
� Data stream mining [Aggrawal]
- Computer network traffic- Web searches- Sensor data
� Graph mining [Aggrawal]
- Web data- Social network analysis- Bio-informatics
C. C. Aggrawal (Ed.) Data Streams: Models and Algorithms, Kluwer Academic PublisherC. C. Aggarwal and H. Wang (Eds.), Managing and Mining Graph Data, Springer
43 Big Data and Future Network Design Hisashi Kobayashi
Source: http://blogs.itmedia.co.jp/business20/2012/10/post-2438.html
深刻な日本のデータ・サイエンテ深刻な日本のデータ・サイエンテ深刻な日本のデータ・サイエンテ深刻な日本のデータ・サイエンテイイイイスト不足スト不足スト不足スト不足
データ・アナリシスに関する知識(統計、機械学習など)を持つデータ・アナリシスに関する知識(統計、機械学習など)を持つデータ・アナリシスに関する知識(統計、機械学習など)を持つデータ・アナリシスに関する知識(統計、機械学習など)を持つ
新卒者の数新卒者の数新卒者の数新卒者の数 ((((2008年):年):年):年):
米国米国米国米国 24,730,中国中国中国中国 17,410,インドインドインドインド 13,270,日本日本日本日本 3,400.(中国では年(中国では年(中国では年(中国では年 +10.4% 増加、日本では増加、日本では増加、日本では増加、日本では -5.3%))))
SAS(Statistical Analysis System) 認定プロフェッショナルの数。認定プロフェッショナルの数。認定プロフェッショナルの数。認定プロフェッショナルの数。
米国米国米国米国 10,544, インドインドインドインド 5,907, 韓国韓国韓国韓国 1,381, 英国英国英国英国 1,242 日本日本日本日本 800。。。。
GDP当りの当りの当りの当りのSAS認定プロフェッショナルの数(米国を認定プロフェッショナルの数(米国を認定プロフェッショナルの数(米国を認定プロフェッショナルの数(米国を100)米国米国米国米国 100, インドインドインドインド 458, 韓国韓国韓国韓国 177, 英国英国英国英国 73, 日本日本日本日本 20.
Source: Diamond ハーバード・ビジネス・レビューハーバード・ビジネス・レビューハーバード・ビジネス・レビューハーバード・ビジネス・レビュー Feb. 2013
44 Big Data and Future Network Design Hisashi Kobayashi
[McKinsey] “Big data: The next frontier for innovat ion, competition and productivity,” McKinsey & Co., May 2011
46 Big Data and Future Network Design Hisashi Kobayashi
Source: - What happens in an Internet Minute? (by Intel):http://www.intel.com/content/dam/www/public/us/en/images/illustrations/embedded-infographic-600-logo.jpg
47 Big Data and Future Network Design Hisashi Kobayashi
Big Data vs. Networks� Networks to cope with Big Data.
- Sufficient storage, bandwidth and processing � Big Data to help design and manage
Networks.- Better performance, reliability and security
� Big Data and Networks for a better world.- Transparent government, Law enforcement- Risk management- Innovative applications for value creation
e.g., User behavior tracking and marketing(Privacy and security are critical).
48 Big Data and Future Network Design Hisashi Kobayashi
� Cloud computing offers an on-demand accessto a shared pool of configurable resources.
� Big Data requires a novel approach to meet the storage and processing requirements.
� The Cloud can make big data (analytics)accessible to those who couldn’t use otherwise.
� Disk storage performance can be a problem when it is shared by various users.
“Cloud Computing & Networking”: A Platform for Big Data
49 Big Data and Future Network Design Hisashi Kobayashi
OpenFlow and FLARE §
will help Data Centers handle Big Data
� Help control of connectivity of Data Centers for big data analytics via virtualization
� Especially useful to a “Multi-tenant Data Center ” environment.
� Facilitate load balancing among Data Centers.
§ FLARE: Deeply Programmable Network (DPN) Architect ure byAki Nakao
50 Big Data and Future Network Design Hisashi Kobayashi
“ID/Locator Separation” and “Context-oriented Service” for Big Data
� Where “contexts ” means “data attributes ,” e.g., identity, group association, time, location, etc.� “Data Centric Networking ” (also called “Named Data Networking or NDN”) appears a proper approach to Big Data. � But its performance implications are unclear.� GUID (Globally Unique ID) of MobilityFirstalso facilitates context-oriented service.
51 Big Data and Future Network Design Hisashi Kobayashi
Optical Technologies: Fast Transportand Processing of Big Data
� Integrated Optical Path and Optical Packets of the AKARI Architecture.
� Silicon Nanophotonics Technology - Integrates optical and electrical circuitson a single silicon chip, by using 90nmCMOS fabrication line.
cf. IBM Press release, Dec 10, 2012 http://www-03.ibm.com/press/us/en/pressrelease/39641.wss
52 Big Data and Future Network Design Hisashi Kobayashi
Additional Issues that Future Network Architectures should Address:
� Interface to Database- Increasingly unstructured and heterogeneous- Requires fast processing and transportation
� The Database community and the Networking community should interact.- No FIA project addresses database issues
� Service Layer for Big Data applications
53 Big Data and Future Network Design Hisashi Kobayashi
References[Kobayashi et al 2005] H. Kobayashi, Francois Dolivo, E. Eleftheriou, “35 Years of Progress in Digital Magnetic Recording,” 2005 Eduard Rhein Technology Award Lecture. [IBM] http://www-01.ibm.com/software/data/bigdata[UCB] http://www.sims.berkeley.edu/research/projects/how-much-info-2003 [McKinsey] “Big data: The next frontier for innovation, competition and productivity,” McKinsey & Co., May 2011, http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation[IBM] “IBM Lights Up Silicon Chips to Tackle Big Data,”Press release Dec 12, 2012,http://www-03.ibm.com/press/us/en/pressrelease/39641.wss
54 Big Data and Future Network Design Hisashi Kobayashi
Appendix
�Big Data across the Federal Government (4)�NITRD’s Focus (2)�NSF-NIH Initiative (2)�MiKinsey Global Institute’s Report (2)�2012 Summer Olympic Games’ Big Numbers�Data Never Sleeps (Fortune Magazine, 7/ 2012)�Twitter 2012�Big Data for Healthcare
Big Data Across the Federal GovernmentMarch 29, 2012
� Department of Defense (DOD)
Defense Advanced Research Projects Agency (DARPA)
- Anomaly Detection at Multiple Scales (ADAMS) program
- Cyber-Insider Threat (CINDER) program
� Department of Homeland Security (DHS)
- Center of Excellence on Visualization and Data Analytics
� Department of Energy (DOE)
- Advanced Scientific Computing Research (ASCR)
- High Performance Storage System (HPSS)
55 Big Data and Future Network Design Hisashi Kobayashi
56 Big Data and Future Network Design Hisashi Kobayashi
� Department of Veterans Administration (VA)- Consortium for Healthcare Informatics Research (CHIR)- Corporate Data Warehouse (CDW)- Genomic Information System for Integrated Science (GenISIS)
� Department of Health and Human Services (HHS)Center for Disease Control & Prevention (CDC)- BioSense 2.0 programCenter for Medicare & Medicaid Services (CMS)- A date warehouse based on Hadoop is being developed.- Use of XML database technologies is being evaluated.Food & Drug Administration (FDA)- Virtual Laboratory Environment (VLE)
� National Archives & Record Administration (NARA)- Cyberinfrastructure for a Billion Electronic Records (CI-BER)
57 Big Data and Future Network Design Hisashi Kobayashi
� National Aeronautic & Space Administration (NASA) - Earth Science Data and Information System (ESDIS)- Global Earth Observation System of Systems (GEOSS)- Planetary Data System (PDS)- Multimission Archive at Space Telescope Science Institute (MAST)
� National Endowment for the Humanities (NEH)- Digging into Data Challenge
� National Institute of Health (NIH)- The Cancer Imaging Archives (TCIA)- Neuroimaging Informatics Tools and Resource Clearinghouse
(NITRC)- Neuroscience Information Framework (NIF)- Structural Genomics Initiative - WorldWide Protein Data Bank (wwPDB)- Biomedical Informatics Research Network (BIRN)- Collaborative Research in Computational Neuroscience (CRCNS)
58 Big Data and Future Network Design Hisashi Kobayashi
� National Science Foundation (NSF) - Core Techniques and Technologies for Advancing Big Data
Science & Engineering- Cyberinfrastructure Framework for 21st Century Science &
Engineering (CIF21)- Data and Software Preservation for Open Science (DASPOS )- Computational and Data-enabled Science and Engineering
(CDS&E) in Mathematical and Statistical Science (CDS&E-MSS)- Open Science Grid (OSG)- Theoretical and Computational Astrophysics Networks (TCAN)
� National Security Agency (NSA)- Vigilant Net: A Competition to Foster and Test Cyber Defense
Situational Awareness at Scale- NSA/CSS Commercial Solutions Center (NCSC)
� United States Geological Survey (USGS) - John Wesley Powell Center for Analysis and Synthesis
The NITRD Program’s focus:
�Big Data (BD)
�Cyber Physical Systems (CPS)
�Cyber Security and Information Assurance (CSIA)
�Health Information Technology R & D (Health IT R&D)
�Human Computer Interaction and Information Management (HCI&IM)
59 Big Data and Future Network Design Hisashi Kobayashi
The NITRD Program’s focus – cont’d:
�High Confidence Software and Systems (HCSS)
�High End Computing (HEC)
� Large Scale Networking (LSN)
�Software Design and Productivity (SDP)
�Social, Economic, and Welfare Implication of IT and IT Workforce Development (SEW)
�Wireless Spectrum Research and Development (WSRD
60 Big Data and Future Network Design Hisashi Kobayashi
NSF-NIH Big Data Initiative
� Eight (8) fundamental research projects o Big Data were announced on October 3, 2012
� Typically, one to three investigators per project.
� Total of $15 millions, so about $500k/project
1. “Eliminating the Data Ingestion Bottleneck in Big-Data Application,”
M. Farach-Colton (Rutgers) and M. Bendor (Stony Brook)
2. “DataBridge- A Sociometric System for Long-Tail Science Data Collection,” A. Rajaesekar (Univ. of N.C.), G. King (Harvard) and Justin Zhan (NC Agricultura & Tech State Univ.)
3. “A Formal Foundation for Big Data Management,” D. Suciu (Univ. of Washington).
61 Big Data and Future Network Design Hisashi Kobayashi
4. “Analytical Approaches to Massive Data Computation with Applications to Genomics,” E. Upfal (Brown)
5. “Distribution-based Machine Learning for High-dimensional Datasets,” A. Singh (CMU)
6. GenomesGlore- Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing,” S. Aluru (Iowa State) O. Olukotun (Stanford) and W. Feng (Virginia Tech.)
7. “Big Tensor Mining: Theory, Scalable Algorithms and Applications,” C. Faloutos (CMU) N. Sidiropoulos (U. of Minnesota)
8. Discovery and Social Analytics for Large-Scale Scientific Literature,” P. Kantor, T. Joachims (Cornell) and D. Biei (Princeton)
62 Big Data and Future Network Design Hisashi Kobayashi
65 Big Data and Future Network Design Hisashi Kobayashi
Source: - Big Data at London Summer Games 2012:http://www.cloudtweaks.com/web/content//big-data-infographic1.jpg
66 Big Data and Future Network Design Hisashi Kobayashi
Source: - How much data is generated Every Minute:http://blogs-images.forbes.com/davefeinleib/files/2012/07/Big-Data-Infographic.jpg
67 Big Data and Future Network Design Hisashi Kobayashi
Source: Facts about Twitter: http://blog.sironaconsulting.com/.a/6a00d8341c761a5 3ef016767bafa2c970b-pi