data science, big data and analytics: present and future€¦ · data. pig->large data...
TRANSCRIPT
![Page 1: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/1.jpg)
![Page 2: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/2.jpg)
Data Science, Big Data and Analytics: Present and Future
Perspectives from Academia, Industry and Consulting
Zoran Bursac, PhD, MPH
Josh Callaway, MS, MPH
Fernando Lopez, MS, PhD Candidate
![Page 3: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/3.jpg)
Data
Health care, business, technology -> data
Big data -> voluminous data sets (structured or unstructured)
Produced every day all around us
Analytics -> examining data to detect patterns
![Page 4: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/4.jpg)
Big Data Analytics and Data Science
Different sources, different sizes
High variety, volume, velocity
Online networks, web pages, audio/video, social media,
logs
Techniques-> machine learning, data mining,
natural language processing, statistics
Extraction, preparation, storage/warehousing,
blending, analytics
![Page 5: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/5.jpg)
Real-time Benefits
Healthcare
Banking
Energy
Tech
Consumer
Education
![Page 6: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/6.jpg)
Current Trends and Common Data Science ToolsProcess, Perform and Visualize
Data sourcing -> Hadoop HDFS
Data storing -> Oracle, MySql
Data conversion -> Sqoop
Data transformation ->
Hive
Exploratory analysis -> Elastic
search, knime
Model building -> R, SAS, Python,
Julia
Visualization -> Tableau, ggplot2
Model execution -> Hadoop, Java,
Spark, C#
Version control -> Git, BitBucket IDE -> Rstudio
Text for coding -> Jupyter Notebook,
R Shiny
![Page 7: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/7.jpg)
Free Open Source
Hadoop->distributed processing of large data
across clusters
Hive->warehouse to manage large data in
distributed SQL storage
Kafka->real time pipeline of streaming
data
Pig->large data analytics
R, Rstudio, ggplot-> analytics and data
visualization
Python, Julia -> high level programming with efficient algorithms and
speed for large data processing
Jupyter notebook -> manage documents
such as code, explanatory and shared
RapidMiner -> data preparation, machine learning and model
deployment
![Page 8: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/8.jpg)
Do you need to know all? NO
Hadoop R SQL Python Hive Pig etc…
![Page 9: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/9.jpg)
• Cirillo and Valencia (2019). Machine learning algorithms for multi-view data analysis. Data from multiple sources (genomic, proteomic, metabolomic) used to identify associations within and between multiple sets of patients, and generate models for patient clustering.
Big data analytics for personalized medicine and pharmacogenomics
![Page 10: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/10.jpg)
The Latest Buzzwords
Data science
Artificial intelligence
Machine learning
Data mining Big data
Data warehouse Data lake Cloud
computing Hadoop Internet of things
![Page 11: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/11.jpg)
Local Environment
diagnosis.csv
demographic.csv
lab.csv medication.csv
procedure.csvRStudio Local
insertDB.R
AWS RDS MySQLTables
demographicdiagnosis
labproceduremedication
AWS EC2 InstanceShiny Server
AWS EC2 InstanceMicrosoft R Open
Online Published Web Apps with URLsModel
Results
results.csv
insertDB.R
![Page 12: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/12.jpg)
www.Kaggle.com
![Page 13: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/13.jpg)
Diabetes
![Page 14: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/14.jpg)
Step 1. Local
Environment
![Page 15: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/15.jpg)
Step 2. Insert into
Remote Database
![Page 16: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/16.jpg)
Step 3. SQL Database
![Page 17: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/17.jpg)
Step 4. R Shiny Web App
![Page 18: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/18.jpg)
Step 4. R Shiny Web App
![Page 19: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/19.jpg)
Local Storage
Text Files (.txt) Comma Separated Value Files (.csv)
Excel Database (.xlsx)
Microsoft Access Database (.accdb)
![Page 20: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/20.jpg)
Remote Storage
Dropbox Google Sheets
AWS S3 Bucket
Cloud Database• MySQL• MongoDB• PostgreSQL• NoSQL• Oracle
![Page 21: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/21.jpg)
Cloud Platforms
![Page 22: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/22.jpg)
![Page 23: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/23.jpg)
Microsoft Azure
![Page 24: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/24.jpg)
![Page 25: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/25.jpg)
DigitalOcean
![Page 26: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/26.jpg)
Distributed Computing Technology
All processing jobs (scripts; i.e. R, Python, Scala, etc.) are divvied up among all available processing units (computers, cores, threads, etc.)
Hadoop
Spark
Microsoft R Open
![Page 27: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/27.jpg)
Microsoft R Open
Multithreaded Performance
![Page 28: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/28.jpg)
Machine Learning
![Page 29: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/29.jpg)
Microsoft Azure Power BI
![Page 30: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/30.jpg)
Microsoft Azure Power BI
![Page 31: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/31.jpg)
Microsoft Azure Power BI
![Page 32: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/32.jpg)
Analytics on big data
Data warehouse
Real-time analytics
Solution Architectures
at Sanitas
Population Health Management for Healthcare
![Page 33: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/33.jpg)
Local Storage
BUSINESS INTELLIGENCE STRATEGYSanitas USA
Data Sources
IT Alignment
SANITAS USA – Strategic plan
Data Governance
- Standards and Policies - Data Quality (DQ) - Data Security and Privacy - Architecture/Integration - DW and Business Intelligence (BI) - Self-service Architectures
Data Types
ETL/ELT workflows Cloud data warehouse
Ensure Data Quality
Dashboards, KPIs and Metrics
eCW Report Delivery
Reports
Security, Roles and Audit
Visualize and Report
Operations
Information Analysis for Business unitsClinical – Financial - Operational
Visualized Analytics
Training and Predictive Experimentation
Hadoop, Spark, Hive, LLAP, Kafka, Storm, R
Apache Spark-based analytics
![Page 34: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/34.jpg)
ML StudioExperiment
![Page 35: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/35.jpg)
ML StudioDataset
![Page 36: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/36.jpg)
ML StudioModel Train
![Page 37: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/37.jpg)
ML StudioModel Score
![Page 38: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/38.jpg)
ML StudioModel
Evaluation
![Page 39: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/39.jpg)
Power BI
![Page 40: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/40.jpg)
Rejoinder
Data -> our biggest asset Emphasis on “good” data
Processing streams -> wrangling, carpenting
Storage –> lakes, warehousing, data bases
Analytics -> mining, machine learning
Output -> new knowledge, information, inferences
Feed back to the users -> gather more data
![Page 41: Data Science, Big Data and Analytics: Present and Future€¦ · data. Pig->large data analytics. R, Rstudio, ggplot-> analytics and data visualization. Python, Julia -> high level](https://reader034.vdocuments.site/reader034/viewer/2022050105/5f43ed35462f437b1d1236e5/html5/thumbnails/41.jpg)
Question to the global audience
• What are your needs and where are you currently with respect to?
• Data collection, quality, storage, analytical and computing power
• Where is data coming from, single or multiple sources
• Who is maintaining data quality and fidelity• Do you have adequate storage with proper
security; planning for the future• Are you investing in resources and trained
personnel with data science skills