the four v’s of big data testing: variety, volume, velocity, and veracity
TRANSCRIPT
T13 Big Data 10/6/16 13:30
The Four V's of Big Data Testing: Variety,Volume, Velocity, and Veracity
Presented by:
Jaya Bhagavathi Bhallamudi
Tata Consultancy Services
Brought to you by:
350 Corporate Way, Suite 400, Orange Park, FL 32073 888-‐-‐-‐268-‐-‐-‐8770 ·∙·∙ 904-‐-‐-‐278-‐-‐-‐0524 -‐ [email protected] -‐ http://www.starwest.techwell.com/
Jaya Bhagavathi Bhallamudi A senior consultant in the assurance services unit of Tata Consultancy Services, Jaya Bhagavathi Bhallamudi heads the Big Data and Analytics Assurance Center of Excellence, which focuses on R&D, test process definitions, test automation solution development, and competency development on Big Data technologies. Jaya has been in the test automation, testing services, and solutions innovation space for fifteen of her seventeen years in IT. She enjoys building test automation frameworks and accelerators for various testing services. Contact Jaya at [email protected] or on LinkedIn.
1 | Copyright © 2016 Tata Consultancy Services Limited
The Four V’s of Big Data Testing: Variety, Volume, Velocity & Veracity
October 6, 2016 TCS Confidential | Copyright © 2016 Tata Consultancy Services Limited
Jayabhagavathi Bhallamudi – Head, Big Data COE, TCS
2
With you today…
• Jaya is a Senior Consultant in TCS and currently heading the Big Data and Analytics Assurance Center of Excellence, which focuses on the R&D, Test Process definitions, Test Automation solution development and Competency development
• Jaya has 18+ years of experience in IT industry with 15+ years in Test Automation and Testing Services & Solutions Innovation
• Jaya holds Masters degree in Computer Application from Osmania University, Hyderabad, India
Jayabhagavathi Bhallamudi, Head – Big Data Testing COE, Assurance Services, TCS
TCS Confidential Information – Not to be shared
3
Today we will cover…
TCS Confidential
1
Tester’s Dilemma 2
Framework to tackle the problem
Need for Big Data Assurance
3
5
Big Data Analytics
TCS Confidential Information – Not to be shared
Non-traditional internal data &
uncontrolled external data
Complex non-traditional
analytical models
INPUT OUTPUT
6
Garbage in equals Garbage out
TCS Confidential Information – Not to be shared
IN OUT
Increased Risk
=
7
How this impacts your business
TCS Confidential Information – Not to be shared
Bad Data
Wrong Insights
Business / Brand Image Losses
Incorrect Processing
8
Appropriate Big Data Assurance ensures
TCS Confidential Information – Not to be shared
Good Data
Relevant Actionable Insights
Business Growth
Reliable Processing
10
Scope in terms of data flow
Ingestion
Integration
Migration
Homogenization
Standardization
Storage
Analytics
Apps Insights
Transformed
Data Raw Data
TCS Confidential Information – Not to be shared
11
VERACITY
Focus in terms of V’s
VALUE
TCS Confidential Information – Not to be shared
VELOCITY
VOLUME
VARIETY
VARIABILITY
BIG
DATA
TBs
RDBMS, txt,
xml, json,
bson, orc, rc…
Inconsistency
Reliability
Relevancy
Performance
12 TCS Confidential Information – Not to be shared
Ingestion
Integration
Migration
Homogenization
Standardization
Storage
Analytics
Apps Insights
When to focus which ‘V’?
Or .. Should we focus on all V’s all the time?
14
Understand the architecture of the integrated data enterprise 1
TCS Confidential Information – Not to be shared
15
Hadoop
Non-Hadoop
Databases Files Near real-time data streams
HDFS ( Raw data )
HIVE / HBASE ( Standardized data )
HIVE / HBASE
( Data for creating
analytical models )
HIVE / HBASE
( Data for applying
analytical models )
Step 1: Understand the architecture
DWHs
Apps
Analy
tics
Analytics
TCS Confidential Information – Not to be shared
17
Hadoop
Non-Hadoop
Databases Files Near real-time data streams
HDFS ( Raw data )
HIVE / HBASE ( Standardized data )
HIVE / HBASE
( Data for creating
analytical models )
HIVE / HBASE
( Data for applying
analytical models )
Step 2: Identify testing interfaces
DWHs
Apps
Analy
tics
Analytics
TCS Confidential Information – Not to be shared
a b c
d
f h
e
g i
j
k
m
l
n
18
Identify testing type relevant to the interface 3
TCS Confidential Information – Not to be shared
19
Databases
HDFS ( Raw data )
Data ingestion testing
Data migration testing
Data integration testing Te
sting types @
Step 3: Identify testing type
a
a
TCS Confidential Information – Not to be shared
20
Files
HDFS ( Raw data )
Data ingestion testing
Data migration testing
Data integration testing Te
sting types @
Step 3: Identify testing type
b
b
TCS Confidential Information – Not to be shared
21
Near real-
time data
streams
HDFS ( Raw data )
Data ingestion testing
Data integration testing
Te
sting types @
Step 3: Identify testing type
c
c
TCS Confidential Information – Not to be shared
22
HDFS
(Raw data)
HIVE / HBASE
(Standardized data)
Data homogenization
testing
Te
sting types @
Step 3: Identify testing type
d
d
TCS Confidential Information – Not to be shared
23
HIVE / HBASE
(Standardized data) Data standardized testing
Te
sting types @
Step 3: Identify testing type
e
TCS Confidential Information – Not to be shared
e
24
HIVE / HBASE
(Standardized data) Data migration testing
Te
sting types @
Step 3: Identify testing type
f
TCS Confidential Information – Not to be shared
HIVE / HBASE
(Data for creating
analytical models)
Data integration testing
f
25
HIVE / HBASE
(Data for creating
analytical models) Analytical model validation
Te
sting types @
Step 3: Identify testing type
g
TCS Confidential Information – Not to be shared
g
26
HIVE / HBASE
(Standardized data) Data migration testing
Te
sting types @
Step 3: Identify testing type
h
TCS Confidential Information – Not to be shared
HIVE / HBASE
(Data for applying
analytical models)
Data integration testing
h
27
HIVE / HBASE
(Data for applying
analytical models)
Analytical model
effectiveness testing
Te
sting types @
Step 3: Identify testing type
i
TCS Confidential Information – Not to be shared
i
28
Ha
doo
p
HIVE / HBASE
(Data for applying
analytical models)
Data provision
testing
T
esting types @
Step 3: Identify testing type
TCS Confidential Information – Not to be shared
Ana
lytics
j
j
k
l
Apps
Analytics k
l
29
Hadoop
HIVE / HBASE
(Data for applying
analytical models)
Data migration
testing
T
esting types @
Step 3: Identify testing type
TCS Confidential Information – Not to be shared
k
DWHs k
Data ingestion
testing
Data
integration
30
Te
sting T
ypes @
n
Data Provisioning Testing
Step 3: Identify testing type
DWHs
Apps
Analytics n
o
o
TCS Confidential Information – Not to be shared
31
Identify the V to be prioritized for the testing type
4
TCS Confidential Information – Not to be shared
32
Step 4: Prioritize V’s 4
Data Ingestion Testing
VARIETY
VELOCITY
High priority for file-based data ingestions
High priority for real time data ingestions
TCS Confidential Information – Not to be shared
33
Step 4: Prioritize V’s 4
Data Migration Testing
VOLUME High priority for historical data migrations
TCS Confidential Information – Not to be shared
34
Step 4: Prioritize V’s 4
Data Integration Testing
VARIABILITY Inconsistency / non-compliance checks
TCS Confidential Information – Not to be shared
High priority for data acquired from multiple sources to a single target
High priority for data acquired from external sources like social media
35
Step 4: Prioritize V’s 4
Data Homogenization Testing
VARIETY High priority for unstructured or semi-structured to
structured data format conversions
TCS Confidential Information – Not to be shared
36
Step 4: Prioritize V’s 4
Data Standardization Testing
VOLUME High priority for any pre-existing data to be checked for
conformance to data standards & industry compliances
TCS Confidential Information – Not to be shared
37
Step 4: Prioritize V’s 4
Analytical Model Validation
VOLUME To identify data patterns which were not considered in
development of model; Entire historical data to be
considered for testing
TCS Confidential Information – Not to be shared
Analytical models based on historical data
38
Step 4: Prioritize V’s 4
Analytical Model Validation
VERACITY
VALUE
High priority to identify the data patterns
that are not relevant for the business
High priority to identify the data patterns
that do not bring any value to the business
TCS Confidential Information – Not to be shared
Analytical models not based on historical data
39
Step 4: Prioritize V’s 4
Analytical Model Effectiveness Testing
VOLUME High priority to identify wrong predictions, unidentified data patterns
TCS Confidential Information – Not to be shared
If the actual data, on which the model needs to be run, is available