the four v’s of big data testing: variety, volume, velocity, and veracity

42
T13 Big Data 10/6/16 13:30 The Four V's of Big Data Testing: Variety,Volume, Velocity, and Veracity Presented by: Jaya Bhagavathi Bhallamudi Tata Consultancy Services Brought to you by: 350 Corporate Way, Suite 400, Orange Park, FL 32073 8882688770 9042780524 [email protected] http://www.starwest.techwell.com/

Upload: techwell

Post on 22-Jan-2018

156 views

Category:

Software


0 download

TRANSCRIPT

       T13  Big  Data  10/6/16  13:30            

The  Four  V's  of  Big  Data  Testing:  Variety,Volume,  Velocity,  and  Veracity  

Presented  by:      

  Jaya  Bhagavathi  Bhallamudi      

Tata  Consultancy  Services    

Brought  to  you  by:        

   

   

350  Corporate  Way,  Suite  400,  Orange  Park,  FL  32073    888-­‐-­‐-­‐268-­‐-­‐-­‐8770  ·∙·∙  904-­‐-­‐-­‐278-­‐-­‐-­‐0524  -­‐  [email protected]  -­‐  http://www.starwest.techwell.com/      

 

   

   

Jaya  Bhagavathi  Bhallamudi      A  senior  consultant  in  the  assurance  services  unit  of  Tata  Consultancy  Services,  Jaya  Bhagavathi  Bhallamudi  heads  the  Big  Data  and  Analytics  Assurance  Center  of  Excellence,  which  focuses  on  R&D,  test  process  definitions,  test  automation  solution  development,  and  competency  development  on  Big  Data  technologies.  Jaya  has  been  in  the  test  automation,  testing  services,  and  solutions  innovation  space  for  fifteen  of  her  seventeen  years  in  IT.  She  enjoys  building  test  automation  frameworks  and  accelerators  for  various  testing  services.  Contact  Jaya  at  [email protected]  or  on  LinkedIn.  

1 | Copyright © 2016 Tata Consultancy Services Limited

The Four V’s of Big Data Testing: Variety, Volume, Velocity & Veracity

October 6, 2016 TCS Confidential | Copyright © 2016 Tata Consultancy Services Limited

Jayabhagavathi Bhallamudi – Head, Big Data COE, TCS

2

With you today…

• Jaya is a Senior Consultant in TCS and currently heading the Big Data and Analytics Assurance Center of Excellence, which focuses on the R&D, Test Process definitions, Test Automation solution development and Competency development

• Jaya has 18+ years of experience in IT industry with 15+ years in Test Automation and Testing Services & Solutions Innovation

• Jaya holds Masters degree in Computer Application from Osmania University, Hyderabad, India

Jayabhagavathi Bhallamudi, Head – Big Data Testing COE, Assurance Services, TCS

TCS Confidential Information – Not to be shared

3

Today we will cover…

TCS Confidential

1

Tester’s Dilemma 2

Framework to tackle the problem

Need for Big Data Assurance

3

4

BIG DATA

BIGGER DILLEMA

NEED FOR BIG DATA

ASSURANCE

5

Big Data Analytics

TCS Confidential Information – Not to be shared

Non-traditional internal data &

uncontrolled external data

Complex non-traditional

analytical models

INPUT OUTPUT

6

Garbage in equals Garbage out

TCS Confidential Information – Not to be shared

IN OUT

Increased Risk

=

7

How this impacts your business

TCS Confidential Information – Not to be shared

Bad Data

Wrong Insights

Business / Brand Image Losses

Incorrect Processing

8

Appropriate Big Data Assurance ensures

TCS Confidential Information – Not to be shared

Good Data

Relevant Actionable Insights

Business Growth

Reliable Processing

9

BIG DATA

BIGGER DILLEMA

TESTER’S DILEMMA

10

Scope in terms of data flow

Ingestion

Integration

Migration

Homogenization

Standardization

Storage

Analytics

Apps Insights

Transformed

Data Raw Data

TCS Confidential Information – Not to be shared

11

VERACITY

Focus in terms of V’s

VALUE

TCS Confidential Information – Not to be shared

VELOCITY

VOLUME

VARIETY

VARIABILITY

BIG

DATA

TBs

RDBMS, txt,

xml, json,

bson, orc, rc…

Inconsistency

Reliability

Relevancy

Performance

12 TCS Confidential Information – Not to be shared

Ingestion

Integration

Migration

Homogenization

Standardization

Storage

Analytics

Apps Insights

When to focus which ‘V’?

Or .. Should we focus on all V’s all the time?

13

A FRAMEWORK

TO TACKLE

THE PROBLEM

14

Understand the architecture of the integrated data enterprise 1

TCS Confidential Information – Not to be shared

15

Hadoop

Non-Hadoop

Databases Files Near real-time data streams

HDFS ( Raw data )

HIVE / HBASE ( Standardized data )

HIVE / HBASE

( Data for creating

analytical models )

HIVE / HBASE

( Data for applying

analytical models )

Step 1: Understand the architecture

DWHs

Apps

Analy

tics

Analytics

TCS Confidential Information – Not to be shared

16 TCS Confidential Information – Not to be shared

Identify the testing interfaces 2

17

Hadoop

Non-Hadoop

Databases Files Near real-time data streams

HDFS ( Raw data )

HIVE / HBASE ( Standardized data )

HIVE / HBASE

( Data for creating

analytical models )

HIVE / HBASE

( Data for applying

analytical models )

Step 2: Identify testing interfaces

DWHs

Apps

Analy

tics

Analytics

TCS Confidential Information – Not to be shared

a b c

d

f h

e

g i

j

k

m

l

n

18

Identify testing type relevant to the interface 3

TCS Confidential Information – Not to be shared

19

Databases

HDFS ( Raw data )

Data ingestion testing

Data migration testing

Data integration testing Te

sting types @

Step 3: Identify testing type

a

a

TCS Confidential Information – Not to be shared

20

Files

HDFS ( Raw data )

Data ingestion testing

Data migration testing

Data integration testing Te

sting types @

Step 3: Identify testing type

b

b

TCS Confidential Information – Not to be shared

21

Near real-

time data

streams

HDFS ( Raw data )

Data ingestion testing

Data integration testing

Te

sting types @

Step 3: Identify testing type

c

c

TCS Confidential Information – Not to be shared

22

HDFS

(Raw data)

HIVE / HBASE

(Standardized data)

Data homogenization

testing

Te

sting types @

Step 3: Identify testing type

d

d

TCS Confidential Information – Not to be shared

23

HIVE / HBASE

(Standardized data) Data standardized testing

Te

sting types @

Step 3: Identify testing type

e

TCS Confidential Information – Not to be shared

e

24

HIVE / HBASE

(Standardized data) Data migration testing

Te

sting types @

Step 3: Identify testing type

f

TCS Confidential Information – Not to be shared

HIVE / HBASE

(Data for creating

analytical models)

Data integration testing

f

25

HIVE / HBASE

(Data for creating

analytical models) Analytical model validation

Te

sting types @

Step 3: Identify testing type

g

TCS Confidential Information – Not to be shared

g

26

HIVE / HBASE

(Standardized data) Data migration testing

Te

sting types @

Step 3: Identify testing type

h

TCS Confidential Information – Not to be shared

HIVE / HBASE

(Data for applying

analytical models)

Data integration testing

h

27

HIVE / HBASE

(Data for applying

analytical models)

Analytical model

effectiveness testing

Te

sting types @

Step 3: Identify testing type

i

TCS Confidential Information – Not to be shared

i

28

Ha

doo

p

HIVE / HBASE

(Data for applying

analytical models)

Data provision

testing

T

esting types @

Step 3: Identify testing type

TCS Confidential Information – Not to be shared

Ana

lytics

j

j

k

l

Apps

Analytics k

l

29

Hadoop

HIVE / HBASE

(Data for applying

analytical models)

Data migration

testing

T

esting types @

Step 3: Identify testing type

TCS Confidential Information – Not to be shared

k

DWHs k

Data ingestion

testing

Data

integration

30

Te

sting T

ypes @

n

Data Provisioning Testing

Step 3: Identify testing type

DWHs

Apps

Analytics n

o

o

TCS Confidential Information – Not to be shared

31

Identify the V to be prioritized for the testing type

4

TCS Confidential Information – Not to be shared

32

Step 4: Prioritize V’s 4

Data Ingestion Testing

VARIETY

VELOCITY

High priority for file-based data ingestions

High priority for real time data ingestions

TCS Confidential Information – Not to be shared

33

Step 4: Prioritize V’s 4

Data Migration Testing

VOLUME High priority for historical data migrations

TCS Confidential Information – Not to be shared

34

Step 4: Prioritize V’s 4

Data Integration Testing

VARIABILITY Inconsistency / non-compliance checks

TCS Confidential Information – Not to be shared

High priority for data acquired from multiple sources to a single target

High priority for data acquired from external sources like social media

35

Step 4: Prioritize V’s 4

Data Homogenization Testing

VARIETY High priority for unstructured or semi-structured to

structured data format conversions

TCS Confidential Information – Not to be shared

36

Step 4: Prioritize V’s 4

Data Standardization Testing

VOLUME High priority for any pre-existing data to be checked for

conformance to data standards & industry compliances

TCS Confidential Information – Not to be shared

37

Step 4: Prioritize V’s 4

Analytical Model Validation

VOLUME To identify data patterns which were not considered in

development of model; Entire historical data to be

considered for testing

TCS Confidential Information – Not to be shared

Analytical models based on historical data

38

Step 4: Prioritize V’s 4

Analytical Model Validation

VERACITY

VALUE

High priority to identify the data patterns

that are not relevant for the business

High priority to identify the data patterns

that do not bring any value to the business

TCS Confidential Information – Not to be shared

Analytical models not based on historical data

39

Step 4: Prioritize V’s 4

Analytical Model Effectiveness Testing

VOLUME High priority to identify wrong predictions, unidentified data patterns

TCS Confidential Information – Not to be shared

If the actual data, on which the model needs to be run, is available

40

Thank you!

For more information, please write to me at [email protected]

Visit TCS at booth # 1