test resource management center big data analytics ......– long data ingest times prevent proper...

17
Test Resource Management Center Big Data Analytics / Knowledge Management Architecture Framework Overview Ed Powell Test Resource Management Center Presented at the ITEA 34th International Test and Evaluation Symposium (2017) Oct 4, 2017, Reston, VA

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Test Resource Management CenterBig Data Analytics /

Knowledge ManagementArchitecture Framework Overview

Ed PowellTest Resource Management Center

Presented at theITEA 34th International Test and Evaluation Symposium (2017)

Oct 4, 2017, Reston, VA

Page 2: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

T&E Need for Big Data Analytics/KM Capability

• Most investment has been on test needs, but this is an evaluation need.– Need an evaluation strategy to not only analyze today's questions but future

questions, even after fielding.• T&E quality is inadequate for our needs

– More data is being collected than can be properly analyzed– Only a tiny fraction of data is looked at– Only simplistic analysis is being done on a small fraction of data– No global view of the collected data is ever done– No systematic anomaly detection, trend analysis, regression analysis, causality

analysis, pattern recognition, simulation/test comparisons, perceived truth/ground truth comparisons are being done.

• T&E timeliness is inadequate for our needs– Analyst retrieval of test data in some cases takes more than a week– Sometimes it’s easier (though not cheaper) to just re-run a test rather than find old

data that may answer your question– Long data ingest times prevent proper debriefing of test participants after a test is

over, since their statements cannot be correlated with data in real time.• T&E dollars are being spent unnecessarily

– More tests than necessary are being done, sometimes at enormous expense– No cross-program lessons learned are being made, except anecdotally

2

A systematic approach to Big Data Analytics and Knowledge Management (an architecture) is required to address these three serious issues.

Page 3: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Architecture Framework Purpose

• Understand the domain of Big Data Analytics and Knowledge Management as it relates to Test and Evaluation Needs

– Identify deficiencies in current T&E data analysis and knowledge management practices

– Identify commercial and open source software and hardware that could address these deficiencies

• Create a Roadmap for investment and deployment for these technologies

– Clearly identify the end state we are looking to achieve– Identify clear benefits to acquisition programs – Identify necessary S&T and development activities to get us to the end state– Identify timeline for technology integration and deployment– Socialize the architecture framework and the roadmap throughout the T&E

community and receive feedback. Adjust as necessary to gain the support of the bulk of the T&E community.

• Create a set of implementation guidelines that will allow multiple independent developers to create valuable elements that integrates seamlessly

– Baseline for future coordinated infrastructure investments3

Page 4: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

What is Big Data Analytics?

• The use of advanced statistical analytic techniques in a parallel processing high-performance computing environment against very large diverse data sets that include different types of data

• Allows analysts to make better and faster decisions using data that was previously inaccessible or unusable

• Previously under-utilized data sources can be analyzed to gain new insights resulting in significantly better and faster decisions

• Instead of analyzing small chunks of data, Big Data Analytics can give the analyst a broad view of the system, allowing the discovery of “unknown unknowns.”

• Most important (and relevant to T&E) big data analytics techniques:– Anomaly Detection – Did something go wrong?– Causality Detection – What contributed to it?– Trend Analysis – What’s happening over time?– Predicting Equipment Function and Failure – When will something go wrong?– Regression Analysis – How is today’s data different than the past?– Data Set Comparison – Is test repeatable? Is the simulation the same as the test?

Is the perceived truth the same as the ground truth?– Pattern Recognition – Are there hidden relationships in the data set?

4

Page 5: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Long-term DoD T&E Big Data and Knowledge Management Vision

5

Result: T&E data used more effectively & efficiently during acquisition

• The primary product of T&E is data & knowledge

• Embrace KM & Big Data Analytics to efficiently handle & securely share T&E data

• Organize T&E data to build knowledge across all DoD acquisitions

• Federate distributed data repositories to enable execution & automated search scenarios that cannot occur today

• Use modern mechanisms to enable collaboration between SMEs in government and industry

Fundamental Functions Performed by KM and BDA

Page 6: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data Analytics Overview (OV-1)

• Integrated• Scalable• Cost-Effective• State-of-the-Art

Working Files

Regional Analytics Capability

Virtualized Big Data ToolsProcessing

Tiered StorageMILS SecurityData Scientists

Current Range Infrastructure

Existing ToolsExisting Storage

Existing Ingest Capabilities

Range AugmentationVirtualized Big Data Tools

Some ProcessingSome Tiered Storage

MILS SecurityEnhance Ingest

Individual Range

Cloud-Based Big Data Analytics and Knowledge Management System

Regional Analytics Capability

Virtualized Big Data ToolsProcessing

Tiered StorageMILS SecurityData Scientists

New

Existing

Quick-Look

Schedule Info

ApplicationRepository

Reports

DataRegional Analytics Capability

Virtualized Big Data ToolsProcessing

Tiered StorageMILS SecurityData Scientists

Video

Audio

Imagery

Page 7: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data and TENA Relationship:The Big Data Analytics Architecture is an Extension of TENA Into the Analytic World – Seamless Integration

7

Event Data Is Ingested into Big Data Enterprise System

Working Files

Current Range Infrastructure

Existing ToolsExisting Storage

Existing Ingest Capabilities

Range AugmentationVirtualized Big Data Tools

Some ProcessingSome Tiered Storage

MLS SecurityEnhance Ingest

Individual Range

Quick-Look

Page 8: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data Software Architecture Overview

8

Existing Range Computing and Storage

Structured Database Unstructured/Semi-Structured Database (Hadoop)

Structured Data Engine Unstructured Data Engine

Query Engine – Federated access for both Structured and Unstructured Data

Data Analysis Packages User-Defined Analytic Plugins

Massively Parallel Tiered Computing, Storage, and Network InfrastructureAt Multiple Independent Levels of Security

Extract-Transform-

Load

Data Sources

Analytic Services

Big Data Visualization

UC S TS SAP SAR

Security

Existing Range DatabasesFlat Files

Raw Files

Setup, Configure, and Manage

Policies Security Define MetadataPrioritization

Streams

Micro-batch

Mega-batch

Parallel

Verify

Transform

Add Metadata

Index

Warehouse

Configuration

Metadata Replication

Build Queries

Quick-Look Real-Time Continuous

2D/3D/Anim

Display Reports

Design Reports

CustomizedDisplays

Display Alerts

User Interface

Authenticate

Authorize

AccessControlEnforcePoliciesEnforce

WorkflowThreat

DetectionIntrusionDetection

ActiveDefenses

Working Sets Tables

Encryption

Audit

Alerts

Load Balancing

Fault/Recovery

MILS SecureCloud

Statistics

Key-Value Store

DistributedFile System

Generate ReportsAI Tools Simulation

Analysis ToolsAlerting Scheduling/Automation Legacy Tools

SQL Services

Remote DataReplication

T&E Specific Custom BDA ServicesAnomaly Detection Trend AnalysisCausality Detection Regression AnalysisGround Truth Comparison Pattern Recognition

Filter Sort Summarize Parallelize Optimize

Machine LearningData Mining

CustomizedUIs

Structured

Unstructured

Audio/Video

Schema

ComputingResources

ComputingResources

CreateAutomated

Products

Abstraction Layer (Virtualization)

Hypervisor

Virtualized Legacy Tools

Infrastructure as a Service Platform as a Service Software as a Service

Virtualized New Tools

Simulation as a Service

Graph-Based

Schema

Audio/Video Analysis

NewDatabases

Provisioning

StreamingScripting

COTS/GOTS SoftwareNew Hardware/Network

TRMC-Developed Software

Existing Range HW/SW

Applications

Resource Mgmt

VM Library

Cloud

License

Customization

Data Services

Organization

Core

OperationsShare

Serve

MessagingMetadata

Store Retrieve

VersioningTaggingPublish/Subscribe Crawl/Index

Transfer

Transform

Catalog

Search

Verify

AdministrativeCOO/DREnforce Policies Archive ToolsDB Admin Config Mgmt

Sync Data/Video

Spatio-temporal

Ontologies

MPP Programming and Execution Engine

C/R/U/D Consistency

ExistingComputers

PipelineWorkflowRangeProtocols

TENA Data Lifecycle

Workflow CreateSoftware

IDE

SDK

Page 9: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data Hardware• “Hardware” means a complementary and integrated combination of both

processing and storage to perform the required analytic and knowledge management functions

• For the Regional Analytics Capabilities, hardware will be deployed focused on long-term storage and broad and deep multi-program analytics

• At each range, there are four basic options for hardware1. Rely Primarily on New Hardware2. Hybrid of Existing Hardware and New Hardware3. Rely Primarily on Existing Range Hardware

− Where investments have already been made4. Minimal Hardware Deployment and Integration

− Where we only track what data is where and not provide deep analytic capabilities.

9

Since the amount of T&E data is increasing exponentially, purchasing and deploying hardware must be a continuous process that never ends. This fact

needs to be impressed upon our Service partners and Congress.

Page 10: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

10

Existing Computing andStorage Infrastructure

Big Data Hardware/Software Architecture –Configuration 1 – Rely on New Hardware

ProcessorsProcessors

ProcessorProcessors

ProcessorProcessor

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

High-Speed Storage

High-Speed Storage

High-speed data interconnect

ProcessorsProcessors

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

High-speed data interconnect

Range Instrumentation

Data Ingest & Processing

Existing Storage

Databases

ProcessorProcessorProcessors

Raw Files

Flat Files

Existing Tools

Existing Visualization

New

Hardw

are Har

dwar

e

User Tools on Existing Workstations

High-Speed Network

Softw

areSoftw

are Softw

are

Data Services

Analysis

Virtualization

Visualization

ETL

Security

Data Services

Analysis

Virtualization

Visualization

Security

MPP MILS Tiered Computing,Storage, and Network Infrastructure

MPP MILS Tiered Computing,Storage, and Network Infrastructure

Security Security

Local to a Range Regional Analytics Capability

Page 11: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

11

Big Data Hardware/Software Architecture –Configuration 2 - Hybrid Existing And New

Hardware

ProcessorsProcessors

ProcessorProcessors

ProcessorProcessor

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

High-Speed Storage

High-Speed Storage

High-speed data interconnect

Range Instrumentation

Data Ingest & Processing

Existing Tools

Existing Visualization

Softw

are

Har

dwar

e

Existing

Existing Computing andStorage Infrastructure

Existing StorageExisting

Databases

ProcessorsProcessorsProcessors

Raw Files

Flat FilesExisting

Databases

Security Security

ProcessorsProcessors

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

High-speed data interconnect

New

New

Hardw

are

High-Speed Network

Software So

ftwar

e

Data Services

Analysis

Virtualization

Visualization

ETL

Security

Data Services

Analysis

Virtualization

Visualization

Security

MPP MILS Tiered Computing,Storage, and Network Infrastructure

Local to a Range Regional Analytics Capability

NewDatabases

NewDatabasesNew

DatabasesNew

Databases

Page 12: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

12

Big Data Hardware/Software Architecture –Configuration 3 - Rely Primarily on Existing

Range Hardware

ProcessorsProcessors

ProcessorProcessors

ProcessorProcessor

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

High-Speed Storage

High-Speed Storage

High-speed data interconnect

Range Instrumentation

Data Ingest & Processing

Existing Tools

Existing Visualization

Software So

ftwar

e

Har

dwar

e

Local to a Range

Existing

Existing Computing andStorage Infrastructure

Existing StorageExisting

Databases

ProcessorsProcessorsProcessors

Raw Files

Flat FilesExisting

Databases

High-Speed Network

ProcessorsProcessors

Working Set Storage

New

New

Hardw

are

MPP MILS Tiered Computing,Storage, and Network Infrastructure

JMETC Network

Data Services

Analysis

Virtualization

Management/Visualization

ETL

Security

Data Services

Analysis

Virtualization

Visualization

Security

Security Security

Regional Analytics Capability

NewDatabases

NewDatabasesNew

DatabasesNew

Databases

Page 13: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data Hardware/Software Architecture –Configuration 4 – Index Local Data Only

ProcessorsProcessors

ProcessorProcessors

ProcessorProcessor

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

High-Speed Storage

High-Speed Storage

High-speed data interconnect

Range Instrumentation

Data Ingest & Processing

Existing Tools

Existing Visualization

Software So

ftwar

e

Har

dwar

e

Local to a Range

Existing

Existing Computing andStorage Infrastructure

Existing Storage

ProcessorsProcessorsProcessors

Raw Files

Flat Files

Raw Files

Flat Files

ExistingDatabases

High-Speed Network

ProcessorsProcessors

Working Set Storage

New

New

Hardw

are

MPP MILS Tiered Computing,Storage, and Network Infrastructure

JMETC Network

Data Services

Analysis

Virtualization

Management/Visualization

ETL

Security

Data Services

Analysis

Virtualization

Visualization

Security

Security Security

ExistingDatabases

Regional Analytics Capability

Page 14: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Federated Software / Storage Architecture

ProcessorsProcessors

ProcessorProcessors

ProcessorProcessor

ProcessorsHigh-Speed

StorageProcessors

Medium-Speed Storage

Processors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

Processors

ProcessorsMedium-Speed

StorageProcessors

Long-term Storage

Processors

Tape

High-Speed Storage

High-Speed Storage

High-speed data interconnect

Har

dwar

e

Ranges

Softw

are

Softw

are

Data Services

Analysis

Virtualization

Visualization

Security

MPP MILS Tiered Computing,Storage, and Network Infrastructure

Security

Regional Analytics Capability

Page 15: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Security Architecture:Notional MILS and CDS

Regional Analytics Capability

Long-TermStorage

Med-Speed

High-Speed

Classification C

Long-TermStorage

Med-Speed

High-Speed

Classification B

Long-TermStorage

Med-Speed

High-Speed

Classification A

Long-TermStorage

Med-Speed

High-Speed

MLS Database

Enterprise Big Data Analysis

MILS-CDS

Page 16: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Big Data Analytics Maturity Model

• Similar to the software maturity model (the Capability Maturity Model Integration or CMMI), big data analytics capability in any organization can be evaluated on a scale from no capability to a capability that is fully integrated and tailored to an organization’s needs.

Level 1

Isolated big data analytics use

Unsophisticated tools and practices

predominate

Level 2

Some predictive analytics usage is

part of mission critical applications

only

Full benefits are not understood by

a majority in the agency

Level 3

Big data analytics usage consists

primarily of tactical and ad hoc

approaches

Big data analytics development and

deployment is constrained, yet

departments have their own experts and/or initiatives

Level 4

Big data analytics talent is

centralized into larger groups

Management understands and supports big data

analytics for strategic value,

thus brining units into alignment

Level 5

Agency is committed to big data analytics as part of its future

growth plan

Big data analytics software

framework supports rapid

response

Big data analytical output integrated

seamlessly into user applications

and workflowLevel 0

No big data analytics

capabilities

Adapted from Etches et al., “Analytic Technology Industry Roundtable Study: Analytics and Use Cases,” published November 2016 by The Mitre Corp.

NONEXISTENT

IMMATURE

INFORMED

EMPOWERED

INNOVATIVE

AWARE

Page 17: Test Resource Management Center Big Data Analytics ......– Long data ingest times prevent proper debriefing of test participants after a test is ... is required to address these

Summary• An architecture specifies a technical plan for solving complex problems

– Document requirements and design constraints– Identify sub-systems that need to interoperate– Determine areas where standardization is needed– Understand impact of current capability gaps & limitations– Inform investment priorities

• The Big Data / Knowledge Management architecture framework provides context for the Big Data investment roadmap. Needs:

– Integrated local data– Cloud analytics capability– Big Data Tools– Trained data scientist workforce

• Architecture standardization and buy-in will ensure collaboration and investments align to solve this common problem

17

Our goal is to advance the T&E Community’sBig Data Analytics Maturity Level