test resource management center big data analytics ......– long data ingest times prevent proper...
TRANSCRIPT
Test Resource Management CenterBig Data Analytics /
Knowledge ManagementArchitecture Framework Overview
Ed PowellTest Resource Management Center
Presented at theITEA 34th International Test and Evaluation Symposium (2017)
Oct 4, 2017, Reston, VA
T&E Need for Big Data Analytics/KM Capability
• Most investment has been on test needs, but this is an evaluation need.– Need an evaluation strategy to not only analyze today's questions but future
questions, even after fielding.• T&E quality is inadequate for our needs
– More data is being collected than can be properly analyzed– Only a tiny fraction of data is looked at– Only simplistic analysis is being done on a small fraction of data– No global view of the collected data is ever done– No systematic anomaly detection, trend analysis, regression analysis, causality
analysis, pattern recognition, simulation/test comparisons, perceived truth/ground truth comparisons are being done.
• T&E timeliness is inadequate for our needs– Analyst retrieval of test data in some cases takes more than a week– Sometimes it’s easier (though not cheaper) to just re-run a test rather than find old
data that may answer your question– Long data ingest times prevent proper debriefing of test participants after a test is
over, since their statements cannot be correlated with data in real time.• T&E dollars are being spent unnecessarily
– More tests than necessary are being done, sometimes at enormous expense– No cross-program lessons learned are being made, except anecdotally
2
A systematic approach to Big Data Analytics and Knowledge Management (an architecture) is required to address these three serious issues.
Architecture Framework Purpose
• Understand the domain of Big Data Analytics and Knowledge Management as it relates to Test and Evaluation Needs
– Identify deficiencies in current T&E data analysis and knowledge management practices
– Identify commercial and open source software and hardware that could address these deficiencies
• Create a Roadmap for investment and deployment for these technologies
– Clearly identify the end state we are looking to achieve– Identify clear benefits to acquisition programs – Identify necessary S&T and development activities to get us to the end state– Identify timeline for technology integration and deployment– Socialize the architecture framework and the roadmap throughout the T&E
community and receive feedback. Adjust as necessary to gain the support of the bulk of the T&E community.
• Create a set of implementation guidelines that will allow multiple independent developers to create valuable elements that integrates seamlessly
– Baseline for future coordinated infrastructure investments3
What is Big Data Analytics?
• The use of advanced statistical analytic techniques in a parallel processing high-performance computing environment against very large diverse data sets that include different types of data
• Allows analysts to make better and faster decisions using data that was previously inaccessible or unusable
• Previously under-utilized data sources can be analyzed to gain new insights resulting in significantly better and faster decisions
• Instead of analyzing small chunks of data, Big Data Analytics can give the analyst a broad view of the system, allowing the discovery of “unknown unknowns.”
• Most important (and relevant to T&E) big data analytics techniques:– Anomaly Detection – Did something go wrong?– Causality Detection – What contributed to it?– Trend Analysis – What’s happening over time?– Predicting Equipment Function and Failure – When will something go wrong?– Regression Analysis – How is today’s data different than the past?– Data Set Comparison – Is test repeatable? Is the simulation the same as the test?
Is the perceived truth the same as the ground truth?– Pattern Recognition – Are there hidden relationships in the data set?
4
Long-term DoD T&E Big Data and Knowledge Management Vision
5
Result: T&E data used more effectively & efficiently during acquisition
• The primary product of T&E is data & knowledge
• Embrace KM & Big Data Analytics to efficiently handle & securely share T&E data
• Organize T&E data to build knowledge across all DoD acquisitions
• Federate distributed data repositories to enable execution & automated search scenarios that cannot occur today
• Use modern mechanisms to enable collaboration between SMEs in government and industry
Fundamental Functions Performed by KM and BDA
Big Data Analytics Overview (OV-1)
• Integrated• Scalable• Cost-Effective• State-of-the-Art
Working Files
Regional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMILS SecurityData Scientists
Current Range Infrastructure
Existing ToolsExisting Storage
Existing Ingest Capabilities
Range AugmentationVirtualized Big Data Tools
Some ProcessingSome Tiered Storage
MILS SecurityEnhance Ingest
Individual Range
Cloud-Based Big Data Analytics and Knowledge Management System
Regional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMILS SecurityData Scientists
New
Existing
Quick-Look
Schedule Info
ApplicationRepository
Reports
DataRegional Analytics Capability
Virtualized Big Data ToolsProcessing
Tiered StorageMILS SecurityData Scientists
Video
Audio
Imagery
Big Data and TENA Relationship:The Big Data Analytics Architecture is an Extension of TENA Into the Analytic World – Seamless Integration
7
Event Data Is Ingested into Big Data Enterprise System
Working Files
Current Range Infrastructure
Existing ToolsExisting Storage
Existing Ingest Capabilities
Range AugmentationVirtualized Big Data Tools
Some ProcessingSome Tiered Storage
MLS SecurityEnhance Ingest
Individual Range
Quick-Look
Big Data Software Architecture Overview
8
Existing Range Computing and Storage
Structured Database Unstructured/Semi-Structured Database (Hadoop)
Structured Data Engine Unstructured Data Engine
Query Engine – Federated access for both Structured and Unstructured Data
Data Analysis Packages User-Defined Analytic Plugins
Massively Parallel Tiered Computing, Storage, and Network InfrastructureAt Multiple Independent Levels of Security
Extract-Transform-
Load
Data Sources
Analytic Services
Big Data Visualization
UC S TS SAP SAR
Security
Existing Range DatabasesFlat Files
Raw Files
Setup, Configure, and Manage
Policies Security Define MetadataPrioritization
Streams
Micro-batch
Mega-batch
Parallel
Verify
Transform
Add Metadata
Index
Warehouse
Configuration
Metadata Replication
Build Queries
Quick-Look Real-Time Continuous
2D/3D/Anim
Display Reports
Design Reports
CustomizedDisplays
Display Alerts
User Interface
Authenticate
Authorize
AccessControlEnforcePoliciesEnforce
WorkflowThreat
DetectionIntrusionDetection
ActiveDefenses
Working Sets Tables
Encryption
Audit
Alerts
Load Balancing
Fault/Recovery
MILS SecureCloud
Statistics
Key-Value Store
DistributedFile System
Generate ReportsAI Tools Simulation
Analysis ToolsAlerting Scheduling/Automation Legacy Tools
SQL Services
Remote DataReplication
T&E Specific Custom BDA ServicesAnomaly Detection Trend AnalysisCausality Detection Regression AnalysisGround Truth Comparison Pattern Recognition
Filter Sort Summarize Parallelize Optimize
Machine LearningData Mining
CustomizedUIs
Structured
Unstructured
Audio/Video
Schema
ComputingResources
ComputingResources
CreateAutomated
Products
Abstraction Layer (Virtualization)
Hypervisor
Virtualized Legacy Tools
Infrastructure as a Service Platform as a Service Software as a Service
Virtualized New Tools
Simulation as a Service
Graph-Based
Schema
Audio/Video Analysis
NewDatabases
Provisioning
StreamingScripting
COTS/GOTS SoftwareNew Hardware/Network
TRMC-Developed Software
Existing Range HW/SW
Applications
Resource Mgmt
VM Library
Cloud
License
Customization
Data Services
Organization
Core
OperationsShare
Serve
MessagingMetadata
Store Retrieve
VersioningTaggingPublish/Subscribe Crawl/Index
Transfer
Transform
Catalog
Search
Verify
AdministrativeCOO/DREnforce Policies Archive ToolsDB Admin Config Mgmt
Sync Data/Video
Spatio-temporal
Ontologies
MPP Programming and Execution Engine
C/R/U/D Consistency
ExistingComputers
PipelineWorkflowRangeProtocols
TENA Data Lifecycle
Workflow CreateSoftware
IDE
SDK
Big Data Hardware• “Hardware” means a complementary and integrated combination of both
processing and storage to perform the required analytic and knowledge management functions
• For the Regional Analytics Capabilities, hardware will be deployed focused on long-term storage and broad and deep multi-program analytics
• At each range, there are four basic options for hardware1. Rely Primarily on New Hardware2. Hybrid of Existing Hardware and New Hardware3. Rely Primarily on Existing Range Hardware
− Where investments have already been made4. Minimal Hardware Deployment and Integration
− Where we only track what data is where and not provide deep analytic capabilities.
9
Since the amount of T&E data is increasing exponentially, purchasing and deploying hardware must be a continuous process that never ends. This fact
needs to be impressed upon our Service partners and Congress.
10
Existing Computing andStorage Infrastructure
Big Data Hardware/Software Architecture –Configuration 1 – Rely on New Hardware
ProcessorsProcessors
ProcessorProcessors
ProcessorProcessor
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
High-Speed Storage
High-Speed Storage
High-speed data interconnect
ProcessorsProcessors
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
High-speed data interconnect
Range Instrumentation
Data Ingest & Processing
Existing Storage
Databases
ProcessorProcessorProcessors
Raw Files
Flat Files
Existing Tools
Existing Visualization
New
Hardw
are Har
dwar
e
User Tools on Existing Workstations
High-Speed Network
Softw
areSoftw
are Softw
are
Data Services
Analysis
Virtualization
Visualization
ETL
Security
Data Services
Analysis
Virtualization
Visualization
Security
MPP MILS Tiered Computing,Storage, and Network Infrastructure
MPP MILS Tiered Computing,Storage, and Network Infrastructure
Security Security
Local to a Range Regional Analytics Capability
11
Big Data Hardware/Software Architecture –Configuration 2 - Hybrid Existing And New
Hardware
ProcessorsProcessors
ProcessorProcessors
ProcessorProcessor
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
High-Speed Storage
High-Speed Storage
High-speed data interconnect
Range Instrumentation
Data Ingest & Processing
Existing Tools
Existing Visualization
Softw
are
Har
dwar
e
Existing
Existing Computing andStorage Infrastructure
Existing StorageExisting
Databases
ProcessorsProcessorsProcessors
Raw Files
Flat FilesExisting
Databases
Security Security
ProcessorsProcessors
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
High-speed data interconnect
New
New
Hardw
are
High-Speed Network
Software So
ftwar
e
Data Services
Analysis
Virtualization
Visualization
ETL
Security
Data Services
Analysis
Virtualization
Visualization
Security
MPP MILS Tiered Computing,Storage, and Network Infrastructure
Local to a Range Regional Analytics Capability
NewDatabases
NewDatabasesNew
DatabasesNew
Databases
12
Big Data Hardware/Software Architecture –Configuration 3 - Rely Primarily on Existing
Range Hardware
ProcessorsProcessors
ProcessorProcessors
ProcessorProcessor
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
High-Speed Storage
High-Speed Storage
High-speed data interconnect
Range Instrumentation
Data Ingest & Processing
Existing Tools
Existing Visualization
Software So
ftwar
e
Har
dwar
e
Local to a Range
Existing
Existing Computing andStorage Infrastructure
Existing StorageExisting
Databases
ProcessorsProcessorsProcessors
Raw Files
Flat FilesExisting
Databases
High-Speed Network
ProcessorsProcessors
Working Set Storage
New
New
Hardw
are
MPP MILS Tiered Computing,Storage, and Network Infrastructure
JMETC Network
Data Services
Analysis
Virtualization
Management/Visualization
ETL
Security
Data Services
Analysis
Virtualization
Visualization
Security
Security Security
Regional Analytics Capability
NewDatabases
NewDatabasesNew
DatabasesNew
Databases
Big Data Hardware/Software Architecture –Configuration 4 – Index Local Data Only
ProcessorsProcessors
ProcessorProcessors
ProcessorProcessor
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
High-Speed Storage
High-Speed Storage
High-speed data interconnect
Range Instrumentation
Data Ingest & Processing
Existing Tools
Existing Visualization
Software So
ftwar
e
Har
dwar
e
Local to a Range
Existing
Existing Computing andStorage Infrastructure
Existing Storage
ProcessorsProcessorsProcessors
Raw Files
Flat Files
Raw Files
Flat Files
ExistingDatabases
High-Speed Network
ProcessorsProcessors
Working Set Storage
New
New
Hardw
are
MPP MILS Tiered Computing,Storage, and Network Infrastructure
JMETC Network
Data Services
Analysis
Virtualization
Management/Visualization
ETL
Security
Data Services
Analysis
Virtualization
Visualization
Security
Security Security
ExistingDatabases
Regional Analytics Capability
Federated Software / Storage Architecture
ProcessorsProcessors
ProcessorProcessors
ProcessorProcessor
ProcessorsHigh-Speed
StorageProcessors
Medium-Speed Storage
Processors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
Processors
ProcessorsMedium-Speed
StorageProcessors
Long-term Storage
Processors
Tape
High-Speed Storage
High-Speed Storage
High-speed data interconnect
Har
dwar
e
Ranges
Softw
are
Softw
are
Data Services
Analysis
Virtualization
Visualization
Security
MPP MILS Tiered Computing,Storage, and Network Infrastructure
Security
Regional Analytics Capability
Security Architecture:Notional MILS and CDS
Regional Analytics Capability
Long-TermStorage
Med-Speed
High-Speed
Classification C
Long-TermStorage
Med-Speed
High-Speed
Classification B
Long-TermStorage
Med-Speed
High-Speed
Classification A
Long-TermStorage
Med-Speed
High-Speed
MLS Database
Enterprise Big Data Analysis
MILS-CDS
Big Data Analytics Maturity Model
• Similar to the software maturity model (the Capability Maturity Model Integration or CMMI), big data analytics capability in any organization can be evaluated on a scale from no capability to a capability that is fully integrated and tailored to an organization’s needs.
Level 1
Isolated big data analytics use
Unsophisticated tools and practices
predominate
Level 2
Some predictive analytics usage is
part of mission critical applications
only
Full benefits are not understood by
a majority in the agency
Level 3
Big data analytics usage consists
primarily of tactical and ad hoc
approaches
Big data analytics development and
deployment is constrained, yet
departments have their own experts and/or initiatives
Level 4
Big data analytics talent is
centralized into larger groups
Management understands and supports big data
analytics for strategic value,
thus brining units into alignment
Level 5
Agency is committed to big data analytics as part of its future
growth plan
Big data analytics software
framework supports rapid
response
Big data analytical output integrated
seamlessly into user applications
and workflowLevel 0
No big data analytics
capabilities
Adapted from Etches et al., “Analytic Technology Industry Roundtable Study: Analytics and Use Cases,” published November 2016 by The Mitre Corp.
NONEXISTENT
IMMATURE
INFORMED
EMPOWERED
INNOVATIVE
AWARE
Summary• An architecture specifies a technical plan for solving complex problems
– Document requirements and design constraints– Identify sub-systems that need to interoperate– Determine areas where standardization is needed– Understand impact of current capability gaps & limitations– Inform investment priorities
• The Big Data / Knowledge Management architecture framework provides context for the Big Data investment roadmap. Needs:
– Integrated local data– Cloud analytics capability– Big Data Tools– Trained data scientist workforce
• Architecture standardization and buy-in will ensure collaboration and investments align to solve this common problem
17
Our goal is to advance the T&E Community’sBig Data Analytics Maturity Level