uppsala database laboratory managing scientific queries over distributed data in a grid environment...
Post on 22-Dec-2015
213 views
TRANSCRIPT
UPPSALA DATABASE LABORATORY
Managing Scientific Queries over Distributed Data in a Grid
Environment
Ruslan Fomkin
January 20, 2006 NGN workshop Uppsala
2
UU- IT - UDBL Ruslan Fomkin
Uppsala DataBase Laboratory (UDBL)
Supervisor• prof. T. Risch
Database research• How to make extensible middleware query
processing allowing scalable and application oriented search to different kinds of wrapped information sources
http://www.it.uu.se/research/group/udbl/
January 20, 2006 NGN workshop Uppsala
3
UU- IT - UDBL Ruslan Fomkin
AMOS II
Virtual Mediator Database
Simulation Visualization Analysis
PatientMonitoring
GRID hist.Measurments
RelationalDatabases
Plug-ins
Wrappers
Queries and views
Queries
Data sources
Applications
Continuous
Queries
January 20, 2006 NGN workshop Uppsala
4
UU- IT - UDBL Ruslan Fomkin
Ongoing Research at UDBL
Stream Queries on BlueGeneErik Zeitler, MSc
FEM DatabasesKjell Orsborn, PhD
Mediating Web ServicesManivasakan Sabesan, BSc
Semantic Web Queriesto Hidden WebJohan Petrini, MSc
Stream Data ManagerMilena Ivanova, PhD
UDBL
Expensive GRID Queries Ruslan Fomkin, MSc
January 20, 2006 NGN workshop Uppsala
5
UU- IT - UDBL Ruslan Fomkin
Outline
Introduction The project Test application Developed framework Conclusion Future work
January 20, 2006 NGN workshop Uppsala
6
UU- IT - UDBL Ruslan Fomkin
Scientific Applications, Grid and Databases
A lot of scientific data• Complex structure• Stored in files distributed in Grid
Scientific analyses can be represented as declarative queries• Complex queries with numerical computations• Long running or batch queries
Utilization of computational resources of Grid
January 20, 2006 NGN workshop Uppsala
7
UU- IT - UDBL Ruslan Fomkin
Parallel Object Query System for Expensive Computations (POQSEC)
Query processor for scientific applications• high-level interface to specify the analyses• automatically generates execution plans and
evaluates them Requirements
• Scalable, efficient, flexible, transparent Properties
• Distributed and parallel
January 20, 2006 NGN workshop Uppsala
8
UU- IT - UDBL Ruslan Fomkin
Layered Architecture of the System
POQSEC provides• scientific query management
Grid provides• computation management• file management
NorduGrid Middleware Application area provides
• computational libraries• data management libraries
ROOT library
POQSEC
Applicationlibraries
Grid
Data Clusters
User
ROOT NorduGrid
January 20, 2006 NGN workshop Uppsala
9
UU- IT - UDBL Ruslan Fomkin
Our Test Application
From Particle Physics Analysis of collision events for presence of
Higgs particles Data produced by ATLAS simulation software
• stored in files • distributed in the Grid (e.g. NorduGrid)• managed by ROOT library
January 20, 2006 NGN workshop Uppsala
10
UU- IT - UDBL Ruslan Fomkin
Object-Relational Schema of the Application Data
Event Particle
Lepton
Muon Electron Jet
particles1 n
PxMiss PyMissPx Py Pz
Kf
Ee
inheritancerelationship
January 20, 2006 NGN workshop Uppsala
11
UU- IT - UDBL Ruslan Fomkin
General Query of the Analysis
Selection of those events that satisfy predicates containing numerical operations
SELECT ev FROM Event ev WHERE jetvetocut(ev) AND zvetocut(ev) AND topcut(ev) AND misseecuts(ev) AND leptoncuts(ev)AND threeleptoncut(ev);
Each predicate called cut in application area Predicates are defined as queries
January 20, 2006 NGN workshop Uppsala
12
UU- IT - UDBL Ruslan Fomkin
Example of a predicate: Z-veto cut
Either event does not have a pair of opposite charged leptons
or invariant mass of the pair is not close to the mass of a Z particle
CREATE FUNCTION zvetocut(Event ev)-> Event AS SELECT evWHERE NOTANY(oppositeLeptons(ev)) OR abs(invMass(oppositeLeptons(ev)) - zMass)
>= minZMass;
CREATE FUNCTION oppositeLeptons (Event ev) -> bag of <Lepton, Lepton> AS
SELECT l1, l2 FROM Lepton l1, Lepton l2WHERE l1 = particles(ev) AND l2 = particles(ev) AND Kf(l1) = -Kf(l2);
January 20, 2006 NGN workshop Uppsala
13
UU- IT - UDBL Ruslan Fomkin
Current Framework
Basic tool for utilizing NorduGrid through Advanced Resource Connector (ARC)
Submission mechanism• submit query • parallelize query to several subqueries• generate job scripts (one per subquery)
Babysitter functionality Data exchange mechanism through files
January 20, 2006 NGN workshop Uppsala
14
UU- IT - UDBL Ruslan Fomkin
Client and Coordinator PartPOQSEC client personal
database with application schema
ROOT wrapper
Coordinator server receives queries creates jobs
Grid Meta-Database computational
resources data files
Babysitter
Coordinatorserver
Grid Meta-Database
SubmissionDatabase
Job queue
QueryCoordinator
Local Storage
ARCClient
Grid ClientNode
POQSECClient
Submission Database received
submissions created jobs
Babysitter interactions with
ARC
January 20, 2006 NGN workshop Uppsala
15
UU- IT - UDBL Ruslan Fomkin
Query SubmissionQuery submission query file name
selection degree of
parallelism CPU time for
each job
Submission and its jobs saved in Submission Database
Created jobs added to Job queue Script files saved to Local Storage
Babysitter
Coordinatorserver
Grid Meta-Database
SubmissionDatabase
Job queue
QueryCoordinator
Local Storage
ARCClient
Grid ClientNode
POQSECClient
Coordinator server creates jobs same query partitions of data with equal size same CPU time provided by user corresponding job script files
January 20, 2006 NGN workshop Uppsala
16
UU- IT - UDBL Ruslan Fomkin
Jobs Submission
Babysitter
Coordinatorserver
Grid Meta-Database
SubmissionDatabase
Job queue
QueryCoordinator
Local Storage
ARCClient
Grid ClientNode
POQSECClient
Babysitter Takes jobs
from Job queue Submits each
job to ARC client
Change status of submitted jobs in Submission DB
ARC GridManager
CEARC GridManager
CEARC client finds Computing Element submits job to corresponding ARC
Grid manager
January 20, 2006 NGN workshop Uppsala
17
UU- IT - UDBL Ruslan Fomkin
Job Execution
ARC Grid Manager downloads input files submits job to Local Batch System
After some delay LBS starts Executor on allocated a CE node
Executor during execution execute given subquery accesses data through
ROOT wrapper saves result to files
on CE Storage
CE
CEStorage
Executor
wrapper
CE node
ARC GridManager
SE SE
LBS Queue
January 20, 2006 NGN workshop Uppsala
18
UU- IT - UDBL Ruslan Fomkin
Downloading Result
Babysitter
Coordinatorserver
Grid Meta-Database
SubmissionDatabase
Job queue
QueryCoordinator
Local Storage
ARCClient
Grid ClientNode
POQSECClient
ARC GridManager
CE
CEStorage
ARC GridManager
CE
CEStorage
Babysitter polls ARC
client for jobs statuses
requests to download results for finished jobs
Results downloaded to Local StorageUser can retrieve result when all jobs are ready
January 20, 2006 NGN workshop Uppsala
19
UU- IT - UDBL Ruslan Fomkin
Conclusion
We provide• declarative query interface for representation scientific
queries• parallel query execution in Grid
(generating scripts)• babysitter to keep track of job execution
Query parallelization is importantStandalone desktop Grid, one job Grid, four jobs
Response time 190 min 225 min 24 min
Requested CPU time - 200 min 20 min
January 20, 2006 NGN workshop Uppsala
20
UU- IT - UDBL Ruslan Fomkin
Future work
Estimation time of executing query Dealing with underestimation of execution time Automatic making decision on degree of
parallelism and resource brokering• adaptive• based on current load and job statistics
Dealing with failures in Grid POOL wrapper