optimized data migration within a system of linked medical research databases

25
Optimized Data Migration within a System of Linked Medical Research Databases By Jared Christopherso n U. of Connecticut

Upload: fayola

Post on 24-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Optimized Data Migration within a System of Linked Medical Research Databases. By Jared Christopherson U. of Connecticut. Problem to Address. Medical research requires connections to multiple hospitals, institutions, or online databases Data must be compiled manually - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimized Data Migration within a System of Linked Medical Research Databases

Optimized Data Migration within a System of Linked Medical Research

Databases By Jared

ChristophersonU. of

Connecticut

Page 2: Optimized Data Migration within a System of Linked Medical Research Databases

Problem to AddressMedical research

requires connections to multiple hospitals, institutions, or online databases

Data must be compiled manually

Process is time-consuming

Page 3: Optimized Data Migration within a System of Linked Medical Research Databases

General Project Goal

Page 4: Optimized Data Migration within a System of Linked Medical Research Databases

Project Goals:

Present data as though from a single sourceGive the researcher flexibility with viewing

the dataOptimize data flow with site caching

Page 5: Optimized Data Migration within a System of Linked Medical Research Databases

Real World Issue: Data Formatting

Page 6: Optimized Data Migration within a System of Linked Medical Research Databases

Real World Issue: Data FormattingTABLE NAME: hiv_resistance_1

[gene] [drug_class] [compound] [aa_mutation] [codon_mutation] [cite]

TABLE NAME: db_data[gene_name] [drug] [aa_info] [codon_info] [source]

TABLE NAME: ResearchDataHIV[geneName] [drugClass] [compoundInfo] [aaMutation] [codonMutation]

Page 7: Optimized Data Migration within a System of Linked Medical Research Databases

Solution: Master Template hiv_resistance_1: gene

db_data: gene_name

ResearchDataHIV: geneName

Gene

hiv_resistance_1: cite

db_data: source Source

hiv_resistance_1: codon_mutation

db_data: codon_info

ResearchDataHIV: codonMutation

Codon Mutation

Master Template: [Gene] [Source] [Codon Mutation]

Page 8: Optimized Data Migration within a System of Linked Medical Research Databases

Master Templates and Display Templates

Database Store

ID: 1Stanford Cancer Database

64.434.343.99Admin/password

ID: 2UConn Database

44.254.292.34Admin/password

ID: 3HIV Research Database

23.32.232.19Admin/password

Master Template - Cancer

Master Template – HIV Resistance

[Age] [Cancer Type] [Drug Class]

ID -> Table Name -> Field Name

[Compound] [Codon] [Result]

ID -> Table Name -> Field Name

Display Template 1

[Age] [Cancer Type] [Drug Class]

Display Template 2

[Cancer Type] [Drug Class]

Display Template 1

[Compound] [Result]

Display Template 2

[Compound] [Codon] [Result]

Page 9: Optimized Data Migration within a System of Linked Medical Research Databases

Basic FunctionalityProvides a simple search for usersResearchers have the option of selecting a pre-set

Display Template to only display data relevant to their needs

Queries each database individually according to the Master Template

Returns results and (optionally) compiles them into a single list

Use AJAX to return results for each database

Page 10: Optimized Data Migration within a System of Linked Medical Research Databases
Page 11: Optimized Data Migration within a System of Linked Medical Research Databases

Caching and OptimizationGoal: researchers should have fastest access

possible to the info they seekX-RAY or MRI images could be 2-5Mb in size

eachWhat if researchers in the US consistently

need access to data on a server in Asia?Local access would be fastest

Page 12: Optimized Data Migration within a System of Linked Medical Research Databases

USA

USAJAPAN

Caching Process

Caching Goal

Page 13: Optimized Data Migration within a System of Linked Medical Research Databases

Caching and Optimization: Possible SolutionsMove everything to a central serverMove records around as they are accessedCache everythingCache databases based on usage

Page 14: Optimized Data Migration within a System of Linked Medical Research Databases

Query and Result Set – No Caching

JAPAN

USA

SPAIN

Page 15: Optimized Data Migration within a System of Linked Medical Research Databases

Caching Process

JAPAN

USA

SPAIN

Page 16: Optimized Data Migration within a System of Linked Medical Research Databases

Caching Complete

JAPAN

USA

SPAIN

Page 17: Optimized Data Migration within a System of Linked Medical Research Databases

Region Caching

Page 18: Optimized Data Migration within a System of Linked Medical Research Databases

Database Caching Queue – What to cache?For each region, determines the top external

servers used based on a percentage of queries

Page 19: Optimized Data Migration within a System of Linked Medical Research Databases

Database Caching Queue

Need a method to determine the most heavily requested external databases for each region

Track statistics:Convert IP address -> region whenever a user

performs a searchIncrement result count for the record that

keeps track of the region ID and database ID

Page 20: Optimized Data Migration within a System of Linked Medical Research Databases

DB Queue to CacheUSA

DB1 (Japan) – 345 resultsDB7 (Spain) – 793 resultsDB16 (USA) – 539 resultsDB3 (Japan) – 491 results

JAPAN

DB1 (Japan) – 212 resultsDB7 (Spain) – 343 resultsDB16 (USA) – 112 resultsDB3 (Japan) – 312 results

USA

DB7 (Spain) – 49%DB3 (Japan) – 30%DB1 (Japan) – 21%

JAPAN

DB7 (Spain) – 75%DB16 (USA) – 25%

Page 21: Optimized Data Migration within a System of Linked Medical Research Databases

Caching and Optimization: Where to cache dataReal-world constraints

allow_cachesupersite

bandwidth cache_size

Page 22: Optimized Data Migration within a System of Linked Medical Research Databases

Iterate through regions

Get a list of servers by region

Filter by allow_cache Filter by supersite Sort by bandwidth

Are there more servers?

Iterate through server list

Sort by bandwidth Filter by non-supersite

Compare cache size needed by DB in queue to space available on server

Are there more DBs in the

queue?

Is space available?Cache DB

Iterate to next DB in the queue

Drop DB from the queue

YES

YES

Cache complete

NO

YES

NO

NO

Are there more servers?

Are there more DBs in the

queue?

Cache DB

Drop DB from the queue

YES

Cache complete

NO

Iterate through server list

Compare cache size needed by DB in queue to space available on server

Is space available?

YES

Do any servers in the list have

space?

NO

NO

YES

YES

Page 23: Optimized Data Migration within a System of Linked Medical Research Databases

Caching and Optimization: Script ProcessRuns at frequency set by adminThis process continues for each region with

the program assigning data to servers with progressively lower bandwidth and cache_size scores until all the server space from that region is exhausted

Page 24: Optimized Data Migration within a System of Linked Medical Research Databases

Caching and Optimization: Script ProcessAt the end, each region should have as many

local copies of the most frequently requested databases as possible

Cached copies are read-only

Page 25: Optimized Data Migration within a System of Linked Medical Research Databases

Further Work and ImprovementsAllow different types of databases (DAL)Remove overlapping dataScript to determine when individual caches

need to be updated