labkey server etl workshop labkey software friday september 20, 2013 1

46
LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Upload: frank-clark

Post on 11-Jan-2016

231 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

LabKey Server ETL Workshop

LabKey SoftwareFriday September 20, 2013

1

Page 2: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

2

Understand basic workings of LabKey Server Administrator & developer views

Know how to use LabKey’s Query capability Build a module to extend LabKey

Update data model with incremental scripts Expose data & metadata to LabKey Server

Learn ETL Options Run ETLs Create Simple ETLs

Objectives

Page 3: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

3

Alternate talking & doing Using Amazon-hosted VMs running LabKey

Server + SQL Server Run via Remote Desktop Everyone has VM with full admin rights Everyone has own SQL Server instance

Workshop not one-way training

Course format

Page 4: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

4

Never done this before Probably “bugs” in course material

The code is fresh Code from LabKey “trunk” Basic ETL Services in Place Extending over next few months

Keeping fingers crossed for reliable wifi

Caveats

Page 5: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

5

About LabKey Server Getting Connected LabKey Folder Setup Data in LabKey LabKey SQL Database & Module Architecture Building a Module ETL in Modules Q & A https://hosted.labkey.com/project/ETLTraining/begin.view?

Agenda

Page 6: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

LabKey Server

Labkey

File System 2 SAS Share

Data 1

Data 2

File SystemLabKey Database

(PostgreSQL, MS SQL)

LabKey Schemas

More Schemas

OracleMS SQL

DatabaseMy SQL

LabKey ServerModular, Java-based

Web App

Nelson et al., LabKey Server: An open source platform for scientific data integration, analysis and collaboration

Page 7: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

7

See instructions on getting to your server at Amazon Should connect via Remote Desktop You can use SQL Management Studio to get direct

access to database Full admin gives you power to break anything

Won’t be true in FHCRC environment

Getting Connected

Page 8: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

8

Start server with icon on desktop Production installs use a Windows Service

Use web-browser on remote desktop machine You’ll connect to http://localhost:8080/labkey

Set up a site administrator password Server will “upgrade itself”

Run SQL Scripts to initialize modules We’ll go over this process later when you build your own

modules

Starting The Server

Page 9: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

9

Site is an server administration level Connectivity to resources, site wide groups

Projects are top-level folders Add groups, customized interfaces

Subfolders secure subsets of data Physically each container is a row in a database with a GUID

Other tables often have “container” column Try the tutorial

Basic Organization and Security

Page 10: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Data Connectivity in LabKey

A relational data store designed for scientists Built on a robust SQL database Property and vocabulary service Secure SQL query service Data grid for exploring data File sharing and linking

10

Relational DB

LabKey Query Service

UI, ETL orCustom

Application

SQL Query or Table + Column List

API Layer

Translated SQL

LabKey Server

Page 11: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

11

LabKey data model terminology

Tabular data: data in the form of rows and columns Schema: a named collection of related tables and queries Metadata: information about the data contained in a tabular

data set, including field names, types, formats, links Query: a named, saved SQL SELECT statement written in

LabKey SQL, can be parameterized Custom grid view

Subset of query functionality (field list, sort, filter) Intended for UI definition (not defined in SQL) Can do implicit joins via lookups

Page 12: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

12

Tutorial: Data Analysis

Import a spreadsheet into a list Explore the data grid view of the list

Sort Filter Paging

Create a scatter plot of the data View the plot over subsets of the data Change the ARVRegimen field to be a lookup

Page 13: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Lookups in LabKey Server

“Lookup” is special field type A field in one table whose values consist of key values from

another table Target: the table whose key values are kept in the lookup Title field: attribute of the target, specifies the field of the

target that will be displayed in place of the key values contained in the lookup

In SQL terms, known as a single-column FOREIGN KEY Always many-to-one or one-to-one from lookup field table

to target

13

Peter
fix this slide. many more uses of lookups, may details
Page 14: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

14

Display more meaningful data values Allow users to explore data without writing SQL To constrain user input to a fixed set of choices Allow updating display values in one place Add expression columns to base data sets

Uses of lookups

Page 15: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Configuring fields

15

The Field Editor is the main UI for configuring field-level properties For developer-defined tables, data is supplied in XML

Page 16: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

16

LabKey allows folks to write SQL But they don’t get access to the underlying database

Within any folder, the available schemas can be browsed

Create new Queries Equivalent to database views

SQL In LabKey

Page 17: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Query Schema Browser

17

Page 18: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

New Query

18

Page 19: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Query Web Part

19

Page 20: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

20

Full SELECT Syntax Update/Insert/Merge accessible via ETL pipeline, APIs, UI

Easy lookup syntax replaces JOIN in many cases Use || for string concat (like Oracle, PostgreSQL) PIVOT Queries GROUP_CONCAT PARAMETERS

LabKey SQL vs MS SQL

Page 21: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

21

Joins Group_Concat – All visits for a patient PIVOT – one column for each visit

Queries to Try

Page 22: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

22

LabKey Server is Based on Modules Look in Admin->Folder Management->FolderType Each module can provide

HTML Views Javascript/CSS LabKey SQL Queries

Enables easy movement of sets of queries between servers ETL Definitions Reports in R and JavaScript Database level schema definition

Only run at restart so DBAs can approve XML to add metadata to database schema

Java code

LabKey Modules

Page 23: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

23

See tutorial

Building first Module

Page 24: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

24

Provenance For every row in HIDRA_Prime, know when & how it got there

Auditing For every row that leaves HIDRA_Prime, know when & how it

left Down to individual patient info History of all runs Clear packaging & deployment

Re-invent the axle, but not the wheel… Use Stored Procs (coming soon) Wrap existing ETL Frameworks

ETLs: Why In LabKey

Page 25: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

25

Still under development Basic functionality is in place

Query based ETLs Checkers (identify whether work is to be done) Scheduling Logging all output

LabKey ETL Infrastructure

Page 26: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

26

User Interface Management User Interface

Scheduling Lists of Transform Runs Detail views

ETL Creation Stored Procedure-based ETLs Support for external ETL packages yet (SSIS, Kettle)

Still Not Done

Page 27: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

27

Change identification Initiation Query Transformation Staging Load/Merge Finalize

ETL Steps (from Design Spec)

Page 28: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

28

ETLs are defined in etls directory of a module Each ETL is an XML file

Each ETL consists of a set of Transform Steps Key Components of a Transform

Source Query (LabKey SQL for now) Destination Table

May be in unrelated database Filter Strategy

Identifies rows to transform & if there is work to do Schedule

ETL Basics

Page 29: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

29

Choose which rows to move to target table SelectAllFilterStrategy

Just get all the data, every time ModifiedSinceFilterStrategy

Rows with a DateTime column newer than last run Records most recent value

RunFilterStrategy Based on Incrementing Integer Value (e.g. Run ID) Any rows with higher value than last time are transferred Useful for rows written by previous ETLs

But can “forget” previous runs and re-run from scratch “Reset State” in the UI

Filter Strategies

Page 30: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

30

How to add data to target table truncate

Delete all rows and add the selected ones append

Add new rows to the target table Will fail if duplicate primary keys

merge Update or Insert Matches Primary Keys

Target Options

Page 31: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

31

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”truncate"/> </transform> </transforms> <incrementalFilter className=”SelectAllFilterStrategy” />

<schedule><poll interval="1h"></poll></schedule></etl>

Overwrite Full Table Every Hour

Page 32: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

32

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”merge"/> </transform> </transforms> <incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" /> <schedule><poll interval="1h"></poll></schedule></etl>

Merge Changed Rows

Page 33: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

33

Couple of key tables in the dataintegration schema TransformConfiguration

One row for each ETL Controls whether ETL is active Quick access to state of last run

TransformRun Stores information about every transform Success or Failure Total # of rows transferred

Pipeline Detailed log of steps

Storing ETL Information

Page 34: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

34

Try an Early HIDRA ETL

Page 35: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

35

Enable hidra and hidra_uw_intake

Page 36: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

36

Amalga_Import has some Data

Page 37: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

37

Let’s Try a Transform

Page 38: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

38

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Amalga to hidraPrime - Patients</name> <description>Move uw_patient, uw_patientidentifier, uw_encounter from Amalga to hidraPrime</description> <transforms>

<transform id="patient"> <source schemaName="AmalgaImport_queries" queryName="uw_patient" timestampColumnName="updtDtTm" /> <destination schemaName="hidraPrime" queryName="Patient" targetOption="merge"/> </transform>

<transform id="patientidentifier_mrn"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_mrn" timestampColumnName="lastUpdateTime"/> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>

<transform id="patientidentifier_epi"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_epi" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>

<transform id="encounter"> <source schemaName="AmalgaImport_queries" queryName="uw_encounter" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="Encounter" targetOption="merge"/> </transform>

</transforms>

<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="lastUpdateTime" /></etl>

Files in: C:\LabKey\modules\hidra_uw_intake

A look inside

Page 39: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

39

SELECT

(SELECT OID FROM AmalgaImport_azAEID.AEID204 WHERE AEID204.EIDForOID=UW_PID601.EIDForOID) as GPID,

LName AS LastName, FName as FirstName, MName as MiddleName, MotherMaidenName AS MaidenNameMother, DOB, Sex AS Gender, Language AS PrimaryLanguage, PatientAlias, Race, Street1 AS AddressLine1, Street2 AS AddressLine2,…

FROM AmalgaImport_azADT.UW_PID601

Patient Query

Page 40: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

40

Nothing Happens Change some Data in

Amalga_Import.azADT.UW_PID601 Remember to update updtDtTm field

Now try again

Run Again

Page 41: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

41

Page 42: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

42

Researchers often have data in existing relational databases LIMS systems Clinical data Locally-developed applications

LabKey Server offers two mechanisms to incorporate this data Define an external schema connection (link) Use Extract, Transform and Load support (copy)

Data in external databases

Page 43: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

43

LabKey Server consists of many separate modules Server modules usually contain SQL scripts to create

the database objects used by the module CREATE or ALTER, TABLES and VIEWs in native syntax Schema usually specific to a module Supported DBs: PostgreSQL and Microsoft SQL Server Script runner figures out which scripts needed for upgrade

Database tables and LabKey Server modules

Page 44: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

44

After install or upgrade, the SQL sent to the database Mostly SELECTs and 1-row UPDATE/INSERT/DELETE SELECTS can be issued by a user or an application in

LabKey SQL LabKey translates into the back-end database dialect

Page 45: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

45

Provides a way to link from LabKey Server to another data source to make LabKey’s functions and Client API to work directly on the external data

LabKey translates its own SQL into the dialect of the external schema. Supported databases include Oracle, SAS, and MySQL in addition to

Postgres and SQL Server Options:

Make only some tables exposed to LabKey Read only or read/write Implement folder-based security if a containerId is included Add additional metadata (example field display properties) via an XML

file

External schemas and data sources

Page 46: LabKey Server ETL Workshop LabKey Software Friday September 20, 2013 1

Files Proteomics Flow

Fold

er 1

Fold

er 2

Tabular data rows and files are visible in folders46

Folders, files and tabular data