labkey server etl workshop labkey software friday september 20, 2013 1

Post on 11-Jan-2016

231 Views

Category:

Documents

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

LabKey Server ETL Workshop

LabKey SoftwareFriday September 20, 2013

1

2

Understand basic workings of LabKey Server Administrator & developer views

Know how to use LabKey’s Query capability Build a module to extend LabKey

Update data model with incremental scripts Expose data & metadata to LabKey Server

Learn ETL Options Run ETLs Create Simple ETLs

Objectives

3

Alternate talking & doing Using Amazon-hosted VMs running LabKey

Server + SQL Server Run via Remote Desktop Everyone has VM with full admin rights Everyone has own SQL Server instance

Workshop not one-way training

Course format

4

Never done this before Probably “bugs” in course material

The code is fresh Code from LabKey “trunk” Basic ETL Services in Place Extending over next few months

Keeping fingers crossed for reliable wifi

Caveats

5

About LabKey Server Getting Connected LabKey Folder Setup Data in LabKey LabKey SQL Database & Module Architecture Building a Module ETL in Modules Q & A https://hosted.labkey.com/project/ETLTraining/begin.view?

Agenda

LabKey Server

Labkey

File System 2 SAS Share

Data 1

Data 2

File SystemLabKey Database

(PostgreSQL, MS SQL)

LabKey Schemas

More Schemas

OracleMS SQL

DatabaseMy SQL

LabKey ServerModular, Java-based

Web App

Nelson et al., LabKey Server: An open source platform for scientific data integration, analysis and collaboration

7

See instructions on getting to your server at Amazon Should connect via Remote Desktop You can use SQL Management Studio to get direct

access to database Full admin gives you power to break anything

Won’t be true in FHCRC environment

Getting Connected

8

Start server with icon on desktop Production installs use a Windows Service

Use web-browser on remote desktop machine You’ll connect to http://localhost:8080/labkey

Set up a site administrator password Server will “upgrade itself”

Run SQL Scripts to initialize modules We’ll go over this process later when you build your own

modules

Starting The Server

9

Site is an server administration level Connectivity to resources, site wide groups

Projects are top-level folders Add groups, customized interfaces

Subfolders secure subsets of data Physically each container is a row in a database with a GUID

Other tables often have “container” column Try the tutorial

Basic Organization and Security

Data Connectivity in LabKey

A relational data store designed for scientists Built on a robust SQL database Property and vocabulary service Secure SQL query service Data grid for exploring data File sharing and linking

10

Relational DB

LabKey Query Service

UI, ETL orCustom

Application

SQL Query or Table + Column List

API Layer

Translated SQL

LabKey Server

11

LabKey data model terminology

Tabular data: data in the form of rows and columns Schema: a named collection of related tables and queries Metadata: information about the data contained in a tabular

data set, including field names, types, formats, links Query: a named, saved SQL SELECT statement written in

LabKey SQL, can be parameterized Custom grid view

Subset of query functionality (field list, sort, filter) Intended for UI definition (not defined in SQL) Can do implicit joins via lookups

12

Tutorial: Data Analysis

Import a spreadsheet into a list Explore the data grid view of the list

Sort Filter Paging

Create a scatter plot of the data View the plot over subsets of the data Change the ARVRegimen field to be a lookup

Lookups in LabKey Server

“Lookup” is special field type A field in one table whose values consist of key values from

another table Target: the table whose key values are kept in the lookup Title field: attribute of the target, specifies the field of the

target that will be displayed in place of the key values contained in the lookup

In SQL terms, known as a single-column FOREIGN KEY Always many-to-one or one-to-one from lookup field table

to target

13

Peter
fix this slide. many more uses of lookups, may details

14

Display more meaningful data values Allow users to explore data without writing SQL To constrain user input to a fixed set of choices Allow updating display values in one place Add expression columns to base data sets

Uses of lookups

Configuring fields

15

The Field Editor is the main UI for configuring field-level properties For developer-defined tables, data is supplied in XML

16

LabKey allows folks to write SQL But they don’t get access to the underlying database

Within any folder, the available schemas can be browsed

Create new Queries Equivalent to database views

SQL In LabKey

Query Schema Browser

17

New Query

18

Query Web Part

19

20

Full SELECT Syntax Update/Insert/Merge accessible via ETL pipeline, APIs, UI

Easy lookup syntax replaces JOIN in many cases Use || for string concat (like Oracle, PostgreSQL) PIVOT Queries GROUP_CONCAT PARAMETERS

LabKey SQL vs MS SQL

21

Joins Group_Concat – All visits for a patient PIVOT – one column for each visit

Queries to Try

22

LabKey Server is Based on Modules Look in Admin->Folder Management->FolderType Each module can provide

HTML Views Javascript/CSS LabKey SQL Queries

Enables easy movement of sets of queries between servers ETL Definitions Reports in R and JavaScript Database level schema definition

Only run at restart so DBAs can approve XML to add metadata to database schema

Java code

LabKey Modules

23

See tutorial

Building first Module

24

Provenance For every row in HIDRA_Prime, know when & how it got there

Auditing For every row that leaves HIDRA_Prime, know when & how it

left Down to individual patient info History of all runs Clear packaging & deployment

Re-invent the axle, but not the wheel… Use Stored Procs (coming soon) Wrap existing ETL Frameworks

ETLs: Why In LabKey

25

Still under development Basic functionality is in place

Query based ETLs Checkers (identify whether work is to be done) Scheduling Logging all output

LabKey ETL Infrastructure

26

User Interface Management User Interface

Scheduling Lists of Transform Runs Detail views

ETL Creation Stored Procedure-based ETLs Support for external ETL packages yet (SSIS, Kettle)

Still Not Done

27

Change identification Initiation Query Transformation Staging Load/Merge Finalize

ETL Steps (from Design Spec)

28

ETLs are defined in etls directory of a module Each ETL is an XML file

Each ETL consists of a set of Transform Steps Key Components of a Transform

Source Query (LabKey SQL for now) Destination Table

May be in unrelated database Filter Strategy

Identifies rows to transform & if there is work to do Schedule

ETL Basics

29

Choose which rows to move to target table SelectAllFilterStrategy

Just get all the data, every time ModifiedSinceFilterStrategy

Rows with a DateTime column newer than last run Records most recent value

RunFilterStrategy Based on Incrementing Integer Value (e.g. Run ID) Any rows with higher value than last time are transferred Useful for rows written by previous ETLs

But can “forget” previous runs and re-run from scratch “Reset State” in the UI

Filter Strategies

30

How to add data to target table truncate

Delete all rows and add the selected ones append

Add new rows to the target table Will fail if duplicate primary keys

merge Update or Insert Matches Primary Keys

Target Options

31

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”truncate"/> </transform> </transforms> <incrementalFilter className=”SelectAllFilterStrategy” />

<schedule><poll interval="1h"></poll></schedule></etl>

Overwrite Full Table Every Hour

32

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”merge"/> </transform> </transforms> <incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" /> <schedule><poll interval="1h"></poll></schedule></etl>

Merge Changed Rows

33

Couple of key tables in the dataintegration schema TransformConfiguration

One row for each ETL Controls whether ETL is active Quick access to state of last run

TransformRun Stores information about every transform Success or Failure Total # of rows transferred

Pipeline Detailed log of steps

Storing ETL Information

34

Try an Early HIDRA ETL

35

Enable hidra and hidra_uw_intake

36

Amalga_Import has some Data

37

Let’s Try a Transform

38

<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Amalga to hidraPrime - Patients</name> <description>Move uw_patient, uw_patientidentifier, uw_encounter from Amalga to hidraPrime</description> <transforms>

<transform id="patient"> <source schemaName="AmalgaImport_queries" queryName="uw_patient" timestampColumnName="updtDtTm" /> <destination schemaName="hidraPrime" queryName="Patient" targetOption="merge"/> </transform>

<transform id="patientidentifier_mrn"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_mrn" timestampColumnName="lastUpdateTime"/> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>

<transform id="patientidentifier_epi"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_epi" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>

<transform id="encounter"> <source schemaName="AmalgaImport_queries" queryName="uw_encounter" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="Encounter" targetOption="merge"/> </transform>

</transforms>

<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="lastUpdateTime" /></etl>

Files in: C:\LabKey\modules\hidra_uw_intake

A look inside

39

SELECT

(SELECT OID FROM AmalgaImport_azAEID.AEID204 WHERE AEID204.EIDForOID=UW_PID601.EIDForOID) as GPID,

LName AS LastName, FName as FirstName, MName as MiddleName, MotherMaidenName AS MaidenNameMother, DOB, Sex AS Gender, Language AS PrimaryLanguage, PatientAlias, Race, Street1 AS AddressLine1, Street2 AS AddressLine2,…

FROM AmalgaImport_azADT.UW_PID601

Patient Query

40

Nothing Happens Change some Data in

Amalga_Import.azADT.UW_PID601 Remember to update updtDtTm field

Now try again

Run Again

41

42

Researchers often have data in existing relational databases LIMS systems Clinical data Locally-developed applications

LabKey Server offers two mechanisms to incorporate this data Define an external schema connection (link) Use Extract, Transform and Load support (copy)

Data in external databases

43

LabKey Server consists of many separate modules Server modules usually contain SQL scripts to create

the database objects used by the module CREATE or ALTER, TABLES and VIEWs in native syntax Schema usually specific to a module Supported DBs: PostgreSQL and Microsoft SQL Server Script runner figures out which scripts needed for upgrade

Database tables and LabKey Server modules

44

After install or upgrade, the SQL sent to the database Mostly SELECTs and 1-row UPDATE/INSERT/DELETE SELECTS can be issued by a user or an application in

LabKey SQL LabKey translates into the back-end database dialect

45

Provides a way to link from LabKey Server to another data source to make LabKey’s functions and Client API to work directly on the external data

LabKey translates its own SQL into the dialect of the external schema. Supported databases include Oracle, SAS, and MySQL in addition to

Postgres and SQL Server Options:

Make only some tables exposed to LabKey Read only or read/write Implement folder-based security if a containerId is included Add additional metadata (example field display properties) via an XML

file

External schemas and data sources

Files Proteomics Flow

Fold

er 1

Fold

er 2

Tabular data rows and files are visible in folders46

Folders, files and tabular data

top related