services for sensitive research data gard thomassen, phd head of research support services group...
TRANSCRIPT
Services for Sensitive Research Data
Gard Thomassen, PhD
Head of Research Support Services Group
Leader of the ”Services for Sensitive Data” project
University Center for Information Technology (USIT)
University of Oslo
Outline
• What is sensitive data?• Who has sensitive data?• Project background• Collaborators and reference group• System requirements• System outline• Technical and security details• Maintenance• Advantages and current status• International collaborations
Gard Thomassen,TSD 2.0
Who has sensitive data?
• Faculty of Medicine / Oslo University Hospital• Faculty of Theology • Faculty of Educational Sciences• Faculty of Social sciences• And so the list continues…also outside UiO..
Gard Thomassen,TSD 2.0
Project background
• UiO has an open network structure, but still with a high level of security
• Most of the UiO data is open • Various UiO/OUS researchers approached
USIT asking for an eInfrastructure for sensitive data (majority was MR-images and NGS data)
• The pilot project TSD 1.0 was run
Gard Thomassen,TSD 2.0
Lessons learned
• The need for our services far exceeded the scalability of our system
• Too much hands-on maintaining and manual setup of new projects and new users
• There is a need for a High Performance Computing (HPC) resource within a secure environment
• Not very user friendly (both ends)
Gard Thomassen,TSD 2.0
Main collaborators on TSD 2.0
Collaborators• Norwegian Storage Infrastructure (NorStore)• Norwegian Genetics Analysis Platform (GenAp)• Norwegian Dietary Registry (Faculty of Medicine)• Institute of Psychology (Faculty of Social Sciences)• Norwegian Cancer Sequencing Consortium (NCGC)
Reference group
Oslo University Hospital, NorStore, Regional Etichal Committee, National Institute of Public Health, Norwegian Cancer Registry, Research Network at OUS, Elixir Norway, NCGC, GenAP and Institute of Psychology,UiO.
6
Gard Thomassen,TSD 2.0
System requirements• Security, isolation and access control as given by law• Large storage capacity• Multiple users• High performance computing resource• High bandwidth• Easy to maintain• Easy to use (including audio and video)• Some freedom within user space• Accessible from anywhere through authentication• A variety of software and public DBs must be available• Windows and Linux support (OS X if possible)• Data collection service• Data sharing service• National scope (so far..)
7
Gard Thomassen,TSD 2.0
Solution outline
8
Gard Thomassen,TSD 2.0
System outline
9
Gateway
HPC - ColossusVM-server
Storage
Internet
Secure encrypted network to special high volume data production sites
1 (project)
1 (storage area)
n 1
Gard Thomassen,TSD 2.0
Using TSD 2.0 for analysis
10
VM B1 P1
P1
TSD disk
VM B2 P1
GWUser B1 P1
Colossus disk
Colossus
Front endColossus
Gard Thomassen,TSD 2.0
User B2 P1
TSD 2.0P1 DB
Data import and export using TSD 2.0
11
“Sluice-server”
Virtual “sluice- server”
Virtual project-server
“Sluice HD”
Project HD
TSD 2.0
NFS mount
2
Data copied here by ssh+scp or web-drive(2-factor authentication) encrypted data if sensitive
1 4
3
Gard Thomassen,TSD 2.0
Data collection using TSD 2.0
12
“Nettskjema”
Gard Thomassen,TSD 2.0
minID
Project VM
Project disk
Import mechanism
Encrypted XML (PGP)
TSD 2.0
Data-import for NGS-centers and other large scale data producers
13
Gard Thomassen,TSD 2.0
TSD 2.0
TSD controlled box on-site
HiSEQ
/tmp/storage
Project VM
Project disk
GW
Encrypted connection
Closed network at USIT
Technical outline
14
Admin services- Provisioning system- AD- Surveillance- Software repo- Cfengine- Vcenter- Backup- Antivirus- Log service
Storage / DBs- PostgreSQL- Archiving- Compartmentalized
disk
HPC-resource
Management- Mgmt of storage- Mgmt of network- Mgmt of hardware- Mgmt of VMs
Clients (2-factor login)
- Remote desktop clients- Thin-clients on dedicated
network- Special network for large-scale
data production centers
Publicly available network segment through “minID”
Web-questionary
Web portal Electronic consent
Clinical health dataprojects
Other sensitive dataprojects
Access network- National Health
network- Terminal servers- Thin client
servers- VPN
Gard Thomassen,TSD 2.0
Technical details
• KVM for virtualization (RedHat Linux)• Cerebrum as provisioning (a USIT application) • AD system administration guided by the provisioning
system (duplicated)• FreeBSD firewall and gateway (duplicated)• Integration with IDporten (Norwegian governmental
eID system) for www-enquiries and applications• Storage with separation between projects (Hitachi
disc system and encrypted backup to tape)• IPv6 on the inside (… and private IPv4)
15
Gard Thomassen,TSD 2.0
HPC resource – Colossus
• At present about 500 cores • No project users are to log in on any nodes• One global job daemon to control data
integrity (to ensure project data separation)• /tmp/ and /work/ will be per projects and
cleaned after job finishes• As similar to Abel as possible• Separate disk and more nodes will come
soon
16
Gard Thomassen,TSD 2.0
Security details
• OATH TOTP 2-factor authentication – Smart phones or programmable hardware tokens
• Special roles for those allowed to export data• Import/export is under strict control• No open connection to the internet• Strong separation between projects (VLAN)• Special security measures with remote desktops• Extremely hardened FreeBSD gateway and firewall • Encrypted backup, one key per project• Sys admins are single users (traceability)• Sys admins have to use same authentication process• Most hardware is physically separated from other UiO
hardware
17
Gard Thomassen,TSD 2.0
Maintenance
• Reuse as much as possible from the USIT eInfrastructure
• Virtualize as much as possible• Management/ surveillance data can be
pushed, but not pulled (Nagios, Collectd) • Surveillance based on existing systems• Sys admins have different access levels
18
Opportunities enabled by TSD 2.0
• NGS research on humans is possible• Large scale imaging studies possible• “HUNT-like” studies online for the respondents and the
scientists• Off-site analysis of sensitive data• Secure storage for verification of published research• Electronic consent• Possible work-area for making exams?• TSD to host all human NGS research data from
UIO/OUS??
Gard Thomassen,TSD 2.0
Nordic collaboration opportunities• Laws are fairly similar (Norway very strict)• Difficult to exchange data for research• One should learn from each others as these systems
demands very special IT-knowledge• System development and system-administration is
non-sensitive and may be shared• Building TSD addresses many novel security
questions in a University setting, to be learnt from• Large DBs of health data may enable very
interesting research in the future (NeGI)• NeIC has shown interest into TSD 2.0• TSD collaborate with CSC in Finland and with BILS /
Elixir Sweden. BBMRI are interested20
Gard Thomassen,TSD 2.0
Current status
• Pilot project data is transferred now now• System is being prepared and finished for setting up
new projects and go into production• Storage is up• Secure Nettskjema is up• Working on risk evaluation• Project registration when risk evaluation is finished• HPC-resource 4th quarter 2013• Video and sound will be the main target during
further work• System Whitepaper (v1.0) written
People involved
• Dag-Erling Smørgrav • Petter Reinholtsen• Elisabeth Ytterdal• Tor Fuglerud• DBA (PostgreSQL team)• Cerebrum team• Morten Werner Forsbring• Espen Grøndahl• HPC – Colossus team• Gard Thomassen
22
Project group / developers
• IT-dir Lars Oftedal• Hans A. Eide• Märtha Felton
Administration / associated
Gard Thomassen,TSD 2.0
Cost per project
• First year establishment price (per project)• Regular yearly project fee• License cost (licensed software usage)• Storage cost for storage exceeding basic
allocation• Cost of DB administration (if DB needed)• Cost of CPU hours Colossus
23
Project administration in TSD 2.0 - technical
• Application through the National ID-portal + Nettskjema• The project is created in Cerebrum with role-categories• The project is connected to resources (VM + disc + VLAN + DB
+ HPC)• Users are created and given their roles• Username, pwd and one-time-passwords are distributed• Accounts kept on storage, HPC CPU time and additional VMs
to enable control and book-keeping • NorStore may offer “free” storage within TSD (there might be a
small security mgmt overhead cost)• In the the future there will be some level of self service through
a web portal within TSD
24
Gard Thomassen,TSD 2.0
Conclusion
• It is very hard to make something secure and user-friendly at the same time– Researchers wants the freedom of using the internet while
doing research on sensitive data…
• A thorough risk assessment must be made during and after the planning and implementation phase to make the best choices
• What you can not avoid should at least be detected by some surveillance mechanism.
• More (inter)national / local cooperation wanted
25
Gard Thomassen,TSD 2.0
Pilot project (TSD 1.0)
• Secure storage for large amounts of NGS data and MR-images (>100TB)
• Secure windows “research server” enabling usage of MS Office, STATA, SPSS etc on sensitive data
• Research server is based on an isolated system using VMware ESX
• Two-factor login-system • Encrypted backup
Gard Thomassen,TSD 2.0
“The Ultimate Goal is….
….to be able to provide the same services that are available for researchers working with non-sensitive data, with the necessary security, with minimum impact on the user experience, and minimum extra overhead and cost.”
Hans Eide, 2012 (my boss)
27
Gard Thomassen,TSD 2.0