the technology of the human protein reference database (draft, 2003)
DESCRIPTION
Between 2002 and 2004, I managed the technology team that built the Human Protein Reference Database (http://hprd.org) at the Institute of Bioinformatics in Bangalore and Johns Hopkins University in Baltimore. These are my notes on the tech from sometime in 2003, rediscovered in 2014 when I was looking through old files.TRANSCRIPT
![Page 1: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/1.jpg)
Human Protein Reference Database
An analysis of the technology powering the database and website,
and how it was developed.
Kiran Jonnalagadda
![Page 2: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/2.jpg)
2
Facts About HPRD
• HPRD is a database of all disease causing proteins in the human body.
• It is the most comprehensive database of its kind in the world today.
• Unlike most other biological databases, HPRD is protein-centric, not gene-centric.
![Page 3: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/3.jpg)
3
Factors Leading to Choice of DB
• The biologists hadn’t settled on what information was to be stored and therefore the data type definitions changed often.
• Several data types were fairly similar to others but not the same.
• Future extensions had to be built by tech-savvy biologists with minimal assistance from programmers.
![Page 4: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/4.jpg)
4
What We Used
• The Zope application server, comprising of:– The Web publishing object framework.– ZODB, the object database storage system.– ZCatalog, the indexing and search system.– ZEO, the stand-alone database server for
multiple front-end Web servers.
![Page 5: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/5.jpg)
5
Why an RDBMS Was Not Suited
• Data type definition changed frequently. In an RDBMS, this would have meant redefining tables every week.
• The code currently has about forty data classes. Imagine having that many data tables, plus tables for relationships between them, all under frequent revision.
![Page 6: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/6.jpg)
6
How Zope Handled These Issues
• Zope is built on Python, which offers dynamic data structures.
• ZODB uses this ability to makes the entire database look like one large data structure, transparently swapping unused parts to disk and recovering them as needed.
• ZCatalog indexes data for searching.
![Page 7: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/7.jpg)
7
At Zope’s Core is Python
• Python is a dynamic language.• When I say dynamic, I mean everything is dynamic!• Code, variables, classes, modules, everything can
be modified at run-time.• Most of Zope is built around this ability. Zope
could not have been implemented in another language.
![Page 8: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/8.jpg)
8
Data Storage in Zope
• In Zope, data is stored in instances of a data class.• The data class has variables, which are like fields,
and methods, which manipulate data.• Instances of a data class (objects) are stored in
the ZODB, making the database.• Objects can contain other objects, forming
hierarchies.
![Page 9: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/9.jpg)
9
Components of Zope
• ZServer (formerly Medusa)– Handles incoming requests.– Does HTTP, FTP, WebDAV, XML-RPC; soon SOAP.
• ZPublisher– Maps URLs to objects and handles security.
• ZODB (Zope Object DataBase)– Stores objects on disk in a transactional DB.
• ZEO (Zope Enterprise Objects)– ZODB server for multiple Zope front-end servers.
![Page 10: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/10.jpg)
10
Security in Zope
• Security is fine grained.• Security is defined around four concepts:
– Users, Roles, Permissions and Hierarchies.• A user is assigned one or more roles.• A role is assigned a set of permissions.• This set can be reassigned at different
positions in the hierarchy.
![Page 11: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/11.jpg)
11
Security Outside Zope
• Zope’s security mechanism is limited to the Web front.
• It is applied only to objects that directly interface with the end-user.
• Code written in a module in the filesystem has no security restrictions. It can do anything.
![Page 12: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/12.jpg)
12
Limitations in Zope
• The API for creating extensions (called Products) is complicated and poorly documented.
• The Property Manager interface is too primitive. It only handles the very basic data types such as strings, integers, boolean fields, selection lists and multi-line text.
![Page 13: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/13.jpg)
13
Our Extensions to Zope
• A framework for separating Zope specifics from our data types, making it much simpler to add new data types.
• An extended property management system that could handle changes in data type definitions and automatically migrate data.
![Page 14: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/14.jpg)
Part IIUser Interface
The rationale behind decisions affecting how a user experiences the
database.
![Page 15: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/15.jpg)
15
User Interface Design
• We started with exposing Zope’s hierarchy as the public user interface
• But there were some elements such as the category browser and the
![Page 16: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/16.jpg)
16
Templates for the Web UI
• Choice of DTML and ZPT for templates.• ZPT for templating system.
![Page 17: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/17.jpg)
Part IIIProject Management Lessons
What we learnt about managing a project across continents and distant
time zones.
![Page 18: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/18.jpg)
18
Project Management Issues 1
• We learnt the hard way that a project manager’s place is with his team, not with the client.
• Productivity suffers in the absence of an effective collaboration tool.
• E-mail and instant messengers are not effective collaboration tools.
![Page 19: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/19.jpg)
19
Project Management Issues 2
• Collaboration over e-mail imposes the burden of articulation on the communicator, which many dislike and therefore avoid.
• Instant messaging prevents collecting thoughts before presenting them and is therefore a poor planning tool.
![Page 20: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/20.jpg)
20
Collaboration Tools
• We experimented with several collaboration systems, with varying effectiveness:– Phone calls.– Instant messengers.– Wikis.– Issue tracking software.– Mailing lists.
![Page 21: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/21.jpg)
21
Phone Calls
• Next best thing to face-to-face discussions.• But only connect two people unless non-
standard equipment is used.• International calls are usually too expensive
for the resulting gain.
![Page 22: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/22.jpg)
22
Instant Messengers
• Provide critical communication between geographically distributed team members.
• But the pressure of maintaining continuity in a conversation hinders pausing to gather thoughts.
• Typing is much slower than talking. Therefore little else gets done alongside.
![Page 23: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/23.jpg)
23
Wikis
• The easy hyperlinking system of a wiki combined with structured text makes presenting information a snap.
• With a little code thrown in, Wikis could make a wonderful project management tool.
• A changed page notification system is needed or changes go unnoticed.
![Page 24: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/24.jpg)
24
Issue Tracking Software
• We use BugZilla to track issues.• But in eight months using it, only 30 issues have
been reported using it.• The other few hundred were reported over e-
mail, instant messengers and in person.• Clearly, the problem is with BugZilla’s usability.
Search for a new system is on.
![Page 25: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/25.jpg)
25
Mailing Lists
• E-mail is push media: the latest is always on top of your inbox.
• E-mail makes an effective to-do list in an interface the user is comfortable with.
• Mailing lists are e-mail in broadcast mode.• Mailing lists have been the most effective
collaboration tool we’ve used so far.
![Page 26: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/26.jpg)
26
Issues With Programmers
• Programmer skill levels and attitudes vary.• C programmers tend to write C code in
Python.• PHP programmers tend to write PHP code
in Python.• Learning Python is easy but thinking in
Python takes a long time.
![Page 27: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/27.jpg)
27
Programming Tools We Used
• CVS for source control.• ViewCVS for a Web front-end to CVS.• Vim in GUI mode for source editing
(preferred editor of everyone in the team).• The print statement for debugging.
![Page 28: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/28.jpg)
28
Tools We Should Have Used
• WingIDE is a $35 piece of software that provides an interactive Python debugger usable with Zope that would have in a few minutes of usage more than paid for itself for the hours in programmer time we instead spent debugging using the print statement.
![Page 29: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/29.jpg)
Part IVThings Needing Fixing
Mistakes we made during development, how they affect things
now, and how they can be fixed.
![Page 30: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/30.jpg)
30
Naming Conventions
• We started with assuming HPRD was gene-centric and named several things as GeneSomething.
• In code, this can be considered just an identifier.
• But in a URL, there is potential for confusing users and needs renaming.
![Page 31: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/31.jpg)
31
Reusable Modules
• All of the code currently sits in one directory.
• Several important pieces have nothing to do with how they are being used.
• These modules could be separated and contributed independently to the open source code pool.
![Page 32: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/32.jpg)
32
Data in Code
• There are bits of implementation specific data embedded in code in some places, particularly related to graph generation.
• These were introduced as quick patches for a temporary problem but have remained in place for months now.
• These need to be taken out so that the code is truly reusable.
![Page 33: The technology of the Human Protein Reference Database (draft, 2003)](https://reader036.vdocuments.site/reader036/viewer/2022062710/559c3dfc1a28abdb7f8b481b/html5/thumbnails/33.jpg)
33
Documentation
• DocStrings needed in code.• Consistent language in DocStrings.• HTML documentation files to be
distributed with code.