6/1/2015epfl7b - gio spring 20001 7. bioinformatics based on nrc* bioinformatics workshop on data...

35
03/16/22 EPFL7B - Gio spring 200 0 1 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published as ….. Gio Wiederhold EPFL, April-June 2000, at 14:15 - 15:15, room INJ 211 Intelligent Information Systems *NRC = National Research Council, Analysis and publication arm of the U.S. National Academy of Sciences

Post on 18-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 1

7. Bioinformaticsbased on NRC* Bioinformatics workshop on Data Integration,

Washington DC, February 2000To be published as …..

Gio WiederholdEPFL,

April-June 2000, at 14:15 - 15:15, room INJ 211

Intelligent Information Systems

*NRC = National Research Council, Analysis and publication arm of the U.S. National Academy of Sciences

Page 2: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 2

Presentations in English -- but I'll try to manage discussions in French and/or German.• I plan to cover the material in an integrating fashion, drawing from concepts in

databases, artificial intelligence, software engineering, and business principles.

1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).

3. 4/5 Digital libraries, information resources. Value of services, copyright.

4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.

5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance.

6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]

7. 31/5 Application to Bioinformatics.

8. 15/6 Educational challenges. Expected changes in teaching and learning.

9. 22/6 Privacy protection and security. Security mediation.

10.29/6 Summary and projection for the future.• Feedback and comments are appreciated.

Schedule

Page 3: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 3

Bio-Information

• to learn about ourselves, – our origins, our place in the world

• Primates, Mice, Zebrafish,

Fruit Flies* (drosophilae), Roundworms* (c.elegans),

and viruses as HIV*, Yeast*, plants

– modesty, seeing how much we share with all organisms

– not just of philosophical interest, but also

• to help humanity to lead healthy lives– to create new scientific methods

– to create new diagnostics

– to create new therapeutics

* substantially/completely sequenced.

also bacterium* (Haemophilus influenzae)

Page 4: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 4

Bioinformatics

Information systems applied to biology and healthcare• Biomedical statistics, …, … • Genomics - an subset of major interest, dealing with information

related to gene-derived data• boundary often unclear nature versus nurture & lifestyle

– A person’s Genomic make-up has a major effect on susceptibility to diseases: positive and negative

– Major genomic errors prevent birth, hence– we deal with differences that are relatively minor

289 / ~10 000 genes suspected/identified– complexity: most health effects are also combinatorial

multiple genes, promotors, inhibitors, metabolic cross-roads

smoking, exposure to smoke & lungcancer

Page 5: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 5

Quantities

The human genome: ~ 3 200 000 000 base pairs

Genes, and gene abnormalities

1 human

~10 000 proteins

? diseases

Everybody’s genes

6 000 000 000

humans

Small organic molecules - affect proteins - suitable for drugs

~2 000 000molecules

Metabolic pathways

<1000systems

Progress

Page 6: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 6

Relationships

• Basepairs: certain pairs of 4 amino acids: ACGT• adenine, cytosine, guanine, thymine,

combine in double helix

• 3 basepairs define 1/12 amino acids (<< 43 =64) • Proteins:

– determined by certain sequences of amino acids: genes– assembled by Ribosome according to RNA template– coded in ~3% of the genome -- but where?– 97% is miscellaneous: historical junk / promotors / inhibitors

multiple genes for many proteins

Page 7: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 7

Players

• Human Genome project (NIH-NCGI & Wellcome trust) $250M, 1988-- 2005, but likely roughly in completed in 2000/2001?

– work at Universities, related research labs, split per 24 chromosomes – collected in public databases www.ncbi.nlm.nih.gov/genome/seq

• Technology and strategies caused exponential rates of improvement• 100 M in 1998 (well annotated, with paper publishing)• 2.100 M by March 2000. ~12,000 base pairs per day in 1999.

– PCMR– automation [Perkins-Elmer Biosystems, Affymetrix…]– piece-wise (100-1 000) analysis and subsequent assembly versus walking the gene– pieces overlap, software to match

• Private enterprises at various levels– not-for-profit [The institute for Genomic Research (TIGR) dir. Craig Ventner]– for profit [Celera Genomics (Ventner), Incyte] sell leads to pharmaceutical companies – Early discovery pharmaceuticals [HGS Inc, Millenium Ph.]–

Established Pharmaceutical companies in-house [all now],support drug development, trials on animals humans, {toxicity, then benefit} trials, marketing.

Page 8: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 8

Heterogeneity inhibits Integration

• An essential feature of science– autonomy of fields– differing granularity and scope of focus– growth of fields requires new terms

• A feature of technological process– standards require stability– yesterday’s innovations are today’s infrastructure

• Must be dealt with explicitly– sharing, integration, and aggregation are essential– large quantities of data require precision

– [doubletwist.com]

Page 9: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 9

Integrating knowledge

bring together biologists and computer scientists from academia, industry and government to discuss salient issues in biological computing.

The following topics will becovered: • the generation and integration of biologic databases;• interoperability of heterogeneous databases; • integrity of databases; • modeling and simulation, • data mining, and • visualization of "model fit” to data.

The format of this workshop is designed to facilitate lively interaction between speakers and audience participants.

Page 10: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 10

Electronic Publication

The Signal Transduction Knowledge Environment: www.stke.org Brian Ray, American Assoc.for the Advancement of Science

STKE: Virtual journal, developed jointly with High-wire Press: Using the web for summarizing relevant articles from other (electronic journals)

A prototype for a future publication model: all academic papers are placed into a pile, and classified into one or more discipline categories, and aggregated and retrieved by secondary specialists - a new role for editors, requiring scientific competence and authority. Maintains a pathway map for attaching Has a controlled vocabulary. Does caching of retrieved referenced Medline articles.

Page 11: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 11

Generating and Integrating Biological Data

a. Methods for data collection Virtual Cell Project

Dong-Guk Shin, Univ. Connecticut [email protected], also available without DB support, from www.nrcam,uchc.edu

NIH supported: Physiology modeling, NSF: computational modeling approach. Bottom-up approach to cell modeling Cross checking of models and HXs: Geometry from segmented images,2-Dvisualization of specified reactions: cannels, pumps, for extra, intra (cytosol), ef core cellular compartments. Generates equations for simulation. Result is a DB publication cycle, supporting model copying and adaptation. For access to remote DBs will need more than a browser, but also a query system, with join over association. DBs nee APIs <and mediation for scalability and mismatch>.

b. Data characteristics

Stephen Koslow, Office on Neuroinformatics, NIMH

c. Data integration Jim Garrels, Proteome, Inc.

Moderated Discussion: By Susan Davidson, Univ. Pennsylvania

Page 12: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 12

Generating and Integrating Biological Data

a. Methods for data collection Dong-Guk Shin, Univ. Conn.

b. Data characteristics Need interoperation

Stephen Koslow, Office on Neuroinformatics, NIMH www.nimh.nih.gov/neuroinformatics/index.cfgThe human brain has 100 billion (10^14) neural cells, 10^15 connections. uses 15 Watts. Neuroscience

is a growing field, includes neuroinformatics. Intial, broad journals, reductionist journals, Numerical, symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948, now 70 000 papers, is becoming impossible to follow. Dozens of cell types. Voluminous 3-D MRI data.UCLA brain mapping. Basis for localization of diagnostic EEG, MEG observations.

c. Data integration

Jim Garrels, Proteome, Inc.

Moderated Discussion: By Susan Davidson, Univ.Pennsylvania

Page 13: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 13

Generating and Integrating Biological Data

a. Methods for data collection. Dong-Guk Shin, Univ. Conn

b. Data characteristics Stephen Koslow, Neuroinform., NIMH

c. Data integration

Jim Garrels, Proteome, Inc. www.proteome.cm - freeLiterature 50 billion bytes of text coveng the 5 billion bytes in Genbank.

BioKnowlede Library, Pages {title wth bief functiona description, family, properties (Mutant phenotype, } sequnece annotations, related proteins: Orhologs and Interlogs (in different soecies) [Marc Vidal, MGH], classifuactuo followung [Ascchburner?]. } curated by expert. Integrated from cDNA microarrays and chips, systematic 2-hybrids, … .

Model-organims: Started with Yeast, now worms [Stuart Kim, Stanford], Pombe. Several 1000 physical associations and interactions.Authors shoild not publish expeimentaldata directluy into a DB and curate their own papers,, but submit thei esults and publishlang expression studies and update their own results.

Need portal sites a well as content sites. Moderated Discussion: By Susan Davidson, Univ.Pennsylvania

Page 14: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 14

Matching of sequences

• Difficult because of– errors in amino-acid sequence– missing subsequences, extra strands– meaningful variation: HIV reverse transcriptase (RT) & protease

is characterized by many mutations[http://hivdb.stanford.edu]

– Loops and repeats in sequences

• Several tools: BLAST, GRAIL

Page 15: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 15

2D to 3D conversion

Protein folding• Strand of DNA, snipped of by …, assumes a tight,

3_D shape• The shape determines the attachkment points to

cells, ...– nature does it in a few nanaoseconds– computation based on finding minimum energy

conformations would take many years– current research tries to break computaion by recognizing

common substructure types: alpha-helixes, beta sheets, ...

Page 16: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 16

Interoperability of Databases

a. Design features of interoperable databases

Daniel Gardner, Cornell University. cortex.med.cornell. eduInteroperability in a 4.5D space:

1. user - platforms, software, open to new data: model journal to define scope and views, but include data - reanalyzable.. Dat quality is domain-dependent. Data sets presented via a virtual oscilloscope.

2 common datamodel (XML based, with capability for interdomain queries.) for neuroscience. hierarchical with a controlled vocabulary, for selected granularity. Much metadate, (physiological site, data, reference, method and model elements) used in query term as well. Data compaptability - federatd, and evolving.

3 TEMPORAL - legacy, current, future (IBN card -- XML)

4 Technical - Proprietary versus open (as PNAs papers)

4.5 Domain versus interdisciplinary. just interfaces.

XML BDML for brains. Will be longer lived than CORBA.<<the problem of interopertation is not the syntax ox XML, but the semantics of the DTD tags, Scalability beyond

neurosciences. Federtion versus articulkation>

Page 17: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 17

Interoperability of Databases

b. Information retrieval and complex queries Peter Karp, SRI Int., Bioinformatics Res.Group [email protected] are supplanting journals. They are re-analyzable. Results published in journals are not. Estimate

now about 500 public databases for Bioinformatics. Not all vn hav APIs. Want seamless interoperations. Differing models, units of measurements, leadng to semantic problems. Progress in interoeration www.ai.sri.com/pkarp/mmdb/94/Follwup includes K2 t Upenn, OPM Gene Logic, Hyperlinkng at SB-Glaxo.Warehouse (SRS 13o sources) versus multi-databases. Text (SRS)vs. Structured. 150 metabolic pathways known in Ecoli.orces lack DBMS, ontologyies, no formal model, irregular flat files, inconsitent semantics (example even in

Genbank entries), no web APIs.proposed XOL= ontology exchange language. <<CS545>>Databbses often don’’t have the right fields (SwiisProt infered versus being observed. maintenance over time<<Need mediating help>>

Page 18: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 18

Interoperability of Databases

c. Definition of data elements and database structure

William Gelbart, Harvard University, FlybaseMoving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society <<new laws>>

Phenome <-- --> complexome <-- -->Genome <-- transciptome <- -->> Preteome.

Clasical genomics is being superseded by Expression and Interaction of gene products and gene perturbation <-- --> phenotypes.

How do me organizes DBs for that objectives. Things {biological objects, relationships among the objects -- with sources ) -> robust object classifiers with controlled vocabularies. <<by guilds>> Many sorting methods

<<moving from agronomic to the medieval guilds, the predecessors of professional societies- sitting around the market square, where the farmers deliver their source , as wholesalers and intermediaries. Well maintained derived databases also have value -added value by expertise focused on some objective.>>

Flybase collects more, as exons and their mutations. Tranposon insertion sites.

Foundation DBs vs Derived DBs -- define ownership of foundation sources. Histories must be maintained. Version tracking. Presentation standards

Page 19: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 19

Interoperability of Databases

d. Novel approaches to achieving interoperability

not(Jaron Lanier), National Tele-Immersion Initiative

actually James Bower, California Institute of TechnologyHistorians are important, past models for the future.

In biology individuals cannot provide contributions,

Organizations can first be folkloric - can ignore data -- all non-quantified diagrams. Commonly used by biologists. Need commitment to quantification. Aristotlian - model starts with some dyta, but model are then worked out independent of all the data.

Timeline of experimental data (1904 ..) versus (lagging) structural theories for Purkinji cells. (1958 ,,) . n 1918 bad model used in DB [Holmes] Later experimental data is ignored.

Page 20: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 20

Database Integrity

a. Curation and quality control (SGD database) Michael Cherry, Stanford Univ. genome-www.stanford.edu

Curation is the act of establishing and maintaining a database, here the xxxx. Similar task to what a journal editor does, also finctions as an Educator, Ontologist [Yaahoo]

Learn what aids the community needs, aaand build the musum to satisfy those needs [John Cotten Dana, 1850]

Set limits according to what you can do and obtain.

Find missing details in literature,

GO [Michael Ashburner] for fly, mouses, and yeast (saccaries) Gene ontology for molecular function, cellular location (abs or rel), . Format DAGs, Used for annotaing microarrays. Included summary paragraphs.

Page 21: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 21

Database Integrity

b. Error detection protocols Chris Overton, Univ. of Pennsylvania Works in genom annotaion, to predict and archive landmarks. Want links to data, to encoded proteins.Errors come fom experimnetal data, manual curation from the literature, computational

pedictions.Errors are propagated in computation, and integration.In K2 (GAIA DB) uses GenBanl, SwissProt, TRRD , GERD, TRANSFAC, MEDLINE. Some

have moderatly or highly restrictive licenses.Look for syntac errors: matching introns and exons (implied in GD, also actual coding

regions. Spelling probles are propoagated.Genbank majority have annotaion ambiguity.PDB does not list all binding sites found in proteins - lack of motivation [Weissig, Bioinformatics 99], Predictions [GRAIL] get propagated.Poor advice of changes other than to sequence.

Page 22: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 22

Database Correctness

c. Methods for correcting errors Bill Anderson ( Knowledge Bus Inc, Hanover MD,all their work(Data Alive) is baed on

an ontology fro biochenical databases) for EML( Europeam Media Lab.) anf EMBL, Heidelberg) :

`Debabelization’

Syntactic errors: formats

Semantic : interpretation of relations -- ontology

Pragmatic errors - true sata differences (exprimnt, transcriprtion)

biochemical ontology --> microanatomy -->{spatial, events} , chemistry --{{spatial, events}--> (several 100 axioms as constraint rules)

Either the database or the constraing ontology is wong)

When a fault occurs go back to pragmatics, no automatic curation.

Page 23: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 23

Modeling & Simulation

a. Modeling and simulation

James Bower, California Institute of Technologywww.whyville.net/index.html - kids learning relationships, including .

Web site, Purkinje Park, allows onging collaboration with students,

Purkinje cell(6 M in human) 100 micro meters, has 250 000 inputs, 10-12 distinct conductances modlled by Eric Schoeter [now Belgium] . Tested with elecrical probes. Found differences with publ.information: here the dendrite is current sink. Rethinking of cerrebellum. It is a sensory device, not a motor control device. Shown by experiments motor and sensing, and observing brain activity. Still linking images and actual activity of neurons in that area is hard.

levels - Cognitive- sytem- network- cellular - subcellular -molecular atomic,

Correponding simulators:

ACT SOAR (connects 2 levels)- GENESIS (4 levels)- NEURON (2)-- MCELL/VCELL (2) / RASMO/WebLlab - GEPASI/GAMESS/Psl.

Page 24: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 24

Analytical Approaches

b. Data Mining, Douglas Brutlag, Stanford University

Many types of relevant DB.Sequenc, sequence variation, Now also relationship DBs.(phylogenetic, gene fusion [Eisenberg], pathways, gene expression, protein-ligand, signal transduction)

Challenge: finding them, syntax, semantics (MESH inadequate),

Doubletwist [Pangea] - an agent-based specific journal - summaries and notifications of subsequent published findings.

Page 25: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 25

Identification

• Match patterns of two samples– label amino acids with fluorescent markers– does not require functional genomic knowledge– PCMR multiplies sample size

– Fluorescent activated cell sorters can separate cells,Ex.:separate embryo cells from mother’s blood by labeling with father’s genes and matching

– Familial ties, human migrations, ... child that died in French prison was Louis XVII by tissue

comparison with current relatives

– Ancestry of species by creating hierarchical difference treesuses “junk portions” of genome - functions no longer needed

Page 26: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 26

Clinical: Diagnosis

Diagnosis is more advanced than treatment

• Match patient tissue sample pattern to rich pattern– VLSI technology used to place 10 000 known genes on a chip surface– look for matches of expressed genes vs expectations in cells from

diseased tissue (skin for melanoma, …)– can distinguish, say, cancers, that require specific treatment, but

are indistinguishable by pathologists

• Follow with– traditional treatments, if any

– but earlier / more aggressive / more specific– being careful

– haemophilia– being emotionally more prepared

Page 27: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 27

Clinical Treatment

Only few choices now, take many years to develop, test

Two ways to get good genes to work• in vivo -- problem: rejection

• put virus (can penetrate cells) with repaired gene into cells• those cells now generate proper protein• expect cells to replicate, and create more protein

• in vitro -- problem: getting protein to right places• use bacteria to replicate gene• let them manufacture needed proteins• inject proteins

Page 28: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 28

Clinical Treatments 2

Or, block bad genes ,

all in vivo -- problem: knowledge, getting there• flood area with decoy promotors

– fool the ribosome, prevent transcription from DNA to RNA

• block RNA from being a model for more DNA – use anti-sense molecules to create wrong double helix segments

• stiffle cells by synthetic antibodies (for cancers)– block growth factor attachment for its proteins, by providing fakes

Page 29: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 29

Visualization

c. Visualization of model fit to data

John Mazziotta, Univ.of California Los Angeles

Huma rain aatlas

Page 30: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 30

Data and Models to represent understanding of data

• Sharing and Publishing electronically at two levels

1. Sources, I.e.: data -- with provenance - incl. predictions, fixes.

• recognize owners’ objectives - they may not be your objectives, (PDB does not list all binding sites found - lack of motivation )

2. Models, incorporating knowledge, with means to populate the model

3. Added value by secondary processing. - shared ownership (c)

• Expanding on Prof. Gelbart’s example by moving from agronomic to the medieval guilds -- the predecessors of professional societies -- sitting around the market square, where the farmers deliver their source, as wholesalers and intermediaries. Well maintained derived databases also have value -added value by expertise focused on some objective.

Page 31: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 31

Integration

• A focus of Knowledge generation is integration of data

• The problem of interoperation is not the syntax ox XML, but the semantics of the DTD tags. Scalability beyond neurosciences. Federation versus articulation> XMLdebabelizaer.

• Yes keep the fundamental sources, but get added value in derived data (as Swiss Prot):

– error correction for a specific objective (U Penn.work), – adding entries – Does not require federation and terminological alignment of all sources.

• Rules and ontologies provide incremental help. help much but don’t solve problems of semantic errors

Page 32: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 32

The People Problem

The demand for people in bioinformatics is high, at all levels

• Critical is a lack of – training opportunities - programs and teachers– available trainees

• Being in multi-disciplinary field is scary– tenure for faculty– load for students

– salary and growth differentials in biology and CS

• Some institutions [Caltech, U Penn] are moving aggressively– must compete with World-Wide Web visions

Page 33: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 33

Privacy requires Ethics

Knowledge carries responsibilities.also, always some error rates

How will people feel about your knowledge about them? their genetic make-up, physical & psychological propensities.Privacy is hard to formalize,

but that does not mean it is not real to people.Perceptions count.

(There is also real stuff - insurance scams - personal relations )

Diagnostics without therapies.

Page 34: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 34

Securing Collaboration

CollaboratorCollaborator

Security FilterSecurity Filter

Private Patient DataPrivate Patient Data

certified query certified query

source query source query certified result certified result

unfiltered result unfiltered result

LogsLogs

Gio Wiederhold TIHI Oct96 34

Page 35: 6/1/2015EPFL7B - Gio spring 20001 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 To be published

04/18/23 EPFL7B - Gio spring 2000 35

Summary

To sustain the trend 1. The value of the results has to keep increasing

precision, relevance not volume2. Value is provided by experts,

encoded as models of diverse resources, customersProblems to be addressed mismatches quality temporal extensions maintenance

} Clear models