navigating the neuroscience data landscape maryann martone, ph. d. university of california, san...
TRANSCRIPT
Navigating the Neuroscience Data
Landscape
Maryann Martone, Ph. D.University of California, San Diego
“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits--their spatial organization, local and long-distance connections, their temporal orchestration, and their dynamic features. Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials,
services) are available to the neuroscience community?
How many are there? What domains do they cover? What domains do
they not cover? Where are they?
Web sites Databases Literature Supplementary material
Who uses them? Who creates them? How can we find them? How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
How many resources are there?
•NIF Registry: A catalog of neuroscience-relevant resources• > 4800
currently listed• > 2000
databases•And we are finding more every day
The Neuroscience Information Framework: Discovery and utilization of web-based
resources for neuroscience
A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of multiple types of information, organized by category
Supported by an expansive ontology for neuroscience
Utilizes advanced technologies to search the “hidden web”
http://neuinfo.org
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Supported by NIH Blueprint
Literature
Database Federation
Registry
What are the connections of the hippocampus?
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion:
Synonyms and related concepts
Boolean queriesData sources
categorized by “data type” and level of nervous
system
Common views across multiple
sources
Tutorials for using full
resource when getting there
from NIF
Link back to record in
original source
Results are organized within a common framework
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects to
Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
The scourge of neuroanatomical nomenclature
•NIF Connectivity: 6 databases containing connectivity primary data or claims• Brain Architecture Management System (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (exluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of partonomy matches: 385
The INCF is working with NIF to develop semantic and spatial strategies for translating anatomy across information systems
What is an ontology?
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
Ontology: an explicit, formal representation of concepts relationships among them within a particular domain that expresses human knowledge in a machine readable form Branch of philosophy: a theory of
what is e.g., Gene ontologies
Provide universals for navigating across different data sources Semantic “index”
Provide the basis for concept-based queries to probe and mine data Perform reasoning Link data through relationships not
just one-to-one mappings
PONS program Structural Lexicon Taskforce
Concentrate on Human, Non-human Primate, Rat and Mouse
Define structural concepts from level of organ to macromolecular complexes
Provide a set of criteria by which structures can be identified
Neuronal Registry Taskforce Establish conventions for
naming new types of neurons Establish a standard set of
properties to define neurons Create a Neuron Registry for
registering new types of neurons
Deployment and representation (Alan Ruttenberg) Brought together ontologists
working across scales
Courtesy of Chris Mungall, Lawrence Berkeley Labs
***Not about imposing a single view of anatomy; about making concepts computable and being able to translate among views
NeuroLex Wiki
http://neurolex.org Stephen Larson
•Provide a simple framework for defining the concepts required• Cell, Part of brain,
subcellular structure, molecule
•Community based:• Avian
neuroanatomy• Fly neurons
(England)• Neuroimaging
terms • Brain regions
identified by text mining
•Creating a computable index for neuroscience data
•INCF working to coordinate Wiki efforts underway at Allen Institute, Blue Brain and Neurolex
Demo D03
Comparison of traffic to NIF Portal vs Neurolex
5000 hits 15000 hits
Wiki is readily indexed by search engines
Neurons in Neurolex
INCF building a knowledge base of neurons and their properties via the Neurolex Wiki
Led by Dr. Gordon Shepherd
Consistent and parseable naming scheme
Knowledge is readily accessible, editable and computable
Stephen Larson
NIF data federation
Images
Drugs
Anti-bodies
Grants
Pathways
Animals
Percentage of data records per data type
connectivity
Brain activation foci
Microarray98%
Primary data, secondary data, claims, repositories
Recently added: BioNOT literature mining tool; Retraction Watch blog
What do you mean by data?Databases come in many shapes and
sizes Primary data:
Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
Secondary data Data features extracted
through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
Tertiary data Claims and assertions
about the meaning of data E.g., gene
upregulation/downregulation, brain activation as a function of task
Registries: Metadata Pointers to data sets or
materials stored elsewhere Data aggregators
Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
Single source Data acquired within a
single context , e.g., Allen Brain Atlas
StriatumHypothalamusOlfactory bulb
Cerebral cortex
Brain
Bra
in r
eg
ion
Data source
Vadim Astakhov, Keppler Workflow Engine
NIF landscape analysis
How much of the landscape do we have?
Query for “reference” brain structures and their parts in NIF Connectivity database
NIF Reports: Male vs Female
Gender bias
NIF can start to answer interesting questions about neuroscience research, not just about neuroscience
Embracing duplication: Data Mash ups
•~300 PMID’s were common between Brede and SUMSdb•Same information; value added
Same data; different aspects
Same data: different analysisChronic vs acute
morphine in striatum
Drug Related Gene database: extracted statements from figures, tables and supplementary data from published article
Gemma: Reanalyzed microarray results from GEO using different algorithms
Both provide results of increased or decreased expression as a function of experimental paradigm 4 strains of mice 3 conditions: chronic
morphine, acute morphine, saline
Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databaseshttp://www.chibi.ubc.ca/Gemma/
home.html
How easy was it to compare?
Gemma: Gene ID + Gene SymbolDRG: Gene name + Probe ID
Gemma: Increased expression/decreased expressionDRG: Increased expression/decreased expression
But...Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases
Analysis: 1370 statements from Gemma regarding gene expression as a
function of chronic morphine 617 were consistent with DRG; over half of the claims of the
paper were not confirmed in this analysis Results for 1 gene were opposite in DRG and Gemma 45 did not have enough information provided in the paper to make
a judgment
NIF annotation standard
Grabbing the long tail of small data
Analysis of NIF shows multiple databases with similar scope and content
Many contain partially overlapping data
Data “flows” from one resource to the next Data is
reinterpreted, reanalyzed or added to
When does it become something else?
Is duplication good or bad?
Phases of NIF 2006-2008: A survey of what was out there
2008-2009: Strategy for resource discovery NIF Registry vs NIF data federation Ingestion of data contained within different technology
platforms, e.g., XML vs relational vs RDF Effective search across semantically diverse sources
NIFSTD ontologies
2009-2011: Strategy for data integration Unified views across common sources Mapping of content to NIF vocabularies
2011-present: Data analytics Uniform external data references
Data, not just stories about them!47/50 major
preclinical published cancer studies could not
be replicated “The scientific community assumes that the claims in a preclinical study can be taken at face value-that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of time. Unfortunately, this is not always the case.”
Getting data out sooner in a form where they can be exposed to many eyes and many analyses, and easily compared, may allow us to expose errors and develop better metrics to evaluate the validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531
“There are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process. “
A global view of data
You (and the machine) have to be able to find itAccessible through the webAnnotations
You have to be able to use itData type specified and in a usable
formYou have to know what the data
meanSome semanticsContext: Experimental metadataProvenance: Where did the data
come from?
Reporting neuroscience data within a consistent framework helps enormously
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum
Fahim Imam, NIF Ontology EngineerLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceLee HornbrookBinh NgoVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer
Concept-based search: search by meaning
Search Google: GABAergic neuron Search NIF: GABAergic neuron
NIF automatically searches for types of GABAergic neurons
Types of GABAergic neurons