1 web 2.0 and grids for scholarly research peking university july 27 2006 geoffrey fox computer...
TRANSCRIPT
11
Web 2.0 andGrids for Scholarly Research
Peking UniversityJuly 27 2006
Geoffrey Fox
Computer Science, Informatics, PhysicsPervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]://www.infomall.org
22
Application Drivers Science Informatics for document analysis as in case of
chemistry which has very precise naming rules for compounds that allow accurate searches in documents• Suggesting how to tag scientific documents either
when writing it or after the fact Journal web site of the future as illustrated by Nature
building social bookmarking tool Connotea Conference support tools as can benefit from features
needed by journals This gives document enhanced Cyberinfrastructure
(CI)
Community Tools e-mail and list-serves are oldest and best used Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P
Collaboration – text, audio-video conferencing, files del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage
shared bookmarks MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to
create (upload) community resources and share them; Friendster, LinkedIn create networks• http://en.wikipedia.org/wiki/List_of_social_networking_websites
Writely, Wikis and Blogs are powerful specialized shared document systems
ConferenceXP and WebEx share general applications Google Scholar tells you who has cited your papers while
publisher sites tell you about co-authors• Windows Live Academic Search has similar goals
Note sharing resources creates (implicit) communities• Social network tools study graphs to both define communities
and extract their properties
How to use Web2.0 Community tools in CI Nearly all of them have “profiles”, “users”, “groups”, “friends”
etc.• Need to integrate these
P2P File Sharing: Maybe this is useful for sharing files in research groups (virtual organizations)• Will modify Maze http://maze.pku.edu.cn – popular Chinese social P2P
system with 2.5 million users BitTorrent: more popular than FTP – why not use for higher
performance fault tolerant cached file sharing? MySpace etc.: Could consider MyGridSpace or MyScienceSpace
that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest• Could include uploaded material in workflows
Social Bookmarking and linking: discuss later• http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/
5
ExistingUser Interface
Document-enhanced Cyberinfrastructure
etc.
Google Scholar
ManuscriptCentral
Science.gov
Windows Live Academic Search
Citeseer
CMT Conference
Management
Existing Documentbased Research Tools
Web serviceWrappers
New Document-enhancedResearch Tools
Integration/EnhancementUser Interface
Community Tools
Generic Document Tools
MyResearchDatabase
Bibliographic Database
Export:RSS, BibtexEndnote etc.
CiteULike
Connotea
Del.icio.us
Bibsonomy
BioliciousPubChem
PubMed
TraditionalCyberinfrastructure
Strategy Doesn’t seem useful to build the 251st community tool In fact a major barrier to use of existing tools is
• What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web)
So assume use existing tools but wrap them all as web services so can transfer information to new tools and integrate information between tools• Need some “glue” logic, a “unification” database and minimal user
interface Bookmarking tools: del.icio.us, Connotea, CiteULike (includes
plug-ins to major publisher sites) Document: Google Scholar, Windows Live, Citeseer tools,
OSCAR3 for Chemistry, Science.gov (later) Journals: Manuscript Central Conferences: CMT from Microsoft or ?
77
Delicious Semantic Web/Grid http://del.icio.us purchased by Yahoo for ~$30M http://www.CiteULike.org http://www.connotea.org (Nature) Associate metadata with Bookmarks specified by
URL’s, DOI’s (Digital Object Identifiers) Users add comments and keywords (called tags) Users are linked together into groups (communities) Information such as title and authors extracted
automatically from some sites (PubMed, ACM, IEEE, Wiley etc.)
Bibtex like additional information in CiteULike This is perhaps de facto Semantic Web – remarkable
for its simplicity
88
Connotea
99
Connotea queried by SERVOGrid
1010
Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid I
Citeseer and Google Scholar scour the Internet and analyze documents for incidental metadata• Title, author and institution of documents• Citations with their own metadata allowing one to match
to other documents Science.gov extracts metadata from lots of US Government
databases These capabilities are sure to become more powerful and to
be extended• Give “Citation Index” in real time• Tell you all authors of all papers that cite a paper that
cites you etc. (Note it’s a small world so don’t go too far in link analysis)
• Tell you all citations of all papers in a workshop
1111
Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid II
It is natural to develop core document Services such as those used in Citeseer/Google Scholar but applied to “your” documents of interest that may not have been processed yet • As just submitted to a conference perhaps
These tools can help form useful lists such as authors of all cited or submitted papers to a journal
OSCAR2/3 (from Peter Murray-Rust’s group at Cambridge) augment the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms • This tool is a Service that can be applied to “your” document or to a set of
documents harvested in some fashion
• Other fields have natural application specific metadata and OSCAR like tools can be developed for them
Such high value tools could appear on “publisher” sites of future (or else publishers will disappear)
12
OSCAR3 Service from Cambridge UK Oscar3 is a tool for shallow, chemistry-specific
natural language parsing of chemical documents (i.e. journal articles).
It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also
formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in
experimental sections. Other entities: Things like N(5)-C(3) and so on.
Uses SMILES, InChI and CML There is a larger effort, SciBorg, in this area
http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html
http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3
1313
OSCAR2 Chemistry Document analysis
It detects “magic” chemical strings in text and then• Stores them as
metadata associated with document
Queries ChemInformatics repositories to tell you lots of information about identified compounds
Tells you which other documents have this compound
Clustering Documents from chemicalproperties
1515
Provenance and Delicious CI We can use del.icio.us style interface to annotate
Application Data with (extra) provenance and user comments of any type (describing quality of data or a keyword relating different data etc.)• All data should be labeled by a URI to enable this• One has in addition Citeseer/OSCAR metadata
Current major tagging systems support flat list of tags without name=value (RDF triple) or schema organization• Tradeoff between features and pervasive deployment
Some extra features are easy to add as a custom service Features not supported by del.icio.us can be uploaded
as comments
1616
Current Status Google Scholar, Windows Live Academic Search, del.icio.us,
Connotea, CiteULike, OSCAR3 are Web Services Debugging on 500 presentations and papers from my CGL
research group Experiment with GGF Presentations, Broad collection of
Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience Web site (?business model for journals)