1 web 2.0 and grids for scholarly research peking university july 27 2006 geoffrey fox computer...

11

Web 2.0 andGrids for Scholarly Research

Peking UniversityJuly 27 2006

Geoffrey Fox

Computer Science, Informatics, PhysicsPervasive Technology Laboratories

Indiana University Bloomington IN 47401

[email protected]://www.infomall.org

mailto:[email protected]

http://www.infomall.org/

http://www.infomall.org/

22

Application Drivers Science Informatics for document analysis as in case of

chemistry which has very precise naming rules for compounds that allow accurate searches in documents• Suggesting how to tag scientific documents either

when writing it or after the fact Journal web site of the future as illustrated by Nature

building social bookmarking tool Connotea Conference support tools as can benefit from features

needed by journals This gives document enhanced Cyberinfrastructure

(CI)

Community Tools e-mail and list-serves are oldest and best used Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P

Collaboration – text, audio-video conferencing, files del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage

shared bookmarks MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to

create (upload) community resources and share them; Friendster, LinkedIn create networks• http://en.wikipedia.org/wiki/List_of_social_networking_websites

Writely, Wikis and Blogs are powerful specialized shared document systems

ConferenceXP and WebEx share general applications Google Scholar tells you who has cited your papers while

publisher sites tell you about co-authors• Windows Live Academic Search has similar goals

Note sharing resources creates (implicit) communities• Social network tools study graphs to both define communities

and extract their properties

http://en.wikipedia.org/wiki/List_of_social_networking_websites

How to use Web2.0 Community tools in CI Nearly all of them have “profiles”, “users”, “groups”, “friends”

etc.• Need to integrate these

P2P File Sharing: Maybe this is useful for sharing files in research groups (virtual organizations)• Will modify Maze http://maze.pku.edu.cn – popular Chinese social P2P

system with 2.5 million users BitTorrent: more popular than FTP – why not use for higher

performance fault tolerant cached file sharing? MySpace etc.: Could consider MyGridSpace or MyScienceSpace

that supports a similar document sharing model with users uploading pictures, papers and even data/services of interest• Could include uploaded material in workflows

Social Bookmarking and linking: discuss later• http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/

http://maze.pku.edu.cn/

5

ExistingUser Interface

Document-enhanced Cyberinfrastructure

etc.

Google Scholar

ManuscriptCentral

Science.gov

Windows Live Academic Search

Citeseer

CMT Conference

Management

Existing Documentbased Research Tools

Web serviceWrappers

New Document-enhancedResearch Tools

Integration/EnhancementUser Interface

Community Tools

Generic Document Tools

MyResearchDatabase

Bibliographic Database

Export:RSS, BibtexEndnote etc.

CiteULike

Connotea

Del.icio.us

Bibsonomy

BioliciousPubChem

PubMed

TraditionalCyberinfrastructure

Strategy Doesn’t seem useful to build the 251st community tool In fact a major barrier to use of existing tools is

• What happens when a better tool comes along and/or chosen tool disappears (unsupported/removed from Web)

So assume use existing tools but wrap them all as web services so can transfer information to new tools and integrate information between tools• Need some “glue” logic, a “unification” database and minimal user

interface Bookmarking tools: del.icio.us, Connotea, CiteULike (includes

plug-ins to major publisher sites) Document: Google Scholar, Windows Live, Citeseer tools,

OSCAR3 for Chemistry, Science.gov (later) Journals: Manuscript Central Conferences: CMT from Microsoft or ?

77

Delicious Semantic Web/Grid http://del.icio.us purchased by Yahoo for ~$30M http://www.CiteULike.org http://www.connotea.org (Nature) Associate metadata with Bookmarks specified by

URL’s, DOI’s (Digital Object Identifiers) Users add comments and keywords (called tags) Users are linked together into groups (communities) Information such as title and authors extracted

automatically from some sites (PubMed, ACM, IEEE, Wiley etc.)

Bibtex like additional information in CiteULike This is perhaps de facto Semantic Web – remarkable

for its simplicity

http://del.icio.us/

http://www.citeulike.org/

http://www.connotea.org/

88

Connotea

99

Connotea queried by SERVOGrid

1010

Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid I

Citeseer and Google Scholar scour the Internet and analyze documents for incidental metadata• Title, author and institution of documents• Citations with their own metadata allowing one to match

to other documents Science.gov extracts metadata from lots of US Government

databases These capabilities are sure to become more powerful and to

be extended• Give “Citation Index” in real time• Tell you all authors of all papers that cite a paper that

cites you etc. (Note it’s a small world so don’t go too far in link analysis)

• Tell you all citations of all papers in a workshop

1111

Document-enhanced Cyberinfrastructureaka Semantic Scholar Grid II

It is natural to develop core document Services such as those used in Citeseer/Google Scholar but applied to “your” documents of interest that may not have been processed yet • As just submitted to a conference perhaps

These tools can help form useful lists such as authors of all cited or submitted papers to a journal

OSCAR2/3 (from Peter Murray-Rust’s group at Cambridge) augment the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms • This tool is a Service that can be applied to “your” document or to a set of

documents harvested in some fashion

• Other fields have natural application specific metadata and OSCAR like tools can be developed for them

Such high value tools could appear on “publisher” sites of future (or else publishers will disappear)

12

OSCAR3 Service from Cambridge UK Oscar3 is a tool for shallow, chemistry-specific

natural language parsing of chemical documents (i.e. journal articles).

It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also

formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in

experimental sections. Other entities: Things like N(5)-C(3) and so on.

Uses SMILES, InChI and CML There is a larger effort, SciBorg, in this area

http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html

http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3

1313

OSCAR2 Chemistry Document analysis

It detects “magic” chemical strings in text and then• Stores them as

metadata associated with document

Queries ChemInformatics repositories to tell you lots of information about identified compounds

Tells you which other documents have this compound

Clustering Documents from chemicalproperties

1515

Provenance and Delicious CI We can use del.icio.us style interface to annotate

Application Data with (extra) provenance and user comments of any type (describing quality of data or a keyword relating different data etc.)• All data should be labeled by a URI to enable this• One has in addition Citeseer/OSCAR metadata

Current major tagging systems support flat list of tags without name=value (RDF triple) or schema organization• Tradeoff between features and pervasive deployment

Some extra features are easy to add as a custom service Features not supported by del.icio.us can be uploaded

as comments

1616

Current Status Google Scholar, Windows Live Academic Search, del.icio.us,

Connotea, CiteULike, OSCAR3 are Web Services Debugging on 500 presentations and papers from my CGL

research group Experiment with GGF Presentations, Broad collection of

Chemical Informatics resources (explore science document CI link) and Concurrency&Computation: Practice&Experience Web site (?business model for journals)