Download - Beautifying Data in the real world
1
Beautifying Datain the Real World
Group 5: Toan Do - An Du
Vinh Nguyen - Tan Tran
Instructor: Professor Lothar Piepmeyer
How big is the data on the Internet?
2004: The first time Internet exceed 1EB
2005: Eric Schmidt estimated it was 5 million Terabytes (~ 5EB)
Cisco forecasts that in 2015, the size of the Internet will reach nearly 1,000 EB
How big is it?Source: http://www.wisegeek.com/how-big-is-the-internet.htmhttp://techland.time.com/
Content
IntroductionOpen Notebook Sciences appoachingCurating and presenting the data Beautfifying the dataData Visualization & Building a portal from
open data and free servicesDemonstration
Problems of data in real world (Scientific)
Noisy source of data The barrier of data presentation
OCR version Text version Human-readable Machine readable …
How to verify the data?
Open Notebook Science
Purpose: record full scientific research raw data, make it available and online
Benefits: obtain detailed descriptions of procedures improve the communication of science increase the progress reduce time lost due to the repetition of failed
experiments…
Validating crowdsourced data
According to ONS, all detail data have been recorded
The doubtful data also be kept and marked for
Unique Identifiers for Chemical Entity
Standardize data
Facilitate the integration with other
data sets
Consider 3 possibilities CAS Registry Number InChI SMILES
CAS Registry Number
Proprietary
Cannot converted to chemical structure
Dependent to a external organization to issue
For example, the CAS number of water is 7732-18-
5: the checksum 5 is calculated as (8×1 + 1×2 +
2×3 + 3×4 + 7×5 + 7×6) = 105; 105 mod 10 = 5
http://en.wikipedia.org/wiki/CAS_registry_number
InChI
IUPAC International Chemical Identifier
Freely usable and non-proprietary
Do not have to be assigned by some organization
Can be computed from structural information
Human readable (with practice)
http://en.wikipedia.org/wiki/Inchi
SMILES
Simplified molecular-input line-entry system
More human-readable than InChI
Can convert to InChI
http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
Analysis Options
Access to live dataGet SummaryComplex Statistical representations
of modelsMark the skeptical data for later
consideration
Google Docs API
Allows developers to create, retrieve, update, and delete Google Docs files and collections
Also provides some advanced features like resource archives, Optical Character
Recognition, translation, and revision history.
Useful to store data in the cloud, perform resource management, convert document formats
https://developers.google.com/google-apps/documents-list/
Google Visualization API
Chart LibraryJavaScript classes
Data TableJavaScript DataTable
classData Source
Chart Tools Datasource protocol
https://developers.google.com/chart/interactive/docs/index
RESTful Web Service
Representational State Transfer - a simpler alternative to SOAP - and Web Services Description Language (WSDL) based Web services
Principles: Use HTTP methods explicitly. Be stateless. Expose directory structure-like URIs. Transfer XML, JavaScript Object
Notation (JSON), or both.
http://www.ibm.com/developerworks/webservices/library/ws-restful/
Compare REST and SOAP
Who's using REST?All of Yahoo's web services use REST, including
Flickr, del.icio.us API uses it, pubsub, bloglines, technorati, and both eBay, and Amazon have web services for both REST and SOAP.
Who's using SOAP?Google seams to be consistent in implementing
their web services to use SOAP, with the exception of Blogger, which uses XML-RPC. You will find SOAP web services in lots of enterprise software as well.http://www.petefreitag.com/item/431.cfm
Compare REST and SOAP
RESTLightweight - not
a lot of extra xml markup
Human Readable Results
Easy to build - no toolkits required
SOAP Easy to consume
- sometimes Rigid - type
checking, adheres to a contract
Development tools
An Effort to Aggregate Data from Multiple Sources
Introducing ChemSpiderAn online lookup engine for Chemists
http://www.chemspider.com40 mil substancesMultiple data sourcesA "link farm" to other sources
Semantic Web
Describing things in a way that computers applications can understand it.“The Beatles was a band from
Liverpool”Describes the relationships between
things (like A is a part of B and Y is a member of Z) and the properties of things (like size, weight, age, and price)
“..will make all the data in the world look like one huge database“ – Tim Berners-Lee
http://www.w3schools.com/web/web_semantic.asp
Resource Description Framework
Is a language to describe resources on the web
Component of the Semantic WebData is self-describing
Triples: "subject", "predicate" and "value“
URIs are used to denote resources
RDF Example
<?xml version="1.0"?><rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:cd="http://www.recshop.fake/cd#"><rdf:Descriptionrdf:about="http://www.recshop.fake/cd/Empire Burlesque"> <cd:artist>Bob Dylan</cd:artist> <cd:country>USA</cd:country> <cd:company>Columbia</cd:company> <cd:price>10.90</cd:price> <cd:year>1985</cd:year></rdf:Description></rdf:RDF>
Semantic Web Example: DBPedia
“Old School” wikipedia: http://en.wikipedia.org/wiki/Porsche_Panamera
DbPedia Entries
http://dbpedia.org/page/Porsche_Panamera http://dbpedia.org/page/Chromium_carbide
Query Language: SPARQL (sparkle)
Query Language for RDFGraph TraversalMatching the triples
Example:Data:
<http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> "SPARQL Tutorial”
Query:SELECT ?titleWHERE { <http://example.org/book/book1> <http://purl.org/dc/elements/1.1/title> ?title . }
Query Result: title "SPARQL Tutorial"
39
To Infinity and Beyond
• DB2 and Oracle are ready for this train
•Object DatabaseVersant OODBMS, anybody?
• Machine-Readable DataWill they become self-awareness?
46
LÂM
BẢO
LÂM’s iPhone
has BẢO’s
SS Galaxy
has
TheGioiDiDong.com
Sold
Was sold
Sold
Connection Detected! -Bao could have met Lam at Thegioididong? -They could have discussed their World domination scheme during the meeting there?-???
Was sold
47
LÂM
BẢO
LÂM’s iPhone
has BẢO’s
SS Galaxy
has
TheGioiDiDong.com
Sold
Was sold
SoldWas sold
(Doe
s no
t exis
ts)
Visualization of Data
Source http://nmap.org/favicon/
Top million web sites (per Alexa traffic data) was performed in early 2010 ]
Second LifeSecond Life is a 3D world where everyone you see is a real person and every place you visit is built by people just like you.
SL- The Opportunity for "Edutainment"
Drexel Island on Second Life
iSchool Teaching: Quizzes and Lectures
Classrooms with Powerpoint Research Center
3-D Environments
http://3rdrockgrid.com/
http://www.osgrid.org/
http://www.craft-world.org
http://www.secondlife.com/
http://youralternativelife.com//
Building A Portal From Open Data And Free Services
Freely hosted Wiki service Google Spreadsheet Google Docs API / javascripts Visualization services/anlalysis
services (2D, 3D) RDF/ Senmantic Web/ Webservices Cost: free or fit to the purpose
References
Oreilly – Beautiful data – Chapter 16th Beautifying data in the real world
http://techland.time.com/2011/06/01/how-big-is-the-internet-spoiler-not-as-big-as-itll-be-in-2015/
http://drexelisland.wikispaces.com/SMILE to 3D – Secon Life,
http://www.youtube.com/watch?v=tOfhuoRbnCg&feature=player_embedded