academic torrents - xsedeacademic torrents academic torrents scalable distribution for science...
TRANSCRIPT
Academic TorrentsAcademic TorrentsScalable Distribution for Science
Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z LoUMass Boston Computer Science Ph.D Candidates
Entire Presentation
Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources
Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes
NSF Data Sharing Policy“Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See” Award & Administration Guide (AAG) Chapter VI.D.4.
NIH Data Sharing Policies“Expects investigators seeking more than $500K in direct support in any given year to submit a data sharing plan with their application or to indicate why data sharing is not possible.”“Requires data for all NIDA-funded human genetics studies to be available for sharing, independent of direct costs, membership in the NIDA Genetics Consortium, or the type of genetics data generated.” http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html
We need to share!
Stick figures taken from xkcd
Sharing is Hard
Considerations:● Maintenance - how much work?● Bandwidth - how scalable?● Speed - how fast are downloads?● Robustness - susceptible to failure?● Cost - how much will it be?
Stick figures taken from xkcd
One machine hosts a file from one location● Benefits
○ Simple (relatively)
● Pains○ Single point of failure (hard drive/network/power outage)○ Limited bandwidth (one machine serving the world)
Single Server Model
Stick figures taken from xkcd
Multiple machines host copies of a fileA central point sends the file to each mirror node (via scp, rsync)
A central index publishes hash of file to verify correctness
● Benefits○ Solves the single point of failure○ Might be faster if you download from a closer node
● Pains○ Each mirror must have high bandwidth○ Verification of each file is responsibility of the users
Apache Mirroring
Maintains list of data locations dynamically (via API)Supports HTTP, FTP, and BitTorrent mirrors
● Benefits○ Long term preservation of data○ Automatic verification of data to ensure consistency○ Can extend existing data dissemination systems○ Download from multiple at once (on campus CDN!)
● Pains○ Clients are not designed for research (until now)○ Network firewalls (HTTP and FTP not blocked)
Academic Torrents
Stick figures taken from xkcd
Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost
Single Server Moderate Somewhat Slow No Moderate
Multiple Servers High Somewhat Moderate Somewhat Moderate
Mailing Disks High No High No Low
Free Repositories Low Yes Moderate Somewhat Free
Proprietary Repositories Low Moderate Moderate Somewhat High
Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost
Single Server Moderate Somewhat Slow No Moderate
Multiple Servers High Somewhat Moderate Somewhat Moderate
Mailing Disks High No High No Low
Free Repositories Low Yes Moderate Somewhat Free
Proprietary Repositories Low Moderate Moderate Somewhat High
Academic Torrents Moderate No High Yes Low
Academic Torrents
Peers get torrent from AT
Upload torrent to Academic
Torrents
Create torrent from data
Share data with peers
Transmission torrent client
Academic Torrents Portal
Each entry contains:
Bibtex Metadata (keys->values)
File listing with hashes (verify authenticity)
Listing of hosting locations (global mirror locations)
Curated collections
Each collection is:
Curated by a user (allows trust)
An updatable folder of entries (modifiable)
accessible via APIs (RSS, CSV, RESTful)
Command Line Interface (atdown)https://github.com/AcademicTorrents/AcademicTorrents-Downloader
Command Line Interface (atdown)
Use Case:Wikipedia XML Offline Version
10GB of DataCommunity Hosted
766 Downloads in 2014 (7.66TB!)~15 Persistent mirror locations
Wikipedia data 10GBglobal mirror locations
Speeds vary
Bytes!
At UMass Boston Campus, Boston, MA
At XSEDE14, Atlanta, GA
Different Mirror Access Mirrors have
different speeds
Use Case:Direct Numerical Simulation of Turbulent Flows
5TB of Datain 63 files
Able to use AT infrastructure as management tool.
Direct Numerical Simulation of Turbulent Flows 250GB/5TB in 2 Locations
Entire Presentation
Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources
Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes
Questions
Why can you expect papers to be accessible?
What is the cost of a research paper?
Current Publishing Model,Elsevier, IEEE/ACM Journal
Distributed publishing model,Academic Torrents Library Smart Node
: ( IEEE/ACM Conference
Current Open Access Model,PLOS, F1000 Journals
Subscribers Everyone
Rea
der/L
ibra
ry P
ays
Aut
hor P
aysC
ost
Who can access
Library Smart Node Overview
StudentLibrary Database,
OpenURL, orAtoZ Server
Elsevier
Springer
IEEE Explore
Academic Torrents Curated SmartNode
ScholarWorks
PLOS
JLMR
$$$$$$$$$$
Smart Node
Management software for dealing with data
Deals with:Bandwidth Limits
Space LimitsContent (subscriptions)
Smart Node V1
CS410 - Software DesignTeam of UndergraduatesGPL/C++
V2 will be in Java
https://github.com/AcademicTorrents/AcademicTorrents-SmartNodeV1
Open Journal System IntegrationSimon Fraser University Library
Academic Torrents
Is this my dissertation topic? No.->Object detection in remote sensed imagery using machine learning +Ad-Hoc pervasive mobile networks +Semi-structured information extraction+CS and Cyber Security Education
blucat Throw Platform Feature Selection
Building Detection
Crater Detection