little escience
DESCRIPTION
Presentation for the myGrid team at the University of Manchester, putting the practice of eScience into the context of little scienceTRANSCRIPT
Little eScience
Andrea WigginsJune 18, 2009
Overview
• Background
• Exposition: Sociology of Science
• Broad generalizations about science
• Example: FLOSS Research
• Little science context for eScience research
• Expectations: What next?
http://www.flickr.com/photos/pmtorrone/304696349/
• BA: Maths with economics
• Nonprofit & IT industry work
• Adult literacy, nonprofit management support, professional theatre
• Web analytics
• MSI: Human-computer interaction, complex systems & network science
• PhD: Information science & technology
My Background
Science
• Systematic investigation for the production of knowledge
• Scientific method emphasizes reproducibility
• Not all phenomena are reproducible...
• Many categories
• Experimental, applied, social, etc.
• Categories are not mutually exclusive
http://www.flickr.com/photos/radiorover/419414206/
• Kuhn - Laws, theories, applications & instrumentation that create coherent traditions of scientific research
• Paradigms help us direct our research, but limit our view of the world
• New technologies can lead to scientific revolutions by revealing anomalies
Paradigms & Revolutions
http://www.flickr.com/photos/weichbrodt/644302381/
Normal Science
• Kuhn - “normal science” is research based on broadly accepted scientific paradigms
• Shared paradigms are based on rules and standards for scientific practice
• Key requirement: agreement onfocus and conduct of research
• Ǝ(Grand Challenges)|Discipline
http://www.flickr.com/photos/themadlolscientist/2421152973/
Big Science
• de Solla Price - “Big Science” is...
• Inherently paradigmatic
• Always normal science
• Produces detailed insights into the minutiae of phenomena studied in the paradigm
http://www.flickr.com/photos/31333486@N00/1883498062/
• Paradigms require agreement on...
• Epistemology
• Ontology
• Methodology
• Most social sciences are pre-paradigmatic
• Primarily exploratory research
• Very little replication
Pre-paradigmatic Science
http://www.flickr.com/photos/askpang/327577395/
Little Science
• de Solla Price - “Little Science” is aromanticized precursor to Big Science,featuring lone, long-haired geniuses misunderstood by society, etc.
• If it’s not Big Science, it’s Little Science
• Pre-paradigmatic and fraught with ambiguity
• Often fundamentally exploratory
• Epistemological/theoretical/methodologicaldivergence among researchers
http://www.flickr.com/photos/mrjoax/2548045246/
Social Science
• Social science is real science: the goal is systematic knowledge production
• Focuses on the study of the social life of human groups and individuals
• IMHO, fundamentally more difficult than “hard” sciences due to infinite complexity of social phenomena
• Replicability is a major challenge with respect to scientific method
• Not all social science can or shouldaspire to replicability
http://www.flickr.com/photos/smiteme/2379629501/
Normalizing Science
• Becoming a normal science requires community and convergence
• Ǝ(community) != Ǝ(agreement)
• Establishing grand challenges and methods are primary tasks of normalizing
• Resistance to change is pervasive
http://www.flickr.com/photos/9036026@N08/2949211479/
Scientific Collaboration
• Collaboration requires common focus, if not also epistemology and ontology
• Challenging enough in normal sciences
• Harder in pre-paradigmatic research
• Economics: systemic disincentives to collaborate, versus potential benefits and ideals of science
http://www.flickr.com/photos/richardsummers/542738965/
• LHC, CERN, etc.
• Thousands of collaborators
• Complex but coordinated,at least somewhat centralized
• Requires shared goals and resources, plus (lots of) communication
• Only happens in normal sciences
Big Science Collaboration
http://www.flickr.com/photos/8767020@N08/531355152/
• A Professor & a grad student, give or take
• Localized goals and resources
• -> localized research practices
• Small research teams
• Fundamentally difficult to achieve consensus that allows larger groups
• Restricts the ability to obtain fundingand undertake ambitious projects
Little Science Collaboration
http://www.flickr.com/photos/lamazone/2735939345/
Scientific Collaboration Requirements
• Shared goals
• Establishes focus of research
• Shared research resources
• Both social and artifactual
• Social aspects include training and community socialization
http://www.flickr.com/photos/ryanr/142455033/
we can has share?
• Letters, Books, Journals, Lectures
• Also technologies: methods, instrumentation
• Sharing?
• Recordkeeping is not alwaysa researcher’s main priority
• Without records, there’s notmuch to share except theresearch outputs
Historical Research Artifacts
http://www.flickr.com/photos/smailtronic/1535870363/
Today’s Research Artifacts
• Large scale datasets, scripts, software, workflows, papers, images, video, audio, annotations, ephemera, web sites...
• “Research objects” -bundling all the pieces together
• Hybrids of boundary objects and touchstones
• Technologies -> scientific revolution!
• Open science
http://www.flickr.com/photos/smiteme/2379630899/
Example: FLOSS Research
• Phenomenological & interdisciplinary
• Software engineering, Information Systems,Anthropology, Sociology, CSCW, etc...
• Ethos
• (Idealistic) combination of open source values and scientific values
http://www.flickr.com/photos/themadlolscientist/2542236565/
FLOSS Phenomenon
• Free/Libre Open Source Software “Free as in speech, free as in beer” - liberty versus cost
• Distributed collaboration to develop software
• Volunteers and sponsored developers
• Community-based model of development
http://www.flickr.com/photos/prawnwarp/541526661/
Typical FLOSS Research Topics
• Coordination and collaboration
• Growth and evolution (social and code)
• Code quality
• Business models and firm involvement
• Motivation, leadership, success
• Culture and community
• Intellectual property and copyright http://www.flickr.com/photos/eean/519258881/
What we study @ SU
• Social aspects of FLOSS
• What practices make some distributed work teams more effective than others?
• How are these practices developed?
• What are the dynamics through which self-organizing distributed teams develop and work?
Sharing FLOSS Research Artifacts
• Community: Small but growing, maybe around 400 researchers worldwide, with lively face-to-face interaction but relatively low listserv activity
• Data: Lots of it, and readily available, though often difficult to use for several reasons
• Analyses and tools: Not quite as easy to get, but there if you can find them
• Papers: Repositories are as yetunderdeveloped, but efforts areunderway
http://www.flickr.com/photos/12698507@N08/2762563631/
FLOSS Research Community
• Handful of small research groups, mostly in UK & Europe
• Most often found in Software Engineering departments
• International conferences targeted to academics, developers, or both
• OSS, ICSE, FOSDEM, etc.
• IFIP WG 2.13
http://www.flickr.com/photos/steevithak/2883218362/
FLOSS Research Data
• Data sources include interviews, surveys, and ethnographic fieldwork
• Digital “trace” data: archival, secondary, by-product of work, easy but hard
• Repositories
• Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.
• RoRs: Repositories of Repositories
• Data sources for research
We Built It...
• Motivations
• Stop hammering forge servers, getting entire campus IPs blocked...
• Stop reinventing the wheel!
• Adoption
• Shared data sources seeing increasing use
• Next step is harder: sharing tools and workflows
http://www.flickr.com/photos/circulating/997909242/
RoRs: FLOSSmole
• Multiple PIs @ Syracuse, Elon, & Carnegie MellonOne grad student @ SU (me), a couple of undergrads @ Elon
• Public access to 300+ GB data on
• 300K+ projects from 8 repositories
• Flat files & SQL datamarts
• Released via SF & GC
• 5 TB allotment on TeraGrid @ SDSC
RoRs: FLOSSmetrics
• Produced by LibreSoft with academic and corporate partners
• Public access to data for 2800+ projects
• Analyzed & raw data from CVS, email, trackers
• Tools for:
• calculating code metrics
• parsing trackers
• parsing email lists
RoRs: SRDA
• SourceForge Research Data Archive
• One PI @ Notre Dame University
• One massive 300 GB+ SQL db of monthly dumps from SourceForge
• Original obtuse structure, regular table deprecation, some documentation
• Gated access: researchers only,condition of data release from SF
RoRs: Emerging Sources
• Ultimate Debian Database (UDD)
• 300 MB compressed Postgres DB, produced by Debian community
• Planning to add to FLOSSmole
• When available...
• Bespoke Scripts
• Taverna workflows
FLOSS Research Analyses
FLOSS Research Papers
• First, there was opensource.mit.edu
• They no longer maintain it, and gave us the data
• Work-in-progress working papers repository at FLOSSpapers.org
• Essential viability problem is thatrepositories require long-termstewardship...
• ...which requires long-termcommitments of funding and personnel, not just volunteers
FLOSS Research Collaboration
• Multiple partners involved in producing FLOSSmole & FLOSSmetrics
• Federated data sources by choice, starting to develop ontologies
• As yet, a Little Science domain
• Cross-institutional collaborationposes many challenges
• Usual difficulties magnified bygeneral lack of resources, bothfinancial and human
Latest Initiatives
• Resource-oriented
• Expanding resources: data, research artifacts, and pedagogical materials
• DOIs: 10.4118/*
• Semantic data interoperability
• Community-oriented
• FLOSShub.org
Evangelizing eScience
• Made presentations at OSS conferences: well received, but hard to make converts for several reasons
• Tried to get other research group members to use Taverna: learning overhead is too high for most
• Submitted a paper on eScienceto an IS conference: rejected because reviewers were unable to adequately evaluate eScienceas a topic, as it’s too unfamiliar
• Currently just doing our work this way, as an exemplar
http://www.flickr.com/photos/naezmi/2418745377/
Barriers to Uptake
• Lack of agreement in research focus, theory, methods; researcher isolation
• Bimodal distribution of requisite skills
• “I can’t possibly do that! I can’t code!”
• “Why bother? I can code my own. You should too; just use Python.”
“Overheard” on Twitter:
Friend #1: i HATE that openoffice automatically took over my "open with..." defaults.
Friend #2: @Friend #1 <opensourcedeveloper> If you don't like it, then why don't you submit code to change the behavior!? </opensourcedeveloper> http://www.flickr.com/photos/noner/1739876378/
What I had to learn to get this far
• Taverna
• A lot more Unix terminal & XML
• Relational DB management & SQL
• More R, plus packages and dependency management
• Java & Eclipse - just enough to write my own Beanshells
• SVN & SSH
• A little bit of OWL, RDF, & SPARQL
• I would not have taken this on if I had known what was in store, but once I got started, I was hooked
http://www.flickr.com/photos/sashala/292868436/
Sociotechnical Engineering
• Tools are part of the solution, thanks to brilliant CS and SE people
• Social elements are the true barrier
• Awareness of methods andbenefits
• Incentive systems
• Resistance to change (paradigms again)
• Proof of concept is difficulthttp://www.flickr.com/photos/pinprick/3117108495/
Using Taverna for Little eScience
• Implementing analysis is usually easy
• Data handling is almost always hard
• All data are in SQL databases, with consistent IDs
• Lots of data manipulation is required
• Avoiding web services as much as possible
• Infrastructure and resources are limited
• Benefit is truly questionable: AFAIK, I am 50% of the user base...
• Estimating user base and potential user interest in FLOSS projects
• Based on common release-and-download patterns
• Proxy for project success, a common dependent variable
Example: Our Recent Research
Version 0.5 Version 0.6 Version 0.7
Area under curve is active users updating
Active user base growth
Potential user experimentation growth (good publicity?)
down
load
s
“Normal” Download-Release Patterns
BibDesk
down
loads
●
●
●
●
●
●
● ● ● ● ● ●
1000
2000
3000
4000
5000
Oct−2005 Apr−2006 Oct−2006 Apr−2007
measure
user_base
baseline
External effects!Taverna’s Download-
Release Patterns
1.3.2-RC1+2 presentations 1.5.0
? ?
Taverna’s Estimated Baseline & User Base
14 day baseline & drop-off
Taverna’s Estimated Baseline & User Base
7 day baseline & drop-off
Interpretation
• Taverna is not a “normal” open source project
• Speaking tours, tutorials, articles, and other events influence downloads
• What this demonstrates...
• Care is needed with quantitative measures
• Not all open source projects are the same
• Taverna users are just as reactive as any
http://www.flickr.com/photos/pagedooley/2121472112/
Where next?
• Adoption is a long-term agenda, as changing social practices doesn’t happen overnight
• For FLOSS research and our disciplinary communities
• We will keep doing our work this way, and hope to draw in others
“Won’t you come out and play?”
http://www.flickr.com/photos/atiq/2658884520/
Thanks!
• Credits where they are due
• Kevin Crowston, my advisor
• James Howison, my collaborator
• Everett Wiggins, my husband