little escience

Little eScience

Andrea WigginsJune 18, 2009

Overview

• Background

• Exposition: Sociology of Science

• Broad generalizations about science

• Example: FLOSS Research

• Little science context for eScience research

• Expectations: What next?

http://www.flickr.com/photos/pmtorrone/304696349/

http://www.flickr.com/photos/pagedooley/2121472112/


• BA: Maths with economics

• Nonprofit & IT industry work

• Adult literacy, nonprofit management support, professional theatre

• Web analytics

• MSI: Human-computer interaction, complex systems & network science

• PhD: Information science & technology

My Background

Science

• Systematic investigation for the production of knowledge

• Scientific method emphasizes reproducibility

• Not all phenomena are reproducible...

• Many categories

• Experimental, applied, social, etc.

• Categories are not mutually exclusive

http://www.flickr.com/photos/radiorover/419414206/



• Kuhn - Laws, theories, applications & instrumentation that create coherent traditions of scientific research

• Paradigms help us direct our research, but limit our view of the world

• New technologies can lead to scientific revolutions by revealing anomalies

Paradigms & Revolutions

http://www.flickr.com/photos/weichbrodt/644302381/



Normal Science

• Kuhn - “normal science” is research based on broadly accepted scientific paradigms

• Shared paradigms are based on rules and standards for scientific practice

• Key requirement: agreement onfocus and conduct of research

• Ǝ(Grand Challenges)|Discipline

http://www.flickr.com/photos/themadlolscientist/2421152973/



Big Science

• de Solla Price - “Big Science” is...

• Inherently paradigmatic

• Always normal science

• Produces detailed insights into the minutiae of phenomena studied in the paradigm

http://www.flickr.com/photos/31333486@N00/1883498062/



• Paradigms require agreement on...

• Epistemology

• Ontology

• Methodology

• Most social sciences are pre-paradigmatic

• Primarily exploratory research

• Very little replication

Pre-paradigmatic Science

http://www.flickr.com/photos/askpang/327577395/



Little Science

• de Solla Price - “Little Science” is aromanticized precursor to Big Science,featuring lone, long-haired geniuses misunderstood by society, etc.

• If it’s not Big Science, it’s Little Science

• Pre-paradigmatic and fraught with ambiguity

• Often fundamentally exploratory

• Epistemological/theoretical/methodologicaldivergence among researchers

http://www.flickr.com/photos/mrjoax/2548045246/



Social Science

• Social science is real science: the goal is systematic knowledge production

• Focuses on the study of the social life of human groups and individuals

• IMHO, fundamentally more difficult than “hard” sciences due to infinite complexity of social phenomena

• Replicability is a major challenge with respect to scientific method

• Not all social science can or shouldaspire to replicability

http://www.flickr.com/photos/smiteme/2379629501/



Normalizing Science

• Becoming a normal science requires community and convergence

• Ǝ(community) != Ǝ(agreement)

• Establishing grand challenges and methods are primary tasks of normalizing

• Resistance to change is pervasive




Scientific Collaboration

• Collaboration requires common focus, if not also epistemology and ontology

• Challenging enough in normal sciences

• Harder in pre-paradigmatic research

• Economics: systemic disincentives to collaborate, versus potential benefits and ideals of science

http://www.flickr.com/photos/richardsummers/542738965/



• LHC, CERN, etc.

• Thousands of collaborators

• Complex but coordinated,at least somewhat centralized

• Requires shared goals and resources, plus (lots of) communication

• Only happens in normal sciences

Big Science Collaboration




• A Professor & a grad student, give or take

• Localized goals and resources

• -> localized research practices

• Small research teams

• Fundamentally difficult to achieve consensus that allows larger groups

• Restricts the ability to obtain fundingand undertake ambitious projects

Little Science Collaboration

http://www.flickr.com/photos/lamazone/2735939345/



Scientific Collaboration Requirements

• Shared goals

• Establishes focus of research

• Shared research resources

• Both social and artifactual

• Social aspects include training and community socialization

http://www.flickr.com/photos/ryanr/142455033/

we can has share?



• Letters, Books, Journals, Lectures

• Also technologies: methods, instrumentation

• Sharing?

• Recordkeeping is not alwaysa researcher’s main priority

• Without records, there’s notmuch to share except theresearch outputs

Historical Research Artifacts

http://www.flickr.com/photos/smailtronic/1535870363/



Today’s Research Artifacts

• Large scale datasets, scripts, software, workflows, papers, images, video, audio, annotations, ephemera, web sites...

• “Research objects” -bundling all the pieces together

• Hybrids of boundary objects and touchstones

• Technologies -> scientific revolution!

• Open science

http://www.flickr.com/photos/smiteme/2379630899/



Example: FLOSS Research

• Phenomenological & interdisciplinary

• Software engineering, Information Systems,Anthropology, Sociology, CSCW, etc...

• Ethos

• (Idealistic) combination of open source values and scientific values

http://www.flickr.com/photos/themadlolscientist/2542236565/



FLOSS Phenomenon

• Free/Libre Open Source Software “Free as in speech, free as in beer” - liberty versus cost

• Distributed collaboration to develop software

• Volunteers and sponsored developers

• Community-based model of development

http://www.flickr.com/photos/prawnwarp/541526661/



Typical FLOSS Research Topics

• Coordination and collaboration

• Growth and evolution (social and code)

• Code quality

• Business models and firm involvement

• Motivation, leadership, success

• Culture and community

• Intellectual property and copyright http://www.flickr.com/photos/eean/519258881/



What we study @ SU

• Social aspects of FLOSS

• What practices make some distributed work teams more effective than others?

• How are these practices developed?

• What are the dynamics through which self-organizing distributed teams develop and work?

Sharing FLOSS Research Artifacts

• Community: Small but growing, maybe around 400 researchers worldwide, with lively face-to-face interaction but relatively low listserv activity

• Data: Lots of it, and readily available, though often difficult to use for several reasons

• Analyses and tools: Not quite as easy to get, but there if you can find them

• Papers: Repositories are as yetunderdeveloped, but efforts areunderway




FLOSS Research Community

• Handful of small research groups, mostly in UK & Europe

• Most often found in Software Engineering departments

• International conferences targeted to academics, developers, or both

• OSS, ICSE, FOSDEM, etc.

• IFIP WG 2.13

http://www.flickr.com/photos/steevithak/2883218362/



FLOSS Research Data

• Data sources include interviews, surveys, and ethnographic fieldwork

• Digital “trace” data: archival, secondary, by-product of work, easy but hard

• Repositories

• Hosting “forges” like SourceForge, FreshMeat, RubyForge, etc.

• RoRs: Repositories of Repositories

• Data sources for research

We Built It...

• Motivations

• Stop hammering forge servers, getting entire campus IPs blocked...

• Stop reinventing the wheel!

• Adoption

• Shared data sources seeing increasing use

• Next step is harder: sharing tools and workflows

http://www.flickr.com/photos/circulating/997909242/



RoRs: FLOSSmole

• Multiple PIs @ Syracuse, Elon, & Carnegie MellonOne grad student @ SU (me), a couple of undergrads @ Elon

• Public access to 300+ GB data on

• 300K+ projects from 8 repositories

• Flat files & SQL datamarts

• Released via SF & GC

• 5 TB allotment on TeraGrid @ SDSC

RoRs: FLOSSmetrics

• Produced by LibreSoft with academic and corporate partners

• Public access to data for 2800+ projects

• Analyzed & raw data from CVS, email, trackers

• Tools for:

• calculating code metrics

• parsing trackers

• parsing email lists

RoRs: SRDA

• SourceForge Research Data Archive

• One PI @ Notre Dame University

• One massive 300 GB+ SQL db of monthly dumps from SourceForge

• Original obtuse structure, regular table deprecation, some documentation

• Gated access: researchers only,condition of data release from SF

RoRs: Emerging Sources

• Ultimate Debian Database (UDD)

• 300 MB compressed Postgres DB, produced by Debian community

• Planning to add to FLOSSmole

• When available...

• Bespoke Scripts

• Taverna workflows

FLOSS Research Analyses

FLOSS Research Papers

• First, there was opensource.mit.edu

• They no longer maintain it, and gave us the data

• Work-in-progress working papers repository at FLOSSpapers.org

• Essential viability problem is thatrepositories require long-termstewardship...

• ...which requires long-termcommitments of funding and personnel, not just volunteers

FLOSS Research Collaboration

• Multiple partners involved in producing FLOSSmole & FLOSSmetrics

• Federated data sources by choice, starting to develop ontologies

• As yet, a Little Science domain

• Cross-institutional collaborationposes many challenges

• Usual difficulties magnified bygeneral lack of resources, bothfinancial and human

Latest Initiatives

• Resource-oriented

• Expanding resources: data, research artifacts, and pedagogical materials

• DOIs: 10.4118/*

• Semantic data interoperability

• Community-oriented

• FLOSShub.org

Evangelizing eScience

• Made presentations at OSS conferences: well received, but hard to make converts for several reasons

• Tried to get other research group members to use Taverna: learning overhead is too high for most

• Submitted a paper on eScienceto an IS conference: rejected because reviewers were unable to adequately evaluate eScienceas a topic, as it’s too unfamiliar

• Currently just doing our work this way, as an exemplar

http://www.flickr.com/photos/naezmi/2418745377/



Barriers to Uptake

• Lack of agreement in research focus, theory, methods; researcher isolation

• Bimodal distribution of requisite skills

• “I can’t possibly do that! I can’t code!”

• “Why bother? I can code my own. You should too; just use Python.”

“Overheard” on Twitter:

Friend #1: i HATE that openoffice automatically took over my "open with..." defaults.

Friend #2: @Friend #1 <opensourcedeveloper> If you don't like it, then why don't you submit code to change the behavior!? </opensourcedeveloper> http://www.flickr.com/photos/noner/1739876378/



What I had to learn to get this far

• Taverna

• A lot more Unix terminal & XML

• Relational DB management & SQL

• More R, plus packages and dependency management

• Java & Eclipse - just enough to write my own Beanshells

• SVN & SSH

• A little bit of OWL, RDF, & SPARQL

• I would not have taken this on if I had known what was in store, but once I got started, I was hooked

http://www.flickr.com/photos/sashala/292868436/



Sociotechnical Engineering

• Tools are part of the solution, thanks to brilliant CS and SE people

• Social elements are the true barrier

• Awareness of methods andbenefits

• Incentive systems

• Resistance to change (paradigms again)

• Proof of concept is difficulthttp://www.flickr.com/photos/pinprick/3117108495/



Using Taverna for Little eScience

• Implementing analysis is usually easy

• Data handling is almost always hard

• All data are in SQL databases, with consistent IDs

• Lots of data manipulation is required

• Avoiding web services as much as possible

• Infrastructure and resources are limited

• Benefit is truly questionable: AFAIK, I am 50% of the user base...

• Estimating user base and potential user interest in FLOSS projects

• Based on common release-and-download patterns

• Proxy for project success, a common dependent variable

Example: Our Recent Research

Version 0.5 Version 0.6 Version 0.7

Area under curve is active users updating

Active user base growth

Potential user experimentation growth (good publicity?)

down

load

s

“Normal” Download-Release Patterns

BibDesk

down

loads

●

●

●

●

●

●

● ● ● ● ● ●

1000

2000

3000

4000

5000

Oct−2005 Apr−2006 Oct−2006 Apr−2007

measure

user_base

baseline

External effects!Taverna’s Download-

Release Patterns

1.3.2-RC1+2 presentations 1.5.0

? ?

Taverna’s Estimated Baseline & User Base

14 day baseline & drop-off

Taverna’s Estimated Baseline & User Base

7 day baseline & drop-off

Interpretation

• Taverna is not a “normal” open source project

• Speaking tours, tutorials, articles, and other events influence downloads

• What this demonstrates...

• Care is needed with quantitative measures

• Not all open source projects are the same

• Taverna users are just as reactive as any




Where next?

• Adoption is a long-term agenda, as changing social practices doesn’t happen overnight

• For FLOSS research and our disciplinary communities

• We will keep doing our work this way, and hope to draw in others

“Won’t you come out and play?”

http://www.flickr.com/photos/atiq/2658884520/



Thanks!

• Credits where they are due

• Kevin Crowston, my advisor

• James Howison, my collaborator

• Everett Wiggins, my husband

little escience

Technology