a data model and development environment to help end-user programmers validate and reuse data...

73
A Data Model and Development A Data Model and Development Environment Environment to Help End-User Programmers to Help End-User Programmers Validate and Reuse Data Validate and Reuse Data Christopher Scaffidi Thesis Proposal, May 8, 2007 Committee Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln Jim Herbsleb Institute for Software Research, Carnegie Mellon University

Post on 22-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

A Data Model and Development EnvironmentA Data Model and Development Environmentto Help End-User Programmers to Help End-User Programmers

Validate and Reuse DataValidate and Reuse Data

Christopher Scaffidi

Thesis Proposal, May 8, 2007

Committee

Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University

Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln

Jim Herbsleb Institute for Software Research, Carnegie Mellon University

Brad Myers Human-Computer Interaction Institute, Carnegie Mellon University

22

Target audienceTarget audience

• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.

• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). [5]

• Both EUs and EUPs will benefit from the proposed research, though the proposed research is primarily aimed at EUPs (including EUs who become EUPs because of the research).

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

33

Contextual inquiry:Contextual inquiry:What are the problems of EUs and EUPs?What are the problems of EUs and EUPs?

• Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each)

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

44

How do you validate web formsHow do you validate web formsif you do not know JavaScript?if you do not know JavaScript?

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Is the input valid?“EDSH 225”

Is the input nearly valid?“EDXH 225”

Does it just need reformatting?“Smith 225”

Or is it obviously badly invalid?“Robotics Institute”

55

Other tasks, other data, other problemsOther tasks, other data, other problems

• When building a staff roster by merging data sources into a single spreadsheet, one of the EUs:– Had to manually transform data to consistent format

(e.g.: Put person names in Lastname, Firstname format)– Had to scrutinize data to identify questionable values that

deserved double-checking(e.g.: A first name with 15 characters might be right)

– Had to manually check for (near-) duplicates(e.g.: “Scaffidi, Christopher” and “Scaffidi, Chris”)

• We and research collaborators identified many additional data validation and data reuse tasks that were poorly supported by existing tools. [3][7][9]

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

66

Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch

• Tools support strings, integers, floats, sometimes dates.• Problem domain involves higher-level categories of data:

– University names “Carnegie Mellon”, “CMU”

– Person names “Scaffidi, Christopher”, “Chris Scaffidi”

– CMU phone numbers “8-1234”, “x8-1234”

– CMU room numbers “WeH 4623”, “Wean 4623”

• These data categories are:– Human-readable

– Short (~ 1 input field)

– Multi-format

– Sometimes ambiguous / fuzzy (non-binary scale of validity)

– Often particular to certain groups of people

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

77

A New Direction: Create a new abstraction A New Direction: Create a new abstraction for each category of datafor each category of data

• Like software “libraries,” implementations of these abstractions could be reused in many programs.

• Abstractions would need to include functionality for:– Recognizing instances of the category

(for automating data validation)

– Transforming instances among various formats(for automating data reformatting)

– Testing instances for equality(for automating removal of duplicates)

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

88

A New Direction: Other requirements for A New Direction: Other requirements for abstractionsabstractions

• EUPs over a range of programming expertise must be able to create custom new abstractions.

• Flexibility:– Abstractions must capture fuzziness when recognizing

instances of the category and when testing equivalence.– EUPs must have the option of configuring abstractions to

learn exceptional cases.

• Sharability:– EUPs must still be able to share and find useful abstractions

even as the number of abstractions grows.– Latency and throughput of operations must not become

burdensome as EUPs share numerous abstractions.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

99

ThesisThesis

The proposed data model and development environment will enable end-user programmers to implement and share custom abstractions for flexibly recognizing, transforming and equivalence-testing values in categories of short, human-readable data.

The model and environment will help end-user programmers to more quickly and correctly validate and reuse data than is possible through currently practiced methods.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1010

TopesTopes

• Tope = an abstraction implementation for a data category– Greek word for “place,” because each corresponds to a

data category with a natural place in the problem domain

• Topes in practice:1. EUPs create new topes by using the basic tope editor (or

by writing topes in another language, such as JavaScript)

2. EUPs publish topes on repositories.

3. Other EUs & EUPs download topes to their local cache.

4. Tool plug-ins let EUs & EUPs browse their local cache and associate topes with variables and input fields.

5. Plug-ins get topes from local cache and use them to recognize, transform, and equivalence-test data.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1111

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Existing approaches lack an easy way for

EUPs to create flexible, sharable

abstractions for data categories

1212

Existing programming tools for EUPsExisting programming tools for EUPs((eg: Excel, Visual Studio Express, Robofoxeg: Excel, Visual Studio Express, Robofox))

• Limited support for a closed set of data categories:– Spreadsheets (like Excel) allow EUs to associate certain

formats with cells, but these do not actually validate data

– Web application design tools (like Visual Studio) allow EUPs to apply certain limited constraints to validate input

– Web macro tools (like Robofox) allow EUPs to store certain personal data (eg: phone #) and reuse it

• No straightforward mechanisms for EUPs to create new abstractions for unsupported categories of data

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1313

User-definable data formatsUser-definable data formats((eg: SWYN, Grammex, Lapis, Data Detectorseg: SWYN, Grammex, Lapis, Data Detectors))

• EUPs struggle to understand and create regexps/CFGs• These formats are binary (non-fuzzy) recognizers• Formats alone do not transform or equivalence-test data• Only Apple Data Detectors offers sharing mechanisms

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Lapis example@DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing@ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing@ShortYear is Number equal to /\d\d/ ignoring nothingDate is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation

1414

Formal and OO typesFormal and OO types((eg: ML, Java, C#eg: ML, Java, C#))

• Type systems are inflexible:– A value is or is not a valid instance of a type (non-fuzzy)

– If a value is invalid at compile-time, it cannot become valid at runtime

• Typed languages are probably difficult for EUPs who are uncomfortable with untyped scripting languages.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1515

Format-inference and constraint-enforcingFormat-inference and constraint-enforcing((eg: info. extraction, Lapis, Cues, Slateeg: info. extraction, Lapis, Cues, Slate))

• Various approaches:– Many algorithms infer an abstract model, CFG-like

grammar, or other format with very low editability.

– Other algorithms enforce constraints (either inferred or specified by EUPs) that cannot handle string-like data

• Formats, grammars, and constraints are not able to transform or equivalence-test data.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1616

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Tasks commonly involve

• Recognizing• Transforming

• Equivalence-testing

values in categories of short, human-readable

data.

1717

Survey of EUPs:Survey of EUPs:Better data-manipulation features neededBetter data-manipulation features needed

• Asked 831 information workers about use of 23 features in 5 tools (eg: creating spreadsheet macros, database stored procedures, and web forms) [4][9]

• The most widely used features were related to manipulating linked structures of data (eg: database tables) rather than imperative or macro programming

• Yet respondents complained about these features:– “Not always easy to move sturctured [sic] data or text”

– “Not always integrated a lot of data manipulation redundant”

– “Information entered inconsistently into database fields by different people leaves a lot of database cleaning”

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1818

Contextual inquiry of EUs and EUPs:Contextual inquiry of EUs and EUPs:Specific data-manipulation features neededSpecific data-manipulation features needed

• Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) [3][9]

• They needed better support for automatically:– Transforming data values among different formats within

the same category of data (eg: ST to State)

– Identifying questionable data values that could be acceptable for a task but deserve double-checking

– Identifying duplicate values, including values that were probably equivalent

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

1919

Interviews of web site creators:Interviews of web site creators:Confirmation of specific features neededConfirmation of specific features needed

• Interviewed 6 people involved in creating “person locator” web sites after Hurricane Katrina [7][9]

• Many omitted data validation on web forms– Hard to detect that “12 Years old” is an invalid street address

(what would the regexp look like?)

• “Aggregator” sites were built to scrape and consolidate data from numerous person locator sites.– Hard to transform data into a single consistent format

– Hard to identify probable duplicates in the merged data set

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2020

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

How could flexible formats be expressed?

2121

PrototypePrototypeTask flow diagramTask flow diagram

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Algorithm infers a format from cell

values

User reviews and customizes

format

User creates a format from

scratch

User loads an existing format

from a file

Plug-in flags cells that don’t match format

User highlights spreadsheet

cells

[1][6]

or

or

2222

Sample task: validating a spreadsheetSample task: validating a spreadsheetwith the prototype we have builtwith the prototype we have built

• The second column is “supposed” to contain first names, but some initials have snuck in.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2323

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing an inferred formatCustomizing an inferred format

• User can specify meaningful names for parts

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2424

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing constraints in our prototypeCustomizing constraints in our prototype

• User can add/edit constraints

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2525

Sample task: validating a spreadsheetSample task: validating a spreadsheetFlagging potential errorsFlagging potential errors

• A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2626

Sample task: web form validationSample task: web form validationThe painful old wayThe painful old way

• Drag widgets and validator onto page, select a regexp, customize if desired.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2727

Sample task: web form validationSample task: web form validationResults of the painful old wayResults of the painful old way

• Invalid inputs cause a hard-coded message to appear.

Oops, forgot to enter a message at design-time.

• For valid inputs, no error message appears.

Hm, didn’t realize the area code was optional.

What if I want to allow campus phone numbers?

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2828

Sample task: web form validationSample task: web form validationThe wonderful new way The wonderful new way

• Drag widgets and validator onto page, select a format, customize if desired.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

2929

Sample task: web form validationSample task: web form validationCreating this format took 55 secondsCreating this format took 55 seconds

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3030

Sample task: web form validationSample task: web form validationResults of the new wayResults of the new way

• Invalid inputs cause a targeted message to appear.

• Inputs that violate an always or never constraint cannot be submitted to the server.

• Inputs that violate an often constraint cause a warning, which the application user can override.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3131

Prototype implementationPrototype implementationSystem block diagramSystem block diagram

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Spreadsheet Microsoft Excel

Plug-in

Microsoft Visual Studio.NET

Plug-in

Format editor

Parser

Web application

Validator

3232

Benefits of the format editorBenefits of the format editor

• Exotic regexp notation is replaced with sentence-like screen prompts.

• Soft constraints (“often”) are supported.• Negation constraints (“never”) are supported.

• In terms of expressiveness,Augmented context-free grammars

> context-free grammars > regexps

But is the expressiveness adequate for common data?

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3333

Expressiveness evaluationExpressiveness evaluation

• Four administrative assistants’ use of a web browser was logged for three weeks, resulting in nearly 6000 sample data values that they typed into web forms.

• Not logged verbatim: characters were generalized– Eg: [email protected] Aa{7}0@a{5}.a{3}

• We manually grouped values into 19 semantic families (eg: email address) based on widget’s HTML name and words visually nearby to the widgets

• Created and tested formats for 14 families (4250 values)– Omitted: username/passwords and long blocks of “text”

– Inference & testing features were not used during format creation

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3434

Expressiveness evaluation resultsExpressiveness evaluation results

• 9 families needed 1 format each; 5 needed 2 formats each

• Easy to quickly express a reasonably correct format? – 11 families took < 1 minute each; others 3, 5, 7 minutes– No errors found in formats for 9 families; 5 had errors

• Most errors: forgetting to mark a part as optional• Testing feature was added after this evaluation

• The only error attributable to editor expressiveness:– 1 of the 4250 test values had a trailing period on a street type

(in an address line)– This particular version of the editor had no way to say that a

part could contain a period but only at the end

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

[6]

3535

Extension and further evaluation neededExtension and further evaluation needed

• The editor evaluation again highlighted the need for supporting multiple formats within each data category.

• The proposed work will add this support.

• Then, usability of the editor as a whole will be evaluated.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3636

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Generalizing the prototype:

A lightweight data model

+

A development environment to help EUPs create, share

and use topes

3737

Proposed data modelProposed data model

• 1 tope implementation contains executable functions:– 1 isa:string[0,1] function per format, for recognizing

instances of the format– 0 or 1 eqc:string x string[0,1] function per

format, for testing equivalence of two values in a format(default is a binary test for being exactly identical)

– 0 or more trf:stringstring function linking formats, for transforming values form one format to another

• A lightweight data model…– Only contains 3 kinds of functions (isa/eqc/trf)– These correspond to the operations that people had to keep

performing manually in our studies.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3838

Example topeExample topeNotional representationNotional representation

• An example tope for CMU room numbers– 3 isa functions, up to 3 eqc functions, 4 trf functions

– A tope’s eqc and trf functions can be omitted if desired

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Formal building name& room number

Elliot Dunlap Smith Hall 225

Building abbreviation& room number

EDSH 225

Colloquial building name& room number

Smith 225

3939

Proposed development environmentProposed development environmentFunctional decomposition diagramFunctional decomposition diagram

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

EUPs implement topes in basic topes editor (or JavaScript), then publish in repositories.Other EUs and EUPs search for topes, download them, then use them through plug-ins.

4040

Proposed development environmentProposed development environmentEnhanced basic topes editorEnhanced basic topes editor

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4141

Proposed workProposed workEnhancing the basic topes editorEnhancing the basic topes editor

• Extend isa support– Improve error message generation

• Add trf support– EUPs will specify a series of steps:

• Select a part, select an operator• Operators: permutation, lookup, arithmetic, capitalization

– Add (regression) testing features to facilitate consistency

• Add eqc support– For each part, EUPs will specify a comparison operator,

returning value in [0,1], and these will be multiplied.• Operators: exactly identical, case-insensitive comparison,

~arithmetic distance, ~edit distance

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4242

Proposed development environmentProposed development environmentRepository softwareRepository software

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4343

Proposed workProposed workRepository softwareRepository software

• Clients will have a list of “known” repository servers– Generally pre-configured to include a global server at CMU

– Organizations will configure clients to include the organizational server

– EUs and EUPs will be able to add new servers to their list

• To support publishing/searching, the repository will house meta-information about topes.

• (EUPs can also simply email topes to EUs and other EUPs, bypassing the repository system.)

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4444

Proposed development environmentProposed development environmentPublishing toolsPublishing tools

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4545

Proposed workProposed workPublishing topesPublishing topes

• Publishing a tope on a repository– Anonymously, or authenticated

– EUPs can gather into groups, publish group-private topes

– Each tope can have a non-unique name & description

– Internally, each tope will have a globally unique id (guid)• For published tope, guid = URL of the master copy• (For emailed tope, guid based on sender’s email address)

• Tope aliases– EUPs can publish tope aliases

– Alias has no implementation; just points to another tope

– Alias can have its own name, description

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4646

Proposed development environmentProposed development environmentSearch toolsSearch tools

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

4747

Proposed workProposed workSearching for relevant topesSearching for relevant topes

• Search by keyword:– Search tope name and description

– And match based on words that are visually near to topes

• Search by groups of people:– Within an organization, or by author’s email domain

– Within spaces that are “group-private”

• Search by groups of topes:– “If you liked this tope, you may also like XYZ”

– Similar to Amazon.com’s product recommendations

• Search by example:– “Find me a tope that recognizes 412-555-1212”

– For efficiency, filter based on “signature” (\d{3}-\d{3}-\d{4})introduction ● related work ● studies ● prototype

● proposed work ● evaluation ● summary

4848

Proposed workProposed workSearching for trustworthy topesSearching for trustworthy topes

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Evidence [8] EUs and EUPs may trust topes: Search features

Explicit formal roles Created by their organization’s system administrators. Search by tope author

Prior performance From people who have previously supplied good topes.

Model of motivation From vendors that care about brand image.

Group membership From people who are known to have a similar background.

Reputation That earned anonymous votes of confidence. Search by tope ratings (either anonymous or not)References That present a list of high-profile people who like the topes.

Certification That are inspected and certified by a third party.

Social context That are actively maintained—that is, for which improved versions are regularly available.

That are implemented in a familiar language/platform.

Search by tope publication date and execution platform

4949

Proposed development environmentProposed development environmentEnhanced plug-insEnhanced plug-ins

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5050

Proposed workProposed workEnhancing plug-insEnhancing plug-ins

• Microsoft Excel– Outlier finding infer format on selected cells, run isa

– Assertions run isa on selected cells

– Transformation run trf on selected cells

– De-duplication run eqc on selected cells, cluster the cells

• Microsoft Visual Studio.NET– Input validation run isa on form widget, show error message

– Input consistency run trf on value if in wrong format

• Robofox– Assertions run isa on selected variable

– Transformation run trf on selected variable

• In each, support basic editor topes & JavaScript topes

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5151

Proposed development environmentProposed development environmentNormalization (“the tope who cried wolf”)Normalization (“the tope who cried wolf”)

Basic Topes Editor Repository Software

Publishing Tools Search Tools

Development Environment

Plug-Ins

Normalization

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5252

Proposed workProposed workNormalization: Recognizing exceptionsNormalization: Recognizing exceptions

• Tope creators might overlook values.• From the standpoint of a tope format, these “normal”

values are exceptional cases that need to be tolerated.

• Simple approach: Record a whitelist of exceptions• More sophisticated: For each format, record exceptions,

infer a format (new isa function), and average this function’s score with the raw function’s score

• Exceptional values can be incorporated into the tope in the local cache and/or, at EUP’s discretion, propagated to the repository of the tope’s master copy

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5353

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Expressiveness: evaluation on examples

Use by EUPs: evaluation in controlled experiments

Flexibility: evaluation through analyses

Sharability: field testing + analyses

5454

ThesisThesis

The proposed data model and development environment will enable end-user programmers to implement and share custom abstractions for flexibly recognizing, transforming and equivalence-testing values in categories of short, human-readable data.

The model and environment will help end-user programmers to more quickly and correctly validate and reuse data than is possible through currently practiced methods.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5555

Expressiveness is neededExpressiveness is needed

• Claim: End users’ tasks commonly involve categories of short, human-readable data that appear in multiple formats, and that users recognize and test for equivalence in a fuzzy manner.

• Using contextual inquiry and interview data, identify and characterize examples of these data categories.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5656

Expressiveness is providedExpressiveness is provided

• Claim: The operators and constructs supported by the basic editor are expressive enough for creating topes for data categories in common tasks.– We’ll create topes for data categories in four tasks similar

to those that we saw in our prior studies:

– 1 “graduated response” validation task in web application

– 1 web macro task

– 1 outlier finding task in spreadsheet

– 1 data de-duplication task in spreadsheet

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5757

EUPs can create topesEUPs can create topes

• Claim: Given a suitable development environment, EUPs can create custom software abstractions for recognizing, transforming and equivalence-testing values in commonly occurring data categories.

• Evaluate with controlled experiment (with CMU staff):– Create topes for data categories in sample tasks

– Within-subjects, we may have subjects use a comparison method

• Eg: Lapis for isa, manual for trf, Excel formulas for eqc

• Measure time-on-task and error rates

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5858

EUPs can benefit from using topesEUPs can benefit from using topes

• Claim: Extending existing programming tools with these abstractions enables EUs and EUPs to more quickly and correctly validate and reuse data than is possible through currently practiced methods.

• Evaluate with controlled experiment (with CMU staff):– Provide subjects with appropriate topes

– Have them perform the sample tasks, using plug-ins

– Within-subjects, we may have subjects use a comparison method

• Eg: JavaScript, manual performance, Lapis, Excel formulas

• Measure time-on-task and error rates

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

5959

Recognition, equivalence-testing,Recognition, equivalence-testing,and exception-handling are flexibleand exception-handling are flexible

• Claim: The abstractions created by EUPs flexibly capture the fuzziness of data recognition and equivalence-testing, and flexibly adapt at runtime when validating exceptional inputs.

• Evaluate with analyses:– Take topes created by EUPs in experiments

– Run them on test data from EUSES spreadsheet corpus

– Based on manual annotation of test data, score the topes

– Evaluate the normalization algorithms: which works best?

• Measure topes’ precision/recall, compare to Lapis scores

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6060

EUPs can share topesEUPs can share topes

• Claim: Given a suitable development environment operating on meta-information about these abstractions, EUPs can share abstractions with one another.

• Evaluate through field testing– Create an installer for plug-ins and basic topes editor

– Recruit CMU grad students and staff to use it for 3 months

– Log user actions (eg: published topes, queries, downloads)

– Record (and answer) frequently asked questions

– Periodic surveys

• Which features do EUPs consider helpful (or need work)?• Which sources of “trust” evidence are actually helpful?

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6161

Performance is scalablePerformance is scalable

• Claim: The latency and throughput of operations does not become burdensome as EUPs share numerous abstractions with one another.

• Evaluate with analyses:– Logs provide sample queries

– Measure execution time of queries on sample tope sets

– Perform algorithmic analysis of the search algorithms

• Combining execution time with algorithmic analysis yields a rough estimate of scalability

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6262

OutlineOutline

• Introduction• Related work• Exploratory studies• Prototype• Proposed work• Evaluation• Summary and schedule

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

3 knowledge contributions5 technical contributions

20 months

6363

Knowledge contributionsKnowledge contributions

• Characterization of the fuzzy, multi-format categories of data commonly involved in end-user programming

• Lightweight data model (isa/trf/eqc) for representing these data categories

• A list of sources of evidence that help EUPs share abstractions

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6464

Primary technical contributionsPrimary technical contributions

• Algorithms– For validating, transforming, and equivalence-testing

data based on formats implemented by EUPs

– For generating targeted error messages

– For search-by-example

– For collecting and searching on context words

– For normalization and format inference

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6565

Green = implementation Blue = evaluation Purple = dissertation

Intended schedule: 20 monthsIntended schedule: 20 months

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Editor and plug-in support for trf and eqc (3 mo)

Evaluate with examples, experiments, analyses (3 mo)

Addl. editor and plug-in enhancements (3 mo)

Implement repository (5 mo)

Evaluate sharability & scalability (3 mo)

Dissertation (3 mo)

6666

Referenced papersReferenced papersConference papers[1] C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th

International Conference on Enterprise Integration Systems (ICEIS'07), 2007, to appear.

[2] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, C. Jin. Red Opal: Product-Feature Scoring from Reviews. Proceedings of 8th ACM Conference on Electronic Commerce (ACMEC'07), 2007, to appear

[3] C. Scaffidi, A. Cypher, S. Elbaum, A. Koesnandar, and B. Myers. Scenario-Based Requirements for Web Macro Tools. Submitted for publication, 2007.

[4] C. Scaffidi, A. Ko, B. Myers, M. Shaw. Dimensions Characterizing Programming Feature Usage by Information Workers. VL/HCC'06: Proceedings of the 2006 IEEE Symposium on Visual Languages and Human-Centric Computing, pp. 59-62, 2006.

[5] C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. VL/HCC'05: Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing , pp. 207-214, 2005.

Other papers[6] C. Scaffidi, B. Myers, M. Shaw. The Topes Format Editor and Parser, Technical Report CMU-ISRI-07-104, School

of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 2007.

[7] C. Scaffidi, B. Myers, and M. Shaw. Trial By Water: Creating Hurricane Katrina "Person Locator" Web Sites. In Leadership at a Distance: Research in Technologically-Supported Work (S. Weisband, ed), Lawrence Erlbaum, pp. 209-222, 2007.

[8] C. Scaffidi, M. Shaw. Toward a Calculus of Confidence. First International Workshop on the Economics of Software and Computation, co-located with ICSE'07, 2007, to appear.

[9] C. Scaffidi, M. Shaw, B. Myers. Games Programs Play: Obstacles to Data Reuse, 2nd Workshop on End User Software Engineering (WEUSE), 2006.

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

6767

Thank You…Thank You…

• …to many people for helpful suggestions

• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)

• …to my wife, and to the Lord, for emotional support

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

Marwan Abi-Antoun Margaret Burnett Martin Erwig Andy Ko Mary Beth Rosson

Robin Abraham Owen Cheng George Fairbanks Thomas LaToza Mary Shaw

Matt Bass Ciera Christopher Thomas Green Alon Lavie Jeff Stylos

Nels Beckman Michael Coblenz Josh Gross Henry Lieberman Dean Sutherland

Kevin Bierhoff Allen Cypher Greg Hartman Larry Maccherone Steve Tanimoto

Alan Blackwell Uri Dekel Jim Herbsleb Brad Myers Susan Wiedenbeck

Barry Boehm Sebastian Elbaum John Hosking John Pane

6868

This slide intentionally left blank.

6969

Contextual inquiry:Contextual inquiry:What are the problems of EUs and EUPs?What are the problems of EUs and EUPs?

• Admin assistants and managers performed tasks in browsers and/or spreadsheets for the entire observation.

• Tasks required copying data among web forms and/or spreadsheets.– E.g.: using a government web site to look up an

appropriate per diem rate based on a locality (City, ST) and a date (MM/DD/YYYY) in an expense report

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

7070

We considered helping them automate their We considered helping them automate their tasks by creating tasks by creating web macroweb macro programs. programs.

But existing tools cannot perform needed data transformations

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

E.g.: Selecting the year based on the date (MM/DD/YYYY) and selecting the state based on the locality (City, ST)

7171

Proposed workProposed workSearching for topes – by exampleSearching for topes – by example

• Overview– Required meta-information:

• Published topes can include positive/negative examples (e.g.: “EDSH 225” matches this format)

• Tope users can also post examples, with ratings & comments

– Generalize these examples to a format signature• Required algorithm is similar to existing format inference but

slightly more coarse (e.g.: “[a-z]{2-5} [0-9]{2-4}”)

• To search by example:1. Specify some examples of the desired tope

2. Repository generalizes these examples to a signature

3. Repository returns topes with a similar signature

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

7272

Proposed workProposed workSearching for topes – in groups of topesSearching for topes – in groups of topes

• Overview– People with one tope in common probably have other

topes in common (eg: medical staff, CMU students, etc)

– Approach: cluster topes based on who creates/uses them

– Many algorithms exist for this kind of problem (eg: HAC)

• Searching by tope group:1. The person searching has already used a few topes

2. Return topes that are in the same clusters

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary

7373

Proposed workProposed workSearching for topes – by keywordSearching for topes – by keyword

• Keywords can occur in tope name or description

• Keywords can occur contextually:1. EUP identifies the field where the tope will be used

– Eg: a spreadsheet cell, or a web form widget

2. The programming tool plug-in looks for nearby words– Eg: top of spreadsheet column, left end of spreadsheet row,

labels above form widget, or form widget’s HTML name

3. With user’s permission, these are sent to repository– As meta-information, when publishing– As a query, when searching

• Adapt algorithm for finding products based on features? [2]

introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary