from words to queries: the right tool at the right time stephanie w. haas december 13, 2001...

42
From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 [email protected]

Upload: magdalen-griffin

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

From Words to Queries:

The Right Tool at the Right Time

Stephanie W. Haas

December 13, 2001

[email protected]

Page 2: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 2

What’s wrong with these queries?

• What is the average income of police officers?

• What is the employment rate of 34-year old webmasters?

• Where can I find statistics broken out by industry?

Nothing, but...

Page 3: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 3

What’s wrong with these queries?

1. What is the average income of police officers?

2. What is the employment rate of 34-year old webmasters?

3. Where can I find statistics broken out by industry?

It’s a big jump from asking the question to finding the answer.

Page 4: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 4

user questions

LABSTAT?

Do users recognize when there is a gap?

Do they know how to fill it?

How can we help?

Page 5: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 5

Ambiguity meets technical distinctions:income

• What do we mean by income?– all money an individual acquires– compensation associated with a job– wage or salary

• The BLS distinguishes among these meanings.– 3 different questions– 3 different answers

Page 6: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 6

Right concept, wrong value:the age of webmasters

An appropriate question for the BLS,

AGE + OCCUPATION EMPLOYMENT RATE

but the values don’t directly correspond to the available choices.– age range instead of specific age– job title isn’t used– occupation is too specific, doesn’t correspond to

SOC

Page 7: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 7

Too many choices, how do I decide?industry

• 18 surveys/series that break out information by industry on Selective Access.– which one is best for your question?– once you’ve found it, which variable(s) concern

industry?– once you’ve found it (or them), which value(s)

should you select?– do your choices interact?

Page 8: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 8

question

uservocabulary

BLS concepts

survey/series

query

result

express

map to

appear in

formulate

return

Basic model of information

seeking/retrieval

Page 9: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 9

question

uservocabulary

BLS concepts

survey/series

query

result

express

map to

appear in

formulate

return

Feedback based on result --

User can change decisions.

Page 10: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 10

user vocabulary

BLS concepts

survey/series query

word - concept aids

concept - survey/series aids

survey/series - query aids

map to appear in formulate

What kind of aids can we provide at each decision point?

Progression Model

Page 11: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 11

Components of the Model

• Definition, role of each.• Examples based on the industry concept.• Note: additional concepts in question may

complicate provision of aid, e.g., in selecting survey/series.

• Although model shows the aids deployed at specific points in the process, they should probably be available for consultation at all times.

Page 12: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 12

User Vocabulary• Where do people learn the words they use

in their queries?– media, job, school, friends...

• Haas & Hert (2001) Finding information at the BLS: Overcoming the barriers of scope, concept, and language mismatch.

• General words for industry, e.g., business, corporation, company, employer

• Specific industry types, e.g., restaurants, insurance companies, textile mills

• Company names

Page 13: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 13

BLS Concepts• Defined by actions, processes, formal

definitions.– data collection, statistical manipulation, publication

• May or may not correspond with “general” definition of concept, e.g., full-time.

• Related concepts may be needed to bridge between user question and BLS information.

• industry• See also division, ownership, establishment,

establishment size

Page 14: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 14

Survey/Series• Where should the user “aim” the query?• Examples focus on LABSTAT, but can be

expanded to include all intermediate and final products, publications.

• 18 survey/series include industry-related information.

• Best choice depends on purpose of question, other information needed, e.g., industry and unemployment, or industry and occupational injuries.

Page 15: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 15

Word - Concept AidsHelp users select BLS concepts that best correspond to concepts in initial question.

• General definition of concept in non-technical terms

• Ambiguity resolution– plumbing - industry or occupation?– sector - industry group or ownership?

• Synonyms and near-synonyms– sector, product group – variable names from surveys/series

Page 16: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 16

• Examples of queries that can be asked involving the concept (“Ask Jeeves” or FAQ)– How many lost workdays due to injury were

there in the construction industry last year?

• Examples showing the range of possible values, or to demonstrate a classification

Division G: Retail Trade

Major Group 56: Apparel and Accessory Stores

Industry Group 566: Shoe Stores

5661 Shoe Stores

Page 17: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 17

• Automatic parsing of questions to identify conceptsSIZE + ESTABLISHMENT ESTABLISHMENT SIZE

SIZE {size, small, midsize, medium, large, giant...}

ESTABLISHMENT {establishment, company, firm, business, organization, employer...}

• Thesaurus browsing to view related concepts, relationships– industry -- division, sector, product

Page 18: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 18

• Links to references – Authoritative classifications, e.g., SIC, NAICS– Browsable tree diagrams of classification

structures– Crosswalks linking familiar categorizations to

authoritative ones, e.g., Yahoo B2B directory or Consumer Yellow Pages

• Scope notes explicitly describe meaning or limitations of coverage in BLS domain

• Links to BLS documentation (Warn user if highly technical in nature.)

Page 19: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 19

Concept - Survey/Series AidsHelp users identify survey/series that are relevant to the concept(s) in their questions.

• Selection of survey/series that contain the concept

• Summary of information available in each survey/series, i.e., what kinds of questions can it answer

• List of variables available in each, to help user understand the context in which concepts are presented

Page 20: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 20

• Highlight variables related to the concept

• Scope notes describing restrictions on concept as represented in each survey/series

• Annotated links to survey/series-specific documentation

• Links to other sources of related information, e.g., other federal agencies

Page 21: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 21

Labor Force Statistics from the Current Population Survey

Summary. This series presents information about the size and makeup of the U.S. labor force, which can be broken out by several demographic variables.

Definition and Scope. Industry covers all industries, including private households, at the division, division combination, and 1-, 2-, and 3-digit SIC level.

Page 22: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 22

Variables. Seasonal Adjustment, Age, Sex, Race, Ethnicity, Occupation,

class of worker - (sector in which individual works),

status - (portion of labor force included),

industry - (division or industry in which individual works).

Other Sources. Current Population Survey, Bureau of the Census.

Page 23: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 23

Survey/Series - Query AidsHelp users formulate query once they have

chosen a survey/series.

• Lists of relevant variables and their values• Constraints or interactions between

variable-value choices• Notes on query results, e.g., preliminary

data• Links to references or documentation• Query construction interfaces

Page 24: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 24

Labor Force Statistics from the Current Population Survey

Industry-related variables and values.class of worker N/A, Wage and Salary Workers, private Wage and Salary

Workers, Government Wage and Salary Workers, Self-employed Workers,...

status Civilian labor Force, Total Labor Force (Includes Total Armed Forces), Full-time Labor Force, Part-time Labor Force, Armed Forces, ...

industry Private households, Nonagriculture goods producing industries, Service producing industries, Nonagricultural industries, Construction, Manufacturing, ... Also 3-digit SIC

Constraints or Interactions. To get choices of detailed SICs, status = civilian labor force, other variables = null

Page 25: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 25

Organizing the Aid Information

• LABSTAT Crosswalk (LSC) (Haas 2000)– link user vocabulary with BLS concepts, terms

and resources

• Matrix Model– link BLS concepts to surveys/series and

variables

• Primarily intended as back-end resources, but could be adapted for direct interaction.

Page 26: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 26

LABSTAT Crosswalk

4-column table, organized by concept

• Column 1 - general language words and phrases

• Column 2 - corresponding BLS terms

• Column 3 - concept associated with terms

• Column 4 - resources in which terms or concepts may be found.

Page 27: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 27

LSC Exampleindustry, business, company...

industry general industry

survey/series,

SIC, NAICS,

definitions as used in BLS

retail establishments, restaurants,

textile mills...

sector,

industry

types of industry

survey/series,

SIC, NAICS

company size,

small businesses,

retail giant

size, establishment size

establishment size

survey/series,

definitions

Page 28: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 28

• Column 1 vocabulary gathered from variety of sources.– BLS experts and advisors– End user queries and email– Published and broadcast information sources

• Requires domain expertise to fill in columns 2-4.

• Major concept groupings harmonize with Liddy & Liddy (2001) query grammar.

Page 29: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 29

Matrix Model

• Identifies surveys/series in which concept-related information can be found.

• Rows list concepts, e.g., race, gender, occupation, industry.

• Columns list surveys/series.

• Function is similar to Wages by Area and Occupation, (http://bls.gov/blswage.htm).

Page 30: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 30

Partial MatrixMass Layoff

StatisticsCovered

Employment and Wages

National Employment,

Hours, and Earnings

Sex X X

Race X

Geographical area

X X

Industry X X X

Occupation X

Page 31: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 31

User selects Covered Employment, irrelevant rows and columns disappear. Ownership and size rows

appear; these concepts are related to industry.

Covered Employment and

Wages

Geographical area

X

Industry X

Ownership X

Size X

Page 32: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 32

Matrix shows variable names and definitions for each row.

Covered Employment and

Wages

Geographical area

X

Industry X

Ownership X

Size X

Area: Indicates the area for which data were reported.

Industry: Indicates the industry for which the data were reported.

Ownership: Indicates the ownership sector for which the data were reported.

Size: Indicates the establishment size for which data were reported.

Page 33: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 33

Matrix lists or describes values for each variable.

Covered Employment and

Wages

Geographical area

X

Industry X

Ownership X

Size X

States, Counties, Metropolitan areas.

Divisions, 1- or 4-digit SIC code, International establishments,Unclassifiable establishments

Total covered, Federal government, State government, Local government, Private

All establishment sizes

Page 34: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 34

Development Issues

• LSC– need for monitoring user vocabulary, “labor

world”– distinguish “fads” from real changes

• Matrix– primarily 1-time development effort– track changes in surveys/series, variables, etc.

• Can supply information for many aids.

When to develop, when to reuse?

Page 35: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 35

What resources already exist?

• Need to recognize potential sources among BLS documents.

• Point to existing sources (e.g., HOM definitions) from LSC, Matrix.

• May need to edit or polish existing sources for end user audience.

• Need to incorporate disparate sources into coherent structure of user aids.

Page 36: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 36

Incremental Development/Adaptation

• Start with what already exists.• Focus on most common user problems or

needs, e.g., Wage by Area & Occupation.• Focus on “big” concepts, e.g., industry,

occupation– found in many places, therefore more decisions

to make– appear in many user questions

Page 37: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 37

Maintenance Issues

• It is crucial that user aids be accurate and current. Faulty help is worse than none at all.

• Incorporate maintenance into regular job responsibilities -- can’t be considered “extra”.

• Coordinate maintenance of shared resources.

Page 38: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 38

Low Maintenance Aids

• Closely associated with changes to survey/series and related products.Example: change in ethnicity categories eventually

changes values for ethnicity variables.

– concept definitions, synonyms– example queries, authoritative classifications– survey/series descriptions, scope notes– lists of variables and values– links to BLS documentation

Page 39: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 39

High Maintenance Aids

• Require constant monitoring of environment, even though actual changes may be few.

– ambiguity resolution

– automatic parsing– links to outside references such as Yahoo

directories– links to sources of information outside BLS

scope

Page 40: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 40

Effort

Development

Adaptation

Maintenance

Benefits

Better information services

Improved user interaction

User education

Page 41: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 41

Summary• Where are users’ decision points? That’s

where they need information and guidance.• What do they need to know to make good

decisions? • Frameworks for organizing various kinds of

aids.• What already exists? What can you adapt?• Importance of coherent presentation of aids.• Don’t underestimate maintenance.

Page 42: From Words to Queries: The Right Tool at the Right Time Stephanie W. Haas December 13, 2001 stephani@ils.unc.edu

12/13/01 Haas, Words to Queries 42

Haas (2001). From Words to Concepts to Queries: Helping Users Find Series and Variables to Satisfy their Information Needs.http://ils.unc.edu/~stephani/bls/fin-rept-01.pdfContains extended examples of aids for 4 concepts; industry,

ownership, occupation, white/blue collar.

Haas (2000). A Terminology Crosswalk for LABSTAT: Mapping General Language Words and Phrases to BLS Terms.http://ils.unc.edu/~stephani/bls/fin-rept-00.pdf

Haas & Hert (to appear) Finding information at the U. S. Bureau of Labor Statistics: Overcoming the barriers of scope, concept, and language mismatch. Terminology.

Liddy, E. & Liddy, J. (2001). An NLP Approach for improving Access to Statistical Information for the Masses. FCSM 2001 Research Conference.