from words to queries: the right tool at the right time stephanie w. haas december 13, 2001...
TRANSCRIPT
From Words to Queries:
The Right Tool at the Right Time
Stephanie W. Haas
December 13, 2001
12/13/01 Haas, Words to Queries 2
What’s wrong with these queries?
• What is the average income of police officers?
• What is the employment rate of 34-year old webmasters?
• Where can I find statistics broken out by industry?
Nothing, but...
12/13/01 Haas, Words to Queries 3
What’s wrong with these queries?
1. What is the average income of police officers?
2. What is the employment rate of 34-year old webmasters?
3. Where can I find statistics broken out by industry?
It’s a big jump from asking the question to finding the answer.
12/13/01 Haas, Words to Queries 4
user questions
LABSTAT?
Do users recognize when there is a gap?
Do they know how to fill it?
How can we help?
12/13/01 Haas, Words to Queries 5
Ambiguity meets technical distinctions:income
• What do we mean by income?– all money an individual acquires– compensation associated with a job– wage or salary
• The BLS distinguishes among these meanings.– 3 different questions– 3 different answers
12/13/01 Haas, Words to Queries 6
Right concept, wrong value:the age of webmasters
An appropriate question for the BLS,
AGE + OCCUPATION EMPLOYMENT RATE
but the values don’t directly correspond to the available choices.– age range instead of specific age– job title isn’t used– occupation is too specific, doesn’t correspond to
SOC
12/13/01 Haas, Words to Queries 7
Too many choices, how do I decide?industry
• 18 surveys/series that break out information by industry on Selective Access.– which one is best for your question?– once you’ve found it, which variable(s) concern
industry?– once you’ve found it (or them), which value(s)
should you select?– do your choices interact?
12/13/01 Haas, Words to Queries 8
question
uservocabulary
BLS concepts
survey/series
query
result
express
map to
appear in
formulate
return
Basic model of information
seeking/retrieval
12/13/01 Haas, Words to Queries 9
question
uservocabulary
BLS concepts
survey/series
query
result
express
map to
appear in
formulate
return
Feedback based on result --
User can change decisions.
12/13/01 Haas, Words to Queries 10
user vocabulary
BLS concepts
survey/series query
word - concept aids
concept - survey/series aids
survey/series - query aids
map to appear in formulate
What kind of aids can we provide at each decision point?
Progression Model
12/13/01 Haas, Words to Queries 11
Components of the Model
• Definition, role of each.• Examples based on the industry concept.• Note: additional concepts in question may
complicate provision of aid, e.g., in selecting survey/series.
• Although model shows the aids deployed at specific points in the process, they should probably be available for consultation at all times.
12/13/01 Haas, Words to Queries 12
User Vocabulary• Where do people learn the words they use
in their queries?– media, job, school, friends...
• Haas & Hert (2001) Finding information at the BLS: Overcoming the barriers of scope, concept, and language mismatch.
• General words for industry, e.g., business, corporation, company, employer
• Specific industry types, e.g., restaurants, insurance companies, textile mills
• Company names
12/13/01 Haas, Words to Queries 13
BLS Concepts• Defined by actions, processes, formal
definitions.– data collection, statistical manipulation, publication
• May or may not correspond with “general” definition of concept, e.g., full-time.
• Related concepts may be needed to bridge between user question and BLS information.
• industry• See also division, ownership, establishment,
establishment size
12/13/01 Haas, Words to Queries 14
Survey/Series• Where should the user “aim” the query?• Examples focus on LABSTAT, but can be
expanded to include all intermediate and final products, publications.
• 18 survey/series include industry-related information.
• Best choice depends on purpose of question, other information needed, e.g., industry and unemployment, or industry and occupational injuries.
12/13/01 Haas, Words to Queries 15
Word - Concept AidsHelp users select BLS concepts that best correspond to concepts in initial question.
• General definition of concept in non-technical terms
• Ambiguity resolution– plumbing - industry or occupation?– sector - industry group or ownership?
• Synonyms and near-synonyms– sector, product group – variable names from surveys/series
12/13/01 Haas, Words to Queries 16
• Examples of queries that can be asked involving the concept (“Ask Jeeves” or FAQ)– How many lost workdays due to injury were
there in the construction industry last year?
• Examples showing the range of possible values, or to demonstrate a classification
Division G: Retail Trade
Major Group 56: Apparel and Accessory Stores
Industry Group 566: Shoe Stores
5661 Shoe Stores
12/13/01 Haas, Words to Queries 17
• Automatic parsing of questions to identify conceptsSIZE + ESTABLISHMENT ESTABLISHMENT SIZE
SIZE {size, small, midsize, medium, large, giant...}
ESTABLISHMENT {establishment, company, firm, business, organization, employer...}
• Thesaurus browsing to view related concepts, relationships– industry -- division, sector, product
12/13/01 Haas, Words to Queries 18
• Links to references – Authoritative classifications, e.g., SIC, NAICS– Browsable tree diagrams of classification
structures– Crosswalks linking familiar categorizations to
authoritative ones, e.g., Yahoo B2B directory or Consumer Yellow Pages
• Scope notes explicitly describe meaning or limitations of coverage in BLS domain
• Links to BLS documentation (Warn user if highly technical in nature.)
12/13/01 Haas, Words to Queries 19
Concept - Survey/Series AidsHelp users identify survey/series that are relevant to the concept(s) in their questions.
• Selection of survey/series that contain the concept
• Summary of information available in each survey/series, i.e., what kinds of questions can it answer
• List of variables available in each, to help user understand the context in which concepts are presented
12/13/01 Haas, Words to Queries 20
• Highlight variables related to the concept
• Scope notes describing restrictions on concept as represented in each survey/series
• Annotated links to survey/series-specific documentation
• Links to other sources of related information, e.g., other federal agencies
12/13/01 Haas, Words to Queries 21
Labor Force Statistics from the Current Population Survey
Summary. This series presents information about the size and makeup of the U.S. labor force, which can be broken out by several demographic variables.
Definition and Scope. Industry covers all industries, including private households, at the division, division combination, and 1-, 2-, and 3-digit SIC level.
12/13/01 Haas, Words to Queries 22
Variables. Seasonal Adjustment, Age, Sex, Race, Ethnicity, Occupation,
class of worker - (sector in which individual works),
status - (portion of labor force included),
industry - (division or industry in which individual works).
Other Sources. Current Population Survey, Bureau of the Census.
12/13/01 Haas, Words to Queries 23
Survey/Series - Query AidsHelp users formulate query once they have
chosen a survey/series.
• Lists of relevant variables and their values• Constraints or interactions between
variable-value choices• Notes on query results, e.g., preliminary
data• Links to references or documentation• Query construction interfaces
12/13/01 Haas, Words to Queries 24
Labor Force Statistics from the Current Population Survey
Industry-related variables and values.class of worker N/A, Wage and Salary Workers, private Wage and Salary
Workers, Government Wage and Salary Workers, Self-employed Workers,...
status Civilian labor Force, Total Labor Force (Includes Total Armed Forces), Full-time Labor Force, Part-time Labor Force, Armed Forces, ...
industry Private households, Nonagriculture goods producing industries, Service producing industries, Nonagricultural industries, Construction, Manufacturing, ... Also 3-digit SIC
Constraints or Interactions. To get choices of detailed SICs, status = civilian labor force, other variables = null
12/13/01 Haas, Words to Queries 25
Organizing the Aid Information
• LABSTAT Crosswalk (LSC) (Haas 2000)– link user vocabulary with BLS concepts, terms
and resources
• Matrix Model– link BLS concepts to surveys/series and
variables
• Primarily intended as back-end resources, but could be adapted for direct interaction.
12/13/01 Haas, Words to Queries 26
LABSTAT Crosswalk
4-column table, organized by concept
• Column 1 - general language words and phrases
• Column 2 - corresponding BLS terms
• Column 3 - concept associated with terms
• Column 4 - resources in which terms or concepts may be found.
12/13/01 Haas, Words to Queries 27
LSC Exampleindustry, business, company...
industry general industry
survey/series,
SIC, NAICS,
definitions as used in BLS
retail establishments, restaurants,
textile mills...
sector,
industry
types of industry
survey/series,
SIC, NAICS
company size,
small businesses,
retail giant
size, establishment size
establishment size
survey/series,
definitions
12/13/01 Haas, Words to Queries 28
• Column 1 vocabulary gathered from variety of sources.– BLS experts and advisors– End user queries and email– Published and broadcast information sources
• Requires domain expertise to fill in columns 2-4.
• Major concept groupings harmonize with Liddy & Liddy (2001) query grammar.
12/13/01 Haas, Words to Queries 29
Matrix Model
• Identifies surveys/series in which concept-related information can be found.
• Rows list concepts, e.g., race, gender, occupation, industry.
• Columns list surveys/series.
• Function is similar to Wages by Area and Occupation, (http://bls.gov/blswage.htm).
12/13/01 Haas, Words to Queries 30
Partial MatrixMass Layoff
StatisticsCovered
Employment and Wages
National Employment,
Hours, and Earnings
Sex X X
Race X
Geographical area
X X
Industry X X X
Occupation X
12/13/01 Haas, Words to Queries 31
User selects Covered Employment, irrelevant rows and columns disappear. Ownership and size rows
appear; these concepts are related to industry.
Covered Employment and
Wages
Geographical area
X
Industry X
Ownership X
Size X
12/13/01 Haas, Words to Queries 32
Matrix shows variable names and definitions for each row.
Covered Employment and
Wages
Geographical area
X
Industry X
Ownership X
Size X
Area: Indicates the area for which data were reported.
Industry: Indicates the industry for which the data were reported.
Ownership: Indicates the ownership sector for which the data were reported.
Size: Indicates the establishment size for which data were reported.
12/13/01 Haas, Words to Queries 33
Matrix lists or describes values for each variable.
Covered Employment and
Wages
Geographical area
X
Industry X
Ownership X
Size X
States, Counties, Metropolitan areas.
Divisions, 1- or 4-digit SIC code, International establishments,Unclassifiable establishments
Total covered, Federal government, State government, Local government, Private
All establishment sizes
12/13/01 Haas, Words to Queries 34
Development Issues
• LSC– need for monitoring user vocabulary, “labor
world”– distinguish “fads” from real changes
• Matrix– primarily 1-time development effort– track changes in surveys/series, variables, etc.
• Can supply information for many aids.
When to develop, when to reuse?
12/13/01 Haas, Words to Queries 35
What resources already exist?
• Need to recognize potential sources among BLS documents.
• Point to existing sources (e.g., HOM definitions) from LSC, Matrix.
• May need to edit or polish existing sources for end user audience.
• Need to incorporate disparate sources into coherent structure of user aids.
12/13/01 Haas, Words to Queries 36
Incremental Development/Adaptation
• Start with what already exists.• Focus on most common user problems or
needs, e.g., Wage by Area & Occupation.• Focus on “big” concepts, e.g., industry,
occupation– found in many places, therefore more decisions
to make– appear in many user questions
12/13/01 Haas, Words to Queries 37
Maintenance Issues
• It is crucial that user aids be accurate and current. Faulty help is worse than none at all.
• Incorporate maintenance into regular job responsibilities -- can’t be considered “extra”.
• Coordinate maintenance of shared resources.
12/13/01 Haas, Words to Queries 38
Low Maintenance Aids
• Closely associated with changes to survey/series and related products.Example: change in ethnicity categories eventually
changes values for ethnicity variables.
– concept definitions, synonyms– example queries, authoritative classifications– survey/series descriptions, scope notes– lists of variables and values– links to BLS documentation
12/13/01 Haas, Words to Queries 39
High Maintenance Aids
• Require constant monitoring of environment, even though actual changes may be few.
– ambiguity resolution
– automatic parsing– links to outside references such as Yahoo
directories– links to sources of information outside BLS
scope
12/13/01 Haas, Words to Queries 40
Effort
Development
Adaptation
Maintenance
Benefits
Better information services
Improved user interaction
User education
12/13/01 Haas, Words to Queries 41
Summary• Where are users’ decision points? That’s
where they need information and guidance.• What do they need to know to make good
decisions? • Frameworks for organizing various kinds of
aids.• What already exists? What can you adapt?• Importance of coherent presentation of aids.• Don’t underestimate maintenance.
12/13/01 Haas, Words to Queries 42
Haas (2001). From Words to Concepts to Queries: Helping Users Find Series and Variables to Satisfy their Information Needs.http://ils.unc.edu/~stephani/bls/fin-rept-01.pdfContains extended examples of aids for 4 concepts; industry,
ownership, occupation, white/blue collar.
Haas (2000). A Terminology Crosswalk for LABSTAT: Mapping General Language Words and Phrases to BLS Terms.http://ils.unc.edu/~stephani/bls/fin-rept-00.pdf
Haas & Hert (to appear) Finding information at the U. S. Bureau of Labor Statistics: Overcoming the barriers of scope, concept, and language mismatch. Terminology.
Liddy, E. & Liddy, J. (2001). An NLP Approach for improving Access to Statistical Information for the Masses. FCSM 2001 Research Conference.