craig a. knoblock university of southern californiapmodi/asamas/knoblock.pdf · craig knoblock...

74
Craig Knoblock University of Southern California 1 Information Agents Information Agents (Based on an invited talk (Based on an invited talk at IJCAI at IJCAI 03) 03) Craig A. Knoblock Craig A. Knoblock University of Southern California University of Southern California

Upload: truonglien

Post on 27-Apr-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

Craig Knoblock University of Southern California 1

Information AgentsInformation Agents

(Based on an invited talk (Based on an invited talk at IJCAIat IJCAI’’03)03)

Craig A. KnoblockCraig A. Knoblock

University of Southern CaliforniaUniversity of Southern California

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 22

IntroductionIntroductionThe Web is a tremendous resource, but The Web is a tremendous resource, but designed for browsingdesigned for browsing•• Sites provide limited capabilities for Sites provide limited capabilities for

personalizationpersonalization•• Few sites are designed to be integrated with Few sites are designed to be integrated with

othersothers

Goal: Develop technology to rapidly Goal: Develop technology to rapidly construct personal software agentsconstruct personal software agents•• Build agents that can perform retrieval, Build agents that can perform retrieval,

integration, and monitoring tasks on any integration, and monitoring tasks on any online sourceonline source

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 33

OutlineOutline

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 44

OutlineOutline

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 55

Electric Elves ProjectElectric Elves Project[[ChalupskyChalupsky et al, 2001]et al, 2001]

Elves project goal: Apply agent technology to Elves project goal: Apply agent technology to support human organizationssupport human organizations• Develop software agents that automate routine tasks • Enable software agents and humans to work together• Support coordination of tasks

Applications: Office Elves and Travel ElvesApplications: Office Elves and Travel Elves

W W WA g e n t P r o x i e sF o r P e o p l e

I n f o r m a t i o nA g e n t s

O n t o l o g y - b a s e dM a t c h m a k e r s

GRID

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 66

Agents for Monitoring TravelAgents for Monitoring Travel[Ambite et al, 2002][Ambite et al, 2002]

•• Office Elves created as an application of the Office Elves created as an application of the Electric Elves Electric Elves

•• Given travel itinerary, generates set of agents for Given travel itinerary, generates set of agents for anticipating travelanticipating travel--related failures and related failures and opportunities:opportunities:•• Price changesPrice changes•• Schedule changesSchedule changes•• Flight delays & cancellationsFlight delays & cancellations•• Earlier and close connectionsEarlier and close connections•• Finding the closest restaurant given GPS coordinatesFinding the closest restaurant given GPS coordinates

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 77

Travel AssistantTravel Assistant

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 88

Monitoring Travel Plans Monitoring Travel Plans

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 99

Agents Deployed to Agents Deployed to Monitor Travel ItineraryMonitor Travel Itinerary

TravelItinerary W W W

A g e n t P r o x i e sF o r P e o p l e

I n f o r m a t i o nA g e n t s

O n t o l o g y - b a s e dM a t c h m a k e r s

GRID

Flight Prices &Schedules

WeatherFlight Status

Restaurants

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1010

Monitoring AgentsMonitoring AgentsFlightFlight--Status Agent: Status Agent: •• Flight delayed message:Flight delayed message:

Your United Airlines flight 190 has been delayed. Your United Airlines flight 190 has been delayed.

It was originally scheduled to depart at 11:45 AM It was originally scheduled to depart at 11:45 AM and is now scheduled to depart at 12:30 PM. and is now scheduled to depart at 12:30 PM.

The new arrival time is 7:59 PM.The new arrival time is 7:59 PM.

•• Flight cancelled message:Flight cancelled message:Your Delta Air Lines flight 200 has been cancelled.Your Delta Air Lines flight 200 has been cancelled.

•• Fax to hotel message:Fax to hotel message:Attention: Registration Desk Attention: Registration Desk

I am sending this message on behalf of David I am sending this message on behalf of David PynadathPynadath, who has a reservation at your hotel. David , who has a reservation at your hotel. David PynadathPynadath is on United Airlines 190, which is now is on United Airlines 190, which is now scheduled to arrive at IAD at 7:59 PM. Since the scheduled to arrive at IAD at 7:59 PM. Since the flight will be arriving late, I would like to request flight will be arriving late, I would like to request that you indicate this in the reservation so that the that you indicate this in the reservation so that the room is not given away. room is not given away.

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1111

Monitoring AgentsMonitoring AgentsAirfare Agent: Airfare dropped messageAirfare Agent: Airfare dropped messageThe airfare for your American Airlines itineraryThe airfare for your American Airlines itinerary

(IAD (IAD -- LAX) dropped to $281.LAX) dropped to $281.

EarlierEarlier--Flight Agent: Earlier flights messageFlight Agent: Earlier flights messageThe status of your currently scheduled flight is:The status of your currently scheduled flight is:

# 190 LAX (11:45 AM) # 190 LAX (11:45 AM) -- IAD (7:29 PM) 45 minutes Late IAD (7:29 PM) 45 minutes Late

If you would like to return earlier, the followingIf you would like to return earlier, the following

United Airlines flights will arrive earlier than your United Airlines flights will arrive earlier than your

scheduled flights:scheduled flights:

# 946 LAX (8:31 AM) # 946 LAX (8:31 AM) -- IAD (3:35 PM) 11 minutes LateIAD (3:35 PM) 11 minutes Late

----------------

# 388 LAX (9:25 AM) # 388 LAX (9:25 AM) -- DEN (12:25 PM) 10 minutes Late DEN (12:25 PM) 10 minutes Late

# 1534 DEN (1:20 PM) # 1534 DEN (1:20 PM) -- IAD (6:06 PM) On TimeIAD (6:06 PM) On Time

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1212

250

750

1250

1750

2250

12/8/2002 12/13/2002 12/18/2002 12/23/2002 12/28/2002 1/2/2003 1/7/2003Date

Pric

e

American Airlines flights192 & 223, LAX-BOS, departing on Jan. 2 & 9

Learning to Make Predictions:Learning to Make Predictions:To Buy or Not To BuyTo Buy or Not To Buy

Agents can go beyond gathering and monitoring Agents can go beyond gathering and monitoring online sourcesonline sources•• They can help make decisions by exploiting the wealth They can help make decisions by exploiting the wealth

of online informationof online information

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1313

Learning to Make Predictions:Learning to Make Predictions:To Buy or Not To BuyTo Buy or Not To Buy

Agents can go beyond gathering and monitoring Agents can go beyond gathering and monitoring online sourcesonline sources•• They can help make decisions by exploiting the wealth They can help make decisions by exploiting the wealth

of online informationof online information

Developed a learning system, Hamlet, to predict Developed a learning system, Hamlet, to predict whether it is better to wait or buy whether it is better to wait or buy [[EtzioniEtzioni, Knoblock, , Knoblock, TuchindaTuchinda, Yates, KDD, Yates, KDD’’03]03]

Collected data on airline prices over several Collected data on airline prices over several monthsmonthsLearned a model of the pricingLearned a model of the pricingIn our simulation on collected data, Hamlet In our simulation on collected data, Hamlet saved $198,074 out of a possible $320,572 saved $198,074 out of a possible $320,572 (61.8% of optimal)(61.8% of optimal)

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1414

Related Agent SystemsRelated Agent Systems

Some notable deployed systemsSome notable deployed systems•• Internet Internet SoftbotSoftbot [[EtzioniEtzioni & Weld, & Weld, ’’94]94]•• BargainFinderBargainFinder [[KrulwichKrulwich, , ’’96]96]•• ShopBotShopBot [[PerkowitzPerkowitz et al. et al. ’’96]96]•• Warren [Decker et al., Warren [Decker et al., ’’97]97]•• Electric Elves [Electric Elves [ChalupskyChalupsky et al., et al., ’’01]01]•• and many othersand many others……

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1515

OutlineOutline

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1616

Wrappers for Live Access Wrappers for Live Access to Online Sourcesto Online Sources

HTML sources turned into agentHTML sources turned into agent--friendly friendly sourcessources

<YAHOO_WEATHER>- <ROW>

<TEMP>25</TEMP> <OUTLOOK>Sunny</OUTLOOK> <HI>32</HI> <LO>19</LO> <APPARTEMP>25</ APPARTEMP > <HUMIDITY>35%</HUMIDITY> <WIND>E/10 km/h</WIND> <VISIBILITY>20 km</VISIBILITY> <DEWPOINT>9</DEWPOINT> <BAROMETER>959 mb</BAROMETER> </ROW>

</YAHOO_WEATHER>

Wrapper

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1717

Extraction RulesExtraction RulesWrapper defined by a set of extraction Wrapper defined by a set of extraction rulesrulesExtraction rule: sequence of landmarksExtraction rule: sequence of landmarks•• Define both beginning and end of required Define both beginning and end of required

information on the pageinformation on the page

Name: Joel’s <p> Phone: <i> (310) 777-1111 </i><p> Review: …

SkipTo(Phone) SkipTo(<i>) SkipTo(</i>)

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1818

Learning the Extraction RulesLearning the Extraction Rules[[MusleaMuslea, Minton, & Knoblock, 01], Minton, & Knoblock, 01]

Hierarchical wrapper inductionHierarchical wrapper induction•• Decomposes a hard problem into several easier Decomposes a hard problem into several easier

onesones•• Extracts items independently of each otherExtracts items independently of each other

InductiveLearningSystem

ExtractionRules

EC Tree

Labeled Pages

GUI

InductiveLearningSystem

ExtractionRules

EC TreeEC Tree

Labeled Pages

GUI

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 1919

Search Space: SkipTo( ( )

… SkipTo(Phone) SkipTo(:) SkipTo( ( ) ...

SkipTo( <b> ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()

Training Examples:Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...

Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

Example of Rule InductionExample of Rule Induction

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2020

Active Learning Active Learning

Problem: May require large number of Problem: May require large number of examples to achieve high accuracyexamples to achieve high accuracyExploit active learningExploit active learning•• System selects most informative examples to System selects most informative examples to

labellabel•• Want to achieve 100% accuracy with as few Want to achieve 100% accuracy with as few

examples as possibleexamples as possible

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2121

Name: Joel’s <p> Phone: (310) 777-1111 <p>Review: The chef…

SkipTo( Phone: )

Name: Kim’s <p> Phone: (213) 757-1111 <p>Review: Korean …

Name: Chez Jean <p> Phone: (310) 666-1111 <p> Review: …Name: Burger King <p> Phone:(818) 789-1211 <p> Review: ...Name: Café del Rey <p> Phone: (310) 111-1111 <p> Review: ...Name: KFC <p> Phone:<b> (800) 111-7171 </b> <p> Review:...

Unlabeled Examples

Training Examples

Which Example to Label NextWhich Example to Label Next

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2222

MultiMulti--view Learningview Learning[[MusleaMuslea, Minton, Knoblock , Minton, Knoblock ’’00]00]

Two ways to find start of the phone number:

Name: KFC <p> Phone: (310) 111-1111 <p> Review: Fried chicken …

SkipTo( Phone: ) BackTo( ( Nmb ) )

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2323

++-

-+-

RULE 1 RULE 2

Unlabeled data

+-

Labeled data

MultiMulti--view Learning: Coview Learning: Co--TestingTesting

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2424

Name: Joel’s <p> Phone: (310) 777-1111 <p>Review: ...SkipTo( Phone: )

Name: Kim’s <p> Phone: (213) 757-1111 <p>Review: ...

BackTo( (Nmb) )

CoCo--Testing for Wrapper InductionTesting for Wrapper Induction

Name: Chez Jean <p> Phone: (310) 666-1111 <p> Review: ...

Name: KFC <p> Phone:<b> (800) 111-7171 </b> <p> Review:...

Name: Burger King <p> Phone: (818) 789-1211 <p> Review: ...

Name: Café del Rey <p> Phone: (310) 111-1111 <p> Review: ...

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2525

SkipTo(Phone:) BackTo( (Nmb) )

… Phone: (800) 171-1771 <p> Fax: (111) 111-1111 <p> Review: …

… Phone:<i> (800) 555-5555 </i><p> Review: A century ago (1891) …

… Phone: (800) 171-1771 <p> Fax: (111) 111-1111 <p> Review: …

Not All Queries are Not All Queries are Equally InformativeEqually Informative

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2626

Learn Learn ““content descriptioncontent description”” for item to be extractedfor item to be extracted

•• Too general for extractionToo general for extraction( ( NmbNmb ) ) NmbNmb –– NmbNmb cancan’’t tell a t tell a phone numberphone number from a from a fax fax numbernumber

•• Useful at Useful at discriminatingdiscriminating among among query candidatesquery candidates

•• Learned content descriptionsLearned content descriptionsStarts with: Starts with: (( NmbNmb ))Ends with: Ends with: NmbNmb –– NmbNmbContains: Contains: NmbNmb PunctPunctLength: Length: [6,6][6,6]

Weak ViewsWeak Views[[MusleaMuslea, Minton, Knoblock , Minton, Knoblock ’’03]03]

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2727

NaNaïïve & Aggressive Cove & Aggressive Co--TestingTesting

NaNaïïve Cove Co--Testing:Testing:•• Query: randomly chosen contention pointQuery: randomly chosen contention point•• Output: rule with fewest mistakes on queriesOutput: rule with fewest mistakes on queries

Aggressive CoAggressive Co--Testing:Testing:•• Query: contention point that most violates Query: contention point that most violates

weak viewweak view•• Output: committee vote (2 rules + weak view)Output: committee vote (2 rules + weak view)

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2828

Results for Random SamplingResults for Random Sampling33 most difficult of the 140 tasks from [33 most difficult of the 140 tasks from [KushmerickKushmerick ’’97]97]

0

5

10

15

20

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+

Examples to 100% accuracy

Random samplingExtraction

Tasks

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 2929

Results for Active LearningResults for Active Learning

0

5

10

15

20

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+

Examples to 100% accuracy

Naïve Co-Testing Random samplingExtraction

Tasks

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3030

Results for Active Learning Results for Active Learning with Weak Viewswith Weak Views

0

5

10

15

20

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+

Examples to 100% accuracy

Aggressive Co-Testing Naïve Co-Testing Random sampling

ExtractionTasks

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3131

OutlineOutline

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3232

Record Linkage Record Linkage (Object Consolidation)(Object Consolidation)

Name Street Phone

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Teresa's 80 Montague St. 718-520-2910

Steakhouse The 128 Fremont St. 702-382-1600

Les Celebrites 155 W. 58th St. 212-484-5113

Name Street Phone

Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604

Binion's Coffee Shop 128 Fremont St. 702/382-1600

Les Celebrites 160 Central Park S 212/484-5113

Zagat’s Restaurants Dept. of Health

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3333

Active Learning to Active Learning to Determine Matched RecordsDetermine Matched Records

[[TejadaTejada, Knoblock, Minton , Knoblock, Minton ’’01,01,’’02]02]

Learn importance of attributes for matching recordsLearn importance of attributes for matching records

Zagat’s

Dept of Health

Art’s Deli 12224 Ventura Boulevard 818-756-4124

Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Name Street Phone

Mapping rules:

Name > .9 & Street > .87 => mapped

Name > .95 & Phone > .96 => mapped

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3434

Mapping Rule LearnerMapping Rule Learner

Set of MappedObjects

Choose initial examples

Generate committee of learners

Learn Rules

ClassifyExamples

Votes Votes Votes

Choose Example

USERLearn Rules

ClassifyExamples

Learn Rules

ClassifyExamples

Label

Label

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3535

Committee DisagreementCommittee DisagreementChooses an example based on the Chooses an example based on the disagreement of the query committeedisagreement of the query committee

CPK, California Pizza Kitchen is the most CPK, California Pizza Kitchen is the most informative exampleinformative example

Art’s Deli, Art’s DelicatessenCPK, California Pizza KitchenCa’Brea, La Brea Bakery

Yes Yes YesYes No Yes No No No

Examples M1 M2 M3Committee

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3636

Exploiting Secondary Exploiting Secondary Sources for Record LinkageSources for Record Linkage[[MichalowskiMichalowski, Thakkar, Knoblock , Thakkar, Knoblock ’’03]03]

Primary data source may be insufficient to Primary data source may be insufficient to determine mappingsdetermine mappingsSecondary sources can help reduce the Secondary sources can help reduce the uncertaintyuncertaintyExamples of secondary sourcesExamples of secondary sources•• GeocoderGeocoder

Maps street addresses into lat/long coordinatesMaps street addresses into lat/long coordinates•• Business directories Business directories

Provide company officers and locationsProvide company officers and locations•• Area code updatesArea code updates

Provide changes in area codes over timeProvide changes in area codes over time

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3737

Missing MatchesMissing Matches

Record Linkage

Matched Records

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3838

Exploiting a Exploiting a GeocoderGeocoder

Record LinkageMatched Records

Secondary Source

26 Beach Cafe26 Washington St.Venice, CA

26 Beach Cafe26 Washington Boulevard Marina Del Rey, Calif

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 3939

Preliminary Results:Preliminary Results:Secondary SourcesSecondary Sources

Secondary source reduces the depth of the decision tree that needs to be learned

# Labeled Examples

Total Correct Matches

Precision RecallAverage DT Depth Precision Recall

Average DT Depth

25 109 51% 33% 5 66% 51% 135 109 73% 57% 8 81% 68% 350 109 83% 81% 10 85% 85% 3

Without Secondary Source With Secondary Source

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4040

OutlineOutline

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4141

Efficiently Executing Efficiently Executing Agent PlansAgent Plans

ProblemProblem•• Information gathering may involve accessing Information gathering may involve accessing

and integrating data from many sourcesand integrating data from many sources•• Total time to execute these plans may be large Total time to execute these plans may be large

Why?Why?•• Slow remote sourcesSlow remote sources•• Unpredictable network latenciesUnpredictable network latencies•• Binding patterns Binding patterns

Source cannot be queried until a previous query has Source cannot be queried until a previous query has been answeredbeen answered

•• Result: execution is often I/OResult: execution is often I/O--boundbound

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4242

Theseus Agent Execution SystemTheseus Agent Execution System[[BarishBarish & Knoblock, JAIR& Knoblock, JAIR’’05]05]

Plan languagePlan language and and execution systemexecution system for Webfor Web--based information integrationbased information integration•• Expressive enough for monitoring a variety of sourcesExpressive enough for monitoring a variety of sources•• Efficient enough for realEfficient enough for real--time monitoringtime monitoring

TheseusExecutor

PLAN myplan {INPUT: xOUTPUT: y

BODY {Op (x : y)

}}

010101010101100001110110101111010101010101

PlanInput Data

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4343

Streaming DataflowStreaming DataflowPlans consist of a network of operatorsPlans consist of a network of operators•• ExamplesExamples: : WrapperWrapper, , SelectSelect, etc., etc.•• Operators produce and consume dataOperators produce and consume data•• Operators Operators ““firefire”” upon any input dataupon any input data

Data passed as Data passed as tuplestuples of a relationof a relation

Wrapper

Select

Join

WrapperAddress

100 Main St., Santa Monica, 90292520 4th St. Santa Monica, 90292

2 Ocean Blvd, Venice, 90292

City State Max PriceSanta Monica CA 200000

Input relation Output relationPlan

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4444

MUL

MUL

ADD

a b c d

Parallelism in Streaming DataflowParallelism in Streaming DataflowDataflowDataflow•• Operations scheduled by data Operations scheduled by data

availabilityavailabilityIndependent operations execute in parallelIndependent operations execute in parallelMaximizes horizontal parallelismMaximizes horizontal parallelism

•• ExampleExample: computing : computing (a*b) + (c*d)

StreamingStreaming•• Operations emit data as soon as Operations emit data as soon as

possiblepossibleIndependent data processed in parallelIndependent data processed in parallelMaximizes vertical parallelismMaximizes vertical parallelism

Producer

Consumer

MUL MUL

ADD

a b c d

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4545

CarInfoCarInfo AgentAgent

Agent for recommending used cars:Agent for recommending used cars:•• Combine information fromCombine information from

Prices of used carsPrices of used carsSafety ratingsSafety ratingsReviews Reviews

•• Example:Example:2002 Midsize coupe/hatchback2002 Midsize coupe/hatchback$4K$4K--$12K, $12K, No No OldsmobilesOldsmobiles

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4646

The The CarInfoCarInfo agentagent1. Locate cars that

meet criteria- Edmunds.com

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4747

The The CarInfoCarInfo agentagent1. Locate cars that

meet criteria- Edmunds.com

2. Filter outOldsmobiles

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4848

The The CarInfoCarInfo agentagent1. Locate cars that

meet criteria- Edmunds.com

2. Filter outOldsmobiles

3. Gather safetyreviews for each

- NHSTA.gov

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 4949

The The CarInfoCarInfo agentagent1. Locate cars that

meet criteria- Edmunds.com

2. Filter outOldsmobiles

3. Gather safetyreviews for each

- NHSTA.gov

4. Gather detailedreviews of each

- ConsumerGuide.com

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5050

ConsumerGuideConsumerGuide NavigationNavigationRequires navigating through multiple pagesRequires navigating through multiple pages

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5151

DataflowDataflow--style style CarInfoCarInfo agent planagent plan

WRAPPERConsumerGuide

Search

(Midsize coupe/hatchback,$4000 to $12000,2002)

((http://cg.com/summ/20812.htm),other summary review URLs)

((http://cg.com/full/20812.htm),other full review URLs)

search criteria

WRAPPERConsumerGuide

Summary

WRAPPERConsumerGuide

Full Review

(car reviews)WRAPPER

EdmundsSearch

((Oldsmobile Alero), (Dodge Stratus), (Pontiac Grand Am), (Mercury Cougar))

JOIN

SELECTmaker !=

"Oldsmobile"

WRAPPERNHTSASearch

(safety reports)

((Dodge Stratus), (Pontiac Grand Am), (Mercury Cougar))

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5252

Speculative ExecutionSpeculative Execution[[BarishBarish & Knoblock & Knoblock ’’02, 02, ’’03]03]

Basic ideaBasic idea•• Exploit idle resources to execute future Exploit idle resources to execute future

instructions in advance of when they are instructions in advance of when they are normally issuednormally issued

ChallengesChallenges•• How to augment plans for speculationHow to augment plans for speculation•• How to ensure correctness and fairnessHow to ensure correctness and fairness•• How to decide what to speculate onHow to decide what to speculate on

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5353

How to speculate?How to speculate?General problemGeneral problem•• Means for issuing and confirming predictionsMeans for issuing and confirming predictions

Two new operatorsTwo new operators•• SpeculateSpeculate: Makes predictions based on "hints": Makes predictions based on "hints"

•• ConfirmConfirm: Prevents errant results from exiting plan: Prevents errant results from exiting plan

Speculateanswers

hints

confirmations

predictions/additions

Confirm confirmations

probable resultsactual results

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5454

JSW

WWW

W

BEFORE

Example: Example: CarInfoCarInfo•• Predict cars based on search criteriaPredict cars based on search criteria•• Makes practical sense: Makes practical sense:

Same criteria yields same carsSame criteria yields same cars

How to speculate?How to speculate?

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5555

AFTER

How to speculate?How to speculate?Example: Example: CarInfoCarInfo•• Predict cars based on search criteriaPredict cars based on search criteria•• Makes practical sense: Makes practical sense:

Same criteria yields same carsSame criteria yields same cars

JSW

WSpeculate

hintspredictions/additions

confirmationsanswers

W

Confirm

W

W

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5656

Detailed exampleDetailed example

JSW

WSpeculate

W

Confirm

W

W

2002Midsize coupe$4000-$12000

Time = 0.0 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5757

Issuing predictionsIssuing predictions

JSW

WSpeculate

W

Confirm

W

W

Oldsmobile Alero T1Dodge Stratus T2Pontiac Grand Am T3Mercury Cougar T4

Time = 0.1 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5858

Speculative parallelismSpeculative parallelism

JSW

WSpeculate

W

Confirm

W

W

Dodge Stratus T2Pontiac Grand Am T3Mercury Cougar T4

Time = 0.2 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 5959

Answers to hintsAnswers to hints

JSW

WSpeculate

W

Confirm

W

W

Oldsmobile AleroDodge StratusPontiac Grand AmMercury Cougar Time = 1.0 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6060

Continued processingContinued processing

JSW

WSpeculate

W

Confirm

W

W

T1T2T3T4

Time = 1.1 sec

Additions (corrections), if any

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6161

Generation of final resultsGeneration of final results

JSW

WSpeculate

W

Confirm

W

W

Dodge Stratus (safety) (review) T2Pontiac Grand Am (safety) (review) T3Mercury Cougar (safety) (review) T4

Time = 3.2 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6262

Confirmation of resultsConfirmation of results

JSW

WSpeculate

W

Confirm

W

W

Dodge Stratus (safety) (review)Pontiac Grand Am (safety) (review)Mercury Cougar (safety) (review)

Time = 3.3 sec

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6363

Safety and fairnessSafety and fairnessSafetySafety•• ConfirmConfirm blocks predictions (and results of) from blocks predictions (and results of) from

exiting plan before verificationexiting plan before verification

FairnessFairness•• CPUCPU

Speculative operations use "speculative threads"Speculative operations use "speculative threads"•• Lower priority threadsLower priority threads

•• Memory and bandwidthMemory and bandwidthSpeculative operations allocate "speculative resources"Speculative operations allocate "speculative resources"

•• Drawn from "speculative pool" of memory / objectsDrawn from "speculative pool" of memory / objects

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6464

Cascading SpeculationCascading SpeculationUse predicted cars to speculate about the Use predicted cars to speculate about the ConsumerGuideConsumerGuide summary and full URLssummary and full URLs

Optimistic performanceOptimistic performance•• Execution time:Execution time: maxmax {{1.2, 1.4, 1.5, 1.61.2, 1.4, 1.5, 1.6}} == 1.6 sec1.6 sec•• Speedup over streaming dataflow:Speedup over streaming dataflow: (4.2/1.6)(4.2/1.6) == 2.632.63

W

J

SW

W

SPECCONF

SPEC

W

WSPEC

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6565

Automatic plan transformationAutomatic plan transformationAgent plans are automatically modified for Agent plans are automatically modified for speculative executionspeculative execution•• Successive runs of the plan benefit Successive runs of the plan benefit

Even with different input dataEven with different input data

Leverage Amdahl's Law: Leverage Amdahl's Law: •• Consider optimizing only the most expensive Consider optimizing only the most expensive

path (path (MEPMEP))

Algorithm continually refines MEP Algorithm continually refines MEP •• Until overhead of further optimization Until overhead of further optimization

outweighs benefitsoutweighs benefits

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6666

Learning for Speculative ExecutionLearning for Speculative ExecutionCachingCaching•• Associate a hint with a predicted valueAssociate a hint with a predicted value

2002 Midsize coupe 4K2002 Midsize coupe 4K--12K12KOlds Olds AleroAlero, Dodge Stratus, Pontiac Grand Am, Mercury Cougar, Dodge Stratus, Pontiac Grand Am, Mercury Cougar

ClassificationClassification•• Use features of a hint to predict valueUse features of a hint to predict value

EXAMPLEEXAMPLE:: Predicting car list from EdmundsPredicting car list from Edmunds

type = SUV : (Nissan Pathfinder, Ford Explorer)type = Midsize : :...min <= 10000 : (Olds Alero, Dodge Stratus)

min > 10000 : (Honda Accord, Toyota Camry)

Cache

Decision list

Year Type Min Max Car list2002 Midsize 8000 15000 (Oldmobile Alero, Dodge Stratus)2002 Midsize 7500 14500 (Oldmobile Alero, Dodge Stratus)2002 SUV 14000 20000 (Nissan Pathfinder, Ford Explorer)2001 Midsize 11000 18000 (Honda Accord, Toyota Camry)2002 SUV 18000 22000 (Nissan Pathfinder, Ford Explorer)

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6767

1"http://cg.com/summary/" :

ε : COPY

3"." :

2ε : COPY

Learning for Speculative ExecutionLearning for Speculative ExecutionTransductionTransduction•• Transducers are FSM that translate hints into predictionsTransducers are FSM that translate hints into predictions

http://cg.com/summary/20812.htm

http://cg.com/full/20812.htm

To create full review URL:

1. Insert "http://cg.com/full/"

2. Extract & insert the dynamic part of the summary URL (e.g., 20812)

3. Insert ".htm"

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6868

Speculation Results: Last TupleSpeculation Results: Last Tuple

0

1000

2000

3000

4000

5000

6000

7000

8000

CarInfo RepInfo TheaterLoc FlightStatus StockInfo

Tim

e to

last

tupl

e (m

s)

No speculation50% correct100% correct

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 6969

Related WorkRelated WorkWrapper learningWrapper learning•• Supervised [Supervised [KushmerickKushmerick ’’97, Hsu & Dung 97, Hsu & Dung ’’98]98]•• Unsupervised [Unsupervised [LermanLerman et al. et al. ’’01, 01, CrescenziCrescenzi ’’01]01]

Record linkageRecord linkage•• Learning [Cohen Learning [Cohen ’’00, 00, SarawagiSarawagi & & BhamidipatyBhamidipaty ’’02]02]•• Statistics [Winkler Statistics [Winkler ’’98]98]•• Name matching [Name matching [BilenkoBilenko et al. et al. ’’03, Cohen et al. 03, Cohen et al. ’’03]03]

Efficient plan executionEfficient plan execution•• Network query engines [Ives et al. 1999, Network query engines [Ives et al. 1999, NaughtonNaughton et et

al. 2000, al. 2000, HellersteinHellerstein et al. 2001]et al. 2001]•• Agent execution systems [Agent execution systems [FirbyFirby ’’94, Myers et al. 1996]94, Myers et al. 1996]

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 7070

Outline of talkOutline of talk

The Electric Elves: Information The Electric Elves: Information agents for monitoring travelagents for monitoring travelWrapping online sourcesWrapping online sourcesLinking records across sourcesLinking records across sourcesEfficiently executing agent plansEfficiently executing agent plansConclusionsConclusions

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 7171

ConclusionsConclusionsWeb provides the ideal environment for Web provides the ideal environment for developing and testing software agentsdeveloping and testing software agents•• Noted by Noted by EtzioniEtzioni, AAAI, AAAI’’96 in his talk on 96 in his talk on SoftbotsSoftbots

Yet few have seized this opportunityYet few have seized this opportunity……why?why?Like robotics, wide variety of hard technical Like robotics, wide variety of hard technical problemsproblemsWith Web Services, the Semantic Web, etc. the With Web Services, the Semantic Web, etc. the infrastructure is improvinginfrastructure is improvingGreat opportunity for AIGreat opportunity for AI•• Ability to demonstrate and test technologies in a realAbility to demonstrate and test technologies in a real--

world settingworld setting•• Opportunity to apply technologies to make a difference Opportunity to apply technologies to make a difference

in peoplein people’’s livess lives

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 7272

Conclusions (cont.)Conclusions (cont.)Many interesting technical challenges for building Many interesting technical challenges for building software agents:software agents:•• Wrapping online sourcesWrapping online sources•• Linking records across sitesLinking records across sites•• Efficiently executing agent plansEfficiently executing agent plans•• Extraction from text documentsExtraction from text documents•• Aligning Aligning ontologiesontologies across sourcesacross sources•• Planning to integrate data sourcesPlanning to integrate data sources•• Learning to improve performance and capabilitiesLearning to improve performance and capabilities•• Integrating these capabilities in a robust architecture that Integrating these capabilities in a robust architecture that

can:can:Respond to failuresRespond to failuresExplain its behaviorExplain its behaviorCommunicate appropriatelyCommunicate appropriately

Craig KnoblockCraig Knoblock University of Southern CaliforniaUniversity of Southern California 7373

Want to Learn More?Want to Learn More?

I teach a course on Information I teach a course on Information Integration on the WebIntegration on the WebSyllabus, readings, and slides Syllabus, readings, and slides available from:available from:•• http://www.isi.edu/infohttp://www.isi.edu/info--

agents/courses/csci548/agents/courses/csci548/

Craig Knoblock University of Southern California 74

The EndThe End