why geolocating written routes is harder than it looks

35
Why Geolocating Written Routes is Harder than it Looks Ian Turton Department of Geography Penn State University, University Park, PA [email protected]

Upload: ian-turton

Post on 02-Jun-2015

382 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Why Geolocating Written Routes Is Harder Than It Looks

Why Geolocating Written Routes is Harder than it Looks

Ian Turton

Department of GeographyPenn State University, University Park,

[email protected]

Page 2: Why Geolocating Written Routes Is Harder Than It Looks

Acknowledgements

Research for this paper was funded by the National Geospatial-Intelligence Agency/NGA through the NGA University Research Initiative Program/NURI program. The views, opinions, and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the National Geospatial-Intelligence Agency or the U.S. Government.

Page 3: Why Geolocating Written Routes Is Harder Than It Looks

Summary

Route mapping Why? How? Problems Fixes?

Page 4: Why Geolocating Written Routes Is Harder Than It Looks

Aims

GeoCAM project aims to take written route descriptions and convert them to maps.

Page 5: Why Geolocating Written Routes Is Harder Than It Looks

Extracting and Mapping Geographic Entities

A simple natural language processing task

All you need to do: Is find the named entities Determine if they are geographic

features Join them up to form a route Then draw a map with them on

Page 6: Why Geolocating Written Routes Is Harder Than It Looks

BUT

Page 7: Why Geolocating Written Routes Is Harder Than It Looks

Multiple Places

Berlin – any reasonable system chooses Germany (40 worldwide, 28 in US)

Springfield occurs in many states and many times per state

Maitri (flickr) CC-NC-SA

Page 8: Why Geolocating Written Routes Is Harder Than It Looks

Strange Place Names

Look out for North East (both MD and PA).

Or North (VA, SC)

Or South (AL, KY) Not to be

confused with NE Irving St or Nebraska.

Page 9: Why Geolocating Written Routes Is Harder Than It Looks

But…

If you thought towns were hard wait until you try streets.

Page 10: Why Geolocating Written Routes Is Harder Than It Looks

213,142 Towns, 13,543,533 Streets in the USA

Place name Count Street name Count Township of Union 248 6210946 Township of Washington 246 Main St 12849 Midway 214 2nd St 8977 Fairview 207 1st St 8093 Township of Jackson 204 3rd St 8027 Oak Grove 164 4th St 6945 Township of Liberty 160 Oak St 6612 Five Points 149 Elm St 6104 Township of Jefferson 147 Pine St 6069 Township of Lincoln 147 5th St 5677

Springfield – 66, Berlin - 28

Page 11: Why Geolocating Written Routes Is Harder Than It Looks

Common Nouns as Street Names

These are hard to distinguish from nouns at the start of the sentence.

Page 12: Why Geolocating Written Routes Is Harder Than It Looks

Ambiguous Road Names

Turn from Independence into Washington.

Washington fought for independence.

Plus there are 246 Washington Twp

Page 13: Why Geolocating Written Routes Is Harder Than It Looks

Interesting Road Names

Colorado has many “interesting” road names.

Consider also Street Rd which occurs in 9 different states.

Page 14: Why Geolocating Written Routes Is Harder Than It Looks

Really Interesting Names!

Slworking2 Flickr – CC-NC-SA

Page 15: Why Geolocating Written Routes Is Harder Than It Looks

Roads with Multiple Names

Many highways and interstates have two or more names.

I-99/US 220/Bud Shuster Hwy

Streets are no better

Photograph by Joe Mabel, licensed under GFDL

Page 17: Why Geolocating Written Routes Is Harder Than It Looks

Numbered Streets Is this 39th St? Or Thirty Ninth St? Or 39 St? All appear to be

equally acceptable when writing directions.

Bitchcakesny (Flickr) CC-NC-SA

Page 18: Why Geolocating Written Routes Is Harder Than It Looks

Directions as seen in the wild

Page 19: Why Geolocating Written Routes Is Harder Than It Looks

Directions to the State College Farmers’ Market

FROM  EAST or WEST : Exit  Rt 322 at College Ave.(also named  RT 26; Benner Pike). Turn west onto College Ave and proceed  2 mi. to Locust Lane (2nd street to left after campus intersection of College Ave. & Shortlidge Rd (Garner St).

FROM  NORTH or SOUTH (within town): Follow Atherton Street (business Rt. 322) to Beaver Ave. light. Turn east on Beaver Ave. and go thru 4 lights (6 blocks) to Locust lane.

Page 20: Why Geolocating Written Routes Is Harder Than It Looks

Downtown State College, PA N Atherton

St S Atherton

St E College Av W College

Av PA26

Page 21: Why Geolocating Written Routes Is Harder Than It Looks

Issues

College Av (Rt. 26 or Benner Pike) In the database as West (or East)

College Av Alternate name is PA 26 Benner Pike is actually PA150.

Atherton Street (business Rt. 322) In the database as North (or South)

Atherton St Mostly referred to as Atherton by locals

Page 22: Why Geolocating Written Routes Is Harder Than It Looks

Directions to The Callan Theatre

From west (Adams Morgan, Georgetown)At the intersection of 16th Street and Irving Street, two blocks east of the heart of Adams Morgan, go east on Irving (right if you are coming from downtown). Cross Georgia Avenue, pass the Washington Hospital Center, go under North Capitol Street. Irving dead-ends on Michigan Avenue; turn left onto Michigan. The next light, which comes up quickly, is Harewood Road; turn left onto Harewood. The theater is half a mile up Harewood on the right.

Page 23: Why Geolocating Written Routes Is Harder Than It Looks

Catholic University Neighbourhood

Irving St NW Irving St NE Michigan Av

NW Michigan Av NE

Page 24: Why Geolocating Written Routes Is Harder Than It Looks

Issues

All references are missing the vital NE/NW

Need to distinguish between Michigan (Avenue) and the State of Michigan.

Page 25: Why Geolocating Written Routes Is Harder Than It Looks

How to handle Street names?

We need to be able to look up shortened (and possibly partly wrong) street names in our database.

Could use SQL LIKE query Slow, difficult Programmatic adjustment of names

encountered (add N/S/E/W etc)

Page 26: Why Geolocating Written Routes Is Harder Than It Looks

Stemming

In linguistic morphology, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Page 27: Why Geolocating Written Routes Is Harder Than It Looks

Stemming Examples

So Dog and dogs (and doggedly) stem to

dog. Cat, cats and cattily stem to cat. fishing, fished, fish, and fisher stem to

fish. This allows a search engine to return

pages about dogs when you search for dog.

Page 28: Why Geolocating Written Routes Is Harder Than It Looks

So Stemming for Street Names…

Washington Pl, Washington Ave and Washington St all become Washington.

North Atherton St, South Atherton St and Atherton Av become Atherton.

1St St, First Ave and 1 St N become 1. Interstate 80, I-80 and I 80w become

80. State Rd 26, PA 26 and county road

26 become 26.

Page 29: Why Geolocating Written Routes Is Harder Than It Looks

Does this help?

This allows us to take (possibly wrong) and often shortened street names and look them up in the database.

(after we have worked through the 7.3 million named street segments in the USA)

(Caching is obviously our friend here)

Page 30: Why Geolocating Written Routes Is Harder Than It Looks

6210946 2 31097 Main St 12849 main 30003 2nd St 8977 3 28614 1st St 8093 1 27938 3rd St 8027 4 25726 4th St 6945 5 23359 Oak St 6612 6 20369 Elm St 6104 park 19280 Pine St 6069 oak 17706 5th St 5677 7 17666 Church St 5662 8 16039 Maple St 5527 maple 15877 Walnut St 5041 pine 14703 6th St 4698 10 14260 Washington St 4104 9 14193 7th St 3927 elm 13163 N Main St 3535 washington 12290 Center St 3502 11 12219 River Rd 3493 cedar 12176 High St 3452 walnut 12107

Page 31: Why Geolocating Written Routes Is Harder Than It Looks

Route Detection Algorithm

Find numeric strings in text Lookup zip codes and telephone

numbers Select Noun Phrases (NP)

JTextPro For each NP:

Is it a state? Is it a town (populated place)? Is it a point of interest (POI)? Is it a street name (after stemming)?

Page 32: Why Geolocating Written Routes Is Harder Than It Looks

Algorithm (continued)

Select any NP that is unambiguous (Zzyzx Rd, Autumn Crocus Ct, Abell City)

Define a minimum bounding box (or polygon) based on the unambiguous points.

Sort the ambiguous NP based on number of matches (so prefer Gibsonia (3) over Midway (214)) then see if only one falls in the polygon if select that one.

Page 33: Why Geolocating Written Routes Is Harder Than It Looks

Route Formation

Once you have a beginning and an end for the route attempt to determine the road segments that form the route.

Note: not all road segments are named nor do they all join correctly

Where possible determine turns from one road to another and use this to truncate highlighted sections of road

Pass the details to the mapping server and display

Page 34: Why Geolocating Written Routes Is Harder Than It Looks

Further Work

Improve detection of streets, POI and places (linguistics). Take the X exit (X is probably a place) Turn left at X (X is probably a POI) Turn left on X (X is probably a street)

Improve route determination1. Fix up streets database – naming and

joins2. Imprecise routing

Given a set of POI which roads pass nearby?

Probably a graph problem?