progress report 0729
TRANSCRIPT
Progress Report
2010/7/29
Shu-Ying Li
Associated information There are two possible case :
1. Two different layouts in a web page.
2. No obvious boundary among extracted addresses. We apply parser and Tidy on web pages and consider the
following two fields :
1. Path based on number
2. Terminal Value
Method
Find the position of address a1 and record the path pa
For each pathi
If pathi match pa
extract the corresponding information
If pathi match pa and length(pathi) >length(pa)
extract the corresponding information
Path based on number
1\2\8\1\1\1\1\1\4\3\2
1\2\8\1\1\1\1\1\4\3\2\4
1\2\8\1\1\1\1\1\4\3\2
Terminal Value
Infomation1
information2
1410 Pines Road, Oregon, IL, 61061, USAa1pa
Case 1 : No obvious boundary among extracted addresses(1/2)
Case 1 : No obvious boundary among extracted addresses(2/2)
Case 2 : Two different layouts in a web page(1/2)
Case 2 : Two different layouts in a web page(2/2)
miss
miss
Case 3 :
Case 3 :
miss
Discussion Using ANNIE annotation
Organization Person Date Address Location
Discussion-Using ANNE annotation
Organization
Address
Location
AddressOrganization
Location
Discussion-Using ANNE annotationDate
Organization
Organization
OrganizationOrganization Person
Person
Person
Location
Discussion-Using ANNE annotationOrganization
Date
Location
Organization
Location
Date
Person
Organization
Location
Date