regnet

60
REGNET REGNET Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004 A Comparative Analysis Framework A Comparative Analysis Framework For Semi-Structured Documents, For Semi-Structured Documents, With Applications To Government With Applications To Government Regulations Regulations

Upload: kamin

Post on 13-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

REGNET. A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations. Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004. ADAAG in HTML. UK DDA in HTML. IBC in PDF. Motivation. Multiple sources of regulations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: REGNET

REGNETREGNET

Gloria LauEngineering Informatics Group, Stanford UniversityMay 14th, 2004

A Comparative Analysis Framework A Comparative Analysis Framework For Semi-Structured Documents, For Semi-Structured Documents, With Applications To Government With Applications To Government RegulationsRegulations

Page 2: REGNET

22

MotivationMotivation

Multiple sources of regulations Multiple jurisdictions: federal, state, local, etc. Different formats, terminologies, contexts

UK DDA in HTMLADAAG in HTML

Amending rules, conflicting ideas

IBC in PDF

Page 3: REGNET

33

MotivationMotivation

Multiple sources of regulations Multiple jurisdictions: federal, state, local, etc. Different formats, terminologies, contexts Amending rules, conflicting ideas

Need for a repository Locate relevant information E.g., small business: penalty fees for violations

Need for analysis tool Complexity of regulations

Multiple jurisdictions Understanding of regulations & their relationships

Page 4: REGNET

44

Example 1: Related ProvisionsExample 1: Related Provisions

ADAAG Appendix 4.6.3

… Such a curb ramp opening must be located within the access aisle boundaries, not within the parking space boundaries.

CBC 1129B.4.3

… Ramps shall not encroach into any parking space.

Exception: 1. Ramps located at the front of accessible parking spaces may encroach into the length of such spaces …

CBC allows curb ramps encroaching into accessible parking stall access aisles, while ADA disallows encroachment into any portion of the stall.

Page 5: REGNET

55

Example 2: Related but Conflicting Example 2: Related but Conflicting ProvisionsProvisions

ADAAG 4.7.2Slope. …Transitions from ramps to walks, gutters, or streets shall be flush and free of abrupt changes…

CBC 1127B.5.5Beveled lip. The lower end of each curb ramp shall have a ½ inch (13mm) lip beveled at 45 degrees as a detectable way-finding edge for persons with visual impairments.

ADAAG focuses on wheelchair traversal; CBC focuses on the visually impaired when using a cane.

Page 6: REGNET

66

Relatedness analysis

Repository development

ScopeScope

generic features

domain-specific features

shallow parser

regulations in HTML, PDF,plain text, etc

feature extractor

Ontology

XML regulations

measurements exceptions definitions

Semio

concepts

author-prescribed

indicesglossary termsrefined XML regulations

DomainExpert

chemicals

effective dates

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

Repository development Relatedness analysis Performance evaluation , results and

applications

Page 7: REGNET

77

Repository developmentRepository development

generic features

domain-specific features

shallow parser

regulations in HTML, PDF,plain text, etc

feature extractor

Ontology

XML regulations

measurements exceptions definitions

Semio

concepts

author-prescribed

indicesglossary termsrefined XML regulations

DomainExpert

chemicals

effective dates

Page 8: REGNET

88

Sources of dataSources of data

Accessibility standards Americans with Disabilities Act Accessibility Guide

(ADAAG) Drafted chapter for rights-of-way access Associated public comments

Uniform Federal Accessibility Standards (UFAS) British Standard BS 8300 Scottish Technical Standards, Part S International Building Code (IBC), Chapter 11

Drinking water standards Code of Federal Regulations, Title 40 (40 CFR) California Code of Regulations, Title 22 (22 CCR)

Fire code International Building Code (IBC), Chapter 9

Page 9: REGNET

99

Computational properties of regulationsComputational properties of regulations

4.7

4

4.1 4.5 Ground andFloor Surfaces.

4.9

4.7.4 Surface.Slopes of curb rampsshall comply with 4.5.

ADAAG

unbounded number of descendents

unboundedtree depth

child nodereference node

Hierarchical tree structure Referential structure Discipline-centered, e.g., ADAAG for

accessibility Shallow parser to capture computational

properties

Page 10: REGNET

1010

Digital publication of regulationsDigital publication of regulations

Current standard: HTML, PDF, plain text... Our system standard: XML

Recreate regulatory structure Unit of extraction: section/provision Extract references Extract features

<regulation id="ibc" name="international building code" type="private">

<regElement id="ibc.1107" name="special occupancies"> …

<regElement id="ibc.1107.2" name=“assembly area seating">

<reference id="ibc.1107.2.4.1" times="1" />

<concept name="assembl area" times="1" /> …

<regText>Assembly areas with fixed seating shall comply … </regText>

<regElement id="ibc.1107.2.1" name="services">...</regElement>

<regElement id="ibc.1107.2.2" name=“wheelchair …">...</regElement>

</regElement>

</regElement>

</regulation>

Page 11: REGNET

1111

Combination of handcrafted rules and software tools Generic features

Concepts - noun phrases

Exceptions - negated provisions

Definitions - terminologies defined in regulations

Domain-specific features

Shallow parser: feature extractionShallow parser: feature extraction

generic features

domain-specific features

shallow parser

regulations in HTML, PDF,plain text, etc

feature extractor

Ontology

XML regulations

measurements exceptions definitions

Semio

concepts

author-prescribed

indicesglossary termsrefined XML regulations

DomainExpert

chemicals

effective dates

Glossary terms - definitions from reference guides Author-prescribed indices - concepts from field handbooks Measurements - e.g., 2 inches max, 4 ppm Chemicals - list of drinking water contaminants from EPA Effective dates - provision updates

Non-structural characteristics specific to a corpus To aid user retrieval of relevant materials For analysis purpose: domain knowledge

Page 12: REGNET

1212

Example of Example of indexTerm, concept, indexTerm, concept,

measurementmeasurement & & exceptionexception features featuresOriginal Section 4.6.3 from the UFAS

4.6.3* PARKING SPACES. Parking spaces for disabled people shall be at least 96 in (2440 mm) wide and shall have an adjacent access aisle 60 in (1525 mm) wide minimum (see Fig. 9). Parking access aisles shall be part of ...

EXCEPTION: … an adjacent access aisle at least 96 in (2440 mm) wide complying with 4.5...

Refined Section 4.6.3 in XML format<regElement name=”ufas.4.6.3” title=”parking spaces” asterisk=”1”>

<concept name=”access aisl” num=”3” /> …

<indexTerm name=”park space” num=”4” />

<measurement unit=”inch” magnitude=”96” quantifier=”min” />

<ref name=”ufas.4.5” num=”1” />

<regText> Parking spaces for disabled people shall ... </regText>

<exception> If accessible parking spaces for ... </exception>

</regElement>

Page 13: REGNET

1313

Relatedness analysis

Repository development

ScopeScope

generic features

domain-specific features

shallow parser

regulations in HTML, PDF,plain text, etc

feature extractor

Ontology

XML regulations

measurements exceptions definitions

Semio

concepts

author-prescribed

indicesglossary termsrefined XML regulations

DomainExpert

chemicals

effective dates

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

Repository development Relatedness analysis Performance evaluation , results and

applications

Page 14: REGNET

1414

Related elements: door and entrance

Relatedness analysisRelatedness analysis

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

ADAAG4.1.6(3)(d) Doors(i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection ...

 UFAS4.14.1 Minimum NumberEntrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with ...

Page 15: REGNET

1515

Relatedness analysisRelatedness analysis

To utilize the computational properties of regulations for a complete comparison

Measure Degree of relatedness: similarity score f(A, U) (0,

1) Nodes A and U are provisions from two different

regulation trees

f (0, 1)A U

ADAAG UFAS

parent

sibling

child

psc(A) psc(U) ref(U)

child node

reference node

nodes in comparison

Page 16: REGNET

1616

Base score Base score ff00 computation computation

Linear combination of feature matching

F(A,U,i) = similarity score between Sections (A,U) based on feature i

N = total number of features = weighting coefficient

|||| NM

NM

dd

dd

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

Feature matching Based on the Vector model using cosine similarity as the

distance between feature vectors Similarity between two documents M and N =

and are document vectors i = concept feature

Concept vectors are formed per provision based on concept frequency in each provision

F(provision M, provision N, i=concept) = cosine between 2 concept vectors

N

i i iUAFUAf10 ),,(),(

11

N

i i

Md

Nd

Page 17: REGNET

1717

Axis dependency: non-Boolean matchingAxis dependency: non-Boolean matching

Vector model assumes mutual independence between axes

Domain experts do not necessarily agree A measurement of “2 inches max” can be a 70%

match to “2 inches” Synonyms exist, e.g., ontology defined for chemicals

Limitation observed Need flexibility to model domain knowledge, such as a

0, 50%, 75% and 100% measurement match:

2 ppm

2 ppm min

2 ppm max

2ppm

0.75

0.750.5

measurements scores

1

1

1

Page 18: REGNET

1818

Proposed non-Boolean matching modelProposed non-Boolean matching model

Define a feature matching matrix E Eij = % match between features i and j E.g., a 3-dimensional vector space using “2 ppm”, “2

ppm max” and “2 ft” as the first, second and third measurement axes:

E =

Vector space transformation before cosine computation Map feature vectors onto an alternate space to form

consolidated frequency vectors E.g., based on measurement features

Cosine similarity =

100

0175.0

075.01

UT

UAT

A

UT

A

mEmmEm

mEm

Page 19: REGNET

1919

Score refinements based on regulation Score refinements based on regulation structurestructure Neighbor inclusion

Diffusion of similarity between clusters of nodes in the tree

Self vs. parent-sibling-child (psc), fs-

psc

psc vs. psc, fpsc-psc

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

A U

ADAAG UFAS

parent

sibling

child

psc(A) psc(U)

f0

s-psc

psc-psc

Page 20: REGNET

2020

Neighbor inclusion: Neighbor inclusion: pscpsc vs. vs. pscpsc

A1 U1

ADAAG UFAS

psc(A1) psc(U1)

A2

U2psc(A2)

psc(U2)

child node

nodes in comparison

spread of similarity

red: similar nodes

blue: dissimilar nodes

Take a linear combination of neighboring pair scores Formulate a neighbor structure matrix N Define score matrix We have psc-psc = NA0NU

T

Page 21: REGNET

2121

Neighbor inclusion: self vs. Neighbor inclusion: self vs. pscpsc

A1 U1

ADAAG UFAS

psc(U1)

child node

nodes in comparison

spread of similarity

red: similar nodes

blue: dissimilar nodes

A2

U2

psc(A2)

Take a linear combination of neighbor vs. self scores Formulate a neighbor structure matrix N Define score matrix We have s-psc = ½ (0NU

T + NA0)

Page 22: REGNET

2222

Score refinements based on regulation Score refinements based on regulation structurestructure

Reference distribution Diffusion of similarity between referencing nodes and

referenced nodes in the tree E.g., f(A5.3, U6.4(a)) updates f(A2.1, U3.3)

ADAAG--------------------------

Section 2.1-----------------------------------------------------------------

Section 5.3--------------------------

UFAS---------------------------------------

Section 3.3-----------------------------------------------------------------

Section 6.4(a)-------------

no crossreference

similarsections: fo != 0

reference

Page 23: REGNET

2323

Reference distribution: s-Reference distribution: s-refref and and refref--refref

...A1

...

...

...A2

...A3

...

...U1

...U2

...

...U3

...

ADAAG UFAS

...A1

...

...

...A2

...A3

...

...U1

...U2

...

...U3

...

reference

s-ref comparisonsof Sections A2, U2

ref-ref comparisonsof Sections A2, U2

ADAAG UFAS

Take a linear combination of reference vs. self and reference vs. reference scores Formulate a reference structure matrix R Define score matrix We have ref-ref = RA0RU

T and s-ref = ½ (0RUT + RA0)

Page 24: REGNET

2424

Final score: linear combination of Final score: linear combination of ’s’s

final = 0

o + s-psc

s-psc + psc-psc

psc-psc + s-ref

s-ref + ref-ref

ref-ref

where psc-psc = NA

0NUT

s-psc = ½ (0NU

T + NA

0)

ref-ref = RA

0RU

T

s-ref = ½ (0RU

T + RA

0)

0 > s-psc > psc-psc > 0

0 > s-ref > ref-ref > 0

0 + s-psc + psc-psc + s-ref + ref-ref = 1

= structural weighting coefficient

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

Page 25: REGNET

2525

Relatedness analysis

Repository development

ScopeScope

generic features

domain-specific features

shallow parser

regulations in HTML, PDF,plain text, etc

feature extractor

Ontology

XML regulations

measurements exceptions definitions

Semio

concepts

author-prescribed

indicesglossary termsrefined XML regulations

DomainExpert

chemicals

effective dates

Similarity Analysis Core

domain knowledge

score refinements

feature matching

measurements

concepts

effective dates

drinking watercontaminants

base score

neighbor inclusion

reference distribution

refined score

discard belowthreshold pairs

related pairs

author-prescribed

indices

ontology (synonymicinformation) . . .

refined XMLregulations

. . .

domain-specificscoring algorithm

+

Repository development Relatedness analysis Performance evaluation, results and

applications

Page 26: REGNET

2626

Performance evaluationPerformance evaluation

Conduct a user survey of rankings of similarity 10 randomly chosen sections from the ADAAG and

UFAS Ranks 1 to 100 in the order of relevance

Root mean square error (RMSE)

= user-generated ranking vector = machine-predicted ranking vector

hr

mr

RMSEN

rrrr

r

rr mNhNmh

h

mh22

112 )()(

oflength

||

Page 27: REGNET

2727

Survey results - Tabulated RMSE’sSurvey results - Tabulated RMSE’s Compared our analysis to Latent Semantic Indexing (LSI)

= structural weighting coefficient = feature weighting coefficient Average RMSE smaller than LSI Measurement feature performs best No improvement in result observed for structural comparison

Page 28: REGNET

2828

Results of comparisons: ADAAG vs. UFASResults of comparisons: ADAAG vs. UFAS Related accessible elements: door and entrance

No ontological information Neighbor inclusion reveals higher similarity Content of neighbors imply similarity between Section

4.1.6(3)(d) in ADAAG and Section 4.14.1 in UFASADA Accessibility Guidelines 4.1.6(3)(d) Doors (i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection of 5/8 in maximum will be permitted for the latch side stop. (ii) If existing thresholds are 3/4 in high or less, and have (or are modified to have) a beveled edge on each side, they may remain.

Uniform Federal Accessibility Standards 4.14.1 Minimum Number 4.14 Entrances 4.14.1 Minimum Number Entrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with 4.3. Such entrances shall be connected by an accessible route to public transportation stops, to accessible parking and passenger loading zones, and to public streets or sidewalks if available (see 4.3.2(1)). They shall also be connected by an accessible route to all accessible spaces or elements within the building or facility.

Page 29: REGNET

2929

Results of comparisons : UFAS vs. BS8300Results of comparisons : UFAS vs. BS8300

4.13 Doors 12.5.4 Doors

4.13.9Door Hardware

12.5.4.2Door Furniture

12.5.4.14.13.1

4.13.3

4.13.2

4.13.12

UFAS BS8300

parent

sibling

Terminological differences - revealed through neighbor inclusion

Uniform Federal Accessibility Standards 4.13.9 Door Hardware 4.13 Doors 4.13.1 General ... 4.13.9 Door Hardware Handles, pulls, latches, locks, and other operating devices on accessible doors shall have a shape that is easy to grasp with one hand and does not require tight grasping ...

... 4.13.12 Door Opening Force

British Standard 8300 12.5.4.2 Door Furniture 12.5.4 Doors 12.5.4.1 Clear Widths of Door Openings 12.5.4.2 Door Furniture Door handles on hinged and sliding doors in accessible bedrooms should be easy to grip and operate by a wheelchair user or ambulant disabled person ...

Page 30: REGNET

3030

Results of comparisons : 40CFRdw vs. Results of comparisons : 40CFRdw vs. 22CCRdw22CCRdw

Top ranked: Almost identical provisions, change of enforcing agency

Code of Federal Regulations Title 40 141.32.e.16 Barium The United States Environmental Protection Agency (EPA) sets drinking water standards and has determined that barium ... In humans, EPA believes that effects from barium on blood pressure should not occur below 2 parts per million (ppm) in drinking water. EPA has set the drinking water standard for barium at 2 parts per million (ppm) to protect against the risk of these adverse health effects. Drinking water that meets the EPA standard is associated with little to none of this risk and is considered safe with respect to barium.

California Code of Regulations Title 22 64468.1(c) Barium The California Department of Health Services (DHS) sets drinking water standards and has determined that barium ... In humans, DHS believes that effects from barium on blood pressure should not occur below 2 parts per million (ppm) in drinking water. DHS has set the drinking water standard for barium at 1 part per million (ppm) to protect against the risk of these adverse health effects. Drinking water that meets the DHS standard is associated with little to none of this risk and is considered safe with respect to barium.

Page 31: REGNET

3131

Results of comparisons : 40CFRdw vs. Results of comparisons : 40CFRdw vs. 22CCRdw22CCRdw

Use of ontological information 40 CFR uses chemical acronyms, e.g., TTHM 22 CCR spells out “total trihalomethanes”

Code of Federal Regulations Title 40 141.132.a.2 [No Title; under Monitoring Requirements] Systems may consider multiple wells drawing water from a single aquifer as one treatment plant for determining the minimum number of TTHM and HAA5 samples required, with State approval in accordance with criteria developed under §142.16(h)(5) of this chapter.

California Code of Regulations Title 22 64823(e) [No Title; under Field of Testing] Field of Testing 5 consists of those methods whose purpose is to detect the presence of trace organics in the determination of drinking water quality and do not require the use of a gas chromatographic/mass spectrophotometric device and encompasses the following Subgroups: ... EPA method 501.2 for trihalomethanes; EPA method 510 for total trihalomethanes; EPA method 508 for chlorinated pesticides; ... EPA method 552 for haloacetic acids.

Page 32: REGNET

3232

Application domain: e-rulemaking Comparison between draft of rules and the associated

public comments ADAAG Chapter 11, rights-of-way draft

Less than 15 pages Over 1400 public comments received within 4 months Comments ~ 10MB in size; most are several pages long New regulation draft can easily generate a huge amount of

data that needs to be reviewed and analyzed Parsing of the draft and comments

From HTML to XML Recreate structure of the draft using our shallow parser Extract features from the draft and comments Treat individual comments as provisions

Application: e-rulemakingApplication: e-rulemaking

Page 33: REGNET

3333

E-rulemakingE-rulemaking

Drafted regulations compared with public comments

Content ofSection 1105.4

6 Related Public Comments

1105.4 [6]

Page 34: REGNET

3434

Related section in draft and public comment

Results from e-rulemaking applicationResults from e-rulemaking application

ADAAG Chapter 11 Rights-of-way Draft 1105.4.1 Length Where signal timing is inadequate for full crossing of all traffic lanes or where the crossing is not signalized, cut-through medians and pedestrian refuge islands shall be 72 inches (1830 mm) minimum in length in the direction of pedestrian travel.

Public Comment Deborah Wood, October 29, 2002 I am a member of The American Council of the Blind. I am writing to express my desire for the use of audible pedestrian traffic signals to become common practice. Traffic is becoming more and more complex, and many traffic signals are set up for the benefit of drivers rather than of pedestrians. This often means walk lights that are so short in duration that by the time a person who is blind realizes they have the light, the light has changed or is about to change, and they must wait for the next walk light. ...

Page 35: REGNET

3535

Results from e-rulemaking applicationResults from e-rulemaking application

No related provisions identified Concern not addressed in the draft

ADAAG Chapter 11 Rights-of-way Draft [None Retrieved] No relevant provision identified

Public Comment Donna Ring, September 6, 2002 If you become blind, no amount of electronics on your body or in the environment will make you safe and give back to you your freedom of movement. You have to learn modern blindness skills from a good teacher. You have to practice your new skills. Poor teaching cannot be solved by adding beeping lights to every big Street corner! I am blind myself. I travel to work in downtown Baltimore and back home every workday by myself. I go to meetings and musical events around town. I use the city bus and I walk, sometimes I take a cab or a friend drives me. Some of the blind people who work where I do are so poor at travel they can only use that lousy “mobility service” or pay a cab. Noisy street corners won’t help them ...

Page 36: REGNET

3636

ContributionsContributions

A framework for regulatory repository Structure of regulations recreated in XML Feature extractions

Prototype for similarity comparisons Contextual comparisons Domain knowledge Structural comparisons

Performance Evaluation, Results and Applications User survey and comparisons with LSI Observations of comparisons between Federal, State, non-

profit organization mandated codes and European standards

Accessibility Drinking water control

Application on e-rulemaking

Page 37: REGNET

3737

Future research directionsFuture research directions In the legal domain

Regulatory competition Cross border data transfer laws Especially in the polyglot countries in EU

Regulatory updates Track changes in updates Track cross references between regulations

Extension of application to other domains of semi-structured documents Software specifications User manuals

Similarity/relatedness is settled - how about differences and conflicts? Drinking water example of almost identical

provisions

Page 38: REGNET

3838

AcknowledgmentsAcknowledgments

Committee members Prof. Kincho Law Prof. Gio Wiederhold Prof. Hans Bjornsson Prof. Cary Coglianese Prof. Hector Garcia-Molina, defense chair

Family, friends and everyone in the Engineering Informatics Group Especially REGNET/REGBASE project members

This research is sponsored by the National Science Foundation

Page 39: REGNET

3939

Thank You!Thank You!

Page 40: REGNET

4040

Backup SlidesBackup Slides

Page 41: REGNET

4141

Natural tree hierarchy rendered by Natural tree hierarchy rendered by SpaceTreeSpaceTree

Page 42: REGNET

4242

Concept ontologyConcept ontology

Page 43: REGNET

4343

Semantics of relatedness/similaritySemantics of relatedness/similarity

Similar: having characteristics in common; strictly comparable; alike in substance or essentials; not differing in shape but only in size or position.

Related: connected by reason of an established or discoverable relation.

Similarity is not static; it can depend on one’s viewpoint and desired outcome.

“related” provisions are more interested, e.g., the conflicting cases

Traditionally, it is called a “similarity score”.

Page 44: REGNET

4444

Cosine similarityCosine similarity

A document is represented as a n-entry vector M = (w1,M, w2,M, … , wn,M), where n is the total number of index terms in the corpus.

Similarity between two documents =

E.g., we take the frequency count of concept i as the concept weight wi,M in dM = (w1,M, w2,M, … , wn,M).

n

i Ni

n

i Mi

n

i NiMi

ww

ww

1

2,1

2,

1 ,,

Page 45: REGNET

4545

Example of feature vectorsExample of feature vectors

Traditional term match each index term i is assigned a positive and non-

binary weight wi,M in each document vector d M

Weight selection Frequency of term, or tf idf model

tf = term frequency; term density idf = inverse document frequency = log(n/ni); term rarity

Excluding stopwords

Page 46: REGNET

4646

Vector space transformationVector space transformation

Define D such that E = DTD is fulfilled Cosine between the consolidated frequency

vectors: =

=

=

=

|'||'|

''

UA

UA

mm

mm

|||| UA

UA

mDmD

mDmD

UT

UAT

A

UT

A

mDmDmDmD

mDmD

)()(

)(

UTT

UATT

A

UTT

A

mDDmmDDm

mDDm

UT

UAT

A

UT

A

mEmmEm

mEm

Page 47: REGNET

4747

Boundary case: reduced spaceBoundary case: reduced space

Measurements i and j are synonyms The following vectors should return the same

answer

mA =

Aj

Ai

w

w

,

,

, mU =

Uj

Ui

w

w

,

,

mA,reduced =

Aj

Aj

AjAi

w

w

ww

,1

,1

,,

, mU,reduced =

Uj

Uj

UjUi

w

w

ww

,1

,1

,,

Page 48: REGNET

4848

r educedUr educedT

r educedA mEm ,,

=

Uj

Uj

UjUijiji

AjAjAjAi

w

w

ww

sym

EE

wwww

,1

,1

,,1,1,

,1,1,,

1.

1

1

1

1

1

=

Uj

Uj

UjUi

ent i t i esn

n

jikk

niAjAinkAk

n

jikk

iAjAikAk

w

w

ww

EwwEwEwwEw

,1

,1

,,

1

,1

,,,

,1

1,,1, )()(

= ))()(())((

,1

,,,,,

,1

,1

,,,,

n

jikk

i iAjAii kAkUjUi

n

jipp

n

jikk

piAjAipkAkUp EwwEwwwEwwEww

= j ii ij ki k

n

p

n

jikk

piAjAipkAkUp EEnkkEEEwwEww

and,1,))((1

,1

,,,,

= nppEEEww pjpi

n

p

n

kpkAkUp

1,1 1

,,

=UTAmEm

Page 49: REGNET

4949

Neighbor inclusionNeighbor inclusion

Neighbor structure matrix formulation N Each Section i corresponds to row i and column i of

N Entry Nij is 0 if i psc(j)

For j psc(i), entry Nij is 1/k where k is the total number of neighbors of i

Example:

A2 A3

A1

A4 A5

NA =

02/102/10

2/1002/10

0002/12/1

4/14/14/104/1

002/12/10

(a) Example Tree of Regulation A (b) A Neighbor Structure Matrix NA

Page 50: REGNET

5050

Matrix representationMatrix representation

Take the average scores of the neighboring pairs Define

= similarity scores between two regulations M and N

ij = similarity score between Section i from regulation M and Section j from regulation N

We have psc-psc = NA0NU

T

and s-psc = ½ (0NUT + NA0)

Page 51: REGNET

5151

L e t a i = S e c t i o n i i n r e g u l a t i o n A

u i = S e c t i o n i i n r e g u l a t i o n U

i j =

otherwise ))((

1)( if 0

i

ij

apscsizeof

apsca

i j =

otherwise ))((

1)( if 0

i

ij

upscsizeof

upscu

i j t h e n t r y o f N A

0 N UT =

l kj lU0, kli kA NΦN ,,

= l k

lk0i kj lU uafN ),(,

= l k

lk0i kj l uaf ),(

= )( )(

),( ))(())((

1

jp ipupscu apscapp0

ji

uafupscs izeofapscs izeof

= f p s c - p s c ( a i , u j )

= i j t h e n t r y o f p s c - p s c

Page 52: REGNET

5252

Reference distributionReference distribution

A2 A3

A1

A4 A5

13

1

2

2

5

RA =

04/204/20

00000

00100

00000

09/39/59/10

(a) Example Tree of Regulation A (b) A Reference Structure Matrix RA

Reference structure matrix formulation R Each Section i corresponds to row i and column i of R Entry Rij is 0 if j ref(i)

For j ref(i), entry Rij is n/k where n is the number of citations from i to j, k is the total number of references from i

Example:

Page 53: REGNET

5353

Matrix RepresentationMatrix Representation

Take the average scores of the referenced pairs By replacing neighbor structure matrix N with

reference structure matrix R, we have ref-ref = RA0RU

T

and s-ref = ½ (0RUT + RA0)

Page 54: REGNET

5454

Reference distribution discussionReference distribution discussion

Referencing is directional unlike an immediate neighboring relationship, which leads us to further investigate the semantics of a reciprocal referential relationship

Define In-reference of i: from other nodes to i Out-reference of i: from i to others

Our matrix formulation can be easily modified to include all combinations In-in In-out Out-out

But should we?

Page 55: REGNET

5555

In-in, in-out, out-out reference In-in, in-out, out-out reference comparisonscomparisons

...A1

...

...

...A2

...A3

...

...U1

...U2

...

...U3

...

inreference

ADAAG UFAS

...A1

...

...

...A2

...A3

...

...U1

...U2

...

...U3

...

outreference

ADAAG UFAS

rU

rA rUT rA RU

T T

TrA

...A1

...

...

...A2

...A3

...

...U1

...U2

...

...U3

...

inreference

ADAAG UFAS

RArU

inreference

inreference

outreference

Page 56: REGNET

5656

User SurveyUser Survey

Design of the survey

final rankingindividual worksheets

Rank UFAS1 72 93 104 55 26 17 48 89 3

10 6

ADAAG: 1Rank UFAS

1 52 103 14 45 86 67 38 29 910 7

ADAAG: 2Rank ADAAG UFAS

1 1 72 7 103 4 14 4 45 2 56 2 107 3 38 4 29 9 9

… … …

tied forrank 7.5

Page 57: REGNET

5757

Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)

Term-document matrix K Singular Value Decomposition (SVD) on K:

Zero out insignificant singular values:

Document-document similarity matrix:

K = PQRT

Ks = PsQsRsT

KsTKs = (PsQsRs

T)T(PsQsRsT)

= RsQsTPsTPsQsRs

T

= RsQs2RsT Ps

TPs = I, QsT = Qs

Page 58: REGNET

5858

Example of DWC ontologyExample of DWC ontology

!Disinfectants and Disinfection-byproducts !Disinfectants

... !Chlorine

+chlorine +cl2 +hypochlorite +hypochlorous acid

!Disinfection Byproducts +d/dbp +d/dbps +dbp +dbps ... !Total Trihalomethanes

+trihalomethane +tthm +tthms

...

Page 59: REGNET

5959

Results of different regulation Results of different regulation comparisonscomparisons

5 groups of comparisons, clustered according to domain Accessibility standards: Groups 1, 2 and 3 Drinking water standards: Group 4 Cross domain comparisons

(drinking water standards vs. fire code): Group 5

Page 60: REGNET

6060

Observations based on resultsObservations based on results Similarities between regulations from

accessibility > drinking water > cross-domain Drinking water regulations

Much more voluminous (2600 provisions each) Accessibility ~ 500 provisions each

Diversity of coverage National primary, national secondary, customer

confidence reports Accessibility regulations: focused on disabled access

(Almost) identical provisions in Groups 1, 2 and 4 Different features

Term-based more important than non term-based Ontology is important Measurements, effective dates: stringent scoring

schemes