regnet
DESCRIPTION
REGNET. A Comparative Analysis Framework For Semi-Structured Documents, With Applications To Government Regulations. Gloria Lau Engineering Informatics Group, Stanford University May 14th, 2004. ADAAG in HTML. UK DDA in HTML. IBC in PDF. Motivation. Multiple sources of regulations - PowerPoint PPT PresentationTRANSCRIPT
REGNETREGNET
Gloria LauEngineering Informatics Group, Stanford UniversityMay 14th, 2004
A Comparative Analysis Framework A Comparative Analysis Framework For Semi-Structured Documents, For Semi-Structured Documents, With Applications To Government With Applications To Government RegulationsRegulations
22
MotivationMotivation
Multiple sources of regulations Multiple jurisdictions: federal, state, local, etc. Different formats, terminologies, contexts
UK DDA in HTMLADAAG in HTML
Amending rules, conflicting ideas
IBC in PDF
33
MotivationMotivation
Multiple sources of regulations Multiple jurisdictions: federal, state, local, etc. Different formats, terminologies, contexts Amending rules, conflicting ideas
Need for a repository Locate relevant information E.g., small business: penalty fees for violations
Need for analysis tool Complexity of regulations
Multiple jurisdictions Understanding of regulations & their relationships
44
Example 1: Related ProvisionsExample 1: Related Provisions
ADAAG Appendix 4.6.3
… Such a curb ramp opening must be located within the access aisle boundaries, not within the parking space boundaries.
CBC 1129B.4.3
… Ramps shall not encroach into any parking space.
Exception: 1. Ramps located at the front of accessible parking spaces may encroach into the length of such spaces …
CBC allows curb ramps encroaching into accessible parking stall access aisles, while ADA disallows encroachment into any portion of the stall.
55
Example 2: Related but Conflicting Example 2: Related but Conflicting ProvisionsProvisions
ADAAG 4.7.2Slope. …Transitions from ramps to walks, gutters, or streets shall be flush and free of abrupt changes…
CBC 1127B.5.5Beveled lip. The lower end of each curb ramp shall have a ½ inch (13mm) lip beveled at 45 degrees as a detectable way-finding edge for persons with visual impairments.
ADAAG focuses on wheelchair traversal; CBC focuses on the visually impaired when using a cane.
66
Relatedness analysis
Repository development
ScopeScope
generic features
domain-specific features
shallow parser
regulations in HTML, PDF,plain text, etc
feature extractor
Ontology
XML regulations
measurements exceptions definitions
Semio
concepts
author-prescribed
indicesglossary termsrefined XML regulations
DomainExpert
chemicals
effective dates
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
Repository development Relatedness analysis Performance evaluation , results and
applications
77
Repository developmentRepository development
generic features
domain-specific features
shallow parser
regulations in HTML, PDF,plain text, etc
feature extractor
Ontology
XML regulations
measurements exceptions definitions
Semio
concepts
author-prescribed
indicesglossary termsrefined XML regulations
DomainExpert
chemicals
effective dates
88
Sources of dataSources of data
Accessibility standards Americans with Disabilities Act Accessibility Guide
(ADAAG) Drafted chapter for rights-of-way access Associated public comments
Uniform Federal Accessibility Standards (UFAS) British Standard BS 8300 Scottish Technical Standards, Part S International Building Code (IBC), Chapter 11
Drinking water standards Code of Federal Regulations, Title 40 (40 CFR) California Code of Regulations, Title 22 (22 CCR)
Fire code International Building Code (IBC), Chapter 9
99
Computational properties of regulationsComputational properties of regulations
4.7
4
4.1 4.5 Ground andFloor Surfaces.
4.9
4.7.4 Surface.Slopes of curb rampsshall comply with 4.5.
ADAAG
unbounded number of descendents
unboundedtree depth
child nodereference node
Hierarchical tree structure Referential structure Discipline-centered, e.g., ADAAG for
accessibility Shallow parser to capture computational
properties
1010
Digital publication of regulationsDigital publication of regulations
Current standard: HTML, PDF, plain text... Our system standard: XML
Recreate regulatory structure Unit of extraction: section/provision Extract references Extract features
<regulation id="ibc" name="international building code" type="private">
<regElement id="ibc.1107" name="special occupancies"> …
<regElement id="ibc.1107.2" name=“assembly area seating">
<reference id="ibc.1107.2.4.1" times="1" />
<concept name="assembl area" times="1" /> …
<regText>Assembly areas with fixed seating shall comply … </regText>
<regElement id="ibc.1107.2.1" name="services">...</regElement>
<regElement id="ibc.1107.2.2" name=“wheelchair …">...</regElement>
</regElement>
</regElement>
</regulation>
1111
Combination of handcrafted rules and software tools Generic features
Concepts - noun phrases
Exceptions - negated provisions
Definitions - terminologies defined in regulations
Domain-specific features
Shallow parser: feature extractionShallow parser: feature extraction
generic features
domain-specific features
shallow parser
regulations in HTML, PDF,plain text, etc
feature extractor
Ontology
XML regulations
measurements exceptions definitions
Semio
concepts
author-prescribed
indicesglossary termsrefined XML regulations
DomainExpert
chemicals
effective dates
Glossary terms - definitions from reference guides Author-prescribed indices - concepts from field handbooks Measurements - e.g., 2 inches max, 4 ppm Chemicals - list of drinking water contaminants from EPA Effective dates - provision updates
Non-structural characteristics specific to a corpus To aid user retrieval of relevant materials For analysis purpose: domain knowledge
1212
Example of Example of indexTerm, concept, indexTerm, concept,
measurementmeasurement & & exceptionexception features featuresOriginal Section 4.6.3 from the UFAS
4.6.3* PARKING SPACES. Parking spaces for disabled people shall be at least 96 in (2440 mm) wide and shall have an adjacent access aisle 60 in (1525 mm) wide minimum (see Fig. 9). Parking access aisles shall be part of ...
EXCEPTION: … an adjacent access aisle at least 96 in (2440 mm) wide complying with 4.5...
Refined Section 4.6.3 in XML format<regElement name=”ufas.4.6.3” title=”parking spaces” asterisk=”1”>
<concept name=”access aisl” num=”3” /> …
<indexTerm name=”park space” num=”4” />
<measurement unit=”inch” magnitude=”96” quantifier=”min” />
<ref name=”ufas.4.5” num=”1” />
<regText> Parking spaces for disabled people shall ... </regText>
<exception> If accessible parking spaces for ... </exception>
</regElement>
1313
Relatedness analysis
Repository development
ScopeScope
generic features
domain-specific features
shallow parser
regulations in HTML, PDF,plain text, etc
feature extractor
Ontology
XML regulations
measurements exceptions definitions
Semio
concepts
author-prescribed
indicesglossary termsrefined XML regulations
DomainExpert
chemicals
effective dates
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
Repository development Relatedness analysis Performance evaluation , results and
applications
1414
Related elements: door and entrance
Relatedness analysisRelatedness analysis
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
ADAAG4.1.6(3)(d) Doors(i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection ...
UFAS4.14.1 Minimum NumberEntrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with ...
1515
Relatedness analysisRelatedness analysis
To utilize the computational properties of regulations for a complete comparison
Measure Degree of relatedness: similarity score f(A, U) (0,
1) Nodes A and U are provisions from two different
regulation trees
f (0, 1)A U
ADAAG UFAS
parent
sibling
child
psc(A) psc(U) ref(U)
child node
reference node
nodes in comparison
1616
Base score Base score ff00 computation computation
Linear combination of feature matching
F(A,U,i) = similarity score between Sections (A,U) based on feature i
N = total number of features = weighting coefficient
|||| NM
NM
dd
dd
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
Feature matching Based on the Vector model using cosine similarity as the
distance between feature vectors Similarity between two documents M and N =
and are document vectors i = concept feature
Concept vectors are formed per provision based on concept frequency in each provision
F(provision M, provision N, i=concept) = cosine between 2 concept vectors
N
i i iUAFUAf10 ),,(),(
11
N
i i
Md
Nd
1717
Axis dependency: non-Boolean matchingAxis dependency: non-Boolean matching
Vector model assumes mutual independence between axes
Domain experts do not necessarily agree A measurement of “2 inches max” can be a 70%
match to “2 inches” Synonyms exist, e.g., ontology defined for chemicals
Limitation observed Need flexibility to model domain knowledge, such as a
0, 50%, 75% and 100% measurement match:
2 ppm
2 ppm min
2 ppm max
2ppm
0.75
0.750.5
measurements scores
1
1
1
1818
Proposed non-Boolean matching modelProposed non-Boolean matching model
Define a feature matching matrix E Eij = % match between features i and j E.g., a 3-dimensional vector space using “2 ppm”, “2
ppm max” and “2 ft” as the first, second and third measurement axes:
E =
Vector space transformation before cosine computation Map feature vectors onto an alternate space to form
consolidated frequency vectors E.g., based on measurement features
Cosine similarity =
100
0175.0
075.01
UT
UAT
A
UT
A
mEmmEm
mEm
1919
Score refinements based on regulation Score refinements based on regulation structurestructure Neighbor inclusion
Diffusion of similarity between clusters of nodes in the tree
Self vs. parent-sibling-child (psc), fs-
psc
psc vs. psc, fpsc-psc
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
A U
ADAAG UFAS
parent
sibling
child
psc(A) psc(U)
f0
s-psc
psc-psc
2020
Neighbor inclusion: Neighbor inclusion: pscpsc vs. vs. pscpsc
A1 U1
ADAAG UFAS
psc(A1) psc(U1)
A2
U2psc(A2)
psc(U2)
child node
nodes in comparison
spread of similarity
red: similar nodes
blue: dissimilar nodes
Take a linear combination of neighboring pair scores Formulate a neighbor structure matrix N Define score matrix We have psc-psc = NA0NU
T
2121
Neighbor inclusion: self vs. Neighbor inclusion: self vs. pscpsc
A1 U1
ADAAG UFAS
psc(U1)
child node
nodes in comparison
spread of similarity
red: similar nodes
blue: dissimilar nodes
A2
U2
psc(A2)
Take a linear combination of neighbor vs. self scores Formulate a neighbor structure matrix N Define score matrix We have s-psc = ½ (0NU
T + NA0)
2222
Score refinements based on regulation Score refinements based on regulation structurestructure
Reference distribution Diffusion of similarity between referencing nodes and
referenced nodes in the tree E.g., f(A5.3, U6.4(a)) updates f(A2.1, U3.3)
ADAAG--------------------------
Section 2.1-----------------------------------------------------------------
Section 5.3--------------------------
UFAS---------------------------------------
Section 3.3-----------------------------------------------------------------
Section 6.4(a)-------------
no crossreference
similarsections: fo != 0
reference
2323
Reference distribution: s-Reference distribution: s-refref and and refref--refref
...A1
...
...
...A2
...A3
...
...U1
...U2
...
...U3
...
ADAAG UFAS
...A1
...
...
...A2
...A3
...
...U1
...U2
...
...U3
...
reference
s-ref comparisonsof Sections A2, U2
ref-ref comparisonsof Sections A2, U2
ADAAG UFAS
Take a linear combination of reference vs. self and reference vs. reference scores Formulate a reference structure matrix R Define score matrix We have ref-ref = RA0RU
T and s-ref = ½ (0RUT + RA0)
2424
Final score: linear combination of Final score: linear combination of ’s’s
final = 0
o + s-psc
s-psc + psc-psc
psc-psc + s-ref
s-ref + ref-ref
ref-ref
where psc-psc = NA
0NUT
s-psc = ½ (0NU
T + NA
0)
ref-ref = RA
0RU
T
s-ref = ½ (0RU
T + RA
0)
0 > s-psc > psc-psc > 0
0 > s-ref > ref-ref > 0
0 + s-psc + psc-psc + s-ref + ref-ref = 1
= structural weighting coefficient
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
2525
Relatedness analysis
Repository development
ScopeScope
generic features
domain-specific features
shallow parser
regulations in HTML, PDF,plain text, etc
feature extractor
Ontology
XML regulations
measurements exceptions definitions
Semio
concepts
author-prescribed
indicesglossary termsrefined XML regulations
DomainExpert
chemicals
effective dates
Similarity Analysis Core
domain knowledge
score refinements
feature matching
measurements
concepts
effective dates
drinking watercontaminants
base score
neighbor inclusion
reference distribution
refined score
discard belowthreshold pairs
related pairs
author-prescribed
indices
ontology (synonymicinformation) . . .
refined XMLregulations
. . .
domain-specificscoring algorithm
+
Repository development Relatedness analysis Performance evaluation, results and
applications
2626
Performance evaluationPerformance evaluation
Conduct a user survey of rankings of similarity 10 randomly chosen sections from the ADAAG and
UFAS Ranks 1 to 100 in the order of relevance
Root mean square error (RMSE)
= user-generated ranking vector = machine-predicted ranking vector
hr
mr
RMSEN
rrrr
r
rr mNhNmh
h
mh22
112 )()(
oflength
||
2727
Survey results - Tabulated RMSE’sSurvey results - Tabulated RMSE’s Compared our analysis to Latent Semantic Indexing (LSI)
= structural weighting coefficient = feature weighting coefficient Average RMSE smaller than LSI Measurement feature performs best No improvement in result observed for structural comparison
2828
Results of comparisons: ADAAG vs. UFASResults of comparisons: ADAAG vs. UFAS Related accessible elements: door and entrance
No ontological information Neighbor inclusion reveals higher similarity Content of neighbors imply similarity between Section
4.1.6(3)(d) in ADAAG and Section 4.14.1 in UFASADA Accessibility Guidelines 4.1.6(3)(d) Doors (i) Where it is technically infeasible to comply with clear opening width requirements of 4.13.5, a projection of 5/8 in maximum will be permitted for the latch side stop. (ii) If existing thresholds are 3/4 in high or less, and have (or are modified to have) a beveled edge on each side, they may remain.
Uniform Federal Accessibility Standards 4.14.1 Minimum Number 4.14 Entrances 4.14.1 Minimum Number Entrances required to be accessible by 4.1 shall be part of an accessible route and shall comply with 4.3. Such entrances shall be connected by an accessible route to public transportation stops, to accessible parking and passenger loading zones, and to public streets or sidewalks if available (see 4.3.2(1)). They shall also be connected by an accessible route to all accessible spaces or elements within the building or facility.
2929
Results of comparisons : UFAS vs. BS8300Results of comparisons : UFAS vs. BS8300
4.13 Doors 12.5.4 Doors
4.13.9Door Hardware
12.5.4.2Door Furniture
12.5.4.14.13.1
4.13.3
4.13.2
4.13.12
UFAS BS8300
parent
sibling
Terminological differences - revealed through neighbor inclusion
Uniform Federal Accessibility Standards 4.13.9 Door Hardware 4.13 Doors 4.13.1 General ... 4.13.9 Door Hardware Handles, pulls, latches, locks, and other operating devices on accessible doors shall have a shape that is easy to grasp with one hand and does not require tight grasping ...
... 4.13.12 Door Opening Force
British Standard 8300 12.5.4.2 Door Furniture 12.5.4 Doors 12.5.4.1 Clear Widths of Door Openings 12.5.4.2 Door Furniture Door handles on hinged and sliding doors in accessible bedrooms should be easy to grip and operate by a wheelchair user or ambulant disabled person ...
3030
Results of comparisons : 40CFRdw vs. Results of comparisons : 40CFRdw vs. 22CCRdw22CCRdw
Top ranked: Almost identical provisions, change of enforcing agency
Code of Federal Regulations Title 40 141.32.e.16 Barium The United States Environmental Protection Agency (EPA) sets drinking water standards and has determined that barium ... In humans, EPA believes that effects from barium on blood pressure should not occur below 2 parts per million (ppm) in drinking water. EPA has set the drinking water standard for barium at 2 parts per million (ppm) to protect against the risk of these adverse health effects. Drinking water that meets the EPA standard is associated with little to none of this risk and is considered safe with respect to barium.
California Code of Regulations Title 22 64468.1(c) Barium The California Department of Health Services (DHS) sets drinking water standards and has determined that barium ... In humans, DHS believes that effects from barium on blood pressure should not occur below 2 parts per million (ppm) in drinking water. DHS has set the drinking water standard for barium at 1 part per million (ppm) to protect against the risk of these adverse health effects. Drinking water that meets the DHS standard is associated with little to none of this risk and is considered safe with respect to barium.
3131
Results of comparisons : 40CFRdw vs. Results of comparisons : 40CFRdw vs. 22CCRdw22CCRdw
Use of ontological information 40 CFR uses chemical acronyms, e.g., TTHM 22 CCR spells out “total trihalomethanes”
Code of Federal Regulations Title 40 141.132.a.2 [No Title; under Monitoring Requirements] Systems may consider multiple wells drawing water from a single aquifer as one treatment plant for determining the minimum number of TTHM and HAA5 samples required, with State approval in accordance with criteria developed under §142.16(h)(5) of this chapter.
California Code of Regulations Title 22 64823(e) [No Title; under Field of Testing] Field of Testing 5 consists of those methods whose purpose is to detect the presence of trace organics in the determination of drinking water quality and do not require the use of a gas chromatographic/mass spectrophotometric device and encompasses the following Subgroups: ... EPA method 501.2 for trihalomethanes; EPA method 510 for total trihalomethanes; EPA method 508 for chlorinated pesticides; ... EPA method 552 for haloacetic acids.
3232
Application domain: e-rulemaking Comparison between draft of rules and the associated
public comments ADAAG Chapter 11, rights-of-way draft
Less than 15 pages Over 1400 public comments received within 4 months Comments ~ 10MB in size; most are several pages long New regulation draft can easily generate a huge amount of
data that needs to be reviewed and analyzed Parsing of the draft and comments
From HTML to XML Recreate structure of the draft using our shallow parser Extract features from the draft and comments Treat individual comments as provisions
Application: e-rulemakingApplication: e-rulemaking
3333
E-rulemakingE-rulemaking
Drafted regulations compared with public comments
Content ofSection 1105.4
6 Related Public Comments
1105.4 [6]
3434
Related section in draft and public comment
Results from e-rulemaking applicationResults from e-rulemaking application
ADAAG Chapter 11 Rights-of-way Draft 1105.4.1 Length Where signal timing is inadequate for full crossing of all traffic lanes or where the crossing is not signalized, cut-through medians and pedestrian refuge islands shall be 72 inches (1830 mm) minimum in length in the direction of pedestrian travel.
Public Comment Deborah Wood, October 29, 2002 I am a member of The American Council of the Blind. I am writing to express my desire for the use of audible pedestrian traffic signals to become common practice. Traffic is becoming more and more complex, and many traffic signals are set up for the benefit of drivers rather than of pedestrians. This often means walk lights that are so short in duration that by the time a person who is blind realizes they have the light, the light has changed or is about to change, and they must wait for the next walk light. ...
3535
Results from e-rulemaking applicationResults from e-rulemaking application
No related provisions identified Concern not addressed in the draft
ADAAG Chapter 11 Rights-of-way Draft [None Retrieved] No relevant provision identified
Public Comment Donna Ring, September 6, 2002 If you become blind, no amount of electronics on your body or in the environment will make you safe and give back to you your freedom of movement. You have to learn modern blindness skills from a good teacher. You have to practice your new skills. Poor teaching cannot be solved by adding beeping lights to every big Street corner! I am blind myself. I travel to work in downtown Baltimore and back home every workday by myself. I go to meetings and musical events around town. I use the city bus and I walk, sometimes I take a cab or a friend drives me. Some of the blind people who work where I do are so poor at travel they can only use that lousy “mobility service” or pay a cab. Noisy street corners won’t help them ...
3636
ContributionsContributions
A framework for regulatory repository Structure of regulations recreated in XML Feature extractions
Prototype for similarity comparisons Contextual comparisons Domain knowledge Structural comparisons
Performance Evaluation, Results and Applications User survey and comparisons with LSI Observations of comparisons between Federal, State, non-
profit organization mandated codes and European standards
Accessibility Drinking water control
Application on e-rulemaking
3737
Future research directionsFuture research directions In the legal domain
Regulatory competition Cross border data transfer laws Especially in the polyglot countries in EU
Regulatory updates Track changes in updates Track cross references between regulations
Extension of application to other domains of semi-structured documents Software specifications User manuals
Similarity/relatedness is settled - how about differences and conflicts? Drinking water example of almost identical
provisions
3838
AcknowledgmentsAcknowledgments
Committee members Prof. Kincho Law Prof. Gio Wiederhold Prof. Hans Bjornsson Prof. Cary Coglianese Prof. Hector Garcia-Molina, defense chair
Family, friends and everyone in the Engineering Informatics Group Especially REGNET/REGBASE project members
This research is sponsored by the National Science Foundation
3939
Thank You!Thank You!
4040
Backup SlidesBackup Slides
4141
Natural tree hierarchy rendered by Natural tree hierarchy rendered by SpaceTreeSpaceTree
4242
Concept ontologyConcept ontology
4343
Semantics of relatedness/similaritySemantics of relatedness/similarity
Similar: having characteristics in common; strictly comparable; alike in substance or essentials; not differing in shape but only in size or position.
Related: connected by reason of an established or discoverable relation.
Similarity is not static; it can depend on one’s viewpoint and desired outcome.
“related” provisions are more interested, e.g., the conflicting cases
Traditionally, it is called a “similarity score”.
4444
Cosine similarityCosine similarity
A document is represented as a n-entry vector M = (w1,M, w2,M, … , wn,M), where n is the total number of index terms in the corpus.
Similarity between two documents =
E.g., we take the frequency count of concept i as the concept weight wi,M in dM = (w1,M, w2,M, … , wn,M).
n
i Ni
n
i Mi
n
i NiMi
ww
ww
1
2,1
2,
1 ,,
4545
Example of feature vectorsExample of feature vectors
Traditional term match each index term i is assigned a positive and non-
binary weight wi,M in each document vector d M
Weight selection Frequency of term, or tf idf model
tf = term frequency; term density idf = inverse document frequency = log(n/ni); term rarity
Excluding stopwords
4646
Vector space transformationVector space transformation
Define D such that E = DTD is fulfilled Cosine between the consolidated frequency
vectors: =
=
=
=
|'||'|
''
UA
UA
mm
mm
|||| UA
UA
mDmD
mDmD
UT
UAT
A
UT
A
mDmDmDmD
mDmD
)()(
)(
UTT
UATT
A
UTT
A
mDDmmDDm
mDDm
UT
UAT
A
UT
A
mEmmEm
mEm
4747
Boundary case: reduced spaceBoundary case: reduced space
Measurements i and j are synonyms The following vectors should return the same
answer
mA =
Aj
Ai
w
w
,
,
, mU =
Uj
Ui
w
w
,
,
mA,reduced =
Aj
Aj
AjAi
w
w
ww
,1
,1
,,
, mU,reduced =
Uj
Uj
UjUi
w
w
ww
,1
,1
,,
4848
r educedUr educedT
r educedA mEm ,,
=
Uj
Uj
UjUijiji
AjAjAjAi
w
w
ww
sym
EE
wwww
,1
,1
,,1,1,
,1,1,,
1.
1
1
1
1
1
=
Uj
Uj
UjUi
ent i t i esn
n
jikk
niAjAinkAk
n
jikk
iAjAikAk
w
w
ww
EwwEwEwwEw
,1
,1
,,
1
,1
,,,
,1
1,,1, )()(
= ))()(())((
,1
,,,,,
,1
,1
,,,,
n
jikk
i iAjAii kAkUjUi
n
jipp
n
jikk
piAjAipkAkUp EwwEwwwEwwEww
= j ii ij ki k
n
p
n
jikk
piAjAipkAkUp EEnkkEEEwwEww
and,1,))((1
,1
,,,,
= nppEEEww pjpi
n
p
n
kpkAkUp
1,1 1
,,
=UTAmEm
4949
Neighbor inclusionNeighbor inclusion
Neighbor structure matrix formulation N Each Section i corresponds to row i and column i of
N Entry Nij is 0 if i psc(j)
For j psc(i), entry Nij is 1/k where k is the total number of neighbors of i
Example:
A2 A3
A1
A4 A5
NA =
02/102/10
2/1002/10
0002/12/1
4/14/14/104/1
002/12/10
(a) Example Tree of Regulation A (b) A Neighbor Structure Matrix NA
5050
Matrix representationMatrix representation
Take the average scores of the neighboring pairs Define
= similarity scores between two regulations M and N
ij = similarity score between Section i from regulation M and Section j from regulation N
We have psc-psc = NA0NU
T
and s-psc = ½ (0NUT + NA0)
5151
L e t a i = S e c t i o n i i n r e g u l a t i o n A
u i = S e c t i o n i i n r e g u l a t i o n U
i j =
otherwise ))((
1)( if 0
i
ij
apscsizeof
apsca
i j =
otherwise ))((
1)( if 0
i
ij
upscsizeof
upscu
i j t h e n t r y o f N A
0 N UT =
l kj lU0, kli kA NΦN ,,
= l k
lk0i kj lU uafN ),(,
= l k
lk0i kj l uaf ),(
= )( )(
),( ))(())((
1
jp ipupscu apscapp0
ji
uafupscs izeofapscs izeof
= f p s c - p s c ( a i , u j )
= i j t h e n t r y o f p s c - p s c
5252
Reference distributionReference distribution
A2 A3
A1
A4 A5
13
1
2
2
5
RA =
04/204/20
00000
00100
00000
09/39/59/10
(a) Example Tree of Regulation A (b) A Reference Structure Matrix RA
Reference structure matrix formulation R Each Section i corresponds to row i and column i of R Entry Rij is 0 if j ref(i)
For j ref(i), entry Rij is n/k where n is the number of citations from i to j, k is the total number of references from i
Example:
5353
Matrix RepresentationMatrix Representation
Take the average scores of the referenced pairs By replacing neighbor structure matrix N with
reference structure matrix R, we have ref-ref = RA0RU
T
and s-ref = ½ (0RUT + RA0)
5454
Reference distribution discussionReference distribution discussion
Referencing is directional unlike an immediate neighboring relationship, which leads us to further investigate the semantics of a reciprocal referential relationship
Define In-reference of i: from other nodes to i Out-reference of i: from i to others
Our matrix formulation can be easily modified to include all combinations In-in In-out Out-out
But should we?
5555
In-in, in-out, out-out reference In-in, in-out, out-out reference comparisonscomparisons
...A1
...
...
...A2
...A3
...
...U1
...U2
...
...U3
...
inreference
ADAAG UFAS
...A1
...
...
...A2
...A3
...
...U1
...U2
...
...U3
...
outreference
ADAAG UFAS
rU
rA rUT rA RU
T T
TrA
...A1
...
...
...A2
...A3
...
...U1
...U2
...
...U3
...
inreference
ADAAG UFAS
RArU
inreference
inreference
outreference
5656
User SurveyUser Survey
Design of the survey
final rankingindividual worksheets
Rank UFAS1 72 93 104 55 26 17 48 89 3
10 6
ADAAG: 1Rank UFAS
1 52 103 14 45 86 67 38 29 910 7
ADAAG: 2Rank ADAAG UFAS
1 1 72 7 103 4 14 4 45 2 56 2 107 3 38 4 29 9 9
… … …
tied forrank 7.5
5757
Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
Term-document matrix K Singular Value Decomposition (SVD) on K:
Zero out insignificant singular values:
Document-document similarity matrix:
K = PQRT
Ks = PsQsRsT
KsTKs = (PsQsRs
T)T(PsQsRsT)
= RsQsTPsTPsQsRs
T
= RsQs2RsT Ps
TPs = I, QsT = Qs
5858
Example of DWC ontologyExample of DWC ontology
!Disinfectants and Disinfection-byproducts !Disinfectants
... !Chlorine
+chlorine +cl2 +hypochlorite +hypochlorous acid
!Disinfection Byproducts +d/dbp +d/dbps +dbp +dbps ... !Total Trihalomethanes
+trihalomethane +tthm +tthms
...
5959
Results of different regulation Results of different regulation comparisonscomparisons
5 groups of comparisons, clustered according to domain Accessibility standards: Groups 1, 2 and 3 Drinking water standards: Group 4 Cross domain comparisons
(drinking water standards vs. fire code): Group 5
6060
Observations based on resultsObservations based on results Similarities between regulations from
accessibility > drinking water > cross-domain Drinking water regulations
Much more voluminous (2600 provisions each) Accessibility ~ 500 provisions each
Diversity of coverage National primary, national secondary, customer
confidence reports Accessibility regulations: focused on disabled access
(Almost) identical provisions in Groups 1, 2 and 4 Different features
Term-based more important than non term-based Ontology is important Measurements, effective dates: stringent scoring
schemes