trans: t ransportation r esearch a nalysis using n lp technique s hyoungtae cho, melissa egan,...
DESCRIPTION
Outline o Motivation o Goals o Data o Methods Clustering Pairwise similarity o TRANS Demo o Future work o ConclusionsTRANSCRIPT
TRANS: Transportation
Research Analysis using NLP
TechniqueSHyoungtae Cho, Melissa Egan, Ferhan Ture
Final PresentationDecember 9, 2009
Project Sponsor
Michael PackMichael PackDirector, Center for Advanced Transportation
Technology Laboratory (CATT Lab)University of Maryland
Outlineo Motivationo Goalso Datao Methods
• Clustering• Pairwise similarity
o TRANS Demoo Future worko Conclusions
Project motivation• Project was inspired by issues in the
transportation research community.• First issue: Researchers in the field, including
Michael Pack, have concerns about the inefficient use of funds due to repetitive research in the field.– Many research ideas and projects are repeatedly
published with only slight repackaging.– It would be ideal if such projects could be detected at
the time of their proposal.
Project motivation, continued• Second issue: Categorization of research projects
within the field.– Useful for:
• Tracking the amount of research done in each sub-field.• Understanding research trends within the community.• Bringing researchers with similar interests together.
– At the moment, these tasks are partially managed by the Transportation Research Board (TRB), but this is costly and not always effective.
– Performing the tasks automatically will produce fast, cheap, and objective results.
– Visualizing the results will make interpretation and analysis easier, and will communicate them to a larger portion of the community.
Outlineo Motivationo Goalso Datao Methods
• Clustering• Pairwise similarity
o TRANS Demoo Future worko Conclusions
Project goals• First goal: Use natural language processing
(NLP) techniques to analyze the research statements from past years.– Build a system that can
1. detect statements that are very similar, and2. classify each statement with a topic/category.
– Create visualizations to highlight interesting results.
• E.g., trends in transportation research over the years
Project goals, continued• Second goal: Create a web site to collect and
analyze research ideas in the field.– Web site should:
• Allow users to submit research needs statements or ideas.
• Allow other users to vote on these ideas.• Generate appropriate visualizations to summarize
research needs and interests.
Outlineo Motivationo Goalso Datao Methods
• Clustering• Pairwise similarity
o TRANS Demoo Future worko Conclusions
DataPreprocessing
• Extract research needs statements and paper abstracts
Research needs statements
Paper abstracts
ClusteringAn algorithm to group similar data points together
In our work,– Categories of statements and papers not available– IDEA:
1. Use clustering to group similar statements2. Assign a category to each cluster
Features
As a global recession of unprecedented scale threatens to engulf much of the United States economy, congress and federal policy-makers have assembled a large package of government stimulus spending that can reverse job losses and revive consumer demand. Economists identify road construction as a good way to create jobs in the short-term and to boost economic productivity in the long-term by lowering transportation costs. As a result, highways feature prominently in the proposed Congressional economic stimulus bill and about $30 billion in new federal money for pavements, bridges, and tunnels is likely to flow to state departments of transportation (DOTs) in 2009 and 2010
global recess unpreced scale threaten engulf unit state economi congress federpolici maker assembllargpackag govern stimulu spend can revers job loss revivconsum demand economist identifi road construct waicreat job short term boost econom product long term lower transport cost result highwaifeaturpromin propos congressioneconomstimulu bill 30 billion new federmonei pavement bridg tunnel like flow state depart transport dot 2009 2010
global = 0.03 recess = 0.7 unpreced = 0.41 …
Each document is represented by a vector of feature weights
frequency in this documentfrequency in all documents
weight
Tokenization removes stop words, and truncates wordsA weight is computed for each term
only unigrams
k-means ClusteringUser chooses number of clusters k (e.g., k=3)k=3 documents randomly selected as ‘centers’For each document that has not been assigned to a clusterFind the distance from this document to each center
C1
…and assign it to the nearest oneDo the same for each unassigned document
C1C1
C1
C2 C2
C2
The center of cluster 1 is adjusted
C1
C1
C2
C3
C2
C2C2
C2
C3
C3C3
C3
C3
C3C3
C1 C1
C1
C2
C2
Since clusters are not labeled by the algorithm,we look at the most frequent terms and manually decide on namesHighways
ConstructionAdministration
Outlineo Introductiono Motivationo Datao Methods
• Clustering• Pairwise similarity
o TRANS Demoo Future worko Conclusions
0.0no similarity
1.0exactly same
0.28 0.51 0.83
Pairwise SimilarityGiven two documents, compute a similarity score:
– Can be used to detect duplicate work and generate ‘‘more like this’’ lists
– Use same features as clustering
Outlineo Introductiono Motivationo Datao Methods
• Clustering• Pairwise similarity
o TRANS Demoo Future worko Conclusions
Demo– TRANS Java Applet– TRANS Web Application
Future Worko Better Features
• Using N-gram features• Transportation Ontology
o LDA Topic Presentationo Visualization for sub-categorizationoCitation Network Analysis
Conclusiono Implement transportation research Visualization
tool, TRANS• TRANS tool• TRANS Website
o Extend to another academic field