2004 lti student research symposiumvitor/srs/2004_lti_srs_brochure.pdf · the observant of you may...

PREFACE

Welcome to the Language Technologies Institutes's second annual Student Research

Symposium! Our first symposium, held last year, was very successful, with a collection

of high-quality student presentations ranging the wide spectrum of research problems in

language technology related areas. We think the program selected for this year's

symposium is equally exciting and high in quality, and we hope you will agree, once you

hear the talks!

The program this year was once again selected in a competitive process. We received a

total of 15 submissions, out of which nine abstracts were selected for presentation by the

selection committee. The committee consisted of four faculty members (Alan Black,

Robert Frederking, Alon Lavie, and Alex Rudnicky) and one graduate student (Benjamin

Han). We believe the selected program is a true reflection of the diverse high-quality

research in which the graduate students of the LTI are engaged.

The observant of you may note that we have added a few new features to the SRS

program this year. We have invited Betty Cheng, the prize winner of last year's

symposium, to give a keynote presentation, where we hope to hear about how her

interesting research on using language modeling tools for understanding the structure and

interactions of biological proteins has progressed. Another new feature is the printed

program, which you are holding in your hands. We have also added two honorable

mention cash prizes, in addition to the best presentation award. The awards will be

presented at a brief ceremony at the end of the day, so be sure to stick around!

We wish to thank the faculty and students that have volunteered to serve on the panel

that will select the winning presentations. We also thank Chris Koch for helping with the

logistics of the symposium. Special thanks to Catherine Copetas for her key role in

producing the programs, posters and publicity for the SRS.

We trust you will enjoy the presentations of the second LTI SRS, which we hope will

become an annual tradition in the years to come.

Benjamin Han and Vitor Carvalho Alon Lavie

SRS Student Organizers Faculty Advisor

Language Technologies Institute

2

2004 Student Research Symposium

3

PROGRAM

Time Event Speaker Title

8:30 Breakfast (Provided)

9:00 Talk 1 Jonathan

Brown

Retrieval of Authentic Documents for

Reader- Specific Lexical Practice

9:30 Talk 2 Wen Wu Incremental Detection of Text on Road

Signs from Video

10:00 Coffee Break

(Provided)

10:30 Talk 3 Kenji Sagae Using Dependencies for Easy, Fast and

Accurate Grammatical/Functional Analysis

11:00 Talk 4 Guy

Lebanon

Hyperplane Margin Classifiers on the

Multinomial Manifold

11:30 Talk 5 Antoine

Raux

Maximum Likelihood Adaptation of Semi-

Continuous HMMs by Latent Variable

Decomposition of State Distributions

12:00 Lunch (on your own)

13:30 Talk 6 (KEYNOTE) Betty Yee-

Man Cheng

Language Technologist's Approach to

Understanding G-Protein-GPCR

Interaction

14:00 Talk 7 John

Kominek

On the Road to High Quality Universal

Speech Synthesis

14:30 Talk 8 Nikesh

Garera

Towards a Personal Briefing Assistant

15:00 Coffee Break

(Provided)

15:30 Talk 9 Luo Si Federated Search in Uncooperative

Environments

16:00 Talk 10 Satanjeev

Banerjee

Automatically Detecting the Structure of

Human Meetings

16:30 Break

16:45 Best Presentation

Award and Closing

Ceremony

• SRS web site: http://www.cs.cmu.edu/~vitor/srs


4


5

Jonathan C. Brown [email protected]

Retrieval of Authentic Documents for Reader-Specific Lexical Practice

When a teacher gives a reading assignment in today’s language learning classrooms, all

of the students are almost always reading the same text. Although students have different

reading levels, it is impractical for a single teacher to seek out unique texts matched to

each student’s abilities. In this presentation, I describe REAP, a system designed to

assign each student individualized readings by combining new techniques in reading

difficulty estimation [1] and detailed student and curriculum modeling [2] with the large

amount of authentic materials on the Web. REAP is designed to be used as an additional

resource in teacher-led classes, as well as to be used by reading comprehension

researchers for testing hypotheses on how to improve reading skills for L1 as well as L2

learners. I describe how researchers can use this tool to get fine-grained control over

selection of reading materials, so that they can more easily test these new learning

hypotheses.

Vocabulary acquisition is the primary factor we use in matching texts to a student’s

abilities. These abilities are modeled as a histogram of words. We also model each

desired curriculum level as a histogram of words, learned from a corpus of texts that the

students would normally read. Differences between the student model and that of the

next desired skill level indicate where the student needs to focus. The system can also

prioritize different criteria during the search. For instance, the system can retrieve

documents based solely on the vocabulary terms needed to progress toward the next

level, thereby focusing on curriculum. REAP can also take into account other goals, such

as student interests, special topics, or an upcoming test, all represented as word

histograms. This allows teachers and researchers to decide what they want the students to

focus on for each session.

1. K. Collins-Thompson and J. Callan. (2004.) "A language modeling approach to predicting

reading difficulty." In Proceedings of the HLT/NAACL 2004 Conference. Boston.

2. J. Brown and M. Eskenazi. (2004.) "Retrieval of Authentic Documents for Reader-Specific

Lexical Practice." In Proceedings of InSTIL/ICALL Symposium 2004. Venice, Italy.


6

Wen Wu [email protected]

Incremental Detection of Text on Road Signs from Video

Automatic detection of text from video is an essential task for video indexing and

understanding. In this talk, we focus on the task of automatically detecting text on road

signs from video. Text on road signs carries much useful information necessary for a

driver’s safely driving and efficient navigation. Automatically detecting text on road

signs can help to keep a driver aware of the traffic situation and surrounding

environments. Such a multimedia system can reduce driver’s cognitive load and enhance

safety in driving, which is especially useful for elderly drivers with weak visual acuity.

In this talk, I will present a fast and robust framework for incrementally detecting text on

road signs from natural scene video. The new framework makes two main contributions.

First, the framework applies a Divide-and-Conquer strategy to decompose the original

task into two sub-tasks, that is, localization of road signs and detection of text.

Corresponding algorithms for the two sub-tasks are proposed and they are smoothly

incorporated into a unified framework through a real-time feature tracking algorithm.

Second, the framework provides a novel way for text detection from video by integrating

2D features in each video frame (e.g., color, edges, texture) with 3D information

available in a video sequence (e.g., object structure). The feasibility of the proposed

framework has been evaluated on 22 video sequences captured from a moving vehicle.

The new framework gives an overall text detection rate of 88.9% and false hit rate of

9.2%, which makes it possible for it to be applied to a driving assistant system and other

tasks of text detection from video.

1. W. Wu, X. Chen and J. Yang. Incremental Detection of Text on Road Signs from Video with

Application to a Driving Assistant System. To appear in ACM Multimedia, New York, USA,

2004. (Oral Presentation).


7

Kenji Sagae [email protected]

Using Dependencies for Easy, Fast and Accurate Grammatical/Functional Analysis

Modern statistical syntactic parsers have achieved very high levels of accuracy over the

past ten years, and we have begun to see their impact on several areas of language

technologies, such as question answering, machine translation, and semantic-role

labeling. Because the Penn Treebank (PTB) is widely used for training of such parsers, it

is common to associate PTB-style constituent trees with statistical parsing. However,

there are instances where other syntactic representations would be easier to use, and just

as useful (if not more). One such instance is the assignment of grammatical relations (or

even PTB function tags) to words. In this case, dependencies are not only more

comfortable to understand and faster to annotate, but also easier to process and largely

just as effective.

I will discuss a simple representation based on lexical dependencies, which I have been

using in the syntactic analysis of parent-child dialogs. I will present a simple

deterministic algorithm for dependency parsing, and show the accuracy of the

dependencies it produces is very close to the accuracy of current PTB constituent

statistical parsers (91% vs. 93%). Although PTB constituent parsers have a slight edge,

they are quite complex. I will show that a dependency parser that performs almost as

well can be surprisingly simple and fast.

I will also discuss how these dependencies can be used to determine PTB function tags

(such as subject, predicate, temporal, beneficiary, locative, etc). The current state-of-the-

art on assigning function tags to text is the work of Blaheta (2000, 2003), and it uses

(among other features) PTB parse trees nodes. I will present results that are very similar

using no constituent information, only dependencies. Both methods achieve an overall

accuracy of about 87% in function tagging (not counting .NULL. tags). Blaheta.s method

is slightly better on tags classified as .grammatical. (subject, predicate, etc), while the

dependency approach is slightly better on .form/function. tags (temporal, locative,

manner, etc).

This approach to function tagging can also be used to label all dependency arcs, when

training data is available. In fact, a relatively small training corpus (less than 10,000

words) can be used to produce a system that assigns a grammatical relation label to every

dependency arc with an accuracy of about 90% in a corpus of parent-child dialogs.


8

Guy Lebanon [email protected]

Hyperplane Margin Classifiers on the Multinomial Manifold

Linear classifiers are a mainstay of machine learning algorithms, forming the basis for

techniques such as the perceptron, logistic regression, boosting, and support vector

machines. A linear classifier, parameterized by a vector , classifies examples

according to the decision rule

following the common practice of identifying x with the feature vector !(x). The

differences between different linear classifiers lie in the criteria and algorithms used for

selecting the parameter vector based on a training set.

Geometrically, the decision surface of a linear classifier is formed by a hyperplane or

linear subspace in n-dimensional Euclidean space,

where

!,! denotes the Euclidean inner product. (In both the algebraic and geometric

formulations, a bias term is sometimes added; we prefer to absorb the bias into the

notation given by the inner product, by setting xn = 1 for all x.) The linearity assumption

made by such classifiers may be justified by its solution to the fundamental learning

tradeoff between complexity of models and restricted expressiveness.

However, we show that implicit in this argument is the presence of Euclidean geometry.

If the data is not well described by Euclidean geometry, the main motivation for linear

classifiers fails and a generalization of linear classifiers, adapted to the present geometry

is expected to perform better. In this work, we generalize the notion of linear hyperplane

and margin to arbitrary Riemannian geometries. The natural generalization of logistic

regression is then defined and its properties are examined. We focus our attention on the

Fisher geometry of the multinomial manifold that forms a natural geometric space for text

documents. The resulting generalization of logistic regression is shown to outperform its

Euclidean counterpart on several standard text classification tasks.


9

Antoine Raux [email protected]

Maximum Likelihood Adaptation of Semi-Continuous HMMs by Latent Variable Decomposition of State Distributions

Hidden Markov Models, the single most used method for speech recognition, involve two

types of parameters: transition probabilities, which model the temporal aspect of speech,

and output distribution parameters (usually means, variances and weights of Gaussian

mixtures), which capture the spectral properties of sub-phonemic units, each unit being

equivalent to a state in the model. In Continuous Density HMMs (CDHMMs), each state

has its own output distribution, independent of that of other states. While this makes for

powerful models, it implies the use of a large number of Gaussians, since there are

typically on the order of several thousand states and tens or hundreds of Gaussians per

mixture. This requires a large amount of training data and makes the use of such models

computationally expensive. On the other hand, in Semi-Continuous HMMs (SCHMMs),

all the states share a single set of Gaussians and only the mixture weights depend on the

state. Compared to CDHMMs, SCHMMs are more compact in size, require less data to

train well and result in comparable recognition performance with much faster decoding

speeds. Nevertheless, the use of SCHMMs in large vocabulary speech recognition

systems has declined considerably in recent years. A significant factor that has

contributed to this is that systems that use SCHMMs cannot be easily adapted to new

acoustic (environmental or speaker) conditions. While maximum likelihood (ML)

adaptation techniques have been very successful for CDHMMs, these have not worked to

a usable degree for SCHMMs. In this talk, I will present a new framework for supervised

ML adaptation of SCHMMs, built upon the paradigm of Probabilistic Latent Semantic

Analysis (PLSA). We use PLSA to decompose the probability distribution of each

Gaussian given the state (i.e. the mixture weights) according to a latent variable. The

decomposition is performed using a variant of the Expectation Maximization algorithm. I

will show how our approach is equivalent to smoothing the mixture weight matrix

obtained by retraining the original model on a small amount of adaptation data.

Experiments on non-native speech recognition in the framework of the Let’s Go spoken

dialogue system demonstrate the effectiveness of this method.


10

Betty Yee-Man Cheng [email protected]

KEYNOTE: Language Technologist’s Approach to Understanding G-Protein-GPCR Interaction

String alignments and n-grams are commonly used in language technology applications,

such as machine translation, information retrieval, speech recognition and synthesis. In

machine translation, alignment can yield high accuracy if the source and target languages

have similar word order. However, if the two languages have very different word order,

getting a correct alignment can be difficult and an n-gram based MT system may perform

better. Likewise, a correct alignment of protein sequences can yield high accuracy in

prediction problems. But segments or “words” in the protein sequence can shuffle in

their linear order while preserving their orientation in 3D space and therefore the

protein’s function or “meaning” as well.

The superfamily of proteins in this study, G-protein coupled receptors (GPCR), are

important in pharmacological research as they are the target of approximately 60% of

current drugs on the market (Muller, 2000). Coupling with G-proteins, these receptors

regulate much of the cell’s reactions to external stimuli. Abnormalities in this regulation

can lead to cancer, Alzheimer’s, Parkinson’s and other diseases. Identification of the

type of G-proteins that can bind to a particular GPCR can provide information on the

causes and symptoms of the disease the receptor is involved in.

Previous studies on predicting the family of G-proteins that can couple to a given GPCR

sequence have focused on the intracellular domains of the receptor sequence, either using

alignment-based features (Cao et al., 2003; Qian et al., 2003), n-gram features (Moller et

al., 2001) or physiochemical properties of the amino acids (Henriksson, 2003). From the

roles of alignments and n-grams in MT and their analogy to the protein language, we

have chosen to combine alignment and n-gram information in a hybrid prediction method

using a k-nearest neighbours (k-NN) classifier on sequence alignment similarity and a k-

NN classifier on Euclidean distance of n-gram counts. Our method outperforms the

current state-of-the-art in precision, recall and F1. Systematic experiments with our

prediction method were able to validate biologists’ hypothesis that most of the coupling

specificity information resides in the 2nd

and 3rd

intracellular loops of the receptor, while

providing evidence for a new hypothesis that the information is more localized to the

beginning of the 2nd

intracellular loop.

1. Cao, J., R. Panetta, et al. (2003). "A naive Bayes model to predict coupling between seven

transmembrane domain receptors and G-proteins." Bioinformatics 19(2): 234-40.

2. Henriksson, A. (2003). Prediction of G-protein Coupling of GPCRs - A Chemometric

Approach. Engineering Biology. Linkoping, Linkoping University: 79.


11

3. Moller, S., J. Vilo, et al. (2001). "Prediction of the coupling specificity of G protein coupled

receptors to their G proteins." Bioinformatics 17 Suppl 1: S174-81.

4. Muller, G. (2000). "Towards 3D structures of G protein-coupled receptors: a multidisciplinary

approach." Curr Med Chem 7(9): 861-88.

5. Qian, B., O. S. Soyer, et al. (2003). "Depicting a protein's two faces: GPCR classification by

phylogenetic tree-based HMMs." FEBS Lett 554(1-2): 95-9.


12

John Kominek [email protected]

On the Road to High Quality Universal Speech Synthesis

Machine Translation has the Vaquois Triangle -- a famous high-level perspective that

delineates the major approaches to MT, as well as their limitations. You can have either

universality (through an Interlingua) or high quality (Direct translation), but not both. In

between, trying to find a happy medium, reside Transfer techniques.

The field of Speech Synthesis also has such a triangle, with similarly frustrating trade-

offs: either high quality or full flexibility, but not both. In this talk I begin by drawing the

corresponding parallels, explaining where the three major approaches fit in, and their

historical development. These three are unit-selection, spectrogram-based, and

articulatory synthesis.

By directly employing segments of recorded speech, unit-selection synthesis can achieve

excellent voice quality, but at the expense of flexibility. A universal synthesizer, ideally,

can mimic any person in any language, in a full range of styles. Achieving this, though,

demands precise modeling of the human vocal tract and articulators -- as yet an unsolved

problem. In between, spectrogram-based synthesizers offer good controlability, but do

not sound as natural as unit-selection techniques.

Two paths can thus be taken on the road to high quality universal synthesis. One can start

with a flexible synthesizer and attempt to make it sound better. Or one can start with a

good sounding synthesizer and try to make it more flexible. This talk will follow the

second path.

To illustrate, we tackle the problem of "accent transformation" -- changing the accent of

one person to sound more like that of another. This is made possible using CMU's

recently created "Arctic Speech Databases," a parallel corpus of carefully spoken English

sentences. Editions exist for American, Canadian, Scottish, Indian, and Japanese accented

English. Grafting a new accent onto an existing voice is desirable for localizing a

synthesizer to match that of a target region. Or, moving in the opposite direction, by

making a native voice sound foreign, hence "exotic".


13

Nikesh Garera [email protected]

Towards a Personal Briefing Assistant

The preparation of summary reports from raw information is a common task in research

projects. A tool that highlights useful items for a summary would allow report writers to

be more productive, by reducing the time needed to assess individual items. It has further

potential benefit in that it can be used to create user-specific or audience-specific digests.

In the latter case, multiple tailored reports could in principle be generated from the same

input information. With this motivation, we present a design of an adaptive system that

learns to extract important items from weekly interviews by observing the behavior of

human summary authors.

Our application scenario involves a report writer producing digests on a week-to-week

basis and our goal is to make this person more efficient over time. We propose to do this

by presenting the writer with successively better ordered lists of items (such that digest-

worthy items appear at the top of the ordered list).

We identified salient features used for learning in this new domain by studying the corpus

of project interviews. This corpus consisted of weekly progress interviews of project

members collected over a period of 4 months. The features were then annotated in the

corpus and were used as parameters in a regression model. This model is incrementally

trained from user input and is used to reorder items in successive weeks. We measure the

user effort in terms of how far down the user has to go in the list in order to select all

important items in a weekly set.

In our evaluation study, 7 expert subjects (project members, managers) were asked to

create 5-item summaries for 12 successive weeks, using a selection interface. The results

with the assistance of our system show an improvement in average precision by a factor

of more than 2.21 by the end of the learning period as compared to the baseline of no

learning. Other evaluation metrics also show significant improvement. A low inter-rater

agreement (Kappa=0.26) indicates that the subjects are selecting different items and the

learned models are individual. Moreover, the different feature weights in the regression

models for each subject identify their summarization differences. We also report our

ongoing work of automatic feature extraction to make this approach domain independent.

The talk will include a short demonstration of our system showing how the learned

models can be used to populate a template for a standard quarterly report.


14

Luo Si [email protected]

Federated Search in Uncooperative Environments

Conventional search engines such as Google or AltaVista are effective when an

information source allows its contents to be crawled and indexed in a centralized

database. However, a large amount of information cannot be crawled and searched by

conventional search engines either due to intellectual property protection or frequent

information update. This type of information is valuable. For example, hidden Web

contents that can not be searched by conventional search engines have been estimated to

be 2-50 times larger than the visible Web and are often created and maintained by

professionals.

Federated search provides the solution of the search problem for the information that

cannot be searched by conventional search engines. It includes three sub-problems: i)

acquiring information about the contents of each information source (resource

representation), ii) ranking the sources and selecting a small number of them for a given

query (resource ranking), and iii) merging the results returned from the selected sources

into a single ranked list (result-merging).

This work addresses federated search problems in uncooperative environments such as

the Web where information sources can not be assumed to share their contents or use the

same type of search engine. Empirically effective solutions have been proposed to the full

range of federated search sub-problems such as new algorithms for information source

estimation, resource selection and results merging.

Furthermore, a unified utility maximization framework is proposed to combine the

separate solutions together to construct effective systems of different federated search

applications. This is the first probabilistic framework for integrating the different

components of a federated search system. The more unified view of federated search task

provides a new opportunity to utilize available information. It enables us to configure

individual components globally to get desired overall results of different applications,

which is superior to the simple choice of combining individual effective solutions

together in previous research.

This work advances the state-of-the-art of federated search. The more theoretical

foundation, the better empirically results and the better modeling of real world

applications make the new research a bridge to turn federated search from a cool research

topic to a much more practical tool.


15

1. Si, L. & Callan, J. (2002a). Using sampled data and regression to merge search engine results.

In Proceedings of the 25th

Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval.

2. Si, L. & Callan., J. (2003a). Relevant document distribution estimation method for resource

selection. In Proceedings of the 26th

Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval.

3. Si, L. & Callan, J. (2003b). A Semi-Supervised learning method to merge search engine

results. ACM Transactions on Information Systems, 21(4).

4. Si, L. & Callan, J. (2004). The effect of database size distribution on resource selection

algorithms. In Distributed Multimedia Information Retrieval, LNCS 2924, Springer.

5. Si, L. & Callan, J. (2004). Unified Utility Maximization for Distributed Information Retrieval

in Uncooperative Environments. In Proceedings of the 13th

International Conference on

Information and Knowledge Management, ACM.


16

Satanjeev Banerjee [email protected]

Automatically Detecting the Structure of Human Meetings

We are interested in automatically extracting the structure of meetings between humans.

Such structure includes the state of a meeting (presentation, discussion, etc), the roles of

each meeting participant (presenter, discussion participator, observer, etc), the

onset/offset boundaries of agenda items, and the onset/offset boundaries of regions of

decisions (such as action items). In this talk we will talk about our current research into

detecting these various aspects of human meetings.

In particular, we will present a simple taxonomy of meeting states and participant roles.

We trained a decision tree classifier that learns to detect these states and roles from

simple speech-based features such as the number of speakers and the lengths of

utterances and speech-overlaps. This classifier detects meeting states 18% absolute more

accurately than a random classifier, and detects participant roles 10% absolute more

accurately than a majority classifier. We will then report on the effect of adding more

advanced features such as the words in the utterances as output by an automatic speech

recognizer, as well as features drawn from other modalities such as the body positions

and face directions of the various participants relative to each other as output by a

camera-image processor.

Finally we will present initial research on agenda item and decision region boundary

detection. Unlike meeting state and participant role detection, the problem of detecting

agenda items and decision regions does not easily lend itself to a typical machine learning

approach, since there are no clear pre-defined classes. However, preliminary observations

of recorded meeting data suggest that different agenda items usually differ highly in both

the pattern of words used in discussing them, as well as in the identities of the

participants involved in the discussions thereof. We will report on our ongoing research

where we draw upon ideas from the realm of topic tracking and leverage the above

characteristics to perform agenda item/decision region detection.