36x48 vertical poster template - tsinghuanlp.csai.tsinghua.edu.cn/site2/images/file/introduction to...
TRANSCRIPT
State Key Lab of Intelligent Technology & Systems
Natural Language Processing Group
Tsinghua University
http://nlp.csai.tsinghua.edu.cn
THUNLP Leader: Prof. Maosong Sun Professor Sun is the Chairman of Department of Computer Science and Technology, Tsinghua University. His research
interests are computational linguistics, statistical and corpus-based natural language processing (NLP), including: Chinese
language computing (computational morphology, bilingual terminology extraction), information retrieval (Chinese text
categorization, graphical model based keyword extraction), collective intelligence (tag generation, Web trend analysis) and
social computing (query log analysis, community discovery). He has participated as project leader or principal researcher in
over 20 projects founded by the National Natural Science Foundation of China, the National Social Science Foundation of
China, the 863 National High-Tech R&D Program of China, the 973 National Basic Research Program of China as well as in
projects funded by a number of international IT companies. Professor Sun has published, together with his students, about 130
papers in academic journals and international conferences in the above fields. The total number of citations of these papers in
Google Scholar is roughly 1,400. He has served as program committee members in numerous national and international
conferences, and many times as conference chairs or program committee chairs.
About THUNLP The Natural Language Processing Group at the Department of Computer Science and Technology, Tsinghua University
(THUNLP), also a part of the National Lab for Information Science and Technology and the State Key Lab of Intelligent
Technology and Systems, is working on methodologies and algorithms for computer processing and understanding of human
languages with emphasis on Chinese. We focus on basic research in language computation as well as the application-
oriented NLP technologies. We have published a number of papers in the related top conferences and journals such as ACL,
COLING, EMNLP, IJCAI, VLDB, Computational Linguistics, Journal of Quantitative Linguistics, IEEE Intelligent Systems in
recent years.
Research Interests Our research covers a range of topics in natural language processing, including:
Recently, Professor Sun presented a point of view in NLP: NLP based on huge-scale naturally annotated corpora. The basic idea is with Web-scale corpora, natural
annotation may help machine better perform some NLP tasks. There are two types of natural annotation: explicit, as punctuations, anchor text, query log, Wikipedia, blog
tags, and implicit, as language usage patterns. He further puts forward a fundamental problem: if we could integrate all the information provided by naturally annotated
corpora from different perspectives together in a systematic way, can we achieve some degree of deep understanding of languages for machine? A preliminary work by him
and his student in Computational Linguistics in 2009 showed the usefulness of punctuations in Chinese word segmentation, suggesting this idea deserves further study.
He is the Vice President of Chinese Information Processing Society, the council member of China Computer Federation, the council member of Chinese Association for
Artificial Intelligence, the officer of ACL SIGHAN, the member-at-large of ACM China Council, the vice chairman of Expert Committee of Language Commission of Beijing
Municipal Government, the member of Expert Committee of National Language Resource Surveillance and Research Center, the Editor-in-chief of the Journal of Chinese
Information Processing, the Editorial Board members of many journals including the Communications of CCF, the Journal of Computer Science and Technology, the Journal
of Chinese Language and Computing, Applied Linguistics and Nankai Linguistics.
NLP based on Huge-scale Naturally Annotated
Corpora
Word segmentation using punctuations in huge-scale web
articles
New word detection and related word retrieval from user
logs of Chinese input method
Chinese abbreviation extraction from anchor texts in web
pages
New word detection from user logs of search engine
Social Tagging and Keyword Extraction
Tag disambiguation
Tag suggestions using topic models
Tag suggestions via Latent Reason Identification
Exploring subsumption relations in social tags
Keyword extraction by clustering to find exemplar
terms
Keyword extraction via topic decomposition
Text Classification
Feature selection for Chinese text classification
Scalable term selection for text classification
Efficient text classification using term projection
Transfer learning and self training for text
classification
Text classification-based image classification
Multilingual Analysis
Fast and robust sentence alignment
algorithm
Bilingual terminology extraction system
Statistical method for Uyghur tokenization
Uyghur morpheme analysis
"Female Script" pinyin input method