study in spatial distribution analysis of science research activities based on toponym resolution in...

22
Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1 , Guodong Cheng 2 , Shaoxiong Liu 1 , Hanqing Ma 1 , Jinhui Ma 3 ,Na Li 1 1.The Lanzhou Branch of the National Science Library, Chinese Academy of Sciences, , Lanzhou 73000,China; 2. Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences, Lanzhou 730000,China; 3. College Of Earth and Environmental Science, LanZhou University,Lanzhou 730000, China Collnet 2012 ,Korea Souel

Upload: theresa-chantry

Post on 29-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Study in Spatial Distribution Analysis ofScience Research Activities based onToponym Resolution in Text

Jianxia Ma1, Guodong Cheng2, Shaoxiong Liu1 , Hanqing Ma1, Jinhui Ma3 ,Na Li1

1.The Lanzhou Branch of the National Science Library, Chinese Academy of Sciences, , Lanzhou 73000,China;

2. Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences, Lanzhou 730000,China;

3. College Of Earth and Environmental Science, LanZhou University,Lanzhou 730000, China

Collnet 2012 ,Korea Souel

Page 2: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Outline

Background Intorduction to Related Study Framework of the Analysis Tool Spatial Analysis of Research

Activity in sporopollen in China Conclusion

Page 3: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Background Recently, many scholars and applications

have begun to show analysis results of scientific papers combined with GIS visually.

Most of their studies are based on addresses of authors given by the authors directly.

There are few reports on the analysis of distribution of research area based on text-mining in research papers, especially written in Chinese.

Page 4: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Katy Börner , Shashikant Penumarthy, Mark Meriss etc. Mapping the Diffusion of Information Among Major U.S. Research Institutions. Scientometrics, 2006,68(3):415-426

[

Xuemei Wang, Mingguo Ma.Spatial information mining and visualization for Qinghai-Tibet Plateau’s literature based on GIS[A] in:Yaolin Liu, Xinming Tang.International Symposium on Spatial Analysis, Spatial-Temporal Data Mining[C].Wuhan, Proc. Of SPIE,2009,1-8

Lutz Bornmann$, Ludo Waltman. The detection of “hot regions” in the geography of science – A visualization approach by using density maps , arXiv:1102.3862v2

Lutz Bornmann, Loet Leydesdorff, Christiane Walch-Solimena, Christoph Ettl$Mapping excellence in the geography of science: An approach based on Scopus data

Page 5: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Background In earth science, resources and

environment related fields, research is closely related with some location.

It is inefficient to read the articles one by one while annotate the research area by hand to get the understanding of the distribution of research area. In doing so, it is not easy to grasp where the research blanks and hot spots are.

Page 6: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Background Through automatic recognition and indication

of geographical names referred in research papers, we can analyze the spatial distribution of research activities in a research field, and understand the hot areas and blank areas in the field.

It will help decision makers and researchers to adjust strategy of research and optimize research resources allocation, and it will be an innovation in information analysis by adding a new spatial dimension to traditional information analysis.

Page 7: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Background PossibilityCan we mine hidden geographical knowledge from large-

scale research papers to support spatial analysis of research activity?

How? How to analyze geographical feature in magnanimous

textual collections and mine the hidden knowledge efficiently?

Key:Toponym resolution in the research articles

includes two tasks, namely Geo-Parsing and Geo-Coding

Page 8: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Introduction to Related Study

Geo-parsing Geo-parsing consists of detecting and

extracting the geographic names referred in the unstructured text of an article or a Web page using Named Entity Recognition (NER) techniques.

Page 9: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Gazetteers based extraction. Simple and allows efficient implementations, with a

loss of precision in toponym extraction. A tedious job to get a full covered gazetteer.

Natural language processing generally based on statistical models. Hidden Markov Models (HMMs) , Maximum Entropy

Models (MEMs),Maximum Entropy Markov Models (MEMMs) ,Conditional Random Field (CRF) ,Supporting Vector Machine(SVM)were discussed in many documents for extraction of geographic names.

require lots of training and are corpus dependent.

Page 10: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Geo-coding Geo-coding is the key step to correlate textual

information to maps. Gazetteer or the geographical knowledge base is the key component

A well-designed digital gazetteer can support geo-entity identification, toponym disambiguation and geo-coding.

By now, the famous digital gazetteers includes ADL Gazetteer , Getty TGN 、 GeoName.

And some digital map services, including Google Map , Microsoft Bing map , Yahoo PlaceFinder, Baidu Map provide API for geo-coding.

Page 11: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Chinese Toponym Extraction

Unlike English, there is no blank to mark word boundaries in Chinese text.

The previous research focused on syntax rules and word segmentation. Statistical models have been used to identify unknown geographical names in Chinese text.

The research mainly carried out in webpage & news, few of them related to research paper.

Page 12: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Framework of The Analysis Tool

Page 13: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Framework of The Analysis Tool

Documentary Database Preparation Geo-parsing in Text

Geo-extraction from authors’ affiliation and address fields

Geo-recognition from unstructured text CRF++ Based Toponym Identification Geographical Knowledge Base with Semantic Relationship

Supporting Toponym Disambiguation GeoFocus

Geo-Coding Spatial Analysis of Research Activity Based on

Toponym Resolution from Documents with ArcGIS

Page 14: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation

Page 15: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

四、实验原型设计 15

Geographical KB Abbre-Alias-Formal Toponym transformation Toponoy-Footprint/Coordinate Combining with toponym rules to support

toponym annotation. Combining

administrative,spatialrelationship, and feature type of geo-entity to support disambugation.

Geo-coding

Page 16: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Spatial Analysis of Research Activity in sporopollen in China

The author’s distribution CNKI 1490 papers (2000-2010) 1402 items have clear authors’ affiliations

and addresses. identified 97.08% author’s affiliation and

address. In combination with Google earth and

Google Map, the rate of geo-coding to 96.9%.

Page 17: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

As Fig shows most of authors of palynology come from Beijing, Jiansu and Shanghai, then from Gansu and Shanxi. Few of the authors are from Xizang and Ningxia.

Page 18: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Distribution of the research area in sporopollen in China

There are 1112 papers referred geographical names according to manual annotation in abstract.

Page 19: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Distribution of research area in sporopollen

The hottest research area of Sporopollen in China is estuary of the Yangtze river, Shandong inland area, Beijing, Qinling mountain area and Junggar Basin,

the sampling point is sparse in the south of Changjiang, mountainous border of Heilongjiang Jilin and Inner Mongolia and northwest desert and southwest tropical regions.

These places should be payed much attention in the future,i.e. in addition to consider research significance, geographic area representative and filling blanks research area also is worth considering.

Page 20: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Conclusion

The experiment shows that it is possible to analyze distribution of research activities based on automatic identification and annotation of the geo-entity in large-scale textual collections.

The method is useful for the science decision maker to allocate research resources.

Page 21: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Further research Further research and experiment is needed and

actually is on-going to improve geo-parsing and geo-coding rate.

We need much more corpus to be trained, need to adjust the feature template to get better efficiency.

We also need to take into consideration of other heuristics to improve the toponym resolution.

A systematic evaluation of the method we have taken should be carried out as well.

Page 22: Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu

Thanks for your attention!