1 language-based information and knowledge analysis professor khurshid ahmad department of computing...

70
1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University of Surrey e-Science day at the Surrey Research Park, 2 December 2002

Upload: dominic-bacher

Post on 01-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

1

Language-based Information and Knowledge

Analysis

Professor Khurshid AhmadDepartment of ComputingSchool of Electronics and Physical

Sciences

University of Surrey e-Science day at the Surrey Research Park, 2 December 2002

Page 2: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

2

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 3: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

3

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 4: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

4

Computing Intelligently?

Knowledge

Intelligence Cognition

Language; Images Symbols; Planning; Learning, Thinking;

Creativity

Page 5: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

5

Computing Intelligently?

Knowledge

Intelligence Cognition

Language; Images Symbols; Planning; Learning, Thinking;

Creativity

Artificially intelligent computing systems attempt to solve problems based on an interpretation of work in psychology, neurobiology, linguistics, mathematics and philosophy.

Page 6: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

6

Knowledge-based

INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION RETRIEVAL

The triumvirate of understanding

Page 7: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

7

Knowledge-based INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION

RETRIEVAL

The triumvirate of understanding

Major text data bases are online:

MEDLINE (11 million papers);

Physical Review Online Archive (c. 1890 to date);

US Patent Office (all patents from 1900 onwards);

Genome Data bases

Page 8: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

8

Knowledge-based INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION

RETRIEVAL

The triumvirate of understanding

Major text and image data bases are online:

Reuters News (c. 3000 stories per day);

Spectroscopy and analytical data (NIS data bases);

Chemical Abstracts, where currently structure diagrams are ignored;

Crime-related images with annotated information

Page 9: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

9

Knowledge-based INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION

RETRIEVAL

The triumvirate of understanding

Major text and image data bases are online:

Recently, studies of how science and technology evolves have been related to issues of business management particularly the emergence of competition, disruptive technologies, and opportunities for collaboration across disciplines.

Page 10: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

10

Knowledge-based INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION

RETRIEVAL

The triumvirate of understanding

Major text and image data bases are online:

Such methods are used essentially with structured spatial and temporal data 

Abstract non-spatial and atemporal data, for example, free text as found in journal papers, in

various abstracts data bases (cf MEDLINE), in electronic mail comprising user-to-expert

communication, or in web-access patterns, are typically visualised using the so-called thematic

landscapes. This would need the GRID.

Page 11: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

11

Knowledge-based INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION

RETRIEVAL

The triumvirate of understanding

Major text and image data bases are online:

Reuters News (c. 3000 stories per day);

Spectroscopy and analytical data (NIS data bases);

Chemical Abstracts, where currently structure diagrams are ignored;

Crime-related images with annotated information

Page 12: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

12

The triumvirate of understanding: Need for/of the Grid

Coordinating data sets based on common sets of metadata: need for standards beyond those for architecture of the Grid (OGSA)

Grid-enabling text analysis systems would enable processing of large volumes of distributed data

Grids provide the infrastructure for development of generic computing applications capable of dealing with and combining results of analysis of various types of data – language, images, graphs.

Page 13: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

13

Computing Intelligently?

A knowledge-based system can be programmed to reason over a set of facts, propositions, rules and rules of thumb and, sometimes, the system may come to the same conclusion as a human being.

Page 14: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

14

Computing Intelligently – with rules of thumb about

images?

Recognising and reasoning about the visual environment something that people do extraordinarily well;

In these abilities an average three year old makes the most sophisticated computer vision system look embarrassingly inept

Page 15: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

15

Computing Intelligently – with rules of thumb about

images?

The Vision Problem?

Three-dimensional physical structure in the scene, containing pictures of objects

related to other (probably) known objects, which projects into two

dimensional structure in the image.

Page 16: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

16

Computing Intelligently – with rules of thumb about words?

Natural Language. A person’s native tongue; organic, ambiguous, creative, wilful

Natural Language Processing. Processing of natural language (e.g., English) by a computer to facilitate communication with the computer or for other purposes, such as word processors, computer-based dictionaries and thesauri, summarizers, machine translators, text filters, grammar checkers……….

Page 17: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

17

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 18: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

18

Complexity of science

-10

-5

0

5

10

15

20

25

30

35

'30 '40 '45 '50 '60 '65 '70 '80 '90

Nature

Science

ScientificAmerican

LEX

ICA

L D

IFFIC

ULTY

YEAR OF PUBLICATION (BETWEEN 1930 & 1990)

Page 19: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

19

Complexity of science

Lexical processes used by scientists involve:

repetition of lexical items comprising the specific vocabulary of a subject domain

inventing new words borrowing words from other domains re-defining words or terms

Such processes contribute significantly to the organisation and communication of tacit and explicit knowledge.

Page 20: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

20

Complexity of scienceWe have developed a computer-based method that compares the relative occurrence of single words in a English-scientific paper (or a collection or corpus of papers) with the occurrence of the words in a representative sample of contemporary English language.

The British National Corpus is a 100 million digital collection of written (and spoken) English written/spoken during 1975-1993. Three-quarters of the text is drawn from (A-level+) natural, social, applied sciences, from arts and culture, commerce and finance. The other quarter includes works of fiction and popular science.

BNC type corpora are used extensively in producing dictionaries for general use.

Page 21: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

21

Complexity of science Leo Esaki discovered a new semi-conductor device, the

tunnel diodes in 1957. The super-fast, current-switching device earned Esaki a

Nobel Prize, and yet technological obstacles hindered widespread use in conventional, silicon-based circuits.

Recent developments in tunnel diodes could help chip-makers boost silicon's speed while further shrinking chips.

• We have developed a text corpus, comprising 100-odd journal papers, published between 1980-2000, containing over 430,000 words, on the topic of tunnel diodes or more precisely on resonant tunnel diodes.

Page 22: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

22

Complexity of science A lexico-morphological signature of discovery?

Weird/excessive use of tunnel:Frequency relative to BNC

Surrey Corpus

(a)

British National Corpus

(b)

tunnel 50 1

tunnels 3 2

tunnelled 70 1

tunnelling 685 1

Magnetotunneling does not exist in the British National Corpus

Page 23: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

23

Complexity of science A lexico-morphological signature of the discovery of tunnel diodes?

Lexical ‘productivity’ of tunnel & resonant: Frequently used compound words

resonant tunneling 172resonant tunneling diodes 25resonant tunneling diode 19resonant magnetotunneling 16resonant tunneling structures 8resonant tunneling peak 8

barrier resonant tunneling structure 6resonant tunneling structure 6resonant tunneling spectroscopy 6resonant tunneling processes 4resonant tunneling system 4

Page 24: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

24

unipolar resonant tunneling diode

bipolar light-emitting resonant tunneling diode

resonant interband tunneling diode - RITD

interband resonant tunneling diode

delta doped resonant tunneling diode

quantum well resonant tunneling diode

resonant tunneling diode

double-barrier resonant tunneling diode

interband double barrier tunneling diode

tunneling diode

Same thing?

Complexity of scienceLexicomorphological signature: Compound Words

Page 25: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

25

Complexity of science

Semiconductor Devices(5)

Tunnel Devices(1)

Tunnel Diode Leo Esaki

1980

Semiconductor Devices(2)

Heterojunction Devices(2)

Tunnel Devices (3)

Memory Devices(9)

L. L. Chang, L. Esaki, W. E. Howard, R. Ludekeand N. Schul, MBE in GaAs and AlAsJournal, J. Vac. Sci. Technol. 10, 655(1973)

H. Sakaki, L. L. Chang, R. Ludeke, C. A.Chang, G. A. Sai-Halasz and L. Esaki;Molecular Beam Epitaxy, Appl. Phys.Lett. 31, 211 (1977)

C. A. Chang, R. Ludeke, L. Chang, and L.Esaki, MBE of InGaAs and GaSbAs, Appl.Phys. Lett. 31, 759 (1977).

Information from journals is passed into patents.

Page 26: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

26

Complexity of scienceVisualising fashions in science and technology: The movement of iconic terms.

Page 27: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

27

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 28: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

28

The triumvirate of understanding

Knowledge

Intelligence Cognition

                     

                  

                                                                 

Page 29: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

29

The triumvirate of understanding with apologies to Plato

Knowledge about, knowledge by description: knowledge of a person, thing, or perception gained through information or facts about it rather than by direct experience.

An impersonation of intelligence; an intelligent or rational being; esp. applied to one that is or may be incorporeal; a spirit

COGNITION: The action or faculty of knowing taken in its widest sense, including sensation, perception, conception, etc., as distinguished from feeling and volition.

Language; Images Symbols; Planning; Learning, Thinking;

Creativity

Page 30: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

30

The triumvirate of understanding

with apologies to AristotleKnowledge of a person, thing, or other entity (e.g. sense-datum, universal) by direct experience of it, as opposed to knowing facts about it. So knowledge of, by, acquaintance

INTELLIGENCE: Knowledge as to events, communicated by or obtained from another; information, news, tidings.

COGNITION: A product of such an action: a sensation, perception, notion, or higher intuition

Language; Images Symbols; Planning; Learning, Thinking;

Creativity

Page 31: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

31

Knowledge-based

INFORMATION EXTRACTION

Cognition based INFORMATION VISUALIZATION

Intelligence based on SCIENTOMETRICS/ BIBLIOMETRICS

I R

Intelligent: INFORMATION RETRIEVAL

The triumvirate of understanding

Page 32: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

32

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 33: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

33

Dealing with information deluge•There are over 2,000 news wires produced by Reuters Financial together with on-line reports from banks, brokerage houses, regulatory bodies. Filtering the relevant from the not-so-relevant is a major problem.

•All major journals in science and technology, together with pre-prints, textbooks, conference proceedings, technical reports, research road-maps, (US) patent documents, are all available (almost) freely. Extracting relevant document from this intellectual deluge is challenging the limits of documentation and has a serious impact on innovation and technology transfer.

Page 34: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

34

Dealing with information deluge

•The news report is one of the most commonly occurring linguistic expressions.

•Despite being a good example of open-world data, a news report is a contrived artefact:

• each report has a potentially attention grabbing headline;

• the opening few sentences generally comprise a good summary of the contents of the report;

• there are slots for the date of origin and slots for photographs and other graphic material.

Page 35: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

35

Dealing with information deluge

Event News Market (Price)

Information

The relationship between Events, News and Markets

(price) through Information.

Page 36: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

36

Dealing with information deluge

Movement from Feb 2001 to Jan 2002. Note the dip on and around Sep 11th 2001, although all markets were falling before this.

Sep 11, 2001

Germany DAX(PERF) Nasdaq Composite Index

Japan NIKKEI AVERAGE INDEX(225) Dow Jones Industrial Average

Sep 11, 2001

Sep 11, 2001

Sep 11, 2001

Page 37: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

37

Dealing with information deluge

•Francis Knowles has written about the use of health metaphors used in the financial news reports:

•markets are full of vigour and are strong or the markets are anaemic or are weak (1996);

•most newspapers also use animal metaphors – there are bull markets and bear markets, the former refer to expansion, and indirectly to fertility, and the latter to shy, retiring and grizzly behaviour much like that reported about bears in popular press and in literature for children.

Page 38: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

38

Dealing with information deluge

Mainly Good News Stories Rather Bad News Stories

Naval shipbuilder and military contractor Vosper Thornycroft has boosted its civil arm by buying facilities manager Merlin Communications (Nov 14, 2001)

Heavyweight banking and oil stocks have dropped up the leading share index as investors bet on fresh interest rate cuts.’ (Nov 21, 2001).

The FTSE 100 stock index looks set to open stronger today after Wall Street added to gains seen at the London close and with U.S. stock index futures boosted by rumours that Osama bin Laden had been captured.’(Nov 15, 2001).

The European Commission has slashed its official growth forecasts for the euro zone [..], predicting the most serious slowdown since the 1990s recession, with lower growth in 2002 than this year.’ (Nov 21, 2001).

Page 39: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

39

Dealing with information deluge

We created a corpus of 1,539 English financial texts from one source (Reuters) on the World Wide Web, published during a 3 month period (Oct 2001-January 2002) comprising over 310,000 tokens. The corpus comprised a blend of both short news stories and financial reports. Most of the news is business news from Britain with thirty percent of the news is from Europe and from the United States. Week (5 day week) Good Word

FrequencyBad Word Frequency

1 58 40

2 71 75

3 77 66

4 73 59

5 72 28

Total 351 268

Frequency of Good and Bad words in Nov 2001. The underlined figures in the 2nd and 3rd columns indicate the minimum value of the frequency and the numbers in italics are the maximum value.

Page 40: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

40

Dealing with information deluge

Market correlation between ‘good’ word frequency and FTSE index.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30

Date

Ratio

Good words FTSE100

Page 41: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

41

Dealing with information deluge

Good and bad word frequency correlated with FTSE 100.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30

Date

Ratio

Good words Bad words FTSE100

Page 42: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

42

Dealing with information deluge

SYSTEM QUIRK

Reuters News Feed

Up

Down

Time Series of Up and Down

FTSE 100INDEX

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30

Date

Ratio

Good w ords FTSE100

0

0.2

0.4

0.6

0.8

1

1.2

1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30

Date

Ratio

Good w ords FTSE100

Generate Signal (Buy / Sell)

Ibermatica, Madrid

Finsoft, London

JRC GmBH, Berlin

Partners

This work is being carried out under the auspices of the EU-IST sponsored GIDA project. The project aims to create a novel service type in the financial investment business. Its novelty lies in the integration of financial analysis with news analysis

Page 43: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

43

Dealing with information deluge

FTSE 100 plotted against ‘bad news’ 20 February 2002 one of the lowest days.The SATISFI system keeps track of news reports with bad (and good) news.

Page 44: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

44

Dealing with information deluge

• SATISFI Sentiment and Time Series: Financial analysis System is being developed at the University of Surrey for the EU-IST GIDA Project.

Good News

FTSE 100

SATISFI is based on our existing text analysis system, System Quirk, together with programs for time series analysis, text summarisation and organising large text collections, and programs for creating thesauri and term bases. Systems for learning the behaviour of the markets are also being developed.

Page 45: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

45

Profiting from information deluge?

See also: http://www.vicefund.com/

Page 46: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

46

Dealing with information deluge

We have used a neural computing system that creates its own categories given a class of computational objects, say digitised, computer-understandable version of a set of news stories – a set of keywords representing the whole set. Some keywords will be present in some stories or absent from the stories.

The system has to be trained on a set of keywords and creates categories.

Then the system will categorise unseen stories into the categories it has already created.

Page 47: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

47

Dealing with information deluge

Our text corpus consisted of 100 Associated Press (AP) news wires selected from 10 pre-classified news categories shown together with their icons. The average length of the articles was 622 words.

Automatic Categorization of Texts Based on Keywords Using a

neural computing system

Page 48: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

48

Dealing with information deluge

Text Categories

1 Bioconversion 6Exportation of Industry

2Pollution Recovery

7 Foreign Trade

3Alternative Fuels

8Int. Drug Enforcement

4 Fossil Fuels 9Foreign Car Makers

5 Rain Forests 10Worldwide Tax Sources

Text categories used in the TIPSTER – SUMMARY program, but were not known to our system

Page 49: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

49

Dealing with information deluge

1 percent 15 mexico 29 mazda 43 enforcement

2 tax 16 emissions 30 gases 44 warming

3 billion 17 drugs 31 shale 45 smog

4 drug 18 fuels 32 deficit 46 ozone

5 reagan 19 senate 33 export 47 massachusetts

6 cars 20 auto 34 recycling 48 imports

7 taxes 21 proposal 35 epa 49 automobile

8 environmental 22 gasoline 36 honda 50 trafficking

9 pollution 23 exports 37 methanol

10 fuel 24 vehicles 38 automakers

11 federal 25 ohio 39 panama

12 dukakis 26 greenhouse 40 corp

13 bush 27 dioxide 41 forests

14 congress 28 marine 42 cocaine

Salient single words identified automatically by System Quirk

Page 50: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

50

Dealing with information deluge

Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate.

Page 51: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

51

Dealing with information deluge

Results of a Full Text Map trained using exponentially decreased neighbourhood and learning rate.

Page 52: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

Evaluation of summary accuracy of 30 texts by 4 defence intelligence assessors

30 full text documents and corresponding summaries given to 4assessors to decide whether the summary was acceptable. Resultsper participant below.

Participant YES NO “Yes” %British Telecom 85 34 71

Univ. of Surrey 72 48 60IBM 71 49 59SRA 67 52 56

Centre for InfoRes (Russia) 61 59 51New Mexico SU 54 66 45

Univ. of Pennsylvania 51 69 42National Taiwan Univ. 50 70 42

CGI/Carnegie-Mellon Uni. 39 80 32Lexis-Nexis 35 85 29

GE 31 88 26Cornell/SabIR 25 95 14

Intelligent Algorithms 23 96 14USCalifornia-ISI 14 106 11

Total 678 997 40

Dealing with information deluge

TEXT SUMMARISATION: Surrey’s Program Telepattern

THE PROGRAMS WERE EVALUATED by the US DoD’s TREC AND TIPSTER Programmes

Page 53: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

53

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 54: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

54

The Missing Link: Images and TextThe administration of justice requires systematic prosecution of the perpetrators of crime. One key element in this system is the collection, analysis and dissemination of information collected safely and securely from the scene where the crime was committed. The information comprises images of the scene, the descriptions and interpretations of these images. In a murder case there maybe over 2000 scene of crime images and the case can take upto two years to come to courts. It is important for these images to be indexed appropriately and be retrieved efficiently.

Scene-of-crime officers (SoCOs) play a key role in the collection of this vital multi-modal information; they describe the image and the context in which the images were collected. The police officers involved in the administration of justice provide the interpretation.

Page 55: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

55

The Missing Link: Images and Text

The collateral texts – written texts or speech (fragments) closely or loosely related to an image or objects within the image.

CLOSELY CLOSELY COLLATERAL TEXTSCOLLATERAL TEXTS

CAPTION

CRIME SCENE

REPORT

BROADLY BROADLY COLLATERAL TEXTSCOLLATERAL TEXTS

NEWSPAPER ARTICLE

DICTIONARY DEFINITION

The collateral texts are special language texts and comprise keywords that may help in indexing and retrieving the images.

Page 56: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

56

The Missing Link: Images and Text

The EPSRC-sponsored SoCIS project, involving Universities of Surrey and Sheffield, is developing methods and techniques for automatically indexing images with the descriptions provided by Scene of Crime Officers.

9 mm browning high power

pistol

Footwear impression

in blood

Body on floor showing

adjacent table

Fingerprints showingridges

Typical Scene of Crime Images

The SoCIS project is investigating how the results of the project can be generalised such that the methods and techniques can be applied to an arbitrary domain.

Page 57: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

57

The Missing Link: Images and Text

What SOCO’s do now? Forms, forms and more forms

Page 58: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

58

The Missing Link: Images and TextThe SoCIS project is developing methods and techniques for automatically indexing images taken at a crime scene with the descriptions provided by scene of crime officers. Five UK Police Forces are working closely with our project: They provide knowledge of their subject domain, test our system and advise us generally.

Hampshire Constabulary

Metropolitan Police

Surrey Police

South Yorkshire Police

Kent Constabulary

Page 59: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

59

The Missing Link: Images and TextThe SoCIS project is developing methods and techniques for automatically indexing images with the descriptions provided by scene of crime officers.

Edit ButtonShape Buttons

Save Button

Select Button

Delete Button

Show All Hotspots Button

Page 60: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

60

DESCRIBING IMAGES – THE LINK BETWEEN IMAGES AND TEXT, THE

MISSING LINK?

SOCIS: A prototype image and text storage and retrieval system. Automatic Labelling (or INDEXING) of images by keywords in the

descriptions provided by the SOCO’s. Automatic Extraction of terms and their relationship to other terms

(ontology) from the descriptions and other texts.EVIDENCE

TRACE EVIDENCE

FIBREBLOOD DNA

INORGANIC FIBRE

MANUFACTURED POLYMERIC FIBRE

DYE FIBRE

The above hierarchy tree is based on our 0.7 million word forensic science text corpus

Page 61: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

61

DESCRIBING IMAGES – THE LINK BETWEEN IMAGES AND TEXT, THE

MISSING LINK?

SANNC: A neural computing system that learns how to relate textual descriptions with images.

Automatic Clustering of similar images in an image collection. Automatic Identification of the position of objects in an image or

image.

Nine millimetre browning high power self-loaded pistol

Nine millimetre browning high power self-loaded pistol

SELF ORGANISING MAP

HEBBIAN NETWORK

IMAGE

TEXT

SELF ORGANISING MAP

Page 62: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

62

DESCRIBING IMAGES – THE LINK BETWEEN IMAGES AND TEXT, THE

MISSING LINK?

IDENTIFICATION LOCATION ELABORATION

[1] Close up view of exhibit ABC/3 [.]

[2] Red and silver knife handle.

On alleyway floor

Adjacent to building and metal gate

[SOCO 1 – spontaneous free text:] Close up view of exhibit ABC/3 red and silver knife handle on alleyway floor adjacent to building and metal gate.

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

Page 63: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

63

DESCRIBING IMAGES – THE LINK BETWEEN IMAGES AND TEXT, THE

MISSING LINK?

SOCO 5 Close up item 3.

SOCO 7 Close up of item 3 -

SOCO 1 Close up of knife.

SOCO 8 Close up view item 3 -

SOCO 2 Close up view of ex 3

SOCO 4 Close up view of exhibit 3

SOCO 3 Close up view of exhibit ABC/3

SOCO 6 Close view of marker 3

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

Not really: there are three ‘structures’ – identification, location and elaboration. The linguistic description shows little or no variation. Research continues.

Page 64: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

64

Variation amongst SOCO’s?

SOCO 2 a red handled lock knife

SOCO 6 against red handled knife.

SOCO 5 Knife handle.

SOCO 3 red and silver knife handle

SOCO 4 red handled flick knife

SOCO 8 red handled flick knife.

SOCO 7 red penknife.

SOCO 1 Red sides. Metal ends.

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

Not really: there are three ‘structures’ – identification, location and elaboration. The linguistic description shows little or no variation. Research continues.

Page 65: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

65

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 66: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

66

Need for/of the Grid

Data Grids Management of large volumes of text, images,

financial data, …. Computational Grids

Processing of large volumes of such data Collaborative Grids

Activities in research – virtual crime investigation

Page 67: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

67

Need for/of the Grid

Coordinating data sets based on common sets of metadata: need for standards beyond those for architecture of the Grid (OGSA)

Grid-enabling System Quirk would enable processing of large volumes of distributed data

Grids provide the infrastructure for development of generic computing applications capable of dealing with and combining results of analysis of various types of data

Page 68: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

68

Talk Outline

Computing Intelligently The complexity of science The triumvirate of understanding Dealing with information deluge The Missing Link: Images and Text Need for/of the Grid Afterword

Page 69: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

69

Afterword: The Department of Computing

Software Engineering Theoretical Computing Knowledge Management Neural Computing Information Extraction and Multi-media

Group

A research-active Department

•Applied to EPSRC to be involved with e-Science Programme

•Looking to develop industrial collaborations for ALL research activities

Page 70: 1 Language-based Information and Knowledge Analysis Professor Khurshid Ahmad Department of Computing School of Electronics and Physical Sciences University

70

Afterword: The Department of Computing

A Department that has or is looking forward to active collaboration within the University:

•Computer Vision (CVSSP – the new JIF Lab)•Satellite Engineering (SSTL – Best Practice)

•Linguistics & Dance

A Department that is looking forward to active collaboration outside the University with:

Unis Sheffield, Southampton, Metropolitan Police College, Queen Mary London

A Department that looking forward to exploit its software systems especially financial prediction systems, language engineering systems.