christoph f. eick’s areas of interest

24
Data Mining for the Health Sciences, Seattle, 4/27/99. Christoph F. Eick’s Areas of Interest Knowledge Discovery in Data and Data Mining (KDD) Expertise in developing and using data mining techniques and tools --- mostly for structured data collections (also started some work concerning images) Database Clustering / Generalizing Data Mining Techniques for Databases Preprocessing in KDD Constructive Induction, Symbolic Regression, and Genetic Programming Agent-based Technologies Ontologies and Semantic Brokering The InfoSleuth Information Gathering System Integration of Agent-based Technologies and Knowledge Discovery/Data Mining Knowledge-based Systems, Expert Systems, and [Knowledge Acquisition] Using Bayesian Technology to Assist Decision Making (in Medicine and other domains) Computerization of Medical Practice Guidelines Genetic Programming and Evolutionary Techniques Sound background in Data Models, Databases, and AI

Upload: hila

Post on 12-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Christoph F. Eick’s Areas of Interest. Knowledge Discovery in Data and Data Mining (KDD) Expertise in developing and using data mining techniques and tools --- mostly for structured data collections (also started some work concerning images) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Christoph F. Eick’s Areas of Interest Knowledge Discovery in Data and Data Mining (KDD)

– Expertise in developing and using data mining techniques and tools --- mostly for structured data collections (also started some work concerning images)

– Database Clustering / Generalizing Data Mining Techniques for Databases

– Preprocessing in KDD

– Constructive Induction, Symbolic Regression, and Genetic Programming Agent-based Technologies

– Ontologies and Semantic Brokering

– The InfoSleuth Information Gathering System

– Integration of Agent-based Technologies and Knowledge Discovery/Data Mining Knowledge-based Systems, Expert Systems, and [Knowledge Acquisition] Using Bayesian Technology to Assist Decision Making (in Medicine and other

domains) Computerization of Medical Practice Guidelines Genetic Programming and Evolutionary Techniques Sound background in Data Models, Databases, and AI

Page 2: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Christoph F. Eickwww.cs.uh.edu/~ceick/eick-uw.html

[email protected]

University of HoustonOrganization

1. Health Care and Computer Science

2. Promising Technologies

2.1 KDD / Data Mining

2.2 Agent-based Systems

2.3 Shared Ontologies and Knowledge Brokering

3. Summary and Conclusion

Data Miningfor the Health Sciences

Data Miningfor the Health Sciences

Page 3: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

1. Health Care and Computer ScienceNot too long ago (e.g. 1989):

Offline data / Missing data / hand written reports Computer that cannot talk to each other Lack of standardization (Tower of Babel, too many languages…) Human is frequently the “gold standard”

Today: faster computers, cheaper computers, better computer networks, electronic scanners, better connectivity, the internet,... We have a lot of computerized knowledge on almost any aspects of human

health(a well of knowledge) We have much more computing power to conduct complex data analysis tasks

New Problems: How can we find anything? How do we gather information that is distributed over various

computer systems and represented using different formats? If we find something, how do we know that it is complete? How can this large amount of information be analyzed? What information can we trust?

Page 4: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Promising “Newer” Technologies to Cope with the

Knowledge Discovery and Data Mining (KDD) Agent-based Technologies Shared Ontologies and Knowledge Brokering Non-traditional data analysis techniques Structural Search and Indexing Techniques

Information Flood

Page 5: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Knowledge Discovery in Data [and Data Mining] (KDD)

Let us find something interesting!

Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see

http://www.kdnuggets.com/siftware.html) Field is more dominated by industry than by research institutions

Page 6: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad)

The identified knowledge is used to– make predictions

– classify new examples

– summarize the content of data collections and documents to facilitate understanding, decision making, and for supporting search and indexing

– support graphical visualization to aid human in discovering deeper patterns Example applications:

– learn to classify brain tissue from examples

– predict a patient’s life expectancy from his medical history

– summarize/cluster/mine clinical trial reports

What is KDD?What is KDD?

Page 7: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Select/preprocessSelect/preprocess TransformTransform Data mineData mine Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate

Data preparation

Data sources Selected/Preprocessed data Transformed data Extracted informationKnowledge

General KDD StepsGeneral KDD Steps

Page 8: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

KDD is less focused than data analysis in that it looks for interesting patterns in data; classical data analysis centers on analyzing particular relationships in data. The notion of interestingness is a key concept in KDD. Classical data analysis centers more on generating and testing pre-structured hypothesis with respect to a given sample set.

KDD is more centered on analyzing large volumes of data (many fields, many tuples, many tables, …).

In a nutshell the the KDD-process consists of preprocessing (generating a target data set), data mining (finding something interesting in the data set), and post processing (representing the found pattern in understandable form and evaluated their usefulness in a particular domain); classical data analysis is less concerned with the the preprocessing step.

KDD involves the collaboration between multiple disciplines: namely, statistics, AI, visualization, and databases.

KDD employs non-traditional data analysis techniques (neural networks, decision trees, fuzzy logic, evolutionary computing,…).

KDD and Classical Data Analysis

Page 9: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

“Agents operate independently and anticipate user needs” (P. Maes) “Agent help users suffering from information overload” (O. Etzioni) rather to mimic

human intelligence “Agents are important because the allow users to interoperate with modern

applications such as electronic commerce and information retrieval. Most of these applications assume that components are added dynamically and that they will be autonomous (serve different users and providers to fill different goals) and heterogeneous.” (M. Singh)

“Essentially, agent-based architectures are characterized by three key features: autonomy, adaptation, and cooperation. Agent-based systems are computational systems in which several agents interact for their own good and for the good of the overall system.

“In an agent-based architecture services are provided in the context of a community of loosely coupled agents of various types in a distributed environment.”

“Agents are aware of their environment and capable of communicating with other agents that belong to the same agent community”.

Key Ideas Agent-based TechnologiesKey Ideas Agent-based Technologies

Page 10: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

MediatorAgents

Agents that act on behalf of end users

that look for services

End UserAgents

Simplified View of Agent-based Systems

Agents that act as amatchmaker between

service providersand end users

Agents that acton behalf of

service providers

Service ProviderAgents

Conversation Layer

Message Layer

Page 11: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

A few more things on Agents Why do agent-based systems show promise for health care?

– Scalability

– Tasks to be solved involve the collaboration between different groups

– Well suited for the world-wide web

– Health care is a dynamically changing environment

– Establish standards (as a by product)

Third International Conference on AUTONOMOUS AGENTS (Agents '99), Seattle, Washington, May 1-5, 1999 (http://www.cs.washington.edu/research/agents99/)

Page 12: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

The goal of model generation (sometimes also called predictive data mining) is the creation, evaluation, and use of models to make predictions and to understand the relationships between various variables that are described in a data collection. Typical example application include:

– generate a model to that predicts a student’s academic performance based on the applicants data such as the applicant’s past grades, test scores, past degree,…

– generate a model that predicts (based on economic data) which stocks to sell, hold, and buy.

– generate a model to predict if a patient suffers from a particular disease based on a patient’s medical and other data .

Neural networks, decision trees, naïve Bayesian classifiers and networks, many other statistical techniques, fuzzy logic and neuro-fuzzy systems are the most popular model generation tools in the KDD area.

All model generation tools and environments employ the basic train/evaluate/predict cycle.

Generating ModelsGenerating Models

Page 13: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Participants in an Agent-basedData Analysis / KDD Society

Data Analysts

Tool BuildersData CollectionProviders

End Users (Managers, Doctors, Decision Makers, Gamblers,...)

Page 14: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Problems of Model Generation It is difficult to find appropriate data collections. Sharing of models is not supported. Model generation is mostly performed in a centralized environment,

not taking advantage of distributed computed computing technology. Degree of tool standardization is low, which makes more difficult to

use different tools for the same data analysis problems. Evaluation of claims with respect to to the performance models is

very difficult. Problem: the model itself, as well as tools and data collection that were used to generate the model are not accessible online.

Page 15: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Model generation services are provided in the context of a community of loosely coupled agents of various types in a distributed environment.

Model generation tools are accessed using a unified interface. Tool providers and data collection providers offer their services to

data analysts and end-users via the internet. New forms of collaboration can easily be supported in this environment:

– data analysts no longer run the tools on their own computing environment

– brokering techniques can be used to find interesting data collections, suitable tools, useful models, and available ontologies.

– tool developers offer tool services on the internet charging one-time tool use fee.

Agent-based Model GenerationAgent-based Model Generation

Page 16: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Model Broker Data CollectionBroker

Tool Broker

Resource Agent

Data Collection

Model

Data Collection

Data Collection

Resource Agent

ModelGenerationTool

ModelGenerationTool

Model

Model GenerationBrowser

Data Analyst

ResourceGenerationTool

Data Collection Provider

Model GenerationBrowserEnd User

ToolDeveloper

Tool IntegrationTool

Agent-based ModelGeneration Community

Model Generation Agent Communities

Page 17: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

“Ontologies are content theories about sorts of objects, properties of objects, and relationship between objects that are possible in a specified domain of knowledge” (Chandrasekaran)

“We consider ontologies to be domain theories that specify a domain-specific vocabulary of entities, classes, properties, predicates, and functions, and a set of relationships that necessarily hold among those vocabulary items” (Fikes)

“Shared ontologies form the basis for domain specific knowledge representation languages” (Chandrasekaran)

“If we could develop ontologies that could be used as the basis of multiple systems, they would share a common terminology that would facilitate sharing and reuse” (W. Swartout)

“Ontologies play an important role for the standardization of terminology in medicine (e.g. UMLS) and other domains”

“Ontologies can serve as the glue between knowledge that is represented at different, usually heterogeneous information sources.”

Shared OntologiesShared Ontologies

Page 18: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

As a shared conceptual model of a particular application domain that describes the semantics of the objects that are part of the domain, and captures knowledge that is inherent to the particular domain --- idea: knowledge base .

Ontologies provide a vocabulary for representing knowledge about a domain and for describing specific situations in a domain (tool for defining and describing domain-specific vocabularies) --- idea: language for communication

For data/knowledge translation and transformation (provide a solution to the translation problem between different terminologies); for fusion and refinement of existing knowledge --- idea: interoperation

For matchmaking between users, agents, and information resources in agent-based systems --- idea: collaboration, brokering focus of next slides

As reusable building blocks to build systems that solve particular problems in the application domain --- idea: model reuse

Summary: “Ontologies can be used as building block components of knowledge bases, object schema for object-oriented systems, conceptual schema for data bases, structured glossaries for human collaborations, vocabularies for communication between agents, class definitions for conventional software system, etc.” (Fikes)

What are Ontologies good for?What are Ontologies good for?

Page 19: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Service providers describe their capabilities in terms of a domain (or task) ontology

Agents that seek services describe their needs in terms of a domain (or task) ontology

Broker agents server as matchmakers between service providers and service seekers by finding suitable agents and by evaluating the extent to which they can provide those services relying on a semantic brokering approach.

Various languages have been advocated in the recent years to specify ontologies: OKBC, CKML/OML, ONTOLINGUA, XML, UMLS,...

Ontologies and BrokeringOntologies and Brokering

Page 20: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Specify keywordswith respect to the documents they are

looking for

End UserAgents

ClinicalTrial Report

Service ProviderAgents

AbstractClinical Trial Report

Search Engine

Summary

Specify subset of ontology

End UserAgents

ClinicalTrial Report

Service ProviderAgents

Subset of an Ontology

Semantic Brokering

Summary

A “Traditional” Approach

Semantic Brokering Approach

:= matchmaking

Page 21: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Patient

Intensive-Care-Patient

Age>40

weight

Hours-in-intensive-care

Patient

Intensive-Care-Patient

Age<15

Patient

Intensive-Care-Patient

age

weight

Patient

Intensive-Care-Patient

Age>60

Weight>300

Data Analyst’s Information Requirement

Data Collection1 Data Collection2 Data Collection3

Hours-in-intensive-care Hours-in-intensive-care Hours-in-intensive-care

Result Semantic Brokering:((DataCollection1 nil ((missing slot weight) (contradictory (< age 15) (> age 40)) (DataCollection2 t) (DataCollection3 t ((> age 60)(> weight 300)))

Example Semantic Brokering

Page 22: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Scientific communities have to agree on ontologies; otherwise, the whole approach is flawed.

Development of ontologies for a particular domain is a difficult task (see Digital Anatomist project at UW, development of UMLS). The development of user friendly, and intelligent knowledge acquisition tools is very important for the successful development of shared ontologies.

Expressiveness of languages that are used to define ontologies limits what can be done with domain ontologies.

Reasoning capabilities are important for systems that use shared ontologies (we need a language to specify ontologies and an inference engine that can reason with the given ontologies)

– finding inconsistencies in knowledge bases, for finding errors at data entry

– semantic brokering

– more intelligent mappings between terms

– ...

Critical Problems with Respect to Shared Ontologies

Critical Problems with Respect to Shared Ontologies

Page 23: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

Promising Technologies to Use theFlood of Data for Providing Better Health Care

Agent-based Systems Structural Indexing Techniques

KDD

Visualization

TraditionalData Analysis Techniques

Shared Ontologies Semantic Brokering

Software Development

Environments

KnowledgeAcquisition

Tools

Database Technology

The Well of Knowledge

Page 24: Christoph F. Eick’s Areas of Interest

Data Mining for the Health Sciences, Seattle, 4/27/99.

References WWW-Links:

– http://www.nlm.nih.gov/pubs/cbm/umlscbm.html (UMLS)

– http://ksl-web.stanford.edu/Reusable-ontol/P001.html (Richard Fikes’ (Stanford University) Slide Show on “Reusable Ontologies”

– http://www.kdnuggets.com/index.html (KDD Nuggets Directory: Data Mining and Knowledge Discovery Resources)

– http://www.mcc.com/projects/infosleuth/ (InfoSleuth (MCC) --- an Agent-based System for Information Gathering)

– http://www.cs.cmu.edu/~softagents/ (CMU Intelligent Software Agents Page) Papers:

– Special Issue IEEE Intelligent Systems on “Coming to Terms with Ontologies”, Jan./Feb. 1999.

– Special Issue IEEE Intelligent System on “Unmasking Intelligent Agents”, March/April 1999.

– Special Issue Communications of the ACM on “Data Mining”, vol. 39, no. 11, November 1996.