information filtering
DESCRIPTION
Information Filtering. Outline. Introduction Information Filtering concept Previous work Filtering general features Filtering rules and attributes Type of filters Profling and Filtering Technologies user-modeling techniques Conclusion. Introduction. Internet and Information overloading - PowerPoint PPT PresentationTRANSCRIPT
Modern Information Retrival Course, Semantic web Research labratory
1
Information Filtering
Modern Information Retrival Course, Semantic web Research labratory
2
Outline
Introduction Information Filtering concept Previous work Filtering general features Filtering rules and attributes Type of filters Profling and Filtering Technologies user-modeling techniques Conclusion
Modern Information Retrival Course, Semantic web Research labratory
3
Introduction
Internet and Information overloading A vast amount of information of varying
quality is disseminated. There are lots of interesting things, but also
lots of trash. Filtering is tools to help people find the most
valuable information
Modern Information Retrival Course, Semantic web Research labratory
4
Introduction
The goal of an information filtering system is to sort through large volumes of dynamically generated information and present to the user those which are likely to satisfy his or her information requirement.
Modern Information Retrival Course, Semantic web Research labratory
5
Introduction
In order to identify information that satisfies a user's information requirement or interest, an IF system needs to acquire an information filter that, when applied to an information item, evaluates whether the item is of interest or not.
Information filter represents the user's interests Identifying only those pieces of information that a
user would find interesting. The key question for designing an IF system is
how to acquire such an information filter.
Modern Information Retrival Course, Semantic web Research labratory
6
Information Filtering concept
Filtering information is not a new concept, nor is it one that is limited to electronic documents.
When we read standard paper texts, information filtering occurs.
We only buy certain magazines, since other magazines may contain information that is redundant with or irrelevant to our interests
With the increasing availability of information in electronic form, it becomes more important and feasible to have automatic methods to filter information.
Modern Information Retrival Course, Semantic web Research labratory
7
Information Filtering concept
We can describe a filtering information system as being an automatic mechanism with the capacity of monitoring a continuous flow of documents and ability to select documents considering it’s relevance for a certain user or users’ groups, according to its needs.
Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests.
Modern Information Retrival Course, Semantic web Research labratory
8
Information Filtering concept
These needs are represented through a profile of interests associated to the user or users’ group.
The ability to select relevant documents is associated with the mechanisms of retrieval information that calculate the value of similarity between documents of the collection and the profiles.
Documents of great similarity with the profile are considered important for the user or users’ group.
Modern Information Retrival Course, Semantic web Research labratory
9
Information Filtering concept
due to personal or professional reasons, a user’s interests may shift or change.
These changes may happen in a relatively short duration of time or over a long period of time.
The shifts can affect the user’s interests partially or fully.
To cope with this problem it should be possible to do reformulation on the user’s
profile. This actualization is made through information sent
to the system about the relevance of the received documents.
Modern Information Retrival Course, Semantic web Research labratory
10
Information Filtering concept
One of the simplest methods of determining whether information matches a user's interests is through keyword matching.
If a user's interests are described by certain words, then information containing those words should be relevant.
This straightforward keyword matching often fails however.
Inappropriate matches can arise because The words people use do not unambiguously reflect the topic
or content. A single word can have more than one meaning (e.g., chip) The same concept can be described by surprisingly many
different words (e.g., human factors, ergonomics).
Modern Information Retrival Course, Semantic web Research labratory
11
Information Filtering concept
Furnas, showed that two people use the same main word to describe an object only 10 to 20 percent of the time.
Bates has reported comparably poor agreement in the generation of search terms by trained intermediaries.
Modern Information Retrival Course, Semantic web Research labratory
12
Previous work
Conventional information retrieval (IR) is very closely related to information filtering (IF)
They both have the goal of retrieving information relevant to what a user wants
And minimizing the amount of irrelevant information retrieved
Modern Information Retrival Course, Semantic web Research labratory
13
Previous work
One of the earliest forms of electronic information filtering came from work on Selective Dissemination of Information (SDI).
SDI was designed as an automatic way of keeping scientists informed of new documents published in their areas of specialization.
The scientist could create and modify a user profile of keywords that described his or her interests.
SDI used the profile to match the keywords against new articles in order to predict which new articles would be most relevant to the scientist's interests.
Modern Information Retrival Course, Semantic web Research labratory
14
Previous work
Allen conducted a series of experiments to explore user models in predicting preferences for news articles.
He predicted which articles a person would read based on previous articles read using a measure of overlap of nouns between the new and old articles.
While the predictions were better than chance, the average correlation between the predicted articles and the subjects' ratings of the articles was fairly low (r=0.44).
Modern Information Retrival Course, Semantic web Research labratory
15
Previous work
The models were more successful at predicting user preferences for general categories of articles than for specific articles.
Predicting what news articles a person will read may be an especially difficult task.
News topics vary from day to day, making it difficult to get stable estimates of interest. In addition, external sources of news probably influenced what people read in the experiment.
We believe that users' interests for technical literature will be more stable over time.
Modern Information Retrival Course, Semantic web Research labratory
16
Previous work
In Allen's research, the subject's past preferences were used to construct an implicit model for retrieving relevant articles.
A different approach is to let the user explicitly structure the information.
For Example the Information Lens system allows users to create rules to filter mail messages based on keyword matches in the mail fields.
There is some structure in mail messages, (e.g. sender, subject)
These rules can take advantage of this structure to perform user specified actions on the messages.
Modern Information Retrival Course, Semantic web Research labratory
17
Previous work
While a variety of information systems have been developed, there has been little systematic evaluation of what features are most effective for filtering.
This leaves many unanswered questions, such as: What are the most effective methods for matching a user's
interests to information available? How should a user's interests be described? How will the performance of filtering methods vary in
different domains?
Modern Information Retrival Course, Semantic web Research labratory
18
Filtering general features
An information filtering system is an information system designed for unstructured or semi structured data.
This contrasts with a typical database application that involves very structured data, such as employee records.
The notion of structure being used here is not only that the data conforms to a format such as a record type description, but also that the fields of the records consist of simple data types with well-defined meanings.
Email messages are an example of semi structured data in that they have well-defined header fields and an unstructured text body.
Modern Information Retrival Course, Semantic web Research labratory
19
Filtering general features
Information filtering systems deal primarily with textual information.
Unstructured data is often used as a synonym for textual data.
It is, however, more general than that and should include other types of data such as images, voice, and video that are part of
multimedia information systems. None of these data types are handled well by
conventional database systems, and all have meanings that are difficult to represent.
Modern Information Retrival Course, Semantic web Research labratory
20
Filtering general features
Filtering systems involve large amounts of data. Typical applications would deal with gigabytes of text, or much
larger amounts of other media. Filtering applications typically involve streams of
incoming data, either being broadcast by remote sources (such as newswire services), or sent directly by other sources (email).
Filtering has also been used to describe the process of accessing and retrieving information from remote databases, in which case the incoming data is the result of the database searches.
Modern Information Retrival Course, Semantic web Research labratory
21
Filtering general features
Filtering is based on descriptions of individual or group information preferences, often called profiles. Such profiles typically represent long-term interests.
Filtering is often meant to imply the removal of data from an incoming stream, rather than finding data in that stream.
In the first case: The users of the system see what is left after the data is removed
In the later case: they see the data that is extracted.
A common example of the first approach is an email filter designed to remove junk mail.
profiles may not only express what people want, but also what they do not want.
Modern Information Retrival Course, Semantic web Research labratory
22
Filtering general features
Many of these features are virtually the same as those found in a variety of other text-based information systems.
Text routing, for example, involves sending relevant incoming data to individuals or groups. This process is essentially identical to filtering.
Categorization systems are designed to attach one or more predefined categories to incoming objects (this is done by newswire services, for example). The major difference from filtering in this case is the static
nature of the categories, when compared to profiles.
Modern Information Retrival Course, Semantic web Research labratory
23
IF vs. IR
The entities and processes relevant to IF are almost identical to those that are relevant to IR.
The major differences appear to be: IR is typically concerned with single uses of the
system, by a person with a one-time goal and one-time query.
IF is concerned with repeated uses of the system, by a person or persons with long-term goals or interests.
Modern Information Retrival Course, Semantic web Research labratory
24
IF vs. IR
IR recognizes inherent problems in the adequacy of queries as representations of information needs.
IF assumes that profiles can be correct specifications of information interests.
IR is concerned with the collection and organization of texts.
IF is concerned with the distribution of texts to groups or individuals.
Modern Information Retrival Course, Semantic web Research labratory
25
IF vs. IR
IR is typically concerned with the selection of texts from a relatively static database.
IF is mainly concerned with selection or elimination of texts from a dynamic data stream.
IR is concerned with responding to the user’s interaction with texts within a single information-seeking episode.
IF is concerned with long-term changes over a series of information-seeking episodes.
Modern Information Retrival Course, Semantic web Research labratory
26
IF vs. IR
In addition to these distinctions based on the models of IR and IF, there seem to be some other, contextual differences that might also be relevant to research interests.
These arise from differences in the social and/or practical situations with which IR and IF have been concerned.
Differences could be categorized according to differences associated with Texts Users General environment of concern to each.
Modern Information Retrival Course, Semantic web Research labratory
27
IF vs. IR
Text-related issues. For IF, the timeliness of a text is often of overriding
significance. For IR, this has typically not been the case.
User-related issues. IR has, by-and-large, studied well-defined user groups, in
well-defined, specific domains, largely in science and technology.
IF, however, is often concerned with very undefined user communities
Environmental issues. IF is highly concerned, in many situations, with issues of
privacy IR, for a variety of reasons, has paid almost no attention to this
kind of problem.
Modern Information Retrival Course, Semantic web Research labratory
28
Filtering using IR
In general, the idea for filtering is to create a space of documents, some of which have previously been judged by a user to be relevant to his or her interests.
If a new document is close to relevant documents in the space, then it would be considered likely to be interesting to the user.
For all these comparisons, the only difference between the LSI and the keyword matching methods is that LSI represents terms and documents in a reduced dimensional space of derived indexing dimensions.
Modern Information Retrival Course, Semantic web Research labratory
29
Filtering using IR
Foltz compared LSI and keyword vector matching for filtering of Netnews articles.
In an experiment, subjects rated Netnews articles as either relevant or not relevant to their interests.
The ratings from the initial 80% of the articles they read were used to predict the relevance of the remaining 20% of the articles for each person.
Foltz found that the LSI filtering improved prediction performance over the keyword matching method by an average of 13% and showed a 26% improvement in precision
Modern Information Retrival Course, Semantic web Research labratory
30
Filtering using IR
Modern Information Retrival Course, Semantic web Research labratory
31
Automatic vs. Social filtering
Automatic Filtering: is where the computer evaluates what is of value
for you. Social Filtering (collaborative filtering):
is tools where other people help you evaluate what is of most value to read. Just like the publishers and organizations did in society before the Internet.
Modern Information Retrival Course, Semantic web Research labratory
32
Social filtering
By social filtering is meant that some kind of ratings are assigned to documents.
The ratings can be compared to the stars (***) which newspapers often assign to films, books and other consumer products.
But the ratings can also include categorization into subject areas or according to particular scales.
Social filtering has some similarities to the filtering done by editors, journalists and publishers, since in both cases humans select the filtering attributes.
Modern Information Retrival Course, Semantic web Research labratory
33
Social filtering
Why use social filtering? It is difficult to design automatic or intelligent
filtering algorithms which really can evaluate the content of a document and evaluate its value. Humans are more capable of really deciding the value of
a document.
Who make the ratings? Ratings for use in social filtering can be provided
by:
Modern Information Retrival Course, Semantic web Research labratory
34
Social filtering
Editors: special people with the task of doing such rating.
An example is the people selecting which messages to put into services like Yahoo.
Readers: ordinary readers might input ratings on what they
read, and these ratings might be collected and put into databases to help other people.
Authors: can provide certain kinds of ratings themselves.
Modern Information Retrival Course, Semantic web Research labratory
35
Social filtering
The most successful social filtering system is Yahoo.
Yahoo employs humans to evaluate documents, and puts documents, which are interesting into its structured information database.
This is very similar to what the publishers, editors, journalists and organizations did in the world before the Internet.
Modern Information Retrival Course, Semantic web Research labratory
36
Social filtering
The simplest and most common filtering is by organizing discussions into groups (newsgroups, mailing lists, forums, etc.)
Each group has a topic, and wants only contributions within that topic. Sometimes the right to submit contributions is restricted. only members can submit. competence control is done before accepting a new member. special moderators must approve contributions before
distribution. The act when a recipient selects which groups to subscribe to,
can thus be seen as an act of setting a personal filter.
Modern Information Retrival Course, Semantic web Research labratory
37
Thread filtering
Another simple and common filtering method is to filter by thread. A thread is a set of messages, which directly or indirectly refer to
each other. People can use threads for filtering by specifying that they want
to skip reading of existing and future contributions in certain threads.
In Usenet News, this functionality is known under the term kill buffer.
Modern Information Retrival Course, Semantic web Research labratory
38
Thread filtering
In discussion groups, messages often belong to threads.
It may then not be possible to understand a single message without seeing other messages in the same thread.
A filter or search facility which only selects certain individual messages, out of threads, might then not satisfy their users.
The filter must either select several items in the thread, or at least make it very easy for users, when reading one selected message, to traverse the tree up and down from this message.
Modern Information Retrival Course, Semantic web Research labratory
39
Filtering rules and attributes
Filtering is done by applying filtering rules to attributes of the documents to be filtered.
Filtering rules are often Boolean conditions. They are usually put in an ordered list, which is
scanned for each item to be filtered. The attributes of documents, to be used in filtering, are
words in: the titles, abstracts or the whole document automatic measurements of stylistic and language quality name of author, and ratings on the documents supplied by its
author or by other people
Modern Information Retrival Course, Semantic web Research labratory
40
Filtering rules and attributes
Filtering can be done in servers or in clients. This figure shows how a server can filter messages before
downloading them to the client. Advantage:
Filtering can be done in the background Disadvantage:
Communication between user and filtering system becomes more complex.
Modern Information Retrival Course, Semantic web Research labratory
41
Filtering rules and attributes
Alternatively, filters may be part of the client, and apply to sets of documents after they have been downloaded to the client.
Modern Information Retrival Course, Semantic web Research labratory
42
Delivery of filtering results
The most common way of delivery of filtering results is that documents are filtered into different folders.
Users choose to read new items one folder at a time.
The filter helps users read messages on the same topic at the same time.
The user can also have a personal priority on the order of reading news in different folders.
Unwanted messages can be filtered to special “trashcan” folders.
Modern Information Retrival Course, Semantic web Research labratory
43
Intelligent filtering
By intelligent filtering is meant use of artificial intelligence (AI) methods to enhance filtering.
This can be done in different ways: to derive attributes for documents, to derive filtering rules, for the filtering process itself. With the machine learning
approach Such filtering can be done in the background, with little
or no interaction with the user it can also be done in a way where a user can interact
with the filter and help the filter understand why the user likes certain messages.
Modern Information Retrival Course, Semantic web Research labratory
44
Filtering against spamming
Many people want filters which will remove unsolicited direct marketing e-mail messages, so called spamming.
The filter has to recognize special properties of spam messages, which distinguish them from other messages.
Examples of such properties are: A message does not have your name or e-mail
address in the message heading, but it does not come from any mailing list, which you subscribe to.
Modern Information Retrival Course, Semantic web Research labratory
45
Filtering against spamming
Examples of such properties are: The author or sender of a message has an illegal e-
mail address. Certain words, such as “money” or “$$$” in the
subject. This is not very dependable. It has the same problem as all intelligent filtering.
If you often get similar spam, you might be able to recognize special properties of them to use to stop further similar spam.
The same message, with identical content, was sent to very many users
Modern Information Retrival Course, Semantic web Research labratory
46
Type of filters
Various Type of Filters: Content-based Filters Collaborative Filters Hybrid Filters
Modern Information Retrival Course, Semantic web Research labratory
47
Content-based Filters
A content-based filter makes use of the content of the information items to evaluate whether the item is interesting
profiles are either in the form of user-specified keywords or rules and reflects the long-term interests of the user.
the user would like the system to learn the user profile rather than impose upon the user to provide one.
This generally involves the application of Machine Learning (ML) techniques.
The user’s feedback can be acquired either implicitly by observing the user or explicitly by asking the user to rate the seen information item
Modern Information Retrival Course, Semantic web Research labratory
48
Content-based Filters (cont.)
The two primary weaknesses of using ML techniques to learn a user profile is that Most techniques require large amounts of data If a new information item is significantly different from
anything seen (and hence labeled) by the user before, the learned profile cannot make an accurate prediction
Content-based filters have been used successfully in various domains including: Web browsing (Letizia and Syskill&Webert), News filtering (NewsWeeder2,WebMate and NewsDude3) Email filtering (Re:Agent and EmailValet).
Modern Information Retrival Course, Semantic web Research labratory
49
Collaborative Filters
Collaborative filters also known as Social Filters, are often used in Recommender Systems.
A collaborative filter makes use of a database of user preferences to find users with similar interests
Predict whether an unseen information item is likely to be of interest to you based on how other users have rated this item.
A community of users has to continuously rate whether the information they have seen is interesting to them or not
Generally this rating is on a scale (e.g., from 1, meaning “not interesting” to 5, meaning “very interesting”.)
Modern Information Retrival Course, Semantic web Research labratory
50
Collaborative Filters (cont.)
Collaborative filters have two common weaknesses: The first rater problem
If no users have rated an information item, the filter cannot evaluate whether that item is likely to be of interest to its user
Sparse data Most users do not rate all that much information due to the time it
takes, and as such, it is not always easy to find users with similar profiles.
Collaborative filters work quite well and have successfully been applied in a variety of domains including: Finding people who are knowledge in a given field (Tapestry) Netnews (GroupLens4) Music recommendation (Ringo) Helping people to find Web resources (PHOAKS) CDNow.com, reel.com, and Amazon.com.
Modern Information Retrival Course, Semantic web Research labratory
51
Hybrid Filters
The goal of hybrid filters is to take the best features of each and minimize the impact of their weaknesses with the goal of outperforming each individually.
Generally they start with one type of filter (content-based or collaborative) and incorporate features from the other type of filter to improve the performance of the original filter.
One simple approach is to have the content-based and collaborative filter each produce separate recommendations, and then combine their predictions
Modern Information Retrival Course, Semantic web Research labratory
52
Profling and Filtering Technologies Most information filtering systems are based on a
number of key-techniques used to describe information, create a user profile and create the interaction and filtering needed for a useful system.
Key filtering technologies Keyword vectors N-grams Hyperlink structures Collaborative and economic-based filtering Data-mining techniques
Modern Information Retrival Course, Semantic web Research labratory
53
Keyword vectors
Keywords are the most popular way of representing documents and are also used to represent user-profiles.
Most representations are based on a standard information retrieval technique called weighted vector representation
Document similarity and document distance to a preferred profile-vector can be easily obtained by comparing the respective vectors with for instance k-nearest-Neighbor algorithms
User profiles can be obtained by determining (clusters of) document vectors that are indicative for the type of information of interest to the user
Modern Information Retrival Course, Semantic web Research labratory
54
N-grams
An n-gram is a sequence of n letters. Typically n is at least three.
For each n and size of alphabet there are a finite number of letter sequence of length n and thus a fixed number of n-grams.
A text can be converted to an n-gram distribution by counting the number of times each possible n-gram appears within the text
The main benefit of N-grams lies in the fact that they are less sensitive to spelling-errors and that the (large!) n-gram vector also incorporates more of the document structure as compared to keywords.
Modern Information Retrival Course, Semantic web Research labratory
55
Hyperlink structures
Specially for documents with linked structures, such as web-pages, graph-like representations can be extracted, mapping out the relationships between documents, and between words near links to other documents.
Such structure can be exploited to filter WebPages into different categories
Modern Information Retrival Course, Semantic web Research labratory
56
Collaborative and economic-based filtering Collaborative (or social) filtering utilizes feedback
and ratings from different users to filter out irrelevant information
The information interesting for a user is gathered on the fly by using the opinions of other users with similar interest
Economic-based filtering augments this idea with a cost-benefit analysis on behalf of the user
It takes into consideration parameters like the price of a document and its cost of transmission when making filtering decisions.
Modern Information Retrival Course, Semantic web Research labratory
57
Data-mining techniques
Data-mining techniques can be employed to find similarities between data-entries, and thus inferring that the profile of a given user might be very close to the profile of some other users.
Thus correlating the current customer to previous users (people-to-people correlation, e.g. you are like this type of
customer who typically likes...) or to these previous users' interests
(item- to-item correlation, e.g. this item you are considering is very much like these items...)
allows companies to present a customer with information that he or she is likely to be interested in
Modern Information Retrival Course, Semantic web Research labratory
58
Key user-modeling techniques User modeling can be defined as the effort to
create a profile of the user's interests and habits
Profiles could be acquired or generated in a variety of ways: By explicit modeling by humans
By direct user interviews and questionnaires By knowledge engineers using user stereotypes Rule-based profiles, where the users specify their own
rules in the profile, rules that control the behavior of the model.
Modern Information Retrival Course, Semantic web Research labratory
59
Key user-modeling techniques (cont.)
By automated software techniques Machine learning techniques like inference, induction
and classification, where the modeler tries to identify certain patterns in the user's behavior.
Profile building by example, where the user provides examples of his/her behavior and the modeling software records them.
At the moment, the first method is much further developed and significantly more applied than the second, which is in its development phase
Modern Information Retrival Course, Semantic web Research labratory
60
Conclusion
Information retrieval and Information filtering are indeed two sides of the same coin.
They work together to help people get the information needed to perform their tasks.