design and deployment of a personalized news service

15
Articles 28 AI MAGAZINE Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Problem Description It is hard to keep up on what matters. The limiting factor is not the amount of information available but our available attention (Simon 1971). Traditional mainstream media cover general interest news to attract a large audience. Although most people are interested in some of the topics covered in the mainstream, they also have specialized interests from their personal lives, professions, and hobbies that are not popular enough to be cov- ered there. Satisfying all of our information needs requires vis- iting multiple sites, but this requires navigation and there is often much overlap in news coverage across sites. The goal of our system (called Kiffets) is to provide readers with news at a single site that is personalized, curated, and organized. Approach to Personalizing News A snapshot of a personal information appetite or “information diet” reveals further nuances. Some interests endure. Some interests are transient, following the events of life. When we form new interests, a topically organized orientation helps to guide our understanding of a new subject area. Kiffets readers explicitly select channels to cover subject areas that interest them. The list of My Channels on the left in figure Design and Deployment of a Personalized News Service Mark Stefik, Lance Good n From 2008–2010 we built an experimental personalized news system where readers sub- scribed to organized channels of topical infor- mation that were curated by experts. AI tech- nology was employed to present the right information efficiently to each reader and to reduce radically the workload of curators. The system went through three implementation cycles and processed more than 20 million news stories from about 12,000 Really Simple Syndi- cation (RSS) feeds on more than 8000 topics organized by 160 curators for more than 600 registered readers. This article describes the approach, engineering, and AI technology of the system.

Upload: others

Post on 18-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Articles

28 AI MAGAZINE Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

Problem Description It is hard to keep up on what matters. The limiting factor is notthe amount of information available but our available attention(Simon 1971). Traditional mainstream media cover generalinterest news to attract a large audience. Although most peopleare interested in some of the topics covered in the mainstream,they also have specialized interests from their personal lives,professions, and hobbies that are not popular enough to be cov-ered there. Satisfying all of our information needs requires vis-iting multiple sites, but this requires navigation and there isoften much overlap in news coverage across sites. The goal ofour system (called Kiffets) is to provide readers with news at asingle site that is personalized, curated, and organized.

Approach to Personalizing NewsA snapshot of a personal information appetite or “informationdiet” reveals further nuances. Some interests endure. Someinterests are transient, following the events of life. When weform new interests, a topically organized orientation helps toguide our understanding of a new subject area.

Kiffets readers explicitly select channels to cover subject areasthat interest them. The list of My Channels on the left in figure

Design and Deployment of a Personalized News Service

Mark Stefik, Lance Good

n From 2008–2010 we built an experimentalpersonalized news system where readers sub-scribed to organized channels of topical infor-mation that were curated by experts. AI tech-nology was employed to present the rightinformation efficiently to each reader and toreduce radically the workload of curators. Thesystem went through three implementationcycles and processed more than 20 million newsstories from about 12,000 Really Simple Syndi-cation (RSS) feeds on more than 8000 topicsorganized by 160 curators for more than 600registered readers. This article describes theapproach, engineering, and AI technology of thesystem.

Articles

SUMMER 2012 29

1 shows a list of the channels to which this readersubscribes including USA, The Future of Journal-ism, Sustainable Living, Planet Watch, Science andPolitics, and several more.

The information on the right under MyOverview shows the top stories from these chan-nels. The top four stories included a Privacy storyfrom the Information Media channel, a storyabout Snow from the USA channel, a story aboutEgypt from the World News channel, and a storyabout the relative market share of tablets from theInformation Media channel. The subsequent pres-entation shows top stories for each of the chan-nels, organized by topic.

To provide this reader experience, a news deliv-ery system needs to use knowledge from news edi-tors. It needs to collect articles from appropriatesources, classify them, and organize them by top-ic. Creating this organized presentation of newsautomatically requires capturing and harnessingthe relevant subject matter expertise. In tradition-al news publications, editors use their expertise toselect sources and organize news presentations.Publishers arrange to have enough material to sat-isfy their audiences and enough editors to vet andorganize the material.

We call our corresponding subject matterexperts curators. We depart from tradition byenabling any user to be a curator and to attract afollowing by sharing articles in organized, topicalchannels. We also simplify curation by automatingthe repetitive work. The idea is to extend topicalcoverage down the long tail (Anderson 2006) ofspecialized interests by using a large group of cura-tors.

What Readers and Curators NeedBased on interviews with Kiffets users, busy readerswant to forage for news efficiently. A “5-20-60rule” expresses how a reader with a busy lifestyleallocates reading time.

5-20-60 rule. I typically take time to read 20 articlesa day and scan 60. When I’m in a hurry I only havetime to read 5 articles according to my interests.When I have time I want to drill down selectivelyfor more information.

Such busy readers want to optimize their use ofavailable time to follow the news. They want toread the important stories first and to scan the restof the news. Although they want to be informedabout topics at the edges of their interests, theyhave a strong sense of priority topics that they

Figure 1. A Reader’s Overview of Subscribed Channels for News.

Articles

30 AI MAGAZINE

want to keep up on. They want adequate cues or“information scent” so that they can tell whethera given story is interesting. They drill down accord-ingly as their interests dictate and they have time.Such actions constitute information foragingstrategies (Pirolli 2007).

In Kiffets, readers leverage the knowledge ofcurators. A curator has a strong interest and expert-ise in a subject area. Curators want to share theirperspective with others. They are not paid for theirwork and depend on Kiffets to do the heavy liftingof collecting, classifying, and presenting news.They are not programmers and typically have nounderstanding of AI or machine-learning technol-ogy.

Our goal was to provide curators with a simpleand intuitive interface for selecting trustworthynews sources and creating a useful topic structurefor organizing the news. They need a simple andintuitive way of conveying their understanding ofhow to classify articles into topics.

In this way our approach draws on three sourcesof power that we call the light work of the many(the readers), the harder work of the few (the cura-tors), and the tireless work of the machines (oursystem).

Guide to ReadingMost of the technology supporting our personal-ized news service is web and distributed systems(cloud) programming. About one-fourth is artifi-cial intelligence and information retrieval technol-ogy (Jones and Willett 1997). This includes mainlytopic modeling and machine learning technolo-gies that we developed.

The next two sections describe the Kiffets per-sonalized news service first as a software-as-serviceapplication built to meet user requirements, andthen in terms of its underlying AI technology. Lat-er sections describe competing approaches and les-sons learned.

Application DescriptionManual curation is practical for traditional news-papers and magazines because the topics corre-sponding to newspaper sections are broad, theirnumber is small, and the published articles aredrawn from just a few sources.

Personalized news differs from traditional main-stream news because it requires a regime of infor-mation abundance. Some narrow topics are veryinfrequently reported, appearing sparsely in arange of publications. Topics of general interest areoften widely reported, with similar articles appear-ing in many publications.

In 2010 Kiffets curators used more than 12,000RSS feeds as sources for news. These feeds con-tributed between 25,000 and 35,000 articles every

day, which were classified automatically by topic.The bigger channels on our system used a few hun-dred feeds as sources and could pull in hundreds ofarticles per day for their topics.

Kiffets employs a distributed computing archi-tecture to deliver personalized news rapidly toreaders as they explore their interests. It also givesquick feedback to curators as they define topic folk-sonomies and tune individual topic coverage byselecting training examples. Most of the informa-tion processing of news is done continuously by acollection of backend processors, caching resultsfor delivery by a web server.

Supporting ReadersReaders select subject areas of interest and the sys-tem provides current information, vetted andorganized.

Finding Relevant Channels Suppose that a reader wants to explore informa-tion about Egypt, going beyond the channels towhich the reader has already subscribed. Figure 2shows the results of searching for channels aboutEgypt when a draft of this article was written. Theresults show the most relevant channels, togetherwith their most relevant subtopics and a samplearticle.

The search query “Egypt” does not specify whataspect of that country the reader is interested in.The delivered results show channels that focus var-iously on Egypt as a travel destination, Egypt in thecontext of recent World Cup soccer, Egypt in worldnews, and so on.

Figure 3 shows the results of clicking the firstsearch result to the Egypt channel. The displayoffers articles from different time periods, such asthe previous 24 hours, the previous 2 days, and soon.

To guide the reader in further browsing, thepresentation links to related topics from the Egyptsightseeing channel including Cairo Sightseeingand Archeological Finds in Egypt. A reader inter-ested in traveling to Egypt can check out the Egyptsightseeing channel. A reader more interested inthe Egyptian political situation could visit one ofthe curated channels specific to that interest.

As readers’ interests shift they can subscribe orunsubscribe to curated channels or create newchannels themselves.

Making Foraging EfficientMost users subscribe to a few channels — some ongeneral news and some on specialized topics.Showing all of the articles on the front page wouldpresent an unwieldy flood of information.

Kiffets presents top articles as starting points andenables readers to drill down selectively. Figure 4shows articles that appear if a reader drills down at

Articles

SUMMER 2012 31

Health and Safety in the USA folksonomy. Readers particularly interested in natural disas-

ters can drill down again to see articles about earth-quakes, fires, storms, volcano eruptions, and othernatural disasters as in figure 5. Each drill downreveals more topics and more articles. This novelKiffets capability is enabled by AI technologydescribed later. It supports information foragingand is one of our favorite features.

Supporting CuratorsCurators are often busy people. Like traditionalnews editors, curators select reliable sources anddecide how the articles should be organized. Incontrast to traditional news organizations, curatorson our system do not do this work manually everyday. They expect the system to acquire theirexpertise and to allow them to tune it over time.

When articles come from many sources, themain work is finding and organizing them.Automating this work is the main opportunity forsupporting curators.

Figure 6 shows the folksonomy of topics for thechannel Future of Journalism. The curator for thischannel organizes the topics and indicates trainingexamples for each topic.

System support for curators is enabled by AItechnology described later. Besides the automaticclassification of articles by topic, it includesmachine learning of topic models, hot topic detec-tion, clustering methods to detect duplicate arti-cles, source recommendation, and automatic selec-tion and allocation of space for displaying toparticles.

System ArchitectureFigure 7 shows the system architecture. Usersaccess the system through web browsers. Browserprograms written in HTML/JavaScript and AdobeFlash provide interactivity. The API to the webserver uses REST protocols with arguments encod-ed in JSON. The web server is outside our firewalland uses Django as the web framework. Code fortransactions with the rest of the system is writtenin Python.

Transactions through the firewall are directed toa MySQL database, a Solr (variant of Lucene) serv-er that indexes articles, and a topic search serverthat we wrote in Java. These specialized servers arefor transactions that require fast, low-latency com-putations. The computations include user-initiat-ed searches for articles or topics and also interac-tive services for curators who are tuning their topicmodels and finding new sources. A caching serverreduces the load on the database for commonqueries. All of these servers run on fairly recentmidclass Dell workstations.

Java programs running on a back-end Hadoopcluster of a dozen workstation-class computers car-

ry out most of the information processing. The col-lector-scheduler process periodically schedules jobsto crawl curator-specified RSS feeds on the web,collect and parse articles, classify them by topic,and cluster related articles from multiple sources.Other periodic Hadoop jobs use AI technology toremove duplicates for topics, identify hot topics,and identify top articles. Still other Hadoop jobsfor machine learning are triggered when curatorsmark articles as on topic or off topic.

Most of the article information (about 3 ter-abytes) is stored in HBase, a NoSQL (key-value pair)database that runs on Hadoop’s distributed file sys-tem.

Figure 2. Finding a Good Channel about Egypt.

Articles

32 AI MAGAZINE

Figure 3. A Channel about Egypt.

Figure 4. Subtopics and Articles for Health and Safety.

Articles

SUMMER 2012 33

AI Technology DescriptionWe used several kinds of AI technology to addressthe information-processing challenges. The mainfunctions of the AI technology are classifying arti-cles into topics and creating an effective news pres-entation.

Robust Topic Identification Accuracy in the topical classification of articles isdifficult because there is variability in the wordsused in articles on a topic and “noise” in news arti-cles. The noise in an article includes advertise-ments for products and blurbs inserted by publish-ers to keep readers engaged.

Optimal Query GenerationMany online news systems classify articles auto-matically by matching a Boolean query againstarticles. Several common conditions can cause thisapproach to be unsatisfactory. One issue is thatcommon words often have multiple meanings.Does a search for “mustang” refer to a horse, a car,or something else? User expectations of precisionare much higher for automatic article classificationthan for results of search engines. When peoplesearch interactively, they face a trade-off betweencarefully developing a precise query and spendingtime foraging through the results. It is acceptable ifmany of the results are off topic as long as a satis-

Figure 5. Articles for Different Kinds of Natural Disasters.

Articles

34 AI MAGAZINE

factory result appears in the top few. In contrast,users find it unacceptable when a system suppliesits own query and there are many off-topic articles.

Skilled query writers can address this issue bywriting complex queries. We have found, however,

that complex queries are prone to errors and refin-ing them is often beyond the skill and patience ofour curators. In practice few people write querieswith more than two or three key words and seldomexpress Boolean constraints.

We developed a machine-learning approach togenerate optimal queries. As curators read thenews, they can mark articles that they encounter.If they flag an article as off topic, it becomes a neg-ative training example for the topic. If they comeupon a particularly good article in their reading,they can recommend it — making it an on-topictraining example. Upon receiving new trainingexamples, Kiffets schedules a training job to find anew optimal query and reclassify articles for thetopic. Curators need not understand how thisworks.

Because we have reported on this approach else-where (Stefik 2011), we describe it here only briefly.Our system employs a hierarchical generate-and-test method (Stefik 1995) to generate and evaluatequeries. The queries are expressed in a Lisp-likequery language and compiled into Java objectsthat call each other to carry out a match. The arti-cles are encoded as arrays of stemmed words rep-resented as unique integers. With query-matchingoperations implemented as operations on memo-ry-resident numeric arrays, the system is able toconsider several tens of thousands of candidatequeries for a topic in a few seconds. This is fastenough to support an interactive session when acurator wants to tune a topic.

The query terms are chosen from the trainingexamples, drawn from words that have high termfrequency–inverse document frequency (tf-idf)ratios, that is, words whose frequencies in thetraining examples are substantially higher thantheir frequencies in a baseline corpus. The queryrelationships are conjunctions (“gas” AND “pollu-tion”), disjunctions (“army” OR “navy”), n-grams(sequence of words), and recursive compositions ofthese. Candidate queries are scored, rewardingmatches of on-topic examples, penalizing matchesof off-topic examples, and rewarding query sim-plicity.

Although the optimal query generator auto-mates writing queries, this approach does not getaround fundamental problems with using queriesalone to classify articles. For example, it does notdistinguish cases where articles match a query inci-dentally, such as when article web pages containadvertisements or short descriptions provided by apublisher to draw a reader to unrelated articles.The query approach also does not distinguish arti-cles that are mainly on topic from articles that aremainly off topic, but which contain tangential ref-erences to a topic. For this reason, we characterizethe query approach as having high precision andhigh vulnerability to noise.

Figure 6. A Topic Tree Folksonomy for Future of Journalism.

Articles

SUMMER 2012 35

Single-Core TopicsTo reduce noise vulnerability, we incorporate a sta-tistical modeling approach for classifying articles.This approach complements query matching andhas opposite characteristics. In contrast to thequery approach, it has low vulnerability to noisebut also low precision.

The statistical approach considers an article as awhole, rather than focusing on just the words andphrases in a query. It represents an article as a termvector (Salton and Buckley 1988), pairing basiswords with their relative frequencies in the article.We compute the similarity of the term vector foran article to the term vectors for the topic asderived from its training examples. With a cosinesimilarity metric, the score approaches one for ahighly similar article and zero for a dissimilar arti-cle. A similarity score of about 0.25 is a goodthreshold for acceptability.

For example, with the current concerns aboutenergy and the economy, stories about gas pricesoften appear in the news. Although some storiesare simply about the rise or fall of prices, other sto-

ries mention gas prices in relation to other matters.For example, stories about the expenses of livingin the suburbs sometimes mention rising gas pricesas a factor. Ecological stories about offshore or arc-tic drilling for oil sometimes predict the effects ofincreased regulation on gas prices.

Suppose that a curator wants to include articlesthat are mainly about the rise or fall of gas prices.The curator may be willing to include articles men-tioning arctic drilling, but not if they are only inci-dentally about gas prices. This is our simplest kindof topic model. We call it a single-core topicbecause it has a single focus. (Multicore topics havemultiple foci — such as if a curator specifically alsowants to include gas price stories about arcticdrilling for petroleum.)

Figure 8 illustrates how we model single-coretopics. The outer box surrounds the articles fromthe corpus. The dashed box surrounds articles thatmatch a Boolean query. The query could be as sim-ple as the terms “gas price” with its implied con-junction or it could be more complex, such as “(gasOR gasoline) AND (price OR cost).” The dark circle

Hadoop Cluster

DB

Web/AppServer

Web Framework(Django)

Clients(Web

Browsers)

HBase

Memcached

SystemManagerand UserDrivenEvents

CollectorScheduler

SolrSearchServer

TopicSearchServer

Figure 7. System Architecture.

Articles

36 AI MAGAZINE

represents articles whose similarity to the positivetraining examples is above a threshold.

A reference vector is computed for each of thetraining examples. In this example, a referencevector for suburban living would include wordsdescribing living far from cities and commuting towork. Reference vectors give a computed tf-idfweight for each word and are the same for eitheron-topic or off-topic examples.

This combined topic model joins an optimalBoolean query with term vectors for the on-topicand off-topic examples. An article is classified ontopic if it matches the query, its vector is similarenough to an on-topic reference vector, and it isnot too similar to an off-topic reference vector.This approach combines the high precision of aBoolean query with graduated measures of similar-ity to training examples. This approach has provenprecise enough for topics and robust against thenoise found in most articles. Restated, thisapproach is much less vulnerable to false positivematches to an irrelevant advertisement on a pagethan query matches alone. For most topics, threeto six training examples of each type are enoughfor satisfactory results.

Multiple-Core and Multiple-Level TopicsConsider a western newspaper that has a section ofstories for each state. Suppose that the curatorwants to include articles about the Oregon state

economy covering lumber and high technology,but not tourism. For Oregon politics she wants tocover politics in the state capital, politics of thePortland mayor’s office, and some controversialstate initiatives. She may also want to cover majornews from Portland, the state’s biggest city.

Figure 9 suggests how the approach for model-ing topics works recursively. Dashed boxes repre-

All articlesfrom sources

Gasprices

Suburbanliving

Off-shoredrilling

Arcticdrilling

Articles matching“gas prices”

Articles similar toon-topic trainingarticles forgas prices

Figure 8. A Single Core Topic for Gas Prices.

Portland Economy

Politics

Figure 9. A Multilevel, Multicore Topic Model for Oregon.

Articles

SUMMER 2012 37

Figure 10. Displaying Articles Promoted from Subtopics.

Articles

38 AI MAGAZINE

sent a set of articles defined by a query. Circles rep-resent term-vector filters for including on-topicexamples. Triangles represent filters for excludingoff-topic examples.

Each of the Oregon subtopics is complex. Wecall these fractal topics in analogy with fractalcurves like coastlines that maintain their shapecomplexity even as you look closer. The subtopicfor Oregon Economy has its own on-topic and off-topic cores shown as small circles and triangles,respectively. When you look closer, the subtopicfor Oregon Politics also has multiple levels.

Prioritizing ArticlesIn an early version of the system, all of the newarticles for a channel (over its entire folksonomy oftopics) were presented at once. This was over-whelming for readers even though articles wereclassified by topic. In a second version, we showedonly the articles from the top-level topics and read-ers had to click through the levels of the folksono-my to find articles on deeper topics. This was toomuch work and caused readers to miss importantstories. In the current version, articles are present-ed a level at a time but a rationed number of arti-cles from the leaf topics are selectively bubbled upthe topic tree through their parents.

Which articles should be selected to propagateupwards? Kiffets combines several factors in pro-moting articles through levels. Articles are favoredif they are central to a topic, that is, if their termvector is similar to a composite term vector for atopic or close to one of its individual trainingexamples. Articles from a topic are favored if thetopic is hot, meaning that the number of articleson the topic is dramatically increasing with time.Story coverage in a parent topic balances compet-ing subtopics.

Articles can be classified as matching more thanone topic. For example, in the Kiffets USA index,articles about difficulties between Google and Chi-na a few years ago were classified under a topicrelating to trade, a topic about censorship, and atopic relating to cyber attacks. When top articlesbubble up a folksonomy, similar or identical arti-cles might appear from more than one of its childtopics. The allocation algorithm presents an articleonly once at a level showing the topic thatmatched it best. If the reader descends the topictree, the article may reappear in a different con-text. Figure 10 gives examples of three articles pro-moted from subtopics of Health and Safety. Thefirst article comes from the leaf topic Snow. Its fulltopic trail is USA > Health and Safety > natural dis-asters > Storms > Snow.

Detecting Duplicate ArticlesBusy newsreaders are annoyed by duplicate arti-

cles. Exact duplicates of articles can arise whencurators include multiple feeds that carry the samearticles under different URLs. Reader perception ofduplication, however, is more general than exactduplication and includes articles that are just verysimilar. The challenge is finding an effective andefficient way to detect duplicates.

Our approach begins with simple heuristics fordetecting identical wording. The main methoduses clustering. Since the number of clusters ofsimilar articles is not known in advance we devel-oped a variant of agglomerative clustering. Weemploy a greedy algorithm with a fixed minimumthreshold for similarity. Two passes through thecandidate clusters are almost always enough tocluster the duplicate articles. An example of a clus-tering result is shown below the first article in fig-ure 10 in the link to “All 2 stories like this.” In ourcurrent implementation, duplicate removal isdone only for displays of articles from the previous24 hours.

Other AI Technology for InformationProcessingMost of the programming in Kiffets is for systemtasks such as job scheduling, data storage andretrieval, and user interactions. Nonetheless, AItechniques have been essential for those parts ofthe system that need to embody knowledge orheuristics. Here are some examples:

A hot-topics detector prioritizes topics according togrowth rates in editorial coverage across sources,identifying important breaking news.

A related-topic detector helps users discover addi-tional channels for their interests.

A near-misses identifier finds articles that are simi-lar to other articles that match a topic, but whichfail to match the topic’s query. The near-miss arti-cles can be inspected by curators and added as pos-itive examples to broaden a topic.

A source recommender looks for additional RSSfeeds that a curator has not chosen but that deliverarticles that that are on topic for a channel.

Competing ApproachesAt a conference about the future of journalism,Google’s Eric Schmidt spoke on the intertwinedthemes of abundance and personalization for news(Arthur 2010).

The Internet is the most disruptive technology inhistory, even more than something like electricity,because it replaces scarcity with abundance, so thatany business built on scarcity is completelyupturned as it arrives there.

He also reflected on the future of mass media andthe news experience.

It is … delivered to a digital device, which has text,

Articles

SUMMER 2012 39

obviously, but also color and video and the abilityto dig very deeply into what you are supplied with.… The most important thing is that it will be morepersonalized.

Many news aggregation and personalizationservices have appeared on the web over the last fewyears. Some of these services have been popular, atleast for a while. In the following we describe theelements that are similar or different from ourapproach.

Choosing Who to FollowA few years ago RSS readers were introduced toenable people to get personalized news. RSS read-ers deliver articles from RSS feeds on the web, cre-ated by bloggers and news organizations. RSS read-ers do not organize news topically and do notprovide headlines of top stories. Rather, they dis-play articles by source. A news consumer can readarticles from one source and then switch to readarticles from another one. Systems vary in whetherthey are web based or desktop applications and inhow they keep track of the articles that have beenread. According to a 2008 Forrester report (Katz2008), however, consumer adoption of RSS readershas only reached 11 percent, because people donot understand them.

The Pew Internet and American Life Projectreports on changes in how people consume andinteract with news. Much of the growth in onlineservices with news is in systems like Twitter andFacebook, which are similar to RSS readers in thatusers specify their interests in terms of sources orpeople that they want to follow. According to Pew,Internet sources have now surpassed television andradio as the main source of news for people under30. Kiffets benefited from the RSS work since itmade available the feeds for news feeds and blogs.

Matching Key Words News alert systems ask users to provide key wordsor a query that specifies the news that they want.This approach treats personalization as search.Typical users receive news alert messages in theiremail.

Since news alert systems maintain a wide spec-trum of sources, they sidestep the problem of ask-ing users to locate or choose appropriate RSS feedson the web. However, a downside of using a broadset of sources to answer queries is that many of thearticles delivered are essentially noise relative tothe user’s intent, due to unintended matches toincidental words on the web pages containing thearticles.

Another disadvantage of news alert systems isthat the precision of queries inherently limits theirpotential for surprise and discovery. In strugglingto get just the right query, news consumers poten-tially miss articles that express things with differ-

ent words. Furthermore, news consumers want tofind out about what’s happening without antici-pating and specifying what the breaking news willbe.

Personalized News by Mainstream PublishersSome major news publishers let their customerschoose from a predefined set of special interest sec-tions such as, say, Science and Technology or allowthem to specify key words that are matched againstnews articles from the publisher. The predefinedsections are manually curated, and the key-wordsections rely on simple matching. According to aprivate communication from a technology officerof a major national news publisher, fewer than 3percent of their mainstream news customers enterany form of customizing information.

Systems like Google News offer a similar combi-nation of methods except that they draw frommany sources. They offer predefined channels(World, Business, Sci/Tech) on broad topics, whichseem to achieve topical coherence by showingonly articles from appropriate manually curatedfeeds. Any user-defined channels based on keywords have the same noise problems as other key-word approaches. Google News also uses a cluster-ing approach to identify hot articles. Lacking sec-tions defined by topic trees, it does not organizearticles into coherent, fine-grained sections. Theseservices are simpler to use than RSS readers becauseusers need not select sources.

Collaborative FilteringBesides these main approaches for personalizednews, there are also social approaches for gatheringand delivering news. Collaborative filteringapproaches recognize that “birds of a feather”groups are powerful for recommending particularnews (and movies, books, and music). These sys-tems collect data about user preferences, matchusers to established groups of people with similarinterests, and make recommendations based onarticles preferred by members of the groups. Find-ory (www.findory.com) and DailyMe (www.daily-me.com) are examples of early and current newssystems, respectively, that use collaborative filter-ing to deliver personalized news. (AI Magazine ranan issue with several articles [Burke, Felfernig, andGöker 2011] on recommender systems.)

Collaborative filtering systems need to identifyaffinity groups, for example, by explicitly askingusers to rank their interests in a questionnaire. Sys-tems can also keep track of the articles that usersread and infer groups implicitly. Since people typ-ically have several distinct news interests, systemsmust account for each interest separately.

Some news sites use collaborative filtering tosupport personalization. These systems keep track

Articles

40 AI MAGAZINE

of stories that users read and use collaborative fil-tering to identify and predict personalized interestsof the readers. Stories matching the personalizedcategories are promoted to higher prominence inthe presentation. One of our unaddressed goals forKiffets was to augment our user modeling and arti-cle ranking by collecting individual user data.

The NewsFinder system (Dong, Smith, andBuchanan 2011), which distributes news storiesabout artificial intelligence, collects and displaysexplicit user ratings for articles. Although that sys-tem is not designed for personalized news or opencuration, it does employ some similar AI methodssuch as for detecting duplicates.

Social News SitesSocial news sites such as Reddit (www.redddit.com)or Digg (www.digg.com) enable people to submitarticles. The articles are ranked by popularityaccording to reader votes. Social bookmarking sitessuch as Delicious (www.delicious.com) are nearcousins to social news sites. Their primary purposeis to organize a personal set of browser bookmarksto web pages, and their secondary purpose is toshare and rank the bookmarks. Social news sitesrely on social participation both for collecting andranking articles and with enough participants canaddress topics on the long tail of the users’ special-ized interests.

Social news sites face a challenge in getting anadequate stream of articles for narrow topics, espe-cially when the participating groups are just get-ting established. Furthermore, the presentation ofnews on social news sites is not topically organizedand usually appears quite haphazard because arti-cles are listed by popularity without topical coher-ence.

In summary, the advent of RSS feeds has provid-ed the basis for many groups to explore new waysto deliver news. Some technology elements such askey-word matching, tf-idf matching, and cluster-ing have been used in many approaches. Kiffetsexplored new territory in its automated assistanceto curation, its organization of articles in deep folk-sonomies, and in its robust topic models that com-bine symbolic and statistical methods.

Lessons LearnedKiffets was inspired by “scent index” research (Chiet al. 2007) for searching the contents of books.That research returned book pages as search resultsorganized by categories from the back-of-the-bookindex. For example, a search query like “Ben Bed-erson” in an HCI book returned results organizedby topics corresponding to Bederson’s researchprojects and institutional affiliations. We wereinspired by how the system employed the organi-zation of an index to help a user to tune a query.

The organization of pages by index topic created asense of conversation, informing the user aboutexpert options for seeking more information. Wethought it would be exciting to extrapolate theapproach to the web.

Research on the project was internally funded atthe Palo Alto Research Center (PARC). Our auda-cious goal was to develop a commercially viableproduct in a year. Our runway was longer thanthat. Kiffets was designed, implemented, anddeployed by two people over two and a half years.Other project members worked on evaluation,channel development, user experience, releasetesting, and business development. About a dozenpatent applications were filed on the technology.

In the course of this work we learned lessonsabout both business and technology. On the busi-ness side, we pursued two approaches for commer-cialization. One was to create an advertising-sup-ported personalized news service. A second was tocreate a back-end platform as a service for newsorganizations. Although we had serious negotia-tions with two of the top five news organizationsin the United States for several months, in the endPARC did not close a deal. The particulars of thatare beyond the scope of this article, but PARC hassince adapted its strategies to address patentindemnification and service levels. On theapproach of creating a stand-alone business, it isworth noting that the news organizations thathave grown and succeeded over the same timeperiod either had an entertainment focus or afocused brand for a specific kind of news. (We didnot even have pictures on our news pages!) Thereis little question that the news industry is beingdisrupted and that news is currently abundant.However, although personalized news seemsappealing to busy people, at the time of this writ-ing we know of no personalized news services thathave become large and sustainable businesses. In2008 when we sought investor funding, investorswere pulling back from this area. In this approachit is important to develop a viral business wherethe user base grows very quickly. During our two-year runway, we did not find a key for that in ourwork.

Alpha and Beta TestingA major lesson for us was the importance of adopt-ing a lean startup approach (Ries 2011). The key tothis approach is in trying ideas quickly with cus-tomers, rapidly pivoting to new alternatives. Wecame to understand the rhythm of a customer-development cycle better as the project proceeded.In the following we describe the interweaving ofdevelopment and evaluation that we carried out.

In April 2008 we created a two-person team toexplore the application of this technology. InOctober 2008 we opened our first prototype to

Articles

SUMMER 2012 41

alpha testing by a dozen users. We had a flash-based wizard for curators and a simple web inter-face for readers. Each of the curators built a sampleindex and used it for a few weeks. Four more peo-ple joined the team, focusing on release testing,user interviews, design issues, and fund raising.

A major challenge was in making curation easyand reliable given the limited time that curatorshave available. Although the system was able tocollect and deliver articles when we built the chan-nels, curation was too difficult for our first cura-tors. They had difficulty finding RSS feeds and didnot completely grasp the requirements of curating.Extensive interviews and observation sessionshelped us to identify key usability issues.

Over time we came to understand reader andcurator habits more deeply. For example, when werecognized that curators wanted to tune their top-ic models while they were reading their daily news,we eliminated the separate wizard interface forcurators and incorporated curation controls intothe news-reading interface. This required changinghow the machine-learning algorithms were trig-gered. We shifted from having curators requesttopic training explicitly to triggering it implicitlyby marking articles while they were reading.

During our trial period, about one registereduser in three created a channel and about one infour of those created a complex channel. We donot know ultimately what fraction of users mightbe expected to become curators. Many users createvery simple channels without setting up topics.

The development and deployment of AI tech-nology was driven by the goal of meeting userneeds. For example, when article classificationbegan failing excessively due to noisy articles fromthe web, we combined our symbolic query-basedapproach with the statistical similarity-basedapproach. For another example, multilevel topicpresentation was developed to improve user forag-ing on big channels. Other additions such as thesource recommender were prioritized when theybecame the biggest obstacles to user satisfaction.

As we learned about lean startup practices, webecame obsessed with meeting customer needs.We followed a ruthless development process thatdivided user engagement into four stages: tryingthe system, understanding it, being delighted by it,and inviting friends. We divided possible systemimprovements into a track for curators and a trackfor readers. We built performance metrics into thesystem and monitored user engagement withGoogle Analytics. In 2010 we measured 1300unique visitors per month with about 8900 pageviews. The average user stayed for about eight min-utes, which was high. Every month we interviewedsome users. Every morning we met for an hour toprioritize and coordinate the day’s developmentactivities.

Performance TuningIn early 2009 we began beta testing with about 60users. The system load from users and articlesincreased to a level where we had to prioritize scal-ing and robustness issues. The first version of thesystem began to stagger when we reached 100,000articles. A recurring theme was to reduce the I-O inprocesses, since that dominated running time inmost computations. For example, an early versionof the classifier would read in arrays representingarticles and use our optimized matching code todetect topic matches. Recognizing that most of thetime was going into I-O, we switched to using Solrto compute word indexes for articles when theywere first collected. The classifier could then matcharticles to Boolean queries without rereading theircontents.

We switched to a NoSQL database for articlecontents to support the millions of articles that thesystem now held. We periodically reworked slowqueries and found more ways to precomputeresults on the back-end in order to reduce databasedelays for users.

In June of 2010, we started an open beta processby which any user could come to the system andtry it without being previously invited. By August,the system had more than 600 users and was ableto run for several months without crashing. Videosof the Kiffets in operation are available onYouTube.

Concluding RemarksKiffets follows the knowledge is power logic of ear-lier AI systems in that it depends on the expertiseof its curators. The system acquires curator expert-ise using a machine-learning approach where cura-tors select sources that they trust (sometimes guid-ed by source recommendations from the system),organize topics in a topic tree folksonomy accord-ing to how they make sense of the subject matter,and train the topic models with example articles. Itcreates fresh, organized channels of informationfor readers every day.

The news business is undergoing rapid changeand economic challenges. It is changing on sever-al fronts, including how news is delivered (mobiledevices), how it is being reported (citizen journal-ists and content farms), and how it is paid for (sub-scription services, pay walls, and advertising). Thisproject opened a further dimension of change:how abundant news can be socially curated.

AcknowledgmentsThis research was sponsored by the Palo AltoResearch Center. We thank Barbara Stefik, RyanViglizzo, Sanjay Mittal, and Priti Mittal for theircontributions to design, release testing, user stud-ies, and great channels. Thank you to Lawrence Lee

AAAI Returns to the PacificNorthwest for AAAI-13!

Please mark your calendars now for the Twen-ty-Seventh AAAI Conference on ArtificialIntelligence (AAAI-13) and the Twenty-FifthInnovative Applications of Artificial Intelli-gence Conference (IAAI-13), which will beheld in the Greater Seattle area, July 14-18, atthe beautiful new Hyatt Regency ConferenceCenter in Bellevue, Washington. Excitingplans are underway to coordinate with localUniversity of Washington, Microsoft, andother members to make this a memorableevent! Updates will be available atwww.aaai.org/Conferences/AAAI/aaai13.phpthis summer.

Articles

42 AI MAGAZINE

and Markus Fromherz for their guidance and sup-port. Thank you to Jim Pitkow for advice on leanmanagement and to our Kiffets users for muchfeedback.

ReferencesAnderson, C. 2006. The Long Tail: Why the Future of Busi-ness Is Selling Less of More. New York: Hyperion.

Arthur, C. 2010. Eric Schmidt Talks about Threats toGoogle, Paywalls, and the Future. Guardian. London:Guardian News and Media Limited (online atwww.guardian.co.uk/media/2010/jul/02/activate-eric-schmidt-google).

Burke, R.; Felfernig, A.; and Göker, M. H. 2011. Recom-mender Systems: An Overview. AI Magazine 32(3): 13–18.

Chi, E. H.; Hong, L.; Heiser, J.; Card, S. K.; and Gum-brecht, M. 2007. ScentIndex and ScentHighlights: Pro-ductive Reading Techniques for Conceptually Reorganiz-ing Subject Indexes and Highlighting Passages.Information Visualization 6 (1): 32–47.

Dong, L.; Smith, R. G.; and Buchanan, B. G. 2011. News-Finder: Automating an Artificial Intelligence News Serv-ice. AI Magazine 33(2).

Jones, K. S., and Willett, P. 1997. Readings in InformationRetrieval. San Francisco: Morgan Kaufmann Publishers.

Katz, J. 2008. What’s Holding RSS Back: Consumers StillDon’t Understand This Really Simple Technology. Cam-bridge, MA: Forrester Research (online at www.for-rester.com/rb/Research/whats_holding_rss_back/q/id/47150/t/2).

Pirolli, P. 2007. Information Foraging Theory: AdaptiveInteraction with Information. Oxford: Oxford UniversityPress.

Ries, E. 2011. The Lean Startup: How Today’s EntrepreneursUse Continuous Innovation to Create Radically SuccessfulBusinesses. New York: Crown Publishing Group.

Salton, G., and Buckley, C. 1988. Term-WeightingApproaches in Automatic Text Retrieval. Information Pro-cessing and Management 24(5): 513–523.

Simon, H. 1971. Designing Organizations for an Infor-mation-Rich World. In Communications and the PublicInterest, ed. Martin Greenberger, 37–72. Baltimore, MD:The Johns Hopkins Press.

Stefik, M. 2011. We Digital Sensemakers. In SwitchingCodes: Thinking Through Digital Technology in the Humani-ties and Arts, ed. T. Bartscherer and R. Coover, 38–60.Chicago: The University of Chicago Press.

Stefik, M. 1995. Introduction to Knowledge Systems. SanFrancisco: Morgan Kaufmann Publishers.

Mark Stefik is a research fellow at PARC. He is a fellow inthe Association for the Advancement of Artificial Intelli-gence and also in the American Association for theAdvancement of Science. He is currently engaged in twoprojects involving smart infrastructure as digital nervoussystems: urban parking services for smart cities and digi-tal nurse assistants for smart health care. Stefik has pub-lished five books, most recently Breakthrough in 2006 bythe MIT Press. Stefik’s website is www.markstefik.com.

Lance Good is a software engineer at Google where heworks on Google shopping. He worked at PARC on Kiffetsand several sensemaking systems. His previous workfocused on using zoomable user interfaces for presenta-tions, sensemaking, and television navigation. Good’swebsite is goodle.org.