selection of data from the mass of information

15
Pergamon Library Acquisitions: Practice & Theory, Vol. 2l, No. 3, pp. 303-317, 1997 Copyright © 1997 Elsevier Science Ltd Printed in the USA. All rights reserved 0364-6408/97 $17.00 + .00 PII S0364.6408(97)00059-8 THE FIRST ELSEVIER ELECTRONIC SUBSCRIPTIONS CONFERENCE OCTOBER 17-18, 1996 HEEMSKERK, THE NETHERLANDS SELECTION OF DATA FROM THE MASS OF INFORMATION JOHN BRENNAN Chief Examiner European Patent Office DG1, Patentlaan2 2288EE Rijswijk (ZH) The Netherlands Abstraetmln carrying out prior art searching, the European Patent Office (EPO) is obliged to take into account all information available in the public domain up to the date of filing of a patent application. In order to do this effectively, it is necessary to have rapid, selective, and comprehensive access to all information relevant to a topic being searched. In recent years, the method of searching in the EPO has been moving from a predominantly paper-based approach to one relying on handling information in electronic format. For such a change to be possible, it is necessary to have electronic access to all relevant information as either primary or secondary data. The means of searching electronic data is the Epoque system. The progress toward development of a uniform approach to the handling of patent and non-patent data in both character- coded and facsimile format is described. © 1997 Elsevier Science Ltd Keywords--Search, Full text, Database, Facsimile 303

Upload: john-brennan

Post on 16-Sep-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Selection of data from the mass of information

Pergamon Library Acquisitions: Practice & Theory, Vol. 2l, No. 3, pp. 303-317, 1997

Copyright © 1997 Elsevier Science Ltd Printed in the USA. All rights reserved

0364-6408/97 $17.00 + .00

PII S0364.6408(97)00059-8

THE FIRST ELSEVIER ELECTRONIC SUBSCRIPTIONS CONFERENCE OCTOBER 17-18, 1996

HEEMSKERK, THE NETHERLANDS

SELECTION OF DATA FROM THE MASS OF INFORMATION

JOHN BRENNAN

Chief Examiner

European Patent Office

DG1, Patentlaan2

2288EE Rijswijk (ZH)

The Netherlands

A b s t r a e t m l n carrying out prior art searching, the European Patent Office (EPO) is obliged to take into account all information available in the public domain up to the date of filing of a patent application. In order to do this effectively, it is necessary to have rapid, selective, and comprehensive access to all information relevant to a topic being searched. In recent years, the method of searching in the EPO has been moving from a predominantly paper-based approach to one relying on handling information in electronic format. For such a change to be possible, it is necessary to have electronic access to all relevant information as either primary or secondary data. The means of searching electronic data is the Epoque system. The progress toward development of a uniform approach to the handling of patent and non-patent data in both character- coded and facsimile format is described. © 1997 Elsevier Science Ltd

K e y w o r d s - - S e a r c h , Full text, Database, Facsimile

303

Page 2: Selection of data from the mass of information

304 John Brennan

INTRODUCTION

The objective of this paper will be to illustrate how the European Patent Office (EPO) uses Elsevier electronic subscriptions. However, in order to place what we are doing in perspective, it will first be necessary to summarize what our search examiners are required to do, how this was formerly done predominantly using paper, and how this is being superseded using electronic data.

THE LEGAL FRAMEWORK FOR SEARCHING IN THE EPO

The European Patent Convention (EPC) This convention defines the tasks for which the EPO is responsible. The EPO carries out a prior

art search on every European patent application in order to establish whether that application satisfies the criteria of novelty and inventive step required by the EPC. In order to be novel (new), the EPC states:

"An invention shall be considered to be new if it does not form part of the state of the art." [Art. 54(1)]

For inventive step, the criteria are defined similarly:

"An invention shall be considered as involving an inventive step if, having regard to the state of the art, it is not obvious to a person skilled in the art.'" [Art. 56]

The state o f the art is defined:

"The state of the art shall be held to comprise everything made available to the public by means of a written or oral description, by use, or in any other way, before the date of filing of the European patent application." [Art. 54(2)]

The effect of this is that any prior public disclosure of a claimed invention or an obvious equivalent, at any time, or in any place, can be sufficient to render that invention unpatentable. Such a disclosure may be written, oral, or available only on the Internet, and is relevant regardless of whether anyone has ever consulted the disclosure.

THE TRADITIONAL WORK OF THE SEARCH EXAMINER

Ordering and Accessing Technical Information In order to provide search reports that will allow examination within the requirements of the

EPC, the EPO must strive to make available for searching all technical disclosures regardless of age or origin. At the same time, the disclosures must be ordered in such a way as to allow an examiner to reliably access all relevant information to an application being searched in a relatively short period of time. Including searching, detailed study of the most relevant documents, and preparing a search report, this will be on average 1 to 2 days per application. The scale of the task of arranging the technical information for effective searching can be illustrated by the fact that, taking into account only patent publications, documents relating to approximately 400,000 distinct inventions are published every year (with much further duplication by "family members") . Non-patent literature (NPL) clearly increases this annual volume of information much further. Access to this

Page 3: Selection of data from the mass of information

Selection of Data from the Mass of Information 305

information has traditionally been provided by classification of documents according to the International Patent Classification (IPC) and the EPO's further subdivided version of this, ECLA (European CLAssification).

The IPC and ECLA are hierarchical systems for assigning a classification(s) to technical disclosures. For example, the class H04 deals with electronic communication, its subclass H04N relates specifically to pictorial communication (e.g., television), and this is then divided into groups such as H04N1 for facsimile document reproduction and H04N9 for details of color television systems. These groups are then divided again into subgroups; for example, there are 120 subgroups in H04N9, within which 10 subgroups from H04N9/16 to H04N9/29 relate to details of color cathode ray tubes.

• H04N9/00: Details of colour television systems • H04N9/12: Picture reproducers (electro-magneto- or acousto-optical modulation or deflection

of light beams G02F) • H04N9/16: Using cathode ray tubes (9/11 takes precedence; cathode ray tubes H01J 31/00) • H04N9/22: Using the same beam for more than one primary color information (9/27 takes

precedence) • H04N9/24: Using means, integral with, or external to, the tube, for producing signal indicating

instantaneous beam position.

In concrete terms, each of the subgroups exists as a collection of hardcopy documents, each of whose content falls (at least partly) within the scope of the corresponding definition. The subgroup H04N9/24 comprises 700 in-file documents. For many years, electronic records have also been kept of the bibliographic data relating to patents (patent and priority numbers, etc.) and the classifications that have been applied to the documents.

The documents present in the "search groups" arrive there as a result of their classification by EPO examiners and, wherever appropriate, they are multiply classified. Copies of these documents are then placed in ordered files corresponding to each of the assigned subclasses. The documents comprise patent, journal, abstract, review, and commercial literature. The system is dynamic, and further divisions are introduced with the purpose of trying to keep the number of paper documents that are to be consulted during a search more or less constant, as the number of publications in a broad area of technology increases. This was important in trying to maintain productivity as the total documentation grew. In general terms, as well as limiting the size of the paper document set, the classification provides a concept coding to sets of records, which can be directly retrieved electronically, independent of textual search terms.

An examiner who wished to carry out a search on a patent application would decide which ECLA subgroup(s) was relevant to that application and retrieve the files containing the relevant hard copy from the ordered documentation collection. Each document present in the file was then compared with the application being searched in order to assess whether it discloses the same or significantly related subject matter. Effective searching using this approach requires the skill of being able to quickly assess the importance of each document, while searching in files often containing several hundred documents.

MOVING TOWARD ELECTRONIC SEARCHING

The Mid 1980s

From 1977 when searching began in the EPO, it was carried out as described above and in generally the same manner as had been practiced in most other patent offices for the previous

Page 4: Selection of data from the mass of information

306 John Brennan

century. An application was studied, appropriate classes were designated, and the paper "search groups" corresponding to these classes were located. The documents present in the "search groups" (inherited from the former liB and the Dutch Patent Office) were examined as were, for example, the paper texts of Chemical Abstracts, or Derwent Agdoc and Plasdoc where appropriate. This part of the search was directed toward both novelty and inventive step, and any documents relevant to either of these aspects of the application were cited in the search report. As external online sources became available, these were used to supplement the paper search. For example, use of online Chemical Abstracts bibliographic searching began in 1978, Agdoc online (part of WPI) has been used in the pesticides and herbicides area (IPC A01N) since 1979, and chemical structure searching in DARC commenced shortly after it became available in 1981. As a result, by the mid 1980s examiners in the field of pure organic chemistry (claims for new low molecular weight compounds) had abandoned the systematic classification of photocopied journal articles and abstracts, on the basis that most significant non-patent information could be retrieved online. In electricity-physics, a small amount of use was made of the external WPI and INSPEC databases beginning in the mid 1980s.

However, until the mid 1980s, the online element of most searches (even in chemistry) was secondary to that of the paper, and in many cases was totally absent. This limited use of online searching was, to some degree, a consequence of the fact that most of the online interrogation was carried out by "operators" or "intermediaries" who were provided with search terms or sub- structures by the examiners. Clearly, while online searching retained the mystique of a highly specialized activity that could only be carried out by a select group of experts, it could never play more than a secondary role in the overall searching process. In summary, therefore, in most cases the search with respect to both novelty and inventive step began with and was principally based upon direct inspection of the classified paper documentation, and external online searching was used as a tool for filling in any subsequent "gaps ."

Developments in the Late 1980s Around 1986 the move toward external online searching being carried out by the end users

commenced, and from 1985 to 1990 the percentage of examiners in DG1 executing independent online searches increased from approximately 20% to approximately 95% (Figure 1). It is reasonable to assume that the figure would now be close to 100%.

At the same time as most examiners were becoming independent searchers of external online systems, all examiners were being provided with PCs connected by a token ring local area network to the mainframe computer. This was to allow examiners to be able to take advantage of a number of important developments that were taking place with the internal computer documentation systems.

EPOQUE

Epoque is an internal host system available to all examiners whose development commenced in the late 1980s. It now comprises over 60 databases loaded on our mainframe, including, Derwent WPI, Patent Abstracts of Japan (PAJ), INSPEC, IBM TDB, and patent full text, and it allows access to all major external hosts. As well as providing searchable databases with the possibilities for left-hand truncation, cluster searching, cross-file searching, and automatic uploading of queries, it allows the manipulation, downloading, and printing of results (including chemical structures). Furthermore, the Viewer function allows visualization of formulae present on the first pages of

Page 5: Selection of data from the mass of information

Selection of Data from the Mass of Information 307

Changes in Types of Searchers

100

80

60

40

20

0

1985 1986 1987 1988 1989 1990 1991 1992

[ ] Independent i Intermediary ~;~~i Operator

Figure 1. Percentage of examiners executing independent external online searches.

documents retrieved and all drawings as well as the corresponding text (including JP abstracts from PAJ). The development of Epoque has been described on a number of occasions [1,2]. The main screen of Epoque is shown in Figure 2.

The potential for manipulating and interrelating information offered by Epoque clearly provided the opportunity for examiners to explore alternatives to traditional paper-led searching in many areas. Therefore, much of what follows is a description of techniques that are recently introduced methodologies in some fields or experimental approaches in others.

EPOQUE AND THE MIGRATION TO ELECTRONIC SEARCHING

At a simple level, instead of an examiner going initially to the appropriate paper search group in order to see the application being searched in context of the relevant prior art, the corresponding approach may be taken online using the EPODOC database in combination with other internal databases. EPODOC comprises comprehensive bibliographic data on all of the documents present in the classified collections and their family members. Included are keyword searchable English language abstracts and titles of EP, WO, US, or GB documents in the basic index, and searchable ECLA and IPC indexes. A typical entry is shown below:

• 1/227 - (C) EPODOC/EPO • PN - US5359266 A 941025 • AP - US920978934 921120 • PR - US920978934 921120; WO94US02785 940315 • EC - H01J29/34 ; H01J31/20 ; **H04N9/24** • CT - US3939486 A; US4163250 A; US4306248 A • DT - *;TF • PA - NUSBAUM HOWARD (US); ZACCARDO JR COSMO F (US)

Page 6: Selection of data from the mass of information

308 John Brennan

Search s t a t e m e n t 1

? / e c h O 4 n g / 2 4 H SS 1: R e s u l t s 1 . 0 8 5

Search s t a t e m e n t 2

? gun Mw SS 2: R e s u t t s 14 .037 m

Search s t a t e m e n t 3

? us /pn Nm SS 3: R e s u l t s 4 . 2 3 1 . 9 0 6

Search s t a t e m e n t 4

? and 1 , 2 , 3 u • SS 4: R e s u t t s 7

Search s t a t e m e n t 5

? . . l i l 1 / 7 - (C} EPODOC / EPO PN - IJS5359260 R 941025 RP - USg20978934 921120 PR - usg20978934 921128 ; wog4us02785 EC - H O I J 2 9 / 3 4 ; H O 1 3 3 t / 2 8 ; HO4N9/24 CT DT PR 1!

RB

940315

- US3939486 R: US4163250 R; US4306248 R w;TF

- NUSBflUH HOWflRD (US); ZflCCflRDO JR COSHO F (US) - Sgstem f o r g e n e r a t i n g t r i g g e r i n g purses f o r use in beam i n d e x i n g type

c o r e r ca thode r a y tubes - R s i n g t e gull c o r e r ca thode rag tube (CRT) hav ing a screen

c o n s i s t i n g o f a l t e r n a t i n g red, b tue and green v e r t i c a l phosphor s t r i p s in groups of t h r e e , u i t h s t r i p s Of c o n d u c t i v e or p h o t o v o t t a i c m a t e r i a t o v e r t a y i n g each phosphor s t r i p , or p l a c e d between a d j a c e n t s t r i p s . The

Figure 2. The main screen o f Epoque .

• TI - System for generating triggering pulses for use in beam indexing type color cathode ray tubes

• AB - A single gun color cathode ray tube (CRT) having a screen consisting of alternating red, blue and green vertical phosphor strips in groups of three, with strips of conductive or photovoltaic material overlaying each phosphor strip, or placed between adjacent strips. The same color conductive strips are connected in parallel in three arrays, and connected to three individual trigger buses. The conductive lines are connected to a positive bias voltage such that when the electron beam within the CRT strikes a conductive strip, a pulse is generated and fed to the corresponding trigger bus. As the single electron beam scans the screen, trigger pulses are generated in the conductive strips and are fed to a trigger pulse amplifier and then to a multiplexer and used as control signals. The multiplexer in turn gates the color video signals from a remote source, such as a workstation or television set, such that the corre- sponding video signal for that color is modulating the beam as the electron beam strikes a phosphor strip corresponding to that color.

Using electronic searching, a subclass corresponding to the subject matter of an application can be used as a search term in EPODOC in order to generate all o f the patent numbers relating to that subject. This corresponds with determining the file o f paper documents in traditional searching. An extract of some fields from such a search in H04N9/24 is shown below:

• 6/227 - (C) EPODOC / EPO

Page 7: Selection of data from the mass of information

Selection of Data from the Mass of Information 309

• PN - US5138441 A 920811 • PR - JP890176491 890707 • E C - **H04N9/24** • 7/227 - (C) EPODOC / EPO • PN - US5097324 A 920317 • PR - JP890171725 890703 • EC - **H04N9/24** • 8/227 - (C) EPODOC / EPO • P N - XP000098050 A 000000 • EC - G09Gl /04 ; G09Gl /28 ; **H04N9/24** • 9/227 - (C) EPODOC / EPO • PN - XP000094664 A 000000 • EC - **H04N9124**

Abstract Searching At the simplest level, it is possible to electronically search the EPODOC abstracts of a classified

document set using suitable text search terms, in order to limit the size and relevance of the answer set (e.g., H04N9/241ec and multiplex+). However, EPODOC does not have a complete set of abstracts, and those present vary in quality because in most cases they are those that were written by the applicant. Other patent databases available in-house include Derwent WPI and Patent Abstracts of Japan (PAJ), which include comprehensive, professionally written abstracts and additional technical coding. It is possible to use the Epoque software to extract and memorize the priorities or patent numbers of all of the patent documents present with a specific classification in EPODOC. After going to a cluster of abstract databases (e.g., WPI and PAJ), the list of patent or priority numbers can be used to generate a set of records (with abstracts) corresponding to the patent documents in the selected classification. This set can then be searched in combination with appropriate text or coding terms relating to the application.

Full-Text Searching As an alternative to abstract searching, a list of patent numbers can be used to generate the set

of corresponding full-text documents, insofar as the records are available. These might be available in either facsimile or character-coded format or both. For facsimile images, the records or documents can be retrieved using a limited set of searchable parameters, such as patent number or other document identifier. The images of pages of the selected document are then displayed exactly as they appear in the original, and the database will have been compiled by photographing the original paper documents. The text of the documents themselves is not electronically searchable.

Facsimile Full Text In 1986, the EPO, the United States Patent and Trademarks Office (USPTO), and the Japanese Patent

Office (JPO) agreed to capture the essential parts of their regional patent documentation published from 1920 onward in facsimile form and to share this data. This would mean that within a few years, examiners in each office could have access to the same documentation via an electronic facsimile display as they already had in paper format. This was insmamental in providing the impetus for exploring the possibilities of carrying out "paperless searches." Before committing the significant resources necessary to create a comprehensive centralized system oriented to the provision of search

Page 8: Selection of data from the mass of information

310 John Brennan

facilities based on searching classified facsimile document sets, it was decided to test the process on certain fields of heterocyclic chemistry (as a model for pure chemistry) and certain fields of photo- graphic chemistry (as a model for applied chemistry). The images corresponding to the documents present in the appropriate classification units were copied onto magneto-optical disks, and workstations were assembled comprising disk drives and state-of-the-art high-resolution 24-inch screens with software capable of a page-to-page flip rate of <0.5 seconds.

Testing began in 1992. Searching with this system consisted of reading the patent application and assigning the appropriate ECLA classes. Instead of collecting the file containing the relevant documents and bringing it to a desk, the examiner collected the disk containing the images of the relevant documents and brought it to the workstation. However, when it came to viewing the documents, the interaction of the examiners with the electronic medium was in no way analogous to their experience using paper. Essentially, despite the high quality of the workstation, the examiners found it difficult to efficiently peruse more than 20 tO 30 documents, with the objective of carrying out an effective search. This was true despite the presence in these facsimile records of all of the layout stimuli and navigation prompts such as different font sizes, formulae, and tables.

The conclusions drawn by us regarding facsimile full text are:

1. For searching for technical information, the interaction between paper and eye and hand and paper is fundamentally different from the interaction between electronic display and eye and hand and keyboard (or mouse).

2. For searching with the objective of efficiently and effectively processing the maximum amount of information in the minimum possible time, facsimile full text is not a good starting point.

3. Searching in facsimile is more acceptable when the number of documents to be viewed has been reduced to a relatively small number (10 to 30) of highly relevant documents using an external, electronic filtering procedure.

4. The above points apply to searching as described under number 2. Conclusions cannot necessarily be drawn with regard to the electronic equivalent of a researcher scanning a scientific journal to see what is of potential interest in a technical area.

These observations are in keeping with a number of published reports of on-screen searching [3,4], although these mainly appear to have been carried out using character-coded display for reading as opposed to facsimile. It is also noteworthy that, in a number of studies, the fatigue associated with reading from a screen was reduced by use of higher resolution display. In our case, there was a loss of concentration in reading large amounts of facsimile even with very high-resolution screens. Instead of detailed reading of all of a classified document set, the testers were initially searching in databases that allowed the query to be defined and searched in coded form, that is, in character-coded indexed text, structural codes, classification symbols, or substructure searching in connection tables. Having established the relevance of the answer set using such a filter, they were then prepared to use the facsimile display in combination with the previous results to select the most relevant documents. Therefore, it was decided to make the facsimile collection available only as a numerical collection (BNS [BACON Numerical Service]) to be used for viewing limited sets of documents, previously defined by searching in coded media. This also had the cost advantage that delivery of the images to the workstations from tape robots in near-line mode would be acceptable.

Tools for the primary searching approach of electronic interrogation of character-coded text were then developed with respect to both abstract and full-text data.

Page 9: Selection of data from the mass of information

Selection of Data from the Mass of Information 311

Searching in Character-Coded Full Text For character-coded, full-text documents, searchable databases can be created in which every

word of the text of the records is, in principle, searchable. In practice, selected "stop words" are not indexed. These databases are compiled by taking text produced in machine-readable form (e.g., ASCII) and constructing an alphabetical index of all of the words against the accession numbers of all of the records present in the database. Since 1970, the USPTO has generated the text of U.S. patents in ASCII format and has retained the tapes containing this data, after publication of the fixed runs of paper documents. Similarly, the EPO has had available the ASCII text of all EP patent applications from 1987, of which 64.5% are in English, 26% in German, and 9.5% in French.

Older documents being available in character-coded form is generally a result of their having been prepared in this format to facilitate publication, rather than with any view to providing searchable databases. However, as computer storage and processing have become cheaper, this possible use has become feasible.

The first searchable, in-house, patent full-text database to be created by the EPO was a test system limited to U.S. data for patents in the photographic field. The response of examiners involved in the test was both positive and enthusiastic, and significant numbers of important documents were being retrieved that had not been found using the conventional abstract databases. As a result of this test, all of the U.S. data were used in the creation of the in-house TXTUS1 database. It contained records corresponding to the approximately 1.7 million U.S. patent publi- cations from 1970 to 1994 and included separate indexed fields for description and claims in the basic index, as well as a patent number index. This required a total of 78 GB of storage. Subsequently, corresponding full-text databases comprising records for the English, French, and German language EP publications were created, as well as TXTUS2 for U.S. publications from 1995 on.

While both left- and right-hand truncation are available on most of the EPO's in-house abstract databases, so far only right-hand truncation is allowed with the English and French language full-text databases, Unfortunately, left-hand truncated indexes require relatively large amounts of storage. In contrast, left- and right-hand truncation is available in the German language database, because of the potential for much important information to be lost as a result of the widespread use of compound words in the German language. However, as a consequence, the right-hand truncated database indexes require just over 30% of the storage required for the text. In the German database, the index requires >110% of the storage needed for the text.

Three fundamental approaches can be used for searching in full text. The first, defining the document set by classification, is directly analogous to the approach to searching facsimile full text, except that the set of records generated is in character-coded form. In addition, it can be refined by the use of textual search terms and boolean or proximity operators, as opposed to visual inspection of each record. Because the ECLA class has been used as a filter to define the group of documents being searched, broader keywords can be used for limiting the answer list compared with a simple keyword search, and, thus, the number of false drops can be reduced.

The second approach, defining the document set using abstracts, uses keywords or controlled terms in abstracts to reliably bring together all records relating to a given concept. However, where the detailed aspect being searched is not normally considered to be of high significance, the 200 or so words in an abstract are often inadequate to include all secondary details. With the availability of full text, it is possible to use the abstract to define the set of documents comprising the primary feature and the corresponding full-text records to search for secondary details. The final approach, defining the document set by text terms, attempts to address the problem that sometimes it is the case that neither classification nor abstracts can wholly encompass the concept being searched. In these cases, searching using text terms in the complete full-text database may be included as a valid

Page 10: Selection of data from the mass of information

312 John Brennan

search strategy, although a relatively high proportion of false drops can normally be expected. In the case of patent full-text databases, selectively searching certain terms only in the claims field allows a significant degree of refinement.

DETAILED COMPREHENSION OF THE WHOLE CONTENT OF A DOCUMENT

In a number of fields where drawings were important, it was also felt that text searching was incapable of reducing the retrieval to 30 documents without the loss of a significant number of relevant documents. In order to overcome this problem, in a few cases, the content of the ASCII full-text display of a document text may be sufficient to make a clear positive decision as to its relevance. However, in most fields in which patenting is carded out, additional non-textual information such as chemical formulae, electrical circuits, or mechanical drawings are essential in fully interpreting the text. Viewing the documents in BNS is a possibility. However, BNS works in near-line mode, which means it may take 10-20 minutes for a set of 30 documents to become available on the workstation. Furthermore, where the description of the invention is long, consid- erable time can be wasted finding the page(s) corresponding to the drawings of interest. Where the electronic searching has reliably reduced the number of documents to <30, viewing these in full facsimile format in BNS may suffice to make a final selection. In other cases, an intermediate stage between text and facsimile was provided. This was known as the Epoque Viewer. It was capable of providing on a split screen the searchable text of a document on one half and the drawings associated with the document on the other. Using this the drawings could be inspected, while displaying the parts of the text relating to selected highlighted keywords. In this mode a much higher record handling capacity (50-100, or more) in real time is possible. The most relevant records seen in the Epoque Viewer can be tagged for final inspection of the facsimile format in BNS.

NON-PATENT LITERATURE

The discussion so far has concerned itself with patent literature, simply because it was the availability of electronic patent data that primarily influenced the developments that took place at the EPO. This was due, in part, to the fact that patent offices knew best what they themselves and other patent offices are doing, and in part due to the fact that trilateral agreements between the world's main publishers of patents, the EPO, JPO, and USPTO, allowed decisions to be made that encompassed access to the vast bulk of the world's patent information. However, NPL is as relevant to patenting as patent literature. The state of the art referred to earlier makes no discrimination between disclosures in the patent and non-patent literature. New NPL is read and classified in the same manner as patent literature and is present in the paper search documentation. It is worth noting that a preliminary test of the Elsevier CAPCAS electronic abstract data had not been developed, because of the view expressed by many of the testers that searching a limited range of abstracts served no purpose when these were comprised in databases such as CAS and INSPEC, which were routinely consulted.

Treatment of NPL As the approaches to electronic patent literature searching were becoming established, the

requests from examiners to have a uniform approach in searching patent and non-patent literature increased. This would mean that the EPO would have to be able to acquire or generate electronic

Page 11: Selection of data from the mass of information

Selection of Data from the Mass of Information 313

NPL in abstract and full-text, character-coded, and facsimile formats, as well as allowing classification-based searching. This also meant that a scheme of identification and interrelation of NPL documents had to be developed, which would allow the grouping of sets of records from a common origin. The method currently being used is that of an XP number. In the same way that patents from every country have a two letter country code followed by a sequential number, each non-patent document would have the "country code" XP followed by a sequential number.

While this has certain disadvantages, such as the inability to infer anything about the publisher or source of the document, it has the advantages that it imposes a uniform format and allows records to be assigned numbers automatically upon arrival from predesignated blocks of free numbers. It also means that a central electronic register, the NPL database, can be created, which contains records comprising the XP numbers as the accession numbers. Each record contains standardized bibliographic information, including title, source, publisher, and publication date, all searchable in any of these fields (e.g., year=1996 AND publisher=elsevier). More significantly, it allows the grouping of records of the same document in different formats; for example, the suffix A might be applied to a facsimile record and B to a document in ASCII, but all records relating to the same original document would have the same basic XP number. Finally, the XP numbers are included in the Patent Number field of EPODOC. This means that when searching for all documents assigned a given classification unit, a list of patent and XP numbers would result, which could be used to generate the character-coded full text in the appropriate databases.

Selection of Journals to be Included in the NPL Project The fact that Elsevier was making available full-text electronic data relating to publications that

it controlled made this an ideal starting place for the extension of full-text searching into the NPL area. Discussions with examiners and analysis of the range of Elsevier subscriptions received and the frequency of their citation in EPO search reports led to the initiation of our first multi-journal full-text project. The greatest interest in full-text searching in the NPL was shown by examiners working in the fields of electricity and physics (including computers) and a list of Elsevier journals for which we had a current active paper subscription was circulated among these examiners. Because of the price structure of the electronic subscriptions, it was felt that there was little point in including journals that had not been considered of sufficient interest to warrant a paper subscription, or as in certain cases had already had their subscriptions cancelled. The result of this preliminary exercise was the compilation of a list of 39 Elsevier journal titles, all related to electricity-physics, which were selected as the basis for the construction of a full-text NPL database. Furthermore, it was decided to include all available backfiles for these titles, as well as the continuing 1996 input. This database was to become known as XPESP (XP-Elsevier Science Publications).

Creation of the Searchable XPESP Database The decision was made at an early point to assign a separate searchable database to each

publisher of NPL with whom we reached an agreement. There were three advantages of this: (1) allowing the publisher to keep a clear overview of the use of their data; (2) reducing the technical problems of loading data in different formats and media; and (3) easing the database specifications by limiting the structure in each database to reflect the available fields. It was implicit in the proposals for this database that all of the database functionality available in Epoque and currently used with in-house databases such as EPODOC, INSPEC, and WPI would be available for XPESP. These include hit highlighting, Focus, cross-filing of terms between databases (e.g., from EPODOC

Page 12: Selection of data from the mass of information

314 John Brennan

TABLE 1 Full Field and Index Format of the Database

Field Index Index type Title

AB BI TEXT Abstract AN AN CODE EPO document number AU AU KW Author AUAW AU TEXT Author affiliation AUW AU TEXT Author DT DT TEXT Document type IRN IRN KW International registry number IW BI TEXT Indexing words LA LA KW Language of article NLIN NLIN ONUM Number of lines at input NR SO TEXT Issue number OT TEXT Other title PAN PAN CODE Publisher accession number PD PD ODAT Publication date PG SO TEXT Inclusive page numbers PUB SO TEXT Publication data TFF TF1 r KW Technical field type TI BI TEXT Title of article TXT BI TEXT Text VOL SO TEXT Volume

to XPESP), and the ability to run prepared standard queries automatically in the database. In terms of field searchability, as far as available from the source data, all searchable fields currently present in the internal databases would be created. Development took place using the DATASET.TOC information from the Elsevier PreCAP production document and sample records provided on CD. Although the bibliographic headers are available in SGML, it was decided to postpone the problems of treating SGML until the complete data, including full text, could be processed in a single operation.

Most of the fields present in DATASET.TOC have been included in the database. Examples of those omitted include the purely administrative data included in the _tO level, t 2 data relating to the whole issue of a journal (total number of pages in issue, page range on spine, location of table of contents, etc.) and the manifestation file data (_mf) at the t_3 document level. The reasons for this are that, as a technical information database, none of these fields provides any data that would be of interest for search or for locating a document; although the issue page number range is absent, page and volume numbers for each document are searchable. The full field and index format of the database is given Table 1.

Searchable fields available include ISSN, journal title, publication date, and, of course, the full text. In several databases we have an Author Affiliation field because examiners are often interested in seeing what a company has recently published in the area of the patent application being searched. We were unable to create this directly from the DATASET.TOC. However, we applied the data from the author correspondence address (_ca) field to the affiliation field in XPESP; generally, this compromise will be more effective for searching for XEROX than for SIEMENS because the field also contains the author name. The basic index consists of the title, abstract, keywords, and raw ASCII full text of the document. Limitation to a single field or fields

Page 13: Selection of data from the mass of information

Selection of Data from the Mass of Information 315

within this is always possible. In keeping with XPESP being a full-text database, only fight-hand truncation is currently available in the basic index. Other fields are searchable by applying the appropriate field indicator, and autoposting is applied in fields such as publication date in order to allow the search to be restricted to single years or ranges.

Certain additional fields were created by us for a variety of reasons. The technical field type has three possible inputs: CH, EP, and ME correspond with chemistry, electricity-physics, and mechanics, one or more of which may be applied. If we are receiving text in all of these areas from a publisher, we shall be able to assign the technical field(s) on the basis of the ISSN number. This is to allow restriction of answer sets to specific fields where search terms are common in more than one of the above areas, e.g., "vector ." Currently all of the records in XPESP are assigned as EP, but it is conceivable that when we renew the contract for 1997, titles outside the EP area may be included. The NLIN field was inserted at the request of colleagues responsible for developing technical aspects of databases.

Six predefined display formats are possible, as well as any user-defined field combination. The default STDR and MAX formats are shown below.

STDR AN IRN PUB VOL NR PG PD AU TI AB IW DT TFT MAX AN PAN IRN PUB VOL NR PG PD AU AUAW TI OT AB TXT IW DT TFF LA NLIN

Most significantly, the default record display does not include the full-text (TXT) field. This is because of the often cumbersome nature of the full-text display, whereas the default display will normally be complete on the screen. Operating in the browsing mode, it is possible to inspect the records one at a time in the default mode and redisplay in the MAX format only those that appear to be relevant. As an alternative in Epoque, it is possible to use the Focus mode, which will display only the title and the parts of the full record that contain the search term(s); again it is possible to redisplay any record in the MAX format.

Having created a set of records by searching with text or other terms in the XPESP database, the same general problem of interpreting the total content of the original documents arises here as well as in the patent full-text databases. In a limited number of cases, the content will be fully comprehensible from the XPESP record. However, in many cases, the content will only be comprehensible in combination with formulae or drawings. Formulae in the full text are essentially destroyed by the OCR used in its creation, while drawings are not part of the text.

The option of viewing the document set in facsimile format in BNS is also available here. The XP numbers for the database records will correspond with the numbers for the BNS records, and this list can be automatically extracted from XPESP and loaded in BNS. The difficulties associated with reading large document sets in facsimile previously discussed for patent literature will also have some relevance here. Factors that favor NPL over patents in facsimile mode include the fact that NPL documents are often much shorter than patents and that the layout of a joumal article with its embedded drawings and graphs makes it easier to associate these with the relevant text. Part of the difficulty with patents is that drawings are not allowed to be included in the description or claims, and so an effort has to be made to associate relevant text with a drawing.

The Epoque Viewer was conceived in order to reduce the problems of associating text with drawings by providing searchable text alongside drawings on a split screen display. Fundamental to this is the fact that the text and the drawings comprise distinct subparts of a patent document and as such are separable. What is being combined in the Viewer is essentially the ASCII text from one source with the facsimile images from the drawing pages part of the whole facsimile document. The situation for NPL documents (including those of Elsevier) is that drawings are included as embedded images. Using conventional techniques, these drawings are inserted by the printers in

Page 14: Selection of data from the mass of information

316 John Brennan

order to obtain a pleasing layout. While text can be regenerated in a separate file using OCR techniques, there is no analogous simple and relatively inexpensive solution for embedded images. As a result, the Epoque Viewer has limited use in improving the handling of conventional NPL data. The approach being developed for SGML will be discussed under future developments.

Current Use of the Elsevier Data The XPESP database has only been available to examiners for 4 months and is currently under

evaluation by a limited group of approximately 50 users. They are examining both search and classification issues. After any final modifications suggested by this group, the database will be made available to all examiners. The provisional date for this is the end of October 1996. At the same time a small group of examiners is looking at the potential of developing a search engine that will automatically preclassify full-text records using ranking algorithms. The objective here is to allow examiners to intellectually classify a broader range of journals in a shorter period of time, by only bringing to their attention those items of relevance to their fields (as opposed to inspection of complete indexes of hardcopy journals). Further significant contributions from the introduction of the Elsevier data have been to develop a general approach to the best ways to handle NPL from the point of view of structuring of information as well as to provide us with practical experience in handling NPL data that we hope to apply to data from further publishers.

Future XPESP Developments As has been described, the XPESP database and corresponding facsimile collection relate to 39

journal titles in the electricity-physics area. This technical field was chosen because of the interest shown by examiners in that field, but in principle it would be interesting to extend the database to other technical areas. The EPO currently has well in excess of 100 subscriptions to journals controlled by Elsevier. Excluding the current XPESP titles, the remainder appear to be mainly in the chemical, biochemical, and materials areas. Part of the reason for a lack of interest in chemistry is the highly complex and non-uniform nature of the substance nomenclature (if it is applied at all), which makes searching in the CAS databases using Registry Numbers or substructures much more reliable. This is clearly not the case in the biological fields, but the number of journals that could be included is probably no more than 10, which is only a small fraction of the coverage of a database such as BIOSIS. However, it is intended to extend both the range of titles and technical fields in 1997. The subject of SGML has already been touched upon briefly, and it seems clear that in the next year or so, full document data will become available in this format from Elsevier and a number of other publishers. This means that, in addition to creating searchable databases, it will be possible to display the text and drawings in the Epoque Viewer. To optimize this, a new Viewer is already under development that will allow the display of text or integrated text and drawings, as well as the capability to hold a drawing in one half of a split screen display while moving through the document content in the other half.

TRAINING

When this presentation was originally proposed, it was suggested that I should discuss how we were providing training for our examiners in the use of the Elsevier data. I hope that it is clear from the above description of our activities that no special training should be necessary in the searching of the data. Every effort has been made to make the data searchable in Epoque in the same way as

Page 15: Selection of data from the mass of information

Selection of Data from the Mass of Information 317

full-text data from any other source, and an examiner should be able to begin searching effectively as soon as he/she has looked at the field and index structure of the data. What nobody can do at the moment is provide tested guidelines on how to search most efficiently using the XPESP database in combination with the facsimile data in BNS. This expertise can only evolve from examiners searching in their own fields of expertise and evolving strategies best adapted to the characteristics of these fields.

GENERAL PROBLEMS ASSOCIATED WITH DEVELOPING A CRITICAL MASS OF NPL

Despite the encouraging results with Elsevier data, the problems were and still are many. Principal among these is the fact that even though agreement between three organizations (EPO, JPO, and USPTO) is enough to provide a very broad foundation on which to build in a predictable manner databases in patent literature covering all fields of technology, the number of major publishers of NPL covering the same areas runs into hundreds. Each has to be approached individually; some are reluctant to make the data available; different contracts have to be entered into with each; data is not always be available in ASCII abstract and full text, and facsimile formats; character-coded data is structured differently in every case; and delivery media and methods vary. Also of importance for any retrospective search system is the availability of a comprehensive electronic backfile. This is fundamental to the patent offices, but of no direct interest to commercial publishers. At the same time, developments in dissemination of information on the Internet have made publishers less afraid of leasing their data in electronic format than they were 2 to 3 years ago. It seems probable that in the electricity-physics area it will be possible to achieve broad front file database coverage within the next year or so, which should provide benefits in both classification and search.

REFERENCES

1. Nuyts, A. "A New Dimension in Patent Indexing and Searching: The Use of Navigation Tools," Proceedings of the 1995 International Chemical Information Conference, N&icirc;mes, France, ed. H. Collier. Calne, England: Infonortics Ltd, 1995, pp. 78-88.

2. Nuyts, A. "Present and Future EPO Systems for Automation of Search in Directorate General 1: Epoque, BACON and CAESAR," Proceedings of the Montreux International Chemical Information Conference, ed. H. Collier. New York: Springer Verlag, 1989, pp. 187-190.

3. Dillon, A. "Reading From Paper Versus Screens: A Critical Review of the Empirical Literature," Ergonomics. 35 (October 1992), 1297-1326.

4. Cushman, W.H. "Reading From Microfiche, a VDT, and the Printed Page: Subjective Fatigue and Performance," Human Factors, 28 (1986), 63-73.