all change at wipo - stn international · 7. the purpose of the . published listings contained in...

35
All change at WIPO - A review of the recent expansion in WIPO Published Sequence Listings coverage Robert Austin – FIZ Karlsruhe

Upload: duonghanh

Post on 04-Apr-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

All change at WIPO -A review of the recent expansion in WIPO

Published Sequence Listings coverage

Robert Austin – FIZ Karlsruhe

2Agenda

• A review of the WIPO Published Listings Contained in Published PCT Applications web site, data coverage and implementation on STN

• A comparison to the WIPO/PCT sequence coverage in INSDC / EMBL-EBI databases

• A review of the changes from 1st July 2009, and any influence on recent publications (Jan 2010)

• The results of a typical BLAST search example• Challenges and conclusions

3PCT Administrative Instructions (AI) Part 8*, were introduced on 11th January 2001

In particular, Section 801(a):“ (a) Pursuant to Rules 89bis and 89ter, where an international

application contains disclosure of one or more nucleotide and/or amino acid sequence listings (“sequence listings”), the receiving Office may, if it is prepared to do so, accept that the sequence listing part of the description, as referred to in Rule 5.2(a) and/or any table related to the sequence listing(s) (“sequence listings and/or tables”), be filed, at the option of the applicant:

(i) only on an electronic medium in electronic form in accordance with Section 802; or

(ii) both on an electronic medium in electronic form and on paper in accordance with Section 802;

provided that the other elements of the international application are filed as otherwise provided for under the Regulations and these Instructions.”

* http://www.wipo.int/export/sites/www/pct/en/texts/pdf/ai_8.pdf

4If over 400 pages, it was cheaper to file a PCT application with an electronic listing

“ Where the PCT application as filed contains a sequence listing part in computer readable form only, or both in that form and on paper, under Section 801 of the Administrative Instructions under the PCT, and where that application is filed with a receiving Office which is prepared to accept such filings, a fixed component of 400 times the fee per sheet over 30 is payable for the sequence listing part, irrespective of the actual length of that part (see Section 803(ii) of the Administrative Instructions). ”

Page 19, Footnotes to Fee Tables, PCT NEWSLETTER, July-August 2009.

5WIPO Published Listings Contained in Published PCT Applications, was launched in August 2001

“ As from 2 August 2001, the sequence listing parts of the international applications filed under Section 801 of the Administrative Instructions under the PCT will be published on the Internet on the date of publication of the rest of the international application of which it forms a part. ”

Page 4, SEQUENCE LISTINGS FILED UNDER SECTION 801 TO BE PUBLISHED ONTHE INTERNET, PCT NEWSLETTER, August 2001.

6WIPO Published Listings Contained in Published PCT Applications, was used to create PCTGEN

• PCTGEN was created by FIZ Karlsruhe using the WIPO data, and was launched in March 2003

• The database incorporated all the available WIPO/PCT data, from August 2001 onwards

• Each PCTGEN record featured searchable publication number and date, patent applicant name(s) and the original publication title

• Sequence length, SEQ ID, organism name and molecule type were also included for each sequence

• The database continues to be updated weekly, within 24 hours of WIPO/PCT publication

7The purpose of the Published Listings Contained in Published PCT Applications service, remained

unchanged until September 2007• From August 2001 until September 2007, the

service provided electronic format sequence listings from “mixed-mode”* PCT applications

• To qualify for the AI Section 801(a) electronic publication, sequence listings, theoretically, had to comply with the WIPO ST.25 standard

• However, PCT applicants frequently did not supply sequences in acceptable ST.25 format

• By September 2007, 2,700+ PCT application sequence listings, representing 5.7+ million sequences in PCTGEN, had been posted

(* Mixed-mode: a paper PCT application, with a separately published electronic sequence listing.)

8As much as possible of the non-standard WIPO/PCT data was loaded into PCTGEN

• For example, in 2006, 1,000 AI Section 801(a) electronic format listings were published, and downloaded for processing by FIZ Karlsruhe

• A total of 428 of these sequence listings did not comply with WIPO ST.25 text format rules– Including non-ASCII text, special characters, missing

mandatory headings and incorrectly used headings– Listings in PDF, TIF, or PDF-of-TIF file formats

• A total of 299 of these 2006 problematic listings were successfully converted for PCTGEN

Source: FIZ Karlsruhe Editorial.

9From October 2007, WIPO publication policy changed to include all sequence listings

“ Changes regarding sequence listings [SLs] have recently been made. The purpose of this change is to make the publication of sequence listing more complete and easier to use….”“All SLs will be included (i.e. including the SLs extracted from the pamphlets). SLs embedded in the description will be gradually removed….”“The new publication system will be effective as of October 1st, 2007. The former publication system will still run in parallel until that date. ”See: http://www.wipo.int/patentscope/en/news/pctdb/2007/news_0010.html

10

Recap: Prior to October 2007, only WIPO sequence listings submitted and published in electronic formunder the mixed-mode filing provided by “PCT Administrative Instructions, Section 801(a)” were available for download. Since October 2007, “all sequence listings” are provided.

10WIPO Published Sequence Listings Contained in Published PCT Applications: http://www.wipo.int/pctdb/en/sequences/

11An example of a downloaded AI Annex C WIPO ST.25* text format sequence listing

* http://www.wipo.int/scit/en/standards/pdf/03-25-01.pdf

Information ST.25SEQ ID NO <210>Length <211>Type <212>Organism <213>Feature <220>Name/key <221> Location <222>Sequence <400>

Note: this is the example highlighted by the red box on the previous slide.

12The example from the previous slide as it appears in the PCTGEN database

L1 ANSWER 1 OF 1 PCTGEN COPYRIGHT 2010 WIPO on STN AN 2010010000.1 DNA PCTGENTI SCREENING METHOD, DIAGNOSTIC METHOD AND SMALL NUCLEIC ACID FOR THE

TREATMENT OF CNS DISORDERSPA Karolinska Innovations ABPI WO 2010010000 20100128RLI US 2008-82492P 20080721ED 20100129DT PatentORGN Homo sapiensSQL 94517SEQ

1 ctgaaataga taatcagaaa tacagccaac tgatctttga aaaaaaagca51 aaggcaattc aatggggaat aatagtcttt tccacaaaga ctaatggagt101 aactggagat tcacatacag aaaaatgaat ctggaaatag accttacatc

. . . .94451 agcattatta tttgctgagt caggttatta gaccttcctt cctttgtgca94501 taatgcaggt gacaaat

FEATURE TABLE:Key |Location | ==========+================+=======================gene |(1)..(94517) |CYP2C19*17 allele 5'UTR |(1)..(3913) | 3'UTR |(94123)..(94517)|

Sequence records typically enter PCTGEN within 24 hours of WIPO/PCT publication.

AN 2010010000.1 is SEQ ID NO 1 from WO 2010010000.

Note: this PCTGEN sequence record is not currently present in DGENE, REGISTRY or INSDC (as of 30th January 2010).

1313

Since October 2007, around 50% of the PCT sequence listings are TIF images (or PDFs of TIF images), rather than ST.25 format text files.

14The example from the previous slide as it appears in the PCTGEN database

L1 ANSWER 1 OF 1 PCTGEN COPYRIGHT 2010 WIPO on STN AN 2010009500.1 DNA PCTGENTI Improved vegetable oils and uses therefor

[File created by using OCR software]PA Commonwealth Scientific and Industrial Research

OrganisationPI WO 2010009500 20100128ED 20100129DT PatentORGN Gossypium hirsutumSQL 1997SEQ

1 tctctccttt ctcaatgctg tggtggcggc gcaaccccta acaaagacgt51 gggcttgatt tcttccttcc gtggatccac cattcaaggc ttgatggctt101 cttgcttggc ttttgagcct tgtgatgatt attattcctc caaaaatggt151 agctttttcg gtcaaaatgg aagcttttca tctttcttcg gctccaaaaa201 tgttcctttc aataaaaatc gcaagcaaaa aaggctcaat cgacgagctc251 atcattctgg acaagccatg gctatagctg tgcaacccac aagagagatt301 acaacgaaga agaagcctcc tacgaagcaa agacgagtgg ttgtgactgg351 gatgggagta gtaactccgc ttggacatga gcctgatgtt ttctataaca401 acctgcttga gggtgttagt ggtataagtg aaatcgagac ttttgactgc451 gctcagtttc cgacaaggat tgctggagag atcaaatctt tctcaactga501 tggatgggtc gcaccgaaac tttccaagag gatggacaaa ttcatgcttt551 attctcttac tgccggaaag aaagctttgc aagatggggg agtaaatgaa

. . . .

Since October 2007, records created from image format sequence listings are clearly marked.

Note: this PCTGEN sequence record is not currently present in DGENE, REGISTRY or INSDC (as of 30th January 2010).

15WIPO/PCT documents with published sequence listings in PCTGEN

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

2001 2002 2003 2004 2005 2006 2007 2008 2009

FIZ Karlsruhe Editorial begin to work extensively on PCTGEN from November 2005.

Greatly expanded WIPO publication policy, and a new Editorial workflow were introduced from October 2007.

16New FIZ Karlsruhe Editorial workflow for PCTGEN from October 2007

Source: FIZ Karlsruhe Editorial.

17Check, clean, amend: Editorial clean-up and quality control of sequence records

• Check (= compliance with WIPO ST.25)– Mandatory tags– Allowed tags– Forbidden special characters– Consistency– Allowed sequence characters

• Clean (= remove extraneous text)– Headers – Footnotes– Page numbers

• Amend (= label OCR records)– [this file was created by using OCR software]

18Editorial figures for 2009 give insight into the workflow challenges involved

Sequence listing files downloaded from WIPO:

4,616 (2009)

Text Corrected Unrepaired Open In PCTGEN*2359 1084 26 9 2324 (98.5%)

Images Converted Unprocessed Open In PCTGEN*2257 2109 160 148 1949 (86.4%)

Source: FIZ Karlsruhe Editorial. * Status as of 7th January 2010 update.

1919Example PCT listing in image format. Initial OCR text file.

Cleaned and corrected text file ready for loading on STN.

Example of an error which needed manual correction.

Source: FIZ Karlsruhe Editorial.

20Example problematic sequence listing files

2121

The INSDC: http://www.insdc.org/

• The INSDC is a collaboration between DDBJ, EMBL-EBI and NCBI, including the daily mutual exchange of submitted nucleotide sequence data

• INSDC sequence data may be searched at DDBJ, EMBL-EBI or NCBI (GenBank)

• INSDC patent sequence data is provided by JPO/KIPO, EPO and USPTO respectively – this collection includes WIPO data

INSDC (Genbank) provides JPO, KIPO, EPO and USPTO nucleotide sequence data

22In addition, EMBL-EBI provides JPO, KIPO, USPTO and EPO protein sequence data

www.ebi.ac.uk/Tools/blastall/

• EMBL-EBI provides separately searchable patent protein data from the JPO, KIPO, USPTO and EPO.

• Note: almost all of the available WIPO protein data is provided within the EPO Patent Protein Database.

23Total WIPO/PCT sequence recordsin PCTGEN and INSDC/EMBL-EBI

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

PCTGEN EMBL-EBI

NucleicProtein

Statistics: 28th January 2010.

24The WIPO/PCT sequence data available in INSDC/EMBL-EBI comes from PCT ISAs

• WIPO/PCT nucleotide sequence data in the INSDC database comes from major International Search Authorities (ISAs), mostly via the EPO/EMBL-EBI

• Similarly, ISAs may also provide WIPO protein sequences to the EPO, who incorporate it into the protein database they provide to EMBL-EBI

• Sequences provided by ISAs, are those submitted by PCT applicants for the purposes of search and/or preliminary examination, under the PCT Rule 13ter*

• Note: PCT Rule 13ter sequence data may not actually form part of the formal PCT application

* See: http://www.wipo.int/pct/en/texts/rules/r13ter.htm

Overview of the Patenting Process using the PCT System

International Preliminary Examining Authority(IPEA)

International Searching Authority

(ISA)Invention

Receiving Office(local patent office)

is filed withPatent application

is the object of

PCTInternational Bureau

transmit application

transmit reports*

publish

International Publication

communicate

Designated Offices(foreign patent offices)

Patentsgrant

Months from Priority Date

File Local Application

(Priority Date)

0 16 18 22 28 3012

File PCT Applicationwith Receiving Office

(International Filing Date)

ISR & Written Opinion

(IPRP I)

International Publication

File Demand for IPRP II(optional)

IPRP II National Phase Entry(where Applicants seek Protections)

* ISA transmit International Search Reports (ISR) & the Written Opinions / IPEA transmit International Preliminary Reports on Patentability II (IPRP II) (optional)

International Phase National Phase

Source: WIPO Statistics on the PCT System:http://www.wipo.int/ipstats/en/statistics/pct/

Rule 13ter sequences go into INSDC / EMBL-EBI.

Published sequences go into PCTGEN.

25

262009 WIPO/PCT document coverage in PCTGEN and INSDC (Genbank)

Diagram: the number of WIPO/PCT applications published in 2009, which have at least one nucleic acid sequence record in either PCTGEN, INSDC (GenBank), or in both databases (the overlap).

1,298 2,029 725

PCTGEN (3,327) INSDC (2,754)

Statistics: 25th January 2010.

27A 2008 paper* suggests why INSDC/EMBL-EBI WIPO/PCT appears to be incomplete

“ We have noticed that, generally, if the ISA is the EPO or the JPO then the data will be made available to the public via the EBI and DDBJ respectively (and then GenBank), but if the ISA is the USPTO then the data will be missing. The principal exception to this rule is when the sequence listing is filed with the international application and is then published by WIPO. ”

* Piet Jan Andree, Mark F. Harper, Stéphane Nauche, Robert A. Poolman, Jo Shaw, Joop C. Swinkels, Sally Wycherley, A comparative study of patent sequence databases, World Patent Information 30 (2008) 300–308

28Part 8 of the Administrative Instructions was deleted with effect from 1st July 2009

“ [Part 8] was introduced in 2001 as a temporary solution to problems arising from the filing of very large sequence listings on paper, which were difficult to handle for applicants and Offices and very expensive for applicants. Now that applicants may, and do to an increasing extent, file international applications in electronic form, such temporary provisions have become less relevant, and Part 8 of the Administrative Instructions has been deleted with effect from 1 July 2009. ”

Page 10, Practical Advice, PCT NEWSLETTER, July-August 2009.

29From 1 July 2009, there are no fees for a sequence listing filed in electronic format

“ Where a sequence listing is contained in an international application filed in electronic form, the calculation of the international filing fee should not take into account any sheet of the sequence listing if that listing is presented as a separate part of the description in accordance with PCT Rule 5.2(a) and is in the electronic document format specified in the Administrative Instructions under the PCT, Annex C, paragraph 40 (that is, in text format). ”

Page 19, Footnotes to Fee Tables, PCT NEWSLETTER, July-August 2009.

30So what has been the effect of new rules on recent WIPO/PCT sequence data?

• PCT Applications filed since 1 July 2009, began to be published in January 2010

• Over the four publication dates - 7th, 14th, 21st

and 28th of January – 340 WIPO/PCT listings were published and entered into PCTGEN

• 208 of the sequence listings (61%) were text files, and 132 were image files (39%)

• Conclusion: the 1 July 2009 rules have not yet ensured that ST.25 text files will replace image files, but the proportion has increased somewhat from the 2009 average – from 50% up to 61%

31A BLAST example shows that including WIPO Published Listings Contained in Published PCT

Applications data can be very important

(Search conducted on 25th January 2010)

Search Question:Find all patent references to Homo sapiens transforming growth factor beta 1 induced transcript 1 isoform 2 (NCBI: NP_057011)MPRSGAPKERPAEPLTPPPSYGHQPQTGSGESSGASGDKDHLYSTVCKPRSPKPAAPAAPPFSSSSGVLGTGLCELDRLLQELNATQFNITDEIMSQFPSSKVASGEQKEDQSEDKKRPSLPSSPSPGLPKASATSATLELDRLMASLSDFRVQNHLPASGPTQPPVVSSTNEGSPSPPEPTGKGSLDTMLGLLQSDLSRRGVPTQAKGLCGSCNKPIAGQVVTALGRAWHPEHFVCGGCSTALGGSSFFEKDGAPFCPECYFERFSPRCGFCNQPIRHKMVTALGTHWHPEHFCCVSCGEPFGDEGFHEREGRPYCRRDFLQLFAPRCQGCQGPILDNYISALSALWHPDCFVCRECFAPFSGGSFFEHEGRPLCENHFHARRGSLCATCGLPVTGRCVSALGRRFHPDHFTCTFCLRPLTKGSFQERAGKPYCQPCFLKLFG

32A summary of results for NP_057011Uniquely retrieved patent families indicated in (red)

Sequences(≥ 80%)

Publications Patent Families

DGENE 15 11 8

USGENE 11 9 5 (1)

PCTGEN 2 2 2 (1)

REGISTRY 18 11 10 (3)

NCBI* 2 2 1

EMBL-EBI 12 9 5

Total Unique - - 13

(* Unlike EMBL-EBI, the NCBI only provides patent protein sequence data from the USPTO. Both NCBI hits were for U.S. patents, and neither were unique compared to USGENE.)

33The unique answer from PCTGEN

L1 ANSWER 1 OF 1 PCTGEN COPYRIGHT 2010 WIPO on STN AN 2008021290.65800 PRT PCTGENTI ORGAN-SPECIFIC PROTEINS AND METHODS OFTHEIR USE PA Homestead Clinical Corporation

Institute for Systems BiologyHood, LeroyBeckmann, M. PatriciaJohnson, RichardMarelli, MarcelloLi, Xiaojun

PI WO 2008021290 20080221RLI US 2006-836986P 20060809ED 20090324DT PatentORGN Homo sapiensSCORE 954 100% of query self score 954BLASTALIGN

Query = 444 lettersLength = 444Score = 954 bits (2465), Expect = 0.0Identities = 444/444 (100%), Positives = 444/444 (100%)Query: 1 MPRSGAPKERPAEPLTPPPSYGHQPQTGSGESSGASGDKDHLYSTVCKPRSPKPAAPAAP

MPRSGAPKERPAEPLTPPPSYGHQPQTGSGESSGASGDKDHLYSTVCKPRSPKPAAPAAPSbjct: 1 MPRSGAPKERPAEPLTPPPSYGHQPQTGSGESSGASGDKDHLYSTVCKPRSPKPAAPAAPQuery: 61 PFSSSSGVLGTGLCELDRLLQELNATQFNITDEIMSQFPSSKVASGEQKEDQSEDKKRPS

PFSSSSGVLGTGLCELDRLLQELNATQFNITDEIMSQFPSSKVASGEQKEDQSEDKKRPSSbjct: 61 PFSSSSGVLGTGLCELDRLLQELNATQFNITDEIMSQFPSSKVASGEQKEDQSEDKKRPS

. . . .

As of the 25th January 2010, this relevant WIPO/PCT application, and its entire patent family, was uniquely retrieved by PCTGEN.

34Conclusions and challenges

• The WIPO Published Listings Contained in Published PCT Applications web site expanded its scope of coverage significantly to “all listings” in October 2007

• Since October 2007, WIPO data has been regularly loaded into PCTGEN from a roughly 50/50 mixture of text format and OCRed image format listings

• It is not yet clear how much effect the July 2009 WIPO Administrative Instructions (AI) changes will have on the proportion of text and image format published listings

• PCT Rule 13ter sequences in INSDC / EMBL-EBI, represent different sequence coverage from what is published by WIPO and loaded into PCTGEN

• Future challenges: WIPO is working on a 1999-2006 backlog of listings (mostly image files), and is also making amendments to previously published listings

All change at WIPO -A review of the recent expansion in WIPO

Published Sequence Listings coverage

www.stn-international.com/pctgen.html