ii-sdv 2014 the challenges of managing “big data” in the patent field: patents for business...

30
The Challenges of Managing “Big Data” in the Patent Field 14-15 April 2014, Nice Olivier Huc

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 11-May-2015

441 views

Category:

Software


1 download

TRANSCRIPT

Page 1: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

The Challenges of Managing “Big Data” in the

Patent Field14-15 April 2014, Nice

Olivier Huc

Page 2: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Specialistsin Patent

Information

BuildingIntelligent

Patent InformationSolutions

since 1996

What we do

Trustedby IP experts

Worldwide

Corporations,National PatentOffices, PatentAttorneys andPatent Search

Firms worldwide

InternationalCustomerSupport

Global client baseWith Offices and Support across Europe North

America, and Asia

Page 3: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Patent Families

Analytics

Quality Control

Fast Search

Legal Status Review

Alerts

• 23 Full Text Collections• 48 Million Families• 103 Issuing Authorities

• IPC, CPC US and JP classes• Quality Controlled content• Normalised data

Page 4: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

3 Patent Data Myths

• Myth #1: Patent data is just another type of “Big Data”

• Myth #2: Patent Data is handled automatically• Myth #3: Patent Data is consistent worldwide

Page 5: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Patent Data volume might be smaller, data is more complex (languages, text, fields)

• Patent data is not retrieved on the fly, it is hosted, indexed and optimized

• There are multiple sources with overlap• Data quality is a major issue• Users have a low tolerance for errors

The reality

Page 6: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Total data volume exceeds 35 Tb

• 49 million families and 103 publishing bodies

• 95 million publications

• 47 million full-texts including over 23 million non-Latin into English machine translations

• 54 million clipped images and 45 million complete sets of drawings

Database Facts

Page 7: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Minesoft and RWS host their own data center, located just outside of London

• Control• Confidentiality• Reactivity• Speed

• Distributed search engine• Continuous data update and indexing => no need to interrupt

or restart the online services, + new data immediately searchable

Hardware & Search Engine

Page 8: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Multiple data sources: • DOCDB weekly feeds (EPO)• National Patent Offices• Commercial collections• External information (such as National Registers)

• Despite the complexity, having multiple sources for the same country is a great advantage:

• Complementarity• Improved quality• Security• Speed

Sources

Page 9: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• We perform stringent quality checks• Human• Programmatic

• Manual checks on some source data collections as they arrive: e.g. Indian (IN), Thai (TH) and The Philippines (PH)

• Errors in data are identified programmatically by strict pre-set parameters which are then manually corrected by our data team

• e.g. IC8=AO1G1/00

• Although we follow EPO’s INPADOC rules for families (extended), we recreate all our families to ensure consistency

Data Quality

Page 10: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Adding extra value to PatBase data:

• Families are automatically reviewed and, then if necessary, rebuilt when we receive new and/or corrected information (e.g. priority)

• Tagging of examples, paragraphs and claims is done in order to facilitate searching specific sections of text

• Machine translation: when a family gets new text, the family is reassessed to see if a machine translation needs to be added/replaced/deleted.

Data Quality

Page 11: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

TW AN/PR inputs TW AN/PR outputs

083303675 Emperor year conversion & Type of application

TW19940303675F

092128911 TW20030128911

092128911 TW20040201682U

US AN/PR inputs US AN/PR outputs

US29/356,858 20100303 Type of application & Year US20100356858F

1301618611 A US20110016186

AT AN/PR inputs AT AN/PR outputs

A 709/95 Type of application & Year AT19950000709

GM647/96 AT19960000647U

Standardisation of patent dataFormatting application and priority information

Page 12: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Formatting patent numbers and kind codes

• Formatting dates

Thailand use Buddhist years (Gregorian calendar year plus 543)

US date format - 2011/09/02 (9 February 2011)European date format – 2011/09/02 (2 September 2011)

2007

Standardisation of patent data

Page 13: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

The EPO standardize names to assist searching.

PatBase contains both standard and non-standard names.

Standard name assigned by the EPO

Non-standard name consists of whatever is filed or published on the patent

Standard Non-standard

PIRELLI IND PIRELI SPAPIRELLI IND PIRELLA SPAPIRELLI IND PIRELLE S P APIRELLI IND PIRELLI DPAPIRELLI IND PIRELLI S p APIRELLI IND PIRELLI S APIRELLI IND PIRELLI S P A PIRELLI IND PIRELLI S P A FIRMAPIRELLI IND PIRELLI S P A ITPIRELLI IND PIRELLI S P CAPIRELLI IND PIRELLI SPA ITPIRELLI IND PIRELLI SPPPIRELLI IND PIRELLU SPAPIRELLI IND PIRELLY SPA

This is a small example set of the non-standard names that The EPO assign the standard name ‘Pirelli’

There are currently 188 non-standard names for the standard name ‘Pirelli’

Standardization of patent data

Page 14: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Date Formats

• All fields, e.g. patent classifications, assignees, text etc. have set parameters. Where these are not matched data errors are identified for manual editing.

• If a text is illegible (we have programmatic systems in place measuring this) it will not be allowed into the database and be identified as requiring manual attention (often manual typing).

• Character conversions

� We have thousands of symbol / letter conversions in our programs:• & is replaced by and• œ is replaced by oe• β is replaced by ss

Data Improvements

Page 15: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Insertion of paragraph breaks and paragraph numbers

Data Improvements

Output in PatBase

Source text

Page 16: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Errors appear in source data so manual checks are essential

• Example – Granted patent information from the Indian Patent Office Journal. Three different inventions have incorrectly been given the same publication number

Manual checks

IN000008

Page 17: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Data quality issues

On the Thai patent office website - the same publication number is used for two different applications

Patent copy for TH48405 A

In PatBaseApplication number: TH19981004295 Publication number: TH48406 A

Application number: TH19981002185Publication number: TH48405 A

Wrong number

Correct number

Manual checks

Page 18: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Acquiring data from multiple sources enables us to supplement records, but also alerts us to errors thus ensuring accuracy

KR20010012826 A – Glial Cell Line-Derived Neurotrophic Factor Receptors

KR20010112826 A – Single phase six pole DC brushless axial fan motor of transistor type

Source EPO – Error in information This EPO record is a combination of two inventions. The publication number does not match with the invention.

Identifying data errors

Page 19: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Incorrect data received from source

In cases such as these we correct the error in PatBase and inform the EPO

NULL values were supplied in the EPO’s DOCDB file as Applicants

Identifying data errors

Page 20: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Example of an incorrect assignment from the USPTO

PatBase family 41683901

Excerpt from USPTO assignments database

Identifying data errors

Page 21: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Translations

• Principle: the English text of an equivalent is always better than the Machine Translation

• All non-latin Texts are machine translated into English and indexed when added to PatBase

• On a rolling basis we re-translate texts to benefit from the continuous improvements of translation engines

Page 22: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Machine translation

• Machine translations are made as data is added, removed / rebuilt. This is all done before indexing.

• We run a rolling re-translate and re-index program to optimize the quality of our machine translated full-text

Original translation, Thai into English Re-translation, Thai into English

Original translation, Thai into English Re-translation, Thai into English

Translations

Page 23: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Re-translation Korean into EnglishOriginal translation, Korean into English

Translations

Page 24: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Assignee translations

• Non-latin assignees are indexed

• Non-latin assignees are also translated• First 10,000 CN and JP assignees have been

manually translated by RWS• All others are Machine Translated until an “official”

Latin names appear in the family

Page 25: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Cross-lingual Tool

• Initially developed by WIPO, CLIR (Cross Lingual Information Retrieval) allows our users to generate multilingual searches

• Using an advanced statistical text analysis system based on the PCT corpus, the cross-lingual search tool identifies variants in multiple languages for search terms entered by the user.

=> Better translation – translated words originate from PCT applications

Page 26: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Source: INPADOC

• All legal status events are categorised with a PRS code

• Challenge: 2628 different PRS codes, some no longer in use

• Solution: Grouping similar legal events together:

Legal Status

Reassignment

Deemed Withdrawn/Abandoned

Examined

Renewal Fees Paid

Granted

Lapsed/Expired/Ceased/Dead

Licence

Non-Entry into National Phase

National Phase Entry

Opposition Filed/Request for revocation

Published

Restored/Reinstated/Amended

Revoked/Rejected/Annuled/Invalid

Withdrawn/Abandoned/Terminated/Void

Page 27: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Legal Status Timeline

Page 28: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

• Most patent databases are structured and optimized for Patent Searching, not for Analytics

• At Minesoft, we developed a special database with proprietary meta tags dedicated to the analytics

• Coverage is important – Beware of data gaps

• Importance of a web service (API)

• Importance of incorporating your own custom data or legal status information in your analysis

Analytics

Page 29: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)
Page 30: II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

Thank you

PatBase celebrates its 10th anniversary

Olivier Huc – [email protected]