madfast similarity search · what is madfast similarity search? • a young product, released in...

45
MadFast Similarity Search Gábor Imre

Upload: others

Post on 20-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

MadFast Similarity Search

Gábor Imre

Page 2: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Agenda

• Short intorduction to MadFast

• Demo:

– Getting started

– Using Web UI / REST API

– Using the command line

– Searching large datasets

• Roadmap

• QA

Page 3: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Introductiona short one

Page 4: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

How fast?

Page 5: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

How fast?

• Very.

Page 6: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

How fast?

• Very.

• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):

– Search for few 10s of most similars for a single query: 200 M targets / s

– Prepare a single fingerprint: 1 M targets / min

– Read prepared data into memory: 1 M targets / s

– Using ~250 MB memory per M targets (1024 bit fp)

Page 7: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

How fast?

• Very.

• Some numbers (ec2 c3.8xlarge/r3.8xlarge machine):

– Search for few 10s of most similars for a single query: 200 M targets / s

– Prepare a single fingerprint: 1 M targets / min

– Read prepared data into memory: 1 M targets / s

– Using ~250 MB memory per M targets (1024 bit fp)

• What does it mean?

– Real time similarity search of tens of millions of structures even on a desktop.

– Or handle even 1B strtuctures on an r3.8xlarge instance

– Or provide near real time search of 1B structures (~5s / query)

– Or do an exhaustive 1M x 1M similarity search in 30 mins

Page 8: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

What is MadFast Similarity Search?

• A young product, released in last December

• Engine for fast similarity searching with efficient in-memory storage

Binary and float vector descriptors with various metrics

• Fast descriptor (fingerprint) calculation is also provided

CFP, ECFP, MACCS-166 included, can use externally calculated fingerprints

• Collection of stand-alone tools implemented in Java

• Providing CLI, Web UI and REST API interfaces

Page 9: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Speed of typical tasks

Page 10: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Getting startedis simple

Page 11: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

On the product page

At https://www.chemaxon.com/products/madfast

Follow “Download MadFast” link

Page 12: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

To the download page

Where a .tar distribution is available.

You will need Linux or Windows + Cygwin to use.

Page 13: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

You will need

Oracle Java 1.8 installed

Cygwin installed (on Windows)

ChemAxon License file

- free evaluation: [email protected]

- copy to ~/.chemaxon/license.cxl orc:\Users\<username>\chemaxon\license.cxl

And the .tar file unpacked - we go with

windows+cygwin from now

Page 14: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

The Web UI and the REST API

Page 15: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

No further installation required.

• Run a command line to see if Java runs correctly: bin/searchStorage.sh -h

• Launch a self contained example: examples/rest-api-example.sh

and connect to the launched embedded server at http://localhost:8085/

• Will explore the Web UI using a more meaningful dataset. Launchexamples/rest-api-small.sh

then connect http://localhost:8085/

Page 16: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Under the hood

Page 17: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Focused chemical space exploration

Page 18: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Or query through the REST API

curl \

-X POST \

-d "count=4" \

--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \

-g \

"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool

Page 19: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Or query through the REST API

curl \

-X POST \

-d "count=4" \

--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \

-g \

"http://localhost:8085/rest/descriptors/nci-250k-cfp7/find-most-similars" | python -m json.tool

{

"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",

"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",

"searchtime": 16,

"targetcount": 249081,

"targets": [

{

"base64img": null,

"dissimilarity": 0.0,

"targetid": "NCI8117",

"targetimageurl": "rest/molecules/nci-250k/7975/png-or-placeholder?w=100&h=100",

"targetindex": 7975,

"targetmolurl": "rest/molecules/nci-250k/7975"

},

]

}

Page 20: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

The command line

Page 21: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Multi query search

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \

bin/searchStorage.sh \

-tmf - \

-qmf data/molecules/vitamins/vitamins.smi \

-context createSimpleCfp7Context

Page 22: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Multi query search

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \

bin/searchStorage.sh \

-tmf - \

-qmf data/molecules/vitamins/vitamins.smi \

-context createSimpleCfp7Context

Query Target Dissimilarity

0 54 0.0

1 409 0.14814814814814814

2 6031 0.02631578947368421

3 44 0.0

4 32 0.0

5 513 0.0

...

Page 23: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

More hits, IDs, formatting, out file

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \

bin/searchStorage.sh \

-tmf - \

-qmf data/molecules/vitamins/vitamins.smi \

-context createSimpleCfp7Context \

-mode MOSTSIMILARS -count 3 \

-qidname -tidname -out-numeric-format "%.3f" -out res.txt

cat res.txt

Page 24: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

More hits, IDs, formatting, out file

gzip -dc data/molecules/drugbank/drugbank-common_name.smi.gz | \

bin/searchStorage.sh \

-tmf - \

-qmf data/molecules/vitamins/vitamins.smi \

-context createSimpleCfp7Context \

-mode MOSTSIMILARS -count 3 \

-qidname -tidname -out-numeric-format "%.3f" -out res.txt

cat res.txt

Query Target Dissimilarity

Vitamin A - Retinol Vitamin A 0.000

Vitamin A - Retinol Alitretinoin 0.148

Vitamin A - Retinol Tretinoin 0.148

Vitamin A - Retinal Alitretinoin 0.148

Vitamin A - Retinal Tretinoin 0.148

Vitamin A - Retinal Isotretinoin 0.148

Vitamin A - beta-Carotene 1,3,3-trimethyl-2-[(1E,3E)-3-methylpenta-1,3-dien-1-yl]cyclohexene 0.026

Vitamin A - beta-Carotene Vitamin A 0.174

Vitamin A - beta-Carotene (6e)-6-[(2e,4e,6e)-3,7-Dimethylnona-2,4,6,8-Tetraenylidene]-1,5,5-Trimethylcyclohexene 0.200

Vitamin B1 - Thiamine Thiamine 0.000

Vitamin B1 - Thiamine Thiamin Phosphate 0.158

Page 25: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Heatmap

bin/searchStorage.sh \

-context createSimpleCfp7Context \

-qmf data/molecules/vitamins/vitamins.smi \

-qidname \

-tmf data/molecules/vitamins/vitamins.smi \

-tidname \

-mode FULLMATRIX \

-out vitamins-fullmatrix.txt \

-heatmap-image vitamins-fullmatrix.png \

-heatmap-image-cellsize 15 \

-heatmap-image-query-ids-length 250 \

-heatmap-image-target-ids-length 250

Page 26: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Heatmap

bin/searchStorage.sh \

-context createSimpleCfp7Context \

-qmf data/molecules/vitamins/vitamins.smi \

-qidname \

-tmf data/molecules/vitamins/vitamins.smi \

-tidname \

-mode FULLMATRIX \

-out vitamins-fullmatrix.txt \

-heatmap-image vitamins-fullmatrix.png \

-heatmap-image-cellsize 15 \

-heatmap-image-query-ids-length 250 \

-heatmap-image-target-ids-length 250

Page 27: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Further inputs for search

Page 28: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Search ~1B structures

Page 29: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

A server with GDB-13 launched

• An r3.8xlarge instance on Amazon EC2 is running during this webinar32 vCPUs, 244 GiB RAM, currently for $2.66 / hour

• Near real time search:http://ec2-54-74-38-126.eu-west-

1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/gdb-13-cfp7&dist=hide

• Plus SureChEMBL:http://ec2-54-74-38-126.eu-west-

1.compute.amazonaws.com:8081/simsearch.html?ref=rest/descriptors/surechembl-cfp7/

Page 30: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

REST API is also available

curl \

-X POST \

-d "count=100" \

--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \

-g \

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool

Page 31: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

REST API is also available

curl \

-X POST \

-d "count=100" \

--data-urlencode "query=C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O" \

-g \

http://ec2-54-74-38-126.eu-west-1.compute.amazonaws.com:8081/rest/descriptors/gdb-13-cfp7/find-most-similars | python -m json.tool

{

"query": "C([C@@H]([C@@H]1C(=C(C(=O)O1)O)O)O)O",

"querysmi": "OC[C@H](O)[C@H]1OC(=O)C(O)=C1O",

"searchtime": 3917,

"targetcount": 977468301,

"targets": [

{

"base64img": null,

"dissimilarity": 0.20270270270270271,

"targetid": "MOLECULE-043953590",

"targetimageurl": "rest/molecules/gdb-13/43953590/png-or-placeholder?w=100&h=100",

"targetindex": 43953590,

"targetmolurl": "rest/molecules/gdb-13/43953590"

},

]

}

Page 32: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Check out our study

• Poster presented at Fragments 2017, available at

https://www.chemaxon.com/library/

similarity-implicated-exploration-of-the-fragment-galaxy/

• Using MadFasy to search drug analogues among 977M

targets from GDB-13

• Which are assessed for parent coverage by searching

the 16M structures in SureChEMBL

• After 4h setup time 20s / query

(assessment of the 100 best analogues)

• Overlap visualization concepts

Page 33: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Whats nextRoadmap, development directions

Page 34: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Roadmap

• Overlap analysis visualization

• Clustering

- Real time clustering

- Similarity based hierachic clustering

- On all interfaces

• Query remote DB using JDBC

• Single desktop UI release.

• Public Java API components for developers

Page 35: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Feedback welcome

• What are your use cases? What are your pain point?Distribution, deploymets, platforms, interfaces

• The proposed roadmapFeedback on priorities; functionalities; whats missing

• Syncing remote DB over JDBCRequirements, data sizes, update frequency, update patterns

• Interactive clusteringTypical set sizes, method preferences, workflows

• MadFast Substructure SearchQuery semantics; use cases; requirements

Page 36: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Further resources

Page 37: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

From https://chemaxon.com follow

Products ⇒ Discovery Toolkit ⇒ MadFast ...

Contains

- Introduction, overview

- Links to download, documentation

- Link to online demo

https://www.chemaxon.com/products/madfast

Product page

Page 38: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Documentation

Detailed documentation, including

- Step-by-step getting started guide.

- Walkthrough of typical use cases

- Advanced topics

- JAVA/REST API docs

Also available in the downloaded distribution.

https://disco.chemaxon.com/products/madfast/latest/

Page 39: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Online demo

One of the examples from the distribution available online athttps://disco.chemaxon.com/madfast-demo

Page 40: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Questions

Page 41: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

Please help us during QA

By answering three questionnaires regarding

• The distribution

• The interfaces

• The roadmap

Page 42: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

The roadmap

We would be interested in

• Similarity based overlap visualization

• Clustering provided with CLI/REST API

• Real time clustering

• Simialrity based diverse selection

• Synching to existing databases over JDBC

Please tell us about the typical set sizes you prefer in the chat.

Page 43: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

The interfaces

• Would like to use JAVA Client library to connect to MadFast REST API

• Would like to use JAVA API to embed MadFast

• Web UI would be needed to access all functionalities

• Full featured desktop GUI would be needed to access all functionalities

• Authentication/authorization would be needed on Web UI / REST API

Page 44: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

The distribution

• Would need Windows .zip distribution / .bat starter scripts

• Would need Linux installer

• Would need Windows installer

• Would need MacOS installer

Page 45: MadFast Similarity Search · What is MadFast Similarity Search? • A young product, released in last December • Engine for fast similarity searching with efficient in-memory storage

THANK YOU