memex - pydata seattle

30
© 2015 Continuum Analytics- Confidential & Memex: Mining the Deep Web Katrina Riehl, PhD Data Scientist Continuum Analytics July 25, 2015

Upload: continuum-analytics

Post on 15-Aug-2015

546 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

Memex: Mining the Deep Web

Katrina Riehl, PhDData ScientistContinuum Analytics

July 25, 2015

Page 2: Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

THE DEEP WEBExplaining

Page 3: Memex - PyData Seattle

When you ask the internet a question, who is answering?

Page 4: Memex - PyData Seattle
Page 5: Memex - PyData Seattle
Page 6: Memex - PyData Seattle
Page 7: Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

DARPA MEMEXAn introduction to

Page 8: Memex - PyData Seattle

What is MEMEX?• Today's web searches use a centralized, one-size-fits-all approach that searches the Internet

with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases.

• DARPA launched the Memex program in September, 2014. • Memex seeks to develop software that advances online search capabilities• Creation of a new domain-specific indexing and search paradigm

• content discovery • information extraction• information retrieval• user collaboration

• Extension of current search capabilities • deep web • dark web • nontraditional (e.g. multimedia) content.

Page 9: Memex - PyData Seattle
Page 10: Memex - PyData Seattle

Memex Search Domains• Human/Labor Trafficking• Weapons• Material Research Science• Financial Fraud• Counterfeit Electronics• Patent Trolling• Child Exploitation

Page 11: Memex - PyData Seattle

http://opencatalog.darpa.mil

Page 12: Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

LARGE SCALE DATA ANALYTICSAn Overview of the Ecosystem

Page 13: Memex - PyData Seattle

BI - DB DM/Stats/ML

Scientific Computing

Distributed Systems

Numba

bcolz

RHadoop

Page 14: Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

THE ANALYTICS PIPELINE

Page 15: Memex - PyData Seattle

Analytics Pipeline• Web Crawlers & Scrapers• Entity Extractors• Indexers• Visual Analytics

Page 16: Memex - PyData Seattle

Memex Explorer• Pluggable Framework for Crawling & Data Discovery• Django Web Application• Elasticsearch Index• Bokeh Visualizations for Crawling Stats• Kibana Dashboards for Initial Data Exploration• Apache Nutch Crawler• NYU ACHE Crawler• NYU Domain Discovery Tool

Page 17: Memex - PyData Seattle
Page 18: Memex - PyData Seattle
Page 19: Memex - PyData Seattle
Page 20: Memex - PyData Seattle
Page 21: Memex - PyData Seattle
Page 22: Memex - PyData Seattle

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemy

json

Page 23: Memex - PyData Seattle
Page 24: Memex - PyData Seattle
Page 25: Memex - PyData Seattle

DATA ANALYSIS

Page 26: Memex - PyData Seattle

Topic Modeling

Page 27: Memex - PyData Seattle

Topic Modeling

Page 28: Memex - PyData Seattle

Topic Modeling

Page 29: Memex - PyData Seattle
Page 30: Memex - PyData Seattle

QUESTIONS?Thank you!!