memex - pydata, new york 2015

30
© 2015 Continuum Analytics- Confidential & Proprietary Memex: Mining the Deep Web Katrina Riehl, PhD Sr. Data Scientist Continuum Analytics November 9, 2015

Upload: kriehl

Post on 23-Feb-2017

318 views

Category:

Technology


2 download

TRANSCRIPT

© 2015 Continuum Analytics- Confidential & Proprietary

Memex: Mining the Deep Web

Katrina Riehl, PhD Sr. Data Scientist Continuum Analytics

November 9, 2015

© 2015 Continuum Analytics- Confidential & Proprietary

THE DEEP WEBExplaining

When you ask the internet a question, who is answering?

3

4

5

© 2015 Continuum Analytics- Confidential & Proprietary

DARPA MEMEXAn introduction to

What is MEMEX?

7

• Today's web searches use a centralized, one-size-fits-all approach that searches the Internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases.

• DARPA launched the Memex program in September, 2014. • Memex seeks to develop software that advances online search capabilities • Creation of a new domain-specific indexing and search paradigm

• content discovery • information extraction • information retrieval • user collaboration

• Extension of current search capabilities • deep web • dark web • nontraditional (e.g. multimedia) content.

8

Memex Search Domains• Human/Labor Trafficking • Child Exploitation • Weapons • Illicit Pharmaceuticals • Material Research Science • Autonomous Systems Research • Financial Fraud • Counterfeit Electronics 9

http://opencatalog.darpa.mil10

© 2015 Continuum Analytics- Confidential & Proprietary

LARGE SCALE DATA ANALYTICSAn Overview of the Ecosystem

12

BI - DB DM/Stats/ML

Scientific ComputingDistributed Systems

Numba

bcolz

RHadoop

© 2015 Continuum Analytics- Confidential & Proprietary

THE ANALYTICS PIPELINE

Analytics Pipeline

14

• Web Crawlers & Scrapers • Entity Extractors • Indexers • Visual Analytics • Search Applications

Memex Explorer

15

• Pluggable Framework for Crawling & Data Discovery • Django Web Application • Elasticsearch Index • Bokeh Visualizations for Crawling Stats • Kibana Dashboards for Initial Data Exploration • Apache Nutch Crawler • NYU ACHE Crawler • NYU Domain Discovery Tool

• Collaborations • CMU Time Anomaly Detection Tool • Sotera Datawake Plug-in

16

17

18

19

20

21

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemy

json

22

23

DATA ANALYSIS

Topic Modeling

25

Topic Modeling

26

Topic Modeling

27

28

29

QUESTIONS?Thank you!!