Big Data: Wall Street Style - O'Reilly Data_ Wall Street Style... · 2 Permission to reprint or distribute…

Download Big Data: Wall Street Style - O'Reilly Data_ Wall Street Style... · 2 Permission to reprint or distribute…

Post on 12-Jun-2018

212 views

Category:

Documents

0 download

TRANSCRIPT

  • Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Copyright 2012 Standard & Poors Financial Services LLC, a subsidiary of The McGraw-Hill Companies, Inc. All rights reserved.

    Big Data: Wall Street Style

    Jeff Sternberg Jen Zeralli S&P Capital IQ February 29, 2012

  • 2 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Boring Financial Chart

  • 3 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Boring Financial Chart: less boring with labels

    As of 2/24/2012.

  • 4 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Boring Financial Chart = kind of interesting, actually

    More than $2.35 trillion dollars

    invested in Information Technology

    over the last 10 years.

    Source: S&P Capital IQ Transaction Screening As of 2/24/2012.

  • 5 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    How Does That Compare?

    Total Investment over the last 10 years:

    Industrials = $3.49 trillion

    Energy = $2.61 trillion

    Healthcare = $2.47 trillion

    Information Technology = $2.35 trillion

    Telecom = $2.13 trillion

    Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.

  • 6 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    So Is Big Data

    Big Money?

  • 7 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Big Money?

    Total Investment over the last three years:

    Information Technology = $774.4 billion

    Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.

  • 8 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Big Money?

    Total Investment over the last three years:

    Information Technology = $774.4 billion

    Big Data = $32.4 billion

    Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.

  • 9 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Big Money?

    Total Investment over the last three years:

    Information Technology = $774.4 billion

    Big Data = $32.4 billion

    So, 4.2%

    Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.

  • 10 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Big Money?

    Total Investment over the last three years:

    Information Technology = $774.4 billion

    Big Data = $32.4 billion

    So, 4.2%

    Hey, at least were not just the 1%

    Source: S&P Capital IQ Transaction Screening. As of 2/24/2012.

  • 11 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    But What We Really Wanted To Talk About

    Strata: Making Data Work

    February 29, 2012

  • 12 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    But What We Really Wanted To Talk About

    S&P Capital IQ: Data Is Our Product

    About Data Collection

    Standardization

    Linking: The Curious, Special Case of Entities

    Suggesting Data

    Projections

  • 13 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    S&P Capital IQ: Data Is Our Product

    Strata: Making Data Work

    February 29, 2012

  • 14 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product

  • 15 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product

    Capital IQ started as an investment bank in 1999*

    Data = competitive advantage over other banks

    Built a database of financial investments,

    relationships and transactions

    *Acquired by Standard and Poors in 2004, now part of S&P Capital IQ.

  • 16 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Hey, Lets Sell That!

    For illustrative purposes only. Source: S&P Capital IQ as of 2/2012.

  • 17 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product: What We Offer

    Datasets

    Financials and

    Valuation

    Qualitative Data

    Global Market Data

    Sell-Side Research

    Earnings Estimates

    News and Events

    Fixed Income

    Alpha and Risk Models

    Research Companies

    Generate Ideas

    Build Models

    Monitor Markets

    Analyze Performance

    Quantitative

    Research

    Web Portal

    Real-Time

    Workstation

    ClariFi

    Mobile

    Data Feeds

    Web Services

    Office Plug-Ins

    Use Cases Tools

  • 18 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product: Who We Help

    Investment Bankers

    Asset Managers

    Private Equity Firms

    Venture Capital Firms

    Credit/Equity Analysts

    Corporations

    Consultants and Advisors

    Academia & Government

  • 19 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product: Some Stats

    Company and Person Profiles

    Companies with full quantitative data 100,000

    Private company profiles 2.7 million

    Professionals and board members 4.2 million

    Quantitative data points per company 5,000

    Qualitative data points per company 1,500

    Transactions

    M&A Transactions 425,000

    Private Placements 190,000

    Public Offerings 138,000

    News and Key Developments

    Daily News articles across 184 countries 16,000

    Key Developments (curated news) 9.7 million

    As of 2/2/2012.

  • 20 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Data Is Our Product

    DEMO

  • 21 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Strata: Making Data Work

    February 29, 2012

  • 22 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    To Have A Data Product, One Must First Acquire Data.

  • 23 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Data Collection Goals

    Coverage

    Quality

    Timeliness

    Auditability

  • 24 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    It starts with documents 67,000 per day

    Sources

    Company filings (SEC)

    News feeds (press releases)

    Web crawling

    We store these in our document repository

  • 25 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Document repository

    SQL for metadata

    Regular file storage for docs

    Solr/Lucene indexing for fast search

    99.3 million documents

    240.3 million versions (files)

    As of 2/24/2012.

  • 26 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Document_tbl

    documentID int PK sourceID smallint FK

    Version_tbl

    versionID int PK documentID int FK rootID smallint FK

    versionIndex smallint filePath varchar(100)

    html, pdf, text, sgml,

    + Filesystem: Document Repository SQL db:

    Element_tbl

    elementID int PK [doc/vers/rel]ID int FK typeID int FK

    value [strongly typed]

    ObjectRel_tbl

    relID int PK documentID int FK objectID int FK

  • 27 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Content search

    Which docs have relevant content?

    Search rules drive collection workflow

    1000+ search rules per doc

    65,000+ automated searches

    per day

  • 28 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Collection workflow

    Core engine that routes work items

    Organized into Processes, Stages, Statuses

    Prioritization based on usage (and others)

    Simple GetNext(), Commit() API

    177.8 million Commits in 2011

    Avg. 130K+ Commits per day in Financials

    As of 2/24/2012.

  • 29 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    About Data Collection

    Collection process

    Automated extraction

    Manual collection

    1000s of quality checks

    Basic integrity

    Variance from prior period

    All data stored as reported with Doc ID

  • 30 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Standardization

    Strata: Making Data Work

    February 29, 2012

  • 31 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Standardization

    Compare apples to apples (or Facebooks)

    For illustrative purposes only. Source: S&P Capital IQ as of 2/24/2012.

  • 32 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Linking: The Curious, Special Case Of Entities

    Strata: Making Data Work

    February 29, 2012

  • 33 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Linking: Managing Entities

    Entities we like to think about

    Companies (public, private, investment firms)

    Government agencies (the Fed)

    Governments (munis, countries, the EU)

    Securities (equity or debt, issued by the above)

    Indices, funds, rates, other aggregations

    People (executives, board members,

    investors, shareholders)

  • 34 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Linking: Managing Entities

    Goal: Blend entity data from different sources

    Ex: unified view of stock price and ratings

    First: Whats the identifier? Or identifiers?

    Name, ticker, CUSIP, others

    Next: Can we auto-link?

    Use historical links to make future links easier

    Quality checks

    Look for outlier cases

    Remember that things change over time

    So entity links create a time series

  • 35 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    An Example Of Difficult Entity Linking: Public Ownership

    Tracks portfolio holdings and values over time

    Example: Vanguard vs. Fidelity Funds

    Many disparate sources

    Reported from both owner and owned side

    Varied requirements by exchange (50+ countries)

    Many different entity types

    People, Institutions, Pension Funds, Mutual Funds

    Common Equity, Derivatives, Options

    Many different security identifiers

    CUSIP, ISIN, SEDOL, Ticker, Name, etc.

  • 36 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Strata: Making Data Work

    February 29, 2012

  • 37 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Goal: Platform that learns from user behavior

    Suggest company profiles that the user may be

    interested in viewing

    Use data exhaust

    to build better

    products

  • 38 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Challenges

    Were an impartial

    data platform

    We may not provide

    investment advice!

    Clients are super-secret

    about their deals

    Ergo, cant use collaborative filtering approach

  • 39 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Advantage: We have lots of great data!

    Key developments

    Curated news product

    Get smart on a company

    News searches catch interesting press releases

    In-house researchers ensure:

    Quality entity linking

    Event typing (categorization)

  • 40 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    For illustrative purposes only.

  • 41 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Key development event ranking

    Popular & infrequent events = interesting

    Example: Dividend increase is more noteworthy than dividend affirmation

    User selectivity

    Based on clicks

    Sector, region, company type

  • 42 Permission to reprint or distribute any content from this presentation requires the prior written approval of S&P Capital IQ. Not for distribution to the public.

    Suggesting Data

    Score each suggestion for each user based on signals via Hadoop + Hive

    Remove items that the user has already seen!

    Present in a widget on the dashboard

    Measure th...