case studies in creating quant models from large scale unstructured text by sameena shah

Case Studies in Creating Quant Models from Large Scale Unstructured Text Dr. SAMEENA SHAH (sameena.shah@thomsonreuters.com)

DISRUPTIONS •  LARGE SCALE DATA ANALYSIS:

–  Hadoop, Spark

•  NATURAL LANGUAGE PROCESSING: –  Sentiment, context, text mining

•  NOVEL/EFFICIENT ALGORITHMS: –  Deep Learning, Topic Modeling

•  NOVEL DATA SETS: –  Twitter, satellite images

•  Accessibility

Some large scale textual datasets •  Social Media

•  SEC filings

•  News

•  Courtwires

•  Patents

ANALYZING UNSTRUCTURED TEXT IN SEC FILINGS

•  All public companies, domestic and foreign, trading on any of the US exchanges, are required to file registration statements, periodic reports, insider trading forms and other forms describing any significant changes to the SEC.

•  Typically contain financial statements as well as large amounts of `unstructured text' describing the past, present and anticipated future performance of the company.

For example, if a company changed its accounting methods to inflate its earnings, or changed its fiscal year end to include some extra sales, or shifted some expenses to a later period or included revenues which are not yet payable, or expensed or capitalized certain items.

Can we •  Create an automated system that identifies

“abnormal” sentences in filings, hence alerting regulators/investors faster

•  This usually requires a deep amount of domain expertise even for humans to recognize such sentences.

•  Value is clear … but

•  > 3TB in compressed format

•  Running this on a small subset of data on a dual core machine gave us an estimate of few months

Text modeling on hadoop •  Reading compressed files through a custom

inputreader

•  Parsing of sections

•  Division into sentences and comparison across different reference groups

•  Scoring each sentence wrt reference group model

•  Divergence of scores from distribution of

reference group

•  All this under 30 minutes for 8 years of filings

TEXT PROCESSING •  Use of text processing techniques to check for

–  Clarity in overall disclosure compared to peers –  Redundancy in language –  Comparison of language model across sector and market

cap peers –  Comparison of model with its own (the company in

context) historical model –  If overly vague or ‘boilerplate’ disclosures in recognition of

revenue

SIGNALS FROM SOCIAL MEDIA

Winning Traders •  Questions:

–  Can we find good traders and follow them to make money?

•  Method: –  Identify trading-related tweets ( buy/sell a specific stock) –  Evaluate traders based on past performance –  Follow their trades

Why People Express their Trading Positions? •  Everyone has an opinion !

•  Positive Motivations –  Enhance reputation/brand –  Build network by attracting other experts –  Benefit personal trading positions

•  Negative Motivations –  Hired to promote a position –  Nothing else to do ….

The Winners strategy gains 9.48% while S&P 500 lost 3.55%

13.03% difference

Cost does cost you! ( 0.2% per transaction)

Conclusions •  While Twitter signal to noise is very low, targeted data

collection and mining can be more promising

•  In event-based sentiment analysis, we assumed stock market related tweets posted after a bad (good) news has a negative (positive) polarity. The data can be used to training a supervised model.

•  User-based analysis (following traders with good record of trading based on their tweets) also showed adapting traders move in the market could be a winning strategy.

•  M. Makrehchi, S. Shah, W. Liao. Stock prediction using Event information from Twitter. In Web Intelligence, 2013.

The Winners strategy gains 19.76% while S&P 500 lost 3.55%

case studies in creating quant models from large scale unstructured text by sameena shah

large scale unstructured

tradingrelated tweets

good traders

months text modeling

good record of trading

insider trading forms

supervised model

context historical model

Data & Analytics

fv, unstructured

efficient parallelization of a dynamic unstructured ... ·...

structuring unstructured information

unstructured p2p networks

vi - modelo... · xls file · web view2012-12-13 ·...

quant presentationshow

optimizing unstructured data

quant 2018 · 2019-04-30 · quant annual reort 2018 quant...

unstructured grid techniques

characterization and transformation of unstructured control...

semantic relations extraction from unstructured ... ·...

a solution adaptive structured/unstructured overset grid ......

quant pesa

quant congressusa2011algotradinglast

using real datajutts/gaise/11-rec2realdata.pdf · quant. 18...

unstructured p2p

· xls file · web view2011-06-17 · ferramentas e...

from unstructured to structured information in military...

migrating from unstructured to structured...

4 saima sameena ed ttc march 16