analysis of non-stochastic time varying data - fingrid lee gillam department of computing,...
TRANSCRIPT
Analysis of non-stochastic time varying data - FINGRID
Lee GillamDepartment of Computing, University of
Surrey
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Financial Decision Making
Challenge:
analysis of streaming financial (time serial) data and financial
and political news
At the interface of quantitative and qualitative?
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
FINGRID Projectaimed at information management/ processing
challenge in social sciences: analysis and fusion of distributed quantitative and qualitative data and programs.
12-month eSS PDP involving econometrics (Essex) and computing academics, particularly in grid computing and artificial intelligence, at Surrey ( social anthropologists & criminologists)
Third project at Surrey that deals with qualitative data (news and reports) and qualitative data (time series) EU Projects ACE (1996-99), GIDA (2001-03).
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Motivation
Market sentiment - quantifying effects of news in the Efficient Market Hypothesis? Technicalists (chart patterns, stats) and
fundamentalists (intrinsic -book- value) locked away from the outside world - no CNN? Challenge of treating multiple data sources
Bounded rationality (Simon 1972, Kahneman 2002)? Self-deception of investors rejecting new
evidence in favour of prior (incorrect) information (Lakonishk, Lee & Poteshman 2003, Kindlberger 2001) - e.g. “.com” bubble
Buy/sell - human (re-)action is documented in the dataset
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
FINGRID methods/techniques
sentiment analysis: automatic terminology extraction; ontology learning; local grammars.
Learning the rules for Information Extraction (IE). Patterns derived from a corpus (MB GB) of texts
(arbitrary domain)
time series analysis (bootstrapping, wavelet analysis)
visualization of large volume time series and texts
Grid - Globus, Condor, OGSA-DAI, SRB
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
FINGRID Technologies
24 computers provide (dual-proc, hyperthreaded)1. Globus Toolkit 3.0.2 (GT3), 2. Java and FORTRAN software compilers, 3. Java Commodity Grid kit (CogKit), and 4. Local security certification.
FINGRID uses the Java CogKit to integrate: (i) the MATLAB wavelet toolbox via JMatLink; (ii) Reuters data via the Reuters SSL SDK; (iii) bootstrap simulation written in FORTRAN; and (iv) System Quirk components via the Quirk Java SDK.
Condor (management of distributed processing – 76 procs in pool), Storage Resource Broker (Data Grids) also configured: expansion and testing in progress.
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Streaming Data (Reuters)
FOREX (GBP/USD) tick data
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Streaming Data (Reuters)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Streaming Data (Reuters)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Streaming Data (Reuters)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Datasets
HFDF data (O&A) e.g. 5 minutes compression GBP/USD 1992 to 2003 inclusive; 1.25M
datapoints (12*24*365*12) approximates 4MB.
Text corpora RCV 1 (over 800000 news stories in 12 of 1996-7 ); RCV 2 (13 languages)
Copyrights/contracts?
Numerical data
Time series price/value movement of financial
instruments;
c. 5MB/day, per instrument (XML) - including sources of quote (>1GB/year/instrument)
Textual data Text streams news items; financial reports; company brochures; government
documents….
c. 40MB/day (> 10GB/year)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Non-stochastic?
Encyclopedia of Chart patternsJapanese Candlestick Charting techniquesIf price increases, demand decreases?
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Bootstrap
With many financial series, it may be difficult to select and fit an appropriate model; block bootstrap generates bootstrap samples from time series when a parametric model is not available. Block bootstrap is a procedure for generating bootstrap samples from time series when a parametric model is not available. The blocking procedure consists of dividing data into blocks and sampling blocks randomly with replacement. Bootstrap techniques are inherently computationally demanding, even using efficient computational algorithms (Nankervis 2002).The bootstrap can be iterated so that a further layer of resampling is performed (a double bootstrap): results in improved properties of estimators and test statistics. To make realistic statistical inferences from data using bootstrapping, significant replications (c. 10000 times) should be used (Lobato, Nankervis & Savin 2001).Other bootstrap-based procedures applicable to financial data include estimating the distribution of returns for Value at Risk (VaR) models (Ruiz and Pascual, 2002).
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Simple Bootstrapping
0
500
1000
1500
2000
2500
1 2 4 8
# of machines
Tim
e in
se
co
nd
s
Bootstrap rep=500 Bootstrap rep=1000
1000 bootstrap replications:2 nodes: 1050 seconds (17.5 mins)8 nodes: 404 seconds (6.73 mins)
10000+ replications? Linear speedup?Hypothesis testing – dismiss bad ideas more quickly?
Methods - Bootstrap
Bespoke FORTRAN implementation of bootstrapping [Nankervis] algorithm (Globus, Java CoGkit – Grid service)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Distributed bootstrapping
Bootstrap is partially parallelizable: Amdahl’s law: the fraction of code f, which cannot
be parallelised, affects speedup factor - replication seeds, results.
Condor and Condor DAGs (compose metalevel description)
seed
calculate calculate
results
calculate calculate
Job A seed.cmdJob B calculate1.cmdJob C calculate2.cmdJob D calculate3.cmdJob E calculate4.cmdJob F results.cmd PARENT A CHILD B C D EPARENT B C D E CHILD F
executable = calculate.exeinput = output = calculate.1.outerror = caculate.1.errtransfer_input_files = outs_aatransfer_files = ALWAYSlog = calculate.1.logarguments = outs_aa 250queue
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Wavelet analysis
Conventional Signal Processing:
• Variation in time-domain OR variation in frequency domain applicable to stationary series
Wavelet-based Analysis:
• Variation in time-domain AND variation in frequency domain applicable to non-stationary series.
Aussem & Murtagh (1997) use wavelet analysis combined with neural networks to provide time series forecasts
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Wavelet Multiscale Analysis
Fourier Power Spectra can be computed for each scale – discover cyclicals
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Wavelet analysis
Most dominant cycle (brown rectified sine wave) has a period of 85.3333 and starts at 57.3333
Next dominant cycle (green rectified sine wave) has a period of 42.6667 and starts at 41.3333
Other cycles in order of their importance are 23.27, 11.6364, 5.68 and 3.12
UPTREND from 1 to 260 with a slope of 12.57 and a y-intercept of 2626.37
DOWNTREND from 261 to 358 with a slope of -8.6956 and a y-intercept of 8166.4091
The series loses its stationarity (variance change occurs) at 141 (black vertical line)
Possible turning points (black circles):
68, 144 , 148, 152, 154, 165, 212, 220, 228, 260 298, 299, 300, 348, 358, and 358
VisualizationVisualization Textual SummaryTextual Summary
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Wavelet analysis
Matlab toolboxes for Wavelet and Signal processing analysisMatlab -> JMatLink (Java) -> Java CoGkit – Grid serviceParallel/performance evaluation?
JMatLink engine = new JMatLink();
engine.engOpen();
eng.engEvalString("array=randn(500)");
…
array=eng.engGetArray("array");
engine.engClose();
public class TSAanalysisServiceGridLocator extends org.globus.ogsa.impl.core.service.ServiceLocator implements org.globus.ogsa.GridLocator {
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Workflow
Select instrument tick dataUse sampling rule (OHLC) to create a time series [4 series, C at equally-spaced intervals]Use close series for n-scale Wavelet transform [n-series]Identify trends in low-frequency scale; apply Fourier analysis to each n-series to discover cycles Apply bootstrap to modelling individual series?Combination of model and trends = prediction?
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Streaming news textNamed entity identification (e.g. company name)Sentiment discovery (local grammars)Up/down series for market / company (qual -> quant?)System Quirk JDK + Java CoGkit = Grid Service-> time series analysis-> covariance analysis
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time seriesLocal Grammar Example Frequency
said PN, TITLE at ORG. said Alex Scott, research analyst at Seven Investment Management. 23.49%said PN, TITLE at MOD ORG. said Mike Lenhoff, chief strategist at private client fund manager Gerrard. 4.23%said PN at ORG. said Alex Bannister at Nationwide. 3.26%said PN of ORG. said Andrew Pendrill of ABN AMRO. 2.06%said PN, TITLE at ORG in PLACE. said David Marshall, analyst at NCB Stockbrokers in Dublin. 2.04%said PN at MOD ORG. said Simon Rubinsohn at brokerage Gerrard Ltd. 1.97%said PN, a TITLE at ORG. said Conor Bill, a partner at Lawrence & Co. 1.40%said PN, an TITLE at ORG. said David Pope, an industry analyst at Brewin Dolphin Securities. 1.25%said PN, TITLE at the ORG. said Oliver Froehlich, euro-transition team leader at the Frankfurter Volksbank. 0.83%said PN, TITLE of the ORG. said Mohammad Nazri Fatullah, chairman of the Afghan Trade Association. 0.74%
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Patterns identified for Chinese also: “up” (上升 ) in Chinese
上半年 /NTN 地產 /NN 投資 /NN 收入 /NN
上升 /NN 約 /FPM 百分之八 /MM ﹐ 至 /I 十九億 /MM 元 /U ﹔
first half of this year, estate investment
up about 8 percent, to 19 billion dollars
月 /NTN 期 /NN 指 /VT 全 /PA 日 /NTN 收 /VT 報 /NN 一萬一千三百 /MM 點 /U ﹐
上升 /VI 二十 /MM 點 /U ﹐ 低 /A 水 /NN 四十五 /MM 點 /U ﹐ 成交 /VT 合約 /NN
day-close value of the monthly index was 11300 points,
up 20 points, 45 points below average
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Methods - Textual time series
Text Analysis Throughput tested with various sizes of corpora
– against benchmark (wordlists – Hughes et al 2004)
Text Analysis
0
100
200
300
400
500
600
1 2 4 8
# of machines
Tim
e in
sec
onds
Text Analysis (process time in ms)
Time required to process one month’s news.RCV1 takes about 95 minutes on 16 machines. Further experiments in progress
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Qual. meets Quant.
Decision Matrix / probability of direction
Market ( security)
JRC Fractal present?
Divergence exists?
Momentum changed
Volume increased
Reversal exists?
Overbought/ oversold
Leverage sufficient
Other, e.g. UniS
Sentimentconfidence
level
EUR/USD + + + + + + up 4
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Qual. meets Quant.
SYSTEM QUIRK
Reuters News Feed
Up
Down
Time Series of Up and Down
Financial instrument (Reuters) e.g. FTSE
100 INDEX
0
0.2
0.4
0.6
0.8
1
1.2
1 2 5 6 7 8 9 12 13 14 15 16 19 20 21 22 23 26 27 28 29 30
Date
Ratio
Good words FTSE100
0
0.2
0.4
0.6
0.8
1
1.2
1 2 5 6 7 8 9 12 13 14 15 16 19 2021 22 23 26 27 28 29 30
Date
Ratio
Good words FTSE100
Generate Signal (Buy / Sell)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Qual. meets Quant. FINGRID’s Sentiment and Time Series: Financial analysis
system (SATISFI): for visualising and correlating the sentiment and instrument time series
Composition of Grid services
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
FINGRID -> qual.
• System Quirk
• text + terminology + ontology + local grammars + ….
• Neural network classifiers (Hebbian networks, Websom)
• Case-based and fuzzy reasoning
• Automatic Text Summarisation
• Text alignment
• Metadata • ISO-standardized (ISO 11179-3 conformant data registries -
LIRICS project); application to text management (Virtual Corpora); Text Categorisation (+ Terminology lookup)
• ISO 639 (codes for the names of languages); ISO 16642 (Terminology Markup Framework); LMF; MAF and other TLAs
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Recap
sentiment analysis: automatic terminology extraction; ontology learning; local grammars.
Learning the rules for Information Extraction (IE). Patterns derived from a corpus (MB GB) of texts
(arbitrary domain)
time series analysis (bootstrapping, wavelet analysis)
visualization of large volume time series and texts
Grid - Globus, Condor, OGSA-DAI, SRB
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Acknowledgements
Saif Ahmad, Research Student, Wavelet Analysis;David Cheng, Research Officer, Text Analysis;
Gary Dear, Computing Officer, Grid Implementation;Pensiri Manomaisupat, Research Student, Text
Categorisation;Ademola Popoula, Research Student, Fuzzy Logic Analysis;
Hayssam Trablousi, Research Student, Named Entity Extraction;
Tuğba Taşkaya-Temizel, Tutor, Grid Computing, Grid Architect;
Khurshid Ahmad, Principal Investigator;Jon Nankervis, Co-Investigator (Essex)
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Outlook
Lessons from: Value at Risk Computation (RiskGrid - BeSC); Aircraft vibration time-series (DAME - York), ….Proposed activity on qual analysis (content analysis meets code-based); qual-quant integration/fusion? Integration with Sheffield’s GATE systemcomplement and draw upon the work of eSS PDPs and the existing nodes: text analysis (Nottingham), modelling & simulation (Leeds), mixed media (Bristol), and quantitative analysis (Lancaster). Additional activities (Surrey): EPSRC: REVEAL (auto-annotation of crime-related CCTV); EU eContent: LIRICS
Fingrid (RES-149-25-0028)ASW on Quant. Methods in e-Social Science, 6
April 2005
Further information
http://www.computing.surrey.ac.uk/grid/[email protected]