researcharticle financial trading strategy system based on

Research ArticleFinancial Trading Strategy System Based on Machine Learning

Yanjun Chen1 Kun Liu1 Yuantao Xie 2 and Mingyu Hu2

1School of Information and Management University of International Business and Economics Beijing China2School of Insurance and Economics University of International Business and Economics Beijing China

Correspondence should be addressed to Yuantao Xie xieyuantaouibeeducn

Received 6 May 2020 Accepted 7 July 2020 Published 28 July 2020

Guest Editor Qian Zhang

Copyright copy 2020 Yanjun Chen et al )is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

)e long-term and short-term volatilities of financial market combined with the complex influence of linear and nonlinearinformation make the prediction of stock price extremely difficult )is paper breaks away from the traditional researchframework of increasing the number of explanatory variables to improve the explanatory ability of multifactor model and providesa new financial trading strategy system by introducing Light Gradient Boosting Machine (LightGBM) algorithm into stock priceprediction and by constructing the minimum variance portfolio of mean-variance model with Conditional Value at Risk (CVaR)constraint )e new system can capture the nonlinear relationship between pricing factors without specific distributions )esystem uses Exclusive Feature Bundling to solve the problem of sparse high-dimensional feature matrix in financial data so as toimprove the ability of predicting stock price and it can also intuitively screen variables with high impact through the factorimportance score Furthermore the risk assessment based on CVaR in the system is more sufficient and consistent than thetraditional portfolio theory )e experiments on Chinarsquos stock market from 2008 to 2018 show that the trading strategy systemprovides a strong logical basis and practical effect for Chinarsquos financial market decision

1 Introduction

With the development of stock market the efficiency ofartificial subjective investment mode is gradually reduceddue to the complex and diverse investment targetsBenefiting from the advancement of data science and sta-tistical method the former subjective investment mode hasbeen gradually replaced by quantitative investment strategywhich uses data and models to construct investmentstrategies New investment model selecting stocks withinvestment value by combining the open information in themarket with statistical methods avoids the subjective impactof human to some extent

As the most widely used quantitative stock selectionmodel at present multifactor model is based on finding outfactors with the highest correlation with the stock returnrate which can predict the stock return to some extentHowever in the empirical test scholars have found that itcould not bring sustained returns to investors due to the lowprediction accuracy and the lack of stability of the predictionresults

At the same time through the empirical study with fi-nancial market data scholars found that the financial marketis a dynamic system with high complexity including long-term and short-term fluctuations and linear and nonlinearinformation )e formation and change of stock price in-volve various uncertain factors and there are complex re-lationships among them To further study and analyzefinancial data and for more accurate prediction the appli-cation of machine learning algorithms in the research offinancial time series has been widely concerned by scholars

Compared with the linear model machine learning al-gorithm takes the nonlinear relationship between variablesinto account It does not need to be based on the assumptionof independence and specific distribution and has higherflexibility and efficiency making it excel at dealing with bigdata especially the huge amount of financial data

)e innovations of this paper are as follows Firstly itbreaks away from the traditional research framework ofimproving the explanatory power of multifactor model byincreasing the number of explanatory variables and providesa new trading strategy system by introducing one of the

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 3589198 13 pageshttpsdoiorg10115520203589198

latest machine learning algorithms LightGBM into the fieldof portfolio LightGBM does not need to consider thespecific distribution form of financial data and it can capturethe nonlinear relationship between pricing factors and theExclusive Feature Bundlingmethod can solve the problem ofsparsity of high-dimensional characteristic matrix and im-prove the prediction accuracy of stock returns Secondly thenew system can be used to generate importance score offactors It directly shows the impact variables have on stockreturn which has important practical significance for stockselection )irdly for stock position allocation module weconstruct minimum-variance weight method of mean-var-iance model with CVaR constraint and test the tradingstrategy system on the real data of Chinarsquos A-share market)e experiment result shows that the system can bring stableexcess return to investors and provides a logical basis andpractical effect for Chinarsquos stock market

2 Literature Review

SinceMarkowitz proposed mean-variance model to quantifythe risk in 1952 scholars have been trying to find idealmodels for stock pricing In 1960s the CAPM model pro-posed by Sharpe John Lintner and JanMossin expresses therelationship between expected return and expected risk as asimple linear relationship [1 2] Ross [3] put forward theArbitrage Pricing )eory on this basis revealing that stockreturn is affected by multiple factors rather than a singlefactor Fama and French [4 5] proposed a three-factormodel which includes market value book value ratio and PE ratio of listed companies as compensation for the riskfactors that β cannot reflect However the classic modelsabove were lacking explanation in the empirical test At thebeginning scholars attributed that to the omission of ex-planatory variables so they successively put forward factorsthat may lead to excess return For example Datar et al [6]used the data of NYSE for empirical analysis and found thatthere was a significant negative correlation between stockreturn and stock turnover Piotroski [7] selected nine in-dicators from the perspective of the companyrsquos financialindicators including profitability robustness and growthBy comprehensively scoring each indicator he established astock pool and selected the stocks according to the scoresNovy-Marx [8] proved that there is a strong correlationbetween the companyrsquos profitability and stock returnAharoni et al [9] also found that the companyrsquos investmentlevel and stock return are remarkably correlated In 2014Fama and French added Robust Minus Weak (RMW) asprofit factor and Conservative Minus Aggressive (CMA) asinvestment factor based on the three-factor model and putforward five-factor model to further explain the stock return[10]

However none of the models above deviated from theresearch framework of linear asset pricing under thebackground of small sample data that is extract excessreturn factors from historical data and then use these factorsas independent variables to construct a linear model toevaluate the investment value of stocks

Since the 1970s with the increasing availability of em-pirical data in the financial market and the improvement ofcomputer technology scholars have found several abnormalphenomena in the financial market through empirical re-search)ese abnormal phenomena are contrary to the basicassumption of CAPM financial market data obey normaldistribution have no long memory and satisfy the linearmodel which challenges the traditional model [11] For thestock market Greene and Fielitz [12] performed a test andconfirmed that the American stock market has the feature oflong memory which showed that even if the time interval isvery long it still had significant autocorrelation that is thehistorical events would affect the future for a long time Onthe one hand it proved the importance of historical in-formation and the predictability of return on the otherhand it also reflected the nature of nonlinear structure ofstock market As for the distribution of financial datascholars pointed out that in reality the distribution of fi-nancial data is usually characterized by thick tail andasymmetry [13] )erefore the traditional use of normaldistribution to fit the actual financial data has limitationsFor example in VaR calculation due to the thick tail offinancial data distribution the calculation under the as-sumption of normal distribution will lead to huge errors[14] In order to find the most reasonable distribution hy-pothesis Mandelbrot [15] proposed replacing the normaldistribution of financial data with the stable distributionHowever because the tail of the stable distribution is usuallythicker than the actual distribution some scholars proposedusing the truncated stable distribution as the distribution ofsecurities returns [16] but where to cut off had becomeanother question

In recent years in order to analyze and predict financialdata more accurately machine learning has received wideattention from scholars Compared with the traditionalmodel machine learning has a unique advantage in dealingwith financial data First of all it can automatically identifythe hidden features behind the financial data reducinghuman intervention Secondly the traditional linear assetpricing model is based on the assumption that the financialsystem is linear However scholarsrsquo research on the non-linear characteristics of financial time series such as longmemory and nonpairing distribution indicates that thestock market system is actually a dynamic system with linearand nonlinear information Machine learning models candeal with high-dimensional and collinear factors and are notlimited to the probability distribution of investment incomeMachine learning models do not need to calculate high-dimensional covariance matrix [17]

Mukhejee (1997) firstly proposed the application ofsupport vector machine in nonlinear chaotic time serieswhich provided the basis for the application of stock seriesIn the empirical study Fan and Palaniswami [18] first ap-plied the support vector machine model to stock selectionBased on the data of Australian stock exchange the modelthey constructed could identify the stocks that outperformthe market and the five-year yield of the equal weightportfolio constructed was 208 which was far higher thanthe benchmark return of the large market Kim [19] used

2 Mathematical Problems in Engineering

support vector machine (SVM) and artificial neural network(ANN) to predict the market index and the results showedthat SVM had more advantages than ANN in stability )ereare also some literatures that focus on the differences be-tween different models in variable selection and modelingcharacteristics for example Xie et al [20] and Huang et al[21] set the rise and fall of stock market as dichotomousvariables and used linear model BP neural network andsupport vector machine model to predict them It was foundthat SVM had better classification performance than othermethods and the combined model performed best in allprediction methods when SVM is combined with othermodels Nair et al [22] used C45 decision tree algorithm toextract the characteristics of stock data and then applied it tothe prediction of stock trend )ey found that the predictioneffect of C45 decision tree was better than neural networkand naive Bayesian model Zhu et al [23] applied Classi-fication and Regression Tree (CART) algorithm and tradi-tional linear multifactor model in North American marketduring the outbreak of financial crisis and found that thestock selection model based on CART algorithm had asignificant effect on risk dispersion Kumar and )enmozhi[24] used random forest model to predict the up and downdirection of Standard and Poorrsquos and found that the resultwas better than that of SVM Bogle and Potter [25] useddecision tree artificial neural network support vectormachine and other machine learning models to predict thestock price of Jamaica stock exchange market and foundthat in this market the prediction accuracy of stock pricecould reach 90

In recent years the first mock exam has also been madein some areas such as the sequence dependence of financialtime series data and the local association characteristics ofdifferent financial market time series data For example Xieand Li [26] discussed the joint pricing models and Yan [27]constructed a CNN-GRU neural network which combinesthe advantages of convolutional neural network (CNN) andgating loop unit (GRU) neural network )ere are also somepapers that study the computing power time consumptionand even hardware layout of various algorithms For thediscussion of machine learning related hardware refer toTang et al [28 29]

3 System Introduction

Based on machine learning algorithm this paper constructsan optimal trading strategy system which aims to bringstable excess return to investors According to Figure 1 thissystem is divided into four modules data preprocessingstock pool selection position allocation and risk mea-surement )e details are as follows

31 Data Preprocessing Because of the noise and formatasymmetry in financial data preprocessing plays an im-portant role in getting accurate prediction results Accordingto Figure 2 we preprocess the financial data according to thefollowing steps

Step 1(financial data processing) due to the differencesin the format of financial statements of differentcompanies the data sets have sparse data spaces )efinancial accounts with over 20 missing values arediscarded directly )e remaining default values arefilled with the average values of the previous and nextthree quarters Since the income statement and cashflow statement are process quantities representing theaccumulation of quarterly values the data of these twofinancial statements are differentially processed toobtain quarterly dataStep 2 (market data processing) since the stock marketdata set is monthly in order to match the data in thefinancial statements it is processed on a quarterlyaverage basisStep 3 (data screening) due to the differences in theformat of financial statements in different industriesfinancial data are divided into banking securities in-dustry insurance industry and genera business As thebanking insurance and security industries are allsubindustries under the financial industry their busi-nesses are complicated and are greatly affected by themacro impact resulting in the uncertainty and vola-tility of their stock prices far greater compared to thegeneral business )erefore prediction from their fi-nancial data alone is difficult Moreover after pre-processing it is found that the data of these industriesare too insufficient to make a prediction In view of thelack of reference in the forecast results these threeindustries are deleted from the data and only the dataof general business are retainedStep 4 (data splicing) we take the companyrsquos stock codeas the primary key and combine the financial data withmarket data of the same quarter into a wide table toprepare for feature engineeringStep 5 (feature engineering) this paper applied ma-chine learning models into stock return forecasting)e reason why machine learning can achieve highprediction accuracy is that it can deal with the non-linear relationship between variables Unlike simplelinear model the complexity of the model leads tomachine learning being regard as a ldquoblack boxrdquo Due tothe lack of interpretation it is contradicted by thetraditional financial industry In order to improve thecredibility of our model not only do we use the im-portance of variables to analyze the important influ-encing factors of stock price return but also we accordto the previous literature and construct the charac-teristics that have been proved to have strong signifi-cance in the previous multifactor model Table 1 showsthe calculation method and index implicationStep 6 (missing value processing and standardization)due to the high requirements of data integrity in stockforecast we delete the data with the missing rate morethan 10 For the remaining missing value based onthe idea of moving average we take the same fields ofthe three records before and after the missing record to

Mathematical Problems in Engineering 3

calculate the moving average value thus retaining thetrend information of the stock as much as possibleFinally in order to improve the convergence speed andprediction accuracy of the model the data arestandardized

32 Stock Pool Generation )is paper uses LightGBM al-gorithm under the sliding time window training method topredict the quarterly earnings of stocks and compares it withtraditional linear model support vector machine artificialneural network random forest and other machine learningmodels )e evaluation criteria of models are R2 and RMSEBased on the prediction results a stock pool composed of alimited number of stocks is generated and then the portfoliois constructed according to different position allocationmethods

321 Algorithm Principle LightGBM Compared with thegeneralized linear model machine learning as a new modeldoes not need to be based on the assumption that thevariables are independent and obey the distribution ofspecific functions and has greater flexibility and efficiencyMachine learning algorithms have unique advantages in

dealing with a large amount of data such as financial marketdata

Among many machine learning algorithms LightGBMalgorithm has the characteristics of high speed high accu-racy high stability and low memory space It has wideapplication space in the financial field with large amount ofdata and high requirements for prediction accuracy andstability LightGBM is essentially an enhanced gradientlifting tree that can be used for regression and classificationCompared with the previous gradient lifting tree (GBT) ithas the following advantages LightGBM uses the leaf-wise(best-first) strategy to grow a tree find the leaf with thelargest gain each time and split it while other nodes withoutthe maximum gain do not continue to split LightGBM doesnot need artificial trim so the result is relatively objectiveand stable At the same time LightGBM adopts histogramalgorithm and accumulates statistics in the histogramaccording to the value after discretization as the index tocalculate the information gain LightGBM also adopts twonew calculation methods Gradient-Based One-Side Sam-pling and Exclusive Feature Bundling greatly improving theaccuracy and efficiency of calculation

(1) Given the supervised training set X (xi yi)1113864 1113865n

i1the goal of LightGBM is to minimize the objectivefunction which is

Balance sheet

Market data

Previousincome statement

Currentincome statement

Currentcash flow statement

Previouscash flow statement

State Process

Difference

Financial data

Market value

Turnover value

Close price

Seasonal average

Figure 2 Data and data processing

Missing values

Data difference

Feature engineering

Feature selection

Sliding timewindow

Machine learningmodels

Equally weighted

Market-value weighted

Minimum-variance model

VaRCVaR

Data preprocessing

Stock pool selection

Allocation of stock positions Risk evaluation

Figure 1 Framework of the trading strategy system


J(ϕ) 1113944i

l 1113954yi yi( 1113857 (1)

where i denotes the i minus th sample and l(1113954yi yi) de-notes the prediction error of the i minus th sample

(2) For optimizing the objective function LightGBMuses gradient boosting method to train rather thanusing bagging method to directly optimize the wholeobjective function )e gradient boosting trainingmethod optimizes the objective function step by stepFirstly optimize the first tree and then optimize thesecond one and so on until the K tree is completed

(3) When generating a new optimal tree LightGBM usesthe leaf-wise algorithm with depth limitation to growvertically )erefore the leaf-wise algorithm is moreaccurate compared with the level-wise algorithmwhen they have the same number of splitting times

When the T minus th tree is generated every time a newlygenerated split node is added and then the objectivefunction can be obtained as follows

J 12

1113936iisinILgi1113872 1113873

2

1113936iisinILhi + λ

+1113936iisinIR

gi1113872 11138732

1113936iisinIRhi + λ

minus1113936iisinIgi( 1113857

2

1113936iisinIhi + λ⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ (2)

where gi z1113954y

(tminus1) l(yi 1113954y(tminus 1)) hi z21113954y

(tminus 1) l(yi 1113954y(tminus 1)) IL andIR are the sample sets of the left and right branchesrespectively

322 A Method for High-Dimensional Feature MatrixSparsity Exclusive Feature Bundling In order to get the dataof each quarter we carried out differential processing on thefinancial statements of process volume during data pre-processing However due to the lack of mandatory regu-lations on the quarterly reports of listed companies in Chinasome companies have the problem of time lag in thequarterly reports and there are a large number of zero valuesin the data of nonupdated adjacent quarterly reports afterdifferential processing At the same time due to businessdifferences between companies nonshared business ac-counts in financial statements sometimes are also zero

In conclusion the financial data of listed companies withlarge subjects and different businesses eventually form ahigh-dimensional feature matrix and a large number of zerovalues in the matrix lead to the problem of data sparsityTraditional statistical methods are often unable to extractenough effective information when dealing with high-di-mensional and sparse data which makes the predictionresults inaccurate or even wrong In order to solve thisproblem and improve the prediction accuracy we introduceExclusive Feature Bundling (EFF) method of LightGBM intothe stock pool selection module

Table 1 Features and implication

Index Calculation method Implication

ProfitabilityROE Net profitownerrsquos equity Return on equityROA Net profittotal assets Return on assetsROS Total profitoperating income Return on sales

FluidityDE Liabilitiesownerrsquos equity Debt to equity ratio

Cash ratio (Current assetsminus inventory)current liabilities Ratio of quick assets to current liabilitiesCurrent ratio Current assetscurrent liabilities Ratio of current assets to current liabilities

Operatingefficiency

Equity turnover Sales incomeaverage shareholdersrsquo equity )e efficiency of the company in using theownerrsquos assets

Asset turnoverSales incomeaverage total assets

Average balance of total assets openingbalance + closing balance2

An important financial ratio to measure theefficiency of enterprise asset management

Valuationindex

BM Outstanding stocklowast closing priceshareholderequity

Book-to-market ratioHigh BM considered to be undervalued by the

market resulting in high yield

PE Market price per common shareearnings percommon share per year

Price-to-earnings ratio)e lower the price earnings ratio is the lowerthe profitability of the market price relative to

the stock is

PB Share pricenet asset per share

Price-to-book ratio)e higher the investment value of stocks with

low market to net ratio is the lower theinvestment value is

Market value oflisted company

Market price per sharelowast total number of sharesissued

)e total value of shares issued by a listedcompany at market price

Turnover Trading volumetotal issued shares )e higher the turnover of a stock is the moreactive the stock is in trading

β Coefficient Systematic riskcoefficient

Regression of the historical rate of return of a singlestock asset to the index rate of return of the same

period

β describes the systemic risk of a fullydiversified portfolio


Exclusive Feature Bundling aims to solve the problem ofdata sparsity by merging mutually exclusive features toreduce the dimension of feature matrix [30] SinceLightGBM stores the features divided into discrete values byconstructing histogram instead of storing continuous valuesdirectly we can combine mutually exclusive features byassigning them to different intervals of the same histogram

For example Xa and Xb are two mutually exclusivefeatures where Xa isin [0 a) and Xb isin [0 b)

)e new bundling feature Xc can be obtained by addingthe value range of Xa as offset and value range of Xb whereXc isin [0 a + b)

Xa Xc Xb 0 when 0leXc lt a

Xa 0 Xb Xc minus a when aleXc lt a + b1113896 (3)

323 TrainingMethod Sliding TimeWindow Since the dataof the stockmarket is a time series the historical informationof the company has a great influence on the future stockprice Considering that the sequence has a great influence onthe values of the sequence nodes the prediction model instock pool selection should consider the key element ldquotimerdquoin the prediction process instead of treating the stock pricesof all times as the same data and randomly selecting thetraining set and the test set )erefore this system does notuse cross-validation but uses sliding time window to ran-domly simulate the prediction process

In the experiment suppose that the size of sliding timewindow was N quarters In a unit time window the first N-1quarter is the training set and the last quarter is the test set)e size of the time window should take into account thecharacteristics of the data set If it is too short the naturaltime period of the test set may be outside the training setand the time information brought by the time window willbe greatly reduced if it is too long some unnecessary noisemay be introduced

In essence the model in each time window is a newmodel and they are all independent of each other Accordingto Figure 3 the stock price of each stock in the next quarterin each time window is predicted according to all the in-formation in the latest quarter

324 Stock Selection )e goal of this paper is to apply a newmachine learning model LightGBM to the prediction ofstock return and to construct a low-risk and high-yieldportfolio compared with the stock price prediction modelsused in previous studies In order to highlight the risk ofdifferent portfolio construction methods the first N stocksare selected as the stock pool from the stock list sorted by theyield only the long purchase rule is allowed in this part toensure that the yield equals the required value )en adjustthe position of each stock according to different weights tofind the optimal portfolio

33 Stock Position Allocation Method )is paper uses threemethods of stock position allocation (1) equal-weight method(2) market-value weightingmethod and (3) minimum-variance

method of mean-variance model with CVaR constraintCombined with the sliding time window training method topredict the quarterly earnings of the stock we useR-squared andRMSE of the model as the evaluation criteria while in thetraditional linearmodel support vectormachine artificial neuralnetwork random forest and other machine learningmodels areused Based on the prediction results a stock pool composed of alimited number of stocks is generated and then the portfolio isconstructed according to different position allocation methods

331 Method 1 Equal-Weight Method Each stock isassigned the same weight If there are n stocks in the stockpool the weight of each stock is wi 1n

332 Method 2 Market-Value Weighting Method )e ratioof market value of stock i to themarket value of all n stocks inthe stock pool is taken as the weight of this stock and thecalculation formula is as follows

wi Market valuei

1113936ni1 Market valuei

(4)

where Market valuei is the market value of the company atthe closing of the previous period and it is calculated bymultiplying the market price of each share by the totalnumber of shares issued

333 Method 3 Minimum-Variance Weight Method of theMean-Variance Model with CVaR Constraint In 1952Markowitz published the beginning article of modernportfolio theory ldquoportfolio selectionrdquo in the financial mag-azine which studies how to allocate risk assets effectivelyMarkowitz believed that investors only consider two factorsof expected return and standard deviation of forecast whenmaking portfolio decision so portfolio decision was mainlybased on the following two points (1) when the investmentreturn is the same investors want to minimize the risk (2)when the risk is the same investors want to maximize theincome According to the principle of mean-variance effi-ciency the optimal portfolio can be expressed by mathe-matical programming in the process of investing in assets

Assuming that the return of risk assets obeys normaldistribution consider CVaR constraint in Markowitz mean-variance model and then the portfolio optimization modelbased on CVaR constraint is

min σ2p minXT 1113936 X

st CVaRβ C2(β)σp minus E rp1113872 1113873leL

E rp1113872 1113873 XTR

XTI 1 I (1 1 1)T

xiisin [0 1]

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)

where C2(β) ϕ(Φminus1(β)) R (R1 R2

Rn) Ri E(ri)is the expected return of i minus th stock and X

(x1 x2 xn)T is the weight of each portfolio We have thefollowing


(1) According to the constraint equation CVaRβ

C2(β)σP minus E(rP)le L and we obtain

CVaRβ C2(β)σP minus E rP( 1113857 L (6)

(2) After transformation we obtain

σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)

(3) Define RP E(rP) XTR and σ2P XT 1113936 X equa-tion (7) can be

XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)

(4) Since the rank of the system of linear equation (8) isn minus 2 the number of its basic solutions is 1 )at is tosay x2 x3 xnminus1 can be linearly expressed by x1(obtained by elimination method) Since 1113936

ni1 xi 1

xn can also obviously be expressed by x1 bysubstituting x2 x3 xnminus1 into equation (8) we canget a quadratic equation of x1 and then we canobtain x2 x3 xnminus1 by calculating x1

(5) )erefore if x1 has no solution the constraint linedoes not intersect the mean-variance effective frontwhich means CVaR does not play a constraint role ifx1 has two multiple roots there is only one inter-section A between the constraint line and the mean-variance effective front and the weight of A is x1 ifthere are two different roots the constraint line andthe mean-variance effective front have two inter-sections A and B )e expected return and varianceof portfolio at A and B are RA RB σ2A and σ2B

34 Risk Evaluation )is paper mainly studies how to re-duce the risk of portfolio At present the most commonmeasurement methods of portfolio risk are sensitivity

method volatility method VaRmethod and CVaR method)ey are introduced respectively as follows

341 Method 1 Sensitivity Method Sensitivity method is amethod to measure the risk of financial assets by using thesensitivity of the value of financial assets to market factorsSensitivity refers to the percentage change in the value offinancial assets when market factors change by a percentageunit )e greater the sensitivity of financial assets is thegreater the impact of market factors is and the greater therisk is)e sensitivity of different types of financial assets hasdifferent names and forms such as the duration and con-vexity of bonds beta of stocks Delta Gamma and Vega ofderivative assets )e problems of sensitivity method lie inthe following (1))e sensitivity is only valid when the rangeof market factors changes is very small (2) A certain sen-sitivity concept is only applicable to a certain class of specificassets or a certain class of specific market factors whichmakes it difficult to compare the risks among different kindsof assets (3) )e sensitivity is only a relative proportionconcept which cannot determine the risk loss of a certainportfolio body value

342 Method 2 Volatility Method Volatility method is astatistical method which is usually described by standarddeviation or covariance Variance represents the volatility ofthe actual rate of return deviating from the average rate ofreturn )e greater the volatility the greater the uncertaintyof the actual rate of return regardless of whether the actualrate of return is higher than the average rate of return orlower than the actual rate of return One of the main defectsof variance measurement of investment risk is that variancerepresents positive and negative deviation generallyspeaking investors do not want that the actual return is lessthan the expected return but they do not refuse when theactual return is higher than the expected return )ereforeMarkowitz put forward the semi variance method in 1959that is the part of the actual income higher than the expectedincome is not included in the risk and only the loss isincluded )ere are some problems in both variance methodand semi variance method (1) )e method is based on the

2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3

Feature windowLabel window

Figure 3 Sliding time window model


assumption that variance exists but whether the variance ofreturn rate exists is still questionable (2) Variance impliesthe assumption of normality so only the linear correlationstructure between risks can be analyzed In reality the riskdependence structure may be a nonlinear complex structure(3) )e square deviation does not specifically indicate howmuch the loss of the portfolio is (4) )is method is notsuitable for comparing the risk of assets with different ex-pected return

343 Method 3 VaR Method VaR is Value at Risk [31]VaR is the maximum possible loss expected in the holdingperiod of an investment within a certain confidence level Itsmathematical expression is as follows

Prob(ΔpleVaR) α (9)

where Δp is the loss amount of the portfolio during theholding periodΔt which is the Value at Risk under the givenfixed credit level α that is the upper limit of possible loss)e meaning of the expression of the above formula is thatthe risk loss of the portfolio is not less than the VaR at thelevel of probability In the research the above expression isregarded as a function of VaR on α and the probabilitydistribution function of the portfolio return is expressedwith F(α) which means

VaRα Fminus1

(α) (10)

)e advantages of VaR method are the following (1))emeasurement of risk is simple and clear the risk mea-surement standard is unified and it is easy for managers andinvestors to understand and grasp (2) VaR can also be usedto compare the risks of different types of financial assets butits disadvantage lies in the inconsistency

344 Method 4 CVaR Method CVaR is a new risk mea-surement method proposed by Roekafellor and Uryasev(1999) also known as Conditional Value at Risk methodwhich means that under a certain confidence level the lossof portfolio exceeds themean value of a given VaR reflectingthe average level of excess loss Its mathematical expressionis as follows

CVaRα E minusX | minus XgeVaRα( 1113857 (11)

where minusX represents the random loss of the portfolio andVaRα is the Value at Risk under the confidence level

)e advantages of CVaR are as follows (1) It solves theproblem of inconsistency measurement satisfies the addi-tivity of risk and improves the defect of VaR (2) It does notneed to realize the form of assumed distribution and in anycase its calculation can be realized by simulation (3) It fullymeasures tail loss and calculates the average value of tail losswhich considers all tail information larger than VaR ratherthan based on a single quantile to calculate

)erefore this paper mainly measures the risk of stockportfolio based on the VaR and CVaR Monte Carlo sim-ulation method is used in the specific calculation methodwhich is the most effective method to calculate VaR and

CVaR as it can solve the nonlinear relationship of varioustargets well without making assumptions on the distributionof portfolio income

4 Empirical Research

41 Experimental Results and Analysis )e data used in thispaper are the market data of 3676 A-share listed companiesin China from 2008 to 2018 as well as the financial datadisclosed by the company on a quarterly basis (including thecompanyrsquos balance sheet profit statement and cash flowstatement)

After data preprocessing and feature engineering theprocessed data from the fourth quarter of 2008 to the firstquarter of 2018 are selected as the final data set

Based on the total split times of features the top 20variables in the variable importance score obtained by ourtrading strategy system are as follows

According to Figure 4 the factor affecting the next stockpricemost significantly is the closing price of the current stock(CLOSE_PRICE) According to Charles Dowrsquos technicalanalysis theory the historical price of the stock contains a lotof information )e price will evolve in the way of trend andthe history will always be repeated because of human psy-chology and market behavior By studying the historical priceof the stock investors can find out the current market and thetrend so as to better detect the stock selection target and theopportunity to build a position)erefore the closing price ofthe current stock is highly related to the next stock pricewhich is the most important variable to predict the stockprice )e second in the list is accounts payment (AP) Inorder to expand sales and increase market share enterprisesoften buy materials and accept services first and then payservice fees and commissions )e time difference betweensales and payment also reflects the risk that enterprises may beshort of funds and cannot pay in time )erefore AP is animportant reference index for investment)e third index thegrowth rate of the previous periodrsquos stock price (Price_rate)has a great contribution to the prediction of the stock price Ithas a positive correlation with the growth of the currentperiodrsquos stock price Generally the higher the growth rate ofthe previous periodrsquos stock price is the higher the growth ofthe stock is and the greater the possibility of continuousgrowth in the current period is

Other financial indicators such as the balance of cashand cash equivalents at the beginning of balance(N_CE_BEG_BAL) turnover (TURNOVER_VALUE) cashpaid for goods purchased and services received(C_PAID_G_S) surplus reserve (SURPLUS_RESER) andreturn on assets (ROA) also have impact on the stock price)e balance of cash and cash equivalents at the beginning ofthe balance refers to the amount of cash and cash equivalentscarried over from the previous year to the current year forcurrent turnover It reflects the cash stock of the enterpriseTurnover rate measures the frequency of stock turnover inthe market within a certain period of time reflecting theactivity of market trading investment Its calculation for-mula is as follows turnover rate trading volumetotalnumber of shares issued )e higher the turnover of a stock


is the more active the stock is )e cash paid for purchasinggoods and receiving services is the main cash flow generatedin the business activities of industrial and commercial en-terprises which reflects the status of the main business )esurplus reserve is the accumulation of earnings that theenterprise keeps in the enterprise from the after tax profits Itcan be used to expand production and operation increasecapital (or share capital) or distribute dividends which has adirect impact on the stock price )e return on assets is animportant index to measure the profitability of a companyrelative to its total assets )e calculation method is asfollows return on assets net profittotal assets )e higherthe index is the better the asset utilization effect of theenterprise is indicating that the enterprise has achievedgood results in increasing revenue and saving capital use Itcan be seen that LightGBM considers a variety of financialindicators comprehensively and the importance score alsoprovides a reference for the research and analysis of long-term investment in enterprise value

It is noteworthy that among the top 20 factors obtainedby the model 40 of the factors are constructed by usaccording to the fundamental theory On the one hand itreflects the scientific nature of combining the fundamentalaspect theory with the pure machine learning method Onthe other hand it can also be seen that in the previousliterature the prediction of stocks completely depending onindividual fundamental factors may cause inaccuracy

411 Comparisons of Models In order to compare the ac-curacy of LightGBM and other algorithms this paper usesGLM (generalized linear model) DNN (deep neural net-work) RF (random forest) SVM (support vector machine)

and LightGBM to predict the next stock price of eachquarter )e results are shown in Figure 5 R-squaredmeasures the goodness of fit which is equal to the ratio of thesum of squares of regression to the total sum of squares )ecloser R-squared is to 1 the better the fitting degree ofregression model is RMSE represents the root mean squareerror)e smaller RMSE is the more accurate the predictionis It can be seen from the left that the R-squared underLightGBM model is 0798 that is this model can solve thevariation of 798 of the stock price which is higher than theother four methods and indicates that LightGBM is the bestto fit the stock price It is noteworthy that R-squared of thelinear model is almost 0443 and the correlation is veryweak It is speculated that the reason is that the stock priceand a large number of factors do not satisfy the linear re-lationship at the same time )e linear model is only ap-plicable to the model composed of a few fundamentalfactors Even if it contains the most influencing factors asmuch as possible such a model still lacks explanation forexcess return It can be seen from (b) that the RMSE ofLightGBM model is 61829 which is lower than the otherfour methods also showing the high accuracy of LightGBM

412 Risk Assessment of Models In order to compare therisk of the portfolio of GLM DNN RF SVM and LightGBMunder equal-weight allocation method market-valueweighting method and minimum-variance weight methodof mean-variance model with CVaR constraint the top Nstocks with the highest investment income in stock pool arecalculated under these 15 conditions respectively and theVaR and CVaR of portfolio investment are calculated at theconfidence level of α 5 as shown in Figure 6

77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP

Current RatioIntant Assets

T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA

C_Paid_G_SSurplus_Reser

Turnover_ValueN_CE_Beg_Bal

Price_RateAP

Close_Price

Feat

ures

Feature importance by split

50 100 150 200 2500Feature importance

Figure 4 Variable importance score


In the selection of position allocation methods it can beseen from Figure 6 that whether the risk is measured by VaRor CVaR minimum-variance weight method is more able tominimize the risk of the portfolio and reduce the losscompared to the other two allocation methods At the same

time the overall ranking of algorithms under VaR and CVaRis basically the same It is because CVaR is based on VaR andCVaR is the optimization of VaR in risk measurement )etwo are highly correlated andmeet the expectations thus themodel results are reasonable

ndash000149

ndash000117

ndash000089

ndash000148ndash000141

ndash000116

ndash000157 ndash000152

ndash000122



ndash000148

ndash000073

Unweighted Market_value_weighted Min_variance_weighted

GLMSVMDNN

RandomForestLightGBM

ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)

Figure 6 Risk assessment of 5 algorithms under 3 position allocation methods (a) Var (b) CVaR

0443

07232 0697106209

0798

GLM DNN RF SVM LightGBM

R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)

Figure 5 R-squared and RMSE comparisons of five models


In algorithm selection LightGBM has lower risk com-pared to the other four models under minimum-varianceweight method of mean-variance model with CVaR con-straint Its VaR is minus00073 and CVaR is minus002853 whichmeans that under the normal fluctuation of the stockmarket the probability that the return of the optimalportfolio declines by more than 073 due to market pricechange is 5 and the expected loss of the whole stockportfolio is 2835 Under equal-weight allocation methodLightGBM also has lower risk than the other four modelsVaR is minus000124 and CVaR is minus006324 which shows that itcan significantly reduce the risk of portfolio and make theexpected return of investors more stable than linear modelsand traditional machine learning algorithms It is note-worthy that LightGBM is not the best under market-valueweighting method )e reason may be that market-valueweighting method pays too much attention to the stocks

with higher market value and does not consider the in-vestment value of small- and medium-sized stocks in thestock market well However investors in real market oftentend to invest in potential stocks with small market value atthe early stage of growth so the reference value of the resultof market-value weighting method is limited On balanceLightGBM has a higher risk reduction effect in a morepractical situation

413 Yield Analysis In this paper we calculate the quarterlyyield of stocks in the stock pool based on LightGBM underthree position allocation methods and use it to track CSI 300index in Chinarsquos stock market from 2009 Q4 to 2018 Q1From Figure 7 it is obvious that the yield obtained by threeposition allocation methods can outperform the market andit is more likely to ensure a considerable yield in a bear

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season

CSI 300Marketvalue_weighted

Minvariance_weightedUnweighted

ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate

Figure 7 Market index tracking results

0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986

Random training times

R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085

Figure 8 1000 repetitive tests of 5 models


market Moreover compared with market-value weightedmethod and equal-weight method the optimal portfolioconstructed by minimum-variance weight method has astable performance in the actual market )e combinationtracking error constructed by market-value weightingmethod is the largest and its performance is not as good asthe other two methods in the bear market )erefore theportfolio constructed by minimum-variance weight methodof mean-variance model with CVaR constraints can con-tinuously obtain relatively stable excess income

414 Model Robustness Comparison )e data structure isdifferent for different stock markets In order to reflect theinfluence of different data structures on the model proposedin this paper we use stochastic simulation technology toverify the robustness In order to keep these dependencystructures [17] this paper uses repeated sampling technol-ogy to do random simulation In order to compare therobustness and generalization ability of models we ran-domly select 1 yearrsquos original data each time and conductrepeated regression predictions1000 times and the resultsare shown in Figure 8 According to the comparison of curvevolatility LightGBM can still maintain a lower volatilitywhile maintaining a higher goodness of fit regardless of theaccuracy of prediction or the robustness of the algorithmLightGBMrsquos regression effect on this type of data set issignificantly better compared to the other models

5 Conclusion

)is paper takes the financial risks and returns of the stockmarket as the research object and uses the method of ma-chine learning and data mining to build a financial tradingstrategy system based on LightGBM During data pre-processing and feature engineering we construct multiplevariables that have been proved to have strong significantfeatures in previous multifactor models Experimented onthe market data of Chinarsquos A-share listed companies theprediction model in this system is trained to predict thestock price of next quarter)emethod will perform in otherstock markets )e variable importance score affecting thenext stock price is also generated and the accuracy of oursystems is compared with GLM DNN RF and SVMmodelAt the same time VaR CVaR and quarterly return of theportfolio based on LightGBM are calculated and comparedwith the market index

)e results show the following

(1) Compared with the traditional linear model ma-chine learning models do not need to be based on theassumption that the variables are independent andobey the distribution of specific functions and theyhave greater advantages in dealing with big data infinancial market )e result of LightGBM is 0798 forR-squared and 61829 for RMSE which is muchbetter than GLM )e prediction error of LightGBMis also significantly smaller than that of the othermachine learning models which shows thatLightGBM has high accuracy

(2) Compared with equal-weight method and market-value weight method the portfolio under minimum-variance weight method of mean-variance modelwith CVaR constraint has the best risk aversioneffect At the same time the three position allocationmethods can outperform the market and are morelikely to ensure a considerable yield in a bear marketIn general the portfolio constructed by minimum-variance weight method of mean-variance modelwith CVaR constraint has the best stability and yieldfollowed by market-value weighting method andequal-weight method

(3) In this paper we generate feature importance scoreto find the most important factor affecting the nextstock price )e three most influencing factors arethe closing price of the current stock accountspayable and the growth rate of stock price Otherfinancial indicators like the balance of cash and cashequivalents at the beginning of schedule turnoverrate and cash paid for goods and services etc alsohave great impact on the stock price

However there is still room for improvement in thispaper (1))e experimental data in this paper is the quarterlydata of stocks which is of great value for the long-termstrategy In the future we can try to use the monthly data ofstocks or even the daily data (2) When we allocate theposition of stock we do not think about shorting and theweights are limited between 0 and 1 so we can try to add theshort strategy in the later research (3) We choose the mean-variance portfolio model with CVaR constraint for positionallocation but the variance itself is inconsistent In thefuture we can consider using the mean-CVaR model tocalculate the weight of the portfolio

Data Availability

All stock return data are available

Conflicts of Interest

)e authors declare that they have no conflicts of interest

References

[1] W F Sharpe and F William ldquoCapital asset prices a theory ofmarket equilibrium under conditions of riskrdquo lte Journal ofFinance vol 19 no 3 pp 425ndash442 1964

[2] J Mossin ldquoEquilibrium in a capital asset marketrdquo Econo-metrica vol 34 no 4 pp 768ndash783 1966

[3] S A Ross ldquo)e arbitrage theory of capital asset pricingrdquoJournal of Economic lteory vol 13 no 3 pp 341ndash360 1976

[4] E F Fama and K R French ldquo)e cross-section of expectedstock returnsrdquo lte Journal of Finance vol 47 no 2pp 427ndash465 1992

[5] E F Fama and K R French ldquoCommon risk factors in thereturns on stocks and bondsrdquo Journal of Financial Economicsvol 33 no 1 pp 3ndash56 1993

[6] V T Datar N Y Naik and R Radcliffe ldquoLiquidity and stockreturns an alternative testrdquo Journal of Financial Marketsvol 1 no 2 pp 203ndash219 1998


[7] J D Piotroski ldquoValue investing the use of historical financialstatement information to separate winners from losersrdquoJournal of Accounting Research vol 38 pp 1ndash41 2000

[8] R Novy-Marx ldquo)e other side of value the gross profitabilitypremiumrdquo Journal of Financial Economics vol 108 no 1pp 1ndash28 2013

[9] G Aharoni B Grundy and Q Zeng ldquoStock returns and themiller modigliani valuation formula revisiting the FamaFrench analysisrdquo Journal of Financial Economics vol 110no 2 pp 347ndash357 2013

[10] E F Fama and K R French ldquoA five-factor asset pricingmodelrdquo Journal of Financial Economics vol 116 no 1 2014

[11] H Markowitz ldquoPortfolio selectionrdquo lte Journal of Financevol 7 no 1 pp 77ndash91 1952

[12] M T Greene and B D Fielitz ldquoLong-term dependence incommon stock returnsrdquo Journal of Financial Economicsvol 4 no 3 pp 339ndash349 1977

[13] S-R Yang and B W Brorsen ldquoNonlinear dynamics of dailyfutures prices conditional heteroskedasticity or chaosrdquoJournal of Futures Markets vol 13 1993

[14] J Fajardo A R Farias and J R H Ornelas ldquoGoodness-of-fittests focus on value-at-risk estimationrdquo Brazilian Review ofEconometrics vol 26 no 2 pp 309ndash326 2006

[15] B Mandelbrot ldquo)e variation of certain speculative pricesrdquolte Journal of Business vol 36 no 4 pp 394ndash419 1963

[16] M Y Romanovsky ldquoTruncated levy distribution of sp500stock index fluctuations Distribution of one-share fluctua-tions in a model spacerdquo Physica A vol 287 no 3-4pp 450ndash460 2000

[17] J Zhang Y T Xie and J Yang ldquoRisk dependence consistensyrisk measurement and portfolio based on mean-copula-CVaR modelsrdquo Journal of Financial Research vol 10pp 159ndash173 2016

[18] A Fan and M Palaniswami ldquoStock selection using supportvector machinesrdquo in Proceedings of the International JointConference on Neural Networks (Cat No01CH37222) vol 3IEEE Washington DC USA pp 1793ndash1798 2001

[19] K Kim ldquoFinancial time series forecasting using supportvector machinesrdquo Neurocomputing vol 55 no 1 2003

[20] Y T Xie J Yang and W Wang ldquoComparison analysis onlogistic regression and tree models based on response ratio ofcredit mail advertisingrdquo Statistics amp Information Forumvol 26 no 6 pp 96ndash101 2011 in Chinese

[21] W Huang Y Nakamori and S-Y Wang ldquoForecasting stockmarket movement direction with support vector machinerdquoComputers amp Operations Research vol 32 no 10 pp 2513ndash2522 2005

[22] B B Nair V P Mohandas and N R Sakthivel ldquoA decisiontree- rough set hybrid system for stock market trend pre-dictionrdquo International Journal of Computer Applicationsvol 6 no 9 pp 1ndash6 2010

[23] M Zhu D Philpotts and M J Stevenson ldquo)e benefits oftree-based models for stock selectionrdquo Journal of AssetManagement vol 13 no 6 pp 437ndash448 2012

[24] M Kumar and M )enmozhi ldquoForecasting stock indexreturns using arima-SVM arima-ANN and arima-randomforest hybrid modelsrdquo International Journal of BankingAccounting and Finance vol 5 no 3 p 284 2014

[25] S Bogle and W Potter ldquoUsing hurst exponent and machinelearning to build a predictive model for the Jamaica frontiermarketrdquo in Transactions on Engineering Technologiespp 397ndash411 Springer Singapore 2016

[26] Y T Xie and Z X Li ldquoExtension of bonus-malus factor basedon joint pricing modelsrdquo Statistics amp Information Forumvol 30 no 6 pp 33ndash39 2015

[27] H Yan ldquoIntegrated prediction of financial time series databased on deep learningrdquo Statistics amp Information Forumvol 35 no 4 pp 33ndash41 2020

[28] Z Tang R Zhu P Lin et al ldquoA hardware friendly unsu-pervised memristive neural network with weight sharingmechanismrdquo Neurocomputing vol 332 pp 193ndash202 2019

[29] Z Chang Y Chen S Ye et al ldquoFully memristive spiking-neuron learning framework and its applications on patternrecognition and edge detectionrdquo Neurocomputing vol 403pp 80ndash87 2020

[30] G L Ke Q Meng T Finley et al ldquoLightGBM a highlyefficient gradient boosting decision treerdquo Advances in NeuralInformation Processing Systems vol 30 pp 3149ndash3157 2017

[31] P Jorion ldquoRisk 2 measuring the risk in value at riskrdquo Fi-nancial Analysts Journal vol 52 no 6 pp 47ndash56 1996


latest machine learning algorithms LightGBM into the fieldof portfolio LightGBM does not need to consider thespecific distribution form of financial data and it can capturethe nonlinear relationship between pricing factors and theExclusive Feature Bundlingmethod can solve the problem ofsparsity of high-dimensional characteristic matrix and im-prove the prediction accuracy of stock returns Secondly thenew system can be used to generate importance score offactors It directly shows the impact variables have on stockreturn which has important practical significance for stockselection )irdly for stock position allocation module weconstruct minimum-variance weight method of mean-var-iance model with CVaR constraint and test the tradingstrategy system on the real data of Chinarsquos A-share market)e experiment result shows that the system can bring stableexcess return to investors and provides a logical basis andpractical effect for Chinarsquos stock market

2 Literature Review

SinceMarkowitz proposed mean-variance model to quantifythe risk in 1952 scholars have been trying to find idealmodels for stock pricing In 1960s the CAPM model pro-posed by Sharpe John Lintner and JanMossin expresses therelationship between expected return and expected risk as asimple linear relationship [1 2] Ross [3] put forward theArbitrage Pricing )eory on this basis revealing that stockreturn is affected by multiple factors rather than a singlefactor Fama and French [4 5] proposed a three-factormodel which includes market value book value ratio and PE ratio of listed companies as compensation for the riskfactors that β cannot reflect However the classic modelsabove were lacking explanation in the empirical test At thebeginning scholars attributed that to the omission of ex-planatory variables so they successively put forward factorsthat may lead to excess return For example Datar et al [6]used the data of NYSE for empirical analysis and found thatthere was a significant negative correlation between stockreturn and stock turnover Piotroski [7] selected nine in-dicators from the perspective of the companyrsquos financialindicators including profitability robustness and growthBy comprehensively scoring each indicator he established astock pool and selected the stocks according to the scoresNovy-Marx [8] proved that there is a strong correlationbetween the companyrsquos profitability and stock returnAharoni et al [9] also found that the companyrsquos investmentlevel and stock return are remarkably correlated In 2014Fama and French added Robust Minus Weak (RMW) asprofit factor and Conservative Minus Aggressive (CMA) asinvestment factor based on the three-factor model and putforward five-factor model to further explain the stock return[10]

However none of the models above deviated from theresearch framework of linear asset pricing under thebackground of small sample data that is extract excessreturn factors from historical data and then use these factorsas independent variables to construct a linear model toevaluate the investment value of stocks

Since the 1970s with the increasing availability of em-pirical data in the financial market and the improvement ofcomputer technology scholars have found several abnormalphenomena in the financial market through empirical re-search)ese abnormal phenomena are contrary to the basicassumption of CAPM financial market data obey normaldistribution have no long memory and satisfy the linearmodel which challenges the traditional model [11] For thestock market Greene and Fielitz [12] performed a test andconfirmed that the American stock market has the feature oflong memory which showed that even if the time interval isvery long it still had significant autocorrelation that is thehistorical events would affect the future for a long time Onthe one hand it proved the importance of historical in-formation and the predictability of return on the otherhand it also reflected the nature of nonlinear structure ofstock market As for the distribution of financial datascholars pointed out that in reality the distribution of fi-nancial data is usually characterized by thick tail andasymmetry [13] )erefore the traditional use of normaldistribution to fit the actual financial data has limitationsFor example in VaR calculation due to the thick tail offinancial data distribution the calculation under the as-sumption of normal distribution will lead to huge errors[14] In order to find the most reasonable distribution hy-pothesis Mandelbrot [15] proposed replacing the normaldistribution of financial data with the stable distributionHowever because the tail of the stable distribution is usuallythicker than the actual distribution some scholars proposedusing the truncated stable distribution as the distribution ofsecurities returns [16] but where to cut off had becomeanother question

In recent years in order to analyze and predict financialdata more accurately machine learning has received wideattention from scholars Compared with the traditionalmodel machine learning has a unique advantage in dealingwith financial data First of all it can automatically identifythe hidden features behind the financial data reducinghuman intervention Secondly the traditional linear assetpricing model is based on the assumption that the financialsystem is linear However scholarsrsquo research on the non-linear characteristics of financial time series such as longmemory and nonpairing distribution indicates that thestock market system is actually a dynamic system with linearand nonlinear information Machine learning models candeal with high-dimensional and collinear factors and are notlimited to the probability distribution of investment incomeMachine learning models do not need to calculate high-dimensional covariance matrix [17]

Mukhejee (1997) firstly proposed the application ofsupport vector machine in nonlinear chaotic time serieswhich provided the basis for the application of stock seriesIn the empirical study Fan and Palaniswami [18] first ap-plied the support vector machine model to stock selectionBased on the data of Australian stock exchange the modelthey constructed could identify the stocks that outperformthe market and the five-year yield of the equal weightportfolio constructed was 208 which was far higher thanthe benchmark return of the large market Kim [19] used
















Balance sheet

Market data





State Process

Difference

Financial data

Market value

Turnover value

Close price

Seasonal average


Missing values

Data difference

Feature engineering

Feature selection

Sliding timewindow


Equally weighted



VaRCVaR

Data preprocessing





J(ϕ) 1113944i

l 1113954yi yi( 1113857 (1)





J 12

1113936iisinILgi1113872 1113873

2


+1113936iisinIR

gi1113872 11138732



2

1113936iisinIhi + λ⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ (2)

where gi z1113954y










Operatingefficiency





Valuationindex






the stock is










period
















wi Market valuei


(4)






E rp1113872 1113873 XTR

XTI 1 I (1 1 1)T

xiisin [0 1]

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)









σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)


XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)


ni1 xi 1







2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3








VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References
















































Balance sheet

Market data





State Process

Difference

Financial data

Market value

Turnover value

Close price

Seasonal average


Missing values

Data difference

Feature engineering

Feature selection

Sliding timewindow


Equally weighted



VaRCVaR

Data preprocessing





J(ϕ) 1113944i

l 1113954yi yi( 1113857 (1)





J 12

1113936iisinILgi1113872 1113873

2


+1113936iisinIR

gi1113872 11138732



2

1113936iisinIhi + λ⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ (2)

where gi z1113954y










Operatingefficiency





Valuationindex






the stock is










period
















wi Market valuei


(4)






E rp1113872 1113873 XTR

XTI 1 I (1 1 1)T

xiisin [0 1]

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)









σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)


XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)


ni1 xi 1







2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3








VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References


































J(ϕ) 1113944i

l 1113954yi yi( 1113857 (1)





J 12

1113936iisinILgi1113872 1113873

2


+1113936iisinIR

gi1113872 11138732



2

1113936iisinIhi + λ⎡⎢⎢⎢⎣ ⎤⎥⎥⎥⎦ (2)

where gi z1113954y










Operatingefficiency





Valuationindex






the stock is










period
















wi Market valuei


(4)






E rp1113872 1113873 XTR

XTI 1 I (1 1 1)T

xiisin [0 1]

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)









σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)


XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)


ni1 xi 1







2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3








VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References















































wi Market valuei


(4)






E rp1113872 1113873 XTR

XTI 1 I (1 1 1)T

xiisin [0 1]

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(5)









σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)


XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)


ni1 xi 1







2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3








VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References






































σ2P 1

C2(β)( 11138572 E rP( 1113857 + L( 1113857

2

1C2(β)( 1113857

2

middot E2

rP( 1113857 + 2LE rP( 1113857 + L2

1113872 1113873

(7)


XT

1113944 X 1

C2(β)( 11138572 X

TR1113872 1113873

2+ 2L X

TR1113872 1113873 + L

21113874 1113875

(8)


ni1 xi 1







2009Q1 2009Q2 2009Q3 2009Q4 2010Q1

2009Q2 2009Q3 2009Q4 2010Q22010Q1

2009Q3 2009Q4 2010Q32010Q1 2010Q2

Dataset 1

Dataset 2

Dataset 3








VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References






































VaRα Fminus1

(α) (10)




















77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References







































77889991111

141617

212222

2527

4248

235

N_CF_FR_Finan_ACIP


T_Compr_IncomeAR

Income_TaxROS

PB RatioT_CA

Cash_C_EquivROEROA



Price_RateAP

Close_Price

Feat

ures







ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References




































ndash000149

ndash000117

ndash000089


ndash000116


ndash000122



ndash000148

ndash000073


GLMSVMDNN


ndash00018

ndash00015

ndash00012

ndash00009

ndash00006

ndash00003

0

(a)

GLMSVMDNN


ndash007213

ndash005329

ndash003487


ndash00517


ndash005773



ndash007338

ndash002853


ndash008

ndash006

ndash004

ndash002

0

(b)


0443

07232 0697106209

0798


R-squared of models

0010203040506070809

(a)

75978 7207580994 81934

61829


RMSE of models

0123456789

(b)






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References





































0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32Season



ndash05

ndash03

ndash01

01

03

05

07

09

11

Gro

wth

rate


0 17 34 51 68 85 102

119

136

153

170

187

204

221

238

255

272

289

306

323

340

357

374

391

408

425

442

459

476

493

510

527

544

561

578

595

612

629

646

663

680

697

714

731

748

765

782

799

816

833

850

867

884

901

918

935

952

969

986


R-sq

uare

d

DNNGLMLightGBM

RFSVM

035

04

045

05

055

06

065

07

075

08

085





5 Conclusion







Data Availability




References




































5 Conclusion







Data Availability




References


































researcharticle financial trading strategy system based on

Documents