[8][9]csif.org.cn/uploadfile/20191119112919_74496.pdfrecursiveness and real-time performance of the...

Taiyuan China 第九届中国信息融合大会中国太原

450 The Ninth Chinese Information Fusion Conference 2019 年 9 月

The algorithm is complex, the adaptability and real-

time performance are poor, and the detection result

depends on the accuracy of the algorithm. The

paper[8][9] uses artificial intelligence and machine

learning to detect and identify Morse code, but this

method has high complexity and requires big data

samples to train it.

The current detection algorithms directly process

the CW signal without signal synchronization, the

recognition algorithm basically uses the hard decision

threshold value. However, the short-wave ionospheric

channel has a serious phenomenon of random multi-

path, which not only causes the fading of the signal

amplitude, but also causes interference effects, causing

distortion and multi-path time delay of the signal. So

the synchronization performance of the CW telegraph

receiver is critical to the communication system, and

the current algorithm can not cope with the problem of

time-varying unknown noise intensity, resulting in the

high BER and the poor communication performance[10].

Therefore, the automatic detection and identification of

CW telegraphs is the focus of current tactical

communication development.

This paper proposes an algorithm for automatic

recognition of CW telegraphs based on Kalman

optimal estimation characteristics. The CW telegraph

signal is time-domain synchronized and segmented by

self-synchronization method[8], The signal energy at

the CW characteristic frequency point is obtained by

Goertzel algorithm, and then the adaptive energy

threshold is set by Kalman filter to judge the energy

value of the CW signal, realizing the synchronous

detection and adaptive recognition of the CW signal

under strong noise background, and guaranting the

recursiveness and real-time performance of the

algorithm.

1 CW signal The wireless shortwave radio receiver is tuned to

the CW signal carrier frequency for telegraph reception.

Due to the multi-path effect of the actual high

frequency ionospheric channel and the influence of

various noises in the channel, the high frequency CW

signal is simulated as a sinusoidal signal whose

frequency is known and amplitude and phase are

unknown.

],0[;0e)()tj(2 0 TtorAtS

f

(1)

The CW signal frequency0

f is usually set in the

frequency range of 300Hz~3400Hz. The message

information of CW communication is mainly Morse

code. Morse code[9] is a time-interrupted signal code

consisting of two basic signals (point 'dot' and 'dash')

and different interval times, expressing different

English letters, numbers, and punctuation with

different order. Because of its advantages of cleanness,

low cost and high efficiency, it still occupies a very

important position in the increasingly developed

communication technology. The ‘dot’ pulse durations

T

is related to the communication code rate (CPM), there

are two common typical words: "PARIS" and

"CODEX". PARIS mimics a word rate that is typical of

natural language words and reflects the benefits of

Morse code's shorter code duration for common

characters such as ‘e’ and ‘t’. CODEX offers a word

rate that is typical of 5-letter code groups (sequences of

random letters). Using the word PARIS as a standard,

the number of dot units is 50 and a simple calculation

shows that the dot length at 20 words per minute is 60

milliseconds. Using the word CODEX with 60 dot

units, the dot length at 20 words per minute is 50

milliseconds. This paper uses "CODEX" as the

standard for communication. From 5 / (sec)Ts CPM ,

at a speed of 100 CPM, the time lengths

T is 50ms. In

Fig.1, the characteristic frequency0

f of the CW signal

is 1000 Hz, the sampling frequencys

f is 8000 Hz, and

the CW telegraph signal whose information is "CQ

SOS" is constructed according to the Morse code rule.

Fig.1 The CW Signal Form of "CQ SOS"

2 CW telegraph automatic recognition

CW telegraph recognition is the process of

extracting signals from noise, interference and

distortion, delay, and obtaining the transmitted

information. The algorithm firstly uses the self-

synchronization method to extract synchronous

information[11][12], realizing the bit synchronization,

and uses the Goertzel algorithm to obtain the energy

Dash Dot Space: Space: Space:3 sT

sT 3 sTsT 7 sT

C Q S O S

- - - - . - . . . - - . .. . .

T/s

U/v

第九届中国信息融合大会

http://dict.cnki.net/dict_result.aspx?searchword=%e8%87%aa%e5%8a%a8%e8%af%86%e5%88%ab&tjType=sentence&style=&t=automatic+recognition

中国太原第九届中国信息融合大会 Taiyuan China

2019 年 9 月 The Ninth Chinese Information Fusion Conference 451

value of the segmented signal, and then establishes the

Kalman filter state space model to dynamic optimal

estimate threshold for soft decision, and finally realizes

the CW telegraph automatic recognition.

Fig.2 Flow Chart of Automatic Recognition of CW

2.1 CW telegraph signal self-synchronization Any communication system is a combination of

transmitted signals and received signals, the premise of

receiving a signal is to achieve synchronization of the

system[10][11] to obtain the start and end time of the

symbol signal, and obtain complete symbol

information.The synchronization performance directly

affects the communication system performance, the

errors or loss of synchronization can result in degraded

communication performance or communication

disruption.The short-wave receiver down-converts the

high-frequency CW telegraph to obtain the base-band

signal, the base-band signal contains the synchronous

information of the CW telegraph signal, and the self-

synchronization method is used to achieve bit

synchronization of the signal: the base-band CW signal

is subjected to secondary down-conversion, according

to the point 'dot 'pulse duration, get the theoretical

frequency of synchronous sinusoidal signal, but due to

the effect of Doppler spread and Doppler shift in the

actual high-frequency ionospheric channel propagation,

set the appropriate frequency range to detect spikes in

the base-band signal spectrum, then extract

synchronous sinusoidal signal to time domain

segmentation of CW telegraph signals, as Fig.3.

Fig.3 Flow Chart of CW Signal Synchronization

The Goertzel algorithm[13][14] is an improved

algorithm based on the discrete Fourier transform. The

discrete Fourier transform (DFT) of the sequence is:

21 1

0 0

( ) ( ) ( )N N

j klklN

N

l l

X K x l e x l W

(2)

In the formula: 2 /j N

NW e

So the above formula can be written as:

Base-band

signal

synchronous

detection

SegmentationGoertzel

algorithm

Kalman

filter

Soft-decision Recognition

synchronous positioning adaptive threshold

Base-band

signal

Down

conversion

Syn

segment

Spike

extract




1( )

0

( ) ( )N

k N l

N

l

X k x l W

(3)

The summation above has a convolutional form,

so ( )X k can be considered as system output due to the

stimulated by sequence ( )x n . This first-order

recursive system algorithm is improved into a second-

order recursive system algorithm whose transfer

function is: 1 1

1 11 2

1- 1-( )

2(1- )(1- )1 2 cos

k k

N N

k k k

N N

W z W zH z

W z W zz k z

N

(4)

The second-order system of the above formula

can be expressed by the following difference equation: 2

( ) ( ) 2 cos ( 1) ( 2)

( ) ( ) ( ) ( 1)

(-1)= (-2)=0

k k k

k

k k N k

k k

kv n x n v n v n

N

X k y N v N W v N

v v

(5)

Taking N (equal to DFT block size) samples, and

iteratively calculate X(k) by Goertzel algorithm to

obtain the amplitude information of the signal’s

DFT.The size of 2| ( )|X k can be used to detect the

energy of the signal at the frequency sf

kN

. The

sampling frequency of the system is sf ,the number of

samples is N ,the frequency resolution of the Goertzel

algorithm is sf

N. At a given sampling frequency, the

variables N and k can be adjusted to obtain the signal

energy amplitude of the detected frequency.

The Goertzel algorithm[15][16] can derive the same

real and imaginary parts as the conventional discrete

Fourier transform (DFT) or FFT, but the Goertzel

algorithm can obtain the amplitude of the spectrum of

the signal at a specific frequency without calculating

the spectral value of the entire frequency

band,meanwhile, it processed immediately after each

sampling. Compared with the FFT algorithm for

processing the block samples, the Goertzel algorithm

is more effective, the calculation amount is small, and

the real-time performance is stronger.

2.2 Adaptive recognition based on Kalman

filter The Goertzel algorithm obtains two kinds of

signal energy: the sum of the effective signal energy

value and the noise energy value; Energy values of

channel noise and background noise within various

intervals of CW signals (including 'dot' and dash

interval, character interval and word interval). So it is

necessary to set a threshold to determine whether it’s

the effective signal energy, but in the high-frequency

communication, the ionospheric channel is the random

parameter channel, and the fixed judgment threshold

cannot cope with the time-varying problem of the

interference noise intensity, resulting in the high BER.

Kalman filter[17][18] can estimate the optimal state of a

dynamic system from a series of completely noise-

containing measurements, especially tracking dynamic

signals in the background of strong noise, and has good

performance in dynamic system optimal estimation.

However, the standard Kalman filter of the CW signal

requires the known statistical characteristics of the

system noise. The wrong system model, measurement

model or inaccurate noise statistics will cause the

estimated value to diverge. Therefore, the algorithm

performs synchronous detection and recognition on the

basis that the 'CQ' as message information’ header to

obtain the statistical characteristics of the

communication channel noise signal. The adaptive

threshold is set by the Kalman filter technique to judge

the energy value ( )X k , and the threshold is

dynamically adjusted to cope with the time variation of

the interference noise energy.

Kalman filter[19][20] uses a state space model of

signal and noise, and the estimated value of the current

time of the state variable is updated using the estimated

value at the previous moment and the observed value

at the current time.. ( ) ( ) ( )

( ) ( ) ( )

X k ΦX k Γ k

Y k HX k k

(6)

The above equations are Kalman filter state

equations and measurement equations respectively.

The state value X(k) is the energy value of the CW

signal at the characteristic frequency point of the CW

signal, and the observation value is defined: ( ) max([ ( )... ( - 7)])

0.6

Y k ratio X k X k

ratio

(7)

Set the size of the slider every eight values

because the maximum interval of the CW signal is 7

times the dot length, ensuring that there is a valid signal

energy value within this interval.

The state transition matrix Φ , the observation

matrix H and the noise driving matrix Γ are all set

as the unit matrix. The state noise and the observed

noise are both zero-mean Gaussian white noise

processes, the errors at the time before and after are

irrelevant. The corresponding statistical characteristic

variance Q, R are obtained according to the header

information ‘CQ’ ,and the energy threshold is updated

in real time by using the update transfer equation.


http://dict.cnki.net/dict_result.aspx?searchword=%e5%8f%98%e9%87%8f&tjType=sentence&style=&t=variables



Fig.4 Flow Chart of Recognition of CW Signal

The Fig.4 shows the energy threshold estimation

value ( 1 | 1)X k k at the (k+1)th time,setting

0(0 | 0) [ (0)]X E x u ; ( 1 | )X k k is the threshold

one-step prediction value, ( | 1)P k k indicating the

estimation error covariance matrix, the initial

estimation error covariance (0 | 0)P is calculated from

0 0[( (0) )( (0) ) ]

TE x u x u ; | k)P(k+ 1 indicating the one-

step prediction error covariance matrix, ( 1)K k

indicating the Kalman filter gain .

The adaptive energy threshold X(k|k) obtained

by Kalman filter[16] is used to judge the signal energy

value X(k) . If )X(k|k)>X(k , indicating that the value is

a valid value, in other word, the energy value of the

CW signal is included in the energy value, otherwise it

is noise, and finally the pure CW signal is extracted for

recognition.

3 Experimental results and analysis

3.1 simulation results and analysis In Matlab software simulation, the characteristic

frequency 0f of CW signal is set to 1000Hz, the

sampling frequency fs is set to 8000Hz, the

communication code rate CPM is set to 80, the

sampling point is 90541, and the time length is

11.3176s. Gaussian white noise with a SNR of -20dB

to 0dB is added to the pure CW signal, and the

Rayleigh fading channel is simulated at the same time

to synchronously detect and recognize the CW signal.

The Gaussian white noise and the Rayleigh fading

simulation channel have randomness to the signal,

therefore, 100 times of detection and recognition were

performed under each fixed SNR, and the bit

synchronization phase difference and bit error rate and

the variance of the result under the same SNR are

recorded.

Fig.5 Synchronization Phase Difference and Bit Error Rate

The phase difference refers to the deviation

between the average phase of the synchronization

signal and the optimal synchronization phase, and is an

indicator for measuring the performance of the bit

synchronization performance. As shown in the Fig.5,

as the intensity of the noise signal increases, the phase

difference and BER of the synchronous detection

increase gradually, meanwhile, the synchronous

performance also directly affects the bit error rate of

the signal recognition, the bit synchronization phase

difference increases, leading to an increase in the

recognition error rate.

3.2 Detection and recognition of actual CW

telegraph In order to verify the performance of this

algorithm in actual short-wave communication,

Chongqing University of Posts and

Telecommunications and Shizhu County of Chongqing

City were used as test points. The linear distance is

150Km, the time was 9:28am, March 28, 2018, the communication frequency was 5.47MHz, and the code

rate CPM was set to 80.

In Shizhu County, the WT-B150 medium-high

Energy

value

State transition matrix

Measurement matrix

(0 | 0)P

Variance of state noise

Variance of measurement noise

Threshold initial value

Initial coveriance (0 | 0)P

(0 | 0)X

Updata filter gain

One-step state and covariance prediction

Threshold optimal estimation and covariance update

( 1| ) ( | )

( 1| ) ( | ) T

X k k X k k

P k k P k k Q

1( 1) ( 1| ) [ ( 1| ) ]T TK k P k k H HP k k H R

( 1| 1) ( 1| ) ( 1) ( 1)

( 1) ( 1) ( 1| )

( 1| 1) [ ( 1) ] ( 1| )

X k k X k k K k k

k Y k HX k k

P K K I K k H P k k

H

Q

R

Soft

decision

Dot-dash

recognition

( )X k ( | )X k k

Header of information

CQ


http://dict.cnki.net/javascript:showjdsw('showjd_0','j_0')



frequency single-sideband radio station configured the

inverted 'V' antenna is used to transmit CW telegraph

signal. The Chongqing University of Posts and

Telecommunications uses the WR-G33DDC short-

wave broadband receiver configured the inverted 'V'

antenna to save the received signal as the 'wav' file, use

the algorithm to detect and recognize the CW telegraph

signal. The result is shown below.

Figure 6. Recognition Results of CW Signal

The result shows that the algorithm can detect the

CW telegraph signal in the actual short-wave

communication, but the BER is relatively high. The

reason is that the algorithm is mainly for stationary

Gaussian white noise, but the noise is non-stationary

colored noise in the actual communication. The next

step is to improve the algorithm to deal with the

actual communication channel environment.

4 Conclusion

Aiming at the problem that the anti-noise

performance of the current high-frequency CW

telegraph signal detection method is not strong and the

real-time performance is poor, this paper proposes an

automatic recognition algorithm based on Kalman

filter for CW telegraph signal. Firstly, the statistical

characteristics of communication channel noise are

obtained by processing the header signal. The Kalman

filter is used to recursively optimize the adaptive

energy threshold, eliminating the influence of channel

multipath effect and interference noise on the signal,

then the pure CW signal is extracted for recognition

and decoding. The experiment proves that the detection

and recognition performance is effective under strong

Gaussian white noise interference and actual short-

wave communication environment. The algorithm is

adaptive and real-time, and can be recursively realized.

It has important practical value for the automatic

detection and recognition of high frequency CW

signals and improve the anti-interference ability of

high frequency CW signals.

References:

[1] Wang Jinlong, Ding Guoru, Wang Haichao. HF

communications: past, present, and future[J]. China

Communications, 2018, 15(09): 9-17.

[2] Uysal M, Heidarpour M. Cooperative communication

techniques for future-generation HF radios[J]. IEEE

Communications Magazine, 2012, 50(10): 56-63.

[3] Huang Lihui. Research on anti-jamming technology of short-

wave communication [J]. China New Communication,

2016(17):37-37.

[4] Ma Wei, Zhang Jingxiu, Wang Hubang. Morse code

automatic decoding system[J]. Ordnance Industry

Automation, 2007(06):51-52+55.

[5] Lin Jinzhao, Li Guojun, Zhou Xiaona, .et al. Automatic

detection of high-frequency CW telegraph signal in strong

noise environment[J]. Journal of Chongqing jiyuUniversity

of Posts and Telecommunications, 2008,(05):505-509.

[6] Li Guojun, Zeng Xiaoping, Zhou Xiaona, et al. Adaptive

Filtering of Weak High-Frequency CW Telegraph Signal[J].

Journal of University of Electronic Science and Technology

of China, 2010,39(02):227-231+250.

[7] Li Guojun, Zeng Xiaoping, Zhou Xiaona, et al. Detection

Technique Based on Stochastic Resonance for Weak High-

Frequency CW Signal[J]. Journal of University of

Electronic Science and Technology of China, 2010,

39(05):737-741.

[8] Wei Zhihao, Jia Kebin, Sun Zhonghua. An Automatic

Detection Method for Morse Signal Based on Machine

Learning[C]// International Conference on Intelligent

Information Hiding and Multimedia Signal Processing.

Darmstadt: Springer, Cham, 2018: 185-191.

[9] Zhang Rubo, He Ligang, Li Xueyao. Automatic detection

and recognition of Morse signal in strong noise

environment[J]. Harbin Gongcheng Daxue Xuebao/Journal

of Harbin Engineering University, 27(1), 2006: 112-117.

[10] Li Guojun, Zhou Xiaona, Jiang Yong, et al. Review on

Automatic Detection Technique of High-frequency CW

Signal[J]. World Sci-Tech R& D, 2013, 35(03): 337-343.

[11] Jinlong L I. Approach for Synchronous Signal Detection of

Program Based on Audio[J]. Audio Engineering, 2014,

38(12):90-97.

[12] Jianbing W. Study and analysis on the synchronization

method of shortwave frequency hopping communication

system[J]. Wireless Internet Technology, 2017, (14): 1-2.

[13] Kececioglu O F, Gani A, Sekkeli M. A performance

comparison of static VAr compensator based on Goertzel

and FFT algorithm and experimental validation[J].

SpringerPlus, 2016, 5(1):391.


http://dict.cnki.net/javascript:showjdsw('jd_t','j_')



[14] Kevin. The Goertzel Algorithm[J]. Embedded Systems

Programming, 2002, 54(9): 34.

[15] Kekelj M, Bulic N, Sucic V. An FPGA implementation of

the Goertzel algorithm in a Non-Destructive Eddy current

Testing[C]// 2017 International Conference on Signals and

Systems (ICSigSys). Sanur: IEEE Press, 2017: 180-184.

[16] Li Yuying. Research of DTMF dialing system based on the

goertzel algorithm and MATLAB simulation[C]//

Information Technology & Artificial Intelligence

Conference. Chongqing: IEEE Press, 2015: 93-97.

[17] Zarchan P, Zarchan P. Fundamentals of Kalman Filtering[J].

Progress in Astronautics & Aeronautics, 2015, 190(8):83.

[18] Hamilton F, Berry T, Sauer T. Ensemble Kalman Filtering

without a Model[J]. Physical Review X, 2016, 6(1): 11-21.

[19] Izanloo R, Fakoorian S A, Yazdi H S, et al. Kalman filtering

based on the maximum correntropy criterion in the presence

of non-Gaussian noise[C]// 2016 Annual Conference on

Information Science and Systems (CISS). Princeton, NJ:

IEEE Press, 2016: 500-505.

[20] Vullings R, Peters C H L, Sluijter R J , et al. Dynamic

segmentation and linear prediction for maternal ECG

removal in antenatal abdominal recordings[J].

Physiological Measurement, 2009, 30(3):291

About the Author:

LI Guo-jun(1978-), male, Ph.D. degrees,

the research direction: statistical

processing and detection of weak signal

including physiological recordings,etc.

Email: [email protected]

TIAN Fei-xiang(1994-), male, master's

degree,

the research direction: CW telegraph,

channelized receiver, etc.





基于多特征融合处理的海面目标尺寸估计技术

刘锐，陈舒文，邓晓波

（中国航空工业集团公司雷华电子技术研究所，江苏无锡 214063）

摘要：海面目标尺寸估计对海面目标分类具有重要意义，在回波信杂噪比较低且存在遮蔽效应时，基于一维距

离像目标尺寸估计，往往因为目标回波信息不全而无法准确估计目标尺寸。为解决上述问题，本文建立了典型海

面目标 RCS 与尺寸的映射模型，并提出了一种基于多特征融合处理的海面目标尺寸估计技术。结合基于 RCS 特

征的目标尺寸间接估计及基于原始回波的目标尺寸直接估计技术，在分别评估两种目标尺寸估计置信度的基础

上，采用多行融合滤波处理，改善目标尺寸估计精度。

关键词：对海探测、尺寸估计、目标分类、融合滤波

中图分类号: V243.2 文献标志码：A

0 引言

雷达作为现代战争获取战场信息的重要工具，

可以实时监控整个战场环境。由于雷达作用距离远、

扫描范围大，在对海搜索时，回波中往往包含大量

不同类型的目标，尤其是近海作战时，目标密集。

面对复杂的目标环境，搜索雷达难以从大量目标中

快速确定重点目标，严重影响作战时态势估计与决

策分析。在目标搜索的同时，实现海面目标快速分

类，对提高雷达搜索效率具有重要作用。

雷达目标的信息无外乎四类：RCS 信息、一维

距离像信息和 SAR/ISAR图像信息、极化信息。RCS

信息提取方便可以一定程度上反映目标大小，然而

对于结构复杂的目标，RCS 随角度闪烁严重，一般

在 10dB 左右，最大差异甚至能达到 40dB，且无法

与目标直观的物理特性对应。SAR/ISAR 可以获取

目标更多的信息，实现更准确的分类，然而系统开

销大，对 SNR 要求较高，一般情况 SAR/ISAR 成

像距离仅为正常探测距离的一般，要实现目标准确

分类识别其作用距离会更近。

因此针对海面目标分类，国内外学者主要研究

方向集中在基于目标回波一维距离像的目标分类

技术，通过对一维距离像原始回波包络提取特征，

实现对目标直观测量，得到目标的径向尺寸，再对

大中小分类进行判决。

国外利用一维距离像进行舰船目标的识别开

展较早。1983 年，ZWICKE[1]等首先将梅林变换应

用于舰船目标的一维距离像识别，充分利用了梅林

变换具有的尺度不变特征，以克服 HRRP 的姿态敏

感性。在 1999 年，S. Slomka[3]等提出了几种用于一

维距离像识别的舰船目标特征：长度、散射中心数

量、质心、量化的距离像和傅里叶梅林变换系数等。

2011 年，C．M. PILCHER[4]等人提出了一种非线性

的分类器组合方法。

国内学者将一维距离像应用于舰船目标分类

识别相对较晚。2009 年，刘先康，梁菁[5]等将一维

距离像中心矩特征用于舰船目标识别。同年，刘江

波，席泽敏等研究舰船目标识别时，在支持向量机

（SVM）方面做了大量工作。2012 年，席泽敏，孙

剑波[9]等提出了一种多姿态角相关匹配法。王锦章，

魏存伟[10]等提出了一种基于Relax散射点特征和基

于散射中心最近邻模糊分类器的舰船目标识别方

法。袁祖霞，高贵明[11]等将最大相关系数法(MCC)

和主分量分析方法(PAC)应用于舰船目标的识别。

尽管国内外学者针对基于一维距离像海面目

标分类的难点研究了各种解决方法，然而从一维距

离像提取目标完整信息的前提是雷达回波中包含

目标完整的信息，该条件不仅要求目标各个散射点

回波 SCNR 较高，而且不能存在严重的遮挡效应。

SCNR 要求高，大大减少了目标准确分类的距离，

遮挡效应导致目标回波在多数情况下信息是不完




整的。

针对上述难题，本文从实测数据分析结果出发，

通过建立目标 RCS 与尺寸映射关系模型，提出了

基于多特征融合处理的海面目标尺寸估计技术，首

先分别提取目标一维距离像、RCS 特征，在多扫描

行间对两种特征进行融合处理，在不影响雷达正常

的探测功能的前提下，提高了目标尺寸估计的准确

率。

1 多特征融合海面目标尺寸估计技

术

一维距离像对于海面目标分类是一个十分重

要且有效的特征。由于海面目标结构普遍较为复杂，

复杂结构目标存在的角度敏感性、斑点效应在海面

目标回波更为突出，同时遮挡效应进一步影响了目

标尺寸直接估计的精度。为提高目标分类的准确率，

结合基于 RCS 反演的目标尺寸间接估计以及基于

目标回波的直接尺寸估计技术，通过融融合处理估

计海面目标尺寸。

基于目标RCS反演的目标尺寸间接估计技术。

根据雷达方程，目标回波强度与目标RCS成正比，

与目标距离四次方成反比，基于此通过标准球对系

统误差进行校准即可达到较高的 RCS 反演精度。

在目标 RCS 准确估计的基础上，通过典型海面目

标 RCS 与尺寸的映射模型，可以间接估计目标尺

寸。典型海面目标 RCS 与尺寸的映射模型是融合

处理的基础。

基于目标回波的径向尺寸估计。由于在高分辨

率下目标回波会分裂成多个散射点，为最大可能的

保存目标回波信息，采用奇异点剔除 CFAR 检测技

术，防止大目标边界被淹没，然后根据目标过门限

点提取目标径向尺寸。

RCS 与尺寸融合处理。结合目标回波 SCR、航

向角等信息，分别间接尺寸估计与直接尺寸评估的

置信度，然后进行多行融合滤波，提高目标尺寸估

计精度。该部分是本文研究的重点。

2 海面目标 RCS与尺寸的映射模型

将不同参数变换到同一维度，是融合滤波的前

提，因此首先需要建立海面目标 RCS 与尺寸的映

射模型。根据一般经验，典型舰船 RCS 与其吨位相

当，船体吨位大致为 / 3L B H ，其中 L、B、H

分别为船体的长、宽、高。假设海面目标长宽比为

，下表给出了典型海面目标长宽比：

表 1 典型海面目标长宽比

Tab.1 Typical surface target aspect ratio

典型目标 RCS 长宽比

球、圆柱(通气管、潜望镜) <10m2 1

海盗船、导弹快艇、小型舰船 <1000m2&&>10m2 5.5

大型军舰(驱逐舰) >1000m2 8

一般情况下，典型海面目标高与宽的关系为：/ 3H B ，综上典型海面目标长度与 RCS关系为：

3

2

1 1( ) 10 log( ) 10 log( )

3 9RCS dB L B H L

(1)

通过上述模型，将目标 RCS 可以映射到尺度

域2 /103 9 10

RCS

RCSL ，间接估计目标尺寸，为 RCS 与

尺寸融合滤波奠定了基础。通过微分运算，可以得

到基于目标 RCS 的目标尺寸估计精度 RCSdL 为：

1 1

2 /30 2 /303 3ln(10)

(9 ) 10 (9 ) 1030

RCS RCS

RCS RCSdL d

(2)

RCSd 为 RCS 估计误差。图 1 给出了基于雷达

回波的 1m2 标准球 RCS 反演结果，初步分析可以

发现该雷达 RCS 估计精度大约为 2dB。

图 1 基于雷达回波的 1m2 标准球 RCS 反演结果

Fig.1 RCS inversion results of 1m2 standard sphere based on

radar echo

基于上述模型，RCS

dL 会随着 RCS 的增大而增大。

说明在存在一定 RCS 估计误差的情况下，RCS 较

小的目标， RCSL 估计较为准确，随着目标 RCS 增大，

RCSL 估计精度会迅速下降。

0 10 20 30 40 50-10

-8

-6

-4

-2

0

2

4

6

8

目标编号

特征参数

F120161214

113731(小球 RCS)

RCS




3 海面目标尺寸直接估计

基于目标回波目标尺寸估计分为两个部分，首

先基于原始回波提取目标径向尺寸，然后根据跟踪

得到的航向角信息对尺寸进行修正。基于原始回波

的目标径向尺寸估计，实际上是提取目标过检测门

限点的长度。图 2 给出了大尺度目标回波在不同

CFAR 检测器下检测结果，从图中可以看出具备奇

异点剔除能力的CFAR检测器可以较完整的保留目

标回波信息。对比不同 CFAR 检测器性能，本文采

用 LACA-CAFR 一种去奇点递推均值 CFAR 检测

器进行目标检测，然后通过聚类处理，提取目标过

检测点，计算距离扩展作为目标径向尺寸。

图 2 不同 CFAR 检测器在大尺度目标检测性能表现

Fig. 2 performance of different CFAR detectors in large-scale

target detection

在估计得到目标径向尺寸后，还需要根据目标

航向角进行修正，目标航向可以在目标跟踪过程中

获取。基于目标航向角及径向尺寸，根据雷达照射

角可以对目标尺寸进行修正。其简单几何关系如 0

所示，假设航向角能准确估计，目标航向与雷达视

线夹角为 A，则目标回波径向尺寸 1 2max( , )L L L ，其

中 1 0cosL L A ， 2 0

sin /L L A 。

航向

雷达照射方向

L2

L1 A

图 3 航向角补偿示意图

Fig. 3 schematic diagram of course Angle compensation

可以得到目标尺寸 L0 与径向尺寸 L 及角度 A

之间的关系为：

0

/ cos arctan( )

/ sin arctan( )

L A AL

L A A

(3)

上述模型仅反映了径向尺寸与简单目标真实

尺寸的二维平面关系，从中我们可以发现直接尺寸

估计误差与目标径向尺寸及航向角估计误差都有

关系。如果目标径向尺寸估计误差与航向角估计误

差相互独立，通过微分运算可以得到直接尺寸估计

误差为：

0 0

0

1 tanarctan( )

cos cos

cotarctan( )

sin sin

AdL L dA A

A AL LdL dL dA

L A AdL L dA A

A A

(4)

事实上目标回波模型往往更为复杂，除了二维

平面映射关系，还存遮挡效应如图 4 所示，遮挡效

应会导致目标回波信息不全，使得直接估计的目标

尺寸往往小于目标真实尺寸。根据一般经验，在 A

接近 0°与 90°时，遮挡效应最明显。

雷达辐射

遮蔽区域

图 4 遮蔽效应示意图

Fig. 4 schematic diagram of masking effect

4 RCS与尺寸融合滤波处理

基于上述模型，理想情况下可以在分别估计雷

达系统 RCS 反演误差、径向尺寸误差以及跟踪过

程中获得的目标航向角误差，在对 RCSL 与 L0分别滤

波的基础上，通过最小方差融合加权得到目标尺寸。

理想情况下其融合加权公式为：

0

0

RCS p RCS

C

RCS

dL L dL LL

dL dL

(5)

然而实际情况往往比较复杂，存在较多的非理

想因素：

1. 目标 RCS 与入射角关系密切且海面目标种类繁

多，难以建立普适的且准确的海面目标 RCS 与尺

寸映射模型；

2. 在 SCNR 较低或者存在遮挡时，目标回波信息




往往不全，大部分目标尺寸估计值均小于实际值。

为克服上述工程应用难题，本文结合两种途径估计

目标尺寸的特点及大量数据分析结果，在融合加权

处理时对上述权值进行了修正。其修正策略如下：

1. 基于RCS

dL 及 0dL 分别评估 RCS

L 与 L0 的置信度，

总体而言在 RCS 较小时 RCSL 置信度较高，在目标

航向与雷达视线夹角处于 30°~60°之间且 SCNR

较高时，L0置信度较高；

2. 由于 L0 与目标真实尺寸直接相关，在滤波中当

L0置信度较高时，对 L0的权值进行了增强。

图 5 给出了基于某合作船（90 米长）雷达实测

数据，目标尺寸估计仿真结果。其中第一条线为基

于目标回波距离扩展及航向角计算的目标尺寸 L0，

第二条线为均匀加权滤波的 RCSL 仿真结果，第三条

线为基于 L0置信度评估的 L0滤波结果，第四条线

为本文提出的融合加权滤波算法仿真结果。

图 5 基于雷达回波的目标尺寸估计

Fig. 5 target size estimation based on radar echo

从图中可以发现：

1. 目标回波距离扩展及航向角计算的目标尺寸大

部分小于目标真实尺寸，符合本文分析结果；

2. 目标 RCS 反演的目标尺寸与目标真实尺寸误差

较大，一方面是由于本文模型存在误差，另一方面

海面目标 RCS 对入射角极为敏感；

3.基于 L0 置信度评估的 L0 滤波与目标真实尺寸已

较为接近，然而由于目标航向角收敛速度慢且高置

信度测量值少，该方法目标尺寸收敛时间较长；

本文提出的基于目标 RCS 与尺寸融合加权滤

波方法，在达到较准确的目标尺寸估计的同时，前

期利用 RCS 辅助，加快了目标尺寸收敛时间。

5 结束语

在不影响雷达正常探测功能的前提下，利用雷

达原始回波准确估计目标尺寸，对海面目标舰船分

类，提高对海搜索雷达效率具有重要意义。本文从

理论出发基于简单的目标 RCS 与尺寸映射模型及

海面目标径向尺寸与真实尺寸二维平面模型，详细

分析了基于目标 RCS 间接尺寸估计与基于目标回

波、航向角直接尺寸估计的特点，提出了一种基于

多特征融合处理的海面目标尺寸估计技术，并通过

实测数据验证了算法的有效性。

参考文献：

[1] Zwicke P E, Kiss I. A new implementation of the Mellin

transform and its application to radar classification of ships

[J]. Pattern Analysis and Machine Intelligence, IEEE

Transactions on, 1983 (2): 191-199.

[2] Inggs M R, Robinson A D. Ship target recognition using

low resolution radar and neural networks[J]. Aerospace

and Electronic Systems, IEEE Transactions on, 1999,

35(2): 386

[3] Slomka S, Gibbins D, Gray D, et al. Features for high

resolution radar range profile based ship classification[J].

1999.

[4] Pilcher C M, Khotanzad A. Maritime ATR using classifier

combination and high resolution range profiles[J].

Aerospace and Electronic Systems, IEEE Transactions on,

2011, 47(4): 2558-2573.

[5] 刘先康，梁菁，任杰，等.基于 HRRP 中心矩特征的舰

船目标识别[J].全国第三届信号和智能信息处理与应

用学术交流会专刊, 2009.

Liu xiankang, liang jing, ren jie, et al. Ship target

identification based on HRRP central moment feature [J].

Special issue of the third national academic exchange

conference on signal and intelligent information

processing and application, 2009.

[6] 刘先康，梁菁，任杰，等.基于 Fisher 判决率加权的修正

最近邻模糊分类器设计[J].舰船科学技术, 2010 (2): 68-

72.

Liu xiankang, liang jing, ren jie, et al. Design of modified

nearest neighbor fuzzy classifier based on Fisher decision

weighting [J]. Ship science and technology, 2010 (2): 68-72.

[7] 刘江波，席泽敏，卢建斌，等.基于修正核函数 SVM 的

5400 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400-50

0

50

100

150

200

250

300

相对雷达时间

目标尺寸

/m

052(155m)数据 \20171224\second\7号航迹

L0

滤波后目标 LRCS

滤波后 L0

融合加权后目标尺寸




一维距离像识别[J].雷达科学与技术, 2009, 7(6): 437-

442.

Liu jiangbo, xi zemin, lu jianbin, et al. One-dimensional

range image recognition based on modified kernel SVM [J].

Radar science and technology, 2009, 7(6): 437-442.

[8] 冷家旭，黄惠明，龙方.基于剪辑支持向量机的雷达

目标识别方法[J].舰船电子工程, 2010 (4): 68-71.

Leng jiaxu, huang hui-ming, long fang. Radar target

recognition method based on clip support vector machine

[J]. Naval electronics engineering, 2010 (4): 68-71.

[9] 孙剑波，席泽敏，卢建斌,等.舰船目标一维距离像多

姿态角相关匹配识别[J].舰船电子工程, 2012, 31(11):

49-52.

Sun jianbo, xi zemin, lu jianbin, et al. Multi-attitude Angle

correlation matching recognition of one-dimensional range

image of ship target [J]. Ship electronic engineering, 2012,

31(11): 49-52.

[10] 王锦章, 魏存伟, 刘先康, 等.基于 Relax 散射点特征

提取的舰船目标识别方法[J].电子科技, 2011, 24(4):

8-11.

Wang jin-zhang, wei cunwei, liu xian-kang, et al. Ship

target recognition method based on Relax scattering point

feature extraction [J]. Electronic science and technology,

2011, 24(4): 8-11.

[11] 袁祖霞, 高贵明.基于高分辨率一维距离像雷达目标

识别研究[J].雷达与对抗, 2010, 30(1): 11-14.

Yuan zu-xia, gao gui-ming. Target recognition based on

high-resolution one-dimensional range image radar [J].

Radar and countermeasures, 2010, 30(1): 11-14.

[12] 李青，李斌，胡文俊等. 基于低分辨率雷达的海面

舰船目标分类识别技术[J]. 现代雷达， 2012,34(12):

45-49.

Li qing, li bin, hu wenjun et al. Classification and

identification technology for surface ship targets based on

low-resolution radar [J]. Modern radar, 2012,34(12): 45-

49.

[13] 王曙光，田西兰. 一种窄带雷达舰船目标分类的决策

方法[J]. 雷达科学与技术，2016,14（2）: 159-162.

Wang shuguang, tian zilan. A decision method for the

classification of narrow-band radar ships [J]. Radar

science and technology, 2016,14 (2): 159-162.

作者简介：

刘锐(1988-)，男，工程师，硕士学历，

研究方向：雷达信号处理、目标检测等。


Sea Surface Target Size Estimation

Based on Multi - feature Fusion Processing

LIU Rui1，CHEN Shu-wen1，DENG Xiao-bo1

（AVIC Lei-Hua Electronic Technology Institute, Wuxi Jangsu 214063, China）

Abstract: The size estimation of sea surface targets is of great significance to the classification of sea surface targets.

When the SCNR of target is low and the masking effect exists, the size estimation based on HRRP is often unable to

accurately estimate the size of targets because of incomplete echo information. The mapping model of RCS and size

of typical sea surface targets is established and a method of sea surface target size estimation based on multi-feature

fusion processing is proposed in this paper to solve the above problems. Combining with the target size indirect

estimation method of target size based on RCS and the direct estimation of target size based on original echo, the

multi-line fusion filtering is adopted to improve the accuracy of target size estimation on the basis of evaluating the

two method confidence.

Keywords: sea detection; target size estimation; target classification; fusion filtering




基于改进的决策级融合人体动作识别双流网络

彭世禹，苏婷立，金学波，孔建磊，白玉廷

（北京工商大学计算机与信息工程学院，北京 100048）

摘要：针对视频数据中人体动作种类繁多且识别准确率较低的问题，本文提出一种基于改进的决策级融合深度

双流网络模型。该模型首先将色彩图片进行差分计算，并将其输入深度差分网络中提取时间维度特征，在保证精

度的同时可以大幅减少运算耗时，进而通过对 Softmax 逻辑回归损失函数进行改进，并加入决策级融合机制保留

了不同网络的帧间图像高维时空特征，更加真实反映了视频所蕴含的信息。本文所提模型在 UCF101 和 HMDB51

两个人体动作识别公开数据集上得到了良好的验证，实验结果证明了改进决策级融合机制对识别结果的有效性，

本文提出改进的双流网络具有较高的识别率。

关键词：双流网络；决策级融合机制；人体动作识别

0 引言

基于视频内容的人体动作识别是计算机视觉

领域内的一个热门方向[1]，其在视频监控、智能交

通、运动分析、看护医疗以及人机交互方面都具有

重要的研究意义和应用价值[2]。目前人体动作识别

算法通常是基于深度学习网络，主要利用三维卷积

神经网络[3] [4]、长短期记忆网络（LSTM）[5]和双流

网络[6]来让计算机自主学习高维时空特征并自动进

行分类识别。

人体动作识别中基于三维卷积的方法将原始

的二维卷积核扩展为三维卷积核，尝试提取时域特

征，一般精度相对较高但参数设置复杂且参数量巨

大[7]；LSTM 以卷积网络为时间点提取每帧的视频

特征，优点在于可以充分使用时间信息，而缺点是

逐帧计算的方式制约了网络运行速度[6] [8]；双流网

络使用光流图作为网络输入来补偿空间网络无法

捕获的时间维度信息，从时空维角度诠释特征，但

光流的提取过程耗时较长[9]。

考虑到不同网络的优缺点，本文对双流卷积神

经网络进行改进，在空间网络提取空间特征的同时

引入深度差分网络代替时间网络的作用，将色彩图

片的差分计算输入到深度差分网络中代替光流图

提取时间维度特征；不仅如此，本文还提出一种改

进 Softmax 逻辑回归函数的决策级特征融合机制，

可以更大尺度地保留不同网络帧间图像的空间及

时间信息。模型在 UCF101 和 HMDB51 两个人体

动作识别公开数据集上得到了良好的验证，实现了

提高人体动作识别精度的目的。图 1 为本文实验整

体流程，所提模型可以进行输入视频数据到识别结

果的端到端训练。

视频数据数据预处理

深度色彩网络

深度差分网络

决策级融合识别结果

时空特征提取

色彩图像

差分图像

图 1 本文实验整体流程

Fig.1 The overall process of paper




改进的Softmax

损失函数

改进的Softmax

损失函数

人体动作类别

决策级融合机制

空间网络深度差分网络

色彩图像差分图像

图 2 本文实验整体流程

Fig.2 Spatio-temporal feature extraction of deep two-stream network

1 双流网络的时空特征提取

众所周知，人体动作识别中双流网络包含空间

网络和时间网络，对应输入为色彩图片和光流图片，

但是光流图片生成耗时漫长问题制约其发展。本文

对双流网络进行了改善，提出了深度差分网络来代

替时间网络以获取视频的帧间关系和时间关系，并

将差分关键帧代替光流图作为输入输入，大幅减少

运算耗时且保证了改进后的深度差分网络能够对

人体动作行为具有良好的姿势表达和类别识别能

力。其整体结构如图 1 所示，双流网络结构中橘黄

色、紫色、青色、白色对应的分别是卷积层、归一

化、激活函数和池化层，图 1 中大和小网络尺寸分

别是 7×7 和 3×3。

差分关键帧是由一段视频中相邻两关键帧做

差分运算得到的，包含了视频中相邻帧间关系，与

光流图片有异曲同工之妙，但是其相对于光流图片，

具有生成速度快、计算运算量小等优点，而且拥有

和光流一样的反应视频关键帧中时间关系的能力。

由于视频所有帧中存在冗余和计算消耗的问题，本

文根据视频多时间尺度抽帧，从而获得视频段的关

键帧用来进行后续计算。假设一段由 t帧组成的输

入视频记为 X ，每个视频片段首先被分成每段时

间相等的T 段，然后再从每段中抽取关键帧 ix ，则

整个视频记为 1 1, , ,

tX x x x L 。这些关键帧经过相

邻两帧差分计算得到的差分关键帧，记为

1 2, , ,

tY y y y L 。再将关键帧和差分关键帧分别

输入到空间卷积网络和深度差分网络中，独立计算

每个帧的高维时空特征，得到特征向量

1 2, , ,

TS S SL ,其中 , 1,2, ,

d

iS i T , d 是关键帧

特征维数。

本文所搭建的双流网络结构可以高效地对每

个人体动作视频关键帧进行高维度时空特征提取，

每一张关键帧通过全局平均池化操作后形成一个

1×1×1024 维的向量，再通过最后一个卷积层提取

最终时空特征。每一个卷积核由一个池化操作来响

应，本模型所使用的池化操作包含平均池化：

1 1

1

i j

i j

S S SQ

j i

L (1)

和最大池化：

1 1max , , ,

i j i jQ S S S

L (2)

每一个关键帧经过卷积池化和全连接计算后，

深度差分网络提取的最终输出从形式上变成一个

d 维的特征向量，由此来表示进行整段视频的高维

时空信息。

2 改进损失函数的决策级融合机制

由于视频数据人体运动识别问题本身的复杂

性，单个网络往往容易受到环境噪声的干扰，对运

动识别做出错误的决策而影响整个模型的最终输

出。为此本文提出了一种决策级融合机制，从本质




上解决了这一问题。因为空间网络和深度差分网络

从同一运动中提取出了不同的视频特征内容，所以

每个网络对相同动作的理解是片面而又相对丰富

的，只有融合同一动作的两种特征才能充分地表达

一种运动的内容。决策级融合机制对不同的高维时

空特征具有特征不变性，该融合机制根据每个网络

的输出确定人体运动视频片段，并收集其他网络的

决策信息。

本文改进了Softmax逻辑回归分类方法并构造

了一个强分类器，将每幅图像的最高分类概率作为

测试图像的行人运动识别概率，更好地解决动作识

别问题。假定当前有 k 个类别，则输出

( )1, 2, ,

iy k 。在Softmax逻辑回归中，对于给定

的测试视频 x ，假设函数估算每个类别出现的概率

为

( ) ( )

( ) ( )

( )

( ) ( )

1 | ;

2 | ;

| ;

i i

i i

i

i i

p y x

p y xh x

p y k x

M

(3)

将最终识别概率 ( )ih x

进行级数展开得到：

( )

( )2

( )

( )

( )

1

1

T ij

T i

T ij

T ik

x

xi

kx

xj

e

eh x

e

e

M

(4)

其中模型参数变为一个矩阵，可以看成共 k

行且每行对应一个类别参数。式(4)中1

( )

1

T ij

kx

j

e

对概

率分布的归一化，其目的是使所有动作识别的概率

总和为1，将网络输出概率投射到同一数量级，方便

处理。根据矩阵论，模型参数矩阵可以写成如下

的形式:

1

2

T

T

T

k

L

(5)

与此同时，模型的损失函数方程为:

( )

( )

( )

1 1

1

1( ) log

T ij

T il

xm ki

kxi j

l

eJ I y j

me

(6)

式(6)中 ( )iI y j 是一个示性函数，当正例样

本出现时取值为1，反之为0。为了使模型参数矩阵

最小化，将式(1)带入(4)中得到概率值表示为：

( ) ( ) ( ) ( )

1

1( ) log 1 log 1

mi i i i

i

J y h x y h xm

(7)

Softmax逻辑回归分类对 k 个类别的概率进行

累加，则 x 类别识别成 j 类别的概率为：

( )

( )

( ) ( )

1

| ;

T ij

T il

x

i i

kx

j

ep y j x

e

(8)

式(8)是逻辑回归中损失函数的概率论推广，将

其带入(7)中简化得到：

( ) ( ) ( )

1

1( ) log | ;

mi i i

i

J I y j p y j xm

(9)

此时一般采用梯度下降法对识别过程中损失

函数进行最小化优化计算，首先计算损失函数的偏

导数：

( ) ( ) ( ) ( )

1

1( ) | ;

j

mi i i i

i

J x I y j p y j xm

(10)

式(10)中概率向量 ( )jJ

的第 l 个偏导

( )

jl

J

代表损失函数对 j 个类别的第 l 个参数求偏导。对

式(9)进行梯度下降迭代进行更新来确定最小化损

失函数，迭代公式如式(11)所示：

( )jj jJ

(11)

然而本文发现在Softmax逻辑回归时直接使用

上述函数会出现参数冗余的情况，从而影响到参数

的更新效果，例如将式(8)进行如下修改:

( ) ( )

1

| ;

T

j

T

j

x i

i i

kx i

l

ep y j x

e

(12)

将式(12)展开得到：

( )

( )

1 1

TT i

j j

TT i

j j

x i x

k kx i x

l l

e e

e e

(13)

从式(13)中可以看到，将最优化参数的每一项

全部减掉同一个数值 T

x ie

时，其得到的损失函数

概率值不发生改变，这说明此样本不是符合要求的

损失函数唯一解。为了解决参数不唯一的问题，本

文在分类的设计中对损失函数加入了权值衰减项

来惩罚过大的参数值以保证损失函数为严格的

凸函数，实现全局收敛，找到全局最优解。此时，




损失函数变为

( )

( )

( )

1 1

1

2

1 0

1( ) log

2

T ij

T il

xm ki

kxi j

i

k n

ij

i j

eJ I y j

me

(14)

其中 0 ，则损失函数的偏导函数表达式为

( ) ( ) ( ) ( )

1

1( ) | ;

j

mi i i i

j

i

J x I y j p y j xm

(15)

式(15)求得了模型的最小化损失函数，得到了

改进Softmax逻辑回归分类概率。接着本文构造一

个决策级融合机制，将视频测试图片通过改进

Softmax逻辑回归的识别概率i

输入融合机制，计

算传入的单个图片的两种概率 _i

色彩和 _i

差分，作

为某幅图像的单个网络识别概率。

最后由式(16)计算多个视频的识别概率值u 作

为整个动作类型的最终识别概率。

_ _arg max{ , }

k

i ii

u 色彩差分

(16)

其中 i 是每种类型动作所包含的视频数量， k

为动作类型总数。

本文提出的决策级融合机制基于在实际系统

中，多个网络对一种视频动作出现错误识别是一个

小概率事件。换句话说，多个网络不可能同时对同

一种视频运动做出错误的决策。因此，根据多数表

决原则重新调整的决策值具有较高的准确性，该机

制大大降低了原来单个网络最终错误输出的概率，

从而提高了模型的人体运动识别性能。

3 实验与结果分析

3.1 实验数据集

UCF101: UCF101数据集一组是从 YouTube视

频网站上收集的真实运动视频的人体动作识别数

据集，其中包含 101 个运动类别，13320 多个人工

剪辑制作视频片段。此数据集主要包含五种动作类

型：人与物体间互动，人体部分肢体运动，人与人

之间互动，演奏乐器，单人运动。每种动作类型包

含 25 组视频，每组视频包含 4-7 个剪辑视频片段。

训练集与测试集比重近似为 2.5 比 1。

HMDB51: HMDB51数据集由多种媒体源组成，

主要来自电影片段，有一小部分来自如 Prelinger

archive、YouTube 和谷歌的公共数据库。此数据集

包含 6849 个人工剪辑制作视频片段，分为 51 个动

作类型共五大类：一般的面部动作表情、操纵物体

的局部肢体动作、一般的肢体动作、人与物体互动

动作和人与人的互动动作.

使用 UCF101 和 HMDB51 这两个人体动作数

据集的原因是其保证了视频数据动作类别的合理

分布，可以测试和验证本文模型的识别性能。表 1

显示了数据集的详细信息，包括视频数量、动作类

别总数以及每类动作所包含的视频片段总数。对于

上述数据集，本文使用了官方提供的标准训练和测

试分割比重。

表 1 实验数据集详细信息

Tab.1 Experimental datasets details

数据集名称片段总数视频数量类型总数

UCF101 13,320 2,500 101

HMDB51 6,766 3,312 51

图 3 多时间尺度视频抽帧策略

Fig.3 Multi-time scale video keyframe extraction visualization




图 4 色彩关键帧和差分关键帧的可视化

Fig.4 Visualization of color keyframes and differential keyframes

3.2 数据预处理

为了使算法提取出的特征能够充分的表达视

频中的时序动作信息，本文采取多时间尺度视频抽

帧策略对原始视频数据进行关键帧提取操作。这种

抽帧策略使得输入网络的原始数据包含视频中更

多的时序信息，能够在保证输入数据维度的前提下，

达到使模型获得更丰富的时序特征的目的。针对数

据集中每个视频，使用 FFmpeg 将视频转换成视频

帧序列，可视化效果如图 3 所示。得到视频所对应

的总帧数后，对于具体的视频采用不同的抽帧方式：

1)当视频所包含的总帧数小于等于模型输入

所需要的帧数时，取所有帧作为关键帧。计算所需

要遍历总帧数的次数以及补充帧数，从总帧数中获

得最终放入模型的关键帧；

2)当视频所包含的总帧数大于模型输入所需

要的帧数时，从起始区间向后取关键帧。用视频总

帧数减去所需帧数得到随机抽帧的起点，时序向后

加上所需总帧数得到结束位置在此可行范围内抽

取关键帧。

3.3 数据可视化

本文将经过多时间尺度视频抽帧策略的色彩

图片进行数据增强处理，目的在于生成不同的训练

样本以防止模型训练过程中的过拟合。本文使用图

像局部裁剪和水平翻转来增强训练集的样本量，其

中我们将网络输入图像的大小固定为 256×340，并

且从{256;224;192;168}中随机选择裁剪区域的宽度

和高度，最后再将这些裁剪区域重新调整成

224×224 的尺寸大小用于网络训练。这种数据增强

处理的优点在于不仅可以增加数据量，还可以使模

型具有鲁棒性。

在数据增强处理后得到关键帧图像集，相邻进

行两帧差分运算得到差分关键帧图像集，其可视化

效果如图4所示，从左至右的动作类别依次是骑马、

骑自行车、滑冰、滑雪、开合跳、多人篮球比赛。

从图中可以明显看到大多数背景噪声已被删除，并

且成功保留了人体运动目标对象。不难看出，色彩

图像和差分图像可以从不同角度捕获视觉信息，使

得同一个动作被不同方式表示从而获得语义层面

互补。




图 5 模型训练损失下降情况

图 6 模型训练准确度




3.4 实验结果

为了验证所提模型的人体动作识别效果，本文

在 Ubuntu 16.04 系统上使用 PyTorch0.4 版本开源

程序库进行编程，并在四个 NVIDIA Tesla P40

(48GB) GPU 上进行并行训练。本文将空间网络和

深度差分网络的训练迭代次数设为 80 代，每一代

训练图片批次设为 256 以避免内存不足。在前三分

之一的训练过程中保持学习率不变，然后每训练 10

代将学习率衰减 10%，用来防止过拟合，并且在训

练过程中使用了自动梯度裁剪，确保模型了训练梯

度能够按预期的方式下降。模型在两个实验数据集

的训练损失函数和训练精度分别如图 4 和图 5 所

示。

从图 4可以看出UCF101数据集上深度差分网

络训练损失较为抖动而空间网络较为稳定，通过融

合后的双流网络训练损失收敛较为迅速且稳定，在

HMDB51 数据集的表现也同样证明双流网络具有

较高的训练损失收敛能力，可以更加高效地对人体

动作高维时空特征进行提取并学习其蕴含信息。图

5 中无论在 UCF101 还是 HMDB51 数据集上融合

后的双流网络训练精度可以迅速上升并接近 100%

的训练精度，说明决策级融合机制可以将空间网络

提取的空间信息和深度差分网络提取的时间信息

进行有效的融合，从空间和时间两方面理解人体动

作信息，弥补了单一网络的理解不足与片面反映真

实信息的缺点。

表2将本模型的人体动作识别结果与当前流行

的算法进行了比较。可以看出，本模型在 UCF101

数据集上比 LSTM 优化后的双流法[6]识别精度还

高了 6.1％，比使用了 152 层的残差双流网络识别

有 1.2%的精度提升。在三维卷积方面，无论是开山

鼻祖 C3D[4]还是引入不同尺寸特征金字塔模型的

T3D[7],精度都远低于本模型。由于深度差分网络和

改进的决策级融合机制使用，相较于长时间卷积

LSTM 网络[8]本模型有 3%的精度提升。HMDB51

数据集同样证实了这一结论，对比传统 FV 特征加

LSTM[10]和深度轨迹特征[11]分别有接近 15%和 6%

的巨大改善，充分说明了本文提出模型可以更为有

效地进行人体动作识别。

表 2 人体动作识别准确率对比

Tab.2 Comparison of human action recognition accuracy

算法名称 UCF101 HMDB51

C3D [3] 85.2% -

Soft Attention + LSTM [4] - 41.3%

Two-Stream + LSTM [6] 88.6% -

RNN+FV 错误!未找到引用源。 88.0% 54.3%

T3D 错误!未找到引用源。 90.3% 59.2%

TDD+iDT+FV 错误!未找到引用

源。

90.3% 63.2%

LTC [8] 91.7% 64.8%

ActionVLAD 错误!未找到引用

源。

92.7% 66.9%

ST-ResNet [11] 93.5% 66.4%

本文模型 94.7% 69.0%

4 结束语

本文针对人体动作识别中动作种类繁杂和识

别准确率较低的问题，提出将色彩图片的差分计算

输入到深度差分网络中代替光流图提取时间维度

特征，利用差分关键帧代替光流图输入网络，保证

精度的同时大幅减少因光流图使用而产生的运算

耗时；不仅如此，本文还提出一种改进 Softmax 逻

辑回归函数的决策级特征融合机制，可以更大尺度

地保留不同网络帧间图像的空间及时间信息将深

度差分网络来代替双流网络以获取视频的帧间关

系和时间关系，使双流网络更真实地从空间和时间

两方面反映视频蕴含信息，大大降低了单个网络的

错误输出概率，保证了改进后的深度差分网络能够

对人体动作行为具有良好姿势表达和类别识别能

力。

通过与其他深度学习算法在 UCF101 和

HMDB51 数据集上的结果对比，验证了本文模型的

有效性。下一步工作考虑融合例如音频等特征的多

模态信息到模型中，从而进一步提高含有有效音频

的视频数据中人体动作识别准确率。

参考文献：

[1]. Herath S, Harandi M, Porikli F. Going deeper into action

recognition: A survey[J]. Image and Vision Computing,

2017, 60: 4-21.

[2]. Ke Q, Bennamoun M, An S, et al. Leveraging structural

context models and ranking score fusion for human




interaction prediction[J]. IEEE Transactions on

Multimedia, 2017, 20(7): 1712-1723.

[3]. Ofer E, Epstein A, Sadeh D, et al. Applying Deep Learning

to Object Store Caching[C]//Proceedings of the 11th ACM

International Systems and Storage Conference. Haifa,

Israel: ACM, 2018: 126-126.

[4]. Tran D, Bourdev L, Fergus R, et al. Learning

spatiotemporal features with 3d convolutional

networks[C]//Proceedings of the IEEE International

Conference on Computer Vision. Santiago, Chile, 2015:

4489-4497.

[5]. Sharma S, Kiros R, Salakhutdinov R. Action recognition

using visual attention[C]//3th International Conference on

Learning Representations. Puerto Rico, USA, 2015

[6]. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al.

Beyond short snippets: Deep networks for video

classification[C]//Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. Puerto Rico,

USA, 2015: 4694-4702.

[7]. Diba A, Fayyaz M, Sharma V, et al. Temporal 3d convnets

using temporal transition layer[C]//Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition Workshops. Utah, USA, 2018: 1117-1121.

[8]. Varol G, Laptev I, Schmid C. Long-term temporal

convolutions for action recognition[J]. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 2017, 40(6):

1510-1517.

[9]. Wang X, Gao L, Wang P, et al. Two-stream 3-d convnet

fusion for action recognition in videos with arbitrary size

and length[J]. IEEE Transactions on Multimedia, 2017,

20(3): 634-644.

[10]. Lev G, Sadeh G, Klein B, et al. Rnn fisher vectors for

action recognition and image annotation[C]//European

Conference on Computer Vision. Springer: Cham,

Switzerland, 2016: 833-850.

[11]. Wang L, Qiao Y, Tang X. Action recognition with

trajectory-pooled deep-convolutional

descriptors[C]//Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. 2015: 4305-

4314.

[12]. Girdhar R, Ramanan D, Gupta A, et al. Actionvlad:

Learning spatio-temporal aggregation for action

classification[C]//Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. Hawaii,USA,

2017: 971-980.

[13]. Feichtenhofer C, Pinz A, Wildes R. Spatiotemporal

residual networks for video action

recognition[C]//Advances in Neural Information

Processing Systems. Barcelona Spain 2016: 3468-3476.

作者简介：

彭世禹(1995-)，男，硕士研究生，研究方向：

深度学习、信息融合、视频分类、人体动作

识别等。


Two-Stream Network based on Improved Decision-Level Fusion

Mechanism for Human Action Recognition

PENG Shi-yu，JIN Xue-bo，SU Ting-li，KONG Jian-lei，BAI Yu-ting

（School of Computer and Information Engineering, Beijing Technology and Business University, Beijing 100048）

Abstracts: In order to figure out diverse human action as well as increase accuracy of recognition in video data, we

propose a two-stream network based on improved decision-level fusion mechanism, which uses differential image

instead of optical flow to input into the deep differential network. The extracted temporal dimension features has

ensured the accuracy and greatly reducing operation time. In addition, In addition, by improving the Softmax logistic

regression loss function, the decision-level fusion mechanism can retain the spatial-temporal features of different

networks to a greater extent, so as to reflect the information contained inter-keyframes more authentically. The model

proposed is well validated on UCF101 and HMDB51 public datasets. The experimental results demonstrate the

effectiveness of the improved decision-level fusion mechanism for the recognition results while the improved two-

stream network has a higher recognition accuracy in the current deep learning algorithms.

Keywords: Two-stream network; Decision-level fusion mechanism; Human action recognition


[8][9]csif.org.cn/uploadfile/20191119112919_74496.pdfrecursiveness and real-time performance of the...

Documents