s parsity driven statistical spasriytdrviensattsitlcai

-hgih ni noitareneg atad evlovni snoitacilppa fo arohtelp A

tnemecnavda lacigolonhcet ot eud sgnittes )DH( lanoisnemid

tnuoma evissam eht ,ylgnidroccA .ksat eht fo erutan eht dna

gninrael desivrepus decnavda sdnamed stesatad DHfo

gniledom evitciderp dna yrotanalpxe rieht rof sehcaorppa

.noitamrofni evisiced dna lufgninaem gnitcartxe ni spleh taht

sesoporp noitatressid sihT

seuqinhcet nevird-ytilibisserpmoc dna -ytisraps wen

devorpmi reffo hcihw ,noitacfiissalc dna noisserger rof

depolevedehT .srewop evitciderp dna yrotanalpxe

deilppa yllufsseccus era sdohtem

)i(

lavirra-fo-snoitcerid fo noitamitse dna noitceted eht rof

dna ,euqinhcet gnimrofmaeb desserpmoc gnisu

)ii(

eht ni noitceles )eneg( erutaef tnedneped-atad eht rof

.smelborp noitacfiissalc desab-noisserpxe eneg

-otl

aA

DD

76/

02

02

+cifida*

GMFTSH9 NBSI 2-8583-06-259-879 )detnirp(

NBSI 9-9583-06-259-879 )fdp(

NSSI 4394-9971 )detnirp(

NSSI 2494-9971 )fdp(

ytisrevinU otlaA

gnireenignE lacirtcelE fo loohcS

scitsuocA dna gnissecorP langiS fo tnemtrapeD

fi.otlaa.www

+ SSENISUBYMONOCE

+ TRA+ NGISED

ERUTCETIHCRA

+ ECNEICSYGOLONHCET

REVOSSORC

LAROTCODSNOITATRESSID

MU

SS

AB

ATD

EE

VA

ND

AM

MA

HU

Md

nan

oiss

erg

eR

lan

ois

ne

miD-

hgi

Hr

ofg

ninr

aeL

laci

tsit

atS

nev

irD

ytis

rap

Sn

oita

cfii

ssal

Cyt

isr

evi

nU

otla

A

0202

scitsuocAdnagnissecorPlangiSfotnemtrapeD

lacitsitatSnevirDytisrapS-hgiHrofgninraeL

noissergeRlanoisnemiDnoitacfiissalCdna

MUSSABATDEEVANDAMMAHUM


-hgihninoitarenegatadevlovnisnoitacilppafoarohtelpA

tnemecnavdalacigolonhcetoteudsgnittes)DH(lanoisnemid

tnuoma evissam eht,ylgnidroccA.ksat ehtfo erutanehtdna

gninrael desivrepus decnavdasdnamed stesatad DHfo

gniledomevitciderpdnayrotanalpxeriehtrofsehcaorppa

.noitamrofnievisiceddnalufgninaemgnitcartxenisplehtaht

sesoporpnoitatressidsihT

seuqinhcetnevird-ytilibisserpmocdna-ytisrapswen

devorpmireffohcihw,noitacfiissalcdnanoissergerrof

depolevedehT.srewopevitciderpdnayrotanalpxe

deilppayllufsseccuserasdohtem

)i(

lavirra-fo-snoitceridfonoitamitsednanoitcetedehtrof

dna,euqinhcetgnimrofmaebdesserpmocgnisu

)ii(

ehtninoitceles)eneg(erutaeftnedneped-atadehtrof

.smelborpnoitacfiissalcdesab-noisserpxeeneg

-otl

aA

DD

76/

02

02

+cifida*


NBSI 9-9583-06-259-879 )fdp(


NSSI 2494-9971 )fdp(

ytisrevinUotlaA

gnireenignElacirtcelEfoloohcS


fi.otlaa.www

+SSENISUBYMONOCE

+TRA+NGISED

ERUTCETIHCRA

+ECNEICSYGOLONHCET

REVOSSORC


MU

SS

AB

ATD

EE

VA

ND

AM

MA

HU

Md

nan

oiss

erg

eR

lan

ois

ne

miD-

hgi

Hr

ofg

ninr

aeL

laci

tsit

atS

nev

irD

ytis

rap

S n

oita

cfii

ssal

Cyt

isr

evi

nU

otla

A

0202


lacitsitatSnevirDytisrapS-hgiHrofgninraeL

noissergeRlanoisnemiDnoitacfiissalCdna

MUSSABATDEEVANDAMMAHUM


-hgihninoitarenegatadevlovnisnoitacilppafoarohtelpA

tnemecnavdalacigolonhcetoteudsgnittes)DH(lanoisnemid

tnuoma evissam eht,ylgnidroccA.ksat ehtfo erutanehtdna

gninrael desivrepus decnavdasdnamed stesatad DHfo

gniledomevitciderpdnayrotanalpxeriehtrofsehcaorppa

.noitamrofnievisiceddnalufgninaemgnitcartxenisplehtaht

sesoporpnoitatressidsihT

seuqinhcetnevird-ytilibisserpmocdna-ytisrapswen

devorpmireffohcihw,noitacfiissalcdnanoissergerrof

depolevedehT.srewopevitciderpdnayrotanalpxe

deilppayllufsseccuserasdohtem

)i(

lavirra-fo-snoitceridfonoitamitsednanoitcetedehtrof

dna,euqinhcetgnimrofmaebdesserpmocgnisu

)ii(

ehtninoitceles)eneg(erutaeftnedneped-atadehtrof

.smelborpnoitacfiissalcdesab-noisserpxeeneg

-otl

aA

DD

76/

02

02

+cifida*


NBSI 9-9583-06-259-879 )fdp(


NSSI 2494-9971 )fdp(

ytisrevinUotlaA

gnireenignElacirtcelEfoloohcS


fi.otlaa.www

+SSENISUBYMONOCE

+TRA+NGISED

ERUTCETIHCRA

+ECNEICSYGOLONHCET

REVOSSORC


MU

SS

AB

ATD

EE

VA

ND

AM

MA

HU

Md

nan

oiss

erg

eR

lan

ois

ne

miD-

hgi

Hr

ofg

ninr

aeL

laci

tsit

atS

nev

irD

ytis

rap

Sn

oita

cfii

ssal

Cyt

isr

evi

nU

otla

A

0202


lacitsitatS nevirD ytisrapS -hgiH rof gninraeL

noissergeR lanoisnemiD noitacfiissalC dna

MUSSABAT DEEVAN DAMMAHUM

LAROTCOD SNOITATRESSID

seires noitacilbup ytisrevinU otlaASNOITATRESSID LAROTCOD 76 / 0202

rof gninraeL lacitsitatS nevirD ytisrapS dna noissergeR lanoisnemiD-hgiH

noitacfiissalC


fo rotcoD fo eerged eht rof detelpmoc noitatressid larotcod A eht fo noissimrep eht htiw ,dednefed eb ot )ygolonhceT( ecneicS

etomer a aiv ,gnireenignE lacirtcelE fo loohcS - ytisrevinU otlaA .00:61 ta 0202 yaM 41 no ,noitcennoc

ytisrevinU otlaA gnireenignE lacirtcelE fo loohcS


The public defense on 22th May 2020 at 12:00 will be organized via remote technology.

Link: https://aalto.zoom.us/j/61370096692

Zoom Quick Guide: https://www.aalto.fi/en/services/zoom-quick-guide

Printed matter4041-0619

NO

RDIC

SWAN ECOLABE

L

Printed matter1234 5678

rosseforp gnisivrepuS dnalniF ,ytisrevinU otlaA ,alillO asE rosseforP etaicossA

rosivda sisehT

dnalniF ,ytisrevinU otlaA ,alillO asE rosseforP etaicossA

srenimaxe yranimilerP nedewS ,ytisrevinU dnuL ,nossbokaJ saerdnA rosseforP

ecnarF ,eniarroL fo ytisrevinU ,eirB divaD rosseforP

stnenoppO nedewS ,ytisrevinU dnuL ,nossbokaJ saerdnA rosseforP

airtsuA ,ygolonhceT fo ytisrevinU anneiV ,rekuärbnelkceM hpotsirhC rosseforP

seires noitacilbup ytisrevinU otlaASNOITATRESSID LAROTCOD 76 / 0202

© 0202 MUSSABAT DEEVAN DAMMAHUM

NBSI 2-8583-06-259-879 )detnirp( NBSI 9-9583-06-259-879 )fdp( NSSI 4394-9971 )detnirp( NSSI 2494-9971 )fdp(

:NBSI:NRU/if.nru//:ptth 9-9583-06-259-879

yO aifarginU iknisleH 0202

dnalniF

:)koob detnirp( sredro noitacilbuP

[email protected]

tcartsbA otlaA 67000-IF ,00011 xoB .O.P ,ytisrevinU otlaA if.otlaa.www

rohtuA MUSSABAT DEEVAN DAMMAHUM noitatressid larotcod eht fo emaN

noitacfiissalC dna noissergeR lanoisnemiD-hgiH rof gninraeL lacitsitatS nevirD ytisrapS

rehsilbuP gnireenignE lacirtcelE fo loohcS

tinU scitsuocA dna gnissecorP langiS fo tnemtrapeD

seireS seires noitacilbup ytisrevinU otlaA SNOITATRESSID LAROTCOD 76 / 0202

hcraeser fo dleiF ygolonhceT gnissecorP langiS

dettimbus tpircsunaM 0202 yraunaJ 92 ecnefed eht fo etaD 0202 yaM 41

)etad( detnarg ecnefed cilbup rof noissimreP 0202 lirpA 3 egaugnaL hsilgnE

hpargonoM noitatressid elcitrA noitatressid yassE

tcartsbAarohtelp a morf noitamrofni eriuqca ot dedeen era seuqinhcet gniledom dna sisylana lacitsitatS

fo esaercni na ,noitazilatigid ot eud detareneg gnieb era hcihw atad )DH( lanoisnemid-hgih fo atad DH morf gninrael lacitsitatS .efil yadyreve ruo ni secived trams dna srosnes ,rewop gnitupmoc dna secruoser lanoitatupmoc fo tnemevorpmi suounitnoc eht etipsed melborp gnignellahc a llits si

ro noisserger sa hcus ,sehcaorppa gninrael desivrepus fo ecnamrofrep ehT .seuqinhcet gninrael )selpmas( snoitavresbo fo rebmun tneicfifusni na stsixe ereht nehw sedarged netfo ,noitacfiissalc

rof sdohtem gninrael desivrepus gnipoleveD .)selbairav( ytilanoisnemid atad ot derapmoc ew ,siseht siht ni ,eroferehT .laicurc si stes atad DH hcus fo gniledom evitciderp dna yrotanalpxe

devorpmi reffo hcihw noitacfiissalc dna noisserger rof sdohtem nevird-ytisraps wen esoporp .srewop evitciderp dna yrotanalpxe

yleman ,smelborp noisserger raenil nevird-ytisraps rof desoporp era srevlos wen ,siseht siht nI esehT .atad deulav-xelpmoc eldnah ot dengised yllaiceps era hcihw ,)NE( ten citsale dna ossaL

fo )sAoD( slavirra-fo-noitcerid eht etamitse ot gniledom yrotanalpxe rof deilppa era srevlos ehT .euqinhcet )FBC( gnimrofmaeb desserpmoc gnisu yarra rosnes a ot secruos gnignipmi

raenil DH suoirav ni deilppa eb nac dna lareneg yletelpmoc ,revewoh ,era sdohtem depoleved eht dellac hcaorppa na ,revoeroM .atad deulav-laer ro -xelpmoc htiw gnilaed smelborp noisserger

esraps eht fo troppus tcaxe eht fo yrevocer eht ecnahne ot depoleved si NE evitpada laitneuqes ,eromrehtruF .krowemarf FBC eht gnisu secruos fo sAoD eht dnfi ot desu neht si sihT .rotcev langis

dezilareneg dna mhtirogla depoleved eht yb detupmoc NE dna ossaL eht fo shtap noitaziraluger eht eht fo level ytisraps eht gnitceted rof dohtem levon a gnisoporp ni desu era noiretirc noitamrofni

.melborp noitamitse AoD ni secruos fo rebmun eht ot sdnopserroc hcihw ,langis

-hgih fo ssalc eht gnitciderp rof krowemarf noitacfiissalc evisserpmoc a sesoporp osla siseht sihT-)ADRC( sisylana tnanimircsid deziraluger evisserpmoc desoporp ehT .noitavresbo lanoisnemid

ylralucitrap ,atad DH fo noitacfiissalc dna noitceles erutaef rof deilppa si srefiissalc fo tes desab taht sdohtem tra-eht-fo-etats tnerruc smrofreptuo hcaorppa desab-ADRC .atad noisserpxe eneg

.ytilibaterpretni dna deeps gninrael ,ycarucca yleman ,stecaf eerht eht fo eno ni tsael ta liaf

sdrowyeK -tnioJ ,scitsitatS lanoisnemiD-hgiH ,noitceleS erutaeF ,ytilibisserpmoC ,noitacfiissalC .gninraeL lacitsitatS ,ytisrapS ,noissergeR ,yrevoceR esraps

)detnirp( NBSI 2-8583-06-259-879 )fdp( NBSI 9-9583-06-259-879

)detnirp( NSSI 4394-9971 )fdp( NSSI 2494-9971

rehsilbup fo noitacoL iknisleH gnitnirp fo noitacoL iknisleH raeY 0202

segaP 841 nru :NBSI:NRU/fi.nru//:ptth 9-9583-06-259-879

To all my family and friends,especially Nano, Ami, Abu, Qurattulain,

Abdullah, Hajra, Amina, and Mohammad,who were my source of intrinsic motivation.

1

Preface

The research work presented in this doctoral dissertation has been carriedout, during the years 2016-2020, at the Department of Signal Processingand Acoustics, Aalto University, Finland. I express my profound gratitudeand appreciation to my supervisor, Prof. Esa Ollila, for teaching me theprinciples of science, pedagogy, and reproducible research. I am indebtedto him for his compassion, expert guidance, and constant encouragementand support. He was always very accommodating and approachable andprovided me critical suggestions after stimulating discussions. These arequalities that rank very high on a student’s wish list and I consider myselffortunate to have him as my supervisor.

This research work has been funded by the Aalto ELEC Doctoral Schoolfunding and the Academy of Finland. I am thankful to the Nokia Founda-tion for its scholarship that helped me partially fund my doctoral disserta-tion.

I am thankful to Prof. Christoph Mecklenbräuker and Prof. AndreasJakobsson for having taken the time to be the opponents for my defense. Ialso wish to thank my dissertation’s reviewers and preliminary examiners,Prof. Andreas Jakobsson and Prof. David Brie, for their time and effortto carefully read my dissertation and give constructive comments whichhelped in finalizing the dissertation. I would like to thank Prof. VisaKoivunen and Prof. Sergiy A. Vorobyov for their informative courses andcritical comments in the group seminars. I would also like to expressmy gratitude to my former supervisors, Prof. Ibrahim Elshafiey, Prof.Mubashir Alam, Prof. Saleh A. Alshebeili, Prof. Amjad Hussain and LateLuqman Qadir, without whom, I would not have come so far in my careerand for pursuing the doctorate degree in this lovely country.

Everything I have achieved and whatever is in store for me is all the will(and Qadar) of ALLAH, the Best of the Planners. Accordingly, I am blessedwith a large number of friends and family who deserve my gratitude forbeing part of my career (and life) journey. It is difficult to name all of themhere, but the first special mention includes Muhammad Naveed Akram whois always there when needed for whatever help and for various intellectual

3

Preface

conversations. Others among the best of the lot are Ustadh Abdul-BasitSyed, Ustadh Shahzad Farooq, Dr. Mustafa K. Masood, Abdul-Razzaq, Dr.Inam Ullah, Usman Farid, Rizwan Ali, Rashed Meer, Imtiaz Ali, and Dr.Waqar A. Malik, who are the catalyst for improving my focus to the life ofboth herein and hereafter. Indeed, this fostered a strong sense of solaceand made me concentrate on things that matter and are in my control.Besides, I cannot forget to thank and mention Yaseen Khan, Dr. AmmarArshad, Nouman Ali, Awais, Farhan, Dr. Saad, (Adnan)3, (Fawad)2, Dr.Waqar, Hammad, Shankar, Nadir, Ghaffar, Waleed, other OKKians, IjazAkhter and Ijaz Khan for being the reason of physical and other socialactivities during my doctoral studies. Other important mentions, whom Iappreciate most for being there in different moments of my life, include Dr.Zeshan Ali, Dr. Ishtiaq Ahmad, Nadeem Ashraf, Shoaiba, Dr. Umar Saeed,Wasif Naveed, Dr. Aqdas Malik, Dr. Nasrullah, Dr. Ghufran Hashmi, Abid,Kaleem, (Shahid)3, (Wasim)2, Rawaiz, Tauseef Siddiqui, Dr. SuleimanKhan, all KSUians, and all OSPians. This list of so many nice people Ihave in my life is to be continued beyond this page.

Finally, I am indebted to my beloved parents and my Late Grandma (akaNano who was my best buddy and passed away just a few weeks ago) fortheir immense love, unconditional support, and consistent prayers, notonly through the doctoral studies but all throughout the highs and lows ofmy life. I would like to especially thank my father, who along with so manyother things also made my foundations in the subject of life and its ups anddowns. I also wish to thank my brother, sisters and the rest of my relativesfor their well-wishes. Lastly, I am extremely grateful to my caring wifeand our adorable children Abdullah, Hajra, Amina and Mohammad forabsolutely everything.

Espoo, May 14, 2020,

Muhammad Naveed Tabassum

4

Contents

Preface 3

Contents 5

List of Publications 7

Author’s Contribution 9

List of Abbreviations 11

List of Symbols 15

1. Introduction 171.1 Background and Motivation . . . . . . . . . . . . . . . . . 171.2 Aims and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 181.3 Scientific Contributions . . . . . . . . . . . . . . . . . . . . 191.4 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . 19

2. Sparse Signal Recovery: Estimation 212.1 Sparse Linear Model . . . . . . . . . . . . . . . . . . . . . . 212.2 Sparse Penalized Regression . . . . . . . . . . . . . . . . . 23

2.2.1 Regularized least squares . . . . . . . . . . . . . 242.2.2 The bias-variance tradeoff . . . . . . . . . . . . . 25

2.3 Complex-valued Lasso and Elastic Net . . . . . . . . . . . 262.3.1 Cyclic coordinate descent in complex domain . . 27

2.4 Complex-valued LARS method for Lasso and EN . . . . . 282.4.1 The c-LARS-Lasso method . . . . . . . . . . . . . 292.4.2 The c-PW-EN method . . . . . . . . . . . . . . . . 30

2.5 Numerical example . . . . . . . . . . . . . . . . . . . . . . 32

3. Sparse Signal Recovery: Detection 353.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Covariance Test for Elastic net . . . . . . . . . . . . . . . . 36

5

Contents

3.3 Orthonormal Predictor Matrix . . . . . . . . . . . . . . . . 373.3.1 Rayleigh test . . . . . . . . . . . . . . . . . . . . . 38

3.4 GIC-based Sparsity Order Detection . . . . . . . . . . . . . 403.4.1 Information criteria . . . . . . . . . . . . . . . . . 413.4.2 The c-LARS-GIC method . . . . . . . . . . . . . . 41

3.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4. Detection and Estimation of Directions-of-Arrivals ofSources 474.1 Sensor Array Signal Model . . . . . . . . . . . . . . . . . . 474.2 Conventional DoA Estimation . . . . . . . . . . . . . . . . 49

4.2.1 Beamformers . . . . . . . . . . . . . . . . . . . . . 494.2.2 Subspace methods . . . . . . . . . . . . . . . . . . 504.2.3 Parametric techniques . . . . . . . . . . . . . . . 514.2.4 Limitations . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Single-snapshot DoA Estimation . . . . . . . . . . . . . . . 524.3.1 Compressed beamforming . . . . . . . . . . . . . 534.3.2 Sequentially adaptive elastic net approach . . . 53

4.4 Detection and Estimation of DoAs . . . . . . . . . . . . . . 55

5. High-dimensional classification 575.1 Motivation and Background . . . . . . . . . . . . . . . . . 585.2 High-dimensional Gene Expression Data . . . . . . . . . . 59

5.2.1 Microarrays . . . . . . . . . . . . . . . . . . . . . 615.2.2 Conventional statistical tests . . . . . . . . . . . 61

5.3 Compressive Regularized Discriminant Analysis . . . . . 625.3.1 Basic set-up . . . . . . . . . . . . . . . . . . . . . 635.3.2 Elements of CRDA . . . . . . . . . . . . . . . . . . 64

5.4 Numerical examples on real-world data . . . . . . . . . . . 67

6. Discussion, Conclusions and Future Work 69

A. Appendix 71A.1 Expected Prediction Error . . . . . . . . . . . . . . . . . . . 71A.2 Rayleigh Test Statistic for Global Null . . . . . . . . . . . 72

Bibliography 73

Errata 83

Publications 85

6

List of Publications

This thesis consists of an overview and of the following publications whichare referred to in the text by their Roman numerals.

I M. N. Tabassum and E. Ollila. Single-snapshot DoA estimationusing adaptive elastic net in the complex domain. in Proc. of 4thIEEE International Workshop on Compressed Sensing Theory and itsApplications to Radar, Sonar and Remote Sensing (CoSeRa), Aachen,Germany, pp. 197-201, Sept. 2016.

II M. N. Tabassum and E. Ollila. Pathwise least angle regression and asignificance test for the elastic net. in Proc. of 25th European SignalProcessing Conference (EUSIPCO), Kos, Greece, pp. 1309-1313, 28Aug.– 2 Sept. 2017.

III M. N. Tabassum and E. Ollila. Compressive Regularized Discrim-inant Analysis of High-Dimensional Data with Applications to Mi-croarray Studies. in Proc. of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), Calgary, AB, pp.4204-4208, April 2018.

IV M. N. Tabassum and E. Ollila. Sequential adaptive elastic net ap-proach for single-snapshot source localization. The Journal of theAcoustical Society of America, vol. 143, no. 6, pp. 3873–3882, Jun.2018.

V M. N. Tabassum and E. Ollila. Simultaneous Signal Subspace Rankand Model Selection with an Application to Single-snapshot SourceLocalization. in Proc. of 26th European Signal Processing Conference(EUSIPCO), Rome, Italy, pp. 1592-1596, Sept. 2018.

VI M. N. Tabassum and E. Ollila. A Compressive Classification Frame-work for High-Dimensional Data. Submitted to , Jan. 2020.

7

List of Publications

8

Author’s Contribution

Publication I: “Single-snapshot DoA estimation using adaptiveelastic net in the complex domain”

The idea of the method and the algorithm was jointly developed with theco-author. The author was leading further developments of the methods,writing of the article and the software codes, as well as designing thesimulation experiments, with inputs from the co-author.

Publication II: “Pathwise least angle regression and a significancetest for the elastic net”


Publication III: “Compressive Regularized Discriminant Analysis ofHigh-Dimensional Data with Applications to Microarray Studies”


9

Author’s Contribution

Publication IV: “Sequential adaptive elastic net approach forsingle-snapshot source localization”


Publication V: “Simultaneous Signal Subspace Rank and ModelSelection with an Application to Single-snapshot SourceLocalization”


Publication VI: “A Compressive Classification Framework forHigh-Dimensional Data”


10

List of Abbreviations

AIC Akaike information criterion

BIC Bayesian information criterion

BPDN Basis pursuit denoising

CBF Compressed beamforming

CCD Cyclic coordinate descent

CDF Cumulative distribution function

CEN Complex-valued elastic net

CoSaMP Compressive sampling matching pursuit

CRDA Compressive regularized discriminant analysis

CS Compressed sensing

CT Covariance test

CV Cross-validation

dB Decibels

DE Differentially expressed

DNN Deep neural network

DoA Direction-of-arrival

DoAs Direction-of-arrivals

DR Detection rate

EN Elastic net

EPE Expected prediction error

ESPRIT Estimation of signal parameters via rotational invariancetechniques

11


ESRR Exact support recovery rate

FSR Feature selection rate

GEO Gene expression omnibus

GEP Gene expression profiling

GIC Generalized information criterion

HD High-dimensional

HDLSS High-dimension low-sample-size

IID Independent and identically distributed

KKT Karush-Kuhn-Tucker

LARS Least angle regression and shrinkage

Lasso Least absolute shrinkage and selection operator

LDA Linear discriminant analysis

MC Monte-Carlo

MDL Minimum description length

ML Maximum likelihood

mRNA Messenger ribonucleic acid

MSE Mean squared error

MUSIC Multiple signal classification

MVDR Minimum variance distortionless response

OLS Ordinary least squares

OMP Orthogonal matching pursuit

PLDA Penalized linear discriminant analysis

PSCM Penalized sample covariance matrix

PW Pathwise

Q-Q Quantile-quantile

QDA Quadratic discriminant analysis

12


RDA Regularized (linear) discriminant analysis

RIC Risk inflation criterion

RLS Regularized least squares

RSCM Regularized sample covariance matrix

RSS Residual sum of squares

RT Rayleigh test

SAEN Sequentially adaptive elastic net

SCM Sample covariance matrix

SCRDA Shrunken centroids regularized discriminant analysis

SNR Signal-to-noise ratio

SPCALDA Supervised principal component analysis based LDA

SPICE Sparse iterative covariance-based estimator

SSR Sparse signal recovery

SST Shrinkage-soft-thresholding

SVM Support vector machine

TER Test error rate

ULA Uniform linear array

varSelRF Variable selection based random forests

WEN Weighted elastic net

13


14

List of Symbols

Latin LettersB Coefficient matrix.B K-rowsparse coefficient matrix.d(·) Discriminant functions vector.G Number of total groups or classes.HK(·, ϕ) Hard-thresholding operator as a K-rowsparse transform.K Level (i.e., order or degree) of sparsity.M Group-means matrix.n Number of samples, observations, data points or sensors.ng Number of samples (or observations) for g-th group.p Number of predictors, variables or features.PFA Probability of false alarm.Ry Covariance matrix of observed measurements.X The predictor (measurement or expression) matrix.x[i] An i-th p-dimensional vector of predictors in the matrix X.xj A j-th column vector of the matrix X.y Measurements vector (or a snapshot), or group-labels vector.Y The snapshot matrix of multiple snapshots.

Greek Lettersα Tuning parameter in elastic net (EN) penalty.β The unknown signal (i.e., p-dimensional regression vector).γ Penalty parameter for only �1 part of EN penalty.Δ Least-squares direction.δ Least-squares direction using the active set.η Penalty parameter for only �2 part of EN penalty.λ Penalty (or regularization) parameter of EN penalty.π Prior probabilities of all groups.Σ Covariance matrix.τ Threshold levelϕ Hard-thresholding selector function.

15

List of Symbols

Notations| · | Cardinality of a set, i.e., the number of elements in the set.supp(·) The index set of a vector’s non-zero elements.Ak Support of a vector with k = |Ak| non-zero entries.CN Complex Gaussian distribution.C The set of complex numbers.R The set of real numbers.diag(·) Diagonal elements of a matrix.‖a‖0 Number of non-zero elements in the vector a ∈ Cp.‖b‖q The �q-norm (also called q-norm) of a vector b ∈ RG.� The Hadamard (i.e., element-wise) product.� Element-wise division.(·)∗ Complex conjugate of a matrix.(·)† Moore-Penrose pseudo-inverse.(·)H Conjugate-transpose of a matrix.(·)� Matrix transpose.1(·) Indicator function.

16

1. Introduction

1.1 Background and Motivation

The collection of data and its diversity is mainly due to recent rapidadvances in computer technologies. This data deluge has created high-dimensional (HD) [1] statistical learning problems that require accuratemodel fitting with an emphasis on computation and variable (feature)selection. HD data refers to a data set where dimensionality staggeringly(e.g., genomics data) or comparatively exceeds the number of observationsn [2], i.e., p ≈ n or p � n.

Model fitting is crucial in supervised learning problems, e.g., in regres-sion that involves predicting a numerical label or in classification thatinvolves predicting a group label. Important aspects of model fitting areinterpretation of learned models, their predictive power (e.g., misclassifi-cation rate) as well as explanatory (or descriptive) power, e.g., in terms ofmean squared error (MSE). The goals in model selection include drawingqualitative inferences about the underlying statistical process, accuratepredictions (e.g., correctly classifying unseen data) and the discovery ofsignificant predictors or features in order to have interpretable models.

Sparsity or parsimony of underlying models is crucial for their properinterpretations [1, 3]. Regularization (or shrinkage estimation) techniquessuch as Least absolute shrinkage and selection operator (Lasso) [4] or theelastic net (EN) [5] can be used for simultaneous sparse estimation andvariable selection in HD linear regression models, where the underlyingsignal either depend only on K � p predictors or can be approximated by aK-sparse signal. Variable selection can improve both model interpretationand predictions by avoiding over-fitting [6]. An ideal sparse estimation andvariable selection method should select parsimonious and interpretablemodels that offer high predictive power while still being fast to compute.Many of the standard learning methods lack in at least one of the afore-mentioned criteria. Accordingly, this dissertation focuses on developing

17

Introduction

supervised sparsity driven regression and classification methods that of-fer a good balance on the performance criteria (sparsity, accuracy, fastcomputation). To be more specific, new algorithms for sparse linear regres-sion are developed and their usefulness is illustrated e.g., in compressivebeamforming application; a significance test of entering predictors for thepopular EN are developed, detectors of the true sparsity level of the signalare developed and a compressive classification framework is proposed.

1.2 Aims and Scope

The main objective of the dissertation is to develop efficient sparsity-drivenmethods for HD classification and regression, especially for complex-valueddata. Research conducted in this dissertation can be divided into two mainresearch themes:

1. Detection and estimation of the unknown sparse signal vector in thecomplex-valued linear regression models.

2. Compressive classification of HD data offering accuracy, lower com-putational cost, and interpretability.

In the first theme, the main aim is to develop new solvers in the complexdomain for the EN problem that is a superset of Lasso, and has the capacityof selecting groups of highly correlated variables. The first theme can alsobe divided into the following two subtasks:

• Sparsity order detection task, where the problem is to find the truesparsity (model) order, that is, the number of non-zero signal coeffi-cients.

• Sparse signal estimation task, where the problem is to estimate thesupport and the non-zero elements of the unknown sparse signalgiven the sparsity order.

The second theme involves the development of a compressive classifica-tion framework that can perform features selection during the learning(i.e., training) process. Research work in this dissertation also followsguidelines of reproducible research and offers publicly available softwarepackages and toolboxes for the developed methods.

The main application of the developed sparse regression methods ofcomplex-valued measurements is the grid-based direction-of-arrival (DoA)estimation of the sources using a sensor array. It should be emphasizedthat the developed methods are widely applicable also in other applicationsareas where sparse linear regression methods have found to be useful.

18

Introduction

1.3 Scientific Contributions

The main scientific contributions of this dissertation are briefly outlinedas follows:

1. We developed a cyclic coordinate descent (CCD) algorithm for theComplex-valued elastic net (CEN), referred to as CCD-CEN methodin this dissertation. The CCD-CEN can also be used to computethe Lasso estimator for complex-valued data. This contribution issummarized in Chapter 2.

2. We developed a least angle regression and shrinkage (LARS) [7] algo-rithm for finding the regularization (solution) path of the Lasso forcomplex-valued data, referred to as c-LARS-Lasso method in this dis-sertation. It finds the values of the regularization parameter (calledknots) where a new predictor enters the model. This contribution issummarized in Chapter 2.

3. The developed c-LARS-Lasso method is then used for developing theLARS algorithm for EN, referred to as c-PW-EN in this dissertation.This contribution is summarized in Chapter 2. The algorithm isapplicable for both real- and complex-valued data.

4. Based on the c-LARS-Lasso algorithm and generalized informationcriterion (GIC) [8], we proposed a detector of sparsity order, referredto as c-LARS-GIC method. This contribution is summarized in Chap-ter 3.

5. We developed a statistical significance test of the variable enter-ing to the model in the EN solution path in the spirit of [9]. Thiscontribution is summarized in Chapter 3.

6. For the DoA estimation problem, we proposed a sequentially adaptiveelastic net (SAEN) approach that improves the recovery of the exactsource DoAs, that is, support of the underlying sparse signal vector.This contribution is summarized in Chapter 4.

7. We proposed a compressive classification framework for HD data thatis a modification of the regularized linear discriminant analysis. Thiscontribution is summarized in Chapter 5.

1.4 Dissertation Structure

This dissertation is divided into an introductory part that summarises thecontributions in Publications P.I - P.VI, and a collection of these six original

19

Introduction

Figure 1.1. Organization of the dissertation.

Publications P.I - P.VI. These attached publications include the originaltheory, methods, and results that are presented in this dissertation.

The introductory part of this dissertation is organized in the mannerillustrated in Figure 1.1. Namely, the tasks of sparse signal estimation isaddressed in Chapter 2 and sparsity order detection in Chapter 3. In Chap-ter 4 we provide an overview of the traditional DoA estimation techniquesand discuss our contribution for grid-based compressive beamforming thatis used for detection and estimation of source DoAs. In Chapter 5, wediscuss the HD data, in particular, gene expression data obtained usingmicroarrays and the proposed compressive classification framework. Weprovide comparison results for two real-life datasets as well. Finally, weconclude the dissertation with discussion and future directions.

20

2. Sparse Signal Recovery: Estimation

Sparse signal recovery (SSR) has attracted a lot of research interest thepast two decades due to the widespread use of the concept of sparsity (andcompressibility) in mathematical modeling and due to its many applica-tions in diverse fields; see e.g., [10, 11, 12, 13, 14, 15] and [16, 17]. Despitethe great interest in exploiting sparsity in various applications, most ofthe tools developed for solving the SSR problem are focused on real-valueddata, whereas this dissertation mainly deals with SSR for complex-valueddata. In this chapter, we first provide a brief overview and description ofthe SSR problem, focusing on sparse regularized regression. Thereafter,we describe the proposed methods for solving the SSR problems, which aredetailed in Publications P.I, P.II and P.IV.

Notations: Lowercase boldface letters are used for vectors and uppercasefor matrices. The support A of a vector a ∈ Cp is the index set of itsnon-zero elements, i.e., A = supp(a) = { j ∈ {1, . . . , p} : aj = 0}. The�0-(pseudo)norm of a is equal to the total number of non-zero elements init, and is defined as ‖a‖0=

∑pj=1 1(aj = 0) = |supp(a)|, where 1(·) is an

indicator function. The �2-norm and the �1-norm are defined as ‖a‖2 =√aHa and ‖a‖1 =

∑ni=1 |ai|, respectively, where |a| = √

a∗a =√

a2R + a2Idenotes the modulus of a complex number a = aR + ıaI (with ı =

√−1

denoting the imaginary unit). We denote by 〈a,b〉 = aHb the Hermitianinner product of Cp. For a vector β ∈ Cp (resp. matrix X ∈ Cn×p) and anindex set Ak ⊆ {1, 2, . . . , p} of cardinality |Ak| = k, we denote by βAk

(resp.XAk

) the k × 1 vector (resp. n × k matrix) restricted to components of β(resp. columns of X) indexed by the set Ak.

2.1 Sparse Linear Model

In several applications, the measurement vector can have a sparse (or com-pressible) representation in some suitable basis or domain. Sparse signalapproximations are applied in diverse fields such as artificial intelligence,coding and information theory, imaging and signal processing. For instance,

21

Sparse Signal Recovery: Estimation

it is used for variable selection in regression models [4, 7, 5, 18, 19, 20, 17].Thereafter, post-variable-selection inference can be performed, e.g., in theform of statistical hypothesis tests [9, 21, 22, 23]. Other examples includesparse principal component analysis [24] and compressed sensing (CS)[25, 26] that can be used to recover sparse or compressive signals withfewer measurements than the traditional methods.

Consider a collection of n predictor-response pairs {(x[i], yi)}ni=1 is given,where each x[i] ∈ Cp is a p-dimensional vector of predictors and each yi ∈ C

is the associated response variable. The predictors and responses arerelated via a system of linear equations of the form

y = Xβ + ε, (2.1)

where y = (y1, y2, . . . , yn)� denotes the observed measurement vector, X ∈

Cn×p is the predictor matrix (measurement matrix or design matrix), β= (β1, β2, . . . , βp)

� ∈ Cp is the unknown signal vector (or vector of regressioncoefficients) to be estimated and the elements εi of the noise term ε areindependent and identically distributed (IID) complex-valued randomvariables from CN (0, σ2) distribution, that is, ε ∼ CN n(0, σ

2I). Withoutany loss of generality we assume that the column vectors of the predictormatrix

X =(x1 x2 · · · xp

)=

(x[1] x[2] · · · x[n]

)�

are of unit norms, that is, ‖xj‖2 = 1 for j = 1, 2, . . . , p. Moreover, forsimplicity, the intercept (or bias) term is ignored because we assume thatpredictors and responses in the linear model (2.1) are mean centered.All vectors and matrices in (2.1), i.e., y,X,β, ε, are complex-valued bydefault, and the imaginary part is considered to be zero when the data isreal-valued.

Assumption 2.1. It is assumed that the signal vector β ∈ Cp is K-sparsewhich means that it has

K = ‖β‖0 � p

non-zero elements whereas rest of the elements are equal to zero. Theparameter K is referred to as the sparsity level (or sparsity order) of β, andthe support set (with K indices) is denoted by A∗ = supp(β).

Even if the K-sparsity of the signal vector does not hold exactly, it isoften the case that the signal vector is at least compressible [26], meaningthat ‖β − βK‖q ≤ cK−r, where βK is a best K-term approximation of β,c > 0, r > 1 and q ≥ 1.

Sparsity (or compressibility) of underlying models plays an importantrole in the development of statistical and machine learning techniques,e.g., for models’ proper interpretations. Sparsity is particularly importantin the undersampled (i.e., n � p) case where the linear system is underde-termined and the conventional least-squares criterion has infinitely many

22


solutions. The aim of the recovery process of sparse signals is twofold. Firstto find the sparse signal β and second to ensure that the sparse modelthat is found is consistent with the measurements. Furthermore, often theknowledge of the underlying number of non-zero entries in β is assumedto be known which may not be the case in practice. Therefore, it is crucialto correctly identify the underlying sparsity level (or model order). Conse-quently, it is increasingly important to develop efficient sparse recoveryalgorithms that are able to detect the true sparsity order and to estimatethe underlying sparse signal vector β.

In this dissertation, we mostly consider the settings in which p > n orp � n and the aim is to fit a sparse linear model that only depends ona subset of K � p predictors (i.e., β is K-sparse). This task can also bedivided into sub-problems: (a) finding the sparsity order K (chapter 3) and(b) estimating the support (this chapter), A∗.

2.2 Sparse Penalized Regression

For a typical linear regression model, ordinary least squares (OLS) methodis used to estimate the unknown signal vector. OLS estimator minimizesthe residual sum of squares (RSS) criterion function

minimizeβ∈Cp

1

2‖y −Xβ‖22. (2.2)

The above minimization problem leads to a well-known closed-form so-lution β = (XHX)−1XHy when n > p and X is of full rank (i.e., rank p).However, all elements of the OLS estimate will be generally non-zero. Fur-thermore, in the case that p > n, the problem is ill-posed and does not havea unique solution. In order to pick a sparse solution, variable selection isoften applied for picking the best subset of predictors that provide a goodfit and that uses a minimal number of variables.

Traditionally, the variables were picked using subset selection ap-proaches. In the first approach, the best subset of K variables is chosenthat fits the data the best amongst all possible, p !

K ! (p−K)! , combinationsof K predictors. However, as K increases, the number of combinationsbecomes quickly far too large. Second approach is to use stagewise subsetselection procedures: in backward elimination, one starts from the fullmodel (i.e., the model where all predictors are included in the model),and then removing the least impactful predictor at each step until onlyK predictors are left; in forward selection, one starts with a null model(no predictors) and sequentially adds the best predictor until K predictorshave been chosen. An alternate iterative method for subset selection ofpredictive variables is forward stage-wise that is less greedy and computa-tionally expensive since it takes several small steps to find the final model.

23


See [27] and [1] for more details.Existing methods (or solvers) for SSR can be placed in two main groups:

1. Greedy algorithms that iteratively construct a K-sparse estimate. Ba-sically, a greedy algorithm picks a new variable j at each step thatmaximally reduce the error, e.g., ‖y−XA∪{j}βA∪{j}‖22, in approximatingy using predictors from the current active set A. Finally, it terminateswhen K predictors have been chosen. The well-known greedy algorithmsare orthogonal matching pursuit (OMP) [28, 29], regularized OMP [30],iterative hard-thresholding [31], compressive sampling matching pur-suit (CoSaMP) [32], and their Bayesian counterparts [33, 34, 35].

2. Optimization methods that solve penalized objective functions thatare based on convex (e.g., �q norm for q ≥ 1) or non-convex (e.g., �qpseudo-norm for 0 ≤ q < 1) penalty terms. In the case of �0-penalty,the homotopy based heuristic search algorithms are proposed in [36],which can be used for variable selection. Main optimization methodsfor solving �1-minimization problems include homotopy algorithms forsolution path finding [37, 7, 38], gradient projection method for solv-ing bound constrained quadratic programming reformulation of basispursuit denoising (BPDN) [39], and iterative shrinkage-thresholdingbased methods which consist of a shrinkage (e.g., soft-thresholding)step and a gradient step [40, 41, 42, 43, 44, 45]. Moreover, a cyclicminimization algorithm was proposed in [46] for �1-minimization basedgeneralized time-updating q-norm sparse iterative covariance-basedestimator (SPICE), whereas group-SPICE is solved in [47] for group-sparse regression problems.

Greedy algorithms work well generally when the solution is sufficientlysparse and the columns of X are sufficiently incoherent, which is not thecase in many applications, e.g., in compressed beamforming studied inthis dissertation. In this dissertation, we develop methods that belong tothe latter category (optimization methods). Moreover, sparse penalizedregression solvers in the complex domain are proposed. Specifically, wepropose the complex-valued CCD and LARS methods for solving the EN.The benefit of our approach is that they also work for real-valued data.

2.2.1 Regularized least squares

Greedy algorithms have some drawbacks such as intensive computation,difficulties in deriving sampling properties of the estimators and unsta-bleness [48]. These shortcomings can often be alleviated by optimizationmethods that use regularization techniques. Penalized regression usesthe regularization penalties to constrain the model fitting (or estimation)process, e.g., regularized least squares (RLS) (cf. [4, 5, 18, 49, 50, 51]).RLS-based constrained optimization problems jointly minimize the RSS

24


and a regularization term by solving

minimizeβ∈Cp

E(β)

subject to J(β) ≤ t, (2.3)

where the empirical error E(β) = 12‖y − Xβ‖22 is the sum of squared

residuals, J(β) ≥ 0 is a non-negative penalty function whose strength isregulated by a threshold parameter t ≥ 0. There are different kinds ofpenalty functions that are used in the literature: the Ridge regression alsoknown as Tikhonov regularization [49] uses J(β) = ‖β‖22 and its extension,the bridge estimator [50], the Lasso estimator [4] that uses J(β) = ‖β‖1,the smoothly clipped absolute deviation (SCAD) estimator [51], OSCAR(octagonal shrinkage and clustering algorithm for regression) that involvesa non-smooth and non-separable penalty and aims to perform clusteringof selected variables [52]. Moreover, there exists other extensions of theLasso, for example, the group Lasso [53], the sparse-group Lasso [54], theEN estimator [5], the weighted version of Lasso known as the adaptiveLasso estimator [18] and the reweighted �1 minimization based algorithm[55].

2.2.2 The bias-variance tradeoff

Notice that t = 0 in (2.3) results in the maximum constraint (i.e., maximalbias) with solution β(t) = 0 whereas t ≥ J(β) involves no constraint (i.e.,maximal variance), where β denotes the OLS estimate. The smaller thethreshold t, the more sparse is the obtained solution β(t) of (2.3). Moreover,t controls the bias-variance tradeoff which is measured in terms of theMSE that is given as: MSE(β(t)) = E

[‖β(t)− β‖22]. MSE is related to the

expected prediction error (EPE) which measures the quality of predictionfor a specific (new) observation {y0,x0}, that is, EPE(β) = E

[|y0−β(t)Hx0|2].

It can be expressed as

EPE = (bias)2 + (variance)︸︷︷︸reducible error

+(noise variance)︸︷︷︸irreducible error

. (2.4)

See [1] and Appendix A.1 for details. Ideally, one would choose a regular-ization parameter t that minimize the MSE (and hence EPE). This oftenindicates introducing some bias in order to reduce the variance of theestimator.

Sparse regression estimator introduces bias but reduces the variance.Lasso is the most commonly used sparse regression estimator but so farmost of the research has focused on real-valued data. Herein, we first solvethe Lasso problem for complex-valued data and thereafter extend it to theEN estimator (having Lasso as special case). EN can consistently selectthe true model, even when the Lasso can not [56, 57], and has the abilityfor selecting the grouping variables (cf. Figure 1 in Publication P.IV [58]).

25


2.3 Complex-valued Lasso and Elastic Net

It is well known that penalizing (2.2) with the �0-norm for finding thesparsest solution of an underdetermined system is NP (nondeterministicpolynomial time) hard. Therefore, in the presence of noise, the convexrelaxation-based RLS problem with an �1-norm penalty function J(β) =

‖β‖1 was introduced as a practical and tractable alternative to promotesparsity. This is known as the Lasso in statistics and is equivalent to thesignal processing’s BPDN problem that is noise tolerable version of basispursuit [59, 60]. Moreover, the work in [61] has shown that the generalizedLasso problem can be cast as a Lasso problem when the regularizationmatrix is under-determined with full rank conditions.

For EN, the RLS-based constrained optimization problem in (2.3) canalso be equivalently written in the Lagrangian form

β (λ, α) = argminβ∈Cp

1

2‖y −Xβ‖22 + λJα

(β)

(2.5)

based on duality and the Karush-Kuhn-Tucker (KKT) conditions, where

Jα(β)=

p∑j=1

(α|βj |+

(1− α)

2|βj |2

)

is the EN penalty (or constraint) function with α ∈ [0, 1] being the ENtuning parameter and λ is the penalty (or regularization) parameter with a1-to-1 relationship with threshold parameter t. The penalty function thendominates the minimization problem as λ increases. When λ → ∞, all theinvolved coefficients (i.e., β-s) flatten out towards zero. For ∞ < λ < 0,EN contains Lasso and Ridge as its special cases for α = 1 and α = 0,respectively. Furthermore, OLS estimate is obtained when λ = 0.

Lasso can offer a sparse solution but ridge regression can not. However,the ridge has some benefits over Lasso when there exist high correlationsbetween variables. For example, if there exist two correlated variables(predictors), then ridge regression has a tendency to pick both by shrinkingtheir coefficients towards one another, whereas Lasso generally picksonly one of them. EN is a compromise between the Lasso and Ridgethat attempts to shrink and perform a sparse (and when needed a group)variable selection simultaneously (cf. Figure 1 and related discussions inthe Publication P.IV [58]). It has plenty of theoretical guarantees, e.g.,consistency, sparsistency (i.e., sparse + consistency) and group selection[5, 56, 57]. In the presence of correlations between predictors, EN has atendency to correctly select true variables, thanks to its group selectionproperty (cf. Lemma 2 and Theorem 1 in [5]). EN is preferred over Lassoand Ridge regression because it offers flexibility and solves the limitationsof both.

26


2.3.1 Cyclic coordinate descent in complex domain

We first proposed easy to implement and efficient CCD method for complex-valued data, which finds an EN estimate for a specified penalty parametervalue λ. The method is named as CCD-CEN method. It can also be used tofind a K-sparse solution by using (or searching) a suitable penalty parame-ter value. The key idea behind CCD method is to successively minimizeeach dimension of the objective function in (2.5) in a cyclic fashion, whileholding the values of the parameters in other dimensions fixed. Both partsof the EN cost function (i.e., RSS criterion and EN penalty) in (2.5) areconvex. Additionally, RSS criterion is smooth and differentiable whereasthe EN penalty is non-smooth. However, the EN penalty part is separable,which makes the CCD approach applicable for solving (2.5).

When holding the coefficients of other coefficients (say βk, k = j) fixed,a closed-form solution of the complex-valued EN problem for the singlevariable βj can be found. Following the derivation for its weighted versionin the Publication P.I, the coordinate-wise update can be written as (cf.[62]):

βj ← S(βj + 〈xj , r〉; γ, η), (2.6)

which cycles over j = 1, 2, . . . , p, 1, 2, . . . until the convergence, where r

denotes the current full residual vector r = y −Xβ and

γ = λα ∈ R+0 and η = λ(1− α) ∈ R+

0 . (2.7)

Above in (2.6), the shrinkage-soft-thresholding (SST) operator of EN is

S(β; γ, η) = softγ(β)

1 + η, β ∈ C, (2.8)

which expands the soft-thresholding operator softγ(β) = sign(β)(|β| − γ)+via ridge style shrinking, where (t)+ = max(t, 0) denotes the positive partof t ∈ R, and

sign(β) =

⎧⎪⎨⎪⎩

β

|β|, for β = 0

0, for β = 0

is the complex signum function. The EN solution β = β(λ, α) suffers froma double shrinkage problem (cf. [5]), which is corrected by debiasing β as:

β ← [1+ η

]� β, (2.9)

where � denotes the Hadamard product, i.e., the component-wise productof two vectors. The procedure for CCD-CEN method is given in algorithm 1,which considers that data is already centered and standardized. SeePublication P.I [62] for more details. The CCD-CEN needs a proper value

27


Algorithm 1: CCD-CEN algorithm1.

Input : y ∈ Cn, X ∈ Cn×p, λ (or K), α and βinit

Output : EN solution β = β(λ, α) ∈ Cp.Initialize : β = βinit, γ = λα and η = λ(1− α).

1 while stopping criteria not met do2 for j = 1 to p do3 r ← y −Xβ // current full residual vector

4 βj ← S(βj + 〈xj , r〉; γ, η)

// coordinate-wise update

5 β ← [1+ η

]� β // shrinkage correction step

1It is part of the MATLAB toolbox available at https://github.com/mntabassm/SAEN-LARS.

of the penalty parameter λ in order that the obtained solution β(λ, α) hasintended sparsity levels. These λ-values are not fixed but depend on thegiven predictor-response data {y,X}. Hence, it is useful to devise a data-dependent approach that finds exact λ-values (called knots) in the solutionpath. Such a (homotopy) approach is presented in the section 2.4.

2.4 Complex-valued LARS method for Lasso and EN

In this section, we consider the solution β(λ, α) to (2.5) from the perspectiveof solving the λ-values where the sparsity order changes.

Definition 2.1. For each λ (or t), we obtain an EN solution and a set of λ’s,which trace out a path of solutions. The regularization (or solution) path isthe graph of [β(λ, α)]j as a function of λ for each j = 1, . . . , p.

Definition 2.2. A knot is the value of the penalty parameter λ after whicha new variable (whose coefficient is about to grow from zero) enters themodel or leaves the model, that is, the number of elements in the selectedmodel (i.e., sparsity level) changes. Knots are denoted by {λ0, λ1, . . . , λL}where λ0 > λ1 > · · · > λL.

For real-valued data, knots are singular points in the Lasso solution pathwhich is continuous and piecewise linear with respect to λ (cf. [7, 63, 64,65]).

Grid searching is one obvious method for finding the knots. It is, how-ever, computationally expensive approach and would require a fine grid ofcandidate values. Consequently, a pathwise (PW) method based on LARSprinciple, which finds knot-values in the EN solution path is developed inthis dissertation. This method is referred to as c-PW-EN and uses our pro-posed complex-valued extension of real-valued LARS method for Lasso (i.e.,c-LARS-Lasso) as its core computational engine. Both the aforementioned

28


proposals along with LARS principle are briefly explained next. We notethat more general versions of these methods are detailed in PublicationP.IV [58], which solve the weighted (or adaptive) Lasso and EN problems.

2.4.1 The c-LARS-Lasso method

Let β(λ) denote the Lasso estimator, i.e., a solution to (2.5) for α = 1. Thebasic principle of LARS is to move the estimator along optimal directions(e.g., bisector) by taking optimally-sized steps. These fit directions areequiangular (i.e., equicorrelated) with each variable in the current activeset A. LARS chooses the most advantageous (correlated) dimension toextend until the sub-gradient (score) corresponding to all existing variablesin A and the one entering has the same absolute value:

|x�l r(λ)| = |x�

j r(λ)|

Next we discuss the complex-valued extension, c-LARS-Lasso developedin this dissertation. First consider the Lasso estimate β(λ) in (2.5) forsome fixed value of the penalty parameter λ. The smallest value of λ

such that all coefficients of Lasso solution are zero, i.e., β(λ0) = 0p, isλ0 = maxj |〈xj ,y〉|. Let A = supp{β(λ)} denote the active set at the regu-larization parameter value λ < λ0. At each step in the regularization path,the solution β(λ) needs to verify the generalized KKT conditions for all thepredictors included in the active set A. That is, β(λ) is a solution to (2.5) ifand only if it verifies the zero sub-gradient equations given by

〈xj , r(λ)〉 = λ sj for j = 1, . . . , p (2.10)

where sj ∈ sign{βj(λ)}, meaning that sj = βj(λ)/|βj(λ)| if βj(λ) = 0 andsome number inside the unit circle, sj ∈ {x ∈ C : |x| ≤ 1}, otherwise.

An active set at a knot λk is denoted by

Ak = supp{β(λk)}.

The active set A1 thus contains a single index as A1 = {j1}, where j1 ispredictor that becomes active first, i.e., j1 = argmaxj∈{1,...,p}|〈xj ,y〉|. Bydefinition of the knots, one has that Ak = supp{β(λk)} ∀λ ∈ (λk−1, λk] andAk = Ak+1 for all k = 1, . . . , L. At step k, notice that the condition in (2.10)becomes |〈xj , r(λ)〉| = λ for the active set Ak, whereas |〈xj , r(λ)〉| < λ holdsfor non-active predictors.

The c-LARS-Lasso, outlined in algorithm 2 is a generalization of LARS-Lasso algorithm to complex-valued case. We refer reader to Section III ofthe Publication P.IV [58] for more details on the steps of the algorithm 2using unit weights. If the desired sparsity level K is not provided, wecompute solution path for min(n− 1, p) steps.

29


Algorithm 2: c-LARS-Lasso algorithm1.Input : y ∈ Cn, X ∈ Cn×p, sparsity level K ∈ min{n− 1, p}.Output : {Ak, λk, β(λk)}Kk=0

Initialize : β(λ0) = Δ = 0p, A0 = {∅} and the residual r0 = y.

1 λ0 = maxj=1,...,p |〈xj , r0〉| and A1 = argmaxj=1,...,p |〈xj , r0〉|.2 for k = 1, . . . ,K do3 Estimate the least-squares direction δ using the active set Ak,

giving [Δ]Ak= δ, as:

δ =1

λk−1(XH

AkXAk

)−1XHAk

rk−1.

4 For all non-active predictors (i.e., � ∈ A�k), calculate correlations

b� = 〈x�,XAkδ〉 and c� = 〈x�, rk−1〉 with current fit direction and

residual, respectively.

5 Find γ� from the roots {γ�1, γ�2} of the second-order polynomialequation {|b�|2 − 1}γ2� +2{λk−1 −Re(c�b

∗� )}γ� + {|c�|2 −λ2

k−1} = 0,for � ∈ Ak as:

γ� =

{min(γ�1, γ�2), if γ�1 > 0 and γ�2 > 0

{max(γ�1, γ�2)}+, otherwise

6 The smallest non-negative step size γjk+1= min��∈Ak

γ� withindex jk+1 = argmin��∈Ak

γ�, leads to Ak+1 = Ak ∪ {jk+1}.

7 Update the coefficient values at new knot λk = λk−1 − γjk+1(i.e.,

the largest λ-value for λk−1 ≥ λ > 0) that verifies (2.10):

β(λk) = β(λk−1) + γjk+1Δ

and set rk = y −Xβ(λk).


It should be noted that the algorithm assumes that the found K knots{λ1, . . . , λK} are such that a variable does not leave the active set. In thereal-valued case, it is possible to include an additional step (cf. [17, p.120]) that handles the case when an active variable leaves the active set.Extending this to complex-valued case is currently under development.

2.4.2 The c-PW-EN method

For a fixed α, the first knot λ0(α) = maxj=1,...,p |〈xj ,y〉|/α is the small-est value of λ that results in all-zeros EN estimate β(λ, α) = 0. Inthis dissertation, we construct a pathwise method for finding next knots,λ1(α) > λ2(α) > · · · > λK(α), in the EN regularization path, which werefer to as c-PW-EN algorithm. Since α is fixed we drop the dependencyof the penalty parameter on α and simply write λ or λk instead of λ(α) or

30


λk(α). Yet it is important to keep in mind that the value of the knots aredifferent for any given α. c-PW-EN finds the knots for different values ofEN tuning parameter α in a grid

[α] = {αi ∈ [1, 0) : α1 = 1 < · · · < αm < 0}. (2.11)

To accomplish this, it uses c-LARS-Lasso algorithm in iterative manner.To explain the idea, we first convert EN problem to an equivalent Lasso

type problem in an augmented (data) space. Namely, we can write the ENobjective function from (2.5) in an augmented form as follows:

1

2‖y −Xβ‖22 + λJα

(β)=

1

2‖ya −Xa(η)β ‖22 + γ

∥∥β∥∥1, (2.12)

where (γ, η) is the new parameterization (α, λ) defined in (2.7), and

ya =

(y

0p

)and Xa(η) =

(X

√η Ip

)

are the augmented (data) forms of original data. Note that (2.12) resemblesthe Lasso objective function with ya ∈ Rn+p and Xa(η) ∈ R(n+p)×p. SeePublications P.II and P.IV for more details. Our c-PW-EN method is givenin algorithm 3. The algorithms finds K-sparse EN solutions, so β(λK , α)

for a set of EN tuning parameter values α in a pre-determined grid [α] in(2.11).

It then picks the best among them by comparing the value of the LS fitobtained when only the predictors in the support set AK(α) are includedin the model. The best EN tuning parameter value α is then the one onthe grid that yields the best LS fit. This value and the respective K-sparseEN solution is then chosen as the optimal K-sparse solution. We note thatin step 5 of the algorithm Equation 2.11, the notation

{λk, β(λk)} = c-LARS-Lasso(y,X)∣∣k

is used to denote the case that we are extracting the k-th knot (and thecorresponding solution) from a sequence of the knot-solution pairs foundby the c-LARS-Lasso algorithm.

Computations are moderate since we no longer need an exhaustive grid-search over λ penalty parameter space for a specified sparsity level K. Ifone has some prior knowledge of the underlying model, then we can selectα-grid accordingly. By default we use a grid from α1 = 1 to αm = 0.9.

The c-PW-EN method provides better estimates of the coefficients for anygiven sparsity level K than Lasso alone thanks to its ability to find theborder values of λ (i.e., knots) after which entering predictor’s coefficientbecome non-zero and its ability to compute the solutions for a set of α

values in the grid.

31


Algorithm 3: c-PW-EN algorithm1.

input : y ∈ Cn, X ∈ Cn×p, [α] ∈ Rm (recall α1 = 1), K, DEBIAS.

output : β ∈ Cp and A∗, estimates of β and A∗

1 {λk(α1), β(λk, α1)}Kk=0 = c-LARS-Lasso(y,X,K

)2 for i = 2 to m do3 for k = 1 to K do4 ηk = λk(αi−1) · (1− αi)

5{γk, β(λk, αi)

}= c-LARS-Lasso

(ya,Xa(ηk))

∣∣k

6 λk(αi) = γk/αi

7 AK = supp{β(λK , αi)}8 βLS(λK , αi) = X+

AKy

9 RSS(αi) = ‖y −XAKβLS(λK , αi)‖22,

10 � = argmini RSS(αi)

11 β = β(λK , α�) and A∗ = supp{β}12 if DEBIAS then βAK

= X+AK

y


2.5 Numerical example

Next we illustrate computation of the solution paths using c-PW-EN ofalgorithm 3 (coupled with algorithm 2). Moreover, the EN regularizationpath by c-PW-EN is compared with the coefficients path that is computedusing CCD-CEN of algorithm 1 over a dense grid of λ-values.

We generated the noisy response as y ∼ CN n(Xβ, σ2I), where the numberof observations is n = 20, the sparsity level is K = ‖β‖0 = 5 and the supportis A∗ = supp(β) = {1, 2, . . . , 5}, and the total number of predictors is p = 30.The noise variance is σ2 = σ2

s × 10−SNRdB/10 for σ2s = 1

K

{|β1|2 + · · ·+ |βK |2}and the signal-to-noise ratio (SNR) in decibels (dB) is 10 dB. The rowvectors of the predictor matrix X are generated randomly from a p-variatemultivariate (circular) complex normal distribution, x[i] ∼ CN p(0, Σ),i = 1, . . . , n, where Σ is a block diagonal matrix consisting of three (d× d)

sub-matrices Σ(ρ)d = (1− ρ) Id + ρ1d 1

�d having different correlations ρ, as

illustrated in Figure 2.1. Note that the first two predictors are highly-correlated.

Figure 2.2 displays the EN (with α = 0.9) regularization path computedby CCD-CEN algorithm using a dense grid [λ] that includes the knotsvalues found by c-PW-EN in Figure 2.3a. Figure 2.3 depicts the solutionpaths of the EN (α = 0.9), Lasso (α = 1) and Ridge (α = 0) for recoveringthe sparse signal β, where the c-PW-EN algorithm, outlined in algorithm 3,is used to compute the first 10 knots, {λk}9k=0, and the solutions, β(λk, α)

32


Σ =

⎡⎢⎢⎢⎢⎣

Σ(0.9)2 0 0

0 Σ(0.6)2 0

0 0 Σ(0)26

⎤⎥⎥⎥⎥⎦

Empirically�

1 10 20 30Predictors

1

10

20

30

Pred

icto

rs

0

1

Figure 2.1. Right panel shows the empirical covariance structure of the standardized datamatrix X ∈ C20×30, where the observations (row vectors) are generated froma multivariate normal distribution with block diagonal 30 × 30 covariancematrix Σ shown on the left panel.

0 0.2 0.4 0.6 0.8 10

0.5

1

Coe

ffic

ient

s

Figure 2.2. Results for a complex-valued sparse linear model with correlated predictors(p = 30, K = 5, A∗ = {1, 2, . . . , 5}). Figure displays EN regularizationpaths computed by the CCD-CEN algorithm using a dense grid [λ] of penaltyparameter values.

at the knots in the case of EN with α = 0.9 and Lasso (α = 1).Firstly, we may now compare the regularization path in Figure 2.2 to

that found by c-PW-EN algorithm and displayed in Figure 2.3a. Whencomparing Figure 2.3a to Figure 2.2 we notice that the found EN regular-ization paths are indistinguishable and c-PW-EN algorithm has found theknot-values with high precision. Secondly, notice that the regularizationpath of EN demonstrates its group selection ability: the coefficients corre-sponding to both of the correlated subsets of variables, namely {A∗

[1],A∗[2]}

and {A∗[3],A∗

[4]}, are shrinking towards zero as a group as λ increases (notethat λ0 > λ9), whereas Lasso fails to do so. This example illustrates thatEN is clearly favourable to Lasso in the sparse recovery of correlated pre-dictors. Figure 2.3c displays the solution paths of Ridge (α = 0) regressionat 10 grid points of λ. As can be seen Ridge regression fails to correctlyestimate the true coefficients, and none of the signal vector estimates on itssolution path are sparse, i.e., ridge regression only shrinks the coefficients,but none of the coefficients are set to zero.

33


0 0.2 0.4 0.6 0.8 10

0.5

1

Coe

ffic

ient

s

(a) EN (α = 0.9)

0 0.2 0.4 0.6 0.8 10

0.5

1

Coe

ffic

ient

s

(b) Lasso (α = 1)

0 0.2 0.4 0.6 0.8 10

0.5

1

Coe

ffic

ient

s

(c) Ridge regression (α = 0)

Figure 2.3. Results for a complex-valued sparse linear model with correlated predictors(p = 30, K = 5, A∗ = {1, 2, . . . , 5}). We used the c-PW-EN algorithm tocompute the first 10 knots and the solutions at the knots for (a) EN withα = 0.9 and (b) Lasso (α = 1). Figure (c) displays the solution path of Ridge(α = 0) regression at 10 grid points of λ. Notice that Lasso lacks grouping ofcorrelated variables and Ridge does not yield sparse solutions.

34

3. Sparse Signal Recovery: Detection

In chapter 2, we assumed that the sparsity order K is known a priori whichis rarely the case in practice. In this chapter, the focus is on detecting thesparsity order. We first present methods based on statistical hypothesistesting using the knots of solution paths described in chapter 2 and thenmethods based on information criteria. These methods find both thesparsity order and the respective estimate of the underlying sparse signal.

3.1 Overview

Typically, most SSR (e.g., greedy pursuit [29]) methods presume that K

is known or an estimate of it is available. Whereas, in practice, it is oftenunknown and can be dynamically varying, e.g., as new measurements(or snapshots) become available. Furthermore, correct sparsity order isalso required for verifying theoretical sparse signal recovery propertiessuch as the null space property and restricted isometry property [66]. Theperformance of greedy pursuit methods relies heavily on the accuracyof detecting the true sparsity order. Sparse linear regression methods(such as Lasso or EN) can perform both tasks simultaneously but arecomputationally expensive. For example, the Lasso uses a data-dependentpenalty parameter λ commonly chosen by cross-validation (CV) to computean estimate β of sparse vector β. An estimate of the sparsity order is thenobtained as K = ‖β‖0.

In the context of the detection task, the existing research can be dividedinto two different categories. In the first category, one has iterative sparserecovery algorithms that use certain stopping criteria (e.g., cf. [67, 68, 69]).These methods are often computationally inefficient, requiring a largenumber of iterations and can have convergence and consistency issues.Second category involves statistically driven approaches to find the sparsityorder (or hyperparameter selection), such as the minimum descriptionlength (MDL) principle [70], sequential generalized likelihood ratio tests(GLRT) [71], information criteria [8, 72, 73, 74, 75], and hypothesis testing

35

Sparse Signal Recovery: Detection

[76, 77]. Some other existing methods in the literature can be found e.g.,in [78, 79, 80, 81, 82, 83, 84, 85].

In this dissertation, we exploit the EN regularization paths and proposetwo novel methods:

(a) a statistical significance test based on EN

(b) using information criteria with nested set of hypotheses.

The methods utilize c-PW-EN algorithm displayed in algorithm 3 whichfinds the knots λk in the EN regularization path and the respective so-lutions β(λk, α) at the knots. Recall that knots are values of the penaltyparameter (singular points) after which a change in the sparsity orderoccurs.

3.2 Covariance Test for Elastic net

We assume that the data (y,X) follows the linear model described insection 2.1 and the underlying signal vector β is K-sparse, i.e., Assump-tion 2.1 holds. Consider a test that evaluates the statistical significanceof the model under the null hypothesis H0. A test statistic T = T (y,X)

measures the compatibility of the data to what is expected under the nullhypothesis. Let FT (t) = Pr(T ≤ t|H0) denote the cumulative distributionfunction (CDF) of the test statistic under the null hypothesis. Then giventhe actual observed (measured) data (y,X), let t denote the correspondingobserved value of the test statistic. Then a P-value or observed significancelevel gives the probability of observing a value of a test statistic at least asextreme as the one observed, i.e.,

P = Pr(T > t|H0) = 1− FT (t).

Such a test statistic can be constructed to test the statistical significanceof the model found at the knots λk(α) in the EN regularization path. Forthis, the following assumption needs to be made.

Assumption 3.1. Let E(α) = [supp{β(λK , α)} = A∗] denote the event thatthe true non-zero coefficients (i.e., true support A∗) are active at the first Kknot points of the EN solution path. Recall that the knots depend on y andX.

Under the null hypothesis that event E(α) holds true, the active set atthe knot k ≥ K + 1 depends also on inactive variables. The distribution ofthe significance test proposed below are thus conditioned on event E(α). Inthis dissertation, we extend the covariance test developed in [9] for Lassoto EN for both the real-valued and complex-valued data.

36


For real-valued data, a covariance test that we developed in PublicationP.II tests the statistical significance of entering predictors at each knot inthe EN regularization path for a given (fixed) EN tuning parameter α. Theproposed covariance test statistic is defined as

Tk(α) =1 + ηk+1

σ2

(⟨y,Xβ(λk+1, α)

⟩− ⟨y,XAk

βAk(λk+1, α)

⟩), (3.1)

where σ2 is the noise variance, λk+1 ≡ λk+1(α) denotes (k+1)-th knot in ENthe solution path for fixed α and ηk+1 = λk+1 · (1− α). Above, βAk

(λk+1, α)

denotes the EN estimate that is computed using only the active predictorsin Ak ≡ Ak(α) and the knot value λk+1 as the penalty parameter. Theproposed test statistic Tk(α) is an extension of the covariance test (CT)statistic for the Lasso [9] and reduces to it when α = 1.

Consider a null hypothesis

H0 : Ak ⊇ A∗,

i.e., that all truly active K signal variables are in Ak at knot k. It wasshown in [9, Sec. 8] that TK →d Exp(1) as n, p → ∞ under the assumptionthat the noise terms are IID Gaussian, εi ∼ N (0, σ2) for i = 1, . . . , n,and under general conditions on the predictor matrix X. This result waspostulated in [9, Sec. 8] where the authors showed that Exp(1)-distributionholds true for Tk(α) when the predictor matrix is orthonormal (X�X =

Ip). Our empirical results reported in the Publication P.II supports thepostulation that Exp(1) distribution holds true Tk(α) in the general caseas well.

It is instructive to consider the covariance test in the simplest casethat the predictor matrix is orthonormal. In this case, the test statisticssimplifies to a simple expression of the knots. This allows constructingseveral other alternative test statistics based on knots. This special casewill be addressed in the next section.

3.3 Orthonormal Predictor Matrix

Assume that the design matrix is orthonormal, that is, XHX = I andn = p. In this case the knots can be expressed in terms of uj = xH

j y forj = 1, . . . , p or rather using the order statistics, |u|(1) ≥ |u|(2) ≥ · · · ≥ |u|(p)of |u1|, |u2|, . . . , |up|.

To describe the solution for the EN problem, we need to recall the SSToperator from Equation 2.8:

S(β; γ, η) = softγ(β)

1 + η, β ∈ C.

37


where γ = λα and η = λ(1 − α). In the orthonormal case, the Lasso/ENadmits a closed-form solution that can be expressed in terms of uj-s asfollows:

βj(λ, α) = S(uj ; γ, η)

=

⎧⎨⎩uj − γ sign(uj)

1 + η, if |uj | > γ

0, if |uj | ≤ γ. (3.2)

Furthermore, from (3.2) and γ = λα, we can observe that for any given α,the knots are

λk(α) =|u|(k+1)

α, k = 0, 1, . . . , p− 1. (3.3)

This means that the order of entrance of the variables is the same for EN(for any α = 1) as for the Lasso (α = 1). Note that this is true only in theorthogonal model.

It is also easy to verify that the covariance test statistics for Lasso (α = 1)reduces to

Tk =λk(λk − λk+1)

σ2=

|u|(k+1)(|u|(k+1) − |u|(k+2))

σ2. (3.4)

Since the knots for EN in (3.3) are just scaled versions of the Lasso knotsone does not need to consider the test statistic separately for EN in theorthonormal case. Thus we simply take α = 1 and consider an alternativetest based on knots in the next section.

3.3.1 Rayleigh test

The Rayleight test is conditioned on event E = E(α = 1), which in theorthonormal case can be expressed as

E ={minj∈A∗ |uj | > max

j /∈A∗|uj |

}. (3.5)

We consider the null hypothesis that active K signal variables are in Ak ata knot λk, i.e.,

H0 : Ak = A∗. (3.6)

The Rayleigh test (RT) statistic is

Ak =λk

σ=

|u|(k+1)

σ. (3.7)

This statistic was considered in [77] under the name test-A.We first consider a global null (i.e., no predictor truly active) case, so

38


β = 0 and K = 0. In this case, y ∼ CN n(0, σ2I), and therefore

uj = xHj y ∼ CN (0, σ2), j = 1, . . . , p

are independent random variables. Consequently, |uj |2/σ2 =d1√2χ2 =

Exp(1) and |uj |/σ =d Rayl(1/√2), see Appendix A.2 for more details. Thus,

the set {Ak}k≥K forms an order statistics of Rayleigh(1/√2) random vari-

ables when K = 0 and conditioned on event E. However, in the real-valuedcase, it forms an order statistics of an independent sample of χ1 variatessince |uj |/σ =d χ1 .

Theorem 3.3.1. The CDF of Ak conditional on event E is

Pr(Ak ≤ a) = (1− exp(−a2))p−k for k ≥ K. (3.8)

Proof. Recall that |u|(K+1)/σ ≥ |u|(K+2)/σ ≥ · · · ≥ |u|(p)|/σ are order statis-tics of a set of IID Rayl(1/

√2) random variables. Thus for k = K, the

test statistic Ak = |u|(k+1)/σ is the maximum of the p− k IID Rayl(1/√2)

random variables and hence its CDF is obtained by

Pr(Ak ≤ a) = Pr(max(u1, . . . , up−k) ≤ a

)= Pr(u1 ≤ a) · · ·Pr(up−k ≤ a)

= Pr(u1 ≤ a)p−k

where Pr(u1 ≤ a) = 1 − exp(−a2) is the CDF of Rayl(1/√2) distribution

and ui denote IID random variables from Rayl(1/√2).

The distribution of RT statistic above has two corrections to [77, Propo-sition 2]. In [77, Proposition 2] the scale parameter 1/

√2 of Rayleigh

distribution was omitted and power in (3.8) was n− k instead of p− k.Thus one may compare the test statistic Ak to a threshold τk at each

successive knot starting from k = 0. Here τk is defined as PFA = Pr(Ak > τk)

which using the CDF in (3.8) yields

τk =

√− ln

(1− (1− PFA)

1p−k

), (3.9)

where PFA is the fixed (desired) probability of false alarm. Then one stopswhen Ak ≤ τk, i.e., the first time one accepts the hypothesis of (3.6). Thisprocedure is outlined in algorithm 4. Another alternative is to start fromk = p, and decrease k until first "rejection" of the hypothesis as in [77,Algorithm 2].

As a simple example, we consider an orthonormal model, XHX = I withn = p = 30 and sparsity level K = 2. The distributions of the test statisticsAk at the knot λK = λ2 are investigated. The quantile-quantile (Q-Q) plotof A2 shown in Fig. 3.1 is obtained by generating 1000 replicates from

39


Algorithm 4: Rayleigh test (RT) for first acceptance of the hypoth-esis of (3.6).

input : y, X, σ, PFA (e.g., = 0.05).initialize :Compute λk = |u|(k+1) and τk in (3.9), k = 0, 1, . . . , p− 1.

1 for k = 0, 1, . . . , p− 1 do2 Ak = λk/σ3 if Ak ≤ τk then4 break

output : Estimate of sparsity order K = k.

the linear model, and computing the statistic A2 for each replicate. Alsoshown is CT statistic T2 computed using (3.4). As can be seen, the QQ-plotof CT-statistic deviates more from the assumed Exp(1) distribution. Thisis because it is an asymptotic approximation, whereas the distribution forA2 derived in (3.3.1) holds for finite samples. The downside of RT is that itdoes not generalize to the non-orthonormal case, unlike the CT statistic.

1 2 3

1

1.5

2

2.5

3

3.5

0 2 4 6

0

1

2

3

4

5

6

Figure 3.1. The Q-Q plots (based on 1000 MC replicates) of RT statistics A2 computedusing (3.7) and CT statistic T2 computed using (3.4) in the linear model withorthonormal predictors (n = p = 30 and sparsity level K = 2). The solid linehas slope 1.

3.4 GIC-based Sparsity Order Detection

Significance tests need either the knowledge of noise variance σ or someother assumptions on the matrix X, e.g., its orthogonality. Noise informa-tion is unknown in most applications, e.g., channel estimation where thereceived signal is corrupted by the time-varying channel (multipath fadingenvironments and intersymbol interference). The distribution of the teststatistic is no longer the same when the noise variance is replaced by itsestimate, affecting the statistical power of the test. Therefore, the infor-mation criteria-based method using EN regularization path is considered

40


next for identifying the sparsity order.

3.4.1 Information criteria

Information criteria measure the plausibility of models in the variableselection process. They pick a model by comparing the complexity ofalternative models and their ability to fit the data. They order the set ofcandidate models which are referred to as the model class and choose themodel with a minimal value of the used information criterion. Severalinformation criteria have been proposed in the literature, such as Akaike,Bayesian, Deviance, Focused, Kashyap and their extensions; see [86, 87,71] for a review.

Akaike information criterion (AIC) focuses more on predictive, ratherthan explanatory modeling [88]. It approximates the out-of-sample erroras a sum of the in-sample error and a penalty term (cf. [86] for its detailedderivation). Moreover, the bias-corrected AIC (AICc) is a small-samplemethod that provides a trade-off between the risk of under- and over-fitting [89]. Besides, a generalized and robust version of AIC is knownas the Takeuchi information criterion that contains AIC as a special case.However, it is rarely used in practice since the estimation of its penaltyterm needs a very large sample size. Bayesian information criterion (BIC),also called Schwarz information criterion, is better than AIC at picking thetrue model assuming it is contained in the model class [90, 91]. Moreover,BIC tends to be less conservative than AIC, i.e., it often chooses moreparsimonious model (with fewer variables) [91]. Another criterion, calledthe risk inflation criterion (RIC) aims to minimize the risk of selectingirrelevant parameters [92].

In our approach, referred to as c-LARS-GIC method proposed in Publica-tion P.5 [93], we use a GIC of [8]. GIC is chosen as it includes AIC, BICand RIC as special cases. Finally, the potential weakness in the usage ofGIC based approach is addressed by having better rival candidate modelsat the knot-values in the EN solution path.

3.4.2 The c-LARS-GIC method

The regularization path of EN is combined with the GIC in order toconstruct a better method for detecting the sparsity (model) order. ForAk ≡ Ak(α), let the index-set of (p − k) non-active (i.e., zero) signal coef-ficients be A�

k = {1, . . . , p} \ Ak. Ideally, for k = 0, 1, . . . , κ = min(n, p) − 1,the best (plausible) solution (or set Ak) amongst all possible

(κk

)solutions

may be picked by considering the set of all hypotheses {HAk} of the form

HAk: βA�

k= 0, |Ak| = k (or|A�

k| = p− k), (3.10)

41


where k = 0, 1, . . . , κ. This is obviously computationally infeasible even forsmall κ. Nevertheless, solving the knots in the EN regularization path byc-PW-EN algorithm reduces the number of tested hypotheses from κκ to(κ+ 1), and allows to consider a nested set of hypotheses

H0 ⊂ H1 ⊂ · · · ⊂ Hκ

Hk : βA�k= 0, k = 0, 1, . . . , κ.

(3.11)

These are then tested in c-LARS-GIC method by computing the GIC valueat each knot (i.e., hypothesis Hk),

GICγ(λk) = n ln σ2u(λk) + k cn,γ , (3.12)

where k = |Ak| is the cardinality of Ak at the knot λk ≡ λk(α), i.e., sparsity(model) order of hypotheses Hk and the contant cn,γ is indexed by γ ∈{0, 1, 2, 3, 4, 5} with acronyms/values:

• GIC0 = BIC [72] uses cn,0 = lnn.

• GIC1 uses cn,1 = lnn · ln(ln p) as suggested in [73].

• GIC2 uses cn,2 = ln p · ln(lnn) as suggested in [74].

• GIC3 = AIC [88] uses cn,3 = 2.

• GIC4 = corrected AIC (AICc) [89] uses cn,4 =2n

n− k − 1.

• GIC5 = RIC [92] uses cn,5 = ln p.

The proposed criterion in (3.12) uses the following bias corrected (i.e.,unbiased) ML-estimate of the error scale σ2 > 0 under Hk:

σ2u(λk) =

1

n− k

∥∥y − yAk

∥∥22

(3.13)

where yAk= XAk

X†Ak

y is the LS fit under Hk.The procedure of c-LARS-GIC method for picking both the correct model

and the sparsity order is given in algorithm 5; see Publication P.V [93]for more details. Note that the nested set of hypotheses (3.11) is notfixed but depends on the data. Therefore, the performance of c-LARS-GICmainly depends on having the correct model A∗ among the active sets{Ak}κk=1 in the EN regularization path. When this condition holds, thenc-LARS-GIC has a high probability of choosing the correct model due tothe consistency property of the used information criterion [8, 74, 73]. Thisfeature is verified empirically by the simulation studies in section 3.5 andin Section III-B of the Publication P.V [93]. These results illustrate thepower of using EN regularization path in estimating both the supportand coefficients of nonzero entries of a sparse signal, which allows GIC

42


Algorithm 5: c-LARS-GIC algorithm1.input : y ∈ Cn, X ∈ Cn×p and γ ∈ {0, 1, 2, 3, 4, 5}.output : {K, A∗, β} as estimates of {K,A∗,β}initialize : β = 0p and κ = min(n, p)− 1.

1 For k = 0, 1, . . . , κ, compute the active sets Ak at all Lasso knots λk

using the c-PW-EN method with α = 1 (algorithm 3).

2 Find the sparsity order K (and respective AK) using (3.12) as:

K = argmink∈{0,1,...,κ}

{GICγ(λk) = n ln σ2u(λk) + k cn,γ },

and set A∗ = AK .3 Set βAK

= X†AK

y to get K-sparse estimate β ∈ Cp.1It is part of the MATLAB toolbox available at https://github.com/mntabassm/SAEN-LARS.

to efficiently detect the sparsity order. Using GIC-based for conventionalset of models (i.e., without the help of solution path) results in degradedperformance. This shows that nested hypotheses based GIC approach is apowerful and potential candidate for sparsity order detection.

3.5 An Example

Sparse approximation of the signals in the overcomplete dictionaries isknown as sparse coding. This is the process of computing the represen-tation coefficients, β ∈ Cp, based on the given signal y ∈ Cn and anovercomplete dictionary matrix D ∈ Cn×p. This process, commonly re-ferred to as atom decomposition, can be applied for example in multiuserprivate communications as

y = Dβ + ε,

where β is the K-sparse message sent by a user at time instant t, y isthe (noisy) received signal using complex random noise ε ∈ Cn, D is thedictionary that is known to the receiver and unknown to other users.This example is analogous to the time-delay estimation of backscatteredechoes from underwater targets [94], where determining which atoms ofdictionary represent the backscattered signal is the main task. Herein, weformulate both the detection and selection of the dictionary columns forthe sparse representation of signals as a joint combinatorial optimizationproblem.

For n = 64, we generate a collection D = [I,F , C,N ] with p = 4n columnsby forming a union of four complex-valued orthonormal bases, includingthe shifted Kronecker delta subdictionary I of spikes, Fourier matrix F ,

43


1 52 103 154 205 256Predictors

1

52

103

154

205

256

Pred

icto

rs

0

1

Figure 3.2. The coherence structure of the dictionary in the sparse coding (n = 64 andp = 4n); Figure depicts the heat-map of the elements in Gram matrix |DH D|.

two-dimensional discrete cosine transform-II basis C and noiselet matrixN . Fig. 3.2 displays the correlation structure for all four subdictionaries.It can be noted that various atoms are highly coherent within D.

For different values of sparsity order K = ‖β‖0, we generated L = 1000

Monte-Carlo (MC) trials, where for each MC-trial, the message β has arandom true support set A∗ and the noise terms εi are IID from CN (0, σ2)

distribution, where σ2 is chosen to yield desired SNR in dB, i.e., σ2 =

10−SNRdB/10. The performance is measured by computing the (empirical)detection rate (DR),

DR =1

L

L∑l=1

1(K = K),

as well as the exact support recovery rate (ESRR),

ESRR =1

L

L∑l=1

1(A∗ = A∗),

where 1(·) is an indicator function.Based on the Lasso, we now compare the detection performance of the

CT and RT with test-C and test-D proposed in [77]. The latter two tests,test-C and test-D, are proposed for orthogonal and non-orthogonal modelswith unknown and known noise-variance, respectively. Figure 3.3 showsthe results where it is considered that the Oracle method knows the truesparsity level. As can be seen, the RT performed relatively better thanother significance tests. CT test is not performing too well as the asymptoticapproximation of its null distribution is not accurate for small values of nand p. Yet, the c-LARS-GIC2 outperformed all of them and achieved almostequal values of DR and ESRR. Here we show the results only for GIC2which was empirically the best c-LARS-GIC based method (cf. Section III-B

44


of the Publication P.V [93]). This example illustrates that c-LARS-GIC hasa high probability to detect the true sparsity order when A∗ is among theconsidered model class.

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1D

R

(a) Detection rates (DR)

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

ESR

R

(b) Exact support recovery rates (ESRR)

Figure 3.3. The sparse coding example (n = 64 and p = 4n); Figures (a) and (b) illustratethe detection rate (DR) and exact support recovery rate (ESRR), respectively,for varying sparsity order. Rates are computed over 1000 MC-trials usingan SNR level of 20 SNR, where support set A∗ is picked randomly for eachMC-trial.

45


46

4. Detection and Estimation ofDirections-of-Arrivals of Sources

The DoA estimation problem aims at detecting and locating K radiatingsources by using an array of n passive sensors. For this, the emitted energycan be, for example, acoustic or electromagnetic, whereas the receivingsensors can be, for instance, hydrophones or antennas. This problem isinvolved in different applications that are related to Radar, sonar, com-munications, seismology, and underwater surveillance. Herein, we firstintroduce the problem and provides an overview of existing approaches,e.g., using spatial spectral analysis that determines the energy distributionover the space. Then, we proceed by discussing a recent proposal that relieson sparse signal recovery algorithms. Then we discuss our SSR approachdeveloped in Publication.IV [58].

4.1 Sensor Array Signal Model

The sources radiate energy that is carried to the sensors by the propagatingwaves. These source signals contain information about the location (ordirection) of the sources. For far-field (i.e., plane wave) assumption, thesource parameters are alike to an n-element sensor array whose apertureis much smaller than the propagation distance. Moreover, a non-dispersive(i.e., homogeneous) medium is considered for the propagation of wavefields.Therefore, for K < n far-field incident source signals, the array output(called a snapshot) sampled at an arbitrary time instant t is given by anarrowband planar wavefront model

y(t) =K∑k=1

a(θk) sk(t) + ε(t) = A(θ) s(t) + ε(t), (4.1)

where ε(t) represents the measurement noise, s(t) = (s1(t), . . . , sK(t))�

contains the source waveforms, θ � (θ1, . . . , θK)� are the DoAs of thesources,

A(θ) �(a(θ1) . . . a(θK)

)

47

Detection and Estimation of Directions-of-Arrivals of Sources

is the steering matrix, a(θk) is the k-th source’s steering vector that isdetermined by the geometry of the sensor array. In the case of a uniformlinear array (ULA) with equispaced sensor elements of half a wavelengthinter-element spacing, the steering matrix has a Vandermonde structureand a steering vector is given by

a(θ) =1√n

(1, eıπ sin θ, . . . , eıπ(n−1) sin θ

)�. (4.2)

Figure 4.1 illustrates the array configuration in the case that a singlesource at DoA θ ∈ [−π/2, π/2] with respect to the array axis is impingingon an ULA with an inter-element spacing d. The output of the array attime t collects the noisy observations into an n-vector y(t), referred to asthe snapshot.

y1(t)y2(t)yn-1(t)

s(t)

yn(t)

d

¼¼

Source

Initiallyspherical

Planewavefronts

d sin

(n-1)d sin

¼¼ Array output

n(t) n-1(t) 1(t)2(t)

Figure 4.1. Plane wave propagation in a general setup of the DoA estimation problem,where a single source is impinging on an n-element ULA whose (noisy) outputis y(t).

Several conventional DoA estimation methods use the spatial (array)covariance matrix Ry = E[y(t)yH(t)] ∈ Cn×n that carries the spatial andspectral information of the sources. Assuming that the uncorrupted signal,A(θ) s(t), is uncorrelated with the spatially white Gaussian noise ε(t) ∼CN n(0, σ

2I), the array covariance matrix has the following structure

Ry = A(θ)RsAH(θ)︸︷︷︸

signal covariance matrix

+ σ2I︸︷︷︸noise covariance matrix

, (4.3)

i.e., it can be represented as a linear combination of a rank-K matrix and ascaled identity matrix, where σ2 = E[|εi(t)|2] > 0 denotes the noise varianceand Rs = E[s(t) sH(t)] denotes the source covariance matrix whose diagonalelements represent the source powers and the off-diagonal elements repre-sent source correlations. Notice that Rs is assumed to be a positive definiteHermitian K × K matrix and hence of full rank, i.e., source signals are

48


assumed to be incoherent. Hence, also Ry is positive definite Hermitianmatrix of full rank.

The array covariance matrix Ry can be estimated from the received noisydata y(t) using the sample covariance matrix (SCM),

Ry =1

TYYH =

1

T

T∑i=1

yiyHi (4.4)

where Y =(y1 · · · yT

)denotes the n×T snapshot matrix collecting the

snapshots, denoted shortly as

yi = y(ti), i = 1, . . . , T,

and measured at distinct time points t1, . . . , tT .

4.2 Conventional DoA Estimation

The DoA estimation methods can be roughly split into three categories:beamformers, subspace methods and parametric techniques. In the follow-ing, these three categories and their limitations are briefly discussed.For a complete review of these methods, the readers are referred to[95, 96, 97, 98].

4.2.1 Beamformers

Conventional beamformer forms a spatial filter using a beamformer weightvector w = w(θ) that linearly combines the weighted sensor outputs intobeamformer output z = wHy(t) with an aim of enhancing a signal-of-interest from a look direction θ and attenuating other undesired signalsfrom other directions. It estimates the DoA based on concentration (orpeak) of estimated beamformer output power

P (θ) =1

T

T∑i=1

|wHyi|2 = 1

T

T∑i=1

wH yiyHi w = wHRyw,

where T denotes the number of snapshots and Ry denotes SCM in (4.4).The plot of P (θ) as a function of the look direction θ is called as the spatialspectrum. Conventional beamforming methods for DoA estimation includethe delay-and-sum and the minimum variance distortionless response(MVDR) beamformers, which are explained briefly herein (cf. [99, 100] formore details). These are non-parametric methods that do not exploit anydata model.

• Classical beamformer: The classical delay-and-sum beamformer uses

49


beamformer weight vector w = a(θ) and performs DoA estimation byfinding peaks of the spatial power spectrum,

Pconv(θ) =aH(θ) Ry a(θ)

aH(θ) a(θ),

that is computed across the beamscan region of the look directions [101].It has low resolution (or larger beamwidth), e.g., θBW ≈ 2×wavelength

|nd cos θ| forthe ULA in Figure 4.1 where the beam-pattern has a main lobe aroundthe look direction θ [96]. This limits the number of DoAs that can beestimated. For instance, the minimum separation needed is 18◦ betweentwo sources when θ = 45◦ and number of sensors in the ULA with half awavelength inter-element spacing is n = 9.

• Capon’s MVDR addresses the aforementioned resolution issue to improvethe estimation performance. It estimates the DoAs of K-sources byfinding K maxima of the spectrum

Pcapon(θ) =1

aH(θ) R−1y a(θ)

,

where Ry is defined in (4.4). This results in a beam-pattern that hasrelatively sharp peaks with smaller beamwidths at the DOAs.

4.2.2 Subspace methods

Subspace methods were developed after noticing that second order esti-mation methods can retrieve the DoAs [102]. Subspace methods estimatethe signal and noise subspaces from the estimated covariance matrix (e.g.,SCM Ry). Subspace methods require detecting the number of DoAs K orassuming it is known a priori. The eigenvectors corresponding to the K

largest eigenvalues span the signal subspace and the remaining eigenvec-tors are orthogonal to the signal space and span the noise subspace. Due tostructure (4.3) of the array covariance matrix Ry, its n−K smallest eigen-values are equal to σ2 and the corresponding eigenvectors eK+1, . . . , en areorthogonal to the columns of A(θ), i.e.,

aH(θi)ej = 0, i = 1, . . . ,K, j = K + 1. . . . , n. (4.5)

Thereafter, DoA estimation is performed by utilizing either noise or signalsubspaces in the algorithm. A number of subspace-based direction-findingalgorithms have been developed since the 1980s, such as multiple signalclassification (MUSIC) [103, 104] and the estimation of signal parame-ters via rotational invariance techniques (ESPRIT) [105, 106] and theirvariants [97, 95].

The widely used MUSIC method exploits the orthogonality of the signal

50


subspace with the noise subspace and property (4.5). Subspace meth-ods start by computing the eigenvalue decomposition of the array outputcovariance matrix Ry in Equation 4.3,

Ry =

n∑i=1

ζi eieHi ,

where {e1, e2, . . . , eK , eK+1, . . . en} denote the eigenvectors of Ry and ζ1 ≥ζ2 ≥ · · · ≥ ζK ≥ ζK+1 ≥ · · · ≥ ζn denote the associated eigenvalues. Thesignal and noise subspaces can be defined via the span of the matrices

Es = (e1 e2 . . . eK) and Eε = (eK+1 eK+2 . . . en),

respectively. However, the number of source signals K is unknown inpractice and its estimate (e.g., using the MDL estimator) is needed in theMUSIC method. An estimate Eε of the noise subspace Eε is obtained inpractise by selecting the eigenvectors corresponding to n − K smallesteigenvalues of the SCM Ry. The MUSIC methods then computes thepseudospectrum

Pmusic(θ) =1

aH(θ) EεEHε a(θ)

for a grid of look-directions (e.g., θ ∈ [−90◦, 90◦]) and estimates the sourceDoAs as K largest peaks in the pseudospectrum. The resolution of thisapproach depends on the sensor parameters.

Another popular subspace-based method ESPRIT [105, 106] that utilizesthe rotational invariance property of two identical subarrays in the case ofULA. It is relatively less sensitive to noise and improves resolution.

4.2.3 Parametric techniques

Parametric techniques formulate the data model for y and exploit theposed model to estimate the DoAs. These offer higher accuracy but haveincreased computational complexity. In general, these involve a multi-dimensional optimization problem.

The well known maximum likelihood (ML) approach is representative ofthis family. It derives estimates by maximizing the likelihood function overthe parameters of interest (e.g., DoA). It has two variants, deterministicML and stochastic ML, based on different interpretations of the sourcesignals. Deterministic ML, also known as the nonlinear least-squaresmethod, views signal as a fixed realization of a stochastic process. Whereas,stochastic ML considers the case that array snapshots are IID and normallydistributed with zero mean and covariance matrix Ry in (4.3), i.e.,

yi ∼ CN n(0,A(θ)RsAH(θ) + σ2I), i = 1, . . . , T.

51


Both assume independent, zero-mean, spatially white Gaussian noise.These exploit the statistical distribution of array observation to find theDOAs that make the observed data y1, . . . ,yT as likely as possible.

For both ML-variants, the maximization of the objective function is anonlinear optimization problem and in the absence of a closed-form solu-tion requires iterative schemes. Accordingly, implementation of ML-basedmethods can be done by expectation and maximization, space alternatinggeneralized expectation-maximization and iterative quadratic maximumlikelihood algorithms that are applicable in the ULA-case [97]. Otherparametric techniques include the covariance matching estimation meth-ods (cf. [107, 108]) which are alternative to ML estimation and referredto as generalized least squares in the literature. Whereas, the subspacefitting is another parametric approach that uses a nonlinear least squareformulation and provides a unified framework for the deterministic MLestimator and subspace-based methods [97]. For example, the weightedsubspace fitting algorithm that is also known as the method of directionestimation [97]. It uses the signal subspace as the source of a projectioninto the model space.

4.2.4 Limitations

The conventional DoA estimation techniques suffer from certain well-known limitations, e.g., when sources are coherent or at oblique angleswith respect to array axis, and fewer (or a single) snapshots are available[109]. For example, in the case of fast-moving sources and multi-patharrivals. For example, MVDR beamformer requires that Ry is full rankand hence invertible, i.e., T ≥ n. Besides, the subspace-based methods arenot robust against signal coherence and they also need a priori knowledgeon the number of sources, K, which may be difficult to estimate in practice.Subspace methods also require a sufficient number of data snapshots toaccurately estimate the array covariance matrix Ry. Moreover, they can besensitive to source correlations that tend to cause a rank deficiency in thesample data covariance matrix [110]. In addition, the DoA of sources thatare having a varying source power/energy can often difficult to estimate bythese methods.

4.3 Single-snapshot DoA Estimation

Detection and estimation of DoAs can be quite challenging when only asmall number of snapshots or, in the worst case, a single-snapshot is avail-able. Sparsity based methods for DoA estimation [111, 112], particularlyinspired by CS [13] can tackle coherent sources and single snapshot asillustrated in Publication P.I [62] and Publication P.IV [58]. Therefore, the

52


sparse estimation (or optimization) methods are used in this dissertationfor DoA estimation based on single-snapshot measurement. These can beapplied in several demanding scenarios, including cases with no knowledgeof the source numbers and highly or completely correlated sources.

4.3.1 Compressed beamforming

Estimation of DoAs by the compressed beamforming (CBF) bases on theprinciple of sparse representation [113, 114], that is, the observed single-snapshot at a single sampled time instant in (4.1),

y = A(θ) s+ ε, (4.6)

In CBF, one converts the over-determined model of (4.6) into the sparselinear model of (2.1) via discretization of the parameter space at desiredresolution [115]. Sparse representation framework uses finite angular grid(i.e., number of steering vectors a(θ[i]) at grid points θ[i]) whereas the actualDoA estimation problem involves a continuous parameter space. Therefore,finer grids are needed to have better resolution, which results in coherentdesign matrix in the sparse linear model.

Consider an angular grid of size p (commonly p � n) of look directions ofinterest (i.e., arrival angles) in terms of the broadside angles,

[θ] = {θ[j] ∈ [−π/2, π/2) : θ[1] < · · · < θ[p]}.

Let the jth column of the measurement matrix X in the model (2.1) bethe array response for look direction θ[j], so xj = a(θ[j]). Then assumingthat the true source DoAs are contained in the chosen angular grid, i.e.,θk ∈ [θ] for k = 1, . . . ,K, then the snapshot y in (4.6) can be equivalentlymodelled as in (2.1), where β is exactly K-sparse (i.e., ‖β‖0 = K) and non-zero elements of β maintain the source waveforms s, i.e., βA = s, whereA = supp(β). Thus in CBF, identifying the true DoAs is equivalent toidentifying the non-zero elements of β. Hence, based on a single snapshotonly, we can utilize the proposed SSR algorithms developed in chapter 2and chapter 3 in finding the number of sources K and their DoAs.

4.3.2 Sequentially adaptive elastic net approach

For sources at oblique directions, the standard (i.e., non-weighted, wj ≡ 1

for j = 1, . . . , p) EN estimator may perform inconsistent variable selec-tion. The adaptive Lasso [18] obtains oracle variable selection propertyby using cleverly chosen adaptive weights for regression coefficients inthe �1-penalty. We extended this idea and considered weighted elastic net(WEN) penalty (cf. Publication P.IV [58]), coupling it with sequential activeset approach, where WEN is applied to only nonzero (active) predictors

53


that were found in previous run.Our proposed adaptive EN uses data dependent weights, defined as

wj =

{1/|βinit,j |, βinit,j = 0

∞, βinit,j = 0, j = 1, . . . , p (4.7)

Above βinit ∈ Cp denotes a sparse initial estimator of β. The idea is thatonly nonzero coefficients are exploited, i.e., basis vectors with βinit,j = 0

are omitted from the model, and thus the dimensionality of the linearmodel is reduced from p to K = ‖βinit‖0. Moreover, larger weight meansthat the corresponding variable is penalized more heavily. The vectorw = (w1, . . . , wp)

� can be written compactly as

w = 1p � ∣∣βinit

∣∣, (4.8)

where notation |β| means element-wise application of the absolute valueoperator on the vector, i.e., |β| = (|β1|, . . . , |βp|)�, and operator � is usedfor element-wise division.

Next we turn our attention on how to choose the adaptive (i.e., datadependent) weights in c-PW-WEN algorithm of Publication P.IV [58]. Aspecial case of this algorithm using unit weights, called as c-PW-EN, waspresented in subsection 2.4.2 and in algorithm 3. In adaptive Lasso [18],one ideally uses the OLS or, if p > n, the Lasso as an initial estimator βinit

to construct the weights given in (4.7). The problem is that both the OLSand the Lasso estimator have very poor accuracy (high variance) whenthere exists high correlations between predictors, which is the conditioncommonly occurring in CBF problem due to high correlations between thesteering vectors corresponding to closely spaced angles. This lowers theprobability of exact recovery of the adaptive Lasso significantly.

To overcome the problem above, we developed a SAEN approach thatobtains K-sparse solution in a sequential manner decreasing the sparsitylevel of the solution at each iteration and using the previous solution asadaptive weights for c-PW-WEN. The SAEN is described in algorithm 6.SAEN runs the c-PW-WEN algorithm three times. At first step, it findsa standard (unit weight) c-PW-WEN solution for 3K nonzero (active) co-efficients, denoted β3. The obtained solution determines the adaptiveweights via (4.7) (using βinit = β3) and hence the active set of 3K nonzerocoefficients which is used in the second step to compute the c-PW-WENsolution that has 2K nonzero coefficients. This again determines the adap-tive weights via (4.7) (and hence the active set of 2K nonzero coefficients)which is used in the second step to compute the c-PW-WEN solution thathas the desired K nonzero coefficients. It is important to notice that sincewe start from a solution with 3K nonzeros, it is quite likely that the trueK non-zero coefficients will be included in the active set of β3 which is

54


Algorithm 6: SAEN algorithm (sequentially adaptive elastic net)1.

input : y ∈ Cn, X ∈ Cn×p, [α] and K.

output : β ∈ Cp and A∗ ⊂ {1, . . . , p} with |A∗| = K correspondingto estimates of the K-sparse signal β and the support setA∗

1{β3,A3

}= c-PW-WEN

(y,X,1, [α], 3K, 0

)2{β2,A2

}= c-PW-WEN

(y,X,1� ∣∣β3

∣∣, [α], 2K, 0)

3{β, A∗} = c-PW-WEN

(y,X,1� ∣∣β2

∣∣, [α],K, 1)


computed in the first step of the SAEN algorithm. Note that the choice3K as the number of active elements is similar to the CoSaMP algorithm[32] which also uses 3K as an initial support size. Using 3K also usuallyguarantees that XA3K

is well conditioned which may not be the case if alarger value than 3K is chosen.

4.4 Detection and Estimation of DoAs

The proposed SAEN method assumed that the number of sources, K,is known. However, in practice, K is often unknown and needs to beestimated, for example, using the algorithms proposed in chapter 3. Noticethat the MDL criterion is known to suffer in detection performance for asmall number of snapshots, and is not applicable for the single-snapshotcase.

In the following, a simulation study is conducted considering a grid from−90◦ to 90◦ with a spacing of 2◦, i.e., p = 90 and n = 40 being the numberof sensors. The SNR level is 20 dB. The detection and estimation approachis as follows

1. Find an estimate of K of sparsity level K using c-LARS-GIC, Rayleightest (RT), or covariance test (CT) that were discussed in Chapter 3,or, test-C and test-D proposed in [77].

2. Compute an estimate of K-sparse signal β and the support A∗ usingthe c-PW-WEN and SAEN approach.

When K sources are present, the results for the detection and estimationof DoAs, using solution paths by c-PW-EN in algorithm 3, are plotted in theFigure 4.2a and Figure 4.2b, respectively. They compare different methods(cf. chapter 3) and show the variations in the DR and ESRR for differentnumber of sources. It can be noticed from the results in Figure 4.2a thatGIC2 outperforms other methods in detecting the number of sources K.Furthermore, the exact estimation of DoAs is also close to the Oracle that

55


knows a priori the value of K. Besides, Figure 4.2b also shows ESRR ratefor SAEN approach that results in improvement.

Finally, we note Publication P.IV [58] detailed comparison of DoA estima-tion results illustrating that SAEN approach improves DoAs estimation ina large number of scenarios.

1 2 3 4 50

0.2

0.4

0.6

0.8

1

DR

(a) Detection rates (DR)

1 2 3 4 50

0.2

0.4

0.6

0.8

1

ESR

R

(b) Exact support recovery rates (ESRR)

Figure 4.2. Detection and estimation of source DOAs by c-PW-EN and methods in chap-ter 3, along with estimation results by SAEN approach: Figures (a) and (b)illustrate the detection rate (DR) and exact support recovery rate (ESRR),respectively, for varying number of sources. Rates are computed over 1000MC-trials using an SNR level of 20 SNR, where support set A∗ is pickedrandomly for each MC-trial.

56

5. High-dimensional classification

Consider having n samples (i.e., observations or cases) and p features(i.e., predictors or variables). A collection of feature vectors for n differentsamples constitutes the feature matrix, X ∈ Rp×n. Each column vectorxi ∈ Rp of X is called a feature vector corresponding to ith observationand is associated with a class label yi ∈ {1, . . . , G}, where G denotes thenumber of classes. The class labels on all n samples are collected to anoutput vector y. The pair (y,X) is then the basic data object on whichpredictive models are learned and called the training data. HD data refersto a dataset for which p > n or p � n, i.e., HD settings involve the casewhere the number of features may staggeringly exceed the number ofsamples.

Genomics data of gene expressions, measured by microarrays or RNAsequencing, often contain expression levels on tens of thousands of genesbut often only on few dozens samples (i.e., p � n). Such data configurationis referred to as high-dimension low-sample-size (HDLSS) setting. In thischapter, we consider the classification of HD data and propose a novelclassification method, called compressive regularized discriminant anal-ysis (CRDA), which performs feature selection as part of the classifier’slearning process. Our examples on simulated and real-world data, illus-trate that CRDA improves in all three facets of the performance criteriaconsidered; namely accuracy, interpretability, and computational efficiency(cf. Publication P.VI).

For a genomics dataset of gene expressions, the feature matrix can berepresented in a tabular form that is called expression matrix, shownin Fig. 5.1 in transposed form X ∈ Rn×p for illustration purposes. Thismatrix contains covariate information for n samples across p genes; eachrow represents one particular sample, each column represents a gene, andeach entry of the matrix is the measured expression level of a particulargene in a sample.

57

High-dimensional classification

Covariate information for features (e.g. genes)

Labe l s

Expression Profile

Covariate information

for samples

sample 1

sample 2

sample n

...

gene 1 gene 2 ... gene p

Exp r e ssion

S i gna t ur e

X y

Figure 5.1. Expression (feature) matrix having n expression profiles and p expressionsignatures. Respective n class labels of each observations are given in a vectory.

5.1 Motivation and Background

Classification rule (or model) assigns an observation x ∈ Rp consistingof measured features to a discrete group or class {1, . . . , G}. Selectingonly a small subset of significant features is often necessary to improvethe classification accuracy in the HDLSS setting. It also helps to avoidoverfitting as less redundant data is present after the feature selection.Overfitting means that the model is too closely aligned to training datasetsthat it does not know how to respond to new situations. Feature selectionalso improves the interpretability of the obtained results. More detailson the challenges in HD statistical inference are discussed in [116, 117].Ideally, an HD classifier should pick only relevant features, preferablyusing a data-dependent feature selection mechanism.

HD data is common in bioinformatics. The advent of ultra-high-throughput sequencing provides researchers and clinicians a variety oftools to probe genomes in greater depth, leading to an enhanced under-standing of how genome sequence variants underline phenotype and dis-ease [118]. The rapid and low-cost sequencing technologies produce mas-sive amounts of data. This brings a new set of challenges related toprocessing and analysing the HD data to acquire knowledge. This requiresadvanced statistical tools and methods which are sufficiently fast and com-putationally efficient especially in HDLSS settings. Such developmentsare particularly useful after the recent precision medicine initiatives withgoals of sequencing tens of thousands of genomes [119, 120].

For example, the discovery of disease subtypes and their classificationare major tasks in delivering precision (or personalized) medicine, e.g., forcancer or diabetes. It takes into account, for example, individual variabilityin genes to predict more accurately which treatment and prevention strate-

58


gies for a particular disease will work in which groups of people, i.e., type(or subtype) of that disease. This information can help classify patients,identify genes that can be used as molecular markers of a certain type ofdisease, and deepen the understanding of the molecular mechanisms ofthat disease.

In HDLSS settings, the number of training samples is not sufficientlylarge to accurately train modern nonlinear classifiers based on deep neuralnetwork (DNN) [121] for example. DNNs with many layers suffer fromsevere overfitting and high-variance gradients for HDLSS data, and thisphenomenon becomes more pronounced due to imbalanced training dataset,where some classes may have only very few cases. If a linear algorithmachieves good performance with tens or hundreds of cases per class, onemay need thousands of cases per class for a nonlinear classifier.

The proposed CRDA classifier is a modification of linear discriminantanalysis (LDA). The main difference between LDA and quadratic dis-criminant analysis (QDA) is that LDA classifies an observation x using adiscriminant rule that is a linear combination of variables x1, . . . , xp, whileQDA uses a quadratic discriminant rule that is a combination of linearterms and 2nd-order interactions, xixj , between variables. LDA-basedmethods often offer substantially lower variance in HDLSS than QDA[122]. However, LDA can suffer from high bias if its modelling assumption,namely that the predictor variables share a common covariance matrix, isbadly off. Nevertheless, accurate feature selection can offset this problemand offer improved performance.

Most of the standard classifiers, like support vector machine (SVM),diagonal LDA, diagonal QDA, either do not select features or dependon separately hand-picked features. There are few classifiers, such asshrunken centroids regularized discriminant analysis (SCRDA) [123], pe-nalized linear discriminant analysis (PLDA) [124], sparse partial leastsquares based logistic regression [125], supervised principal componentanalysis based LDA (SPCALDA) [126] and variable selection based ran-dom forests (varSelRF) [127], which perform feature selection during thelearning process but often offer inferior accuracy in at least one aspect ofconsidered performance criteria. Furthermore, they involve parameterswhich are not easy to tune optimally.

5.2 High-dimensional Gene Expression Data

The collection of proteins within each cell, known as cellular proteins, deter-mine its function and perform most of the cellular mechanisms. Therefore,measuring the complete protein expression of the cells would be ideal tolearn their actions or reactions. However, this is not fully possible withthe current state-of-the-art, and, therefore, approximate alternatives are

59


commonly used. Currently, the most prevalent way of estimating the typeof cell activity utilizes the expression of genes in a particular condition.Gene expression is the process by which the instructions in our DNA areconverted into a functional product, such as a protein. For example, duringthe expression of a protein-coding gene, information flows from DNA �RNA � protein. In this, firstly the DNA transcription transfers the in-formation in DNA to a messenger ribonucleic acid (mRNA) molecule, andthereafter resulting mature mRNA must be translated into the proteinsthat carry out different functions in the cell. Genes are expressed by beingtranscribed into RNA.

Gene expression profiling (GEP) measures the expression level of mRNAs(the transcripts) in a cell population, which shows the pattern of genesexpressed by a cell at the transcription level [128]. This often meansmeasuring relative mRNA amounts in two or more experimental conditions,then assessing which conditions resulted in specific genes being expressed.The characteristic GEP patterns, referred to as gene expression signatures,are well known for their ability to differentiate between the different typesand the behavior of cells [129, 130].

Different genes can be upregulated (increased in the expression) or down-regulated (decreased in the expression) in distinct types of cells. Being ableto map out these gene expression changes and find differentially regulatedgenes characteristic to a cell-type is thus invaluable to the biomedicalresearch. This helps in understanding different medical conditions [131],processes of diseases like cancer [132], and otherwise to explore the ther-apeutic applications of drugs [133]. For instance, in cancer studies, itis valuable to identify the survival-related genes whose expressions arealtered by a drug [134].

There are different GEP techniques that measure the activity (the ex-pression) of thousands of genes at once. GEP has been defined by repeatedtechnological innovations that led to a tremendous revolution in genomicsresearch and the emergence of the high-throughput genomics era. Thetwo most popular and influential inventions are microarrays and high-throughput DNA sequencing that is commonly known as RNA sequencing[135]. The latter is also frequently called next-generation sequencing.DNA microarrays are commonly used since the end of the previous cen-tury. In this dissertation, the performance of CRDA classifier is testedmainly on microarray gene expression data, mainly due to the availabil-ity of such data sets, e.g., thanks to the gene expression omnibus (GEO)database [136]. The effectiveness of CRDA based classifiers on a broadarray of microarray gene expression data sets were illustrated in detail inPublications P.III and P.VI. Next, we discuss GEP using microarrays.

60


5.2.1 Microarrays

Microarrays are solid supports, usually of glass or silicon, upon whichDNA is attached in an organized pre-determined grid fashion. GEP usingmicroarrays is a robust, efficient, and cost-effective approach to assaya cell or tissue’s transcriptome at a particular time or under a specificcondition. It measures the expression of thousands of short predefinedsequences, which are then preprocessed to obtain gene-level activity. AsmRNA has relatively homogenous properties, analyzing gene expressionon the level of transcription is much easier than on the level of the proteinsthat actually carry out functions in cells. In most cases, the levels of mRNAproduced from different genes can be assumed to be proportional to theconcentrations of corresponding active cellular protein. Thus, measuringthe abundance of transcripts of different genes is a convenient way ofobtaining gene expression profiles of cells.

To obtain gene expression profiles via microarrays, mRNA is extractedfrom cell or tissue samples using hybridization. After miscellaneous pro-cessing and amplification steps, the sample is applied onto a microarraychip. It contains spotted probes for different genes, and the amount ofsample bound to the spots correlates with the expression level of the genethat the probes are for. Using various fluorescent labeling techniques, theabundance of the bound sample in each spot can then be quantified. Theseabundances are proportional to the gene expression levels of differentgenes.

There are several challenges with GEP using microarrays at differentsteps of the genomics data pipeline. The measurements data can be sig-nificantly noisy, for example, due to various preprocessing steps [137].Shortcomings in experimental design can introduce systematic error intoresults, making the results inaccurate. In microarray experimental design,care should be taken at each step of acquiring and processing samples toavoid introducing excessive random variation or bias.

5.2.2 Conventional statistical tests

For genomics data, the differential expression is usually computed as a fold-change (FC), i.e., a log2-ratio, for example, of averages from treatment andcontrol sample values. This accounts for the biological significance of thegenes, and their statistical significance is commonly tested using standardstatistical tests such as the t-test. Typically, relevant genes (features) arepicked using multiple hypotheses testing and correction, and by using thefold-change parameter and false discovery rate (FDR) [138, 139, 140, 141],where FDR is defined as the expected ratio of the number of false positivesto the total number of positive calls in a differential expression analysisbetween two groups of samples [142].

61


Using permutation-based multiple testing, the SAM (Significance Anal-ysis of Microarrays) along with its extensions represents state-of-the-artstatistical techniques for finding significant genes [143, 144]. It uses re-peated permutations of the data to determine if the expression of genesis significantly related to the response. It improves the ordinary Studentt-test, by imposing limits on the variability of genes, to exclude genes thatdo not change by at least a pre-specified fold-change level. It computesthe expected order statistics and offers, through its R-package samr, theoptions of estimating a p-value for each gene by using either t-statistic orMann-Whitney (two-sample Wilcoxon) statistic. The obtained p-values forall the genes are compared to a cutoff which is chosen to have an acceptableFDR. Moreover, SAM also computes the q-value that is familiar p-valuebut adapted to multiple-testing situations, i.e., lowest FDR at which thegene is called significant [145].

The cutoff for statistical significance is based on the FDR, and the valueof the fold-change parameter ensures that selected genes are biologicallysignificant by at least a pre-specified amount. Thus, a gene picked asdifferentially expressed (DE) must be both statistically and biologicallysignificant. The fold-change parameter is generally difficult to tune and thequantification of how much change in the mean expression makes a geneDE is subjective. All in all, these traditional gene selection approachesare not part of the classification process and they often lack the ability toselect all the DE genes.

5.3 Compressive Regularized Discriminant Analysis

CRDA is based on the LDA-rule, which uses a linear discriminant function(i.e., a linear combination of the feature variables) in order to discriminatebetween the classes. LDA assumes that each of the G populations have amultivariate normal (MVN) distribution with a common positive definitecovariance matrix Σ ∈ Rp×p but a different mean vector μg ∈ Rp, g =

1, 2, . . . , G. Considering that both of these quantities are known at themoment, an observation in the training data from a class g ∈ {1, 2, . . . , G}is assumed to be generated from

xg,j ∼ Np(μg,Σ), for 1 ≤ j ≤ ng,

where ng is the number of observations from class g, and the total samplesize is n = n1 + n2 + · · · + nG. LDA assigns an observation x to class g

that maximizes its likelihood, or equivalently stated, that minimizes theMahalanobis distances (MD-s) of given observation x at all G-groups:

x ∈ group

[g = arg min

g=1,...,G

{(x− μg)

�Σ−1(x− μg)} ]

.

62


In certain cases, usage of some prior knowledge, for example, the expectedproportion of observations from each population can be plausible. There-fore, let πg, g = 1, . . . , G, denote the a priori class probabilities, i.e., πgis the probability that a randomly selected observation is from class g,where π1 + π2 + · · ·+ πG = 1. Then one maximize the posterior probabilityinstead of the likelihood. This results in slightly modified classificationrule expressed as

x ∈ group

[g = arg min

g=1,...,G

{1

2(x− μg)

�Σ−1(x− μg)− lnπg

}].

One may further simplify the rule above and write it in an equivalent form

x ∈ group[g = arg max

g=1,...,Gdg(x)

], (5.1)

wheredg(x) = x�Σ−1μg − 1

2μ�g Σ

−1μg + lnπg (5.2)

denotes the discriminant function for group g. Hence (5.2) illustratesthat the assumption of common covariance matrix Σ and multivariantenormality of the class population results in a discriminant rule that islinear in x.

In QDA one needs to estimate the covariance matrix for each classwhereas in LDA the assumption of a common covariance matrix be-tween classes allows to estimate the covariance matrix from the pooleddata that is centred group-wise with respect to the sample mean vectorsμg = (1/ng)

∑ng

i=1 xg,i, of the classes. This simplicity offered by LDA leadsto substantially lower variance and improved prediction performance inHDLSS settings. This is because the number of observations ng in a par-ticular class g can be severely insufficient, leading to poor estimate ofthe class covariance matrix. This is the main reason why LDA is oftenpreferred over QDA in high-dimensional classification problems. LDA isalso the chosen approach in the proposed CRDA framework.

5.3.1 Basic set-up

In the proposed CRDA approach, we collect the discriminant functionsdg(x) in (5.2) into a vector d(·) as:

d(x) = (d1(x), . . . , dG(x))

= x�B+ c, (5.3)

wherec = −1

2diag

(M�B

)+ lnπ (5.4)

63


is a vector of constants, where diag(·) retrieves diagonal elements of itsmatrix argument, ln π = (lnπ1, . . . , lnπG) are the log-prior probabilities, M=

(μ1 . . . μG

)contains the group-means and the coefficient matrix B is

defined as,B =

(b1 · · · bG

)= Σ−1M. (5.5)

Both of the unknowns, the coefficient matrix B ∈ Rp×G and the constantvector c = c(B) ∈ RG in (5.3) basically depend on two matrices, namelygroup-means matrix M ∈ Rp×G and the precision matrix Σ−1 ∈ Rp×p.In practice, both of these matrices are unknown, and thus we need toestimate them from the training data. A matrix of sample group-means,M = (μ1 . . . μG), is used as an estimate of M. The estimate of priors canbe either uniform (πg = 1/G) or empirical probability of each group, i.e.,πg = ng/n.

When p is huge, the resulting classifier is difficult to interpret, since theclassification rule involves a linear combination of all p features. Moreover,notice that when the i-th row of B is a zero vector 0, then it implies thatthe i-th feature does not contribute to the classification rule and hence canbe eliminated. Consequently, this allows us to perform feature selectionduring the learning process given that we find and use a rowsparse esti-mate of B. This is one of the main elements of CRDA along with using anappropriate estimate of the precision matrix Σ−1. This is explained in thenext subsection.

5.3.2 Elements of CRDA

In HDLSS setting, LDA faces the problem that the SCM, i.e., the maximumlikelihood estimate of the covariance matrix, can be ill-conditioned (whenn is only slightly larger than p) or singular (when p > n). This dictate usto use a regularized version of LDA, referred shortly as (RDA), where weuse two state-of-the-art regularized estimators of the covariance matrixinstead of the SCM. These include the regularized sample covariance ma-trix (RSCM), namely Ell1-RSCM and Ell2-RSCM estimators developed in[146, 147] and a penalized sample covariance matrix (PSCM) [148, 149]that solves a penalized Gaussian likelihood function using an additive Rie-mannian distance penalty on Σ with shrinkage towards a scaled identitymatrix, referred to Rie-PSCM estimator. These three estimators results inthree CRDA classifiers, named CRDA1, CRDA2 and CRDA3, respectively.

Another element of CRDA is compression of features by retaining a subsetof the features and discarding the rest. This subset selection produces amodel that is interpretable and has possibly lower misclassification ratesor prediction error than the model using the full set of features.

We have G-groups (i.e., G dependent variables) and p independent vari-ables. As unit change in particular dimension will change the distance

64


d(x) equal to that coefficient value. The predictor with a small coefficientcan be called as non-significant or noise and hence deleted from the featurespace. We apply �q-norm (and sample variance) based metric to rank thefeatures for subset selection. The selected subset of variables containsthose K out of p features whose effect is more pronounced in the coefficientmatrix. Accordingly, CRDA’s discriminant function is

d(x) =(d1(x), . . . , dG(x)

)= x�B + c, (5.6)

where c = −12 diag

(M�B) ∈ RG and K-rowsparse B ∈ Rp×G has p − K

zero-rows which do not contribute to the classification rule.CRDA estimates the coefficient matrix (5.5) as

B =(b1 · · · bG

)= Σ

−1M,

where Σ−1

is the precision matrix that is base on one of the three afore-mentioned covariance matrix estimators (Ell1-RSCM, Ell2-RSCM, or Rie-PSCM) and M = (μ1 . . . μG) contains the sample group-means. The K-rowsparse coefficient matrix B is estimated by using a hard-thresholdingoperator HK(·, ϕ), defined as a transform

B = HK(B, ϕ),

where the function ϕ is a hard-thresholding selector function that actson the row vectors of B to rank the features. This is shown in Figure 5.2where the function ϕ for a vector b ∈ RG uses �q-norm when q ∈ [1,∞), soϕq(b) =‖b‖q, and the sample variance when q = 0, denoted as

ϕ0(b) =1

G− 1

G∑i=1

|bi − b|2,

where b = (1/G)∑G

i=1 bi. Note that any norm with q ≥ 1 could be used,and we consider the most commonly used �q-norms, namely, q = 1, q = 2 orq = ∞. We then use CV to choose the best ϕ-function among the candidates{ϕ0, ϕ1, ϕ2, ϕ∞} along with joint-sparsity level K ∈ {1, . . . , p}. This CVprocedure, which gives the optimal B that is used in (5.6), is detailedin Publication P.VI. Finally, the training (or learning) and testing (i.e.,prediction) phases in the CRDA framework are illustrated in Figure 5.2.

65


´=

¼

Length

Row-w

ise variance or lq -norm

Sort

in descending order

K-features selection

i.e., p - K zero-row

s

¼

¼1

12

2

p

p

¼¼

m1

mG

K

HK

(B, q)

S-1

b1

bG

BB

l

M

¼

¼1

12

2

n

p

X

®y

¼

12

p

xt T

b1

bG

¼B

b1

bG

´+

,

¼

1G

c

=

¼

1G

d(xt )

Group assigning

Index with m

aximum

d(xt )-value

Group-label = 1

Prediction Learning

Figure

5.2.CR

DA

framew

orkw

herethe

unknown

parameters

arefound

usingthe

trainingdata

set{X,y}.D

uringthe

learningprocess,it

doesfeature

selectionto

havefinalcoefficient

matrix

andconstants,w

hichare

laterused

forpredictions

ontest

set.

66


5.4 Numerical examples on real-world data

Next, we compare the performance of CRDA classifiers on two real-worlddatasets and compare the results with state-of-the-art classification meth-ods. We note that these analysis results were not included in PublicationP.III or Publication P.VI and hence complement the findings therein. Bothof the datasets listed in Table 5.1 are available on the GEO database whichis an international public repository that archives and freely distributeshigh-throughput gene expression and other functional genomics data sets[136]. The accession numbers of the data sets are GDS968 and GDS3715.

The first dataset is for the analysis of UV and IR irradiated lymphoblas-toid cell lines derived from the peripheral blood of patients with acuteradiation toxicity. Samples are taken 2 months after the completion ofradiation therapy. Results provide insight into the role of the responseto DNA damage in radiation toxicity. The second dataset is for the anal-ysis of vastus lateralis muscle biopsies from insulin-sensitive subjects,insulin-resistant subjects, and diabetic patients following insulin treat-ment. Results provide insight into the molecular basis of insulin actionin skeletal muscle and the underlying defects causing insulin resistance.Details about the datasets are provided in Table 5.1. Note that the avail-able data is split randomly into training and test sets of fixed length, nand nt, respectively, such that relative proportions of observations in theclasses are preserved as accurately as possible. Then L = 10 such randomsplits are performed and averages of error rates are reported in order todiminish the randomness of experiments involved due to the splits.

Table 5.1. Summary of the real datasets used for comparison.

Dataset GEO Accession # p n nt {n1, . . . , nG} G

#1 GDS968 5748 102 69 {42, 45, 39, 45} 4#2 GDS3715 12626 66 44 {30, 40, 40} 3

For both the datasets, the results are compared with several state-of-the-art methods: such as SCRDA, PLDA, SPCALDA, varSelRF, linear SVM aswell regularized multinomial regression with �1 (Lasso) penalty or ElasticNet (EN) penalty [150] for which we use acronyms Lasso and EN. Theseset of estimators and software used for calculations are explained in moredetail in Table 3 of Publication P.VI. Table 5.2 lists the comparison resultsin terms of averages and standard deviations of both test error rate (TER)and feature selection rate (FSR).

It can be seen from the table that CRDA based classifiers outperformsall other methods. Nevertheless, Lasso and EN based feature selectionhave the best interpretability but they have higher misclassification rates.For data set #1, CRDA3 which uses the Riemannian PSCM is perform-ing the best in terms of the test error rate. The good performance of

67


CRDA3 could be related to the fact that it leverages on the result that theparameter space forms a Riemannian manifold (the space of symmetricpositive definite matrices) in the design of the associated penalty function.However, CRDA3 is not always the best method (e.g., for data set #2) andthus validating this empirical observation would require more extensivestudies. Overall, CRDA approach is the optimal one offering both higherinterpretability and least misclassifications.

Table 5.2. Classification results for both the datasets in terms of test error rates (TER) andfeature selection rates (FSR). Standard deviations are given at in the brackets.

Data set #1 Data set #2TER FSR TER FSR

CRDA1 6.23 (3.28) 13.94 (2.76) 8.41 (2.41) 18.12 (8.38)CRDA2 7.25 (4.53) 12.73 (4.44) 8.41 (2.41) 18.00 (8.58)CRDA3 5.94 (3.83) 16.82 (0.73) 8.55 (4.45) 13.31 (3.88)

Lasso 17.25 (4.8) 0.49 (0.03) 15.23 (7.73) 0.12 (0.04)EN 14.2 (5.37) 1.24 (0.05) 12.73 (6.36) 0.4 (0.1)SCRDA 7.83 (4.64) 100 (0) 14.77 (8.04) 85.5 (30.03)PLDA 47.54 (3.97) 44.2 (18) 46.14 (8.9) 86.58 (7.09)SPCALDA 58.7 (15.25) 100 (0) 40.23 (16.3) 100 (0)varSelRF 23.04 (3.51) 40.1 (31.04) 21.82 (11) 23.4 (31.24)Linear SVM 39 (10.55) 100 (0) 13.18 (4.65) 100 (0)

68

6. Discussion, Conclusions and FutureWork

This dissertation covers topics in HD statistics and signal processing. Anincreasingly common occurrence is the collection of large amounts of infor-mation on each individual sample point, even though the number of samplepoints themselves may remain relatively small. This is the case in bioinfor-matics, where HD classification is performed using high-throughput datalike microarrays or next-generation sequencing, or in compressed beam-forming using a sensor array. Regularization and shrinkage are commonlyused tools in many applications, such as regression or classification, toovercome significant statistical challenges posed particularly due to theHDLSS data settings in which the number of features, p, is often severalmagnitudes larger than the sample size, n (i.e., p � n). In this dissertation,we developed methods and tools for solving HD regression and classifica-tion problems, with emphasis on applications dealing with complex-valueddata.

In the context of the linear model, we focused on convex penalized op-timization problem using the EN penalty. The EN is a sparse linearregression estimator that includes the popular Lasso estimator as a specialcase. It is suitable for HD data, obtains good prediction accuracy whileencouraging a grouping effect. One of its benefits over the Lasso is itsability to cope with predictors that are highly correlated. Such a situationis commonly occurring in many real-world applications and is prevalentin compressed beamforming based DoA finding due to densely discretizedangular space.

In this dissertation, we developed new algorithms for the EN optimiza-tion problem (and its weighted version). First, we devised an easy toimplement CCD algorithm to compute the EN solution for complex-valueddata, referred to as CCD-CEN (algorithm 1). Then we extended the popularLARS algorithm for solving the Lasso knots (penalty parameter valuesafter which the level of sparsity changes, cf. algorithm 2) and the EN prob-lem in the complex domain. The algorithm, called c-PW-EN (algorithm 3)is especially useful in the problems where the sparsity level of the signalvector is known since it permits to solve the EN solution with the specified

69

Discussion, Conclusions and Future Work

sparsity level directly.In many applications, however, the sparsity level is unknown and needs

to be estimated. Therefore, the solution path based GIC method wasproposed for detecting the sparsity order, e.g., number of sources in DoAestimation. The developed algorithm, called c-LARS-GIC (cf. algorithm 5)is simple and computationally fast since it only considers models in theLasso solution path. Detection and estimation of source DoAs were per-formed in the CBF application. Finally, a novel sequential estimationstrategy, called SAEN (cf. algorithm 6) was proposed to improve on theexact support-recovery in sparse linear regression problem in the casewhen the sparsity level is known. Its usefulness was demonstrated in theDoA finding problem of sources.

For the HD classification problems, a compressive classification frame-work, referred to as CRDA, was proposed. CRDA performs feature selectionduring the learning phase. Feature selection is important in many applica-tions. For example, in genomic studies, it is equivalent to finding significantgenes that influence the particular trait under study. CRDA-based classi-fiers deploy state-of-the-art regularized covariance matrix estimators andthe hyperparameters of the method, the joint-sparsity level, and the hard-thresholding selector function can be easily tuned via cross-validation.

The research done in this dissertation opens up many avenues for futurework. For example, we assume that only a single measurement vector y isavailable from the sparse linear model. In some cases, there are severalmeasurement vectors that all share the same design (measurement) matrixand sparse support. This is the case in a compressed beamforming problemin the case that there are several snapshots available and sources havebeen static during the measurement interval. In future work, the aim isto extend the developed methods to such multiple measurements vectormodel. Another research direction is to extend the CRDA-based classifiersto handle next-generation sequencing data, where measurements are noton a continuous scale but positive counts data.

70

A. Appendix

A.1 Expected Prediction Error

Let f(x0) = βHx0 be the prediction for new measurement data {y0,x0}.

Then, the expected prediction error (EPE) can be defined as:

EPE � E[∣∣y0 − f(x0)

∣∣2]= E

[∣∣y0 − E[f(x0)]− {f(x0)− E[f(x0)]}∣∣2]

= E[∣∣y0 − E[f(x0)]

∣∣2]+ var[f(x0)], (1.1)

where

E[∣∣y0 − E[f(x0)]

∣∣2] = E[∣∣y0 − f(x0)− {E[f(x0)]− f(x0)}

∣∣2]= var(ε) +

(Bias[f(x0)]

)2.

Collecting these in Equation 1.1, we have

EPE � E[|y0 − f(x0)|2

]= E

[∣∣y0 − E[f(x0)]− (f(x0)− E[f(x0)])∣∣2]

=(Bias[f(x0)]

)2+ var[f(x0)]︸︷︷︸

MSE[f(x0)]

+ var(ε)︸︷︷︸σ2

= (bias)2 + (variance)︸︷︷︸reducible error

+(variance of noise/error)︸︷︷︸irreducible error

, (1.2)

Thereafter, note that the mean squared error (MSE) is as:

MSE[f(x0)] = E[∣∣x�

0 β − x�0 β

∣∣2]

71

Appendix

= xH0 E

[(β − β)(β − β)H

]x∗0

= xH0 MSE(β)x∗

0.

Therefore, minimizing the EPE is directly linked with minimizing the MSEof β.

A.2 Rayleigh Test Statistic for Global Null

Considering a global null (i.e., no predictor truly active), that is, H0 : y ∼CN n(0, σ

2I), we have

ujR + ıujI = uj = xHj y ∼ CN (0, σ2)

for j = 1, 2, . . . , p, where ujR ∼ N (0, σ2

2 ) and ujI ∼ N (0, σ2

2 ) are real andimaginary parts of uj . Hence, since

|ujR|σ/

√2

d= χ1 and

|ujI |σ/

√2

d= χ1 ,

one has that

|uj |2 = (ujR)2 + (ujI)

2

d=

(σ√2

)2[( ujR

σ/√2

)2+

( ujI

σ/√2

)2]d= σ2

2

[χ2

1+ χ2

1

]⇔ |uj |2

σ2

d=

1

2χ2

2

d= Exp(1)

⇔ |uj |σ

d= 1√

2χ2

d= Rayleigh

(1√2

) d= Weibull(1, 2).

Above notation χ2k denotes chi-squared distribution with k degrees of free-

dom, and d= reads “has the same distribution as ".

72

Bibliography

[1] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction (online: 12th printingJan 2017), 2nd ed. Springer, New York, NY, 2009. (Cited on pages 17, 24,and 25.)

[2] A. Zollanvari, “High-dimensional statistical learning: Roots, justifications,and potential machineries,” Cancer informatics, vol. 14, pp. CIN–S30 804,2015. (Cited on page 17.)

[3] N. Meinshausen, B. Yu et al., “Lasso-type recovery of sparse representationsfor high-dimensional data,” The annals of statistics, vol. 37, no. 1, pp. 246–270, 2009. (Cited on page 17.)

[4] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journalof the Royal Statistical Society. Series B (Methodological), pp. 267–288,1996. (Cited on pages 17, 22, 24, and 25.)

[5] H. Zou and T. Hastie, “Regularization and variable selection via the elasticnet,” Journal of the Royal Statistical Society: Series B (Statistical Method-ology), vol. 67, no. 2, pp. 301–320, 2005. (Cited on pages 17, 22, 24, 25, 26, and 27.)

[6] X. Ying, “An overview of overfitting and its solutions,” in Journal of Physics:Conference Series, vol. 1168, no. 2. IOP Publishing, 2019, p. 022022. (Citedon page 17.)

[7] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least angle regression(with discussion),” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.(Cited on pages 19, 22, 24, and 28.)

[8] R. Nishii, “Asymptotic properties of criteria for selection of variables inmultiple regression,” Ann. Statist., vol. 12, no. 2, pp. 758–765, 06 1984.(Cited on pages 19, 35, 41, and 42.)

[9] R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani, “A significancetest for the Lasso (with discussion),” Annals of statistics, vol. 42, no. 2, p.413, 2014. (Cited on pages 19, 22, 36, and 37.)

[10] P. Gerstoft, C. F. Mecklenbräuker, W. Seong, and M. Bianco, “Introductionto compressive sensing in acoustics,” The Journal of the Acoustical Societyof America, vol. 143, no. 6, pp. 3731–3736, 2018. (Cited on page 21.)

[11] P. Gerstoft, S. Nannuru, C. F. Mecklenbräuker, and G. Leus, “DOA estima-tion in heteroscedastic noise,” Signal Processing, vol. 161, pp. 63–73, 2019.(Cited on page 21.)

73

Bibliography

[12] E. Schwab, H. E. Cetingul, B. Mailhe, and M. S. Nadar, “Sparse recovery offiber orientations using multidimensional prony method,” Patent 10 324 155,June, 2019, US Patent App. 10/324,155. (Cited on page 21.)

[13] R. G. Baraniuk, T. Goldstein, A. C. Sankaranarayanan, C. Studer, A. Veer-araghavan, and M. B. Wakin, “Compressive video sensing: algorithms,architectures, and applications,” IEEE Signal Processing Magazine, vol. 34,no. 1, pp. 52–66, 2017. (Cited on pages 21 and 52.)

[14] X. Ma, F. Yang, S. Liu, J. Song, and Z. Han, “Design and optimization ontraining sequence for mmWave communications: A new approach for sparsechannel estimation in massive MIMO,” IEEE Journal on Selected Areas inCommunications, vol. 35, no. 7, pp. 1486–1497, 2017. (Cited on page 21.)

[15] Z. Qin, J. Fan, Y. Liu, Y. Gao, and G. Y. Li, “Sparse representation forwireless communications: A compressive sensing approach,” IEEE SignalProcessing Magazine, vol. 35, no. 3, pp. 40–58, 2018. (Cited on page 21.)

[16] S. Tiwari, K. Kaur, and K. Arya, “A study on dictionary learning basedimage reconstruction techniques for big medical data,” in Handbook ofMultimedia Information Security: Techniques and Applications. Springer,2019, pp. 377–393. (Cited on page 21.)

[17] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning withsparsity: the Lasso and generalizations. Chapman and Hall/CRC, 2015.(Cited on pages 21, 22, and 30.)

[18] H. Zou, “The adaptive Lasso and its oracle properties,” Journal of theAmerican Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006.(Cited on pages 22, 24, 25, 53, and 54.)

[19] L. Clemmensen, T. Hastie, D. Witten, and B. Ersbøll, “Sparse discriminantanalysis,” Technometrics, vol. 53, no. 4, pp. 406–413, 2011. (Cited on page 22.)

[20] J. H. Friedman, “Fast sparse regression and classification,” InternationalJournal of Forecasting, vol. 28, no. 3, pp. 722–738, 2012. (Cited on page 22.)

[21] J. Taylor and R. J. Tibshirani, “Statistical learning and selective inference,”Proceedings of the National Academy of Sciences, vol. 112, no. 25, pp. 7629–7634, 2015. (Cited on page 22.)

[22] R. J. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani, “Exact post-selection inference for sequential regression procedures,” Journal of theAmerican Statistical Association, vol. 111, no. 514, pp. 600–620, 2016. (Citedon page 22.)

[23] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor, “Exact post-selectioninference, with application to the Lasso,” Ann. Statist., vol. 44, no. 3, pp.907–927, 06 2016. [Online]. Available: https://doi.org/10.1214/15-AOS1371(Cited on page 22.)

[24] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,”Journal of computational and graphical statistics, vol. 15, no. 2, pp. 265–286, 2006. (Cited on page 22.)

[25] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling[a sensing/sampling paradigm that goes against the common knowledgein data acquisition],” IEEE signal processing magazine, vol. 25, no. 2, pp.21–30, 2008. (Cited on page 22.)

[26] S. Foucart and H. Rauhut, A mathematical introduction to compressivesensing. Springer, 2013, vol. 1, no. 3. (Cited on page 22.)

74

Bibliography

[27] J. Wakefield, Bayesian and frequentist regression methods. SpringerScience & Business Media, 2013. (Cited on page 24.)

[28] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matchingpursuit: Recursive function approximation with applications to waveletdecomposition,” in Proceedings of 27th Asilomar conference on signals,systems and computers. IEEE, 1993, pp. 40–44. (Cited on page 24.)

[29] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurementsvia orthogonal matching pursuit,” IEEE Transactions on Information The-ory, vol. 53, no. 12, pp. 4655–4666, Dec 2007. (Cited on pages 24 and 35.)

[30] D. Needell and R. Vershynin, “Signal recovery from incomplete and inac-curate measurements via regularized orthogonal matching pursuit,” IEEEJournal of selected topics in signal processing, vol. 4, no. 2, pp. 310–316,2010. (Cited on page 24.)

[31] T. Blumensath and M. E. Davies, “Iterative hard thresholding for com-pressed sensing,” Applied and computational harmonic analysis, vol. 27,no. 3, pp. 265–274, 2009. (Cited on page 24.)

[32] D. Needell and J. Tropp, “CoSaMP: Iterative signal recovery from incom-plete and inaccurate samples,” Applied and Computational Harmonic Anal-ysis, vol. 26, no. 3, pp. 301 – 321, 2009. (Cited on pages 24 and 55.)

[33] M. Elad, Sparse and redundant representations: from theory to applicationsin signal and image processing. Springer Science & Business Media, 2010.(Cited on page 24.)

[34] R. Crandall, B. Dong, and A. Bilgin, “Randomized iterative hard thresh-olding for sparse approximations,” in 2014 Data Compression Conference.IEEE, 2014, pp. 403–403. (Cited on page 24.)

[35] A. Ejaz, E. Ollila, and V. Koivunen, “Randomized simultaneous orthogonalmatching pursuit,” in 2015 23rd European Signal Processing Conference(EUSIPCO). IEEE, 2015, pp. 704–708. (Cited on page 24.)

[36] C. Soussen, J. Idier, J. Duan, and D. Brie, “Homotopy based algorithmsfor �0-regularized least-squares,” IEEE Transactions on Signal Processing,vol. 63, no. 13, pp. 3301–3316, 2015. (Cited on page 24.)

[37] M. Osborne, B. Presnell, and B. Turlach, “A new approach to variableselection in least squares problems,” IMA Journal of Numerical Analysis,vol. 20, no. 3, pp. 389–403, 2000. (Cited on page 24.)

[38] A. Panahi and M. Viberg, “Fast candidate points selection in the LASSOpath,” IEEE Signal Processing Letters, vol. 19, no. 2, pp. 79–82, 2012. (Citedon page 24.)

[39] M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection forsparse reconstruction: Application to compressed sensing and other inverseproblems,” IEEE Journal of selected topics in signal processing, vol. 1, no. 4,pp. 586–597, 2007. (Cited on page 24.)

[40] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algo-rithm for linear inverse problems with a sparsity constraint,” Communica-tions on Pure and Applied Mathematics: A Journal Issued by the CourantInstitute of Mathematical Sciences, vol. 57, no. 11, pp. 1413–1457, 2004.(Cited on page 24.)

[41] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstructionby separable approximation,” IEEE Transactions on Signal Processing,vol. 57, no. 7, pp. 2479–2493, July 2009. (Cited on page 24.)

75

Bibliography

[42] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithmfor linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1,pp. 183–202, 2009. (Cited on page 24.)

[43] S. Becker, J. Bobin, and E. J. Candès, “NESTA: A fast and accurate first-order method for sparse recovery,” SIAM Journal on Imaging Sciences,vol. 4, no. 1, pp. 1–39, 2011. (Cited on page 24.)

[44] S. Yun and K.-C. Toh, “A coordinate gradient descent method for �1-regularized convex minimization,” Computational Optimization and Appli-cations, vol. 48, no. 2, pp. 273–307, 2011. (Cited on page 24.)

[45] Q. Wang, C. Meng, W. Ma, C. Wang, and L. Yu, “Compressive sensingreconstruction for vibration signals based on the improved fast iterativeshrinkage-thresholding algorithm,” Measurement, vol. 142, pp. 68–78, 2019.(Cited on page 24.)

[46] Y. Zhang, A. Jakobsson, D. Mao, Y. Zhang, Y. Huang, and J. Yang, “Gener-alized time-updating sparse covariance-based spectral estimation,” IEEEAccess, vol. 7, pp. 143 876–143 887, 2019. (Cited on page 24.)

[47] T. Kronvall, S. I. Adalbjörnsson, S. Nadig, and A. Jakobsson, “Group-sparseregression using the covariance fitting criterion,” Signal Processing, vol.139, pp. 116–130, 2017. (Cited on page 24.)

[48] L. Breiman et al., “Heuristics of instability and stabilization in modelselection,” The annals of statistics, vol. 24, no. 6, pp. 2350–2383, 1996.(Cited on page 24.)

[49] A. E. Hoerl and R. W. Kennard, “Ridge regression: biased estimation fornonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.(Cited on pages 24 and 25.)

[50] I. E. Frank and J. H. Friedman, “A statistical view of some chemometricsregression tools (with discussion),” Technometrics, vol. 35, no. 2, pp. 109–135, 1993. (Cited on pages 24 and 25.)

[51] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihoodand its oracle properties,” Journal of the American statistical Association,vol. 96, no. 456, pp. 1348–1360, 2001. (Cited on pages 24 and 25.)

[52] H. D. Bondell and B. J. Reich, “Simultaneous regression shrinkage, variableselection, and supervised clustering of predictors with OSCAR,” Biometrics,vol. 64, no. 1, pp. 115–123, 2008. (Cited on page 25.)

[53] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. (Cited on page 25.)

[54] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse-group Lasso,”Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231–245, 2013. (Cited on page 25.)

[55] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity byreweighted �1 minimization,” Journal of Fourier analysis and applications,vol. 14, no. 5-6, pp. 877–905, 2008. (Cited on page 25.)

[56] M. Yuan and Y. Lin, “On the non-negative garrotte estimator,” Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), vol. 69,no. 2, pp. 143–161, 2007. (Cited on pages 25 and 26.)

[57] J. Jia and B. Yu, “On model selection consistency of the elastic net whenp � n,” Statistica Sinica, vol. 20, no. 2, pp. 595–611, 2010. [Online].Available: http://www.jstor.org/stable/24309012 (Cited on pages 25 and 26.)

76

Bibliography

[58] M. N. Tabassum and E. Ollila, “Sequential adaptive elastic net approach forsingle-snapshot source localization,” The Journal of the Acoustical Societyof America, vol. 143, no. 6, pp. 3873–3882, 2018. [Online]. Available:https://doi.org/10.1121/1.5042363 (Cited on pages 25, 26, 29, 47, 52, 53, 54, and 56.)

[59] S. Chen and D. Donoho, “Basis pursuit,” in Proceedings of 1994 28th Asilo-mar Conference on Signals, Systems and Computers, vol. 1. IEEE, 1994,pp. 41–44. (Cited on page 26.)

[60] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition bybasis pursuit,” SIAM review, vol. 43, no. 1, pp. 129–159, 2001. (Cited onpage 26.)

[61] J. Duan, C. Soussen, D. Brie, J. Idier, M. Wan, and Y.-P. Wang, “GeneralizedLASSO with under-determined regularization matrices,” Signal processing,vol. 127, pp. 239–246, 2016. (Cited on page 26.)

[62] M. N. Tabassum and E. Ollila, “Single-snapshot doa estimation using adap-tive elastic net in the complex domain,” in 2016 4th International Workshopon Compressed Sensing Theory and its Applications to Radar, Sonar andRemote Sensing (CoSeRa), Sep. 2016, pp. 197–201. (Cited on pages 27 and 52.)

[63] S. Rosset and J. Zhu, “Piecewise linear regularized solution paths,” Ann.Statist., vol. 35, no. 3, pp. 1012–1030, 07 2007. (Cited on page 28.)

[64] R. J. Tibshirani and J. Taylor, “The solution path of the generalized Lasso,”Ann. Statist., vol. 39, no. 3, pp. 1335–1371, 06 2011. (Cited on page 28.)

[65] R. J. Tibshirani, “The Lasso problem and uniqueness,” Electron. J. Statist.,vol. 7, pp. 1456–1490, 2013. (Cited on page 28.)

[66] A. M. Tillmann and M. E. Pfetsch, “The computational complexity of therestricted isometry property, the nullspace property, and related conceptsin compressed sensing,” IEEE Transactions on Information Theory, vol. 60,no. 2, pp. 1248–1259, 2013. (Cited on page 35.)

[67] T. T. Do, L. Gan, N. Nguyen, and T. D. Tran, “Sparsity adaptive matchingpursuit algorithm for practical compressed sensing,” in 2008 42nd AsilomarConference on Signals, Systems and Computers. IEEE, 2008, pp. 581–587.(Cited on page 35.)

[68] D. Angelosante and G. B. Giannakis, “RLS-weighted Lasso for adaptiveestimation of sparse signals,” in IEEE International Conference on Acoustics,Speech and Signal Processing, 2009. ICASSP 2009. IEEE, 2009, pp. 3245–3248. (Cited on page 35.)

[69] D. M. Malioutov, S. R. Sanghavi, and A. S. Willsky, “Sequential compressedsensing,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 2,pp. 435–444, 2010. (Cited on page 35.)

[70] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14,no. 5, pp. 465–471, 1978. (Cited on page 35.)

[71] V. Koivunen and E. Ollila, Academic Press Library in Signal Processing,Volume 3: Array and Statistical Signal Processing. Elsevier, 2014, ch.Model order selection, pp. 9–26. (Cited on pages 35 and 41.)

[72] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics,vol. 6, no. 2, pp. 461–464, 1978. (Cited on pages 35 and 42.)

[73] H. Wang, B. Li, and C. Leng, “Shrinkage tuning parameter selection with adiverging number of parameters,” Journal of the Royal Statistical Society:Series B (Statistical Methodology), vol. 71, no. 3, pp. 671–683, 2009. (Citedon pages 35 and 42.)

77

Bibliography

[74] Y. Fan and C. Y. Tang, “Tuning parameter selection in high dimensionalpenalized likelihood,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 75, no. 3, pp. 531–552, 2013. (Cited on pages 35and 42.)

[75] C. D. Austin, R. L. Moses, J. N. Ash, and E. Ertin, “On the relation betweensparse reconstruction and parameter estimation with model order selection,”IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 3, pp. 560–570, 2010. (Cited on page 35.)

[76] T. Kronvall and A. Jakobsson, “Hyperparameter selection for group-sparseregression: A probabilistic approach,” Signal Processing, vol. 151, pp. 107–118, 2018. (Cited on page 36.)

[77] R. Jagannath, “Detection, estimation and grid matching of multiple targetswith single snapshot measurements,” Digital Signal Processing, vol. 92, pp.82–96, 2019. (Cited on pages 36, 38, 39, 44, and 55.)

[78] P.-J. Chung, J. F. Bohme, C. F. Mecklenbrauker, and A. O. Hero, “Detectionof the number of signals using the benjamini-hochberg procedure,” IEEETransactions on Signal Processing, vol. 55, no. 6, pp. 2497–2508, 2007.(Cited on page 36.)

[79] P.-J. Chung, M. Viberg, and C. F. Mecklenbräuker, “Broadband ML estima-tion under model order uncertainty,” Signal processing, vol. 90, no. 5, pp.1350–1356, 2010. (Cited on page 36.)

[80] Y. Wang, Z. Tian, and C. Feng, “Sparsity order estimation and its applicationin compressive spectrum sensing for cognitive radios,” IEEE Transactionson Wireless Communications, vol. 11, no. 6, pp. 2116–2125, 2012. (Cited onpage 36.)

[81] A. Lavrenko, F. Roemer, G. Del Galdo, R. Thomä, and O. Arikan, “An empir-ical eigenvalue-threshold test for sparsity level estimation from compressedmeasurements,” in 2014 22nd European Signal Processing Conference (EU-SIPCO). IEEE, 2014, pp. 1761–1765. (Cited on page 36.)

[82] M. E. Lopes, “Unknown sparsity in compressed sensing: Denoising andinference,” IEEE Transactions on Information Theory, vol. 62, no. 9, pp.5145–5166, 2016. (Cited on page 36.)

[83] S. Semper, F. Römer, T. Hotz, and G. DelGaldo, “Sparsity order estimationfrom a single compressed observation vector,” IEEE Transactions on SignalProcessing, vol. 66, no. 15, pp. 3958–3971, 2018. (Cited on page 36.)

[84] Y. Gao, Y. Si, B. Zhu, and Y. Wei, “Sparsity order estimation algorithm incompressed sensing by exploiting slope analysis,” in 2018 14th Interna-tional Wireless Communications & Mobile Computing Conference (IWCMC).IEEE, 2018, pp. 753–756. (Cited on page 36.)

[85] Y. Luo, J. Dang, and Z. Song, “Optimal compressive spectrum sensingbased on sparsity order estimation in wideband cognitive radios,” IEEETransactions on Vehicular Technology, 2019. (Cited on page 36.)

[86] P. Stoica and Y. Selen, “Model-order selection: a review of informationcriterion rules,” IEEE Signal Processing Magazine, vol. 21, no. 4, pp. 36–47,2004. (Cited on page 41.)

[87] J. Ding, V. Tarokh, and Y. Yang, “Model selection techniques: An overview,”IEEE Signal Processing Magazine, vol. 35, no. 6, pp. 16–34, 2018. (Cited onpage 41.)

78

Bibliography

[88] H. Akalke, “A new look at the statistical model identification,” IEEE trans-actions on automatic control, vol. 19, no. 6, p. 716723, 1974. (Cited on pages41 and 42.)

[89] C. M. Hurvich and C.-L. Tsai, “Regression and time series model selectionin small samples,” Biometrika, vol. 76, no. 2, pp. 297–307, 1989. (Cited onpages 41 and 42.)

[90] J. Ding, V. Tarokh, and Y. Yang, “Bridging AIC and BIC: a new criterion forautoregression,” IEEE Transactions on Information Theory, vol. 64, no. 6,pp. 4024–4043, 2017. (Cited on page 41.)

[91] J. E. Cavanaugh and A. A. Neath, “The akaike information criterion: Back-ground, derivation, properties, application, interpretation, and refinements,”Wiley Interdisciplinary Reviews: Computational Statistics, vol. 11, no. 3, p.e1460, 2019. (Cited on page 41.)

[92] D. P. Foster, E. I. George et al., “The risk inflation criterion for multipleregression,” The Annals of Statistics, vol. 22, no. 4, pp. 1947–1975, 1994.(Cited on pages 41 and 42.)

[93] M. N. Tabassum and E. Ollila, “Simultaneous signal subspace rank andmodel selection with an application to single-snapshot source localization,”in 2018 26th European Signal Processing Conference (EUSIPCO), Sep. 2018,pp. 1592–1596. (Cited on pages 41, 42, and 45.)

[94] X. Meng, X. Li, A. Jakobsson, and Y. Lei, “Sparse estimation of backscat-tered echoes from underwater object using integrated dictionaries,” TheJournal of the Acoustical Society of America, vol. 144, no. 6, pp. 3475–3484,2018. (Cited on page 43.)

[95] H. Krim and M. Viberg, “Two decades of array signal processing research:the parametric approach,” IEEE Signal Processing Magazine, vol. 13, no. 4,pp. 67–94, 1996. (Cited on pages 49 and 50.)

[96] P. Stoica, R. L. Moses et al., “Spectral analysis of signals,” 2005. (Cited onpages 49 and 50.)

[97] P.-J. Chung, M. Viberg, and J. Yu, “DOA estimation methods and algo-rithms,” in Academic Press Library in Signal Processing. Elsevier, 2014,vol. 3, pp. 599–650. (Cited on pages 49, 50, and 52.)

[98] A. M. Zoubir et al., “Academic press library in signal processing: Volume 3array and statistical signal processing.” (Cited on page 49.)

[99] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Pro-ceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418, 1969. (Cited on page 49.)

[100] D. H. Johnson and D. E. Dudgeon, Array signal processing: concepts andtechniques. PTR Prentice Hall Englewood Cliffs, 1993. (Cited on page 49.)

[101] M. Bartlett, “An introduction to stochastic processes,” Cambridge, Mass,1966. (Cited on page 50.)

[102] V. F. Pisarenko, “The retrieval of harmonics from a covariance function,”Geophysical Journal International, vol. 33, no. 3, pp. 347–366, 1973. (Citedon page 50.)

[103] R. O. Schmidt, “A signal subspace approach to multiple emitter locationand spectral estimation.” 1982. (Cited on page 50.)

[104] R. Schmidt, “Multiple emitter location and signal parameter estimation,”IEEE transactions on antennas and propagation, vol. 34, no. 3, pp. 276–280,1986. (Cited on page 50.)

79

Bibliography

[105] R. Roy, A. Paulraj, and T. Kailath, “ESPRIT–a subspace rotation approachto estimation of parameters of cisoids in noise,” IEEE transactions onacoustics, speech, and signal processing, vol. 34, no. 5, pp. 1340–1342, 1986.(Cited on pages 50 and 51.)

[106] R. Roy and T. Kailath, “ESPRIT-estimation of signal parameters via rota-tional invariance techniques,” IEEE Transactions on acoustics, speech, andsignal processing, vol. 37, no. 7, pp. 984–995, 1989. (Cited on pages 50 and 51.)

[107] A. B. Gershman, C. F. Mecklenbrauker, and J. F. Bohme, “Matrix fittingapproach to direction of arrival estimation with imperfect spatial coherenceof wavefronts,” IEEE Transactions on Signal Processing, vol. 45, no. 7, pp.1894–1899, 1997. (Cited on page 52.)

[108] B. Ottersten, P. Stoica, and R. Roy, “Covariance matching estimation tech-niques for array signal processing applications,” Digital Signal Processing,vol. 8, no. 3, pp. 185–210, 1998. (Cited on page 52.)

[109] P. Häcker and B. Yang, “Single snapshot DOA estimation,” Advances inRadio Science, vol. 8, pp. 251–256, 2010. (Cited on page 52.)

[110] Z. Yang, J. Li, P. Stoica, and L. Xie, “Sparse methods for direction-of-arrivalestimation,” in Academic Press Library in Signal Processing, Volume 7.Elsevier, 2018, pp. 509–581. (Cited on page 52.)

[111] M.-A. Bisch, S. Miron, D. Brie, and D. Lejeune, “A sparse approach for DOAestimation with a multiple spatial invariance sensor array,” in 2016 IEEEStatistical Signal Processing Workshop (SSP). IEEE, 2016, pp. 1–5. (Citedon page 52.)

[112] C. F. Mecklenbräuker, P. Gerstoft, and E. Zöchmann, “c–LASSO and itsdual for sparse signal estimation from array data,” Signal Processing, vol.130, no. Supplement C, pp. 204 – 216, 2017. (Cited on page 52.)

[113] C. F. Mecklenbräuker, P. Gerstoft, A. Panahi, and M. Viberg, “Sequentialbayesian sparse signal reconstruction using array data,” IEEE Transactionson Signal Processing, vol. 61, no. 24, pp. 6344–6354, Dec 2013. (Cited onpage 53.)

[114] P. Gerstoft, A. Xenaki, and C. F. Mecklenbräuker, “Multiple and singlesnapshot compressive beamforming,” The Journal of the Acoustical Societyof America, vol. 138, no. 4, pp. 2003–2014, 2015. (Cited on page 53.)

[115] D. Malioutov, M. Cetin, and A. S. Willsky, “A sparse signal reconstructionperspective for source localization with sensor arrays,” IEEE Transactionson Signal Processing, vol. 53, no. 8, pp. 3010–3022, Aug 2005. (Cited onpage 53.)

[116] D. L. Donoho et al., “High-dimensional data analysis: The curses andblessings of dimensionality,” AMS math challenges lecture, vol. 1, no. 2000,p. 32, 2000. (Cited on page 58.)

[117] J. Fan and R. Li, “Statistical challenges with high dimensionality: fea-ture selection in knowledge discovery,” in Proceedings of the InternationalCongress of Mathematicians Madrid, August 22–30, 2006, 2007, pp. 595–622. (Cited on page 58.)

[118] S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of age: tenyears of next-generation sequencing technologies,” Nature Reviews Genetics,vol. 17, no. 6, p. 333, 2016. (Cited on page 58.)

80

Bibliography

[119] F. S. Collins and H. Varmus, “A new initiative on precision medicine,” NewEngland journal of medicine, vol. 372, no. 9, pp. 793–795, 2015. (Cited onpage 58.)

[120] P. R. Njølstad et al., “Roadmap for a precision-medicine initiative in thenordic region,” Nature genetics, vol. 51, no. 6, p. 924, 2019. (Cited on page 58.)

[121] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no.7553, pp. 436–444, 2015. (Cited on page 59.)

[122] J. H. Friedman, “Regularized discriminant analysis,” Journal of the Amer-ican statistical association, vol. 84, no. 405, pp. 165–175, 1989. (Cited onpage 59.)

[123] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized linear discriminantanalysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp.86–100, 2007. (Cited on page 59.)

[124] D. M. Witten and R. Tibshirani, “Penalized classification using fisher’slinear discriminant,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011. (Cited on page 59.)

[125] G. Durif, L. Modolo, J. Michaelsson, J. E. Mold, S. Lambert-Lacroix, andF. Picard, “High dimensional classification with combined adaptive sparsePLS and logistic regression,” Bioinformatics, vol. 34, no. 3, pp. 485–493,2018. (Cited on page 59.)

[126] Y. S. Niu, N. Hao, and B. Dong, “A new reduced-rank linear discriminantanalysis method and its applications,” Statistica Sinica, pp. 189–202, 2018.(Cited on page 59.)

[127] R. Diaz-Uriarte, “GeneSrF and varSelRF: a web-based tool and r package forgene selection and classification using random forest,” BMC bioinformatics,vol. 8, no. 1, p. 328, 2007. (Cited on page 59.)

[128] M. R. Fielden and T. R. Zacharewski, “Challenges and limitations of geneexpression profiling in mechanistic and predictive toxicology,” Toxicologicalsciences, vol. 60, no. 1, pp. 6–10, 2001. (Cited on page 60.)

[129] L. Zhang et al., “Gene expression profiles in normal and cancer cells,”Science, vol. 276, no. 5316, pp. 1268–1272, 1997. (Cited on page 60.)

[130] D. T. Ross et al., “Systematic variation in gene expression patterns in humancancer cell lines,” Nature genetics, vol. 24, no. 3, pp. 227–235, 2000. (Citedon page 60.)

[131] J. Lu et al., “MicroRNA expression profiles classify human cancers,” nature,vol. 435, no. 7043, pp. 834–838, 2005. (Cited on page 60.)

[132] U. R. Chandran et al., “Gene expression profiles of prostate cancer revealinvolvement of multiple molecular pathways in the metastatic process,”BMC cancer, vol. 7, no. 1, p. 64, 2007. (Cited on page 60.)

[133] L. J. Van’t Veer et al., “Gene expression profiling predicts clinical outcomeof breast cancer,” nature, vol. 415, no. 6871, pp. 530–536, 2002. (Cited onpage 60.)

[134] M. H. Cheok et al., “Treatment-specific changes in gene expression dis-criminate in vivo drug response in human leukemia cells,” Nature genetics,vol. 34, no. 1, pp. 85–90, 2003. (Cited on page 60.)

[135] R. Lowe, N. Shirley, M. Bleackley, S. Dolan, and T. Shafee, “Transcriptomicstechnologies,” PLoS computational biology, vol. 13, no. 5, p. e1005457, 2017.(Cited on page 60.)

81

Bibliography

[136] E. Clough and T. Barrett, “The gene expression omnibus database,” inStatistical Genomics. Springer, 2016, pp. 93–110. (Cited on pages 60 and 67.)

[137] J. M. Raser and E. K. O’Shea, “Noise in gene expression: origins, conse-quences, and control,” Science, vol. 309, no. 5743, pp. 2010–2013, 2005.(Cited on page 61.)

[138] S. Dudoit, J. P. Shaffer, J. C. Boldrick et al., “Multiple hypothesis testingin microarray experiments,” Statistical Science, vol. 18, no. 1, pp. 71–103,2003. (Cited on page 61.)

[139] D. Dembélé and P. Kastner, “Fold change rank ordering statistics: a newmethod for detecting differentially expressed genes,” BMC bioinformatics,vol. 15, no. 1, p. 14, 2014. (Cited on page 61.)

[140] J. Hemerik and J. J. Goeman, “False discovery proportion estimation bypermutations: confidence for significance analysis of microarrays,” Journalof the Royal Statistical Society: Series B (Statistical Methodology), vol. 80,no. 1, pp. 137–155, 2018. (Cited on page 61.)

[141] J. H. Kim, Advanced Microarray Data Analysis. Singapore: SpringerSingapore, 2019, pp. 79–93. [Online]. Available: https://doi.org/10.1007/978-981-13-1942-6_5 (Cited on page 61.)

[142] J. D. Storey and R. Tibshirani, “Statistical significance for genomewidestudies,” Proceedings of the National Academy of Sciences, vol. 100, no. 16,pp. 9440–9445, 2003. (Cited on page 61.)

[143] V. G. Tusher, R. Tibshirani, and G. Chu, “Significance analysis of microar-rays applied to the ionizing radiation response,” Proceedings of the NationalAcademy of Sciences, vol. 98, no. 9, pp. 5116–5121, 2001. (Cited on page 62.)

[144] I. Dinu et al., “Gene-set analysis and reduction,” Briefings in bioinformatics,vol. 10, no. 1, pp. 24–34, 2008. (Cited on page 62.)

[145] J. D. Storey, “A direct approach to false discovery rates,” Journal of theRoyal Statistical Society: Series B (Statistical Methodology), vol. 64, no. 3,pp. 479–498, 2002. (Cited on page 62.)

[146] E. Ollila, “Optimal high-dimensional shrinkage covariance estimation for el-liptical distributions,” in 2017 25th European Signal Processing Conference(EUSIPCO). IEEE, 2017, pp. 1639–1643. (Cited on page 64.)

[147] E. Ollila and E. Raninen, “Optimal shrinkage covariance matrix estimationunder random sampling from elliptical distributions,” IEEE Transactionson Signal Processing, vol. 67, no. 10, pp. 2707–2719, 2019. (Cited on page 64.)

[148] L. Philip, X. Wang, and Y. Zhu, “High dimensional covariance matrix esti-mation by penalizing the matrix-logarithm transformed likelihood,” Com-putational Statistics & Data Analysis, vol. 114, pp. 12–25, 2017. (Cited onpage 64.)

[149] D. E. Tyler and M. Yi, “Shrinking the sample covariance matrix us-ing convex penalties on the matrix-log transformation,” arXiv preprintarXiv:1903.08281, 2019. (Cited on page 64.)

[150] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for gener-alized linear models via coordinate descent,” Journal of statistical software,vol. 33, no. 1, p. 1, 2010. (Cited on page 67.)

82

Errata

Publication III

Equation (2) should be μg = xg = 1ng

∑c(i)=g xi instead of μg =

∑c(i)=g xi.

Publication IV

• p3875: The expression |〈xj , r(λ)〉| < λ holds for non-active predictorsinstead of |〈xj , r(λ)〉| ≤ λ.

• p3877: The output of Algorithm 2 (i.e., c-PW-WEN algorithm) shouldmention AK ∈ NK instead of AK ∈ RK .

83




gninrael desivrepus decnavda sdnamed stesatad DH fo






depoleved ehT .srewop evitciderp dna yrotanalpxe


)i(



)ii(



-otl

aA

DD

76/

02

02

+cifid

a*GMFTSH

9

NBSI 2-8583-06-259-879 )detnirp(

NBSI 9-9583-06-259-879 )fdp(


NSSI 2494-9971 )fdp(

ytisrevinU otlaA



fi.otlaa.www

+ SSENISUB YMONOCE

+ TRA

+ NGISED ERUTCETIHCRA

+ ECNEICS

YGOLONHCET

REVOSSORC


M

US

SA

BAT

D

EE

VA

N D

AM

MA

HU

Md

na

noi

sser

ge

R la

noi

sn

emi

D-h

giH

rof

g

ninr

aeL

lac

itsi

tat

S n

evir

D yt

isra

pS

n

oita

cfii

ssal

C y

tisr

evi

nU

otla

A

0202

















)i(



)ii(



-otl

aA

DD

76/

02

02

+cifid

a*GMFTSH

9

NBSI 2-8583-06-259-879 )detnirp(

NBSI 9-9583-06-259-879 )fdp(


NSSI 2494-9971 )fdp(

ytisrevinU otlaA



fi.otlaa.www

+ SSENISUB YMONOCE

+ TRA


+ ECNEICS

YGOLONHCET

REVOSSORC


M

US

SA

BAT

D

EE

VA

N D

AM

MA

HU

Md

na

noi

sser

ge

R la

noi

sn

emi

D-h

giH

rof

g

ninr

aeL

lac

itsi

tat

S n

evir

D yt

isra

pS

n

oita

cfii

ssal

C y

tisr

evi

nU

otla

A

0202

















)i(



)ii(



-otl

aA

DD

76/

02

02

+cifid

a*GMFTSH

9

NBSI 2-8583-06-259-879 )detnirp(

NBSI 9-9583-06-259-879 )fdp(


NSSI 2494-9971 )fdp(

ytisrevinU otlaA



fi.otlaa.www

+ SSENISUB YMONOCE

+ TRA


+ ECNEICS

YGOLONHCET

REVOSSORC


M

US

SA

BAT

D

EE

VA

N D

AM

MA

HU

Md

na

noi

sser

ge

R la

noi

sn

emi

D-h

giH

rof

g

ninr

aeL

lac

itsi

tat

S n

evir

D yt

isra

pS

n

oita

cfii

ssal

C y

tisr

evi

nU

otla

A

0202






s parsity driven statistical spasriytdrviensattsitlcai

Documents