arxiv:1804.00633v1 [quant-ph] 2 apr 2018 · 2018. 4. 3. · university of kwazulu-natal, durban...

Circuit-centric quantum classifiers

Maria Schuld,1, 2, 3 Alex Bocharov,3 Krysta Svore,3 and Nathan Wiebe3

1Quantum Research Group, School of Chemistry and Physics,University of KwaZulu-Natal, Durban 4000, South Africa

2National Institute for Theoretical Physics, KwaZulu-Natal, Durban 4000, South Africa3Quantum Architectures and Computation Group,

Station Q, Microsoft Research, Redmond, WA (USA)(Dated: April 3, 2018)

The current generation of quantum computing technologies call for quantum algorithms that re-quire a limited number of qubits and quantum gates, and which are robust against errors. A suitabledesign approach are variational circuits where the parameters of gates are learnt, an approach thatis particularly fruitful for applications in machine learning. In this paper, we propose a low-depthvariational quantum algorithm for supervised learning. The input feature vectors are encoded intothe amplitudes of a quantum system, and a quantum circuit of parametrised single and two-qubitgates together with a single-qubit measurement is used to classify the inputs. This circuit architec-ture ensures that the number of learnable parameters is poly-logarithmic in the input dimension.We propose a quantum-classical training scheme where the analytical gradients of the model canbe estimated by running several slightly adapted versions of the variational circuit. We show withsimulations that the circuit-centric quantum classifier performs well on standard classical bench-mark datasets while requiring dramatically fewer parameters than other methods. We also evaluatesensitivity of the classification to state preparation and parameter noise, introduce a quantum ver-sion of dropout regularisation and provide a graphical representation of quantum gates as highlysymmetric linear layers of a neural network.

I. INTRODUCTION

Quantum computing - information processing with de-vices that are based on the principles of quantum theory– is currently undergoing a transition from a purelyacademic discipline to an industrial technology. Socalled “non-fault-tolerant”, “small-scale” or “near-term”quantum devices are being developed on a variety ofhardware platforms, and offer for the first time a testbedfor quantum algorithms. However, allowing for only ofthe order of 1, 000 − 10, 000 elementary operations on50 − 100 qubits [1] and without the costly feature oferror correction, these early devices are not yet suitableto implement the algorithms that made quantum com-puting famous. A new generation of quantum routinesthat use only very limited resources and are robustagainst errors has therefore been created in recent years[2–4]. While many of those small-scale algorithms havethe sole purpose of demonstrating the power of quantumcomputing over classical information processing [5], animportant goal is to find quantum solutions to usefulapplications.

One increasingly popular candidate application fornear-term quantum computing is machine learning[6]. Machine learning is data-driven decision makingin which a computer fits a mathematical model todata (training) and uses the model to derive decisions(inference). Numerous quantum algorithms for machinelearning have been proposed in the past years [7, 8].A prominent strategy [9–12] is to encode data intothe amplitudes of a quantum state (here referred toas amplitude encoding), and use quantum circuits to

manipulate these amplitudes. Quantum algorithms thatare only polynomial in the number n of qubits canperform computations on 2n amplitudes. If these 2n am-plitudes are used to encode the data, one can thereforeprocess data inputs in polylogarithmic time. However,most of the existing literature on amplitude encodedquantum machine learning translates known machinelearning models into non-trivial quantum subroutinesthat lead to resource-intensive algorithms which cannotbe implemented on small-scale devices. Furthermore,quantum versions of training algorithms are limited tospecific, mostly convex optimisation problems. Hybridapproaches called “variational algorithms” [2, 13, 14] aremuch more suited to near term quantum computing andare rapidly getting popularity in the quantum researchcommunity in recent months. A general picture ofvariational circuits for machine learning is introduced in[15]. The emphasis of low-depth circuits for quantummachine learning has been made in [16], where theimportance of entanglement as a resource has beenanalysed for the low-depth architectures in the contextof Boltzmann machines. A very recent preprint thatcomes closest to the designs presented here is Farhi andNeven [17]. The latter focusses mostly on classificationof discrete and discretized data that is encoded intoqubits rather than amplitudes, which requires an ex-ponentially larger number of qubits for a given inputdimension. The circuit architectures proposed in thework are of more general nature compared to our focuson a slim parameter count through the systematic use ofentanglement.

This paper presents a quantum framework for supervised

arX

iv:1

804.

0063

3v1

[qu

ant-

ph]

2 A

pr 2

018

2

QPUinput x

parameters θprediction y

|0〉

Sx Uθ

p(y)...

|0〉

FIG. 1. Idea of the circuit-centric quantum classifier. Infer-ence with the model f(x, θ) = y is executed by a quantumdevice (the quantum processing unit or QPU) which consistsof a state preparation circuit Sx encoding the input x intothe amplitudes of a quantum system, a model circuit Uθ, anda single qubit measurement. The measurement retrieves theprobability of the model predicting 0 or 1, from which inturn the binary prediction can be inferred. The classificationcircuit parameters θ are learnable and can be trained by avariational scheme.

learning that makes use of the advantages of amplitudeencoding, but is based on a variational approach andtherefore particularly designed for small-scale quantumdevices (see Fig. 1). To achieve low algorithmic depthwe propose a circuit-centric design which understandsa generic strongly entangling quantum circuit Uθ as thecore of the machine learning model f(x; θ) = y, where xis an input, θ a set of parameters and y is the predictionor output of the model. We call this circuit the modelcircuit. The model circuit consists of parametrisedsingle and controlled single qubit gates, with learnable(classical) parameters. The number of parametrisedgates in the family of model circuits we propose growsonly polynomially with the number of qubits, whichmeans that our quantum machine learning algorithm hasa number of parameters that is overall poly-logarithmicin the input dimension.

The model circuit acts on a quantum state that repre-sents the input x via amplitude encoding. To preparesuch a quantum state, a static state preparation circuitSx has to be applied to the initial ground state. After ap-plying the state preparation as well as the model circuit,the prediction is retrieved from the measurement of asingle qubit. If the data is sufficiently low-dimensional orits structure allows for efficient approximate preparation,this yields a compact circuit that can be understood as ablack box routine that executes the inference step of themachine learning algorithm on a small-scale quantumcomputer.

We propose a hybrid quantum-classical gradient descenttraining algorithm. On the analytical side we show howthe exact gradients of the circuit can be retrieved fromrunning slight variations of the inference algorithm (andfor now assuming perfect precision in the prediction)

a small, constant number of times and adding up theresults, a strategy we call classical linear combination ofunitaries. The parameter updates are then calculatedon a classical computer. Keeping the model parametersas a classical quantity allows us not only to implementa large number of iterations without worrying aboutgrowing coherence times, but also to store and reuselearnt parameters at will. Using single-batch gradientdescent only requires the state preparation circuit Sxto encode one input at a time. In addition to that,we can easily improve the gradient descent scheme bystandard methods such as an adaptive learning rate,regularisation and momenta.

We analyse the resulting circuit-centric quantum classi-fier theoretically as well as via simulations to judge itsperformance compared to other models. We show thatmathematically, a quantum circuit closely resembles aneural network architecture with unitary layers, anddiscuss ways to include dropout and nonlinearities. Theunitarity of the “pseudo-layers” is a favourable propertyfrom a machine learning point of view [19, 20], sinceit maintains the length of an input vector throughoutthe layers and therefore circumvents notorious problemsof vanishing or exploding gradients. Unitary weightmatrices have also been shown to make the convergencetime of gradient descent independent of the circuit depth[21] - an important guarantee to avoid the growingcomplexity of training deep architectures. Possibly themost important feature of the model is that it uses anumber of parameters that is logarithmic in the datasize, which is a huge saving to a neural network wherethe first layer already has weights at least linear in thedimension of the input.

In the remainder of the paper we will introduce thecircuit-centric quantum classifier in Section II, along withdesign considerations for the circuit architecture in Sec-tion III, as well as the training scheme in Section IV.We analyse its performance in Section V and show thatcompared with out-of-the-box methods it performs rea-sonable well. We finally propose a number of ways toextend the work in Section VI.

II. THE CIRCUIT-CENTRIC QUANTUMCLASSIFIER

The task our model intends to solve is that of super-vised pattern recognition, and is a standard problemin machine learning with applications in image recog-nition, fraud detection, medical diagnosis and manyother areas. To formalise the problem, let X be a setof inputs and Y a set of outputs. Given a datasetD = {(x1, y1), ..., (xM , yM )} of pairs of so called train-ing inputs xm ∈ X and target outputs ym ∈ Y form = 1, ...,M , our goal is to predict the output y ∈ Yof a new input x ∈ X . For simplicity we will assume in

3

x2

x1

FIG. 2. Supervised binary classification for 2-dimensional in-puts. Given the red circle and blue triangle data points be-longing to two different classes, guess the class of the newinput (pink square).

the following that X = RN and Y = {0, 1}, which is abinary classification task on a N -dimensional real inputspace (see Fig. 2). Most machine learning algorithmssolve this task in two steps: They first train a modelf(x, θ) with the data by adjusting a set of parameters θ,and then use the trained model to infer the predictiony.

The main idea of the circuit-centric design is to turn ageneric quantum circuit of single and 2-qubit quantumgates into a model for classification. One can divide thefull inference algorithm into four steps. As shown in Fig.3, these four steps can be described using the languageof quantum circuits, but also as a formal mathematicalmodel, and finally, using the idea of graphical represen-tation for neural networks, as a graphical model.

From a quantum circuit point of view we use the statepreparation circuit Sx to encode the data into the stateof a n qubit quantum system, which effectively maps aninput x ∈ RN to the 2n-dimensional amplitude vectorϕ(x) that describes the initial quantum state |ϕ(x)〉.Second, the model circuit Uθ is applied to the quantumstate. Third, the prediction is read out from the finalstate |ϕ′〉 = Uθ|ϕ(x)〉 . For this purpose we measurethe first of the n qubits. Repeated applications of theoverall circuit and measurements resolve the probabilityof measuring the qubit in state 1. Lastly, the result ispostprocessed by adding a learnable bias parameter band mapping the result through a step function to theoutput y ∈ {0, 1}.

From a purely mathematical point of view, this procedure(that is, if we could perfectly resolve the probability of thefirst qubit by measurements) formally defines a classifier

STEP 1 STEP 2 STEP 3 STEP 4

CIR

CUIT

|0〉...|0〉

state prep. model circuitmeasure-ment

post-process.

→ ySx U

p(1)

FORMAL

x→ ϕ(x) → ϕ′ = Uθϕ(x)→∑{k}| (ϕ′)k |2

nonlin. map unitary transform.

→ y

GRAPHIC

AL

feat. maplinear layer

nonlinearlayer

thre-sholding

x0...xN

→

ϕ0

...

ϕK

ϕ1

...

...

ϕK

ϕ′1

...

...

ϕ′K

u11

uK1

u1K

uKK

r| · |2

| · |2y

FIG. 3. Inference with the circuit-centric quantum classifierconsists of four steps, here displayed in four colours, and canbe viewed from three different perspectives, i.e. from a formalmathematical framework, a quantum circuit framework anda graphical neural network framework. In the first step, thefeature map from the input space to the feature space RN →RK is executed for an input by a state preparation scheme.The quantum circuit applies a unitary transformation to thefeature vector which can be understood as one linear layer(or, when decomposed into gates, several linear layers) of aneural network. The measurement statistics of the first qubitare interpreted as the continuous output of the classifier andeffectively implement a weightless nonlinear layer in whichevery component of the last half of all units is mapped by anabsolute square and summed up. The postprocessing stagebinarises the result with a thresholding function via classicalcomputing.

that takes decisions according to

f(x; θ, b) =

1 if2n∑

k=2n−1+1

|(Uθ ϕ(x))k|2

+ b > 0.5,

0 else.

.

(1)Here ϕ : RN → C2n is a map that describes the proce-dure of information encoding via the state preparationroutine (n is an integer such that 2n ≥ N), Uθ isthe parametrised unitary matrix describing the modelcircuit, and (Uθϕ(x))k is the kth entry of the result afterwe applied this matrix to ϕ(x). The sum over the secondhalf of the resulting vector corresponds to the single qubitmeasurement resulting in state 1. Postprocessing addsthe bias b and thresholds to compute a binary prediction.

Lastly, if we formulate the four steps in the language ofneural networks and their graphical representation, statepreparation corresponds to a feature map on the input

4

space, while the unitary circuit resembles a neural net-work of several parametrised linear layers. This is fol-lowed by two nonlinear layers, one simulating the read-out via measurement (adding the squares of some unitsfrom the previous layer) and one that maps the output tothe final binary decision. We will go through the four dif-ferent steps in more detail and discuss our specific designdecisions for the model.

A. State preparation

There are various strategies to encode input vectorsinto the n-qubit system of a quantum computer. Inthe most general terms, state preparation implementsa feature map ϕ : RN → C2n where n is the totalnumber of qubits used to represent the features. Inthe following we focus on amplitude encoding, wherean input vector x ∈ RN – possibly with some furtherpreprocessing to bring it into a suitable form – is directlyassociated with the amplitudes of the 2n-dimensional‘ket’ vector of the quantum system written in thecomputational basis. This option can be extended bypreparing a set of copies of the initial quantum state,which effectively implements a tensor product of copies ofthe input, mapping it to much higher dimensional spaces.

To directly associate an amplitude vector in computa-tional basis with a data input, we require that N is apower of 2 (so that we can use all 2n amplitudes of an-qubit system), and that the input is normalised tounit length, xTx = 1. If N is no power of 2, we can‘pad’ the original input with a suitable number of zerofeatures. (For example x = (x1, x2, x3)

T would be ex-tended to x′ = (x1, x2, x3, 0)

T ). Normalisation can posea bigger challenge. Although many datasets carry prox-imity relations between vectors in their angles and nottheir length, some data sets can become significantly dis-torted by normalisation. A possible solution is to embedthe data in a higher dimensional space. Practically, thiscan be achieved by adding non-zero padding terms beforenormalization. Let N be the dimensionality of the origi-nal feature space, and let c1, . . . , cD be the padding termsthat may in general depend on the informative featuresx1 to xN . The preprocessing necessary for amplitudeencoding maps(

x1, ..., xN)T → χ (x1, ..., xN , c1, ..., cD)T , (2)

with

χ =1√∑

j x2j +

∑k |ck|2

,

on the original data.

For the designs investigated in this paper it is convenientto choose the padding width D such that N ′ = N + D

is some exact power of 2, and to choose {c1, . . . , cD} asa set of non-informative constants. This choice has infact two desirable side-effects of feature normalisation(2): first, is creates an ‘ancillary’ space of dimension D,which in the language of neural networks is analogous tohaving more “nodes” in the first hidden layer than in theinput layer; second, the constants create a state vectorthat is not homogeneous with respect to the vector ofthe original features (the importance of this will appearshortly in the discussion of the tensorial maps).

Preparing a quantum state whose amplitude vector in thecomputational basis is equivalent to the pre-processedinput ϕ(x) can always be done with a circuit that islinear in the number of features in the input vector, forexample with the routines in Refs [22–24]. When morestructure in the data can be exploited, preparation rou-tines with polylogarithmic dependence on the number offeatures might be applicable [25, 26]. A largely uninves-tigated option is also approximate state preparation offeature vectors, which may reduce the resources neededfor the circuit Sx at the expense of an error in the inputs.

To map input data into vastly higher dimensional spaceswe can apply a tensorial feature map by preparing dcopies of the state [27]. If |ψ〉 is the ‘ket’ vector pro-duced by amplitude encoding, this prepares

|ψ〉 → |ψ〉 ⊗ . . .⊗ |ψ〉︸︷︷︸d times

.

For amplitude encoding with N = 2 and d = 2, andwithout any of the preprocessing described above, thiswould map a feature vector (x1, x2)

T to

(x1x2

)⊗(x1x2

)=

x21

x1x2x2x1x22

,and can give rise to interesting nonlinearities that may fa-cilitate the classification procedure in the following steps(see also [28]).

B. The model circuit

Given an encoded feature vector ϕ(x) which is now a‘ket’ vector in the Hilbert space of a n qubit system,the model circuit maps this ket vector to another ketvector ϕ′ = Uθϕ(x) by a unitary operation Uθ which isparametrised by a set of variables θ.

5

1. Decomposition into (controlled) single qubit gates

As described before, we decompose U into

U = UL . . . U` . . . U1, (3)

where each U` is either a single qubit or a two-qubit quan-tum gate. As a reminder, a single qubit gate Gk actingon the kth of n qubits can be expressed as

Ul = I0 ⊗ · · · ⊗Gk ⊗ · · · ⊗ In−1. (4)

If the circuit depth L is in Ω(4n), this decompositionallows us to represent general unitary transformations.Remember that unitary operators are linear transforma-tions that preserve the length of a vector, a fact thatholds a number of advantages for the classifier as we willdiscuss later.

We further restrict the type of 2-qubit gate to simplifyour “elementary parametrised gate set”. A 2-qubit uni-tary gate is called imprimitive if it can map a 2-qubitproduct state into a non-product state. A common caseof an imprimitive two-qubit gate is a singly-controlledsingle-qubit gate C(G) that in standard computationalbasis can be written as

Ca(Gb) |x〉|y〉 = |x〉 ⊗Gx|y〉, (5)

where G is a single-qubit gate other than a global phasefactor on the qubit b and the state x of qubit a is either 0or 1 (G0 is the identity). For example, G could be a NOTgate, in which case the C(G) is simply the frequentlyused CNOT gate. It is known ([29]), that single-qubitgates together with any set of imprimitive 2-qubit gatesprovide for quantum universality:

Observation 1. Circuits of the form (3) composed outof single-qubit gates and at least one type of imprimitive2-qubit gates generate the entire unitary group U(2n) in atopological sense. That is, for any ε > 0 and any unitaryV ∈ U(2n) there is a circuit of the the form (3) the valueof which is ε-close to V .

To make the single qubit gates trainable we need to for-mulate them in terms of parameters that can be learnt.The way the parametrisation is defined can have a signif-icant impact on training, since it defines the shape of thecost function. A single qubit gate G is a 2 × 2 unitary,which can always be written [30] as

G(α, β, γ, φ) = eiφ(

eiβ cosα eiγ sinα−e−iγ sinα e−iβ cosα

)(6)

and is fully defined by four parameters {α, β, γ, φ}. Forquantum gates – where we cannot physically measureoverall phase factors – we may neglect the prefactoreiφ and only consider three learnable parameters pergate. The advantage in using angles (instead of, forexample, a parametrisation with Pauli matrices) is that

training does not need an additional condition on themodel parameters. A disadvantage might unfavourableconvergence properties of trigonometric functions closeto their optima.

Note that there may be much more efficient “elementaryparametrised gatesets” for a specific hardware, sincesome single qubit gates might naturally be parametrisedin the device (i.e. where the parameter corresponds tothe intensity of a laser pulse). For the agnostic casewe consider here, every parametrised gate has to bedecomposed into the constant elementary gate set of thephysical device, which adds an efficient overhead pergate that depends on the fidelity with which we seek toapproximate it (see Section IV E).

We treat the circuit architecture, i.e. which qubit a cer-tain gate acts on and where to place the controls, asfixed here and will discuss design choices in Section III.Of course, strategies to learn the circuit architecture arealso worth investigating, but we expect this to be a non-trivial problem due to the vast impact that each gatechoice in the architecture bears for the final state (see[31] and the discussion on “quantum chaos” in randomcircuits).

C. Read out and postprocessing

After executing the quantum circuit Uθϕ(x) in Step 2,the measurement of the first qubit (Step 3) results instate 1 with probability[32]

p(q0 = 1, x; θ) =

2n∑k=2n−1+1

|(Uθϕ(x))k|2 .

To resolve these statistics we have to run the entirecircuit S times and measure the first qubit. We esti-mate p(q0 = 1) from these samples s1, ..., sS . This isa Bernoulli parameter estimation problem which wediscuss in Section IV E.

The classical postprocessing (Step 4) consists of addinga learnable bias term b to produce the continuous outputof the model,

π(x; θ, b) = p(q0 = 1, x, θ) + b. (7)

Thresholding the value finally yields the binary outputthat is the overall prediction of the model:

f(x; θ) =

{1 if π(x; θ) > 0.5

0 else.

In Dirac notation the measurement result can be writtenas the expectation value of a σz operator acting on thefirst qubit, measured after applying U to the initial state

6

B1 B3

|q0〉 G • G G • G G

|q1〉 G G • G G •

|q2〉 G G • G G •

|q3〉 G G • G G •

|q4〉 G G • G G •

|q5〉 G G • G G •

|q6〉 G G • G G •

|q7〉 G G • G G •

FIG. 4. Generic model circuit architecture for 8 qubits. The circuit consists of two ‘code blocks’ B1 and B3 with a range ofcontrols of r = 1 and r = 3 respectively. The circuit consists of 17 trainable single-qubit gates G = G(α, β, γ), as well as 16trainable controlled single qubit gates C(G), which have in turn to be decomposed into the elementary constant gate set usedby the quantum computer on which to implement it. If the optimisation methods are used to reduce the controlled gates to asingle parameter, we have 3 · 33 + 1 = 100 parameters to learn in total for this model circuit. These 100 parameters are usedto classify inputs of 28 = 256 dimensions, which shows that the circuit-centric classifier is a much more compact model than aconventional feed-forward neural network.

|ϕ(x)〉. In absence of non-linear activation, the expecta-tion value of the σz operator on the subspace of the firstqubit is given by

E(σz) = 〈ϕ(x)|U†(σz ⊗ I⊗ . . .⊗ I)U |ϕ(x)〉,

and we can retrieve the continuous output via

π(x; θ) =

(E(σz)

2+

1

2

)+ b. (8)

III. CIRCUIT ARCHITECTURES

Our initial goal was to build a classifier that at its corehas a low-depth quantum circuit. With the circuit de-composed into L single or controlled single qubit gates,we therefore want to constrain L to be polynomial inn which will allow us to do inference with a number ofelementary quantum operations that grows only polylog-arithmically in the dimension of the data set. However,this obviously comes at a price. The vectors of the formUθ|0...0〉 exhaust only a small subset of the Hilbert spaceof n qubits. In other words, the set of amplitude vec-tors ϕ′ = Uθϕ(x) that the circuit can ‘reach’ is limited.In machine learning terms, this limits the flexibility ofthe classifier. Much like in classical machine learning,the challenge of finding a generic circuit architecture istherefore to engineer circuits (3) of polynomial depth thatstill create powerful classifiers for a subclass of datasets.

A. Strongly entangling circuits

A natural approach to the problem of circuit designis to consider circuits that prepare strongly entangledquantum states. For one, such circuits can reach ‘widecorners of the Hilbert space’ with Uθ |0, ..., 0〉. Reversiblyargued, they have a better chance to project input datastate |ϕ(x)〉 with the class label y onto the subspace|y〉 ⊗ |η〉, η ∈ C2n−1 , which corresponds to a decision ofp(q0) = 0, 1 in our classifier (for a zero bias). Moreover,from a theoretical point of view a classifier has tocapture both short and long-range correlations in theinput data, and there is mounting evidence [19, 33] thatshallow circuits may be suitable for the purpose whenthey are strongly entangling.

More specifically, we compose the circuit (3) out of sev-eral code blocks B (see dotted boxes in the example inFigure 4). A code block consists of a layer of single qubitgates G = G(α, β, γ) applied to each of the n qubits, fol-lowed by a layer of n/gcd(n, r) controlled gates, where ris the ‘range’ of the control and gcd(n, r) is the greatestcommon denominator of n and r. For j ∈ [1..n/gcd(n, r)]the jth 2-qubit gate Ccj (Gtj ) of a block has qubit num-ber tj = (jr − r) mod n as the target, qubit numbercj = jr mod n as control. A full block has the followingcomposition,

B =

n−1∏k=0

Cck(Gtk)

n−1∏j=0

Gj . (9)

We observe that such code block is capable of entan-gling/unentangling all the qubits with numbers that

7

G0 •

G1 G •

G2 G

⇔

G0 Pλ •

G1 Q† Pφ Q Pλ •

G2 Q† Pφ Q

⇔

G̃0 •

G̃1 Pφ Q̃ •

G̃2 Pφ Q

FIG. 5. Illustration of first step of the proof from Observation 2 for an example of the first 5 gates of a codeblock of 3 qubitswith range r = 1. Decomposing the controlled rotations and merging single qubit gates reduces the number of parametersneeded to represent the model circuit architecture. For simplification the gates are displayed without indices or parameters.

are a multiple of gcd(n, r). In particular, assuming ris relatively prime with n, all n qubits can be entan-gled/unentangled.

As an example that demonstrates the entangling powerof the circuit, select a block with n = 4, r = 1. Letall controlled gates be CNOTs and let all single qubitgates be identities, except from G0 = G2 = H, whichare Hadamard gates. Applying the circuit to the basisproduct state |0000〉 we get the state

|ψ〉 = 12

(|00〉|00〉+ |01〉|11〉+ |10〉|01〉+ |11〉|10〉).

If A is the subsystem consisting of qubits 0, 1 and Bthe subsystem of qubits 2, 3, then the marginal densitymatrix, corresponding to the state |ψ〉 and the partition-ing A⊗B, is completely mixed. Therefore the state |ψ〉strongly entangles the two subsystems.

B. Optimising the architecture

The definition of the code block as per Equation (9) isfairly redundant. It turns out that the parameter spaceof the circuit (9) can for practical purposes be reduced toroughly 5n parameters. For this we need to introduce acontrolled phase gate Cj(Pk(φ)), φ ∈ R that applies thephase shift ei φ to a standard basis vector if and only ifboth the j-th and k-th qubits are in state |1〉. (Note thesymmetry of the definition, which means that it does notmatter which of the qubits is the control and which is thetarget.)

Observation 2. A circuit block of the form (9) can, upto global phase, be uniquely rewritten as

B =

n−1∏k=0

RXk Cck(Ptk)

n−1∏j=0

Gj . (10)

where ∀j,Gj ∈ SU(2) are single qubit gates with the usualthree parameters (and, moreover, Gj is an axial rotationfor j > 0), P is a single-parameter phase gate, and RX

is a single-parameter X-rotation.

Proof. The proof is based on transformations of theCa(Gb) gates and subsequent normalizations of the

single-qubit unitaries. Let us diagonalise the single-qubit unitary G = QDQ†, where Q is some othersingle-qubit unitary and D = diag(ei λ, ei (λ+φ)) withλ, φ ∈ R is the diagonal matrix of the eigenvalues.Then Ca(Gb) = QbCa(Pb(φ))Q

†b Pa(λ) (see Figure 5).

We further merge the Q†b with the corresponding Gbof the code block (9). In case a = 0 the Pa(λ) can becommuted to the beginning of the layer and merged withG0. In case a 6= 0 the Pa(λ) can be commuted throughthe end of the layer and either merged into the nextlayer or, if we are looking at the last layer in the circuit,traced out. At this point the action of the circuit (9)

is equivalent to that of (∏n−1k=0 Q̃tk Cck(Ptk))

∏n−1j=0 G̃j

where G̃j and Q̃tk are updated single qubit gates fromthe merging operation. Note that the single qubit gatesin the following layer are also updated.

In the second round of the we split all the single-qubitgates Q̃ up to global phase into product of three ro-tations Q̃ ∼ RZ(µ1)RX(µ2)RZ((µ3). We conclude theproof by noting that each of the diagonal operatorsRZ(mu1)tj , RZ((µ3)tj can be commuted through all con-trolled phase gates to either the end of the layer or to thebeginning (in which case it can be merged with one of the

G̃j gates).

To summarize, with the possible exception of the lastlayer in the classifier, a layer is described (up to a globalphase) by at most 5n parameters, at most n for all con-trolled phase gates C(P ), at most n for all x-rotationsRX and at most 3n for all fully parametrised single qubitgates G.

C. Graphical representation of gates

As a product of elementary gates, the model circuitUx can be understood as a sequence of linear layersof a neural network with the same number of units ineach “hidden layer”. This perspective facilitates thecomparison of the circuit-centric quantum classifier withwidely studied neural network models, and visualisesthe connectivity power of (controlled) single qubit gates.The position of the qubit (as well as the control) deter-mine the architecture of each layer, i.e. which units are

8

G0 C1(G0)

FIG. 6. Graphical representation of quantum gates. Left: Asingle qubit gate applied to the first qubit of a 2-qubit register.Right: A single qubit gate and a controlled single cubit gateapplied to a two-qubit register. A solid line corresponds to aunit weight, while the other lines stand for a variable weightparameter. The same line styles indicate the same weights.

connected and which “weights” are tied in a “gate-layer”.

To show an example, consider a Hilbert space of dimen-sion 2n with n = 2 qubits |q0q1〉. A single qubit unitaryG applied to q0 would have the following matrix repre-sentation

G0 =

eiβ cosα 0 eiγ sinα 0

0 eiβ cosα 0 eiγ sinα−e−iγ sinα 0 e−iβ cosα 0

0 −e−iγ sinα 0 e−iβ cosα

,while the same unitary but controlled by qubit q1, C1(G0)has matrix representation

C1(G0) =

1 0 0 00 eiβ cosα 0 eiγ sinα0 0 1 00 −e−iγ sinα 0 e−iβ cosα

At the same time, these two gates can be understood aslayers with connections displayed in Figure 6.

It becomes obvious that a single qubit gate connects twosets of two variables with the same weights, in otherwords, it ties the parameters of these connections. Thecontrol removes half of the ties and replaces them withidentities. A quantum circuit can therefore be under-stood as an analog of a neural network architecturewith highly symmetric, unitary linear layers, and con-trols break some of the symmetry. Note that althoughwe speak of linear layers here, the weights (i.e., the en-tries of the weight matrix representing a gate) have anonlinear dependency on the model parameters θ, a cir-cumstance that plays a role for the convergence of thehybrid training method.

IV. TRAINING

We consider a stochastic gradient descent method fortraining. The parameters that define every single qubitgate of the quantum circuit are at every stage of the quan-

tum algorithm classical values. However, we are comput-ing the model function on a quantum device, and havetherefore no ‘classical’ access to its gradients. This meansthat the training procedure has to be a hybrid schemethat combines classical processing to update the param-eters, and quantum information processing to extract thegradients. We will show how to use the quantum circuitto extract estimates of the analytical gradients, as op-posed to other proposals for variational algorithms basedon derivative-free or finite-difference gradients (see [34]).A related approach, but for a different gate representa-tion, has been proposed during the time of writing in Ref.[17].

A. Cost function

We choose a standard least-squares objective to evaluatethe cost of a parameter configuration θ and a bias b givena training set, D = {(x1, y1), ..., (xM , yM )},

C(θ, b; D) = 12

M∑m=1

|π(xm; θ, b)− ym|2,

where π is the continuous output of the model de-fined in Equation (7). Note that we can easily adda regularisation term (i.e., an L1 or L2 regulariser)to this objective, since it does not require any addi-tional quantum information processing. For the sake ofsimplicity we do not consider regularisation in this paper.

Gradient descent updates each parameter µ from the setof circuit parameters θ via

µ(t) = µ(t−1) − ηµC(θ, b; D)∂θ

,

and similarly for the bias,

b(t) = b(t−1) − η ∂C(θ, b; D)∂b

.

The learning rate η can be adapted during training andwe can also add momenta to the updates, which cansignificantly decrease the convergence time.

In stochastic gradient descent, we do not consider theentire training set D in every iteration, but only a subsetor batch B ⊂ D [35]. The derivatives in the parameterupdates are therefore taken with respect to C(θ, b; B)instead of C(θ, b; D). In principle, quantum computingallows us to encode a batch of B training inputs intoa quantum state in superposition and feed it into theclassifier, which can be used to extract gradients for theupdates from the quantum device. However, guided bythe design principle of a low-depth circuit, this wouldextend the state preparation routine to be in O(BN) forgeneral cases, where N is the size of each input in thebatch (which becomes even worse for more sophisticated

9

QPUinput

x ∈ RNoutput

p(q0 = 1)

update θgradient

FIG. 7. Idea of the hybrid training method. The quantumprocessing unit (QPU) is used to compute outputs and gradi-ents of the model in order to update the parameters for eachstep of the gradient descent training.

feature maps in Step 1). We therefore consider single-batch gradient descent here (i.e., B = 1), where only onerandomly sampled training input is considered in eachiteration. Single-batch stochastic gradient descent canhave favourable convergence properties, for example incases where there is a lot of data available [36].

B. Hybrid gradient descent scheme

The derivative of the objective function with respect toa model parameter ν = b, µ (where µ ∈ θ is a circuitparameter) for a single data sample {(xm, ym)} is calcu-lated as

∂C

∂ν= (π(xm; ν)− ym) ∂νπ(xm; ν).

Note that π(xm; ν) is a real-valued function and the ym

and the parameters are also real-valued. Hence ∂C∂ν ∈ R.

While π(xm; ν) is a simple prediction we can get fromthe quantum device, and ym is a target from the classicaltraining set, we have to look closer at how to computethe gradient ∂νπ. For ν = b this is in fact trivial, since

∂bπ(xm; b) = 1.

In case of ν = µ, the gradient forces us to computederivatives of the unitary operator. In the following wewill calculate the gradients in vector as well as in Diracnotation and show how a trick allows us to estimatethese gradients using a slight variation of the modelcircuit Sx.

The derivative of the continuous output of the model with

respect to the circuit parameter µ is formally given by

∂µπ(xm;µ) = ∂µ p(q0 = 1;x

m, θ)

= ∂µ

2n∑k=2n−1+1

(Uθ ϕ(x))†k (Uθ ϕ(x))k

= 2Re

2n∑

k=2n−1+1

(∂µUθ ϕ(x))†k (Uθ ϕ(x))k

.The last expression contains the ‘derivative of the circuit’,∂µUθ, which is given by

∂µUθ = UL . . . (∂µUi) . . . U1,

where we assume for simplicity that only theparametrised gate Ui depends on parameter µ. Ifthe parameters of different unitary matrices are tiedthen the derivative can simply be found by applying theproduct rule.

In Dirac notation, we have expressed the probability ofmeasuring the first qubit in state 1 through the expec-tation value of a σz operator acting on the same qubit,p(q0 = 1;x

m, θ) = 12 (〈Uθϕ(x)|σz|Uθϕ(x)〉+ 1) (see Equa-tion 8). We can use this expression to write the gradientin Dirac notation,

∂µ π(xm; θ, b) = Re{〈(∂µUθ)ϕ(xm)|σz|Uθϕ(xm)〉}. (11)

This notation reveals the challenge in computing the gra-dients using the quantum device. The gradient of aunitary is not necessarily a unitary, which means that|(∂µUθ)ϕ(xm)〉 is not a quantum state that can arise froma quantum evolution. How can we still estimate gradientsusing the quantum device?

C. Classical linear combinations of unitaries

It turns out that in our architecture we can alwaysrepresent ∂µUθ as a linear combination of unitaries.Linear combination of unitaries is a known technique inquantum mechanics [37], where the sum is implementedin a coherent fashion. In our case where we allowfor classical postprocessing, we do not have to applyunitaries in superposition, but can simply run thequantum circuit several times and collect the output.This is what we will call classical linear combinations ofunitaries here.

Consider the derivative of Ui for the single-qubit gatedefined in Equation (4),

∂µUi = I⊗ · · · ⊗ ∂µG(α, β, γ)⊗ · · · ⊗ I,

where G(α, β, γ) is given in the parametrisation intro-duced in Equation (6) and discounting the global phase.The derivatives of the single qubit gate G(α, β, γ) with

10

respect to the parameters µ = α, β, γ are as follows:

∂αG = G(α+π

2, β, γ) (12)

∂βG =1

2G(α, β +

π

2, 0) +

1

2G(α, β +

π

2, π) (13)

∂γG =1

2G(α, 0, γ +

π

2) +

1

2G(α, π, γ +

π

2) (14)

One can see that while the derivative with respect toα requires us to implement the same gate but with thefirst parameter shifted by π2 , the derivative with respectto µ = β [µ = γ] is a linear combination of single qubitgates where the original parameter β [γ] is shifted by π2 ,while γ [β] is replaced by 0 or π.

Differentiating a controlled single qubit gate is not thatimmediate, but fortunately we have

∂µ C(G) =1

2

(C(∂µG)− C(−∂µG)

),

which means that the derivative of the controlled singlequbit gate is half of the difference between a controlledderivative gate and the controlled negative version ofthat gate. In our design, when µ = α, each of the twocontrolled gates is unitary, while µ = β, γ requires us touse the linear combinations in (13) and (14).

If we plug the gate derivatives back into the expressionsfor the gradient in Equation (11), we see that the gradi-ents, irrespective of the gate or parameter, can be com-puted as ‘classical’ linear combinations of the form

∂µ π(xm; θ, b) =

J∑j=1

aj Re {〈Uθ[j]ϕ(xm)|σz|Uθϕ(xm)〉} ,

where θ[j] is a modified vector of parameters corre-sponding to a term appearing in Equations (12 - 14),and aj is the corresponding coefficient also stemmingfrom the Equations (12 - 14). If there is no parametertying between the constituent gates, for example,then J is either 2 or 4 depending on whether thegate containing parameter µ is a one- or two-qubitgate. For each circuit, the eventual derivative has tobe estimated by repeated measurements, and we willdiscuss the number of repetitions in the following section.

The last thing to show is that we can compute theterms Re{〈Uθ[j]ϕ(xm)|σz|Uθϕ(xm)〉} with the quantumdevice, so that classical multiplication and summationcan deliver estimates of the desired gradients.

Observation 3. Given two unitary quantum circuits Aand B that act on a n qubit register to prepare the twoquantum states |A〉, |B〉, and which can be applied con-ditioned on the state of an ancilla qubit, we can use thequantum device to sample from the probability distribu-

tion p = 12 +12 Re〈A|B〉

Proof. The proof of this observation follows from the ex-act same reasoning that underlies the Hadamard test.For concreteness, we specify the algorithm below. Weuse the circuits A,B to prepare the two states |A〉, |B〉conditioned on an ancilla,

1√2

(|0〉|A〉+ |1〉|B〉) .

Applying a Hadamard on the ancilla yields

1

2(|0〉(|A〉+ |B〉) + |1〉(|A〉 − |B〉)) ,

where the probability of the ancilla to be in state 0 isgiven by

p(a = 0) =1

2+

1

2Re〈A|B〉.

To use this interference routine we have to add an extraqubit and implement Uθ and Uθ[j] conditioned on thestate of the ancilla. Since these two circuits coincidein all except from one gate, we do in fact only need toapply the differing gate in conditional mode. This turnsa single qubit gate into a singly controlled single qubitgate, and a controlled gate into a double controlled gate.The desired value Re〈A|B〉 can be derived by resolvingp(a = 0) through measurements and computing

Re〈A|B〉 = 2p(a = 0)− 1.

D. Dropout

Despite the relatively small parameter space, ourcircuit-centric architecture is not immune to overfitting.Benchmarking on smaller data sets reveals cases wherethe training data is fit perfectly (zero misclassifications)by a model with exponentially few parameters, but thesame model has significant generalization errors on thetest holdout.

The approach that often helps is a simple dropout regu-larization that is both quantum-inspired and quantumready (in the sense that it is easy in both classicalsimulation and quantum execution). The essence of theapproach is to randomly select and measure one of thequbits, and set it aside for a certain number Ndropoutof parameter update epochs. After that, the qubit isre-added to the circuit and another qubit (or, perhaps,no qubit) is randomly dropped. This strategy works by“smoothing” the model fit and it generally inflates thetraining error, but often deflates the generalization error.

The effect of such dropout regularization is similar, inspirit, to dropout regularization in a traditional neural

11

network when the dropout probability p = 0.5 is used.Indeed, freezing a randomly chosen qubit for a certainnumber of epochs prevents a half of the amplitudes in theamplitude encoding from affecting the stochastic gradi-ent during these epochs. In the graphical representationof the circuit-centric classifier this is analogous to remov-ing a half of the nodes from a hidden layer for a certainnumber of epochs.

E. Performance analysis

In order to use the circuit-centric quantum classifier withnear-term quantum devices, we need to motivate thatit only requires a small number of qubits, a low circuitdepth as well as a high error tolerance. After introducingthe details of the algorithms for inference and training,we want to discuss these three points in more detail.

1. Circuit depth and width

The number of qubits needed for the circuit-centricquantum classifier (if we use amplitude encoding asexplained in Section II A) is given by ddlog2Ne whereN is the dimension of the inputs and d is the numberof copies we consider for a tensorial feature map. Forexample, if d = 1, we can process a dataset of 1000-dimensional inputs with n = 10 qubits. With about 50qubit we can use a tensorial feature map of d = 5 (i.e.,prepare 5 copies of the state) and map the data into a250 dimensional feature space. For the inner productssubroutine in the hybrid training scheme, we need oneextra ancilla qubit. The algorithm is therefore verycompact as much as circuit width is concerned, a featurestemming from the amplitude encoding strategy.

The bottleneck of the circuit depth is the state prepara-tion routine Sx. Comparably, implementing the modelcircuit costs a negligible amount of resources. Using anarchitecture with K codeblocks of ranges (r1, ..., rK) andn qubits, we need

Kn+

K∑k=1

n/gcd(n, rk)

parametrised (controlled) single qubit gates to implementUθ, which is polynomial in the number of qubits. Eachof these gates has to be decomposed into the elementaryconstant gate set used in the physical implementationof the quantum computer. Every parametrised singlequbit gate can be efficiently translated into circuit G̃of at most O(log 1δ ) constant elementary gates from agiven gate set such as “Clifford-plus-T” to a fidelity of atleast (1 − δ) (cf. [38–41]). Methods such as automatedoptimization [42] may reduce the costs further.

General state preparation can in the worst case requireccn2

n CNOT gates as well as csgl2n single qubit gates.

For current algorithms csgl and ccn is equal to or slighlylarger than 1 [22–24, 43, 44]. This means that for the ex-ample of N = 1000 from above, we would indeed require2 · 2n = 2048 gates only to prepare the states. Issuesof fidelity arise, since without error correction we cannotguarantee to prepare a close enough approximation ofx. Our simulations show that adding 5% noise to the in-puts does not change the classification result significantly,which suggests that the classifier is rather robust againstinput noise. Still, until error correction becomes a real-ity, it is therefore advisable to focus on lower-dimensionaldatasets. Two interesting exceptions have to be men-tioned. First, if an algorithm is known that efficientlyallows us to approximate the (preprocessed) inputs witha product state, x ≈ a1 ⊗ · · · ⊗ aK the resources reduceto the number of gates required to prepare the a1, ..., aKin amplitude encoding [25]. Second, as other authors inquantum machine learning research, we point out that ifthe data is given by a shallow and robust digital quan-tum simulation routine performed on the same register ofqubits, our classifier can be used to train with ‘quantumdata’, or inputs that are ‘true’ wavefunctions.

2. Number of repetitions for output estimation

The continuous output of the circuit-centric quantumclassifier was based on the probability of measuring thefirst qubit in state 1. To resolve this number, we haveto repeat the entire algorithm multiple times. Each mea-surement samples from the Bernoulli distribution p(q0 =1) = ν, and we want to estimate ν from the S samplesq11 , ..., q

S1 . The number of samples needed to estimate ν at

error � with probability > 2/3 scales as O(Var(σz)/�2)),

where Var(σz) is the variance of the sigma-z operatorthat we measure with respect to the final quantum state[12, 34]. If amplitude estimation is used then the numberof repetitions of circuit centric classifier falls into O(1/�)at a price of increasing the circuit depth by a factor ofO(1/�).

3. Parameter noise

An important feature of the circuit-centric classifier isits robustness to noise in the inputs and parameters.Suppose δ > 0 is some small value and we allow pa-rameter permutations (resp. input permutations) suchthat for each constituent gate G the permuted gate G′

is δ-close to G: ||G − G′|| < δ. (Or for encoded input φa perturbed input φ′ is δ-close to φ.) we allow certainimprecisions in some or all parameter values and thatsuch imprecisions are bounded below some constantδ. Since all the constituent operations are unitary, theimpact of the parameter imprecisions is never amplifiedacross the circuit at the defect imposed by the imperfect

12

Noise level RI Mean RI St.Dev. RI max RI min

0.1% 1% 1.47% 3.5% 0%

1% 7% 6.67% 17% 0%

10% 60.2% 55.8% 192.3% 0%

TABLE I. Relative impact (RI) of uncorrelated parameternoise on the classification test error over SEMEION andMNIST256 data, using the generic 8-qubit model circuit dis-played in Figure 4

circuit on the final state before the measurement isbounded by 4Lδ in the worst theoretical case, where Lis the number of elementary parametrised gates whichhave at most 4 parameters. In practice the propagatederror should be much smaller that this bound.

The same analysis applies to imperfections in thequantum gates execution (other than parameter drift).There is no amplification of defect across the circuitand the imperfection of the final state is bounded bythe sum of imperfections of individual gates. Finally,the ket encoding of the input data does not have to beperfect either. A possible imperfection or approximationduring the state preparation will not be amplifiedby the classification circuit, and the drift of the pre-measurement state will be never be greater than thedrift of the initial state. Another widely advertisedadvantage of variational quantum algorithms is thatthey can learn to counterbalance systematic errors inthe device architecture – for example when one gatealways over-rotates the state by the same value.

In our simulation experiments we have systematicallyevaluated the effects on the quality of the classificationof 0.1%, 1% and 10% random perturbations in (a) thecircuit parameters and (b) the input data vectors. Asexpected due to the unitariness, the effect of inputnoise is not amplified by the classifier circuit andthus had proportionate impact on the percentage ofmisclassifications. Somewhat more surprisingly, randomperturbations of the trained circuit parameters almostnever had the worst case estimated impact on theclassification error. The 0.1% uncorrelated parameternoise in the majority of cases had no impact on theclassification results.

Here we limit the noise impact discussion to the contextof the “SEMEION” and “MNIST256” data sets of thebenchmark sets displayed in Table II. Our observationsare easier to calibrate in this context since both datasetsare encoded with 8 qubits and the same model circuitarchitecture with 33 gates at depth 19 (as shown inFigure 4 ) is used in the classifier.

The 0.1% parameter noise had no impact on classifi-

cation in about 60% of our test runs. The maximumrelative drift of the test error has been 3.5% (in one theof remaining runs), the mean drift has been 1% withthe standard deviation of approximately 1.47%. The 1%parameter noise had a more pronounced, albeit fairlyrobust impact, which was non-trivial in about 90% ofour test runs. The maximum relative change in the testerror rate has been 17%, the mean relative change hasbeen 7% with the standard deviation of approximately6.67%. This statistics is summarized in Table I. Finallythe 10% parameter noise lead to significant loss ofclassification robustness (although still smaller that theworst case analysis suggests). The maximum relativechange in the test error rate has been 192.3%, the meanhas been 60.2% with the standard deviation of 55.8%.Curiously, the minimum change has been zero in one ofthe “SEMEION”-based runs. This suggests that 10%perturbation of parameters has no stable amplificationpattern, and the model should be best re-trained aftersuch perturbation.

The practical takeaway from these observations is thatthe circuit-centric classifiers may work on small quantumcomputers even in the absence of strong quantum errorcorrection.

V. SIMULATIONS AND BENCHMARKING

To demonstrate that the circuit-centric quantum clas-sifier works well in comparison with classical machinelearning methods we present some simulations. Thecircuit-centric classifier was implemented on a classicalcomputer using the F# programming language.

1. Datasets

We select six standard benchmarking datasets (see TableII). The CANCER, SONAR, WINE and SEMEIONdata sets are taken from the well-known UCI repositoryof benchmark datasets for machine learning [45]. TheMNIST data set is the official NIST Modified handwrit-ten digit recognition data set. While CANCER andSONAR are binary classification exercises, the otherdata sets call for multi-class classification. Althoughthe circuit-centric quantum classifier could be operatedas a multi-class classifier, we limit ourselves to the caseof binary classification discussed above and cast themulti-label tasks as a set of “one-versus-all” binarydiscrimination subtasks. For example, in the case ofdigit recognition the ith subtask would be to distinguishthe digit “i” from all other digits. The results are anaverage taken over all subtasks.

The data is preprocessed for the needs of the quantumalgorithm. The MNIST data vectors were course-grainedinto real-valued data vectors of dimension 256. With the

13

ID domain # features # classes # samples preprocessing

CANCER decision 32 2 569 none

SONAR decision 60 2 208 padding with noninformative features

WINE decision 13 3 178 padding with noninformative features

SEMEION OCRa 256 10 1593 padding with noninformative features

MNIST256 OCR 256 10 2766 coarse-graining and deskewing

a Optical Character Recognition

TABLE II. Benchmark datasets and preprocessing.

ID model fixed hyperparameters variable hyperparameters

QC Circuit-centricquantum classifier

entangling circuit architecture dropout rate, number of blocks, range

PERC perceptron - regularisation type

MLPlin neural network dim. of hidden layer = N regularisation strength, optimiser,initial learning rate

MLPshal neural network dim. of hidden layer = dlog2Ne regularisation strength, optimiser,initial learning rate

MLPdeep neural network dim. of hidden layers = dlog2Ne regularisation strength, optimiser,initial learning rate

SVMpoly1 support vectormachine

polynomial kernel with d = 1,regularisation strength of slack variables= 1, offset c = 1

-

SVMpoly2 support vectormachine

polynomial kernel with d = 2,regularisation strength of slack variables= 1, offset c = 1

-

TABLE III. Benchmark models explained in the text and possible choices for hyperparameters.

exception of this data set, in all other cases the data vec-tors have been padded non-informatively so that theirdimension after padding is the nearest power of 2. Af-ter padding, each input vector was renormalized to unitnorm. Thus each of the N -dimensional original data vec-tors would require n = dlog2(N)e qubits to encode inamplitude encoding. In our simulations we do not usefeature maps which would multiply the input dimensions.However, we expect that the circuit-centric classifier willgain a lot of power from this strategy, which can be testedon real quantum devices without the same amount ofoverhead.

2. Benchmark models

For the circuit-centric quantum classifier (QC) we usethe data-agnostic entangling circuit of n qubits, whichhas been explained in Section III. We use one, two orthree entangling layers, which means that the circuitcontains anywhere from n+ 1 to 6n+ 1 single-qubit andtwo-qubit gates. Therefore the number of real trainable

parameters varies from 3n + 2 to 18n + 2. For eachdataset we selected the circuit architecture with thelowest training error while reducing overfitting.

Defining a fair, systematic comparison is not straightforward since there are many different models andtraining algorithms we could compare with, each beingfurther specified by hyperparameters. Without the useof feature maps, our classifier has only limited flexibility,and benchmarking against state-of-the-art models suchas convolutional neural networks will therefore notprovide much insight. Instead, we choose to compareour model to six classical benchmark models (see TableIII) that are selected for their mathematical structurewhich is related to the circuit-centric classifier.

Section III C showed an interesting parallel to neuralnetworks, which is why we take neural networks asone benchmark model family. From this family wechoose 3 different architectures shown in Figure 8. TheMLPlin model has a linear hidden layer of the samedimension N as the input layer and resembles the

14

'''

'''

'''

'''

'''

'''

'''

FIG. 8. The three architectures of the benchmark artificialneural network models. Left is the MLPlin model, which hasa hidden linear layer of the size of the input N and a logisticoutput layer. The MLPshal model in the middle has a hiddenlayer of size dlog2Ne with nonlinear activations and a logisticoutput layer. The MLPdeep model model on the right addsa second nonlinear hidden layer.

architecture of our QC model displayed in Figure 3.The MLPshal and MLPdeep models have hidden layersof size dlog2Ne. The motivation is to compare theQC with an architecture that - after the input layer ofsize N - gives rise to a polylogarithmic number of weights.

We use a support vector machine as a second bench-mark model. Support vector machines are similar to thecircuit-centric classifier since we can also think of themas a linear classifier in a feature space that is defined bya kernel κ [46]. We mentioned in Section II A that am-plitude encoding can be supplemented by a feature mapwhich is very similar to that of a support vector machinewith a polynomial kernel,

κ(x, x′) = (xTx′ + c)d.

The offset c can be loosely compared to the effect ofpadding. The degree d of the kernel is not one-to-onecomparable to the degree of the polynomial feature mapin amplitude encoding, since our QC model effectivelycontains an ‘extra’ nonlinearity which derives from themeasurement process (and is therefore not preciselya linear model in feature space). This can be seen inFigure 9 where we compare the decision boundariesof an SVM with polynomial degree d = 1, 2 with theQC model and a feature map of degree d = 1, 2. Tocounterbalance the potential advantage of the QC, weconsider two support vector machines, one with d = 1(SVMpoly1) and one with d = 2 (SVMpoly2). The QCmodel does not use any feature maps. Finally, we adda perceptron (PERC) to the list of benchmark modelsto get an impression about the linear separability of thedatasets.

Since one of our goals is to build a model with a smallparameter space, we compare the number of trainableparameters for each model in Table IV. For the MLP andPERC models these parameters are the weights betweenunits, and their number is defined by the dimension ofinputs as well as the architecture of the network. Forthe SVM models we consider the number of input datapoints, since in their dual formulation these models startwith assigning a weight to each input, after which they

QC SVMpoly QC SVMpoly

1

0

d=1

d=2

FIG. 9. Comparison of the decision boundary for the circuit-centric quantum classifier (QC) and a support vector machinewith polynomial kernel (SVMpoly). The 2-dimensional datagets embedded into a 4-dimensional feature space (we paddedwith 2 non-informative features), where it is classified by thetwo models. The colour scale indicates the probability thatthe model predicts class “green circles”. For the QC model,the parameter d corresponds to the degree of the polynomialfeature map in amplitude encoding (see Section II A). For theSVMpoly, d is the degree of the polynomial kernel. One cansee that for d = 1 the QC model is slightly more flexible thanthe SVMpoly.

ID QC PERC MLPlin MLPshal MLPdeep SVM

CANCER 79 32 1056 165 190 208

SONAR 60 60 1952 305 330 569

WINE 28 13 272 51 60 178

SEMEION 100 256 65792 2056 2120 1593

MNIST256 124 256 65792 2056 2120 800

TABLE IV. Number of parameters each model has to train forthe different benchmark datasets. The circuit-centric quan-tum classifier QC has a logarithmic growth in the number ofparameters with the input dimension N and data set size M ,while all other models show a linear or polynomial parametergrowth in either N or M .

reduce the training set to a few support vectors used forinference.

3. Results

For every benchmarking test (except from the QC modelon SEMEION and MNIST256) we use 5-fold crossval-idation with 10 repetitions each. This means that theresults are an average of 50 repetitions of training themodel and calculating the training and test error. Dueto the significant cost of quantum circuit simulations, forthe QC experiments on the SEMEION and MNIST256datasets only one repetition of the 5-fold cross-validationwas carried out. The results are summarized in Table V.

As we can read from the non-zero training error of thePERC model, none of the datasets is linearly separable.

15

CANCER SONAR WINE∗ SEMEION∗ MNIST256∗

QC 0.022/0.058 0.000/0.195 0.000/0.028 0.031/0.031 0.031/0.033

PERC 0.128/0.137 0.283/0.315 0.067/0.134 0.022/0.038 0.065/0.066

MLPlin 0.060/0.075 0.117/0.263 0.001/0.039 0.001/0.025 0.038/0.041

MLPshal 0.064/0.077 0.010/0.174 0.029/0.063 0.002/0.024 0.011/0.018

MLPdeep 0.056/0.076 0.001/0.174 0.010/0.063 0.001/0.026 0.014/0.021

SVMpoly1 0.373/0.367 0.452/0.477 0.430/0.466 0.101/0.100 0.092/0.092

SVMpoly2 0.169/0.169 0.334/0.383 0.090/0.099 0.100/0.101 0.091/0.092

Average 0.125/0.136 0.171/0.283 0.090/0.137 0.037/0.049 0.040/0.043

TABLE V. Results of the benchmarking experiments. The cells are of the format “training error/validation error”. The variancebetween the 50 repetition for each experiment was of the order of 0.01− 0.001 for the training and test error. ∗For multilabelclassification problems with d labels, the average of all d one-versus-all problems train and test errors were taken.

One finds that the QC model performs significantlybetter than the SVMpoly1 and SVMpoly2 models acrossall data sets. In further simulations we verified that forsupport vector machines with polynomial kernel, degreesof d = 6 to d = 8 return the best results on the datasets,which are also better than those of the MLP models.

Although showing slightly worse test errors than theMLPshal and MLPdeep (and with mixed success com-pared to the MLPlin), the QC performs comparable withthe MLP models that train a lot more parameters. Forthe SONAR and WINE dataset we find that the QCmodel overfits the training data. This is even worse with-out using the dropout qubit regularization technique ex-plained above. The observation is interesting, since theQC model is ‘slimmer’ than the MLP and SVM mod-els in terms of its parameter count. An open question istherefore how to introduce other means of regularisation.

VI. CONCLUSION

We have developed a machine learning design which isboth quantum inspired and implementable by near-termquantum technology. The key building block of thisdesign is a unitary model circuit with relatively fewtrainable parameters that assumes amplitude encoding ofthe data vectors and uses systematically the entanglingproperties of quantum circuits as a resource for capturingcorrelations in the data. After state preparation, theprediction of the model is computed by applying onlya small number of one- and two-qubit gates quantumgates. At the same time, simulating these gates on aclassical computer requires a of elementary operationsthat scales with the number of features. We are aware ofprior designs of unitary neural nets available in literature(cf. [19, 20] and related work). In these designs thenumber of learnable parameters is proportional to thenumber of data features, whereas the size of our model

circuit therefore scales with the number of qubits andthus allows, qubit-wise, for exponentially fewer learnableparameters than what traditional methods would use.We have also shown in some preliminary simulationsthat the design can indeed achieve results close tooff-the-shelf methods that have comparable limitationsbut considerably more tunable parameters.

The quantum classifiers based on model circuits thatwe have explored so far belong to a class of weaklynonlinear classifiers. The main source of nonlinearity inour classifiers stems from the concluding measurementand thresholding on the probabilities of the measurementoutcomes. These probabilities are roughly quadraticin the amplitudes of the final post-circuit state. Sincethe overall effect of the model circuit on the amplitudesis linear reversible, one concludes that the models wehave experimented with can capture (amplitude-wise)quadratic separation of classes in the original featurespace.

There is a potential for tracing class separation bound-aries of higher polynomial degrees by encoding severalcopies of a classical data vector in disjoint quantumregisters and then having the model circuit work on thetensor power of the data vector. Of course, this setuprequires vastly more computational resources to simulatethe quantum circuit, and we leave such experimentsfor the future. We furthermore expect that the mostbeneficial application of our quantum classifiers is toquantum data. One can conceive entangling a quantumsystem with a classifier circuit and use the latter todiscriminate between various states of the quantumsystem.

This work contributes to the growing literature on vari-ational circuits for machine learning applications inproposing a specific circuit architecture and parametrisa-tion, a dropout regularisation technique, a hybrid train-

16

ing method, as well as a graphical interpretation of quan-tum operations in the language of neural networks. How-ever, an overwhelming number of questions is still largelyunexplored. For example, we noticed that our slim cir-cuit design still suffers from overfitting. Also, full-fledgednumerical benchmarks on larger datasets are needed tosystematically analyse the effect of logarithmically fewparameters with growing input spaces. The power offeature maps to introduce nonlinearities is another openquestion. Further numerical as well as theoretical stud-ies are therefore crucial to understand the convergenceproperties and representational power of circuit-centricmodels for classification.

ACKNOWLEDGMENTS

The Authors are grateful to Jeongwan Haah for insight-ful remarks and wish to thank Martin Roetteler for nu-merous useful discussions of the ongoing research andearly drafts of this paper. MS wishes to thank the entireQuArC group at Microsoft Research, Redmond, whosecollective help during her research internship had beengenerous, professional and very welcoming.

[1] Masoud Mohseni, Peter Read, Hartmut Neven, SergioBoixo, Vasil Denchev, Ryan Babbush, Austin Fowler,Vadim Smelyanskiy, and John Martinis. Commer-cialize quantum technologies in five years. Nature,543(7644):171, 2017.

[2] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann.A quantum approximate optimization algorithm. arXivpreprint arXiv:1411.4028, 2014.

[3] Michael J Bremner, Richard Jozsa, and Dan J Shep-herd. Classical simulation of commuting quantum com-putations implies collapse of the polynomial hierarchy. InProceedings of the Royal Society of London A: Mathemat-ical, Physical and Engineering Sciences, page 20100301.The Royal Society, 2010.

[4] Joonsuk Huh, Gian Giacomo Guerreschi, Borja Per-opadre, Jarrod R McClean, and Alán Aspuru-Guzik. Bo-son sampling for molecular vibronic spectra. Nature Pho-tonics, 9(9):615–620, 2015.

[5] Aram W Harrow and Ashley Montanaro. Quantum com-putational supremacy. Nature, 549(7671):203, 2017.

[6] Alejandro Perdomo-Ortiz, Marcello Benedetti, JohnRealpe-Gómez, and Rupak Biswas. Opportunitiesand challenges for quantum-assisted machine learn-ing in near-term quantum computers. arXiv preprintarXiv:1708.09757, 2017.

[7] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione.Introduction to quantum machine learning. Contempo-rary Physics, 56(2):172–185, 2015.

[8] Jacob Biamonte, Peter Wittek, Nicola Pancotti, PatrickRebentrost, Nathan Wiebe, and Seth Lloyd. Quantummachine learning. Nature, 549(7671):195, 2017.

[9] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd.Quantum support vector machine for big data classifica-tion. Physcial Review Letters, 113:130503, Sep 2014.

[10] Iordanis Kerenedis and Anupam Prakash. Quantum rec-ommendation systems. arXiv preprint arXiv:1603.08675,2016.

[11] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quan-tum algorithm for data fitting. Physical Review Letters,109(5):050505, 2012.

[12] Maria Schuld, Mark Fingerhuth, and Francesco Petruc-cione. Implementing a distance-based classifier with aquantum interference circuit. EPL (Europhysics Letters),119(6):60002, 2017.

[13] Jarrod R McClean, Jonathan Romero, Ryan Babbush,and Alán Aspuru-Guzik. The theory of variational hybridquantum-classical algorithms. New Journal of Physics,18(2):023023, 2016.

[14] Jonathan Romero, Jonathan P Olson, and Alan Aspuru-Guzik. Quantum autoencoders for efficient compressionof quantum data. Quantum Science and Technology,2(4):045001, 2017.

[15] Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa,and Keisuke Fujii. Quantum circuit learning. arXivpreprint arXiv:1803.00745, 2018.

[16] G. Verdon, M. Broughton, and J. Biamonte. A quan-tum algorithm to train neural networks using low-depthcircuits. arXiv preprint arXiv:1712.05304, 2017.

[17] E. Farhi and H. Neven. Classification with quantum neu-ral networks on near term processors. arXiv preprintarXiv:1802.06002, 2018.

[18] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J Love, Alán Aspuru-Guzik, and Jeremy L Obrien. A variational eigenvaluesolver on a photonic quantum processor. Nature commu-nications, 5, 2014.

[19] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Uni-tary evolution recurrent neural networks. Journal of Ma-chine Learning Research, 48, 2016.

[20] Li Jing, Yichen Shen, Tena Dubček, John Peurifoy, ScottSkirlo, Max Tegmark, and Marin Soljačić. Tunable effi-cient unitary neural networks (eunn) and their applica-tion to rnn. arXiv preprint arXiv:1612.05231, 2016.

[21] Andrew M Saxe, James L McClelland, and Surya Gan-guli. Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks. arXiv preprintarXiv:1312.6120, 2013.

[22] Mikko Möttönen, Juha J Vartiainen, Ville Bergholm, andMartti M Salomaa. Quantum circuits for general mul-tiqubit gates. Physical Review Letters, 93(13):130502,2004.

[23] Emanuel Knill. Approximation by quantum circuits.arXiv preprint quant-ph/9508006, 1995.

[24] Martin Plesch and Časlav Brukner. Quantum-statepreparation with universal gate decompositions. Phys-ical Review A, 83(3):032302, 2011.

[25] Lov Grover and Terry Rudolph. Creating superpositionsthat correspond to efficiently integrable probability dis-tributions. arXiv preprint arXiv:0208112v1, 2002.

17

[26] Andrei N Soklakov and Rüdiger Schack. Efficient statepreparation for a register of quantum bits. Physical Re-view A, 73(1):012307, 2006.

[27] These can be exact copies when direct classical access todata is available. Otherwise these could be approximateclones of the data point generated at sufficient fidelity.

[28] Edwin Stoudenmire and David J Schwab. Supervisedlearning with tensor networks. In Advances In NeuralInformation Processing Systems, pages 4799–4807, 2016.

[29] J.-L. Brylinski and R. Brylinski. Universal quantumgates. pages 101–116, 2002.

[30] Adriano Barenco, Charles H Bennett, Richard Cleve,David P DiVincenzo, Norman Margolus, Peter Shor, Ty-cho Sleator, John A Smolin, and Harald Weinfurter. El-ementary gates for quantum computation. Physical Re-view A, 52(5):3457, 1995.

[31] Sergio Boixo, Sergei V Isakov, Vadim N Smelyan-skiy, Ryan Babbush, Nan Ding, Zhang Jiang, John MMartinis, and Hartmut Neven. Characterizing quan-tum supremacy in near-term devices. arXiv preprintarXiv:1608.00263, 2016.

[32] According to the laws of quantum mechanics we have tosum over the absolute square values of the amplitudesthat correspond to basis states where the first qubit is instate 1. Using the standard computational basis, this isexactly the ‘second half’ of the amplitude vector, rangingfrom entry 2n−1 + 1 to 2n.

[33] Yoav Levine, David Yakira, Nadav Cohen, and AmnonShashua. Deep learning and quantum entanglement:Fundamental connections with implications to networkdesign. arXiv preprint arXiv:1704.01552, 2017.

[34] Gian Giacomo Guerreschi and Mikhail Smelyanskiy.Practical optimization for hybrid quantum-classical al-gorithms. arXiv preprint arXiv:1701.01450, 2017.

[35] Stochastic gradient descent originally referred to the caseB = 1, but is often used as a synonym for minibatchgradient descent as opposed to batch gradient descentusing the full dataset.

[36] Léon Bottou. Large-scale machine learning with stochas-tic gradient descent. In Proceedings of COMPSTAT’2010,pages 177–186. Springer, 2010.

[37] Andrew M Childs and Nathan Wiebe. Hamiltoniansimulation using linear combinations of unitary opera-tions. Quantum Information and Computation, 12:901–924, 2012.

[38] A. Bocharov, Y. Gurevich, and K.M. Svore. Efficientdecomposition of single-qubit gates into v basis circuits.Physical Reviews, A,, 88:012313, 2013.

[39] V. Kliuchnikov, A. Bocharov, and K.M. Svore. Asymp-totically optimal topological quantum compiling. Physi-cal Review Letters, 112:140504, 2014.

[40] P. Selinger. Efficient clifford+t approximation of single-qubit operators. Quantum Information and Computa-tion, 15:159–180, 2015.

[41] V. Kliuchnikov, A. Bocharov, M. Roetteler, and J. Yard.A framework for approximating qubit unitaries. arXivpreprint arXiv:1510.03888, 2016.

[42] Yunseong Nam, Neil J Ross, Yuan Su, Andrew M Childs,and Dmitri Maslov. Automated optimization of largequantum circuits with continuous parameters. arXivpreprint arXiv:1710.07345, 2017.

[43] Juha J Vartiainen, Mikko Möttönen, and Martti M Salo-maa. Efficient decomposition of quantum gates. Physicalreview letters, 92(17):177902, 2004.

[44] Raban Iten, Roger Colbeck, Ivan Kukuljan, JonathanHome, and Matthias Christandl. Quantum circuits forisometries. Physical Review A, 93(3):032318, 2016.

[45] The most recently accessed location of the UCI databaseis https://archive.ics.uci.edu/ml/datasets.html(more specifically, the CANCER data set is retrievedfrom https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29, theSEMEION from https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit, theSONAR from https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%

2C+Mines+vs.+Rocks%29 and the WINE fromhttps://archive.ics.uci.edu/ml/datasets/Wine).

[46] Bernhard Schölkopf and Alexander J Smola. Learningwith kernels: Support vector machines, regularization,optimization, and beyond. MIT Press, 2002.

https://archive.ics.uci.edu/ml/datasets.htmlhttps://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digithttps://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digithttps://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar%2C+Mines+vs.+Rocks%29https://archive.ics.uci.edu/ml/datasets/Wine

Circuit-centric quantum classifiersAbstractI IntroductionII The circuit-centric quantum classifierA State preparationB The model circuit1 Decomposition into (controlled) single qubit gates

C Read out and postprocessing

III Circuit architecturesA Strongly entangling circuitsB Optimising the architectureC Graphical representation of gates

IV TrainingA Cost functionB Hybrid gradient descent schemeC Classical linear combinations of unitariesD DropoutE Performance analysis1 Circuit depth and width2 Number of repetitions for output estimation3 Parameter noise

V Simulations and benchmarking1 Datasets2 Benchmark models3 Results

VI Conclusion Acknowledgments References

arxiv:1804.00633v1 [quant-ph] 2 apr 2018 · 2018. 4. 3. · university of kwazulu-natal, durban...

Documents