compiler design for distributed quantum computing

23
Compiler Design for Distributed Quantum Computing Davide Ferrari 1 , Angela Sara Cacciapuoti 2,3 , Michele Amoretti 1 , and Marcello Caleffi 2,3 1 Quantum Software, Department of Engineering and Architecture (DIA), University of Parma, Parma, 43124 Italy (e-mail: [email protected], [email protected]). Web: http://www.qis.unipr.it/quantumsoftware.html 2 FLY: Future Communications Laboratory, Department of Electrical Engineering and Information Technology (DIETI), University of Naples Federico II, Naples, 80125 Italy (e-mail: marcello.caleffi@unina.it, [email protected]). Web: www.quantuminternet.it. 3 Laboratorio Nazionale di Comunicazioni Multimediali, National Inter-University Consortium for Telecommunications (CNIT), Naples, 80126 Italy. Abstract In distributed quantum computing architectures, with the network and communications functionalities pro- vided by the Quantum Internet, remote quantum pro- cessing units (QPUs) can communicate and cooperate for executing computational tasks that single NISQ de- vices cannot handle by themselves. To this aim, dis- tributed quantum computing requires a new generation of quantum compilers, for mapping any quantum algo- rithm to any distributed quantum computing architec- ture. With this perspective, in this paper, we first dis- cuss the main challenges arising with compiler design for distributed quantum computing. Then, we analytically derive an upper bound of the overhead induced by quan- tum compilation for distributed quantum computing. The derived bound accounts for the overhead induced by the underlying computing architecture as well as the additional overhead induced by the sub-optimal quan- tum compiler – expressly designed through the paper to achieve three key features, namely, general-purpose, efficient and effective. Finally, we validate the analyt- ical results and we confirm the validity of the compiler design through an extensive performance analysis. Index terms— Quantum Internet, Quantum Net- works, Distributed Quantum Computing, Distributed Quantum Systems, Quantum Compiling. 1 Introduction Current quantum computers are commonly defined as noisy intermediate-scale quantum (NISQ) devices, be- ing characterized by few dozens of quantum bits (qubits) with non-uniform quality and highly constrained physi- cal connectivity. Hence, the growing demand for large-scale quantum computers is motivating research on distributed quan- tum computing architectures [1,2], and experimental ef- forts have demonstrated some of the building blocks for such a design [3]. Indeed, with the network and com- munications functionalities provided by the Quantum Internet [1, 2, 4–11], remote quantum processing units (QPUs) can communicate and cooperate – through the distributed computing paradigm as a virtual quantum processor with a number of qubits that scales linearly with the number of remote QPUs [12] – for executing computational tasks that each NISQ device cannot han- dle by itself. As overviewed in recent literature such as [3,12], sev- eral challenges arise with the design of a distributed quantum computing architecture. In the following, we focus on the problem of designing a quantum algorithm compiler for distributed quantum compilation. Compiling a quantum algorithm means translating a hardware-agnostic description of the algorithm – i.e., the quantum circuit 1 – into a functionally-equivalent one that takes into account the physical constraints of the underlying computing architecture – i.e., the compiled quantum circuit. When it comes to distributed comput- ing architectures, two are the main issues arising with the compiler design. First, a fundamental question arises with distributed computation: at what price ? Indeed, distributed com- putation requires the different processors being able to communicate each others for coordinating and data ex- changing, and these tasks introduce an overhead that strongly depend on the particulars of the distributed computing architecture. For instance, the induced over- head becomes more severe as the connectivity between the QPUs shrinks or as the number of qubits stored at the QPUs decreases. Hence, from a compiling perspec- tive, it is crucial to estimate the overhead effects onto the compiled quantum circuit, effects that are generally measured in terms of depth of the compiled circuit with respect to the depth of the original one. Furthermore, compiling a quantum circuit is a very 1 See Section 2 for a proper introduction to quantum circuits, circuit compilation and circuit depth. 1 arXiv:2012.09680v1 [quant-ph] 17 Dec 2020

Upload: others

Post on 02-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compiler Design for Distributed Quantum Computing

Compiler Design for Distributed Quantum Computing

Davide Ferrari1, Angela Sara Cacciapuoti2,3, Michele Amoretti1, and Marcello Caleffi2,3

1Quantum Software, Department of Engineering and Architecture (DIA), University of Parma, Parma, 43124 Italy (e-mail:[email protected], [email protected]). Web: http://www.qis.unipr.it/quantumsoftware.html

2FLY: Future Communications Laboratory, Department of Electrical Engineering and Information Technology (DIETI),University of Naples Federico II, Naples, 80125 Italy (e-mail: [email protected], [email protected]).

Web: www.quantuminternet.it.3Laboratorio Nazionale di Comunicazioni Multimediali, National Inter-University Consortium for Telecommunications

(CNIT), Naples, 80126 Italy.

Abstract

In distributed quantum computing architectures, withthe network and communications functionalities pro-vided by the Quantum Internet, remote quantum pro-cessing units (QPUs) can communicate and cooperatefor executing computational tasks that single NISQ de-vices cannot handle by themselves. To this aim, dis-tributed quantum computing requires a new generationof quantum compilers, for mapping any quantum algo-rithm to any distributed quantum computing architec-ture. With this perspective, in this paper, we first dis-cuss the main challenges arising with compiler design fordistributed quantum computing. Then, we analyticallyderive an upper bound of the overhead induced by quan-tum compilation for distributed quantum computing.The derived bound accounts for the overhead inducedby the underlying computing architecture as well as theadditional overhead induced by the sub-optimal quan-tum compiler – expressly designed through the paperto achieve three key features, namely, general-purpose,efficient and effective. Finally, we validate the analyt-ical results and we confirm the validity of the compilerdesign through an extensive performance analysis.

Index terms— Quantum Internet, Quantum Net-works, Distributed Quantum Computing, DistributedQuantum Systems, Quantum Compiling.

1 Introduction

Current quantum computers are commonly defined asnoisy intermediate-scale quantum (NISQ) devices, be-ing characterized by few dozens of quantum bits (qubits)with non-uniform quality and highly constrained physi-cal connectivity.

Hence, the growing demand for large-scale quantumcomputers is motivating research on distributed quan-tum computing architectures [1,2], and experimental ef-forts have demonstrated some of the building blocks for

such a design [3]. Indeed, with the network and com-munications functionalities provided by the QuantumInternet [1, 2, 4–11], remote quantum processing units(QPUs) can communicate and cooperate – through thedistributed computing paradigm as a virtual quantumprocessor with a number of qubits that scales linearlywith the number of remote QPUs [12] – for executingcomputational tasks that each NISQ device cannot han-dle by itself.

As overviewed in recent literature such as [3,12], sev-eral challenges arise with the design of a distributedquantum computing architecture. In the following, wefocus on the problem of designing a quantum algorithmcompiler for distributed quantum compilation.

Compiling a quantum algorithm means translating ahardware-agnostic description of the algorithm – i.e., thequantum circuit1 – into a functionally-equivalent onethat takes into account the physical constraints of theunderlying computing architecture – i.e., the compiledquantum circuit. When it comes to distributed comput-ing architectures, two are the main issues arising withthe compiler design.

First, a fundamental question arises with distributedcomputation: at what price? Indeed, distributed com-putation requires the different processors being able tocommunicate each others for coordinating and data ex-changing, and these tasks introduce an overhead thatstrongly depend on the particulars of the distributedcomputing architecture. For instance, the induced over-head becomes more severe as the connectivity betweenthe QPUs shrinks or as the number of qubits stored atthe QPUs decreases. Hence, from a compiling perspec-tive, it is crucial to estimate the overhead effects ontothe compiled quantum circuit, effects that are generallymeasured in terms of depth of the compiled circuit withrespect to the depth of the original one.

Furthermore, compiling a quantum circuit is a very

1See Section 2 for a proper introduction to quantum circuits,circuit compilation and circuit depth.

1

arX

iv:2

012.

0968

0v1

[qu

ant-

ph]

17

Dec

202

0

Page 2: Compiler Design for Distributed Quantum Computing

challenging task even for a single-processor architecture,being such a task a NP-complete problem [13]. Hence,optimal circuit compiling for distributed quantum ar-chitectures can be achieved only for very small circuitinstances. Conversely, the compilation of medium-to-large circuits of practical value induces an additionaloverhead – whose severity depends on the suboptimal-ity of the quantum compiler – that further increases thedepth of the compiled circuit.

With this in mind, in this paper we analytically derivean upper bound of the overhead induced by quantumcircuit compilation for distributed quantum computing:

• by considering the overhead induced by the worst-case scenario for a distributed quantum computingarchitecture, namely a scenario characterized by i)the lowest possible number of qubits at each QPU,and ii) the poorest connectivity among the QPUs,

• and by considering the additional overhead inducedby a sub-optimal quantum compiler.

Clearly, with reference to the last point, the additionaloverhead strongly depends on the particulars of thequantum compiler. To this aim, through the manuscriptwe design a quantum compiler with three key features:

• general-purpose, namely, requiring no particular as-sumptions on the quantum circuits to be compiled;

• efficient, namely, exhibiting a polynomial-timecomputational complexity so that it can success-fully compile medium-to-large circuits of practicalvalue;

• effective, being the total circuit depth overhead in-duced by the quantum circuit compilation alwaysupper-bounded by a factor that grows linearly withthe number of logical qubits of the original quantumcircuit.

The rest of the paper is organized as follows. In Sec-tion 2 we review some preliminaries about quantum cir-cuits and quantum compilers. Then, in Section 3 wedetail the problem of circuit compilation for distributedquantum computing, discussing the challenges that arisewith the compiler design and the relevant literature.These basics are crucial for understanding the compilerdesign as well as the analytical derivation of the over-head bound given in Section 4. Then, Section 5 presentsthe performance analysis for the proposed compiler de-sign. In particular, we present the implementation of acompiler that is able to cope with the worst-case scenariofor a distributed quantum computing architecture; wevalidate the compiling overhead upper bound; we illus-trate experimental results regarding the compilation ofseveral quantum circuits, with our compiler compared toa state-of-the-art solution. Finally, Section 6 concludesthe paper.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

First, a fundamental question arises with distributed com-putation: at what price? Indeed, distributed computationrequires the different processors being able to communicateeach others for coordinating and data exchanging, and thesetasks introduce an overhead that strongly depend on theparticulars of the distributed computing architecture. Forinstance, the induced overhead becomes more severe as theconnectivity between the QPUs shrinks or as the number ofqubits stored at the QPUs decreases. Hence, from a compilingperspective, it is crucial to estimate the overhead effectsonto the compiled quantum circuit, effects that are generallymeasured in terms of depth of the compiled circuit withrespect to the depth of the original one.

Furthermore, compiling a quantum circuit is a very chal-lenging task even for a single-processor architecture, beingsuch a task a NP-complete problem [13]. Hence, optimalcircuit compiling for distributed quantum architectures canbe achieved only for very small circuit instances. Conversely,the compilation of medium-to-large circuits of practical valueinduces an additional overhead – whose severity dependson the suboptimality of the quantum compiler – that furtherincreases the depth of the compiled circuit.

With this in mind, in this paper we analytically derive anupper bound of the overhead induced by quantum circuitcompilation for distributed quantum computing:• by considering the overhead induced by the worst-case

scenario for a distributed quantum computing architec-ture, namely a scenario characterized by i) the lowestpossible number of qubits at each QPU, and ii) thepoorest connectivity among the QPUs,

• and by considering the additional overhead induced bya sub-optimal quantum compiler.

Clearly, with reference to the last point, the additionaloverhead strongly depends on the particulars of the quantumcompiler. To this aim, through the manuscript we design aquantum compiler with three key features:• general-purpose, namely, requiring no particular as-

sumptions on the quantum circuits to be compiled;• efficient, namely, exhibiting a polynomial-time compu-

tational complexity so that it can successfully compilemedium-to-large circuits of practical value;

• effective, being the total circuit depth overhead in-duced by the quantum circuit compilation always upper-bounded by a factor that grows linearly with the numberof logical qubits of the original quantum circuit.

The rest of the paper is organized as follows. In Section IIwe review some preliminaries about quantum circuits andquantum compilers. Then, in Section III we detail the prob-lem of circuit compilation for distributed quantum comput-ing, discussing the challenges that arise with the compilerdesign and the relevant literature. These basics are crucial forunderstanding the compiler design as well as the analyticalderivation of the overhead bound given in Section IV. Then,Section V presents the performance analysis for the proposedcompiler design. In particular, we present the implementation

LAYER 1-QUBIT GATE 2-QUBIT GATE

Q0 H T

Q1 H T H H

Q2 H T H

Q3 H T H H

Q4 H T H H

FIGURE 1: Example of a 5-qubit quantum circuit from [17],with each horizontal line representing the time-evolution of thestate of a single logical qubit.

of a compiler that is able to cope with the worst-case sce-nario for a distributed quantum computing architecture; wevalidate the compiling overhead upper bound; we illustrateexperimental results regarding the compilation of severalquantum circuits, with our compiler compared to a state-of-the-art solution. Finally, Section VI concludes the paper.

II. BACKGROUNDWe refer the reader to [14] for an introduction to the concep-tual and notation differences separating quantum computingfrom conventional computing, and to [15] for an in-depthtreatise of the subject.

In this work, we consider QPUs that support the quantumcircuit model [16], which is the most popular and developedmodel for quantum computation. A quantum circuit is amodel of a quantum algorithm, where quantum operatorsare described as quantum gates. A quantum circuit is still alogical abstraction, not to be confused with its realization onan actual quantum hardware device. Hence, in the following,the abstract qubits subjected to quantum gates as specifiedby the quantum circuit are called logical qubits to distinguishthem from the physical qubits embedded within a quantumprocessor.

Figure 1 shows a simple quantum circuit, where eachhorizontal line represents the time-evolution of the stateof a single logical qubit, with time flowing from left toright, dictating the order of execution of the different gates.More specifically, gates affecting the same qubit must beexecuted sequentially, and this agrees with the intuition.Conversely, gates acting on different qubits can be performedsimultaneously as long as the “ordering” arising from gatesaffecting multiple qubits is respected. This concept underliesthe notion of layer, i.e., the set of gates that can be performedsimultaneously on a disjoint set of qubits. The number oflayers in a quantum circuit is denoted as circuit depth. As anexample, the quantum circuit given in Figure 1 is composedof 9 layers and hence its depth is equal to 9. The number ofgates within the circuit is denoted as circuit size.

2 VOLUME 4, 2016

Figure 1: Example of a 5-qubit quantum circuit from[17], with each horizontal line representing the time-evolution of the state of a single logical qubit.

2 Background

We refer the reader to [14] for an introduction to theconceptual and notation differences separating quantumcomputing from conventional computing, and to [15] foran in-depth treatise of the subject.

In this work, we consider QPUs that support thequantum circuit model [16], which is the most popu-lar and developed model for quantum computation. Aquantum circuit is a model of a quantum algorithm,where quantum operators are described as quantumgates. A quantum circuit is still a logical abstraction, notto be confused with its realization on an actual quantumhardware device. Hence, in the following, the abstractqubits subjected to quantum gates as specified by thequantum circuit are called logical qubits to distinguishthem from the physical qubits embedded within a quan-tum processor.

Figure 1 shows a simple quantum circuit, where eachhorizontal line represents the time-evolution of the stateof a single logical qubit, with time flowing from left toright, dictating the order of execution of the differentgates. More specifically, gates affecting the same qubitmust be executed sequentially, and this agrees with theintuition. Conversely, gates acting on different qubitscan be performed simultaneously as long as the “order-ing” arising from gates affecting multiple qubits is re-spected. This concept underlies the notion of layer, i.e.,the set of gates that can be performed simultaneously ona disjoint set of qubits. The number of layers in a quan-tum circuit is denoted as circuit depth. As an example,the quantum circuit given in Figure 1 is composed of 9layers and hence its depth is equal to 9. The number ofgates within the circuit is denoted as circuit size.

2.1 Quantum Compilation

Given a quantum algorithm, there exist several equiva-lent quantum circuits modeling the same computation

2

Page 3: Compiler Design for Distributed Quantum Computing

Authors: Distributed Quantum Compiling

Q0 Q1 Q2 Q3 Q4 Q5 Q6

Q14 Q13 Q12 Q11 Q10 Q9 Q8 Q7

FIGURE 2: Coupling map of the IBM Melbourne quantumprocessor [21]. The fifteen physical qubits are representedby circles. The arrows denote the possibility to realize atwo-qubit CNOT gate between the connected qubits, with thearrow pointing toward the target qubit. As an example, aCNOT between qubits Q1 (control) and Q0 (target) can bedirectly executed by the quantum processor, whereas a CNOTbetween qubits Q2 and Q0 cannot.

A. QUANTUM COMPILATION

Given a quantum algorithm, there exist several equivalentquantum circuits modeling the same computation with adifferent arrangement or different ordering of gates.

Circuits with fewer gates – i.e., with lower size – maybe preferred to reduce the circuit complexity. However, theexecution time of the circuit – rather than its size – is gen-erally considered the key factor to be optimized [18], [19].The rationale is to keep the execution time of the quantumcircuit within the coherence time of the underlying quantumhardware architecture [15], [20]. By oversimplifying, the ex-ecution time increases with the number of layers. Therefore,it is crucial to build – for a given quantum algorithm – aquantum circuit characterized by the lowest possible depth.However, two issues arise as a consequence of the quantumprocessor characteristics.

First, even if there exists an uncountable number of quan-tum logic gates, the set of gates that can be executed on acertain quantum processor can be limited, as a consequenceof the constraints imposed by the underlying qubit technol-ogy [3]. In this case, any gate outside this reduced set mustbe obtained with a proper combination of the allowed gatesthrough a process known as gate synthesis.

Furthermore, regardless of the underlying qubit technol-ogy, any quantum processor exhibits physical constraints– arising as a consequence of the noise and the physical-space limitations – on the possible interactions between thedifferent physical qubits. For example, CNOT gates cannotbe applied to any physical qubit pair, but they are insteadrestricted to certain pairs, as shown in Figure 2 with thecoupling map of an IBM quantum processor.

From the above, it becomes clear that the execution of aquantum algorithm on a certain quantum processor requiresthat: i) each logical qubit of the quantum circuit is mapped 2

onto a physical qubit of the quantum processor, and ii) eachCNOT operation between non-adjacent (within the couplingmap) physical qubits is mapped into a sequence of CNOT

2Indeed, NISQ technology may require a logical qubit to be mapped ontoseveral physical qubits to implement proper fault-tolerant techniques [23].Nevertheless, in the following we assume a one-to-one mapping for the sakeof clarity, without any loss of generality.

operations between adjacent physical qubits, as shown3 inFigure 3.

This process, known as quantum compilation, must beoptimized so that the depth of the compiled circuit – i.e.,the equivalent quantum circuit satisfying all the constrainsimposed by the quantum processor – is minimized [24], [25],[27], [28].

III. COMPILERS FOR DISTRIBUTED QUANTUMCOMPUTINGAs highlighted in Section I, the demand for large-scale quan-tum computers is motivating research on distributed quantumcomputing architectures, where multiple small-scale quan-tum processors interact and cooperate through the QuantumInternet for solving challenging computational tasks. As aconsequence, a new generation of quantum compilers isneeded, for mapping any quantum algorithm to any dis-tributed quantum computing architecture.

Let us consider a toy model for distributed quantumcomputing, in which a generic quantum algorithm must beexecuted on two quantum processors interconnected by aquantum link, as shown in Figure 4.

A. CHALLENGESSeveral challenges arise with the design of a quantum com-piler for mapping an arbitrary quantum circuit into a dis-tributed quantum computing architecture, as discussed in thefollowing.

Data Qubits vs Communication qubitsSimilarly to classical distributed computing, a key require-ment for distributed quantum computing is the possibilityto perform remote operations, namely operations betweenqubits stored at different processors. But, differently fromthe classical domain, quantum mechanics does not allowan unknown qubit to be copied or even simply read ormeasured in any way, without causing an irreversible loss ofthe quantum information stored within the qubit [14], [15].

Thankfully, entanglement provides an invaluable tool forimplementing remote operations without violating quantummechanics [12]. Entanglement is a property of two (or more,in case of multipartite entanglement) quantum particles thatexist in a special type of superposition state, such that anyaction on a particle affects instantaneously the other particleas well. This sort of quantum correlation, with no counterpartin the classical world, holds even when the particles arefar away from each other. For an in-depth discussion aboutentanglement from an information engineering point of view,we refer the reader to [20].

By exploiting the availability of a Bell state – that is astate of two maximally-entangled qubits – shared betweenthe two remote processors, it is possible to perform a remoteCNOT through a sequence of local CNOTs and single-qubitoperations/measurements as shown in Figure 4b.

3With the state transfer strategy based on SWAPs usually preferred overthe ancilla strategy [24]–[26].

VOLUME 4, 2016 3

Figure 2: Coupling map of the IBM Melbourne quan-tum processor [21]. The fifteen physical qubits are rep-resented by circles. The arrows denote the possibilityto realize a two-qubit CNOT gate between the connectedqubits, with the arrow pointing toward the target qubit.As an example, a CNOT between qubits Q1 (control) andQ0 (target) can be directly executed by the quantumprocessor, whereas a CNOT between qubits Q2 and Q0

cannot.

with a different arrangement or different ordering ofgates.

Circuits with fewer gates – i.e., with lower size – maybe preferred to reduce the circuit complexity. How-ever, the execution time of the circuit – rather than itssize – is generally considered the key factor to be op-timized [18, 19]. The rationale is to keep the executiontime of the quantum circuit within the coherence time ofthe underlying quantum hardware architecture [15, 20].By oversimplifying, the execution time increases withthe number of layers. Therefore, it is crucial to build– for a given quantum algorithm – a quantum circuitcharacterized by the lowest possible depth. However,two issues arise as a consequence of the quantum pro-cessor characteristics.

First, even if there exists an uncountable number ofquantum logic gates, the set of gates that can be exe-cuted on a certain quantum processor can be limited, asa consequence of the constraints imposed by the under-lying qubit technology [3]. In this case, any gate outsidethis reduced set must be obtained with a proper combi-nation of the allowed gates through a process known asgate synthesis.

Furthermore, regardless of the underlying qubit tech-nology, any quantum processor exhibits physical con-straints – arising as a consequence of the noise and thephysical-space limitations – on the possible interactionsbetween the different physical qubits. For example, CNOTgates cannot be applied to any physical qubit pair, butthey are instead restricted to certain pairs, as shownin Figure 2 with the coupling map of an IBM quantumprocessor.

From the above, it becomes clear that the executionof a quantum algorithm on a certain quantum proces-sor requires that: i) each logical qubit of the quantumcircuit is mapped 2 onto a physical qubit of the quan-

2Indeed, NISQ technology may require a logical qubit to bemapped onto several physical qubits to implement proper fault-tolerant techniques [23]. Nevertheless, in the following we assume

tum processor, and ii) each CNOT operation between non-adjacent (within the coupling map) physical qubits ismapped into a sequence of CNOT operations between ad-jacent physical qubits, as shown3 in Figure 3.

This process, known as quantum compilation, mustbe optimized so that the depth of the compiled circuit– i.e., the equivalent quantum circuit satisfying all theconstrains imposed by the quantum processor – is min-imized [24,25,27,28].

3 Compilers for DistributedQuantum Computing

As highlighted in Section 1, the demand for large-scale quantum computers is motivating research on dis-tributed quantum computing architectures, where mul-tiple small-scale quantum processors interact and coop-erate through the Quantum Internet for solving chal-lenging computational tasks. As a consequence, a newgeneration of quantum compilers is needed, for map-ping any quantum algorithm to any distributed quan-tum computing architecture.

Let us consider a toy model for distributed quantumcomputing, in which a generic quantum algorithm mustbe executed on two quantum processors interconnectedby a quantum link, as shown in Figure 4.

3.1 Challenges

Several challenges arise with the design of a quantumcompiler for mapping an arbitrary quantum circuit intoa distributed quantum computing architecture, as dis-cussed in the following.

Data Qubits vs Communication qubits

Similarly to classical distributed computing, a key re-quirement for distributed quantum computing is thepossibility to perform remote operations, namely opera-tions between qubits stored at different processors. But,differently from the classical domain, quantum mechan-ics does not allow an unknown qubit to be copied or evensimply read or measured in any way, without causingan irreversible loss of the quantum information storedwithin the qubit [14,15].

Thankfully, entanglement provides an invaluable toolfor implementing remote operations without violatingquantum mechanics [12]. Entanglement is a propertyof two (or more, in case of multipartite entanglement)quantum particles that exist in a special type of super-position state, such that any action on a particle affects

a one-to-one mapping for the sake of clarity, without any loss ofgenerality.

3With the state transfer strategy based on SWAPs usually pre-ferred over the ancilla strategy [24–26].

3

Page 4: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

REVERSE CNOT

Q0 H H

Q1

(a) Reversing CNOT [22]. A CNOT betweenQ0 (control) and Q1 (target) can be ex-ecuted with the coupling map given inFigure 2 by performing a CNOT betweenQ1 (control) and Q0 (target) sandwichedbetween two H gates. We note that IBMMelbourne processor (shown in Figure 2)natively supports CNOTs in both directionsbetween neighbor qubits.

SWAP Q0 AND Q1 SWAP BACK

≡ ≡

Q0

Q1

Q2

(b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed through either: i) quantumstate transfer, by first swapping qubits Q0 and Q1 so that Q0 and Q2 become adjacent qubits inthe coupling maps, then by performing a CNOT between Q0 and Q2, and finally by swapping againqubits Q0 and Q1 so that they recover their initial position, or ii) ancilla qubit, by performing fourCNOT operations between neighbour qubits with the help of the intermediate qubit Q1.

FIGURE 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOTinto a sequence of CNOTs that can be directly executed by a given quantum processor.

Q0

Q1

DATA QUBIT

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0COMMUNICATIONQUBIT

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

QUANTUM NETWORK

QUANTUM LINK

CLASSICAL LINK

(a) Two IBM Yorktown quantum processors are interconnected with aquantum link and a classical link. The classical link is used to transmitclassical information, whereas the quantum link is needed to distributeBell states – that is, maximally-entangled two-qubit states – betweenremote processors to execute remote operations. Indeed, at least onephysical qubit at each processor must be reserved for storing the Bellstate, as discussed in Figure 4b. This kind of qubits – dark-blue-coloredin the figure – are called communication qubits [2], [29] to distinguishthem from the remaining physical qubits – white-colored in the figure –devoted to computing and referred to as data qubits.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #2

Q4 Z DATA QUBIT (CONTROL)

|Φ+〉

Q3 COMMUNICATION QUBIT

Q′0 H COMMUNICATION QUBIT

Q′1 X DATA QUBIT (TARGET)

(b) Remote CNOT. To perform a CNOT between remote physical qubitsstored at different processors – say data qubits Q4 and Q′

1 in Figure 4a– a Bell state such as |Φ+〉must be distributed through the quantum linkso that each pair member is stored within a communication qubit at eachprocessor. Once the Bell state is available, the remote CNOT is obtainedwith a local CNOT between the data and the communication qubit at eachprocessor, followed by a conditional gate on the data qubit dependingon the measurement of the remote communication qubit. The doubleline denotes the transmission of one bit of classical information – i.e.,the measurement output – between the remote processors.

FIGURE 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantumnetwork. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides thequantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.

To distribute Bell states between different quantum pro-cessors, at least one qubit at each processor – referred to ascommunication qubit [29] to distinguish it from the remain-ing data qubits devoted to processing – must be reserved forremote inter-processor operations. Hence, a crucial trade-offbetween communication and data qubits arises. Specifically,for each remote CNOT, a Bell state is consumed (Figure 4b)and a new Bell state must be distributed between the remoteprocessors through the quantum link before another remoteCNOT can be executed. Hence, the more communicationqubits are available within a processor, the more remoteCNOTs can be executed in parallel, reducing the overheadinduced by the distributed computation. But the more com-munication qubits are available for inter-processor commu-nication, the less valuable resources – i.e., data qubits – areavailable for computing.

It is unlikely that a data qubit could be dynamically turnedinto a communication qubit during the compilation, given thededicated hardware – such as a matter-flying qubit interface

[20] – required for entanglement distribution. Conversely, itis reasonable to envision that the distributed quantum com-piler could easily reserve – when multiple communicationqubits are available at the same processor – a subset of thecommunication qubits for computing. This optimization taskrepresents an interesting yet unaddressed open problem.

Dynamic Connectivity

As mentioned in Section II-A, with single-processor quantumcomputing all the constraints on the possible interactionsbetween different qubits – arising from the underlying phys-ical computing architecture – can be effectively representedwith a coupling map. Formally, a coupling map is a visualrepresentation of the directed graph G:

G = (V, E) (1)

where V = {vi} denotes the set of vertices representing thequbits and E = {ei,j}vi,vj∈V denotes the set of directed

4 VOLUME 4, 2016

(a) Reversing CNOT [22]. A CNOT be-tween Q0 (control) and Q1 (target)can be executed with the couplingmap given in Figure 2 by perform-ing a CNOT between Q1 (control) andQ0 (target) sandwiched between twoH gates. We note that IBM Mel-bourne processor (shown in Figure 2)natively supports CNOTs in both di-rections between neighbor qubits.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

REVERSE CNOT

Q0 H H

Q1

(a) Reversing CNOT [22]. A CNOT betweenQ0 (control) and Q1 (target) can be ex-ecuted with the coupling map given inFigure 2 by performing a CNOT betweenQ1 (control) and Q0 (target) sandwichedbetween two H gates. We note that IBMMelbourne processor (shown in Figure 2)natively supports CNOTs in both directionsbetween neighbor qubits.

SWAP Q0 AND Q1 SWAP BACK

≡ ≡

Q0

Q1

Q2

(b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed through either: i) quantumstate transfer, by first swapping qubits Q0 and Q1 so that Q0 and Q2 become adjacent qubits inthe coupling maps, then by performing a CNOT between Q0 and Q2, and finally by swapping againqubits Q0 and Q1 so that they recover their initial position, or ii) ancilla qubit, by performing fourCNOT operations between neighbour qubits with the help of the intermediate qubit Q1.

FIGURE 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOTinto a sequence of CNOTs that can be directly executed by a given quantum processor.

Q0

Q1

DATA QUBIT

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0COMMUNICATIONQUBIT

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

QUANTUM NETWORK

QUANTUM LINK

CLASSICAL LINK

(a) Two IBM Yorktown quantum processors are interconnected with aquantum link and a classical link. The classical link is used to transmitclassical information, whereas the quantum link is needed to distributeBell states – that is, maximally-entangled two-qubit states – betweenremote processors to execute remote operations. Indeed, at least onephysical qubit at each processor must be reserved for storing the Bellstate, as discussed in Figure 4b. This kind of qubits – dark-blue-coloredin the figure – are called communication qubits [2], [29] to distinguishthem from the remaining physical qubits – white-colored in the figure –devoted to computing and referred to as data qubits.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #2

Q4 Z DATA QUBIT (CONTROL)

|Φ+〉

Q3 COMMUNICATION QUBIT

Q′0 H COMMUNICATION QUBIT

Q′1 X DATA QUBIT (TARGET)

(b) Remote CNOT. To perform a CNOT between remote physical qubitsstored at different processors – say data qubits Q4 and Q′

1 in Figure 4a– a Bell state such as |Φ+〉must be distributed through the quantum linkso that each pair member is stored within a communication qubit at eachprocessor. Once the Bell state is available, the remote CNOT is obtainedwith a local CNOT between the data and the communication qubit at eachprocessor, followed by a conditional gate on the data qubit dependingon the measurement of the remote communication qubit. The doubleline denotes the transmission of one bit of classical information – i.e.,the measurement output – between the remote processors.

FIGURE 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantumnetwork. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides thequantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.

To distribute Bell states between different quantum pro-cessors, at least one qubit at each processor – referred to ascommunication qubit [29] to distinguish it from the remain-ing data qubits devoted to processing – must be reserved forremote inter-processor operations. Hence, a crucial trade-offbetween communication and data qubits arises. Specifically,for each remote CNOT, a Bell state is consumed (Figure 4b)and a new Bell state must be distributed between the remoteprocessors through the quantum link before another remoteCNOT can be executed. Hence, the more communicationqubits are available within a processor, the more remoteCNOTs can be executed in parallel, reducing the overheadinduced by the distributed computation. But the more com-munication qubits are available for inter-processor commu-nication, the less valuable resources – i.e., data qubits – areavailable for computing.

It is unlikely that a data qubit could be dynamically turnedinto a communication qubit during the compilation, given thededicated hardware – such as a matter-flying qubit interface

[20] – required for entanglement distribution. Conversely, itis reasonable to envision that the distributed quantum com-piler could easily reserve – when multiple communicationqubits are available at the same processor – a subset of thecommunication qubits for computing. This optimization taskrepresents an interesting yet unaddressed open problem.

Dynamic Connectivity

As mentioned in Section II-A, with single-processor quantumcomputing all the constraints on the possible interactionsbetween different qubits – arising from the underlying phys-ical computing architecture – can be effectively representedwith a coupling map. Formally, a coupling map is a visualrepresentation of the directed graph G:

G = (V, E) (1)

where V = {vi} denotes the set of vertices representing thequbits and E = {ei,j}vi,vj∈V denotes the set of directed

4 VOLUME 4, 2016

(b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed througheither: i) quantum state transfer, by first swapping qubits Q0 and Q1 so that Q0 andQ2 become adjacent qubits in the coupling maps, then by performing a CNOT betweenQ0 and Q2, and finally by swapping again qubits Q0 and Q1 so that they recovertheir initial position, or ii) ancilla qubit, by performing four CNOT operations betweenneighbour qubits with the help of the intermediate qubit Q1.

Figure 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbi-trary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.

instantaneously the other particle as well. This sort ofquantum correlation, with no counterpart in the classicalworld, holds even when the particles are far away fromeach other. For an in-depth discussion about entangle-ment from an information engineering point of view, werefer the reader to [20].

By exploiting the availability of a Bell state – thatis a state of two maximally-entangled qubits – sharedbetween the two remote processors, it is possible to per-form a remote CNOT through a sequence of local CNOTsand single-qubit operations/measurements as shown inFigure 4b.

To distribute Bell states between different quantumprocessors, at least one qubit at each processor – re-ferred to as communication qubit [29] to distinguish itfrom the remaining data qubits devoted to processing –must be reserved for remote inter-processor operations.Hence, a crucial trade-off between communication anddata qubits arises. Specifically, for each remote CNOT,a Bell state is consumed (Figure 4b) and a new Bellstate must be distributed between the remote proces-sors through the quantum link before another remoteCNOT can be executed. Hence, the more communicationqubits are available within a processor, the more remoteCNOTs can be executed in parallel, reducing the overheadinduced by the distributed computation. But the morecommunication qubits are available for inter-processorcommunication, the less valuable resources – i.e., dataqubits – are available for computing.

It is unlikely that a data qubit could be dynamicallyturned into a communication qubit during the compila-tion, given the dedicated hardware – such as a matter-flying qubit interface [20] – required for entanglementdistribution. Conversely, it is reasonable to envision thatthe distributed quantum compiler could easily reserve –

when multiple communication qubits are available at thesame processor – a subset of the communication qubitsfor computing. This optimization task represents an in-teresting yet unaddressed open problem.

Dynamic Connectivity

As mentioned in Section 2.1, with single-processor quan-tum computing all the constraints on the possible in-teractions between different qubits – arising from theunderlying physical computing architecture – can be ef-fectively represented with a coupling map. Formally, acoupling map is a visual representation of the directedgraph G:

G = (V, E) (1)

where V = {vi} denotes the set of vertices representingthe qubits and E = {ei,j}vi,vj∈V denotes the set of di-rected edges representing the possibility to perform4 aCNOT with vi and vj acting as control and target qubit,respectively.

But when it comes to distributed quantum computing,a new kind of constraints arises as a consequence of theunderlying physical network topology.

More in detail, similarly to single-processor quantumcompiling, the remote operations are restricted to cer-tain fixed pairs. Specifically, they are restricted to pairscomposed by data qubits directly connected to a commu-nication qubit within the processor coupling map. Forinstance, with reference to the network topology shownin Figure 4a, a remote CNOT between data qubits Q2 and

4In the following, for the sake of presentation, we consider thesimplest binary case, i.e., either the operation can or cannot beexecuted. But the discussions, as well as the results derived in thefollowing, continue to hold when a weight – usually representingthe gate fidelity – is mapped on the edge.

4

Page 5: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

REVERSE CNOT

Q0 H H

Q1

(a) Reversing CNOT [22]. A CNOT betweenQ0 (control) and Q1 (target) can be ex-ecuted with the coupling map given inFigure 2 by performing a CNOT betweenQ1 (control) and Q0 (target) sandwichedbetween two H gates. We note that IBMMelbourne processor (shown in Figure 2)natively supports CNOTs in both directionsbetween neighbor qubits.

SWAP Q0 AND Q1 SWAP BACK

≡ ≡

Q0

Q1

Q2

(b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed through either: i) quantumstate transfer, by first swapping qubits Q0 and Q1 so that Q0 and Q2 become adjacent qubits inthe coupling maps, then by performing a CNOT between Q0 and Q2, and finally by swapping againqubits Q0 and Q1 so that they recover their initial position, or ii) ancilla qubit, by performing fourCNOT operations between neighbour qubits with the help of the intermediate qubit Q1.

FIGURE 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOTinto a sequence of CNOTs that can be directly executed by a given quantum processor.

Q0

Q1

DATA QUBIT

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0COMMUNICATIONQUBIT

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

QUANTUM NETWORK

QUANTUM LINK

CLASSICAL LINK

(a) Two IBM Yorktown quantum processors are interconnected with aquantum link and a classical link. The classical link is used to transmitclassical information, whereas the quantum link is needed to distributeBell states – that is, maximally-entangled two-qubit states – betweenremote processors to execute remote operations. Indeed, at least onephysical qubit at each processor must be reserved for storing the Bellstate, as discussed in Figure 4b. This kind of qubits – dark-blue-coloredin the figure – are called communication qubits [2], [29] to distinguishthem from the remaining physical qubits – white-colored in the figure –devoted to computing and referred to as data qubits.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #2

Q4 Z DATA QUBIT (CONTROL)

|Φ+〉

Q3 COMMUNICATION QUBIT

Q′0 H COMMUNICATION QUBIT

Q′1 X DATA QUBIT (TARGET)

(b) Remote CNOT. To perform a CNOT between remote physical qubitsstored at different processors – say data qubits Q4 and Q′

1 in Figure 4a– a Bell state such as |Φ+〉must be distributed through the quantum linkso that each pair member is stored within a communication qubit at eachprocessor. Once the Bell state is available, the remote CNOT is obtainedwith a local CNOT between the data and the communication qubit at eachprocessor, followed by a conditional gate on the data qubit dependingon the measurement of the remote communication qubit. The doubleline denotes the transmission of one bit of classical information – i.e.,the measurement output – between the remote processors.

FIGURE 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantumnetwork. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides thequantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.

To distribute Bell states between different quantum pro-cessors, at least one qubit at each processor – referred to ascommunication qubit [29] to distinguish it from the remain-ing data qubits devoted to processing – must be reserved forremote inter-processor operations. Hence, a crucial trade-offbetween communication and data qubits arises. Specifically,for each remote CNOT, a Bell state is consumed (Figure 4b)and a new Bell state must be distributed between the remoteprocessors through the quantum link before another remoteCNOT can be executed. Hence, the more communicationqubits are available within a processor, the more remoteCNOTs can be executed in parallel, reducing the overheadinduced by the distributed computation. But the more com-munication qubits are available for inter-processor commu-nication, the less valuable resources – i.e., data qubits – areavailable for computing.

It is unlikely that a data qubit could be dynamically turnedinto a communication qubit during the compilation, given thededicated hardware – such as a matter-flying qubit interface

[20] – required for entanglement distribution. Conversely, itis reasonable to envision that the distributed quantum com-piler could easily reserve – when multiple communicationqubits are available at the same processor – a subset of thecommunication qubits for computing. This optimization taskrepresents an interesting yet unaddressed open problem.

Dynamic Connectivity

As mentioned in Section II-A, with single-processor quantumcomputing all the constraints on the possible interactionsbetween different qubits – arising from the underlying phys-ical computing architecture – can be effectively representedwith a coupling map. Formally, a coupling map is a visualrepresentation of the directed graph G:

G = (V, E) (1)

where V = {vi} denotes the set of vertices representing thequbits and E = {ei,j}vi,vj∈V denotes the set of directed

4 VOLUME 4, 2016

(a) Two IBM Yorktown quantum processors are intercon-nected with a quantum link and a classical link. The clas-sical link is used to transmit classical information, whereasthe quantum link is needed to distribute Bell states – that is,maximally-entangled two-qubit states – between remote pro-cessors to execute remote operations. Indeed, at least onephysical qubit at each processor must be reserved for storingthe Bell state, as discussed in Figure 4b. This kind of qubits– dark-blue-colored in the figure – are called communicationqubits [2,29] to distinguish them from the remaining physicalqubits – white-colored in the figure – devoted to computingand referred to as data qubits.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

REVERSE CNOT

Q0 H H

Q1

(a) Reversing CNOT [22]. A CNOT betweenQ0 (control) and Q1 (target) can be ex-ecuted with the coupling map given inFigure 2 by performing a CNOT betweenQ1 (control) and Q0 (target) sandwichedbetween two H gates. We note that IBMMelbourne processor (shown in Figure 2)natively supports CNOTs in both directionsbetween neighbor qubits.

SWAP Q0 AND Q1 SWAP BACK

≡ ≡

Q0

Q1

Q2

(b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed through either: i) quantumstate transfer, by first swapping qubits Q0 and Q1 so that Q0 and Q2 become adjacent qubits inthe coupling maps, then by performing a CNOT between Q0 and Q2, and finally by swapping againqubits Q0 and Q1 so that they recover their initial position, or ii) ancilla qubit, by performing fourCNOT operations between neighbour qubits with the help of the intermediate qubit Q1.

FIGURE 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOTinto a sequence of CNOTs that can be directly executed by a given quantum processor.

Q0

Q1

DATA QUBIT

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0COMMUNICATIONQUBIT

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

QUANTUM NETWORK

QUANTUM LINK

CLASSICAL LINK

(a) Two IBM Yorktown quantum processors are interconnected with aquantum link and a classical link. The classical link is used to transmitclassical information, whereas the quantum link is needed to distributeBell states – that is, maximally-entangled two-qubit states – betweenremote processors to execute remote operations. Indeed, at least onephysical qubit at each processor must be reserved for storing the Bellstate, as discussed in Figure 4b. This kind of qubits – dark-blue-coloredin the figure – are called communication qubits [2], [29] to distinguishthem from the remaining physical qubits – white-colored in the figure –devoted to computing and referred to as data qubits.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #2

Q4 Z DATA QUBIT (CONTROL)

|Φ+〉

Q3 COMMUNICATION QUBIT

Q′0 H COMMUNICATION QUBIT

Q′1 X DATA QUBIT (TARGET)

(b) Remote CNOT. To perform a CNOT between remote physical qubitsstored at different processors – say data qubits Q4 and Q′

1 in Figure 4a– a Bell state such as |Φ+〉must be distributed through the quantum linkso that each pair member is stored within a communication qubit at eachprocessor. Once the Bell state is available, the remote CNOT is obtainedwith a local CNOT between the data and the communication qubit at eachprocessor, followed by a conditional gate on the data qubit dependingon the measurement of the remote communication qubit. The doubleline denotes the transmission of one bit of classical information – i.e.,the measurement output – between the remote processors.

FIGURE 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantumnetwork. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides thequantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.

To distribute Bell states between different quantum pro-cessors, at least one qubit at each processor – referred to ascommunication qubit [29] to distinguish it from the remain-ing data qubits devoted to processing – must be reserved forremote inter-processor operations. Hence, a crucial trade-offbetween communication and data qubits arises. Specifically,for each remote CNOT, a Bell state is consumed (Figure 4b)and a new Bell state must be distributed between the remoteprocessors through the quantum link before another remoteCNOT can be executed. Hence, the more communicationqubits are available within a processor, the more remoteCNOTs can be executed in parallel, reducing the overheadinduced by the distributed computation. But the more com-munication qubits are available for inter-processor commu-nication, the less valuable resources – i.e., data qubits – areavailable for computing.

It is unlikely that a data qubit could be dynamically turnedinto a communication qubit during the compilation, given thededicated hardware – such as a matter-flying qubit interface

[20] – required for entanglement distribution. Conversely, itis reasonable to envision that the distributed quantum com-piler could easily reserve – when multiple communicationqubits are available at the same processor – a subset of thecommunication qubits for computing. This optimization taskrepresents an interesting yet unaddressed open problem.

Dynamic Connectivity

As mentioned in Section II-A, with single-processor quantumcomputing all the constraints on the possible interactionsbetween different qubits – arising from the underlying phys-ical computing architecture – can be effectively representedwith a coupling map. Formally, a coupling map is a visualrepresentation of the directed graph G:

G = (V, E) (1)

where V = {vi} denotes the set of vertices representing thequbits and E = {ei,j}vi,vj∈V denotes the set of directed

4 VOLUME 4, 2016

(b) Remote CNOT. To perform a CNOT between remote physicalqubits stored at different processors – say data qubits Q4 andQ′

1 in Figure 4a – a Bell state such as |Φ+〉 must be distributedthrough the quantum link so that each pair member is storedwithin a communication qubit at each processor. Once theBell state is available, the remote CNOT is obtained with a localCNOT between the data and the communication qubit at eachprocessor, followed by a conditional gate on the data qubitdepending on the measurement of the remote communicationqubit. The double line denotes the transmission of one bit ofclassical information – i.e., the measurement output – betweenthe remote processors.

Figure 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through aquantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4bprovides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed toexecute a remote operation.Authors: Distributed Quantum Compiling

Q0

Q1

Q2

Q4

VIRTUAL QUANTUM PROCESSOR

Q′1

Q′2 Q′3

Q′4

LOCAL CNOT

TIME-CONSTRAINED REMOTE CNOT

FIGURE 5: Dynamic coupling map for the network topologyshown in Figure 4a. The two 5-qubit quantum processorsconstitute a 8-qubit virtual quantum processor with qubits inter-connected through both local and remote CNOTs. While thelocal CNOTs can be concurrently executed – i.e., they are un-constrained – the parallel execution of multiple remote CNOTsis constrained to the availability of multiple Bell states, with oneBell state for each concurrent remote CNOT. Given that onlyone communication qubit is available at each processor inFigure 4a, out of four remote CNOTs only one can be executedat any time.

edges representing the possibility to perform4 a CNOT withvi and vj acting as control and target qubit, respectively.

But when it comes to distributed quantum computing,a new kind of constraints arises as a consequence of theunderlying physical network topology.

More in detail, similarly to single-processor quantum com-piling, the remote operations are restricted to certain fixedpairs. Specifically, they are restricted to pairs composed bydata qubits directly connected to a communication qubitwithin the processor coupling map. For instance, with ref-erence to the network topology shown in Figure 4a, a remoteCNOT between data qubits Q2 and Q′1 can be directly mappedonto the circuit given in Figure 4b. Conversely, a remoteCNOT between data qubits Q2 and Q′4 cannot be directlyexecuted between the pair, but it requires to distribute theoperation through the neighbor qubits as shown in Figure 3b.

However, differently from single-processor compiling, theremote operations are subjects to two types of temporalconstraints.

• Simultaneity Limitations. As previously discussed, eachremote CNOT relies on the availability of a Bell statestored within a communication qubit. Hence, even ifremote CNOTs can – in principle – be executed betweendifferent remote pairs, the number of remote CNOTs thatcan be executed simultaneously between two proces-sors is limited by the number of communication qubitsjointly available at each processor. With reference toFigure 5, out of four possible remote CNOTs (denotedwith blue arrows), only one can be executed at any time.

• Consecutiviness Limitations. Each remote CNOT con-sumes a Bell state as a consequence of the measurement

4In the following, for the sake of presentation, we consider the simplestbinary case, i.e., either the operation can or cannot be executed. But thediscussions, as well as the results derived in the following, continue to holdwhen a weight – usually representing the gate fidelity – is mapped on theedge.

operations on the communication qubits [20]. Accord-ingly, a new Bell state must be generated and distributedthrough the quantum link to the communication qubits,before a subsequent remote CNOT could be executed.And even if the Bell state distribution can start rightafter the measurements, it is reasonable to assume –given the several order of magnitudes separating intra-processor qubit distance from inter-processor one –that the time needed to entangle the communicationqubits significantly exceeds the time required for localCNOTs5. Accordingly, we have two major issues. First,the “clock” of the remote operations will be significantlylower than the “clock” of the local operations and,hence, it becomes fundamental to minimize the numberof remote – rather than the number of local – opera-tions to preserve the quantum information integrity fromdecoherence (Section II-A). Furthermore, there may beperiods of time – following the execution of a remoteoperation up to the successful distribution of a newBell state – during which the quantum processors aredisconnected and only local operations are possible.

These additional constraints must be properly modeledwithin the coupling map, so that the distributed quantumcompiler can optimize the quantum circuit by accounting forthe temporal dynamics arising with the distributed architec-ture. And this represents an open problem.

Augmented ConnectivityAs shown in Figure 3b, single-processor quantum computingmust resort to either state transfer (swapping) or ancillastrategy to implement a CNOT between non-adjacent (withinthe coupling map) physical qubits. The rationale for thislays in the impossibility to have direct interactions betweendistant qubits. And the further the qubits are within thecoupling map, the longer the sequence of additional CNOTsis required, regardless of the adopted strategy.

Conversely, distributed quantum computing can exploit astrategy – called entanglement swapping [3] and summarizedin Fig 6 – to implement a remote CNOT between qubits storedat remote processors, even if the processors are not directlyconnected through a quantum link.

In a nutshell, to distribute a Bell state between remote pro-cessors – say quantum processor #1 and #3 in Figure 6a – twoBell states must be first distributed through the quantum linksso that one Bell state is shared between the first processorand an intermediate node and another Bell state is shared bythe same intermediate node and the second processor. Then,by performing a Bell state measurement (consisting of a Hand a CNOT gate, followed by a joint measurement) on thecommunication qubits at the intermediate node – i.e., qubitsQ′0 and Q′3 in Figure 6b – a Bell state is obtained at theremote communication qubits – i.e., qubits Q′′0 and Q3 inFigure 6a – by applying some local processing at the remote

5In the order of hundreds of nanoseconds for the IBM Yorktown quantumprocessor [30]

VOLUME 4, 2016 5

Figure 5: Dynamic coupling map for the network topol-ogy shown in Figure 4a. The two 5-qubit quantum pro-cessors constitute a 8-qubit virtual quantum processorwith qubits inter-connected through both local and re-mote CNOTs. While the local CNOTs can be concurrentlyexecuted – i.e., they are unconstrained – the parallel ex-ecution of multiple remote CNOTs is constrained to theavailability of multiple Bell states, with one Bell statefor each concurrent remote CNOT. Given that only onecommunication qubit is available at each processor inFigure 4a, out of four remote CNOTs only one can beexecuted at any time.

Q′1 can be directly mapped onto the circuit given in Fig-ure 4b. Conversely, a remote CNOT between data qubitsQ2 and Q′4 cannot be directly executed between the pair,

but it requires to distribute the operation through theneighbor qubits as shown in Figure 3b.

However, differently from single-processor compiling,the remote operations are subjects to two types of tem-poral constraints.

• Simultaneity Limitations. As previously discussed,each remote CNOT relies on the availability of a Bellstate stored within a communication qubit. Hence,even if remote CNOTs can – in principle – be exe-cuted between different remote pairs, the number ofremote CNOTs that can be executed simultaneouslybetween two processors is limited by the numberof communication qubits jointly available at eachprocessor. With reference to Figure 5, out of fourpossible remote CNOTs (denoted with blue arrows),only one can be executed at any time.

• Consecutiviness Limitations. Each remote CNOT

consumes a Bell state as a consequence of the mea-surement operations on the communication qubits[20]. Accordingly, a new Bell state must be gen-erated and distributed through the quantum linkto the communication qubits, before a subsequentremote CNOT could be executed. And even if theBell state distribution can start right after themeasurements, it is reasonable to assume – giventhe several order of magnitudes separating intra-processor qubit distance from inter-processor one –

5

Page 6: Compiler Design for Distributed Quantum Computing

that the time needed to entangle the communica-tion qubits significantly exceeds the time requiredfor local CNOTs5. Accordingly, we have two majorissues. First, the “clock” of the remote operationswill be significantly lower than the “clock” of the lo-cal operations and, hence, it becomes fundamentalto minimize the number of remote – rather than thenumber of local – operations to preserve the quan-tum information integrity from decoherence (Sec-tion 2.1). Furthermore, there may be periods oftime – following the execution of a remote opera-tion up to the successful distribution of a new Bellstate – during which the quantum processors aredisconnected and only local operations are possible.

These additional constraints must be properly mod-eled within the coupling map, so that the distributedquantum compiler can optimize the quantum circuit byaccounting for the temporal dynamics arising with thedistributed architecture. And this represents an openproblem.

Augmented Connectivity

As shown in Figure 3b, single-processor quantum com-puting must resort to either state transfer (swapping)or ancilla strategy to implement a CNOT between non-adjacent (within the coupling map) physical qubits. Therationale for this lays in the impossibility to have directinteractions between distant qubits. And the further thequbits are within the coupling map, the longer the se-quence of additional CNOTs is required, regardless of theadopted strategy.

Conversely, distributed quantum computing can ex-ploit a strategy – called entanglement swapping [3] andsummarized in Fig 6 – to implement a remote CNOT be-tween qubits stored at remote processors, even if theprocessors are not directly connected through a quan-tum link.

In a nutshell, to distribute a Bell state between remoteprocessors – say quantum processor #1 and #3 in Fig-ure 6a – two Bell states must be first distributed throughthe quantum links so that one Bell state is shared be-tween the first processor and an intermediate node andanother Bell state is shared by the same intermediatenode and the second processor. Then, by performing aBell state measurement (consisting of a H and a CNOT

gate, followed by a joint measurement) on the commu-nication qubits at the intermediate node – i.e., qubitsQ′0 and Q′3 in Figure 6b – a Bell state is obtained at theremote communication qubits – i.e., qubits Q′′0 and Q3

in Figure 6a – by applying some local processing at theremote nodes depending on the (classical) output of theBell state measurement.

5In the order of hundreds of nanoseconds for the IBM Yorktownquantum processor [30]

From the above, it becomes clear that entanglementswapping significantly increases the connectivity withinthe virtual quantum processor. As an instance, qubitQ4 in Figure 6a can interact with just two qubits withinthe same processor via local CNOTs and two qubits withinthe neighbor processor via remote CNOTs. However, itcan interact with two more qubits – i.e., Q′′1 and Q′′2– via entanglement swapping. And the higher is thenumber of available quantum processors, the higher isthe number of possible interactions. Indeed, the num-ber of additional interactions via entanglement swappingscales linearly with the number of available processorswhen only two communication qubits are available ateach intermediate processor. If this constraint is relaxed,the number of additional interactions via entanglementswapping scales more than linearly.

However, it must be acknowledged that the aug-mented connectivity provided by entanglement swap-ping does not come for free. Indeed, entanglement swap-ping consumes the Bell states stored within the commu-nication qubits at the intermediate processors. And thehigher the number of intermediate processors, the higherthe number of consumed Bell states.

Hence, a trade-off between “augmented connectivity”and “EPR cost” arises with entanglement swapping, anda distributed quantum compiler must carefully accountfor these pros and cons.

3.2 Related Work

Most quantum computer proposals are based on varia-tions of the nearest-neighbor, two-qubit, and concurrentexecution (NTC) architecture [31]. Depending on thelayout of qubits, there are three NTC architectures: 1D,2D and 3D. The 1D model, called Linear Nearest Neigh-bor (LNN) [32], consists of qubits located in a single line.In this model, only two neighboring qubits can inter-act. This is the most challenging scenario. The effectsof the LNN model on performance have been investi-gated for many relevant use cases, such as the quantumFourier transform [33,34], Shor’s algorithm [35,36], andadders [37].

Beals et al. [38] provided algorithms for efficientlymoving and addressing quantum memory in parallel.These imply that the standard circuit model can be sim-ulated with low overhead by a more realistic model of adistributed quantum computer. The authors show thatfor an LNN N -qubit architecture, O(N) time steps arenecessary for performing N/2 two-qubit gates in paral-lel. However, it is worthwhile to note that the devel-oped analysis does not consider any additional overheadinduced by the compilation task and the derived Big-O bound relies on linear constant that is in the ”many,many thousands” [39]. Conversely, in the following wedevelop an analysis that explicitly considers the addi-tional overhead induced by the compilation task, as dis-

6

Page 7: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

Q0

Q1

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

Q′′0

Q′′1

Q′′2 Q3

Q′′4

QUANTUM PROCESSOR #3

QUANTUM LINK QUANTUM LINK

VIRTUAL QUANTUM LINK

(a) By swapping the entanglement at the intermediate nodes – namely, quantum processor #2 – it is possible to distribute a Bell state betweenremote processors – namely, processors #1 and #3 – even if they are not adjacent, i.e., they are not directly connected through a quantum link.Hence, entanglement swapping enhances the network connectivity through virtual quantum links.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #3

|Φ+〉

Q3 Z Q3

|Φ+〉Q′0 H

|Φ+〉

Q′3

Q′′0 X Q′′0

(b) Entanglement swapping. A Bell state can bedistributed between remote processors by swap-ping the entanglement at an intermediate nodethrough local processing and classical communi-cation.

Q0

Q1

Q2

Q4

VIRTUAL QUANTUM PROCESSOR

Q′1

Q′2

Q′4

Q′′1

Q′′2 Q′′3

Q′4

LOCAL CNOT

TIME-CONSTRAINED REMOTE CNOT

TIME-CONSTRAINED REMOTE CNOT

VIA ENTANGLEMENT SWAPPING

(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue linesdenote remote CNOTs between adjacent processors, whereas the dotted blue lines denoteremote CNOTs between distant processors achievable via entanglement swapping.

FIGURE 6: Augmented connectivity. Entanglement swapping increases the connectivity between physical qubits, with a numberof possible remote CNOTs that scales at least linearly with the number of processors.

nodes depending on the (classical) output of the Bell statemeasurement.

From the above, it becomes clear that entanglement swap-ping significantly increases the connectivity within the virtualquantum processor. As an instance, qubit Q4 in Figure 6a caninteract with just two qubits within the same processor vialocal CNOTs and two qubits within the neighbor processorvia remote CNOTs. However, it can interact with two morequbits – i.e., Q′′1 and Q′′2 – via entanglement swapping. Andthe higher is the number of available quantum processors,the higher is the number of possible interactions. Indeed, thenumber of additional interactions via entanglement swappingscales linearly with the number of available processors whenonly two communication qubits are available at each inter-mediate processor. If this constraint is relaxed, the numberof additional interactions via entanglement swapping scalesmore than linearly.

However, it must be acknowledged that the augmentedconnectivity provided by entanglement swapping does notcome for free. Indeed, entanglement swapping consumesthe Bell states stored within the communication qubits atthe intermediate processors. And the higher the number ofintermediate processors, the higher the number of consumed

Bell states.Hence, a trade-off between “augmented connectivity” and

“EPR cost” arises with entanglement swapping, and a dis-tributed quantum compiler must carefully account for thesepros and cons.

B. RELATED WORKMost quantum computer proposals are based on variationsof the nearest-neighbor, two-qubit, and concurrent execution(NTC) architecture [31]. Depending on the layout of qubits,there are three NTC architectures: 1D, 2D and 3D. The 1Dmodel, called Linear Nearest Neighbor (LNN) [32], consistsof qubits located in a single line. In this model, only twoneighboring qubits can interact. This is the most challengingscenario. The effects of the LNN model on performance havebeen investigated for many relevant use cases, such as thequantum Fourier transform [33], [34], Shor’s algorithm [35],[36], and adders [37].

Beals et al. [38] provided algorithms for efficiently movingand addressing quantum memory in parallel. These implythat the standard circuit model can be simulated with lowoverhead by a more realistic model of a distributed quan-tum computer. The authors show that for an LNN N -qubit

6 VOLUME 4, 2016

(a) By swapping the entanglement at the intermediate nodes – namely, quantum processor #2 – it is possible to distributea Bell state between remote processors – namely, processors #1 and #3 – even if they are not adjacent, i.e., they are notdirectly connected through a quantum link. Hence, entanglement swapping enhances the network connectivity through virtualquantum links.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

Q0

Q1

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

Q′′0

Q′′1

Q′′2 Q3

Q′′4

QUANTUM PROCESSOR #3

QUANTUM LINK QUANTUM LINK

VIRTUAL QUANTUM LINK

(a) By swapping the entanglement at the intermediate nodes – namely, quantum processor #2 – it is possible to distribute a Bell state betweenremote processors – namely, processors #1 and #3 – even if they are not adjacent, i.e., they are not directly connected through a quantum link.Hence, entanglement swapping enhances the network connectivity through virtual quantum links.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #3

|Φ+〉

Q3 Z Q3

|Φ+〉Q′0 H

|Φ+〉

Q′3

Q′′0 X Q′′0

(b) Entanglement swapping. A Bell state can bedistributed between remote processors by swap-ping the entanglement at an intermediate nodethrough local processing and classical communi-cation.

Q0

Q1

Q2

Q4

VIRTUAL QUANTUM PROCESSOR

Q′1

Q′2

Q′4

Q′′1

Q′′2 Q′′3

Q′4

LOCAL CNOT

TIME-CONSTRAINED REMOTE CNOT

TIME-CONSTRAINED REMOTE CNOT

VIA ENTANGLEMENT SWAPPING

(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue linesdenote remote CNOTs between adjacent processors, whereas the dotted blue lines denoteremote CNOTs between distant processors achievable via entanglement swapping.

FIGURE 6: Augmented connectivity. Entanglement swapping increases the connectivity between physical qubits, with a numberof possible remote CNOTs that scales at least linearly with the number of processors.

nodes depending on the (classical) output of the Bell statemeasurement.

From the above, it becomes clear that entanglement swap-ping significantly increases the connectivity within the virtualquantum processor. As an instance, qubit Q4 in Figure 6a caninteract with just two qubits within the same processor vialocal CNOTs and two qubits within the neighbor processorvia remote CNOTs. However, it can interact with two morequbits – i.e., Q′′1 and Q′′2 – via entanglement swapping. Andthe higher is the number of available quantum processors,the higher is the number of possible interactions. Indeed, thenumber of additional interactions via entanglement swappingscales linearly with the number of available processors whenonly two communication qubits are available at each inter-mediate processor. If this constraint is relaxed, the numberof additional interactions via entanglement swapping scalesmore than linearly.

However, it must be acknowledged that the augmentedconnectivity provided by entanglement swapping does notcome for free. Indeed, entanglement swapping consumesthe Bell states stored within the communication qubits atthe intermediate processors. And the higher the number ofintermediate processors, the higher the number of consumed

Bell states.Hence, a trade-off between “augmented connectivity” and

“EPR cost” arises with entanglement swapping, and a dis-tributed quantum compiler must carefully account for thesepros and cons.

B. RELATED WORKMost quantum computer proposals are based on variationsof the nearest-neighbor, two-qubit, and concurrent execution(NTC) architecture [31]. Depending on the layout of qubits,there are three NTC architectures: 1D, 2D and 3D. The 1Dmodel, called Linear Nearest Neighbor (LNN) [32], consistsof qubits located in a single line. In this model, only twoneighboring qubits can interact. This is the most challengingscenario. The effects of the LNN model on performance havebeen investigated for many relevant use cases, such as thequantum Fourier transform [33], [34], Shor’s algorithm [35],[36], and adders [37].

Beals et al. [38] provided algorithms for efficiently movingand addressing quantum memory in parallel. These implythat the standard circuit model can be simulated with lowoverhead by a more realistic model of a distributed quan-tum computer. The authors show that for an LNN N -qubit

6 VOLUME 4, 2016

(b) Entanglement swapping. A Bell statecan be distributed between remote proces-sors by swapping the entanglement at anintermediate node through local processingand classical communication.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

Q0

Q1

Q2 Q3

Q4

QUANTUM PROCESSOR #1

Q′0

Q′1

Q′2 Q′3

Q′4

QUANTUM PROCESSOR #2

Q′′0

Q′′1

Q′′2 Q3

Q′′4

QUANTUM PROCESSOR #3

QUANTUM LINK QUANTUM LINK

VIRTUAL QUANTUM LINK

(a) By swapping the entanglement at the intermediate nodes – namely, quantum processor #2 – it is possible to distribute a Bell state betweenremote processors – namely, processors #1 and #3 – even if they are not adjacent, i.e., they are not directly connected through a quantum link.Hence, entanglement swapping enhances the network connectivity through virtual quantum links.

QUANTUM PROCESSOR #1

QUANTUM PROCESSOR #3

|Φ+〉

Q3 Z Q3

|Φ+〉Q′0 H

|Φ+〉

Q′3

Q′′0 X Q′′0

(b) Entanglement swapping. A Bell state can bedistributed between remote processors by swap-ping the entanglement at an intermediate nodethrough local processing and classical communi-cation.

Q0

Q1

Q2

Q4

VIRTUAL QUANTUM PROCESSOR

Q′1

Q′2

Q′4

Q′′1

Q′′2 Q′′3

Q′4

LOCAL CNOT

TIME-CONSTRAINED REMOTE CNOT

TIME-CONSTRAINED REMOTE CNOT

VIA ENTANGLEMENT SWAPPING

(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue linesdenote remote CNOTs between adjacent processors, whereas the dotted blue lines denoteremote CNOTs between distant processors achievable via entanglement swapping.

FIGURE 6: Augmented connectivity. Entanglement swapping increases the connectivity between physical qubits, with a numberof possible remote CNOTs that scales at least linearly with the number of processors.

nodes depending on the (classical) output of the Bell statemeasurement.

From the above, it becomes clear that entanglement swap-ping significantly increases the connectivity within the virtualquantum processor. As an instance, qubit Q4 in Figure 6a caninteract with just two qubits within the same processor vialocal CNOTs and two qubits within the neighbor processorvia remote CNOTs. However, it can interact with two morequbits – i.e., Q′′1 and Q′′2 – via entanglement swapping. Andthe higher is the number of available quantum processors,the higher is the number of possible interactions. Indeed, thenumber of additional interactions via entanglement swappingscales linearly with the number of available processors whenonly two communication qubits are available at each inter-mediate processor. If this constraint is relaxed, the numberof additional interactions via entanglement swapping scalesmore than linearly.

However, it must be acknowledged that the augmentedconnectivity provided by entanglement swapping does notcome for free. Indeed, entanglement swapping consumesthe Bell states stored within the communication qubits atthe intermediate processors. And the higher the number ofintermediate processors, the higher the number of consumed

Bell states.Hence, a trade-off between “augmented connectivity” and

“EPR cost” arises with entanglement swapping, and a dis-tributed quantum compiler must carefully account for thesepros and cons.

B. RELATED WORKMost quantum computer proposals are based on variationsof the nearest-neighbor, two-qubit, and concurrent execution(NTC) architecture [31]. Depending on the layout of qubits,there are three NTC architectures: 1D, 2D and 3D. The 1Dmodel, called Linear Nearest Neighbor (LNN) [32], consistsof qubits located in a single line. In this model, only twoneighboring qubits can interact. This is the most challengingscenario. The effects of the LNN model on performance havebeen investigated for many relevant use cases, such as thequantum Fourier transform [33], [34], Shor’s algorithm [35],[36], and adders [37].

Beals et al. [38] provided algorithms for efficiently movingand addressing quantum memory in parallel. These implythat the standard circuit model can be simulated with lowoverhead by a more realistic model of a distributed quan-tum computer. The authors show that for an LNN N -qubit

6 VOLUME 4, 2016

(c) Dynamic coupling map for the network topology shown in Figure 6a. Thesolid blue lines denote remote CNOTs between adjacent processors, whereas thedotted blue lines denote remote CNOTs between distant processors achievable viaentanglement swapping.

Figure 6: Augmented connectivity. Entanglement swapping increases the connectivity between physical qubits, witha number of possible remote CNOTs that scales at least linearly with the number of processors.

cussed in Section 1.

Zomorodi-Moghadam et al. [9] proposed a general ap-proach, based on the Kernighan-Lin algorithm for graphpartitioning, to optimize the number of teleportationsfor a DQC consisting of two spatially separated andlong-distance quantum subsystems. The same authorsproposed also an approach based on dynamic program-ming [40].

Andres-Martınez and Heunen [41] proposed an ap-proach that may distribute circuits across any number ofquantum devices. The main idea is to turn the quantumcircuit into a hypergraph, then find a partitioning thatminimizes the number of cuts, as each cut correspondsto a Bell state shared across two QPUs by means ofcommunication qubits. The partitioning problem is ad-dressed by means of the KaHyPar solver [42]. The pro-posed solution has some drawbacks, in particular that

there is no way to define the number of communicationqubits of each QPU. In the software implementation ofthe algorithm, the number of available communicationqubits is unlimited and cannot be constrained.

4 Compiler Design and OverheadBounds

As discussed in Section 3, several additional constraintsarise with the shift from single-processor to distributedquantum compiling. And given that single-processorquantum compiling has been already proved to be NP-complete [13], it is reasonable to expect that optimaldistributed quantum compiling is an even harder chal-lenging task.

For this reason, in the following, we take a completely

7

Page 8: Compiler Design for Distributed Quantum Computing

different approach. Specifically, we aim at designing ageneral-purpose, efficient and effective compiler for dis-tributed quantum computing.

General-purpose because our compiler does not re-quire any particular assumption on the quantum circuitto be compiled.

Efficient because – as proved in Section 5.1 – our com-piler is computationally efficient, exhibiting a polyno-mial time complexity that grows polynomially with thenumber of logical qubits and linearly with the depth ofthe quantum circuit to be compiled.

Effective because – as proved in Section 4.2 – our com-piler assures a polynomial worst-case overhead, in termsof both: i) depth of the compiled quantum circuit andii) number of calls to the costliest and most challengingtask, i.e., the link entanglement generation.

4.1 System Model

We consider the worst-case scenario shown in Figure 7.More in detail, we assume that only one data qubit isavailable at each quantum processor6. The rationale forthis choice is as follows. Whenever multiple data qubitsare available at a single quantum processor, a local CNOTcan be executed between these data qubits without in-curring in any overhead induced by the distributed com-putation. Conversely, with just one data qubit availableat each processor, each and every CNOT within the quan-tum circuit must be mapped into a remote CNOT, andhence the overhead induced by the distributed compu-tation is the highest possible.

Furthermore, we assume that the quantum processorsare interconnected through a one-dimensional nearest-neighbor topology, as shown in Figure 7. Again, therationale for this choice is to consider the worst-casescenario in terms of overhead induced by the distributedcomputation. In fact, the considered topology is char-acterized by the lowest possible number of communica-tions qubits – i.e., 2n−2 with n denoting the number ofquantum processors – since the removal of any communi-cation qubit would disconnect the network into two dis-joint subsets of quantum processors. And the quantumprocessors are arranged in a line – rather than in a star –to maximize both the number of non-adjacent quantumprocessors and the maximum distance – in terms of hops– between two non-adjacent quantum processors.

From the above, it becomes clear that the consid-ered architecture represents the worst-case scenario interms of overhead induced by the distributed computa-tion. Hence, the actual overhead induced by any real-world architecture will be always upper-bounded by thecommunication overhead induced by the considered ar-chitecture.

6Clearly, the total number of data qubits within the distributedarchitecture must be greater than the number of logical qubitswithin the quantum circuit to be compiled.

Clearly, we need to choose a metric for measuring theoverhead induced by the distributed computation. Asdiscussed in Section 2.1, there exists a general consen-sus on circuit depth as a key performance metric of cir-cuit compilation. Hence, in the following, we measurethe overhead in terms of number of additional layers re-quired to distribute the computation of a single layer inthe original quantum circuit. Furthermore, we also eval-uate the overhead in terms of how many calls to the linkentanglement generation process are required from thecompiling algorithm.

4.2 Basic Strategies for DistributingCNOTs

Let us consider a single layer of the original n-qubitquantum circuit. Clearly, the number of CNOTs in eachlayer is lower or equal to n

2 , given that at most n2 gates

can be executed simultaneously (and thus belong to thesame layer) by operating on different pairs of qubits.

As discussed in Section 4.1, we aim at consideringthe worst-case scenario in terms of overhead inducedby the distributed computing architecture. Hence, eachCNOT within the quantum circuit – given that it oper-ates on physical qubits stored at different processors asdiscussed in Section 4.1 – is a remote CNOT. As a conse-quence, the compiler must map at most n

2 remote CNOTsin each layer.

Entanglement Swapping Based Strategy

The first strategy for implementing remote CNOTs isbased on the entanglement swapping technique discussedin Section 3.1 and shown in Figure 6.

Accordingly, each remote CNOT is implemented byfirstly generating link entanglement [43] among neigh-bor nodes. To this aim, different techniques for entan-glement generation can be employed, depending on theparticulars of the underlying qubit technology [20]. Nev-ertheless, link entanglements can be simultaneously gen-erated, given that each processor is equipped with twocommunication qubits. Once generated, the entangle-ment is simultaneously7 swapped at intermediate nodesso that a Bell state is distributed between the two re-mote processors and, finally, the remote CNOT is obtainedas shown in Figure 4b.

The entanglement swapping based strategy is outlinedin Figure 8b in terms of basic tasks. Within the figure,

7In general, the capability to generate (and to re-generate, oncedepleted) and distribute entangled Bell states through differentlinks in parallel depends on the quantum resources available, i.e.,both the number of communication qubits at each processor andthe inter-connection (shared bus vs. point-to-point) among thecommunication qubits. Differently, the possibility to simultane-ously swap the entanglement at the intermediate nodes dependsonly on classical resources, i.e., the possibility to simultaneouslytransfer classical information.

8

Page 9: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc1,2

QUANTUM PROCESSOR #1

Qc2,1 Qd

2Qc

2,3

QUANTUM PROCESSOR #2

Qc3,2 Qd

3Qc

3,4

QUANTUM PROCESSOR #3

Qcn,n−1 Qd

n

QUANTUM PROCESSOR #N

QUANTUM LINK QUANTUM LINK QUANTUM LINK QUANTUM LINK

FIGURE 7: Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors areinterconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantumprocessor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.

LAYER #i LAYER #i+1

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

SECOND CNOT IN LAYER #i+1

Processor #1LinkEnt LinkEnt

Processor #2LinkEnt

EntSwap

LinkEntProcessor #3

LinkEnt LinkEnt

EntSwap

Processor #4LinkEnt LinkEnt

EntSwap

Processor #5LinkEnt LinkEnt

EntSwap

Processor #6

(b) Compiled quantum circuit.

FIGURE 8: Entanglement swapping strategy. Each remote CNOT in Figure 8a requires two preliminary tasks: i) Link Entanglement,for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remoteprocessors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubitsstored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based StrategyThe first strategy for implementing remote CNOTs is basedon the entanglement swapping technique discussed in Sec-tion III-A and shown in Figure 6.

Accordingly, each remote CNOT is implemented by firstlygenerating link entanglement [43] among neighbor nodes.To this aim, different techniques for entanglement genera-tion can be employed, depending on the particulars of theunderlying qubit technology [20]. Nevertheless, link entan-glements can be simultaneously generated, given that eachprocessor is equipped with two communication qubits. Oncegenerated, the entanglement is simultaneously7 swapped atintermediate nodes so that a Bell state is distributed betweenthe two remote processors and, finally, the remote CNOT isobtained as shown in Figure 4b.

The entanglement swapping based strategy is outlined inFigure 8b in terms of basic tasks. Within the figure, theparticulars of each task are omitted for the sake of clarity.For instance, entanglement swapping – although depicted asa single block – is indeed obtained with a quantum circuitcomposed by three layers as shown in Figure 6b. Similarly,the link entanglement generation requires a quantum circuitwith a depth equal or greater than two, depending on the par-

7In general, the capability to generate (and to re-generate, once depleted)and distribute entangled Bell states through different links in parallel de-pends on the quantum resources available, i.e., both the number of commu-nication qubits at each processor and the inter-connection (shared bus vs.point-to-point) among the communication qubits. Differently, the possibilityto simultaneously swap the entanglement at the intermediate nodes dependsonly on classical resources, i.e., the possibility to simultaneously transferclassical information.

ticulars of the quantum technology underlying entanglementgeneration and distribution [20].

Nevertheless, the figure8 provides a clear intuition of both:i) the sequentiality constraints between the different tasks,and ii) the parallelism achievable within each task. Specif-ically, whenever the CNOTs overlaps9 within the networktopology (as for the CNOTs of the layer #i+1 in Figure 8a),they must be executed sequentially. Differently, CNOTs thatdon’t overlap (as for the CNOTs of the i-th layer in Figure 8a)can be executed simultaneously. Since we are interested inassessing the worst-case overhead induced by distributedcomputation, in the following we consider the worst-casescenario in which all the CNOTs of an arbitrary layer of thequantum circuit overlap within the network topology. Hence,we have that the depth overhead of the entanglement swap-ping based strategy does not exceed the following depth:

n

2des (2)

8We note that – for the sake of simplicity – in Figure 8b we simplymapped the j-th logical qubit Qj of layer #i in Figure 8a onto the j + 1-thprocessor, ignoring so any optimization achievable with a proper mappingof the logical qubits of the quantum circuit onto the physical qubits of thequantum processor.

9The term “overlap” indicates the case when the execution of the con-sidered CNOTs involves overlapping sets of intermediate processors as aconsequence of the constraint we imposed on the network topology of having2n − 2 communication qubits. With reference to the example in Figure 8a,the CNOT between Q0 and Q2 in the layer #i+1 overlaps with the CNOTbetween Q1 and Q4, being the communication qubits at the processors #1and #2 needed to both of them. Differently, the CNOT between Q3 and Q5

does not overlap with CNOT between Q0 and Q2 and, hence, they can beperformed in parallel.

8 VOLUME 4, 2016

Figure 7: Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processorsare interconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at eachquantum processor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc1,2

QUANTUM PROCESSOR #1

Qc2,1 Qd

2Qc

2,3

QUANTUM PROCESSOR #2

Qc3,2 Qd

3Qc

3,4

QUANTUM PROCESSOR #3

Qcn,n−1 Qd

n

QUANTUM PROCESSOR #N

QUANTUM LINK QUANTUM LINK QUANTUM LINK QUANTUM LINK

FIGURE 7: Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors areinterconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantumprocessor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.

LAYER #i LAYER #i+1

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

SECOND CNOT IN LAYER #i+1

Processor #1LinkEnt LinkEnt

Processor #2LinkEnt

EntSwap

LinkEntProcessor #3

LinkEnt LinkEnt

EntSwap

Processor #4LinkEnt LinkEnt

EntSwap

Processor #5LinkEnt LinkEnt

EntSwap

Processor #6

(b) Compiled quantum circuit.

FIGURE 8: Entanglement swapping strategy. Each remote CNOT in Figure 8a requires two preliminary tasks: i) Link Entanglement,for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remoteprocessors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubitsstored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based StrategyThe first strategy for implementing remote CNOTs is basedon the entanglement swapping technique discussed in Sec-tion III-A and shown in Figure 6.

Accordingly, each remote CNOT is implemented by firstlygenerating link entanglement [43] among neighbor nodes.To this aim, different techniques for entanglement genera-tion can be employed, depending on the particulars of theunderlying qubit technology [20]. Nevertheless, link entan-glements can be simultaneously generated, given that eachprocessor is equipped with two communication qubits. Oncegenerated, the entanglement is simultaneously7 swapped atintermediate nodes so that a Bell state is distributed betweenthe two remote processors and, finally, the remote CNOT isobtained as shown in Figure 4b.

The entanglement swapping based strategy is outlined inFigure 8b in terms of basic tasks. Within the figure, theparticulars of each task are omitted for the sake of clarity.For instance, entanglement swapping – although depicted asa single block – is indeed obtained with a quantum circuitcomposed by three layers as shown in Figure 6b. Similarly,the link entanglement generation requires a quantum circuitwith a depth equal or greater than two, depending on the par-

7In general, the capability to generate (and to re-generate, once depleted)and distribute entangled Bell states through different links in parallel de-pends on the quantum resources available, i.e., both the number of commu-nication qubits at each processor and the inter-connection (shared bus vs.point-to-point) among the communication qubits. Differently, the possibilityto simultaneously swap the entanglement at the intermediate nodes dependsonly on classical resources, i.e., the possibility to simultaneously transferclassical information.

ticulars of the quantum technology underlying entanglementgeneration and distribution [20].

Nevertheless, the figure8 provides a clear intuition of both:i) the sequentiality constraints between the different tasks,and ii) the parallelism achievable within each task. Specif-ically, whenever the CNOTs overlaps9 within the networktopology (as for the CNOTs of the layer #i+1 in Figure 8a),they must be executed sequentially. Differently, CNOTs thatdon’t overlap (as for the CNOTs of the i-th layer in Figure 8a)can be executed simultaneously. Since we are interested inassessing the worst-case overhead induced by distributedcomputation, in the following we consider the worst-casescenario in which all the CNOTs of an arbitrary layer of thequantum circuit overlap within the network topology. Hence,we have that the depth overhead of the entanglement swap-ping based strategy does not exceed the following depth:

n

2des (2)

8We note that – for the sake of simplicity – in Figure 8b we simplymapped the j-th logical qubit Qj of layer #i in Figure 8a onto the j + 1-thprocessor, ignoring so any optimization achievable with a proper mappingof the logical qubits of the quantum circuit onto the physical qubits of thequantum processor.

9The term “overlap” indicates the case when the execution of the con-sidered CNOTs involves overlapping sets of intermediate processors as aconsequence of the constraint we imposed on the network topology of having2n − 2 communication qubits. With reference to the example in Figure 8a,the CNOT between Q0 and Q2 in the layer #i+1 overlaps with the CNOTbetween Q1 and Q4, being the communication qubits at the processors #1and #2 needed to both of them. Differently, the CNOT between Q3 and Q5

does not overlap with CNOT between Q0 and Q2 and, hence, they can beperformed in parallel.

8 VOLUME 4, 2016

(a) Original quantum cir-cuit.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc1,2

QUANTUM PROCESSOR #1

Qc2,1 Qd

2Qc

2,3

QUANTUM PROCESSOR #2

Qc3,2 Qd

3Qc

3,4

QUANTUM PROCESSOR #3

Qcn,n−1 Qd

n

QUANTUM PROCESSOR #N

QUANTUM LINK QUANTUM LINK QUANTUM LINK QUANTUM LINK

FIGURE 7: Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors areinterconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantumprocessor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.

LAYER #i LAYER #i+1

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

SECOND CNOT IN LAYER #i+1

Processor #1LinkEnt LinkEnt

Processor #2LinkEnt

EntSwap

LinkEntProcessor #3

LinkEnt LinkEnt

EntSwap

Processor #4LinkEnt LinkEnt

EntSwap

Processor #5LinkEnt LinkEnt

EntSwap

Processor #6

(b) Compiled quantum circuit.

FIGURE 8: Entanglement swapping strategy. Each remote CNOT in Figure 8a requires two preliminary tasks: i) Link Entanglement,for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remoteprocessors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubitsstored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based StrategyThe first strategy for implementing remote CNOTs is basedon the entanglement swapping technique discussed in Sec-tion III-A and shown in Figure 6.

Accordingly, each remote CNOT is implemented by firstlygenerating link entanglement [43] among neighbor nodes.To this aim, different techniques for entanglement genera-tion can be employed, depending on the particulars of theunderlying qubit technology [20]. Nevertheless, link entan-glements can be simultaneously generated, given that eachprocessor is equipped with two communication qubits. Oncegenerated, the entanglement is simultaneously7 swapped atintermediate nodes so that a Bell state is distributed betweenthe two remote processors and, finally, the remote CNOT isobtained as shown in Figure 4b.

The entanglement swapping based strategy is outlined inFigure 8b in terms of basic tasks. Within the figure, theparticulars of each task are omitted for the sake of clarity.For instance, entanglement swapping – although depicted asa single block – is indeed obtained with a quantum circuitcomposed by three layers as shown in Figure 6b. Similarly,the link entanglement generation requires a quantum circuitwith a depth equal or greater than two, depending on the par-

7In general, the capability to generate (and to re-generate, once depleted)and distribute entangled Bell states through different links in parallel de-pends on the quantum resources available, i.e., both the number of commu-nication qubits at each processor and the inter-connection (shared bus vs.point-to-point) among the communication qubits. Differently, the possibilityto simultaneously swap the entanglement at the intermediate nodes dependsonly on classical resources, i.e., the possibility to simultaneously transferclassical information.

ticulars of the quantum technology underlying entanglementgeneration and distribution [20].

Nevertheless, the figure8 provides a clear intuition of both:i) the sequentiality constraints between the different tasks,and ii) the parallelism achievable within each task. Specif-ically, whenever the CNOTs overlaps9 within the networktopology (as for the CNOTs of the layer #i+1 in Figure 8a),they must be executed sequentially. Differently, CNOTs thatdon’t overlap (as for the CNOTs of the i-th layer in Figure 8a)can be executed simultaneously. Since we are interested inassessing the worst-case overhead induced by distributedcomputation, in the following we consider the worst-casescenario in which all the CNOTs of an arbitrary layer of thequantum circuit overlap within the network topology. Hence,we have that the depth overhead of the entanglement swap-ping based strategy does not exceed the following depth:

n

2des (2)

8We note that – for the sake of simplicity – in Figure 8b we simplymapped the j-th logical qubit Qj of layer #i in Figure 8a onto the j + 1-thprocessor, ignoring so any optimization achievable with a proper mappingof the logical qubits of the quantum circuit onto the physical qubits of thequantum processor.

9The term “overlap” indicates the case when the execution of the con-sidered CNOTs involves overlapping sets of intermediate processors as aconsequence of the constraint we imposed on the network topology of having2n − 2 communication qubits. With reference to the example in Figure 8a,the CNOT between Q0 and Q2 in the layer #i+1 overlaps with the CNOTbetween Q1 and Q4, being the communication qubits at the processors #1and #2 needed to both of them. Differently, the CNOT between Q3 and Q5

does not overlap with CNOT between Q0 and Q2 and, hence, they can beperformed in parallel.

8 VOLUME 4, 2016

(b) Compiled quantum circuit.

Figure 8: Entanglement swapping strategy. Each remote CNOT in Figure 8a requires two preliminary tasks: i) LinkEntanglement, for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, forentangling the two remote processors involved within the CNOT. Clearly, the swapping task is omitted whenever theCNOT operates between data qubits stored at processors that are neighbor within the network topology, as for thei-th layer.

the particulars of each task are omitted for the sake ofclarity. For instance, entanglement swapping – althoughdepicted as a single block – is indeed obtained with aquantum circuit composed by three layers as shown inFigure 6b. Similarly, the link entanglement generationrequires a quantum circuit with a depth equal or greaterthan two, depending on the particulars of the quantumtechnology underlying entanglement generation and dis-tribution [20].

Nevertheless, the figure8 provides a clear intuition ofboth: i) the sequentiality constraints between the differ-ent tasks, and ii) the parallelism achievable within eachtask. Specifically, whenever the CNOTs overlaps9 within

8We note that – for the sake of simplicity – in Figure 8b wesimply mapped the j-th logical qubit Qj of layer #i in Figure 8aonto the j+1-th processor, ignoring so any optimization achievablewith a proper mapping of the logical qubits of the quantum circuitonto the physical qubits of the quantum processor.

9The term “overlap” indicates the case when the execution ofthe considered CNOTs involves overlapping sets of intermediateprocessors as a consequence of the constraint we imposed on thenetwork topology of having 2n − 2 communication qubits. Withreference to the example in Figure 8a, the CNOT between Q0 andQ2 in the layer #i+1 overlaps with the CNOT between Q1 andQ4, being the communication qubits at the processors #1 and #2needed to both of them. Differently, the CNOT between Q3 andQ5 does not overlap with CNOT between Q0 and Q2 and, hence,they can be performed in parallel.

the network topology (as for the CNOTs of the layer #i+1in Figure 8a), they must be executed sequentially. Dif-ferently, CNOTs that don’t overlap (as for the CNOTs ofthe i-th layer in Figure 8a) can be executed simultane-ously. Since we are interested in assessing the worst-caseoverhead induced by distributed computation, in the fol-lowing we consider the worst-case scenario in which allthe CNOTs of an arbitrary layer of the quantum circuitoverlap within the network topology. Hence, we havethat the depth overhead of the entanglement swappingbased strategy does not exceed the following depth:

n

2des (2)

where n denotes the number of logical qubits within thequantum circuit and des is a constant factor (indepen-dent from the characteristics of the original quantumcircuit) given by:

des = cle + cbsm + ccx (3)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entangle-ment swapping task, respectively, and ccx denoting thenumber of layers required to perform a remote CNOT oncethe Bell state has been distributed between two proces-sors. The actual values of cle, cbsm and ccx depend onthe particulars of the underlying hardware technology.

9

Page 10: Compiler Design for Distributed Quantum Computing

From (2), we have that the actual depth of an ar-bitrary d-depth quantum circuit compiled with the en-tanglement swapping based strategy will be always lowerthan n

2 d, neglecting the constant des. Hence, thedepth overhead grows linearly with the number of logicalqubits of the quantum circuit to be compiled. And giventhat this result holds for the worst-case scenario (one-data-qubit processors arranged in a one-dimensional net-work topology), the actual depth overhead induced byany arbitrary distributed architecture will be alwaysupper-bounded by (2).

We further note that classical information must be ex-changed between the quantum processors. For instance,the entanglement swapping task requires the transmis-sion of classical information (i.e., the measurement out-put) throughout the quantum network. Hence, in case oflong-distance quantum processors, the actual executiontime of the compiled quantum circuit may be affectedby the latency induced by the classical communications.

Finally, due to the complex and stochastic nature ofthe physical mechanisms underlying quantum entangle-ment [43], several attempts can be required for establish-ing a link entanglement, and this may impact as well theexecution time of the compiled quantum circuit. Indeed,we should consider link entanglement as the critical taskfor distributed quantum computation, given that the re-maining tasks require only local quantum operations andclassical communications. From this perspective, the en-tanglement swapping based strategy requires at most n

2repetitions of the link entanglement task, regardless ofthe original quantum circuit and regardless of the char-acteristics of the network topology underlying the dis-tributed computing architecture.

Data-Qubit Swapping Based Strategy

The entanglement swapping based strategy takes full ad-vantage of the augmented connectivity enabled by thecommunication qubits – as discussed in Section 3.1 –to allow interactions between remote processors withineach layer.

Nevertheless, whenever the original quantum circuitpresents repetitions of the same CNOT interaction patternbetween logical qubits – as for layers #i+1 and #i+2 inFigure 9a – a more elaborate strategy – based on movingthe data qubits – can provide better performance.

The strategy is shown in Figure 9: the objective isto arrange (i.e., to swap) the data qubits within thequantum processors so that eventually each CNOT of theoriginal layer operates on qubits stored at neighbor pro-cessors within the network topology.

Intuitively, the strategy goal can be modeled as an ar-ray sorting problem. Indeed, similarly to classical sort-ing, the n data qubits (representing the values to besorted) must be ordered within the network topology(representing an array with size equal or greater than

Algorithm 1 Data-Qubit SwappingInput: n-qubit circuit layer L with mod(n,4) = 0 and n

2CNOTs

Output: layer L with each CNOT operating on neighbor qubits

1: function Sort(L)2: if ∃ CNOT(qi, qj) with i, j ≤ n

2 then3: // ∃ CNOT(qk, ql) with k, l > n

24: Swap(qi+1, qj)5: Swap(qk+1, ql)6: L = L \ {qi, qi+1, qk, qk+1}7: else8: // ∃ CNOT(qn

2, ql) with l > n

29: // and ∃ CNOT(qi, ql−1) with i < n/2

10: Swap(qn2, ql−1)

11: Swap(qi, qn2−1)

12: L = L \ {qn2−1, qn

2, ql−1, ql}

13: end if14: if L 6= ∅ then15: Sort(L)16: end if17: end function

n ). However, differently from classical sorting whereany couple of values can be swapped regardless fromtheir position within the array, with the data-qubit swap-ping the constraints arising from the underlying networktopology must be carefully taken into account. To thisaim, by taking advantage of the sorting network the-ory, it is easy to model the network topology constraintsthrough the notion of insertion network (or, equiva-lently, bubble network). As a consequence, the overalldepth of the equivalent quantum circuit grows with thenumber n of logical qubits as [44]:

2n− 3 (4)

instead of a logarithmic log n depth factor as for classicalsorting.

Nevertheless, sorting networks – and in general clas-sical sorting – are based on the assumption that thereexists a total (monotonic) order over the array elements.Hence, there exists a unique solution to the sorting prob-lem. Conversely, the data-qubit swapping based strat-egy admits several equivalent solutions for the arrangingproblem, as exemplified in Figure 10. We now formalizethese considerations with the following theorem.

Theorem 1. Let us consider the i-th layer of an arbi-trary n-qubit quantum circuit. The depth of the cor-responding compiled quantum circuit, obtained throughthe data-qubit swapping based strategy, does not exceedthe following depth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independentfrom the characteristics of the original quantum circuit)

10

Page 11: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

LAYER #i LAYER #i+1 LAYER #i+2

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

QUBIT SWAPPING

Processor #1LinkEnt

Processor #2LinkEnt LinkEnt LinkEnt

Processor #3LinkEnt

Processor #4LinkEnt LinkEnt LinkEnt

Processor #5LinkEnt

Processor #6

(b) Compiled quantum circuit.

FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageouswhenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it isnecessary to swap data qubits stored at processors that are not neighbor within the network topology.

LAYER #i

Q0

Q1

Q2

Q3

(a) Original quantumcircuit.

Q0

Q3

Q1

Q2

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(b) Possible arrange-ment.

Q1

Q2

Q0

Q3

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(c) Alternative equiva-lent arrangement.

FIGURE 10: Data-qubit swapping based strategy: equivalentmappings. Both Figure 10b and Figure 10c represent validarrangements of the data qubits within the remote quantumprocessors so that each CNOT in Figure 10a operates onqubits stored at processors neighbor within the network topol-ogy.

swapping based strategy, does not exceed the followingdepth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independent from thecharacteristics of the original quantum circuit) given by:

dqs =3 (cle + cbsm + ccx) (6)d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entanglementswapping task, respectively, and ccx denoting the number oflayers required to perform the remote CNOTs once the Bellstate has been distributed between two processors.

Proof: The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is called atmost n

4 times, and ii) after these calls to SORT(·), all theCNOTs, by acting on qubits stored at neighbor processors,can be executed at once through link entanglement followedby local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).

In the former case (lines 3-6), there exists a CNOT actingwithin the first half portion – i.e., the first n

2 logical qubits– of the original layer. Since we are considering the worst-case – namely, a layer with n

2 CNOTs – then we have thatthere exists at least one CNOT acting on the last half portion– i.e., the last n

2 logical qubits – of the original layer. Hence,the two CNOTs do not overlap and two simultaneous SWAPoperations can be executed – one in each half portion of theoriginal quantum circuit – as shown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTs acton qubits stored at neighbor processors. Within the previousdiagram, as well as in Algorithm 1, we omitted some minorparticulars for the sake of simplicity. For instance, we implic-itly assumed that i (and k) is odd – i.e., mod(i, 2) = 1 – sothat qj must be swapped with qi+1. Clearly whether i shouldbe even, qj must be swapped with qi−1.In the latter case (lines 8-12), each and every CNOT acts ontwo logical qubits belonging to both the half portions of theoriginal layer. Let us consider, with no lack of generality, theCNOT acting on the n

2 -th logical qubit (i.e., the last qubit ofthe first half portion) and let us denote as ql the second qubiton which such a CNOT operates. Since we are considering alayer with n

2 CNOTs, then we have that there exists a CNOTacting on the l − 1-th qubit. By denoting as qi the secondqubit on which such a CNOT operates, it follows i < n

2 .Although the two CNOTs overlap, by properly selecting twoSWAP operations as shown here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executed sothat, within the compiled circuit – the two CNOTs act on

10 VOLUME 4, 2016

(a) Original quantum circuit.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

LAYER #i LAYER #i+1 LAYER #i+2

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

QUBIT SWAPPING

Processor #1LinkEnt

Processor #2LinkEnt LinkEnt LinkEnt

Processor #3LinkEnt

Processor #4LinkEnt LinkEnt LinkEnt

Processor #5LinkEnt

Processor #6

(b) Compiled quantum circuit.

FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageouswhenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it isnecessary to swap data qubits stored at processors that are not neighbor within the network topology.

LAYER #i

Q0

Q1

Q2

Q3

(a) Original quantumcircuit.

Q0

Q3

Q1

Q2

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(b) Possible arrange-ment.

Q1

Q2

Q0

Q3

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(c) Alternative equiva-lent arrangement.

FIGURE 10: Data-qubit swapping based strategy: equivalentmappings. Both Figure 10b and Figure 10c represent validarrangements of the data qubits within the remote quantumprocessors so that each CNOT in Figure 10a operates onqubits stored at processors neighbor within the network topol-ogy.

swapping based strategy, does not exceed the followingdepth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independent from thecharacteristics of the original quantum circuit) given by:

dqs =3 (cle + cbsm + ccx) (6)d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entanglementswapping task, respectively, and ccx denoting the number oflayers required to perform the remote CNOTs once the Bellstate has been distributed between two processors.

Proof: The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is called atmost n

4 times, and ii) after these calls to SORT(·), all theCNOTs, by acting on qubits stored at neighbor processors,can be executed at once through link entanglement followedby local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).

In the former case (lines 3-6), there exists a CNOT actingwithin the first half portion – i.e., the first n

2 logical qubits– of the original layer. Since we are considering the worst-case – namely, a layer with n

2 CNOTs – then we have thatthere exists at least one CNOT acting on the last half portion– i.e., the last n

2 logical qubits – of the original layer. Hence,the two CNOTs do not overlap and two simultaneous SWAPoperations can be executed – one in each half portion of theoriginal quantum circuit – as shown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTs acton qubits stored at neighbor processors. Within the previousdiagram, as well as in Algorithm 1, we omitted some minorparticulars for the sake of simplicity. For instance, we implic-itly assumed that i (and k) is odd – i.e., mod(i, 2) = 1 – sothat qj must be swapped with qi+1. Clearly whether i shouldbe even, qj must be swapped with qi−1.In the latter case (lines 8-12), each and every CNOT acts ontwo logical qubits belonging to both the half portions of theoriginal layer. Let us consider, with no lack of generality, theCNOT acting on the n

2 -th logical qubit (i.e., the last qubit ofthe first half portion) and let us denote as ql the second qubiton which such a CNOT operates. Since we are considering alayer with n

2 CNOTs, then we have that there exists a CNOTacting on the l − 1-th qubit. By denoting as qi the secondqubit on which such a CNOT operates, it follows i < n

2 .Although the two CNOTs overlap, by properly selecting twoSWAP operations as shown here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executed sothat, within the compiled circuit – the two CNOTs act on

10 VOLUME 4, 2016

(b) Compiled quantum circuit.

Figure 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can beadvantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, asfor layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as inFigure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor withinthe network topology.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

LAYER #i LAYER #i+1 LAYER #i+2

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

QUBIT SWAPPING

Processor #1LinkEnt

Processor #2LinkEnt LinkEnt LinkEnt

Processor #3LinkEnt

Processor #4LinkEnt LinkEnt LinkEnt

Processor #5LinkEnt

Processor #6

(b) Compiled quantum circuit.

FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageouswhenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it isnecessary to swap data qubits stored at processors that are not neighbor within the network topology.

LAYER #i

Q0

Q1

Q2

Q3

(a) Original quantumcircuit.

Q0

Q3

Q1

Q2

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(b) Possible arrange-ment.

Q1

Q2

Q0

Q3

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(c) Alternative equiva-lent arrangement.

FIGURE 10: Data-qubit swapping based strategy: equivalentmappings. Both Figure 10b and Figure 10c represent validarrangements of the data qubits within the remote quantumprocessors so that each CNOT in Figure 10a operates onqubits stored at processors neighbor within the network topol-ogy.

swapping based strategy, does not exceed the followingdepth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independent from thecharacteristics of the original quantum circuit) given by:

dqs =3 (cle + cbsm + ccx) (6)d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entanglementswapping task, respectively, and ccx denoting the number oflayers required to perform the remote CNOTs once the Bellstate has been distributed between two processors.

Proof: The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is called atmost n

4 times, and ii) after these calls to SORT(·), all theCNOTs, by acting on qubits stored at neighbor processors,can be executed at once through link entanglement followedby local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).

In the former case (lines 3-6), there exists a CNOT actingwithin the first half portion – i.e., the first n

2 logical qubits– of the original layer. Since we are considering the worst-case – namely, a layer with n

2 CNOTs – then we have thatthere exists at least one CNOT acting on the last half portion– i.e., the last n

2 logical qubits – of the original layer. Hence,the two CNOTs do not overlap and two simultaneous SWAPoperations can be executed – one in each half portion of theoriginal quantum circuit – as shown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTs acton qubits stored at neighbor processors. Within the previousdiagram, as well as in Algorithm 1, we omitted some minorparticulars for the sake of simplicity. For instance, we implic-itly assumed that i (and k) is odd – i.e., mod(i, 2) = 1 – sothat qj must be swapped with qi+1. Clearly whether i shouldbe even, qj must be swapped with qi−1.In the latter case (lines 8-12), each and every CNOT acts ontwo logical qubits belonging to both the half portions of theoriginal layer. Let us consider, with no lack of generality, theCNOT acting on the n

2 -th logical qubit (i.e., the last qubit ofthe first half portion) and let us denote as ql the second qubiton which such a CNOT operates. Since we are considering alayer with n

2 CNOTs, then we have that there exists a CNOTacting on the l − 1-th qubit. By denoting as qi the secondqubit on which such a CNOT operates, it follows i < n

2 .Although the two CNOTs overlap, by properly selecting twoSWAP operations as shown here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executed sothat, within the compiled circuit – the two CNOTs act on

10 VOLUME 4, 2016

(a) Original quan-tum circuit.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

LAYER #i LAYER #i+1 LAYER #i+2

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

QUBIT SWAPPING

Processor #1LinkEnt

Processor #2LinkEnt LinkEnt LinkEnt

Processor #3LinkEnt

Processor #4LinkEnt LinkEnt LinkEnt

Processor #5LinkEnt

Processor #6

(b) Compiled quantum circuit.

FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageouswhenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it isnecessary to swap data qubits stored at processors that are not neighbor within the network topology.

LAYER #i

Q0

Q1

Q2

Q3

(a) Original quantumcircuit.

Q0

Q3

Q1

Q2

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(b) Possible arrange-ment.

Q1

Q2

Q0

Q3

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(c) Alternative equiva-lent arrangement.

FIGURE 10: Data-qubit swapping based strategy: equivalentmappings. Both Figure 10b and Figure 10c represent validarrangements of the data qubits within the remote quantumprocessors so that each CNOT in Figure 10a operates onqubits stored at processors neighbor within the network topol-ogy.

swapping based strategy, does not exceed the followingdepth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independent from thecharacteristics of the original quantum circuit) given by:

dqs =3 (cle + cbsm + ccx) (6)d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entanglementswapping task, respectively, and ccx denoting the number oflayers required to perform the remote CNOTs once the Bellstate has been distributed between two processors.

Proof: The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is called atmost n

4 times, and ii) after these calls to SORT(·), all theCNOTs, by acting on qubits stored at neighbor processors,can be executed at once through link entanglement followedby local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).

In the former case (lines 3-6), there exists a CNOT actingwithin the first half portion – i.e., the first n

2 logical qubits– of the original layer. Since we are considering the worst-case – namely, a layer with n

2 CNOTs – then we have thatthere exists at least one CNOT acting on the last half portion– i.e., the last n

2 logical qubits – of the original layer. Hence,the two CNOTs do not overlap and two simultaneous SWAPoperations can be executed – one in each half portion of theoriginal quantum circuit – as shown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTs acton qubits stored at neighbor processors. Within the previousdiagram, as well as in Algorithm 1, we omitted some minorparticulars for the sake of simplicity. For instance, we implic-itly assumed that i (and k) is odd – i.e., mod(i, 2) = 1 – sothat qj must be swapped with qi+1. Clearly whether i shouldbe even, qj must be swapped with qi−1.In the latter case (lines 8-12), each and every CNOT acts ontwo logical qubits belonging to both the half portions of theoriginal layer. Let us consider, with no lack of generality, theCNOT acting on the n

2 -th logical qubit (i.e., the last qubit ofthe first half portion) and let us denote as ql the second qubiton which such a CNOT operates. Since we are considering alayer with n

2 CNOTs, then we have that there exists a CNOTacting on the l − 1-th qubit. By denoting as qi the secondqubit on which such a CNOT operates, it follows i < n

2 .Although the two CNOTs overlap, by properly selecting twoSWAP operations as shown here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executed sothat, within the compiled circuit – the two CNOTs act on

10 VOLUME 4, 2016

(b) Possible ar-rangement.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

LAYER #i LAYER #i+1 LAYER #i+2

Q0

Q1

Q2

Q3

Q4

Q5

(a) Original quantum circuit.

QUBIT SWAPPING

Processor #1LinkEnt

Processor #2LinkEnt LinkEnt LinkEnt

Processor #3LinkEnt

Processor #4LinkEnt LinkEnt LinkEnt

Processor #5LinkEnt

Processor #6

(b) Compiled quantum circuit.

FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageouswhenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it isnecessary to swap data qubits stored at processors that are not neighbor within the network topology.

LAYER #i

Q0

Q1

Q2

Q3

(a) Original quantumcircuit.

Q0

Q3

Q1

Q2

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(b) Possible arrange-ment.

Q1

Q2

Q0

Q3

Proc. #1

Proc. #2

Proc. #3

Proc. #4

(c) Alternative equiva-lent arrangement.

FIGURE 10: Data-qubit swapping based strategy: equivalentmappings. Both Figure 10b and Figure 10c represent validarrangements of the data qubits within the remote quantumprocessors so that each CNOT in Figure 10a operates onqubits stored at processors neighbor within the network topol-ogy.

swapping based strategy, does not exceed the followingdepth:

n

4dqs + d′qs (5)

where dqs and d′qs are constant factors (independent from thecharacteristics of the original quantum circuit) given by:

dqs =3 (cle + cbsm + ccx) (6)d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entanglementswapping task, respectively, and ccx denoting the number oflayers required to perform the remote CNOTs once the Bellstate has been distributed between two processors.

Proof: The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is called atmost n

4 times, and ii) after these calls to SORT(·), all theCNOTs, by acting on qubits stored at neighbor processors,can be executed at once through link entanglement followedby local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).

In the former case (lines 3-6), there exists a CNOT actingwithin the first half portion – i.e., the first n

2 logical qubits– of the original layer. Since we are considering the worst-case – namely, a layer with n

2 CNOTs – then we have thatthere exists at least one CNOT acting on the last half portion– i.e., the last n

2 logical qubits – of the original layer. Hence,the two CNOTs do not overlap and two simultaneous SWAPoperations can be executed – one in each half portion of theoriginal quantum circuit – as shown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTs acton qubits stored at neighbor processors. Within the previousdiagram, as well as in Algorithm 1, we omitted some minorparticulars for the sake of simplicity. For instance, we implic-itly assumed that i (and k) is odd – i.e., mod(i, 2) = 1 – sothat qj must be swapped with qi+1. Clearly whether i shouldbe even, qj must be swapped with qi−1.In the latter case (lines 8-12), each and every CNOT acts ontwo logical qubits belonging to both the half portions of theoriginal layer. Let us consider, with no lack of generality, theCNOT acting on the n

2 -th logical qubit (i.e., the last qubit ofthe first half portion) and let us denote as ql the second qubiton which such a CNOT operates. Since we are considering alayer with n

2 CNOTs, then we have that there exists a CNOTacting on the l − 1-th qubit. By denoting as qi the secondqubit on which such a CNOT operates, it follows i < n

2 .Although the two CNOTs overlap, by properly selecting twoSWAP operations as shown here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executed sothat, within the compiled circuit – the two CNOTs act on

10 VOLUME 4, 2016

(c) Alternativeequivalent arrange-ment.

Figure 10: Data-qubit swapping based strategy : equiv-alent mappings. Both Figure 10b and Figure 10c rep-resent valid arrangements of the data qubits within theremote quantum processors so that each CNOT in Fig-ure 10a operates on qubits stored at processors neighborwithin the network topology.

given by:

dqs =3 (cle + cbsm + ccx) (6)

d′qs = cle + ccx (7)

with cle and cbsm denoting the number of layers requiredto perform the link entanglement task and the entangle-ment swapping task, respectively, and ccx denoting thenumber of layers required to perform the remote CNOTsonce the Bell state has been distributed between twoprocessors.

Proof. The proof easily follows by recognizing that: i)the function SORT(·), defined in Algorithm 1, is calledat most n

4 times, and ii) after these calls to SORT(·), allthe CNOTs, by acting on qubits stored at neighbor proces-sors, can be executed at once through link entanglement

followed by local operations, as shown in Figure 4b.More specifically, in each call we have two disjoint cases(line 2).In the former case (lines 3-6), there exists a CNOT act-ing within the first half portion – i.e., the first n

2 logicalqubits – of the original layer. Since we are consideringthe worst-case – namely, a layer with n

2 CNOTs – then wehave that there exists at least one CNOT acting on thelast half portion – i.e., the last n

2 logical qubits – of theoriginal layer. Hence, the two CNOTs do not overlap andtwo simultaneous SWAP operations can be executed – onein each half portion of the original quantum circuit – asshown here:

q1 . . . qi qi+1 . . . qj . . . qn2. . . qk qk+1 . . . ql . . . qn

q1 . . . qi qj . . . qi+1 . . . qn2. . . qk ql . . . qk+1 . . . qn

so that, within the compiled circuit – the two CNOTsact on qubits stored at neighbor processors. Within theprevious diagram, as well as in Algorithm 1, we omittedsome minor particulars for the sake of simplicity. For in-stance, we implicitly assumed that i (and k) is odd – i.e.,mod(i, 2) = 1 – so that qj must be swapped with qi+1.Clearly whether i should be even, qj must be swappedwith qi−1.In the latter case (lines 8-12), each and every CNOT actson two logical qubits belonging to both the half portionsof the original layer. Let us consider, with no lack of gen-erality, the CNOT acting on the n

2 -th logical qubit (i.e.,the last qubit of the first half portion) and let us denoteas ql the second qubit on which such a CNOT operates.Since we are considering a layer with n

2 CNOTs, then wehave that there exists a CNOT acting on the l−1-th qubit.By denoting as qi the second qubit on which such a CNOT

operates, it follows i < n2 . Although the two CNOTs over-

lap, by properly selecting two SWAP operations as shown

11

Page 12: Compiler Design for Distributed Quantum Computing

here:

q1 . . . qi . . . qn2−1 qn

2. . . ql−1 ql . . . qn

q1 . . . qn2−1 . . . qi ql−1 . . . qn

2ql . . . qn

we have that the SWAPs can be simultaneously executedso that, within the compiled circuit – the two CNOTsact on qubits stored at neighbor processors. Regardlessof which case holds, each call to the SORT(·) functioncompiles two CNOTs. By recalling that at most n

2 CNOTsare present in a layer, the thesis follows. �

From (5), we have that the actual depth of an arbi-trary d-depth quantum circuit compiled with the data-qubit swapping based strategy will be always lower thann4 d, neglecting the constant factors. Hence, the depthoverhead is asymptotically lower than the overhead in-duced by the entanglement swapping based strategy.However, an explicit comparison between the two strate-gies depends on the particulars of the underlying qubittechnology through the exact expressions of des and dqs.Furthermore, it depends also on the repetitions of thesame CNOT interaction patterns within the original quan-tum circuit.

As regards to the number of repetitions of the linkentanglement task, in general it depends on the charac-teristics of the underlying qubit technology as discussedin Section 4.3. With reference to the IBM quantum pro-cessors, where a SWAP operation is obtained through asequence of three CNOTs, from (5)-(7) we have that thedata-qubit swapping based strategy requires at most 2nrepetitions of the link entanglement task, regardless ofthe original quantum circuit and regardless of the char-acteristics of the network topology underlying the dis-tributed computing architecture.

4.3 Discussion

As already mentioned above, the performance of the twostrategies firmly depends on the particulars of the un-derlying hardware technology through the parametersdes, dqs and d′qs given in equations (3), (6) and (7).

To better clarify this point, let us consider dqs, whichinherently denotes the cost for a SWAP operation. Hav-ing assumed through the manuscript the CNOT being thefundamental multi-qubit gate, a single remote SWAP op-eration can be obtained through three remote CNOTs asin Figure 3b. And this is the rationale for the constantfactor equal to 3 in (6), which accounts for the cost ofthree remote CNOTs.

Clearly, by changing the assumptions on the underly-ing hardware technology, the expression of dqs changesas well. For instance, photonic technology can providethe SWAP gate as the native operation [45], and in sucha case dqs is equal to 1. Nevertheless the main result

– i.e., equation (5) – continues to hold. Furthermore,whenever the SWAP gate is the native operation, a sin-gle CNOT can be obtained through two consecutive SWAPsinterleaved by single-qubit operations [46]. Hence, theexpression of des must change accordingly but the mainresult – i.e., equation (3) – continues to hold as well.

Indeed it is worthwhile to note that, despite the dif-ferences between the performance of the two strategies,there exists a one-to-one mapping between the strate-gies. Specifically, there exists an admissible transforma-tion allowing to map the compiled circuit obtained witha strategy into the compiled circuit obtained with theother strategy. And the corresponding computationaltask exhibits a polynomial-time complexity10 for everyoriginal circuit.

5 Performance Analysis

Here we perform a performance analysis for the compilerdesign conducted in Section 4.

More in detail, in Section 5.1 we illustrate the algo-rithmic implementation of the compiler, proving so itsattractive feature – a polynomial time complexity thatgrows polynomially with the number of logical qubits andlinearly with the depth of the quantum circuit to be com-piled – from a computational perspective. Then, in Sec-tion 5.2 we validate the theoretical upper bounds on thenumber of layers that result from compiling a layer ofremote CNOTs, derived in Section 4.2, against an ex-tensive set of medium-size quantum circuits of practicalinterest. Finally, with Sections 5.3 and 5.4 we concludethe performance analysis through an unfair – as clearlyshown with Figure 14 – comparison with the state-of-theart for two different network topologies.

5.1 Compiler Implementation

We implemented the strategies discussed in Section 4.2in Python, using Qiskit [26] as the development frame-work. Given a quantum circuit described in the QASMformat, the compiler proceeds to instantiate a dis-tributed architecture that mimics the one described inSection 4.1. To model the worst-case scenario depictedin Figure 7, each QPU has one data qubit and two com-munication qubits. Moreover, each QPU has two neigh-bor QPUs, with the exception of the outer QPUs thathave one neighbor QPU only. Each qubit of the circuitis assigned to the data qubit of one QPU.

10Given that both the strategies exhibit a polynomial-time com-putational complexity as proved in Section 5.1.

12

Page 13: Compiler Design for Distributed Quantum Computing

Figure 11: Workflow of the proposed compiler.

Algorithm 2 CompileFrontLayerInput: the front layer F to be compiled, the executed gates Euntil nowOutput: the executed gates E updated with the compiled frontlayer F

if data-qubit swapping based strategy thenprepare interactions vectorswaps← SortPairs(interactions)for all swap ∈ swaps do // Perform remote SWAPs

if two EPR pairs between QPUs thenTeleport(swap.q1, swap.q2)

elseRemoteCnot(swap.q1, swap.q2)RemoteCnot(swap.q2, swap.q1)RemoteCnot(swap.q1, swap.q2)

end ifend for

end iffor all gate ∈ F do

RemoteCnot(gate.control, gate.target)end forreturn E

function RemoteCnot(control, target)if control and target are not on neighboring QPUs then

qpu0 ← QPUs of controlqpun ← QPUs of targetfor i ∈ {0, .., n− 2} do

add link entanglementbetween qpui and qpui+1 to E

end foradd entanglement swap between qpu0 and qpun to E

end ifadd link entanglement between qpu0 and qpun to Eadd a remote CNOT between control and target to E

end function

function Teleport(q1, q2)if q1 and q2 are not on neighboring QPUs then

qpu0 ← QPUs of q1qpun ← QPUs of q2for i ∈ {0, .., n− 2} do

add link entanglementbetween qpui and qpui+1 to Eusing two available EPR pairs between QPUs

end fore01, e02 ← the two communication qubits at qpu0

en1, en2 ← the two communication qubits at qpun

add entanglement swap between e01 at qpu0 anden1 at qpun to E

add entanglement swap between e02 at qpu0 anden2 at qpun to E

add quantum teleportation between q1 and en1 to Eadd quantum teleportation between q2 and e02 to Eadd local swap between q1 and e02add local swap between q2 and en1

end ifend function

The compilation process is summarized and illus-trated in Figure 11. Reading the circuit from left toright, the front layer, i.e., a layer comprising only CNOT

gates that can be executed in parallel, is updated. Tothis end, one-qubit gates are immediately mapped tothe compiled circuit, while CNOT gates are added tothe front layer. This is done until all logical qubits areinterested by a CNOT or there are no more CNOT gatesthat can be executed in parallel in the current frontlayer. Then, the front layer is compiled, rendering allcurrently involved CNOT gates executable. This processis repeated until no more front layers can be computed,meaning that all the circuit gates have been mapped tothe distributed architecture.

Front layer compilation is based on Algorithm 2,where one can choose between the two strategies dis-cussed in Section 4.2. When the entanglement swap-ping based strategy is adopted, remote CNOT gates pre-ceded by entanglement swapping are applied wheneverthe involved QPUs are not neighbors. Note that to per-form entanglement swapping, as well as remote CNOTs,we need to generate link entanglement between all in-volved QPUs.

Regarding the more advanced data-qubit swappingbased strategy, the list representing the interaction be-tween qubits is prepared and then Algorithm 3 is usedto compute the data swap operations needed to reorderthe qubits. Referring to Figure 12a, the list representingthe interaction between qubits before swapping would be[112323], while the sorted list after swapping would be[112233], meaning that no overlapping CNOTs are left inthe layer.

The data-qubit swapping routine operates on lists witha number of elements – i.e., a number of logical qubits– that is a multiple of 4. For this reason, a couple ofdummy values, set to −1, may have to be added to theend of the list whenever the number of CNOTs is odd.Given that the algorithm searches for swaps form leftto right, the dummy couple at the end will be left un-touched. After creating masks to keep track of alreadyswapped qubits, the algorithm finds all necessary swapsin exactly n

4 steps, where n is the number of elements inthe list, i.e., the number of logical qubits interested byCNOTs.

13

Page 14: Compiler Design for Distributed Quantum Computing

Algorithm 3 SortPairsInput: a vector v representing CNOT interactions between qubitspairs, ex. [1 2 2 3 1 3]Output: the swaps to perform

// f.e.o. stands for ’first element of’if length(v) mod4 6= 0 then // length(v) must be multiple of4

add {−1,−1} to v // Add a dummy pair at the endend ifswaps← ∅ // List of SWAPs to performn← length(v)mask1← {0, ..., n

2− 1}

mask2← {n2, ..., n− 1}

cycle← 0while cycle < n/4 do

w ← indexes of unique values in vindex last← {0, ..., |v[mask1]|} \ wif index last = ∅ then

swap v[mask1[end]] andf.e.o. s.t. v[mask2] 6= v[mask1[end]]

update swapsend ifv,mask1, swaps← swap(v, mask1, swaps)v,mask2, swaps← swap(v, mask2, swaps)

end whilereturn swaps

function swap(v,mask, swaps)w ← indexes of unique values in vindex last← f.e.o. {0, ..., |v[mask]|} \ windex first← f.e.o. v[mask] s.t. = v[mask[index last]]if index last is odd then

index swap← index last− 1else

index swap← index last + 1end ifswap v[mask[index swap]] and v[mask[index first]]update swapsremove mask[index swap] and mask[index last] from maskreturn v,mask, swaps

end function

Taking into account the topology of the worst-casescenario shown in Figure 7, to perform a swap at leastthree remote CNOTs are needed. After all the data-qubit swaps have been applied, the necessary remoteCNOTs have to be placed. Ideally, this last step wouldinvolve only remote CNOTs between neighbor QPUs,but when not all qubits of the circuit are involved inthe front layer, such as Q1 in Figure 12a, it may bestill necessary to perform some entanglement swappingoperations, as shown in Figure 12b. This is because,when sorting pairs, our algorithm does not take into ac-count QPUs that are not involved in the current frontlayer. Consequently, the implementation of the data-qubit swapping based strategy is actually an hybrid be-tween the two strategies described in Sections 4.2, avoid-ing data-qubit swapping if not necessary and resortinginstead to entanglement swapping.

As shown in Figure 12, starting from the layer in Fig-ure 12a, following the data-qubit swapping based strategy,in Figure 12b we apply a remote SWAP between qubit 3and qubit 4, and then we can execute all remote CNOTs

in parallel.

Moving from the topology illustrated in Figure 7 tothe one in Figure 13, we can devise a different strategyto execute swaps by exploiting the augmented connec-tivity, described in Algorithm 2. Specifically, to per-form a SWAP between QPU Qx and Qy, our compileruses one communication qubit at each QPU as a buffermemory and exploits quantum teleportation to move thestate of a data qubit from Qx to Qy and vice versa, inparallel. After the two parallel teleportations, we canexecute two parallel local SWAPs between communica-tion qubits and data qubits at each QPU to effectivelyachieve data qubit swapping between Qx and Qy. Thisis clearly beneficial as it only requires one layer of linkentanglement generation between the interested QPUs,unlike the previous scenario where we needed three layerof link entanglement generation to perform three remoteCNOTs.

The difference between data-qubit swapping and en-tanglement swapping lies in the fact that, if the subse-quent front layers are similar (in terms of CNOT interac-tion pattern) to the one just compiled, not much dataswapping and very little entanglement swapping (giventhat the front layers involve most of the qubits) will benecessary to compile those layers.

Regarding the computational complexity, Alg. 4 readsthe circuit from left to right while updating the frontlayer, so its computational complexity is O(d), whered is the depth of the circuit, i.e., the number of lay-ers. Alg. 2 loops through all necessary swaps found byAlg. 3, and applies remote CNOTs. As we must take intoaccount for possible entanglement swapping operationsbetween QPUs, applying a remote CNOT has a compu-tational complexity of O(n), with n being the numberof QPUs. Given that the we need at most n/4 swaps,the computational complexity of Alg. 2 turns out to beO(n2). The overall computational complexity of dis-tributing a circuit is therefore O(dn2).

5.2 Compiling Overhead Validation

We validate the theoretical upper bounds (derived inSection 4.2) on the number of layers that result fromcompiling a layer of remote CNOTs, considering an ex-tensive set of medium-size quantum circuits (the largestones requiring 16 qubits, with the exception of a GHZand two random circuit with 20 qubits). Specifically,we consider quantum circuits that are publicly avail-able and widely adopted for testing quantum compil-ers [24, 25]11, plus a few quantum chemistry circuits forthe implementation of the unitary quantum Coupled-Cluster [47, 48] (qUCC) and the RYRZ heuristic [49]wavefunction Ansatze.

11https://github.com/deeptechlabs/quantum_compiler_

optim/tree/master/examples

14

Page 15: Compiler Design for Distributed Quantum Computing

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc′1,2

Qc′′1,2

QUANTUM PROCESSOR #1

Qc′2,1

Qc′′2,1

Qd2

Qc′2,3

Qc′′2,3

QUANTUM PROCESSOR #2

Qc′3,2

Qc′′3,2

Qd3

Qc′3,4

Qc′3,4

QUANTUM PROCESSOR #3

Qc′n,n−1

Qc′′n,n−1

Qdn

QUANTUM PROCESSOR #N

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

FIGURE 12: Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, butneighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.

CNOT 1

CNOT 2

CNOT 3

Q0

Q1

Q2

Q3

Q4

Q5

Q6

(a) A layer with threeparallel CNOTs.

CNOT 1

CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5RemoteSwap

Processor #6LinkEnt

Processor #7

(b) The layer distributed with the Sort strategy.

CNOT 1

REMOTESWAP CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5LinkEnt LinkEnt LinkEnt

Processor #6LinkEnt

Processor #7

(c) The distributed layer after decomposing the Remote SWAP.

FIGURE 13: Compilation of a layer with three parallel CNOTs using the sorting strategy.

each QPU memory to one data qubit but we could not imposeany limitation over the number of communication qubits perQPU nor the topology of the quantum network, which is al-ways assumed to be an hypercube. Such a topology is shownin Figure 14b, where one can clearly see the connectivitydisparity, compared to the worst-case topology illustrated inFigure 14a. In Figures 15-17, the results of the comparativeevaluation are plotted.

Figure 15 shows the number of Link Generation Layers,i.e., the number of layers in the distributed circuit thatcomprise only Bell state generation and distribution betweenQPUs. It is clear that our compiler requires less layers of linkgeneration for almost every tested circuit. Having fewer lay-ers of link generation reduces the time that data qubits haveto spend idle, i.e., possibly affected by decoherence, whilewaiting to be able to perform remote operations. Figure 15calso shows that choosing the data-qubit swapping basedstrategy against the entanglement swapping based strategyis generally the best choice.

Regarding the depth of the distributed circuits, illustratedin Figure 16, we can see that for some circuits our compilerclearly outperforms Andrés-Martínez’s one. Figure 16c con-firms that the data-qubit swapping based strategy is betterthan the entanglement swapping based one.

With respect to the number of generated Bell states, de-picted in Figure 17, it can be observed that our compilerconsumes a fair amount of Bell states compared to Andrés-Martínez’s compiler. This was expected and it is mostly due

to the fact that Andrés-Martínez’s compiler benefits from ahypercube network topology, as showed in Figure 14b. Usingsuch a topology means that, in most cases, a link betweentwo QPUs can be directly generated with just one Bell state,which is in direct contrast with the worst-case topology thatwe used, depicted in Figure 14a. In our linear topology, togenerate a link between two non neighboring QPUs we needto perform entanglement swapping, generating and consum-ing Bell states shared by all the others QPUs in between.Nevertheless, the time needed to generate one Bell stateshould be the same as to generate n Bell states in parallel, andwith Figure 15 we already showed that our compiler usuallyneeds fewer layers of link generation.

It is worthwhile to note that with network topologiesdifferent from the considered one – namely, the worst-casetopology where each CNOT must be mapped into a remoteCNOT – new optimization challenges arise. As instance,whenever multiple data qubits are available at each (or somenodes), only a subset of CNOTs must be mapped into remoteoperations. Hence, the compiler should be able to optimizechoices such as which sub-circuit should be mapped to whichnode or which CNOT should be performed via communica-tion qubits. Clearly, the optimal strategies – as well as themetrics to measure the optimality of a strategy – representsinteresting open problems.

14 VOLUME 4, 2016

(a) A layer with threeparallel CNOTs.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc′1,2

Qc′′1,2

QUANTUM PROCESSOR #1

Qc′2,1

Qc′′2,1

Qd2

Qc′2,3

Qc′′2,3

QUANTUM PROCESSOR #2

Qc′3,2

Qc′′3,2

Qd3

Qc′3,4

Qc′3,4

QUANTUM PROCESSOR #3

Qc′n,n−1

Qc′′n,n−1

Qdn

QUANTUM PROCESSOR #N

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

FIGURE 12: Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, butneighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.

CNOT 1

CNOT 2

CNOT 3

Q0

Q1

Q2

Q3

Q4

Q5

Q6

(a) A layer with threeparallel CNOTs.

CNOT 1

CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5RemoteSwap

Processor #6LinkEnt

Processor #7

(b) The layer distributed with the Sort strategy.

CNOT 1

REMOTESWAP CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5LinkEnt LinkEnt LinkEnt

Processor #6LinkEnt

Processor #7

(c) The distributed layer after decomposing the Remote SWAP.

FIGURE 13: Compilation of a layer with three parallel CNOTs using the sorting strategy.

each QPU memory to one data qubit but we could not imposeany limitation over the number of communication qubits perQPU nor the topology of the quantum network, which is al-ways assumed to be an hypercube. Such a topology is shownin Figure 14b, where one can clearly see the connectivitydisparity, compared to the worst-case topology illustrated inFigure 14a. In Figures 15-17, the results of the comparativeevaluation are plotted.

Figure 15 shows the number of Link Generation Layers,i.e., the number of layers in the distributed circuit thatcomprise only Bell state generation and distribution betweenQPUs. It is clear that our compiler requires less layers of linkgeneration for almost every tested circuit. Having fewer lay-ers of link generation reduces the time that data qubits haveto spend idle, i.e., possibly affected by decoherence, whilewaiting to be able to perform remote operations. Figure 15calso shows that choosing the data-qubit swapping basedstrategy against the entanglement swapping based strategyis generally the best choice.

Regarding the depth of the distributed circuits, illustratedin Figure 16, we can see that for some circuits our compilerclearly outperforms Andrés-Martínez’s one. Figure 16c con-firms that the data-qubit swapping based strategy is betterthan the entanglement swapping based one.

With respect to the number of generated Bell states, de-picted in Figure 17, it can be observed that our compilerconsumes a fair amount of Bell states compared to Andrés-Martínez’s compiler. This was expected and it is mostly due

to the fact that Andrés-Martínez’s compiler benefits from ahypercube network topology, as showed in Figure 14b. Usingsuch a topology means that, in most cases, a link betweentwo QPUs can be directly generated with just one Bell state,which is in direct contrast with the worst-case topology thatwe used, depicted in Figure 14a. In our linear topology, togenerate a link between two non neighboring QPUs we needto perform entanglement swapping, generating and consum-ing Bell states shared by all the others QPUs in between.Nevertheless, the time needed to generate one Bell stateshould be the same as to generate n Bell states in parallel, andwith Figure 15 we already showed that our compiler usuallyneeds fewer layers of link generation.

It is worthwhile to note that with network topologiesdifferent from the considered one – namely, the worst-casetopology where each CNOT must be mapped into a remoteCNOT – new optimization challenges arise. As instance,whenever multiple data qubits are available at each (or somenodes), only a subset of CNOTs must be mapped into remoteoperations. Hence, the compiler should be able to optimizechoices such as which sub-circuit should be mapped to whichnode or which CNOT should be performed via communica-tion qubits. Clearly, the optimal strategies – as well as themetrics to measure the optimality of a strategy – representsinteresting open problems.

14 VOLUME 4, 2016

(b) The layer distributed with the Sortstrategy.

Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc′1,2

Qc′′1,2

QUANTUM PROCESSOR #1

Qc′2,1

Qc′′2,1

Qd2

Qc′2,3

Qc′′2,3

QUANTUM PROCESSOR #2

Qc′3,2

Qc′′3,2

Qd3

Qc′3,4

Qc′3,4

QUANTUM PROCESSOR #3

Qc′n,n−1

Qc′′n,n−1

Qdn

QUANTUM PROCESSOR #N

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

FIGURE 12: Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, butneighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.

CNOT 1

CNOT 2

CNOT 3

Q0

Q1

Q2

Q3

Q4

Q5

Q6

(a) A layer with threeparallel CNOTs.

CNOT 1

CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5RemoteSwap

Processor #6LinkEnt

Processor #7

(b) The layer distributed with the Sort strategy.

CNOT 1

REMOTESWAP CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5LinkEnt LinkEnt LinkEnt

Processor #6LinkEnt

Processor #7

(c) The distributed layer after decomposing the Remote SWAP.

FIGURE 13: Compilation of a layer with three parallel CNOTs using the sorting strategy.

each QPU memory to one data qubit but we could not imposeany limitation over the number of communication qubits perQPU nor the topology of the quantum network, which is al-ways assumed to be an hypercube. Such a topology is shownin Figure 14b, where one can clearly see the connectivitydisparity, compared to the worst-case topology illustrated inFigure 14a. In Figures 15-17, the results of the comparativeevaluation are plotted.

Figure 15 shows the number of Link Generation Layers,i.e., the number of layers in the distributed circuit thatcomprise only Bell state generation and distribution betweenQPUs. It is clear that our compiler requires less layers of linkgeneration for almost every tested circuit. Having fewer lay-ers of link generation reduces the time that data qubits haveto spend idle, i.e., possibly affected by decoherence, whilewaiting to be able to perform remote operations. Figure 15calso shows that choosing the data-qubit swapping basedstrategy against the entanglement swapping based strategyis generally the best choice.

Regarding the depth of the distributed circuits, illustratedin Figure 16, we can see that for some circuits our compilerclearly outperforms Andrés-Martínez’s one. Figure 16c con-firms that the data-qubit swapping based strategy is betterthan the entanglement swapping based one.

With respect to the number of generated Bell states, de-picted in Figure 17, it can be observed that our compilerconsumes a fair amount of Bell states compared to Andrés-Martínez’s compiler. This was expected and it is mostly due

to the fact that Andrés-Martínez’s compiler benefits from ahypercube network topology, as showed in Figure 14b. Usingsuch a topology means that, in most cases, a link betweentwo QPUs can be directly generated with just one Bell state,which is in direct contrast with the worst-case topology thatwe used, depicted in Figure 14a. In our linear topology, togenerate a link between two non neighboring QPUs we needto perform entanglement swapping, generating and consum-ing Bell states shared by all the others QPUs in between.Nevertheless, the time needed to generate one Bell stateshould be the same as to generate n Bell states in parallel, andwith Figure 15 we already showed that our compiler usuallyneeds fewer layers of link generation.

It is worthwhile to note that with network topologiesdifferent from the considered one – namely, the worst-casetopology where each CNOT must be mapped into a remoteCNOT – new optimization challenges arise. As instance,whenever multiple data qubits are available at each (or somenodes), only a subset of CNOTs must be mapped into remoteoperations. Hence, the compiler should be able to optimizechoices such as which sub-circuit should be mapped to whichnode or which CNOT should be performed via communica-tion qubits. Clearly, the optimal strategies – as well as themetrics to measure the optimality of a strategy – representsinteresting open problems.

14 VOLUME 4, 2016

(c) The distributed layer after decomposing the Remote SWAP.

Figure 12: Compilation of a layer with three parallel CNOTs using the sorting strategy.Journal of LATEX Class Files, Vol. .., No. 8, August 2015

. . . . . .Qd1

Qc′1,2

Qc′′1,2

QUANTUM PROCESSOR #1

Qc′2,1

Qc′′2,1

Qd2

Qc′2,3

Qc′′2,3

QUANTUM PROCESSOR #2

Qc′3,2

Qc′′3,2

Qd3

Qc′3,4

Qc′3,4

QUANTUM PROCESSOR #3

Qc′n,n−1

Qc′′n,n−1

Qdn

QUANTUM PROCESSOR #N

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

QUANTUM LINK

FIGURE 12: Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, butneighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.

CNOT 1

CNOT 2

CNOT 3

Q0

Q1

Q2

Q3

Q4

Q5

Q6

(a) A layer with threeparallel CNOTs.

CNOT 1

CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5RemoteSwap

Processor #6LinkEnt

Processor #7

(b) The layer distributed with the Sort strategy.

CNOT 1

REMOTESWAP CNOT 2

CNOT 3

Processor #1LinkEnt

Processor #2LinkEnt

EntSwap

Processor #3

Processor #4LinkEnt

Processor #5LinkEnt LinkEnt LinkEnt

Processor #6LinkEnt

Processor #7

(c) The distributed layer after decomposing the Remote SWAP.

FIGURE 13: Compilation of a layer with three parallel CNOTs using the sorting strategy.

each QPU memory to one data qubit but we could not imposeany limitation over the number of communication qubits perQPU nor the topology of the quantum network, which is al-ways assumed to be an hypercube. Such a topology is shownin Figure 14b, where one can clearly see the connectivitydisparity, compared to the worst-case topology illustrated inFigure 14a. In Figures 15-17, the results of the comparativeevaluation are plotted.

Figure 15 shows the number of Link Generation Layers,i.e., the number of layers in the distributed circuit thatcomprise only Bell state generation and distribution betweenQPUs. It is clear that our compiler requires less layers of linkgeneration for almost every tested circuit. Having fewer lay-ers of link generation reduces the time that data qubits haveto spend idle, i.e., possibly affected by decoherence, whilewaiting to be able to perform remote operations. Figure 15calso shows that choosing the data-qubit swapping basedstrategy against the entanglement swapping based strategyis generally the best choice.

Regarding the depth of the distributed circuits, illustratedin Figure 16, we can see that for some circuits our compilerclearly outperforms Andrés-Martínez’s one. Figure 16c con-firms that the data-qubit swapping based strategy is betterthan the entanglement swapping based one.

With respect to the number of generated Bell states, de-picted in Figure 17, it can be observed that our compilerconsumes a fair amount of Bell states compared to Andrés-Martínez’s compiler. This was expected and it is mostly due

to the fact that Andrés-Martínez’s compiler benefits from ahypercube network topology, as showed in Figure 14b. Usingsuch a topology means that, in most cases, a link betweentwo QPUs can be directly generated with just one Bell state,which is in direct contrast with the worst-case topology thatwe used, depicted in Figure 14a. In our linear topology, togenerate a link between two non neighboring QPUs we needto perform entanglement swapping, generating and consum-ing Bell states shared by all the others QPUs in between.Nevertheless, the time needed to generate one Bell stateshould be the same as to generate n Bell states in parallel, andwith Figure 15 we already showed that our compiler usuallyneeds fewer layers of link generation.

It is worthwhile to note that with network topologiesdifferent from the considered one – namely, the worst-casetopology where each CNOT must be mapped into a remoteCNOT – new optimization challenges arise. As instance,whenever multiple data qubits are available at each (or somenodes), only a subset of CNOTs must be mapped into remoteoperations. Hence, the compiler should be able to optimizechoices such as which sub-circuit should be mapped to whichnode or which CNOT should be performed via communica-tion qubits. Clearly, the optimal strategies – as well as themetrics to measure the optimality of a strategy – representsinteresting open problems.

14 VOLUME 4, 2016

Figure 13: Improving over the worst-case scenario topology. Only one data qubit is available at each quantumprocessor, but neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.

circuit name # qubits n # CNOT layersEntanglement Swap Data-Qubit Swap

# CNOT layers ×n2 des compiled # layers compiled/theoretical # CNOT layers ×(n

4 dqs + d′qs) compiled # layers compiled/theoretical

4gt12-v1 89 16 88 2112 212 0.10 3344 233 0.074gt4-v0 73 16 160 3840 372 0.10 6080 399 0.074mod7-v1 96 16 65 1560 151 0.10 2470 155 0.069symml 195 16 12849 308376 34721 0.11 488262 32809 0.07adder 10 55 825 144 0.17 1100 187 0.17alu-v2 31 16 172 4128 459 0.11 6536 436 0.07ghz 20 20 19 570 56 0.10 893 56 0.06ghz 4 4 3 18 8 0.44 33 8 0.24H2 RYRZ 4 25 150 66 0.44 275 69 0.25H2 UCCSD 4 52 312 72 0.23 572 72 0.13H2O RYRZ 14 125 2625 971 0.37 3625 2040 0.56H2O UCCSD 14 12937 271677 16017 0.06 375173 16017 0.04ising model 16 16 20 480 31 0.06 760 31 0.04life 238 16 8356 200544 22391 0.11 317528 21073 0.07LiH RYRZ 12 105 1890 711 0.38 3045 1547 0.51LiH UCCSD 12 7264 130752 9216 0.07 210656 9216 0.04one-two-three-v2 100 16 29 696 76 0.11 1102 69 0.06Random 20q RYRZ 20 185 5550 1991 0.36 8695 4113 0.47Random 20q UCCSD 20 110497 3314910 130197 0.04 5193359 130197 0.03random1 n5 d5 5 15 90 69 0.77 165 53 0.32random2 n16 d16 16 48 1152 666 0.58 1824 588 0.32rd53 138 16 42 1008 116 0.12 1596 100 0.06root 255 16 5965 143160 16699 0.12 226670 15973 0.07sqn 258 16 3719 89256 9713 0.11 141322 9210 0.07sym9 146 16 91 2184 277 0.13 3458 254 0.07

Table 1: Validation of the theoretical upper bounds derived in Section 4.2 against a heterogeneous set of quantumcircuits. Each circuit is characterized by a number of qubits n and a number of CNOT layers, i.e., layers thatcomprise only CNOT gates. For each compiling strategy, there is a theoretical upper bound on the number of layersthat are necessary to realize the remote CNOTs and an actual number of layers resulting from the compilationprocess. The ratio between the latter and the former is also reported.

More into details, Table 1 reports a sample of theresults that have been collected by compiling the cir-cuits with both the entanglement swapping based and

the data-qubit swapping based strategies. Within the ta-ble, the first column shows the name of the circuit andthe second column shows the number n of logical qubits

15

Page 16: Compiler Design for Distributed Quantum Computing

Q₃

Q₄

Q₅

Q₆

Q₇Q₉

Q₁₃

Q₁₁

Q₁₂

Q₁₀Q₈

Q₁₄

Q₁₆Q₁₅

Q₁

Q₂

(a) Linear Nearest Neighbor topology.

Q₃

Q₄

Q₅

Q₆

Q₇Q₉

Q₁₃

Q₁₁

Q₁₂

Q₁₀Q₈

Q₁₄

Q₁₆Q₁₅

Q₁

Q₂

(b) Hypercube topology.

Figure 14: Comparison of the worst-case scenario linear nearest neighbor topology with the hypercube topologyassumed by the compiler proposed by Andres-Martınez et al. [41]. As shown in Figure 14a, for n = 16 QPUs thelongest path between two QPUs consists of n − 1 = 15 links, while with the hypercube topology in Figure 14b isonly about log2n = 3 links.

within the circuit. The third column shows the numberof CNOT layers in the original circuit – i.e., the un-compiled circuit. The fourth and the seventh columnsshow the theoretical upper bound of the depth of thecompiled CNOT layers – computed in agreement withequations (2) and (4), respectively – whereas the fifthand the eighth column shows the depth of the compiledCNOT layers. For computing the upper bound valuesand collecting the experimental results, we set the pa-rameters cle, cbsm and ccx in equations (3), (6) and (7)as unit factors, thus obtaining des = 3, dqs = 9 andd′qs = 2.

Table 1 clearly shows that the upper bounds on thenumber of layers that result from compiling the layers ofremote CNOTs are widely respected and hold, for all theconsidered examples. Indeed, by comparing the actualdepth with the theoretical one, it becomes evident theoverestimation of the bounds given in Section 4.2. Therationale for this lays in the number of CNOTs in eachlayer, which are usually significantly lower than whatassumed. For instance, let us consider the GHZ circuits,where each layer contains a single CNOT and hence thederived bounds – by assuming n/2 CNOTs in each layer– overestimate the depth. Nevertheless, the discrepancycan be easily fixed by substituting the n/2 factor withthe actual estimation on the average number of CNOTsin each layer with no loss of generality.

5.3 Experimental Results for the Worst-Case Topology

We compared our compiler with the one proposed byAndres-Martınez et al. [41], in respect of which we wereable to set each QPU memory to one data qubit but

we could not impose any limitation over the number ofcommunication qubits per QPU nor the topology of thequantum network, which is always assumed to be an hy-percube. Such a topology is shown in Figure 14b, whereone can clearly see the connectivity disparity, comparedto the worst-case topology illustrated in Figure 14a. InFigures 15-17, the results of the comparative evalua-tion are plotted. For a better readability of the figures,we omit data related to a 20 qubit random circuit thatpresents values far greater than the rest of the data set,for all compiling strategies including the state of the art(such data have been included in Table 1).

Figure 15 shows the number of Link Generation Lay-ers, i.e., the number of layers in the distributed circuitthat comprise only Bell state generation and distributionbetween QPUs. It is clear that our compiler requires lesslayers of link generation for almost every tested circuit.Having fewer layers of link generation reduces the timethat data qubits have to spend idle, i.e., possibly affectedby decoherence, while waiting to be able to perform re-mote operations. Figure 15c also shows that choosingthe data-qubit swapping based strategy against the en-tanglement swapping based strategy is generally the bestchoice.

Regarding the depth of the distributed circuits, illus-trated in Figure 16, we can see that for some circuitsour compiler clearly outperforms Andres-Martınez’s one.Figure 16c confirms that the data-qubit swapping basedstrategy is better than the entanglement swapping basedone.

With respect to the number of generated Bell states,depicted in Figure 17, it can be observed that our com-piler consumes a fair amount of Bell states compared toAndres-Martınez’s compiler. This was expected and it is

16

Page 17: Compiler Design for Distributed Quantum Computing

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Ent

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Martinez - Link Generation Layers

0

250

500

750

0

250

500

750

(a)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Dat

a-Q

ubit

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Martinez - Link Generation Layers

0

250

500

750

0

250

500

750

(b)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Dat

a-Q

ubit

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Ent Swap Based - Link Generation Layers

0

250

500

750

0

250

500

750

(c)

Figure 15: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler withAndres-Martınez’s compiler [41], in terms of link entanglement generation layers. Our compiler distributes circuitson the worst-case topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andres-Martınez’s one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies arecharacterized by one data qubit per QPU.

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Ent

Swap

Bas

ed -

Dep

th

Martinez - Depth

0

250

500

750

1000

0

250

500

750

100

0

(a)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Dat

a-Q

ubit

Swap

Bas

ed -

Dep

th

Martinez - Depth

0

250

500

750

1000

0

250

500

750

100

0

(b)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Dat

a-Q

ubit

Swap

Bas

ed -

Dep

th

Ent Swap Based - Depth

0

250

500

750

1000

0

250

500

750

100

0(c)

Figure 16: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compilerwith Andres-Martınez’s compiler [41], in terms of circuit depth. Our compiler distributes circuits on the worst-case topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andres-Martınez’s oneexploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterizedby one data qubit per QPU.

mostly due to the fact that Andres-Martınez’s compilerbenefits from a hypercube network topology, as showedin Figure 14b. Using such a topology means that, inmost cases, a link between two QPUs can be directlygenerated with just one Bell state, which is in directcontrast with the worst-case topology that we used, de-picted in Figure 14a. In our linear topology, to gener-ate a link between two non neighboring QPUs we needto perform entanglement swapping, generating and con-suming Bell states shared by all the others QPUs inbetween. Nevertheless, the time needed to generate oneBell state should be the same as to generate n Bell statesin parallel, and with Figure 15 we already showed that

our compiler usually needs fewer layers of link genera-tion.

It is worthwhile to note that with network topologiesdifferent from the considered one – namely, the worst-case topology where each CNOT must be mapped into aremote CNOT – new optimization challenges arise. Asinstance, whenever multiple data qubits are available ateach (or some nodes), only a subset of CNOTs must bemapped into remote operations. Hence, the compilershould be able to optimize choices such as which sub-circuit should be mapped to which node or which CNOT

should be performed via communication qubits. Clearly,the optimal strategies – as well as the metrics to measure

17

Page 18: Compiler Design for Distributed Quantum Computing

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Ent

Swap

Bas

ed -

EPR

Pai

rs

Martinez - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Dat

a-Q

ubit

Swap

Bas

ed -

EPR

Pai

rs

Martinez - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(b)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Dat

a-Q

ubit

Swap

Bas

ed -

EPR

Pai

rs

Ent Swap Based - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(c)

Figure 17: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compilerwith Andres-Martınez’s compiler [41], in terms of consumed EPR pairs. Our compiler distributes circuits on theworst-case topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andres-Martınez’sone exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterizedby one data qubit per QPU.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Ent

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Martinez - Link Generation Layers

0

250

500

750

0

250

500

750

(a)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Dat

a-Q

ubit

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Martinez - Link Generation Layers

0

250

500

750

0

250

500

750

(b)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

0

200

0

400

0

600

0

800

0

100

00

120

00

140

00

160

00

180

00

200

00

220

00

240

00

Dat

a-Q

ubit

Swap

Bas

ed -

Lin

k G

ener

atio

n La

yers

Ent Swap Based - Link Generation Layers

0

250

500

750

0

250

500

750(c)

Figure 18: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler withAndres-Martınez’s compiler [41] over link entanglement generation layers. Our compiler distributed circuits on thetopology illustrated in Figure 13, with two links between neighboring QPUs(illustrated in Figure 7), while Andres-Martınez’s one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies arecharacterized by one data qubit per QPU.

the optimality of a strategy – represents interesting openproblems.

5.4 Additional Experimental Results

To show that the proposed compiling strategies can beapplied to more complex topologies, we tested them ona slight variation of the worst-case scenario. This newtopology is depicted in Figure 13, where we doubled thenumber of EPR pairs per QPU, hence we doubled thenumber of links between neighbor QPUs. As describedin Section 5.1, this setting enables our compiler to per-form data-qubit swapping in a more efficient way and

also greatly reduces the number of layers dedicated tothe link entanglement generation, as we can generateand use double the number of links in parallel. Such aperformance improvement is clearly shown in Figure 18.

The same considerations apply to the depth of thecompiled circuits, illustrated in Figure 19, where we canobserve an appreciable improvement against the worst-case scenario and the state of the art compiler. As forthe number of generated EPR pairs, aside from an im-perceptible difference with the worst-case scenario, theadvantage of an hypercube topology is still evident.

18

Page 19: Compiler Design for Distributed Quantum Computing

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Ent

Swap

Bas

ed -

Dep

th

Martinez - Depth

0

250

500

750

1000

0

250

500

750

100

0

(a)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Dat

a-Q

ubit

Swap

Bas

ed -

Dep

th

Martinez - Depth

0

250

500

750

1000

0

250

500

750

100

0

(b)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0

200

00

400

00

600

00

800

00

100

000

120

000

140

000

160

000

180

000

200

000

Dat

a-Q

ubit

Swap

Bas

ed -

Dep

th

Ent Swap Based - Depth

0

250

500

750

1000

0

250

500

750

100

0

(c)

Figure 19: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compilerwith Andres-Martınez’s compiler [41], in terms of circuits’ depth. Our compiler distributed circuits on the topologyillustrated in Figure 13, with two links between neighboring QPUs (illustrated in Figure 7), while Andres-Martınez’sone exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterizedby one data qubit per QPU.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Ent

Swap

Bas

ed -

EPR

Pai

rs

Martinez - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Dat

a-Q

ubit

Swap

Bas

ed -

EPR

Pai

rs

Martinez - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(b)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

100

00

200

00

300

00

400

00

500

00

600

00

700

00

800

00

900

00

100

000

Dat

a-Q

ubit

Swap

Bas

ed -

EPR

Pai

rs

Ent Swap Based - EPR Pairs

0

1000

2000

3000

4000

0 1

000

200

0 3

000

400

0

(c)

Figure 20: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compilerwith Andres-Martınez’s compiler [41] over consumed EPR pairs. Our compiler distributed circuits on the topologyillustrated in Figure 13, with two links between neighboring QPUs (illustrated in Figure 7), while Andres-Martınez’sone exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterizedby one data qubit per QPU.

6 Conclusion

In this paper, we have discussed the main challenges aris-ing with compiler design for distributed quantum com-puting. Then, we analytically derived an upper bound ofthe overhead induced by quantum compilation for dis-tributed quantum computing. The derived bound ac-counts for the overhead induced by the underlying com-puting architecture as well as the additional overheadinduced by the sub-optimal quantum compiler. To thisaim, we designed a quantum compiler with three key fea-tures: i) general-purpose, namely, requiring no particu-lar assumptions on the quantum circuits to be compiled,

ii) efficient, namely, exhibiting a polynomial-time com-putational complexity so that it can successfully compilemedium-to-large circuits of practical value, and iii) effec-tive, being the total circuit depth overhead induced bythe quantum circuit compilation always upper-boundedby a factor that grows linearly with the number of logi-cal qubits of the original quantum circuit. We validatedthe theoretical upper bound against an extensive set ofmedium-size quantum circuits of practical interest, andwe confirmed the validity of the compiler design throughan extensive performance analysis.

19

Page 20: Compiler Design for Distributed Quantum Computing

Acknowledgments

The authors would like to thank Pablo Andres-Martınezfor answering questions about his related work.

References

[1] Marcello Caleffi, Angela Sara Cacciapuoti, andGiuseppe Bianchi. Quantum internet: From commu-nication to distributed computing! In Proceedings ofthe 5th ACM International Conference on NanoscaleComputing and Communication, 2018. Invited Paper.

[2] Marcello Caleffi, Daryus Chandra, Daniele Cuomo,Shima Hasaanpour, and Angela Sara Cacciapuoti.The Rise of the Quantum Internet. IEEE Computer,2020.

[3] R. Van Meter and S. J. Devitt. The path to scalabledistributed quantum computing. Computer, 49(9):31–42, Sept 2016. doi:10.1109/MC.2016.291.

[4] Stefano Pirandola and Samuel L. Braunstein.Physics: Unite to Build a Quantum Internet. Nature,532(7598):169–171, Apr. 2016.

[5] Elizabeth Gibney. Chinese Satellite is One Gi-ant Step for the Quantum Internet. Nature,535(7613):478–479, July 2016.

[6] Wolfgang Dur, Raphael Lamprecht, and StefanHeusler. Towards a Quantum Internet. EuropeanJournal of Physics, 38(4):043001, May 2017.

[7] Christoph Simon. Towards a Global Quantum Net-work. Nature Photonics, 11(11):678–680, 2017.

[8] Stephanie Wehner, David Elkouss, and Ronald Han-son. Quantum internet: A vision for the road ahead.Science, 362(6412), 2018.

[9] M. Zomorodi-Moghadam, M. Houshmand, andM. Houshmand. Optimizing teleportation cost in dis-tributed quantum circuits. Int J Theor Phys, 57:848–861, 2018.

[10] Laszlo Gyongyosi and Sandor Imre. Entanglementconcentration service for the quantum internet. Quan-tum Information Processing, 19(8):221, 2020.

[11] Laszlo Gyongyosi and Sandor Imre. Routing spaceexploration for scalable routing in the quantum inter-net. Scientific Reports, 10(1):11874, 2020.

[12] Daniele Cuomo, Marcello Caleffi, and Angela SaraCacciapuoti. Towards a distributed quantum comput-ing ecosystem. IET Quantum Communication, 1:3–8,July 2020. Invited Paper.

[13] Adi Botea, Akihiro Kishimoto, and Radu Mari-nescu. On the complexity of quantum circuit com-pilation. In SOCS, 2018.

[14] Eleanor G Rieffel and Wolfgang H Polak. Quantumcomputing: A gentle introduction. MIT Press, 2011.

[15] Michael A Nielsen and Isaac L Chuang. QuantumComputation and Quantum Information. CambridgeUniversity Press, 2010.

[16] David Deutsch. Quantum theory, the church–turingprinciple and the universal quantum computer. Pro-ceedings of the Royal Society of London A: Mathemat-ical, Physical and Engineering Sciences, 400:97–117,1985.

[17] Sergio Boixo, Sergei V. Isakov, Vadim N. Smelyan-skiy, et al. Characterizing quantum supremacy innear-term devices. Nature Physics, 14(6):595–600,2018.

[18] Abhinav Kandala, Kristan Temme, Antonio D.Corcoles, Antonio Mezzacapo, Jerry M. Chow, andJay M. Gambetta. Error mitigation extends the com-putational reach of a noisy quantum processor. Na-ture, 567(7749):491–495, 2019.

[19] Laszlo Gyongyosi and Sandor Imre. Circuit depthreduction for gate-model quantum computers. Scien-tific Reports, 10(1):11229, 2020.

[20] A. S. Cacciapuoti, M. Caleffi, R. Van Meter, andL. Hanzo. When Entanglement Meets Classical Com-munications: Quantum Teleportation for the Quan-tum Internet. IEEE Transactions on Communica-tions, pages 1–1, 2020. invited paper.

[21] IBM. Ibm quantum systems.https://quantum-computing.ibm.

com/docs/cloud/backends/systems/

#system-backends-systems-available, 2020.

[22] Juan Carlos Garcia-Escartin and Pedro Chamorro-Posada. Equivalent quantum circuits, Oct. 2011.arXiv:1110.2998.

[23] A. D. Corcoles, A. Kandala, A. Javadi-Abhari,D. T. McClure, A. W. Cross, K. Temme, P. D. Na-tion, M. Steffen, and J. M. Gambetta. Challengesand opportunities of near-term quantum computingsystems. Proceedings of the IEEE, pages 1–15, 2020.in press.

[24] A. Zulehner, A. Paler, and R. Wille. An effi-cient methodology for mapping quantum circuits tothe ibm qx architectures. IEEE Transactions onComputer-Aided Design of Integrated Circuits andSystems, 38(7):1226–1236, 2019.

20

Page 21: Compiler Design for Distributed Quantum Computing

[25] Gushu Li, Yufei Ding, and Yuan Xie. Tacklingthe qubit mapping problem for nisq-era quantum de-vices. In Proceedings of the Twenty-Fourth Interna-tional Conference on Architectural Support for Pro-gramming Languages and Operating Systems, ASP-LOS ’19, page 1001–1014, 2019.

[26] Hector Abraham et al. Qiskit: An open-sourceframework for quantum computing, 2019. doi:10.

5281/zenodo.2562110.

[27] Davide Ferrari and Michele Amoretti. Efficient andeffective quantum compiling for entanglement-basedmachine learning on ibm q devices. InternationalJournal of Quantum Information, 16(08):1840006,2018.

[28] Lukasz Cincio, Yigit Subası, Andrew T Sornborger,and Patrick J Coles. Learning the quantum algo-rithm for state overlap. New Journal of Physics,20(11):113022, Nov. 2018.

[29] Wojciech Kozlowski, Stephanie Wehner, Rod-ney Van Meter, Bruno Rijsman, Angela Sara Caccia-puoti, and Marcello Caleffi. Architectural Principlesfor a Quantum Internet. Internet-Draft draft-irtf-qirg-principles-03, Internet Engineering Task Force, March2020. Work in Progress.

[30] Norbert M. Linke, Dmitri Maslov, Martin Roet-teler, Shantanu Debnath, Caroline Figgatt, Kevin A.Landsman, Kenneth Wright, and Christopher Mon-roe. Experimental comparison of two quantum com-puting architectures. Proceedings of the NationalAcademy of Sciences, 114(13):3305–3310, 2017.

[31] R. Van Meter and K. M. Itoh. Fast quantum mod-ular exponentiation. Phys. Rev. A, 71:052320, May2005.

[32] A. G. Fowler, S. J. Devitt, and L. C. L. Hollenberg.Implementation of shor’s algorithm on a linear near-est neighbor qubit array. Quantum Information andComputation, 4:237–251, 2004.

[33] Y. Takahashi, N. Kunihiro, and K. Ohta. The quan-tum fourier transform on a linear nearest neighbor ar-chitecture. Quantum Information and Computation,7:383–391, 2007.

[34] R. Van Meter. Communications topology and dis-tribution of the quantum fourier transform. In Pro-ceedings of Quantum Information Technology Sympo-sium (QIT10, 2004.

[35] S. A. Kutin. Shor’s algorithm on a nearest-neighbor machine, 2007. URL: http://arxiv.org/abs/quant-ph/0609001.

[36] Rodney Van Meter, W. J. Munro, Kae Nemoto, andKohei M. Itoh. Arithmetic on a distributed-memoryquantum multicomputer. J. Emerg. Technol. Comput.Syst., 3(4), January 2008.

[37] Byung-Soo Choi and Rodney Van Meter. On theeffect of quantum interaction distance on quantum ad-dition circuits. 7(3), 2011.

[38] Robert Beals, Stephen Brierley, Oliver Gray,Aram W. Harrow, Samuel Kutin, Noah Linden, DanShepherd, and Mark Stather. Efficient distributedquantum computing. Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Sciences,469(2153):20120686, 2013.

[39] Thomas H. Cormen, Charles E. Leiserson,Ronald L. Rivest, and Clifford Stein. Introductionto Algorithms. The MIT Press, 2nd edition, 2001.

[40] Z. Davarzani, M. Zomorodi-Moghadam, M. Housh-mand, and M. Nouri-baygi. A dynamic programmingapproach for distributing quantum circuits by bipar-tite graphs. Quantum Information Processing, 19,2020.

[41] P. Andres-Martınez and C. Heunen. Automateddistribution of quantum circuits wia hypergraph par-titioning. Phys Rev A, 100:032308–1–11, 2019.

[42] Yaroslav Akhremtsev, Tobias Heuer, Peter Sanders,and Sebastian Schlag. Engineering a direct k-way hy-pergraph partitioning algorithm. In 2017 Proceedingsof the Ninteenth Workshop on Algorithm Engineeringand Experiments (ALENEX), 2017.

[43] M. Caleffi. Optimal routing for quantum networks.IEEE Access, 5:22299–22312, 2017.

[44] Donald E. Knuth. The Art of Computer Program-ming, Volume 3: (2nd Ed.) Sorting and Searching.Addison Wesley Longman Publishing Co., Inc., USA,1998.

[45] Takafumi Ono, Ryo Okamoto, Masato Tanida, Hol-ger F. Hofmann, and Shigeki Takeuchi. Implementa-tion of a quantum controlled-swap gate with photoniccircuits. Scientific Reports, 7(1):45353, 2017.

[46] Norbert Schuch and Jens Siewert. Natural two-qubit gate for quantum computation using the XYinteraction. Phys. Rev. A, 67:032301, Mar 2003.

[47] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung,X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L.O’Brien. A variational eigenvalue solver on a pho-tonic quantum processor. Nature Communications,5(4213):1–7, 2014.

21

Page 22: Compiler Design for Distributed Quantum Computing

[48] Panagiotis Kl. Barkoutsos, Jerome F. Gonthier,Igor Sokolov, Nikolaj Moll, Gian Salis, AndreasFuhrer, Marc Ganzhorn, Daniel J. Egger, MatthiasTroyer, Antonio Mezzacapo, Stefan Filipp, and IvanoTavernelli. Quantum algorithms for electronic struc-ture calculations: Particle-hole Hamiltonian and op-timized wave-function expansions. Phys. Rev. A,98:022322, 2018.

[49] Abhinav Kandala, Antonio Mezzacapo, KristanTemme, Maika Takita, Markus Brink, Jerry M. Chow,and Jay M. Gambetta. Hardware-efficient variationalquantum eigensolver for small molecules and quan-tum magnets. Nature, 549(7671):242–246, September2017.

7 Appendix A

Here we present a pseudocode description of the wholecompilation process illustrated in Figure 11 and sum-marized by Algorithm 4. The front layer is iterativelyupdated with Algorithm 5 and compiled by Algorithm 2.This is done until no more front layers can be computed,meaning that all gates have been mapped and the com-pilation process has finished.

Algorithm 4 CompileInput: the circuit to be compiledOutput: the compiled circuit

G ← all gates from circuitE ← ∅ // executed gatescreate new circuit as an empty circuitwhile G 6= ∅ doG,F , E ← UpdateFrontLayer(G, E) // F is the front layerif F 6= ∅ thenE ← CompileFrontLayer(F , E)

end ifend whilefor all gate ∈ E do:

add gate to new circuitend forreturn new circuit

Algorithm 5 UpdateFrontLayerInput: G gates to be executed, E executed gates until nowOutput: updated G, updated E, new front layer FA ← ∅ // allocated qubitsF ← ∅ // new front layerR← ∅ // gates to removewidth← total number of data qubits availablefor all gate ∈ G do

if |A| = width thenbreak

end ifif A ∩ gate.qubits = ∅ then

if gate is a one-qubit gate thenadd gate to E

elseadd gate to Fadd gate to Eadd gate.qubits to A

end ifadd gate to R

else:add gate.qubits to A

end ifend forfor all gate ∈ R do

remove gate from Gend forreturn G,F , E

22

Page 23: Compiler Design for Distributed Quantum Computing

Davide Ferrari (GS’20) receivedthe BSc degree in Computer Engi-neering from the Polytechnic of Milan,Italy, in 2016 and the MSc degree inComputer Engineering from the Uni-versity of Parma, Italy, in 2019. Rightafter, he has been a research scholarat Future Technology Lab of the Uni-versity of Parma, working on the de-sign of efficient algorithms for quan-

tum compiling. Currently, he is a PhD student of the Depart-ment of Engineering and Architecture of the University ofParma. He is involved in the Quantum Information Science(QIS) research initiative at the University of Parma, wherehe is a member of the Quantum Software research unit. In2020, he won the ’IBM Quantum Awards Circuit Optimiza-tion Developer Challenge’. His research focuses on efficientquantum compiling for quantum simulations and quantummachine learning.

Angela Sara Cacciapuoti (M’10,SM’16) is a faculty at the Universityof Naples Federico II, Italy. In 2009,she received the Ph.D. degree in Elec-tronic and Telecommunications Engi-neering, and in 2005 a ‘Laurea’ (in-tegrated BS/MS) summa cum laudein Telecommunications Engineering,both from the University of NaplesFederico II. She was a visiting re-

searcher at Georgia Institute of Technology (USA) and atthe Universitat Politecnica de Catalunya (Spain). Since July2018, she held the national habilitation as “Full Professor” inTelecommunications Engineering. Her work has appeared infirst tier IEEE journals and she has received different awards,including the elevation to the grade of IEEE Senior Mem-ber in 2016, most downloaded article, and most cited arti-cle awards, and outstanding young faculty/researcher fellow-ships for conducting research abroad. Currently, Angela Saraserves as Area Editor for IEEE Communications Letters, andas Editor/Associate Editor for the journals: IEEE Trans.on Communications, IEEE Trans. on Wireless Communica-tions, IEEE Open Journal of Communications Society andIEEE Trans. on Quantum Engineering. She was a recipientof the 2017 Exemplary Editor Award of the IEEE Communi-cations Letters. In 2016 she has been an appointed memberof the IEEE ComSoc Young Professionals Standing Commit-tee. From 2017 to 2018, she has been the Award Co-Chairof the N2Women Board. Since 2017, she has been an electedTreasurer of the IEEE Women in Engineering (WIE) Affin-ity Group of the IEEE Italy Section. In 2018, she has beenappointed as Publicity Chair of the IEEE ComSoc Women inCommunications Engineering (WICE) Standing Committee.And since 2020, she is the vice-chair of WICE. Her currentresearch interests are mainly in Quantum Communications,Quantum Networks and Quantum Information Processing.

Michele Amoretti (S’01–M’06–SM’19) received his PhD in Informa-tion Technologies in 2006 from theUniversity of Parma, Parma, Italy.He is Associate Professor of Com-puter Engineering at the Universityof Parma. In 2013, he was a Visit-ing Researcher at LIG Lab, in Greno-

ble, France. He authored or co-authored over 100 researchpapers in refereed international journals, conference proceed-ings, and books. He serves as Associate Editor for the jour-nals: IEEE Trans. on Quantum Engineering and Interna-tional Journal of Distributed Sensor Networks. He is in-volved in the Quantum Information Science (QIS) researchand teaching initiative at the University of Parma, where heleads the Quantum Software research unit. He is the CINIConsortium delegate in the CEN-CENELEC Focus Groupon Quantum Technologies. His current research interests aremainly in High Performance Computing, Quantum Comput-ing, and the Internet of Things.

Marcello Caleffi (M’12, SM’16)received the M.S. degree (summa cumlaude) in computer science engineer-ing from the University of Lecce,Lecce, Italy, in 2005, and the Ph.D.degree in electronic and telecommu-nications engineering from the Uni-versity of Naples Federico II, Naples,Italy, in 2009. Currently, he is with

the DIETI Department, University of Naples Federico II, andwith the National Laboratory of Multimedia Communica-tions, National Inter-University Consortium for Telecommu-nications (CNIT). From 2010 to 2011, he was with the Broad-band Wireless Networking Laboratory at Georgia Instituteof Technology, Atlanta, as visiting researcher. In 2011, hewas also with the NaNoNetworking Center in Catalunya(N3Cat) at the Universitat Politecnica de Catalunya (UPC),Barcelona, as visiting researcher. Since July 2018, he held theItalian national habilitation as Full Professor in Telecommu-nications Engineering. His work appeared in several premierIEEE Transactions and Journals, and he received multipleawards, including best strategy award, most downloaded ar-ticle awards and most cited article awards. Currently, heserves as associate technical editor for IEEE Communica-tions Magazine and as editor for IEEE Trans. on QuantumEngineering and IEEE Communications Letters. He servedas Chair, TPC Chair, Session Chair, and TPC Member forseveral premier IEEE conferences. In 2016, he was elevatedto IEEE Senior Member and in 2017 he has been appointedas Distinguished Lecturer from the IEEE Computer Society.In December 2017, he has been elected Treasurer of the JointIEEE VT/ComSoc Chapter Italy Section. In December 2018,he has been appointed member of the IEEE New InitiativesCommittee.

23