hong wang & robert a. walker computer science department kent state university kent, oh 44242...
Post on 06-Jan-2016
31 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Scalable PipelinedAssociative SIMD ArrayWith Reconfigurable PE
Interconnection Network For Embedded Applications
Hong Wang & Robert A. WalkerComputer Science Department
Kent State UniversityKent, OH 44242 USA
ASC Processor Group2
Outline of Talk SIMD Associative Computing
Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample execution
Sample Application String Matching
Conclusion
ASC Processor Group3
Associative SIMD Array
Memory APE
Memory APE
Memory APE
Memory APE
Memory APE
Memory APE
Assoc.Control
Unit(CU)
Associative Computing
Tabular data, cells referenced by content
= Associative Search
On successful search is, cell is flagged
+ Associative Processing
Flagged cells are processed further
SIMD Associative Computing
Memory cell uses associative processing element (APE) to search concurrently
ASC Processor Group4
Assoc.Control
Unit(CU)
Ford Taurus $22,000 Kent
Chevrolet Malibu $20,000 Akron
Ford Taurus $18,000 Akron
Ford Focus $14,000 Kent
Jeep Wrangler $25,000 Akron
Associative PEs (APEs) search for a key, and those that find it are flagged as responders
Find all “Ford” cars for sale
APE
APE
APE
APE
APE
R
R
R
R
R
Associative Search
ASC Processor Group5
Assoc.Control
Unit(CU)
Ford Taurus $22,000 Kent
Chevrolet Malibu $20,000 Akron
Ford Taurus $18,000 Akron
Ford Focus $14,000 Kent
Jeep Wrangler $25,000 Akron
The responders can be processed further, either one, all sequentially, or all in parallel, as needed
APE
APE
APE
APE
APE
R
R
R
Associative Processing
ASC Processor Group6
ASC: Associative SIMD Array w/ PE Network
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
PE
Int
erco
nnec
tion
Net
wor
k
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
APE
APE
APE
ASC Processor Group7
MASC: ASC + Multiple Control Units / Instruction Streams
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
PE
Int
erco
nnec
tion
Net
wor
k
CU
Int
erco
nnec
tion
Net
wor
k
APE
APE
APE
APE
APE
APE
APE
APE
Assoc.Control
Unit(CU)
Assoc.Control
Unit(CU)
Assoc.Control
Unit(CU)
ASC Processor Group8
Outline of Talk SIMD Associative Computing
Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution
Sample Application String Matching
Conclusion
ASC Processor Group9
ASC Processor’s Pipelined Architecture
We have implemented a pipelined SIMD Associative (ASC) Processor using Altera FPGAs
Five single-clock-cycle pipeline stages are split between the SIMD Control Unit (CU) and the PEs In the Control Unit
Instruction Fetch (IF) Part of Instruction Decode (ID)
In the Scalar PE (SPE), in each Parallel PE (PPE) Rest of Instruction Decode (ID) Execute (EX) Memory Access (MEM) Data Write Back (WB)
ASC Processor Group10
ID/EX Latch
EX/MEM Latch
MEM/WB Latch
Data Memory
Register File
IF/ID Latch
InstructionMemory
Decoder
Control Unit (CU)
Sequential PE (SPE)
Parallel PE (PPE) Array
ImmediateData
BroadcastRegister
Data
Pipelined ASC Processor with Reconfigurable Interconnection Network
ASC Processor Group11
Re
gis
ter
File
Da
ta S
witc
h
Co
mp
ara
tor
ID/E
X L
atc
h
Mask
EX
/ME
M L
atc
h
ME
M/W
B L
atc
h
Da
ta M
em
ory
MU
X
Processing Element (PE)
Comparator implements associative search, pushes ‘1’ onto top of stack for responders, ‘0’ otherwise
Top of mask of ‘0’ disables ID/EX Latch
ASC Processor Group12
Pipelined ASC Processor’s Performance
Our pipelined ASC Processor has been implemented an Altera APEX20KC1000 FPGA with 70 8-bit PEs Other 8-bit processor cores implemented on this
FPGA / speed grade have clock speeds ranging from 30 to 106 MHz, typically 60-68 MHz
Our pipelined ASC Processor has a clock speed of 56.4 MHz, comparable with these other processors With the 5-stage pipeline, our ASC Processor can
approach a peak performance of 300 MHz
ASC Processor Group13
Reconfigurable PE Interconnection Network
Our pipelined ASC Processor also has a reconfigurable PE interconnection network
Reconfigurable PE network allows arbitrary PEs in the PE Array to be connected via Linear array (currently implemented), or 2D mesh (to be implemented soon)
without the restriction of physical adjacency
Each PE in the PE Array can Choose to stay in the PE interconnection network, or Choose to stay out of the PE interconnection network,
so that it is bypassed by any inter-PE communication
ASC Processor Group14
ID/EX Latch
EX/MEM Latch
MEM/WB Latch
Data Memory
Register File
IF/ID Latch
InstructionMemory
Decoder
Control Unit (CU)
Sequential PE (SPE)
Parallel PE (PPE) Array
ImmediateData
BroadcastRegister
Data
Pipelined ASC Processor with Reconfigurable Interconnection Network
ASC Processor Group15
Data Switch
RegisterFile
RegisterData
(from SPE)
ImmediateData
(from CU)
LeftNeighbor
RightNeighbor
Top ofMask Stack
Comparator &ID/EX Latch
Reconfigurable Network Implementation
Data switch Passes register, broadcast, and immediate data to the PE
and to its two neighbors Routes data from the PE’s neighbors to its EX stage
Reconfigurable network — supports Bypass Mode to remove the PE non-responders from the network Will be needed by MASC Processor
ASC Processor Group16
ASC Processor’s Network Performance
Performance of ASC Processor degrades as number of PEs is increased with Bypass Mode present Due to the long path from the first PE to the last PE in
the PE array
4-PE ASC Processor requires 2152 LEs and runs at 56.4 MHz with Bypass Mode present When the number of PEs is increased to 50, the clock
frequency drops to 22 MHz
In the future we hope to reduce this delay using a pipelined or other multi-hop architecture
ASC Processor Group17
Outline of Talk SIMD Associative Computing
Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution
Sample Application String Matching
Conclusion
ASC Processor Group18
IDLE
Task Manager
Task_Allocation
Wait_For_IS
Join
Call_TM
Task_Execution
IDLE
Instruction Stream
ASC Processor Group19
MASC PE Structure
PE
IS_TM_Chooser
IS1 IS2 TM1 TM2
ID Register
ASC Processor Group20
IDLE
Task Manager
Task_Allocation
Wait_For_IS
Join
Call_TM
Task_Execution
IDLE
Instruction Stream
TM ID
IS ID
IS ID
ASC Processor Group21
Assembly Code Example
.
.
101 Parallel_Select_Start Mem(110)
102 Pcase Condition1 Mem(104)
103 Pcase Condition2 Mem(107)
104 Case1
105 …
106 Parallel_Case_End
107 Case 2
108 …
109 Parallel_Case_End
110 Parallel_Select_End (note: This does not trigger JOIN, lack of tasks do)
.
.
ASC Processor Group22
TM0TM1
TM2 IS0 IS1 IS2
Task Managers Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
ASC Processor Group23
TM0TM1
TM2
Task ManagersIS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
Originally All PEs listen to IS0
ASC Processor Group24
TM0
TM1TM2
Task Managers
IS0 IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
When Parallel Select is met, Task Manager takes over PEs
101 Parallel_Select_Start Mem(110)
ASC Processor Group25
TM0
TM1TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM then calls IS0 to perform 1st task
102 Pcase Condition1 Mem(104)
104 Case1105 …
ASC Processor Group26
TM0
TM1TM2
Task Managers
IS0 IS1
IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM then calls IS1 to perform 2nd task
102 Pcase Condition2 Mem(107)
107 Case 2 108 …
102 Pcase Condition1 Mem(104)
104 Case1105 …
ASC Processor Group27
TM0
TM1TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
2nd task finishes and gives control back to TM
107 Case 2 108 … 109 Parallel_Case_End
102 Pcase Condition1 Mem(104)
104 Case1105 …
ASC Processor Group28
TM0
TM1TM2
Task Managers
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
1st task finishes and gives control back to TM
104 Case1105 …106 Parallel_Case_End
ASC Processor Group29
TM0TM1
TM2
Task ManagersIS0
IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
Control is back to the last finished IS which is IS0
110 Parallel_Select_End . .
IS1
ASC Processor Group30
TM0
TM1
TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
IS1 meets a nested parallel select code
ASC Processor Group31
TM0
TM1
TM2
Task Managers
IS0
IS1 IS2
Instruction Streams
PE0 PE1 PE2 PE3 PE4 PE5
TM1 allocates the two tasks to IS1 and IS2
A = 2
C = AB = A
Common Register
ASC Processor Group32
Outline of Talk
SIMD Associative Computing Associative Search & Associative Processing PE Interconnection Network Multiple Instruction Streams
ASC Processor (Work Mostly Complete) Pipelined Architecture Reconfigurable PE Interconnection Network Processor and Network Performance
MASC Architecture (Work in Progress) Implementation of Task Manager and Instruction Stream Sample Code Architecture and Sample Execution
Sample Application String Matching
Conclusion
ASC Processor Group33
Sample Application — String Matching
One of the most fundamental computing operations
Variable Length Don’t Care (VLDC) SIMD associative algorithm (Mary Esenwein 1997 and Ping Xu 2005) can find all instances of a pattern string within a larger text string Exact-match version shown here Extensions support single-character and variable-
length “don’t cares”
Demonstrates associative search, associative computing (responder processing), and the linear PE interconnection network
ASC Processor Group34
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
R
R
R
R
1
2
3
4
5
Look for match of pattern string AB in text string ABAA
Initialize variables as shown above
Note that “$” indicates a parallel variable
text$
0
counter$
0
match$
0 0
0 0
0 0
@
A
B
A
A 0 0
0
patt_counter
2
patt_length
AB
patt_string
ASC Processor Group35
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
R
R
R
R
1
2
3
4
5
Responders are text$ == patt_string[j]and counter$ == patt_counter;
text$
0
counter$
0
match$
0 0
0 0
0 0
@
A
B
A
A 0 0
0
patt_counter
2
patt_length
AB
patt_string
j
ASC Processor Group36
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
1
2
3
4
5
Responders are text$ == patt_string[j]and counter$ == patt_counter;
text$
0
counter$
0
match$
0 0
0 0
0 0
@
A
B
A
A 0 0
0
patt_counter
2
patt_length
AB
patt_string
j
ASC Processor Group37
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
1
2
3
4
5
Responders add 1 to counter$ and send resultto counter$ of preceding cell via network;
patt_counter++;
text$
0
counter$
0
match$
0
0 0
0 0
@
A
B
A
A 0 0
0 1
patt_counter
2
patt_length
AB
patt_string
j
0 1
ASC Processor Group38
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
R
R
R
R
1
2
3
4
5
Responders are text$ == patt_string[j]and counter$ == patt_counter;
text$
0
counter$
0
match$
1 0
0 0
0 0
@
A
B
A
A 0 0
1
patt_counter
2
patt_length
AB
patt_string
j
ASC Processor Group39
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
1
2
3
4
5
Responders are text$ == patt_string[j]and counter$ == patt_counter;
text$
0
counter$
0
match$
1 0
0 0
0 0
@
A
B
A
A 0 0
1
patt_counter
2
patt_length
AB
patt_string
j
ASC Processor Group40
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
1
2
3
4
5
Responders add 1 to counter$ and send resultto counter$ of preceding cell via network;
patt_counter++;
text$ counter$
0
match$
1 0
0 0
0 0
@
A
B
A
A 0 0
1 2
patt_counter
2
patt_length
AB
patt_string
j
0 2
ASC Processor Group41
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
R
R
R
R
1
2
3
4
5
Responders are counter$ == patt_length;
text$ counter$
0
match$
1 0
0 0
0 0
@
A
B
A
A 0 0
2
patt_counter
2
patt_length
AB
patt_string
2
ASC Processor Group42
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R1
2
3
4
5
Responders are counter$ == patt_length;
text$ counter$
0
match$
1 0
0 0
0 0
@
A
B
A
A 0 0
2
patt_counter
2
patt_length
patt_string
2
AB
ASC Processor Group43
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R1
2
3
4
5
Responders send 1 to match$ of next cell via network;
text$ counter$
0
match$
1
0 0
0 0
@
A
B
A
A 0 0
2
patt_counter
2
patt_length
patt_string
2
0 1
AB
ASC Processor Group44
String Match by Associative Computing
Assoc.Control
Unit(CU)
APE
APE
APE
APE
APE
R
1
2
3
4
5
Responders are match$ == 1;
Indicates cell(s) where match of pattern string ABin text string ABAA begins
text$ counter$
0
match$
1
0 0
0 0
@
A
B
A
A 0 0
2
patt_counter
2
patt_length
patt_string
2
1
AB
ASC Processor Group45
Conclusion
We have implemented a SIMD associative ASC Processor (on an FPGA) that combines the parallelism of SIMD architectures with the search capabilities of associative computing Performance is improved by adding a 5-stage
pipeline, split between the Control Unit and the PEs Additional functionality is provided by a reconfigurable
PE interconnection network
Future work will include Support for multiple Control Units (in progress) Performance improvement to support more efficient
broadcast to a large number of PEs
top related