goldstrike 1: cointerra’s first generation ... - hot chips
TRANSCRIPT
GOLDSTRIKETM 1:
COINTERRA’S FIRST GENERATION
CRYPTO-CURRENCY PROCESSOR
FOR BITCOIN MINING MACHINES
Javed Barkatullah, Ph.D., MBA
Timo Hanke, Ph.D.
Ravi Iyengar
Ricky Lewelling
Jim O’Connor
BITCOIN MINING WORK
For a message m find a nonce n such
that the 256-bit result
Sha-256 (Sha-256 (m || n) )
has a specified number of leading
zero bits (“target”)
• Sha-256 is a cryptographic hash function
• Cryptographic hash functions are “one way” functions
• This search problem is best solved by trial-and-error
VersionPrevious
HashMerkle
RootTarget Time Nonce Hash
DoubleSha
Tx_id_0 Tx_id_1 Tx_id_n Tx_id_n+1
Tx_id_0,1 Tx_id_n,n+1
Merkle
Root
VersionPrevious
HashMerkle
RootTarget Time Nonce
MessageHash
DoubleSha
Entry m
Entry m+1
Transaction 0 Transaction 1
DoubleSha
DoubleSha
Block Header
HISTORY OF BITCOIN MINING HARDWARE
GPU based
Platform
FPGA based
Platform
Graph source: http://bitcoin.sipa.be
Custom ASIC
based Platform
First Bitcoin
Network & CPU
based Platform
↑Hashing Capacity
↑Network Difficulty
↑Platform Hashing Power
Network Difficulty
adjusted every 2016
blocks mined
1e-06
1e-04
1e-02
1
100
1e04
TH
ash
/s
1e-06
1e-04
1e-02
1
100
1e04
Dif
ficu
lty
x 1
06
GOLDSTRIKE™ 1 DEVELOPMENT TIMELINE
• 4 months from RTL start to tape out!
• 49 days from tape out to first silicon
• Packaged silicon arrived on Dec. 28, 2013
• First system shipped to customer around mid January, 2014
Cointerra, Inc. Founded
May2013
June2013
July2013
August2013
Sept.2013
Oct.2013
Nov.2013
Dec2013
Jan.2014
RTL Start
RTL Freeze
Tape Out
First Silicon
System Shipped!
Board & System Design
Start
Feb.2014
• Motorola compatible 4-pin SPI Port
• PLL with simple bit-bang interface
• 120 Hash Engines arranged into 16 super-pipes
• 128 deep Input Work FIFO
• 128 deep Output Status FIFO
• 384-bit Pipe Control Register (PCR) to enable/disable individual hash engine
• Low I/O bandwidth requirement
• New work (384 bits) every 232 clock cycles per engine
GOLDSTRIKE™ 1 ARCHITECTURE
• Two rounds of SHA-256 processing
• Searches for a result in 232 nonce range
• Each round consists of 64 iterations
• Fully unrolled iterations
• Two parallel but connected pipelines – message & compressor
• Generates a result out only if target criteria met
HASH ENGINE
Input Interface
Output Interface
Hash_out
Work_in
Nonce Counter
Comparator
Stage 0Stage 1Stage 2
Stage 63Stage 62C
om
pre
sso
rR
ou
nd
1
Stage 0Stage 1Stage 2
Stage 63Stage 62
Me
ssa
ge
Ro
un
d 2
Stage 0Stage 1Stage 2
Stage 63Stage 62 M
ess
ag
eR
ou
nd
1
Stage 0Stage 1Stage 2
Stage 63Stage 62
Co
mp
ress
or
Ro
un
d 2
COMPRESSOR STAGE OF SHA-256 PIPELINE
S0 Maj S1 Ch
+ +
+
+ +
A DB C E F HG
Ain Bin Cin Din Ein Fin Gin Hin Kin Win
Ain Bin Cin Ein Fin Gin
Aout Bout Cout Dout Eout Fout Gout Hout
S0(A) = ( A >>> s1) ( A >>> s2 )
( A >>> s3)
S1(E) = ( E >>> s4) ( E >>> s5 )
( E >>> s6)
Ch(E,F,G) = (E F) (~E G)
Maj(A,B,C) = (A B) (B C)
(C A)
All registers are 32-bits wide
MESSAGE STAGE SHA-256 PIPELINE
• 512 bits message word divided in to 16 words, 32-bit wide (W0 to W15)
• s0(W1,in) = (W1,in >>> s7) (W1,in >>> s8 ) (W1,in >>> s9)
• s1(W14,in) = (W14,in >>> s10) (W14,in >>> s11 ) (W14,in >>> s12)
+
W15
,in
W14
,in
W13
,in
W12
,in
W11
,in
W10
,in
W9
,in
W8
,in
W7,
in
W6
,in
W5,
in
W4
,in
W3,
in
W2
,in
W1,
in
W0
,in
W13 W12 W11 W10 W9 W8 W7 W6 W5 W4 W3 W2 W1 W0W15 W14
s0s1Win
To Compressor
• Global Foundries HKMG
28nm HPP process
• 9 metal layers
• 120 hash engines in
11x11 array (grey boxes)
• Top level logic block in
the center
DIE MICROGRAPH
• 37.5 x 37.5 mm FCBGA
package
• 4 bare dies per package
• 1296 pins
• > 500 GH/s @ 1.05GHz & 0.7v
GOLDSTRIKE™-1 (GS1) PACKAGE
HEAT DISSIPATION CHALLENGE
• Cooling options examined:
• Heat sink + Airflow Common
in CPU applications
• Liquid Cooling Popular
among over-clockers
• Immersion Efficient for data
centers
• Liquid cooling with direct
attach cooling head selected
• Enable a common platform for
both home & data center
customers
Air Temps on Plane 3mm
above PCB Top
Up to 2TH/s hash rate per appliance • Dual PCB with 4 GS1 packages total
• Power budget to meet household outlet capacity
Layout - 4U chassis Design
Driven by cooling requirements Radiator cross-section
Fan Size
Similar design for TerraMiner IV data center and home models
Push pull airflow design for maximum performance
Fans chosen for balance between cost, performance & audible noise
Dual 1U power supplies for minimal volume impact
TERRAMINER APPLIANCE
TERRAMINER APPLIANCE IMAGES
Front View
Back View
TERRAMINERTM IN DATACENTER
ASIC DESIGN CHALLENGES & CHOICES
Challenges:
• High power density and high node toggle rates
• Power delivery
• Heat dissipation
• IR drop and di/dt noise
• Very high sequential cell count
• Reduce die area and power consumption
• Very short (4-month) schedule from RTL start to tape out
• Very small design team
Choices:
• Optimize common core blocks
• Maximize design repeat & reuse
• Utilize highly experienced design team
CONCLUDING REMARKS
• Continued demand for higher performance and lower power
appliance
• Maintain Cointerra’s leadership position in Bitcoin mining
industry
• New designs with increased power efficiency and
performance