インフラ野郎azureチーム night
TRANSCRIPT
#infrayarou
{
“名前” : “真壁 徹(まかべ とおる)”,
“所属” : “日本マイクロソフト株式会社”,
“役割” : “クラウド ソリューションアーキテクト”,
“経歴” : “大和総研 HP Enterprise”,
“特技” : “クラウド & オープンソース”
}
https://docs.microsoft.com/ja-jp/azure/
[Microsoft Global Datacenters and Network Infrastructure]
https://www.youtube.com/watch?v=bqZrejosqWU
32 Regions Worldwide, 24 Generally Available…
Central USIowa
West USCalifornia
East US 2Virginia
US GovVirginia
North Central USIllinois
US GovIowa
South Central USTexas
Brazil SouthSao Paulo State
West EuropeNetherlands
China North*Beijing
China South*Shanghai
Japan EastTokyo, Saitama
Japan WestOsaka
India SouthChennai
East AsiaHong Kong
SE AsiaSingapore
Australia South EastVictoria
Australia EastNew South Wales
India CentralPune
Canada EastQuebec City
Canada CentralToronto
India WestMumbai
Germany North EastMagdeburg
Germany CentralFrankfurt
United KingdomRegions (2)
North EuropeIreland
US DoD West TBA
US DoD East TBA
East USVirginia Korea Regions (2)
*Operated by 21Vianet
Announced/not operational
Operational
公表32リージョン/稼働済み24 その時点の配置 (*)
(*)現在は公表38/稼働済み30
$azure location list --details
info: Executing command location list
+ Getting ARM registered providers
info: Getting locations...
data:
data: Location : eastasia
data: DisplayName : East Asia
data:
data:
[…]
https://blogs.technet.microsoft.com/hybridcloud/2016/05/26/microsoft-and-facebook-to-build-subsea-cable-across-atlantic/
https://azure.microsoft.com/ja-jp/blog/microsoft-invests-in-subsea-cables-to-connect-datacenters-globally/
Colocation Density
2.0+ Power Usage Effectiveness (PUE) 1.4 – 1.6 PUE
Discrete servers
Capacity
20 year technology
Rack
Density & deployment
Minimized resource
impact
Generation 1 Generation 2
Containment Modular Hyper-scale
1.2 – 1.5 PUE1.12 – 1.20 PUE 1.07 – 1.19 PUE
Containers, PODs
Scalability &
sustainability
Air & water
Economization
Differentiated SLAs
Deployment Areas &
ITPACs
No more traditional IT
Right-sized
Faster time-to-market
Outside air cooled
Fully integrated
Resilient software
Common
infrastructure
Operational simplicity
Flexible & scalable
Generation 3 Generation 4 Generation 5
S. Sankar, K. Vaid, M. Shaw “Impact of Temperature on Hard Disk Drive Reliability in Large Datacenters” Microsoft, IEEE, 2011
Inlet Temperature and Impact on Hard Disk Failure Rates
HDD Case Temp Relative AFR HDD Case Temp Relative AFR
10 C 50 F 11 C 100% 30 C 100%
15 C 59 F 16 C 100% 34 C 100%
20 C 68 F 21 C 100% 38 C 100%
25 C 77 F 26 C 100% 41 C 106%
30 C 86 F 31 C 100% 45 C 131%
35 C 95 F 36 C 100% 49 C 153%
40 C 104 F 41 C 106% 53 C 189%
45 C 113 F 46 C 138% 56 C 231%
50 C 122 F 51 C 179% 60 C 281%
HDD's in Front, ΔT 1˚CBuried HDDs Des ign, ΔT 20˚C
cold de-rated to ΔT 10˚C hotInlet Temp
“Azure Network and Datacenter Infrastructure: Enterprise Quality at Cloud Scale” Microsoft Ignite 2015
http://natick.research.microsoft.com/
• 2014 年にマイクロソフトとしてカーボン ニュートラルを達成済み
( https://blogs.microsoft.com/green/category/renewable-energy/ )
https://news.microsoft.com/2016/11/14/microsoft-announces-largest-wind-energy-purchase-to-date
Geo
Region
Region
DCs/Zones
DCs/Zones
汎用・柔軟 効率・性能
( https://docs.microsoft.com/ja-jp/azure/virtual-machines/virtual-machines-linux-sizes )
October 15, 2016
October 15, 2016
• Hyper-V VMSwitch拡張
• AzureでSDNを実現するためのコア機能• Address Virtualization for VNET
• VIP -> DIP Translation for SLB
• ACLs, Metering, and Security Guards
• プログラマブル ルール/フローテーブルでパケット毎のアクション定義
• Windows Server 2016で利用可
NIC vNIC
VM Switch
VFP
VM
vNIC
VM
ACLs, Metering, Security
VNET
SLB (NAT)
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
Host: 10.4.1.5
• VMSwitchがMatch-Action-Table型のAPIをコントローラーへ提供
• コントローラーがポリシーを定義
• ポリシー毎のテーブル
• パケット毎にどう処理すべきかを厳密に定義
Tenant Description
VNet Description
VNet Routing Policy
ACLsNATEndpoints
VFP
VM110.1.1.2
NIC
Flow ActionFlow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Controller
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
Flow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
VFP
Southbound API
GFT Offload API (NDIS)
VMSwitch
VM
Northbound API
GFTTable
First Packet
GFT Offload Engine
SmartNIC50G
QoSCrypto RDMAFlow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT
Transposition
Engine
Rewrite
SLB Decap SLB NAT VNET ACL Metering
Rule Action Rule ActionRule Action Rule Action Rule Action Rule ActionDecap* DNAT* Rewrite* Allow* Meter*
ControllerControllerController
En
cap
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
• IaaS仮想マシンD15v2、DS15v2で利用可能• プライベートプレビュー中
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
CS0 CS1 CS2 CS3
ToR
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
FPGA
NIC
Server
SP0 SP1 SP2 SP3
L0
L1/L2
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
October 15, 2016
Credits
Virtual Channel
Data
Header
ElasticRouter
(multi-VC
on-chip router)
Send Connection Table
Transmit State Machine
Send Frame QueueConnection
Lookup
Packetizerand
TransmitBuffer
Unack’dFrame Store
Ethernet Encap
Ethernet Decap
40G MAC+PHY
Receive Connection Table
Credits
Virtual Channel
Data
Header
Depacketizer
Credit Management
Ack Receiver
Ack Generation
Receive State Machine
Solid links show Data flow, Dotted links show ACK flow
DatacenterNetwork
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
https://www.sdxcentral.com/articles/news/microsoft-azure-will-use-intel-silicon-photonics/2016/08/
Microsoft expects to deploy silicon photonics in Azure data centers soon, “initially going for switch-to-switch connectivity,” said Kushagra Vaid, Azure’s general manager of hardware engineering, speaking at the Intel Developer Forum.
“The problem I have right now? It is supply chain. I am not so worried about technology. We have our Open Cloud Server, which I think is very compelling in that it offers some real economic capabilities. But I have got to nurture my supply chain because traditionally we bought from OEMs and now we are designing with ODMs so we can take advantage of prices and lower our overall costs. So I am moving very, very quickly to build out new capacity, and I want to do it in a very efficient and effective way and it is really about the commoditization of the infrastructure.”
( https://www.nextplatform.com/2016/09/26/rare-tour-microsofts-hyperscale-datacenters/ )
Rick Bakken, Sr. Director, Data Center Evangelism, Microsoft
https://azure.microsoft.com/en-us/blog/microsoft-reimagines-open-source-cloud-hardware/
Azure Storage
https://infrayarou.blob.core.windows.net/vhds/myubuntu.vhd
FE 2
Partition 3(F-J)
Stream 2
Partition Layer
Stream Layer
FE 2
Partition 3(F-J)
Stream 2
Request 1: Partition F; Row 102
Request 1: シンプルな例
FE 1
Partition 3(F-J)
Stream 4
Request 1: Partition F; Row 102
Request 2: Partition F; Row 507
Request 2: 異なる Front End、同じPartition Server、異なるStream Server
FE 4
Partition 4(K-T)
Stream 2
Request 1: Partition F; Row 102
Request 2: Partition F; Row 507
Request 3: Partition T; Row 356
Request 3: 違うFront End、違うPartition Server、同じStream Server
FE 4
Partition 5(U-Z)
Stream 3 Stream 4
Request 1: Partition F; Row 102
Request 2: Partition F; Row 507
Request 3: Partition T; Row 356
Request 4: Partition W;
Rows 213 & 672
Request 4: トランザクションの例ひとつのPartition Serverが複数のStream Server上のデータをAtomicに更新
https://docs.microsoft.com/ja-jp/azure/storage/storage-scalability-targets
Disk(Page Blob)
C:¥, /dev/sda C:¥, /dev/sda
copy
C:¥, /dev/sda
ImageCache
copy
C:¥, /dev/sda
L3
L2
L3 East/Westトラフィックが遠い
Routerが大型になり高コスト
LB/FWがボトルネック
T2-1-1 T2-1-2 T2-1-8
T3-1 T3-2 T3-3 T3-4
Row Spine
T2-4-1 T2-4-2 T2-4-4Data Center Spine
T1-1 T1-8T1-7…
T1-2
… …
Regional Spine
…
T1-1 T1-8T1-7…
T1-2 T1-1 T1-8T1-7…
T1-2
Rack …T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
Microsoft's Production Configurable Cloud” Mark Russinovich, Chief Technology Officer, Microsoft Azure, SCS Distinguished Lecture, 11/15/2016
https://azure.github.io/SONiC/
” Albert Greenberg, Distinguished Engineer Director of Networking, Microsoft, SIGCOMM 2015
P802.3by)
• Today’s Server to Tier 0• Interconnect is based on 25G technology
• Links are 50G Ethernet - 2x25G based on 25G Ethernet Consortium spec
• Bandwidth growth drove us to use 50G
• Don’t require an 802.3 specification here
• Tomorrow’s Server to Tier 0• Interconnect will be based upon 50G PAM4 technology
• Expect links will be 100G Ethernet (2x50G)
• Choice for 802.3:
• Create the specification
• Let a consortium do it
Azureで実現したいこと LB製品を使った実装では
スケール • VIPあたり100Gbps• 障害発生時、数1000のVIPを素早く
再構成したい
• $80,000で20Gbps• VIPあたり20Gbps• VIPあたり再構成に1秒かかる
可用性 • N+1 冗長化 and Quick failover • 1+1 冗長化 or Slow failover
配置柔軟性 • サーバーとLB/NATはL2境界を越えて柔軟に配置したい
• NATやDSR(Direct Server Return)は同じL2でしかサポートされない
テナント分離 • ユーザーテナント起因での過負荷が、他テナントに影響しないようにしたい
• ユーザーテナントからの過度なSNAT要求が他テナントに影響を及ぼす
Ananta: Cloud Scale Load Balancing” Microsoft, SIGCOMM 2013
VM Switch
VMN
Host Agent
VM1
. . .
VM Switch
VMN
Host Agent
VM1
. . .
ControllerControllerAnanta Manager
VIP Configuration:VIP, ports, # DIPs
Multiplexer Multiplexer Multiplexer. . .
VM Switch
VMN
Host Agent
VM1
. . .
. . .
2nd Tier: Provides connection-level(layer-4) load spreading, implemented in servers.
1st Tier: Provides packet-level (layer-3) load spreading, implemented in routers via ECMP.
3rd Tier: Provides statefulNAT implemented in the virtual switch in every server.
Multiplexer Multiplexer Multiplexer. . .
VM Switch
VMN
Host Agent
VM1
. . .
VM Switch
VMN
Host Agent
VM1
. . .
VM Switch
VMN
Host Agent
VM1
. . .
. . .
Ananta: Cloud Scale Load Balancing” Microsoft, SIGCOMM 2013
Ananta: Cloud Scale Load Balancing” Microsoft, SIGCOMM 2013
RouterRouter MUX
Host
MUXRouter MUX
…
Host Agent
1
2
3
VMDIP
4
5
67
8
Dest:
VIPSrc:
ClientPacketHeaders
Dest:
VIPDest:
DIPSrc:
MuxSrc:
Client
Dest:
ClientSrc:
VIPPacketHeaders
Client
Ananta: Cloud Scale Load Balancing” Microsoft, SIGCOMM 2013
PacketHeaders
Dest:
Server:80Src:
VIP:1025
VIP:1025 DIP2
Server
Dest:
Server:80Src:
DIP2:5555
Ananta: Cloud Scale Load Balancing” Microsoft, SIGCOMM 2013
足りなくなったら単純にサーバー足す、いちいちエンジニアリングしない
手作業で増設、設定していては無理なスケールと変化スピード
50Gbpsを超える世界で、CPUだけでは頑張れない
各種チップを活用しているが、FPGAが鍵
LinkedInのエンジニアリングチームもとんがってます
情報公開も積極的 (https://engineering.linkedin.com/blog )