api for real time/dynamic image processing

36
Presented By Ivan Wong VP and Co-founder Oct 16 th 2018 API for Real time/Dynamic Image Processing

Upload: others

Post on 06-Nov-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: API for Real time/Dynamic Image Processing

Presented By

Ivan Wong

VP and Co-founder

Oct 16th 2018

API for Real time/Dynamic

Image Processing

Page 2: API for Real time/Dynamic Image Processing

About CTAccel

Founded in Mar, 2016

Based in Hong Kong & Shenzhen

Focus on FPGA-based accelerated computing for Data

Center

US Patent

Core value of our company: Quality & Innovation

CTAccel Limited

CTAccel & Xilinx partnership

Partner with Xilinx Inc since 2016

Delivered first Xilinx FPGA based Off-Premise

solution in 2016

Delivered first Xilinx FPGA based Cloud

solution in 2017

Background

Page 3: API for Real time/Dynamic Image Processing

Challenges:

• Huge consumption of computational and storage resource

• Server and storage performance IS NOT KEEPING PACE

Faster Access

Better Quality

More Data

Market demand and challenges

Users generate more image and video data everyday

Higher resolution in capture and display

Users crave better viewing experience

Better Quality

Faster Access

Customers demand instant access to the resource

More Data

Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)*

*Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/

Background

Page 4: API for Real time/Dynamic Image Processing

CPU performance per core has not

increased since 2005

Internet traffic increases by 26% annually,

image is a large portion of internet data.

How to achieve the fastest image processing with the least amount of computational resources?

Chip Frequency graphed against year of introduction

Source: Cisco--VNI Forecast Highlights Tool

Constraints in existing solutionsBackground

Source: The Future of Computing Performance (2011)

Page 5: API for Real time/Dynamic Image Processing

Modern image formats continue to emerge

Lepton

HEIF

Professional

image

processing

accelerator

OpenCV

ImageMagick

GraphicsMagick

Mozjpeg

BPG

WebP JPEG Baseline

/Progressive

Guetzli

Pixels Processing

Background

Page 6: API for Real time/Dynamic Image Processing

Where is CTAccel image processor located in customers’ production environment?

Image Video AI

UGC Web portals

News app

……Cloud storage

Cloud album Social network E-commerce

Scenarios

CIP in Object-oriented storage

Page 7: API for Real time/Dynamic Image Processing

CIP software stack and accelerated functions

JPEG

Lepton

Resizing

Crop

JPEG

WebP

Lepton

Pixel Processing

EncodeDecode

Accelerated Function UnitsSystem Stack

Image Video AI

Page 8: API for Real time/Dynamic Image Processing

Video & Image

Processing

Data Analytics

Genomics

Machine

Learning

Compute Acceleration

GenomicsFinancial

AnalyticsVideo/Image Processing Big Data AnalyticsMachine

LearningSecurity

Applications

Software-Defined

Development

Environments

Solution Development

Partners & Customers

Cloud, Enterprise to

Edge-Computing

Increase

Service Level

Why Xilinx?Image Video AI

Page 9: API for Real time/Dynamic Image Processing

Problems solved Solution

Key features

Scenarios

Values

CIP

End-user

Smart phone/PAD Camera PC

App

Cloud album Social network News app…

JPEG

DecodeFPGA Resize

JPEG

Encode

Accelerate the speed of JPEG

image thumbnail generation,

reduce image-processing

servers.

High Throughput

Low Latency

Low CPU Utilization

TCO reduction:

Reduce image-processing servers

OPEX reduction

Improve customer experience:

Accelerate the image rendering speed

Internet apps with JPEG

E-commerce platform

Cloud album

Social network with pictures

Web portals

News application

Image Video AI

Product 1:JPEG 2 JPEG

Page 10: API for Real time/Dynamic Image Processing

JPEG 2 JPEG performance on cloud

Input:10000*4096x2160(Average Size 803k)

310.4

62.2

7.7 70

50

100

150

200

250

300

350

640x480 2048x1080

Throughput(MB/s)

Input:10000*1024x768 (Average Size 130k)

94.9

50.8

13.38.5

0

20

40

60

80

100

240x180 768x576

Throughput(MB/s)

28

201

1122 1162

0

200

400

600

800

1000

1200

1400

640x480 2048x1080

Latency(ms)

423

48

197

0

50

100

150

200

250

240x180 768x576

Latency(ms)

63.00%

95.00%99.00% 99.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

640x480 2048x1080

CPU Utilization

37.00%

70.00%

83.00% 87.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

240x180 768x576

CPU Utilization

VU9P (HUAWEI fp1c.2xlarge)

8vCPU (HUAWEI fp1c.2xlarge)

Image Video AI

Page 11: API for Real time/Dynamic Image Processing

Image Video AI

Problems solved Solution

Key features

Scenarios

Values

CIP

End-user

Smart phone/PAD Camera PC

App

E-commerce Social network News app…

JPEG

DecodeFPGA Resize

WebP

Encode

Vast amount of computational

resources and long time when

processing WebP encoding by

CPU;

Accelerate the transcoding from

JPEG to WebP;

Internet apps with WebP

E-commerce platform

Social network with pictures

Web portals

News application

High Throughput

Low Latency

Low CPU Utilization

TCO reduction:

Save CDN traffic

Reduce image-processing servers

OPEX reduction

Improve customer experience:

Accelerate the image rendering speed

Product 2: JPEG 2 WebP

Page 12: API for Real time/Dynamic Image Processing

384

286

108

721

470

208

9956 25

0

100

200

300

400

500

600

700

800

640x480 1024x768 2048x1080

QPS

FPGA*1 FPGA*2 CPU

60.97 82.79

220.07

32.7 49.94113.03

240.64

422.73

941.62

0

200

400

600

800

1000

640x480 1024x768 2048x1080

Latency (ms)

FPGA*1 FPGA*2 CPU

25% 23% 21%

44% 45% 45%

100% 100% 100%

0%

20%

40%

60%

80%

100%

120%

640x480 1024x768 2048x1080

CPU Utilization

FPGA*1 FPGA*2 CPU

318237

89

599

390

173

8246 21

0

100

200

300

400

500

600

700

640x480 1024x768 2048x1080

Throughput (MB/s)

FPGA*1 FPGA*2 CPU

Input: 10000*4096 x2160 images

QPSSingle FPGA is 5.11 times of CPU

Dual FPGA is 8.39 times of CPU

LatencySingle FPGA is 0.20 times of CPU

Dual FPGA is 0.12 times of CPU

JPEG to WebP performance

Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2

RAM: 128GB

OS: CentOS Linux release 7.2.1511

FPGA Used:

Single Xilinx Kintex UltraScale+ KU115

Dual Xilinx Kintex UltraScale+ KU115

Image Video AI

Page 13: API for Real time/Dynamic Image Processing

Modern image format-Lepton and its problem

Image Video AI

Lepton is a new image compression format developed by Dropbox and made open source in 2016.

Lepton compresses JPEG images losslessly. It reduces file size by an average of 22%.

It preserves the original JPEG file bit-by-bite perfectly, including all metadata.

Lepton can be applied into the scenarios with massive images storage . It can effectively reduce

the storage cost.

JPEG Lepton Compression Ratio

1024x768 420 1.3G 1.1G 15%

1024x768 422 1.4G 1.1G 21%

1024x768 444 1.6G 1.3G 18%

4096x2160 420 8.3G 6.3G 24%

4096x2160 422 8.7G 6.6G 24%

4096x2160 444 9.5G 7.1G 25%

Average 21%

The downside of Lepton is that it requires heavy computation power for both compression and

decompression. A conventional X86 server with dual E5-2630 CPU can only compress JPEG files into Lepton format at a rate of 20 megabytes per second.

Problem

Page 14: API for Real time/Dynamic Image Processing

Product 3: JPEG to LeptonImage Video AI

Problems solved Solution

Key features

Scenarios

Values

CIP

End-user

Smart phone/PAD Camera PC

App

Cloud album Cloud storage

JPEG

DecodeFPGA

Lepton

Encode

Scenarios with massive images

storage :

Cloud album;

Cloud storage;

TCO Reduction:

Save CDN traffic

Reduce image-processing servers

Reduce image-storage servers

OPEX reduction

High Throughput

Low Latency

Low CPU Utilization

Vast amount of computational

resources and long time when

processing Lepton encoding by

CPU;

Accelerate the transcoding from

JPEG to Lepton;

Page 15: API for Real time/Dynamic Image Processing

JPEG to Lepton performance

FPGA Used:

Virtex UltraScale+ FPGA VCU1525

Input images:

Total number: 999 images

Total file size: 3.8GB

108.2

25.7

0.0

20.0

40.0

60.0

80.0

100.0

120.0

24 threads

Throughput(MB/s)28.3

6.7

0.0

5.0

10.0

15.0

20.0

25.0

30.0

24 threads

QPS

827.77

3492.08

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

3500.00

4000.00

24 threads

Latency(ms)

7%

100%

0%

20%

40%

60%

80%

100%

120%

24 threads

CPU Utilization

QPSFPGA is 4.2 times of CPU

LatencyFPGA is 0.24 times of CPU

Test environment :

CPU: Intel(R) Xeon(R) E5-2630v2 x2

RAM: 128GB

OS: CentOS Linux release 7.3.1611

VCU1525

CPU

VCU1525

CPU

VCU1525

CPU

VCU1525

CPU

Image Video AI

Page 16: API for Real time/Dynamic Image Processing

Product 4: HEIC to JPEGImage Video AI

Problems solved Solution

Key features

Scenarios

Values

CIP

End-user

Smart phone/PAD Camera PC

App

Cloud album Social network News app…

HEIC

DecodeFPGA Resize

JPEG

Encode

High Throughput

Low Latency

Low CPU Utilization

TCO reduction:

Reduce image-processing servers

OPEX reduction

Improve customer experience:

Accelerate the image rendering speed

Internet apps with HEIC

E-commerce platform

Social network with pictures

Web portals

News application

Vast amount of computational

resources and long time when

processing HEIC decoding by CPU;

Poor customer experience;

Page 17: API for Real time/Dynamic Image Processing

Features of CIP

5-10 times throughput promotion compare to CPU

3 times latency reduction compare to CPU

20W-40W per FPGA card

TCO reduction: improve computing density of DC, reduce racks

Support: ImageMagick, OpenCV, GraphicsMagick, Lepton

Allow seamless migration from software-based implementation to CIP

Remote upgrading

PR(Partial Reconfiguration) technology allows fast and easy context switch in

accelerator functionality without rebooting the server

High

performance

Low power

Software

compatible

Ease of

maintenance

Image Video AI

Summary of CIP’s values

Page 18: API for Real time/Dynamic Image Processing

Advantages

Image thumbnail

generation for cloud

album

Scenario

Customer

Leading mobile phone

manufacturer in China

Feedback from customer

·Maintaining stable operating status

JPEG

decodeResize

CIP

· CIP has been deployed in the production environment for more than 16 months

High

performance

TCO

reduction

Ease of

maintenance

Reduce TCO by

25%

3x latency reduction

Deterministic

timing

Smaller cluster of

servers

Reduce the total

number of server

cluster by 50%

Improve the

throughput by 50%

Better

QoS

Xilinx FPGA

Image Video AI

Case study 1

Page 19: API for Real time/Dynamic Image Processing

Image Video AI

Case study 2

Advantages

Transcoding from

JPEG to WebP

Scenario

Customer

Famous video portal

website in China

Feedback from customer

·CIP has passed the test of customer

JPEG

decodeResize

CIP

· Test period: 2 months

TCO

reduction

Reduce TCO by

50% at least

Improve customer

experience

Reduce latency by

xx% (not announced the

specific values by our

customer)

Deterministic

timing: the same

low latency at full-

load and no-load

High

performance

1 server with one CIP

accelerator has the

equivalent computing

capability with 3

servers without CIP

WebP

encode·CIP will be deployed in the production environment in 2018Q4

Page 20: API for Real time/Dynamic Image Processing

20192018Q4Ready

JPEG->Lepton

WebP, M2

Pixels

JPEG

HEIC

BPG

Mozjpeg

Guetzli

JPEG-Progressive

Lepton->JPEG

WebP, M6

Ready

Image Video AI

Evolution planning of CIP

Page 21: API for Real time/Dynamic Image Processing

Virtex UltraScale+

VCU1525

Kintex®

UltraScale™115

Kintex®

UltraScale™ 060

Zynq

UltraScale+

PartnersCloud platform support

Hardware platform support

Image Video AI

Page 22: API for Real time/Dynamic Image Processing

Video AIImage

Higher performance

density

Improve quality of

serviceTCO reduction

High-quality FPGA-based accelerated computing solution provider

New product line of CTAccelImage Video AI

Page 23: API for Real time/Dynamic Image Processing

Xilinx MPSoC with VCU - H264/H265

Problems solved Solution

Key features

Scenarios

Values

Xilinx MPSoC with VCU

End-user

Smartphone PC TV

App

Video portal

website

Short video

app

Live

broadcast…

H264/H265

DecodeFPGA

H264/H265

Encode

Applications which require video

transcoding:

Video portal websites ;

Short video apps;

FPS

PSNR

TCO Reduction:

Reduce processing servers

OPEX reduction

Improve customer experience

Accelerate the speed of video

processing, improve customer

experience

Image Video AI

Page 24: API for Real time/Dynamic Image Processing

Performance & quality of H.264 (MPSoC)Image Video AI

0

50

100

150

200

250

300

350

0 2 4 6

FPS

Bitrate(Mbps)

H.264 Performance

50

60

70

80

90

100

0 2 4 6 8Bitrate(Mbps)

H.264 VMAF

32

34

36

38

40

42

44

46

0 2 4 6 8

FPS

Bitrate(Mbps)

H.264 PSNR

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8Bitrate(Mbps)

H.264 SSIM

FPGA: H.264

GPU: NV P4(h264) fast

GPU: NV P4(h264) slow

CPU: x264 ultrafast

CPU: x264 medium

CPU: x264 placebo

FPGA: H.264

GPU: NV P4(h264) fast

GPU: NV P4(h264) slow

CPU: x264 ultrafast

CPU: x264 medium

CPU: x264 placebo

FPGA: H.264

GPU: NV P4(h264) fast

GPU: NV P4(h264) slow

CPU: x264 ultrafast

CPU: x264 medium

CPU: x264 placebo

FPGA: H.264

GPU: NV P4(h264) fast

GPU: NV P4(h264) slow

CPU: x264 ultrafast

CPU: x264 medium

CPU: x264 placebo

Page 25: API for Real time/Dynamic Image Processing

Image Video AI

0

50

100

150

200

250

300

350

400

0 2 4 6 8

FPS

Bitrate(Mbps)

H.265 Performance

70

75

80

85

90

95

100

0 2 4 6 8Bitrate(Mbps)

H.265 VMAF

38

40

42

44

46

48

0 2 4 6 8Bitrate(Mbps)

H.265 PSNR

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

0 2 4 6 8Bitrate(Mbps)

H.265 SSIM

FPGA: H.265

GPU: NV P4(h265) fast

GPU: NV P4(h265) slow

CPU: x265 ultrafast

CPU: x265 medium

CPU: x265 placebo

FPGA: H.265

GPU: NV P4(h265) fast

GPU: NV P4(h265) slow

CPU: x265 ultrafast

CPU: x265 medium

CPU: x265 placebo

FPGA: H.265

GPU: NV P4(h265) fast

GPU: NV P4(h265) slow

CPU: x265 ultrafast

CPU: x265 medium

CPU: x265 placebo

FPGA: H.265

GPU: NV P4(h265) fast

GPU: NV P4(h265) slow

CPU: x265 ultrafast

CPU: x265 medium

CPU: x265 placebo

Performance & quality of H.265 (MPSoC)

Page 26: API for Real time/Dynamic Image Processing

Image Video AI

Summary of H264/H265 encoding (MPSoC)

Codec ItemBitrate(Mbps)

Low(<2) Medium(2-5) High(>5)

H.264

FPGA vs

GPU

Performance Better than GPU

Quality Better than NV P4 fast, worse than NV P4 slowBetter than NV P4 fast,

similar to NV P4 slow

FPGA vs

CPU

Performance Worse thanx264 ultrafast,better than x264 medium and x264 placebo

Quality

Better than x264 ultrafast

Worse than x264 placebo

Worse than x264 medium

Better than x264 ultrafast

Worse than x264 placebo

Similar to x264 medium

H.265

FPGA vs

GPU

Performance Worse than GPU

Quality<1.6M:worse than GPU

1.6M~2M:better than GPUBetter than GPU

FPGA vs

CPU

Performance Better than CPU

Quality

<1.6M:worse than CPU

1.6M~2M:better than x265 ultrafast,worse than x265 placebo,worse than x265 medium

Worse than x265 placebo ,better than x265 ultrafast

and x265 medium

Page 27: API for Real time/Dynamic Image Processing

Image Video AI

I-Frame 2 JPEG

Problems solved Solution

Key features

Scenarios

Values

CIP + MPSoC

End-user

Smartphone PC TV

App

Video portal

website

Short video

app

Live

broadcast…

H264/H265

DecodeFPGA Resize

JPEG

Encode

Applications which require “I-

Frame 2 JPEG”:

Video portal websites ;

Short video apps;

High Throughput

Low Latency

Low CPU Utilization

TCO Reduction:

Reduce processing servers

OPEX reduction

Improve customer experience

Accelerate the speed of ”I-Frame 2

JPEG”, improve customer

experience

Page 28: API for Real time/Dynamic Image Processing

Image Video AI

AI accelerator: face detection/recognition

Problems solved Solution

Key features

Scenarios

Values

Smart City;

Intelligent Transportation;

Safe City;

Security system;

End-user

Smartphone PAD PC

App

Safe City Smart CityIntelligent

Transportation…

Image

Decode

FPGA

ResizeFace

recognition

GPU/FPGA

High Throughput

Low Latency

Low CPU Utilization

TCO Reduction:

Reduce image-processing servers

OPEX reduction

Improve customer experience

Improve the speed of face

detection/recognition

Accelerate the speed of ”face

detection/recognition”, improve

customer experience, reduce

TCO

Page 29: API for Real time/Dynamic Image Processing

Image Video AI

AI accelerator: test scenarios

GPU GPU

GPU

GPU

GPU

GPU computing matrix

Memory caching

FPGA #1

FPGA #2

Test-1:Alexnet model

FPGA #3

Test-2:nsfw model

Test-3:Inception V4 model

Test-4:ResNet-50 model

Test environment CPU: 2*Intel(R) Xeon(R) CPU E5-2690v4 x 2 RAM: 128GB OS: CentOS Linux release 7.4 Kernel version: 3.10.0-514.2.2.el7.x86_64 Python version: 2.7

Image type Resolution Quantity

8 Mega Pixel 2647X3278 14708

16 Mega Pixel 5312x2988 9280

1080p 1920x1080 21400

Test data

Page 30: API for Real time/Dynamic Image Processing

Image Video AI

Test result 1-Alexnet

Alexnet model

QPS:

FPGA+GPU is 2.5 faster than CPU+GPU

Latency:

FPGA+GPU is 30% of CPU+GPU

CPU usage:

FPGA+GPU is 20% of CPU+GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

8M Pixels 16M Pixels 1080p

Latency(ms)

3 FPGA+ 5 GPU CPU+ 5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

8M Pixels 16M Pixels 1080p

CPU utilization(%)

3 FPGA+ 5 GPU CPU + 5 GPU

0.00

500.00

1000.00

1500.00

2000.00

2500.00

8M Pixels 16M Pixels 1080p

Throughput(MB/S)

3 FPGA + 5 GPU CPU+ 5 GPU

Page 31: API for Real time/Dynamic Image Processing

Image Video AI

Test result 2-nsfw

nsfw model

0.00

500.00

1000.00

1500.00

2000.00

2500.00

8M Pixels 16M Pixels 1080p

Throughput(MB/S)

3 FPGA+ 5 GPU CPU+5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

8M Pixels 16M Pixels 1080p

Latency(ms)

3 FPGA+ 5 GPU CPU+5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

8M Pixels 16M Pixels 1080p

CPU utilization(%)

3 FPGA+ 5 GPU CPU+5 GPU

QPS:

FPGA+GPU is 3 faster than CPU+GPU

Latency:

FPGA+GPU is 30% of CPU+GPU

CPU usage:

FPGA+GPU is 20% of CPU+GPU

Page 32: API for Real time/Dynamic Image Processing

Image Video AI

Test result 3-Inception V4

InceptionV4 model

0

500

1000

1500

2000

2500

8M Pixels 16M Pixels 1080p

Throughput(MB/S)

3 FPGA+ 5 GPU CPU+5 GPU

0

10

20

30

40

50

60

70

80

90

8M Pixels 16M Pixels 1080p

Latency(ms)

3 FPGA+ 5 GPU CPU+5 GPU

0

10

20

30

40

50

60

70

8M Pixels 16M Pixels 1080p

CPU utilization(%)

3 FPGA+ 5 GPU CPU+5 GPU

QPS:

FPGA+GPU is 2 faster than CPU+GPU

Latency:

FPGA+GPU is 40% of CPU+GPU

CPU usage:

FPGA+GPU is 30% of CPU+GPU

Page 33: API for Real time/Dynamic Image Processing

Image Video AI

Test result 4-ResNet-50

ResNet-50 model

0.00

500.00

1000.00

1500.00

2000.00

2500.00

8M Pixels 16M Pixels 1080p

Throughput(MB/S)

3 FPGA+ 5 GPU CPU+5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

8M Pixels 16M Pixels 1080p

Latency(ms)

3 FPGA+ 5 GPU CPU+5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

8M Pixels 16M Pixels 1080p

CPU utilization(%)

3 FPGA+ 5 GPU CPU+5 GPU

QPS:

FPGA+GPU is 3 faster than CPU+GPU

Latency:

FPGA+GPU is 30% of CPU+GPU

CPU usage:

FPGA+GPU is 20% of CPU+GPU

Page 34: API for Real time/Dynamic Image Processing

Image Video AI

Test result 5-Accuracy

Model Category Accuracy(top1) Accuracy(top5)

AlexnetTensorFlow 0.49 0.74

accel 0.49 0.73

Inceptionv4TensorFlow 0.80 0.95

accel 0.79 0.95

ResNet50TensorFlow 0.73 0.91

accel 0.72 0.90

nsfwTensorFlow 0.75 N/A

accel 0.75 N/A

Page 35: API for Real time/Dynamic Image Processing

Product website:www.ct-accel.com

Product hotline:+86-0755-88914045

Product enquiry email:[email protected]

Cloud partners:

AWS

HUAWEI Cloud

Alibaba Cloud

Baidu Cloud

Tencent Cloud

FPGA Partner:

For any product related enquiries:

Page 36: API for Real time/Dynamic Image Processing

Adaptable.

Intelligent.