multi-block gpu implementation of a stokes equations ......title: multi-block gpu implementation of...

37
Multi-Block GPU Implementation of a Stokes Equations Solver for Absolute Permeability Computation Nicolas Combaret, Ph.D. - Software Engineer FEI Visualization Sciences Group

Upload: others

Post on 18-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Multi-Block GPU Implementation of a Stokes Equations Solver for Absolute Permeability

    Computation Nicolas Combaret, Ph.D. - Software Engineer

    FEI Visualization Sciences Group

  • Preamble

    2

    GTC 2014

    March 26, 2014

    • Avizo Fire for Material Sciences:

    – Visualization

    – Image Processing

    – Measures & Quantification

    • Compute physical properties:

    – Diffusive properties (thermal, electrical, molecular)

    – Absolute Permeability

  • Preamble

    3

    GTC 2014

    March 26, 2014

  • Why Absolute Permeability?

    GTC 2014

    4

    Exploration Wells

    Well analysis

    Production wells

    Upscaling

    Reservoir Modeling

  • Overview

    • Stokes Equations for Absolute Permeability

    • Solving Stokes Equations

    • Out-of-Core GPU implementation

    5

    GTC 2014

  • Stokes Equations for Absolute Permeability

  • Absolute Permeability

    • Measures the ability of a porous media to transmit a fluid

    • All the world of porous media:

    – Soils (petroleum, mining, civil engineering)

    – Rocks, core sample, core plugs

    – Cement, foams, ceramics

    – Powders, sands

    7

    GTC 2014

  • Darcy’s Law

    • Empirical law Q

    S= −

    𝐤

    μ

    ∆P

    L

    To estimate k:

    • S, L and μ are external parameters

    • Q and ΔP need to be computed

    porous media

    Q: fluid flow

    input pressure output pressure

    L: sample length

    S

    ∆P

    cross section

    area

  • Stokes Equations

    • Simplification of Navier-Stokes equations for uncompressible, Newtonian fluid and steady-state, laminar flow

    • 𝛻2v − 𝛻p = 0

    𝛻. v = 0

    – v: local fluid velocity

    – p: local fluid pressure

    • With v known everywhere: Q

    • With p known everywhere: ∆P

    9

    GTC 2014

  • Solving Stokes Equations

  • Stokes Equations Discretization (1)

    • Finite volume, explicit discretization

    • To compute one v at time step t:

    11

    GTC 2014

    v(t − 1) p(t − 1)

    + = v(t)

  • Stokes Equations Discretization (2)

    • Finite volume, explicit discretization

    • To compute one p at time step t:

    12

    GTC 2014

    v(t) p(t − 1)

    + = p(t)

  • Time and Space Dependency

    13

    GTC 2014

    v(t − 1) p(t − 1)

    + = v(t)

    v(t) p(t − 1)

    + = p(t)

  • Iterative solver

    • Iterative solver with two time steps t − 1 and t

    • p(t) depends on v(t)

    • Convergence:

    – Slow a lot iteration is necessary

    – Guaranteed no divergence

    14

    GTC 2014

    Compute v(t)

    Compute p(t)

    Convergence?

    Initialize data structures

    Output results

  • First Implementation

    • CPU implementation:

    – Double indirection approach

    • Direct to GPU: bad performance

    – Too many non-coalesced memory accesses

    15

    GTC 2014

    indices array

  • Current Implementation

    • Target CUDA GPU with Compute Capability ≥ 2.0 (Fermi)

    • Target workstation with one or two GPU

    • Regular 3D grid

    • Velocities and pressures allocated on GPU

    • Each GPU thread compute one value of velocity and pressure

    • Error (for convergence) computed on GPU every 100 iterations

    16

    GTC 2014

  • Results

    Data size CPU Time for

    100 iterations (s) GPU Time for

    100 iterations (s) Speedup

    503 0.202 0.341 0.6

    1003 1.854 0.628 3.0

    2003 16.5 2.684 6.1

    4003 151.583 18.454 8.2

    5003 283.129 26.842 10.5

    17

    GTC 2014

    GPU: Quadro K6000 CPU: 2×4 cores

  • Out-of-Core GPU implementation

  • Memory Limit is an Issue

    • Max memory on GPU: up to 12GB for Quadro K6000 and Tesla K40

    • Solver memory consumption:

    – 4 unknowns per cell (3 velocity components + 1 pressure)

    – Double precision (8 bytes each) = 32 bytes

    – 2 time steps for each cell = 64 bytes

    – Number of cells: 10003 data set = 64 GB

    19

    GTC 2014

  • Idea

    • Divide data set in blocks that fit in GPU memory

    • Cover block transfers with GPU computation

    20

    GTC 2014

  • Blocks Transfer Process (1)

    GTC 2014

    21

    3 4

    1 2 GPU

  • Blocks Transfer Process (2)

    GTC 2014

    22

    3 4

    1 2 GPU

  • GPU

    Blocks Transfer Process (3)

    GTC 2014

    23

    3 4

    1 2

    1

  • GPU

    Blocks Transfer Process (4)

    GTC 2014

    24

    3 4

    1 2

    1

    2

  • GPU

    Blocks Transfer Process (5)

    GTC 2014

    25

    3 4

    1 2

    3

    2

  • GPU

    Blocks Transfer Process (6)

    GTC 2014

    26

    3 4

    1 2

    3

    4

  • Blocks Transfer Process (7)

    GTC 2014

    27

    3 4

    1 2 GPU

    4

  • Blocks Transfer Process (8)

    GTC 2014

    28

    3 4

    1 2 GPU

  • Challenge

    • Covering data transfer with computation:

    – Several iteration computed on each block

    – Halo data transferred:

    • Values in black cell: 2 iterations

    • Values in green cells: only 1 iteration

    • Values in white cells: not computed

    29

    GTC 2014

  • • Available memory on GPU is determined at runtime

    – Defines maximal size for a block (1/3 of GPU memory)

    • Need to balance:

    – Number of iterations = halo size = useless computation

    – Number of blocks = number of transfer-compute cycles

    30

    GTC 2014

    Challenge

  • Result

    31

    GTC 2014

    GPU CPU CPU GPU kernels execution

  • Results

    Data size CPU Time for

    100 iterations (s) GPU Time for

    100 iterations (s) Speedup

    503 0.202 0.341 0.6

    1003 1.854 0.628 3.0

    2003 16.5 2.684 6.1

    4003 151.583 18.454 8.2

    5003 283.129 26.842 10.5

    8003 461.86 161.42 2.86

    10243 711.89 238.89 2.98

    32

    GTC 2014

  • Conclusion & future work

  • Conclusion

    • Implementation of a Stokes equations solver in CUDA

    • Able to manage “unlimited” size of data on one GPU

    • From 3× (out-of-core) to 10× (in-core) compared to CPU

    • CUDA integrated:

    – In a general purpose software

    – Large number of supported devices

    – In a limited development time

    34

    GTC 2014

  • Future work

    • Optimize GPU kernels: textures? Shared memory?

    • Use more GPU:

    – Peer-to-peer memory access if data fits in memory

    – Distribute blocks to several GPU

    • Optimize blocks division:

    – Less blocks

    – Better covering memory copies / compute

    35

    GTC 2014

  • Acknowledgments

    • NVIDIA for training and support:

    – François Courteille

    – Paulius Micikevicius

    – Julien Demouth

    36

    GTC 2014

  • Thank you for your attention.