the power of c++ project austin app
DESCRIPTION
The power of C++ Project Austin app. Ale Contenti Visual C++ | Principal Dev Manager 4-001. Diving deep into project Austin. What’s Austin Why we built it C++ at work Go build amazing apps!. Austin. Austin is a digital note-taking app for Windows 8 - PowerPoint PPT PresentationTRANSCRIPT
The power of C++ Project Austin appAle ContentiVisual C++ | Principal Dev Manager4-001
What’s AustinWhy we built itC++ at workGo build amazing apps!
Diving deep into project Austin
Austin
Austin is a digital note-taking app for Windows 8• You can add pages to your notebook, delete
them, or move them around
• You can use digital ink to write or draw things on those pages
• You can add photos from your computer, from SkyDrive, or directly from your computer's camera
• You can share the notes you create to other Windows 8 apps such as Email or SkyDrive
Beautiful and simple
Austin: just a pen and a piece of paper
Austin: why we built it
We used Visual C++ 2012 to build an amazing app:• Written in “modern C++”• DirectX, XAML for UI• C++/CX to interact with WinRT• Auto-vectorizer for faster ink smoothing • C++ AMP for faster page curling• …and it was fun (the code is available on
codeplex, too)
Showcase the power of Windows 8, the native platform and C++
Modern C++DirectX and XAML UIC++/CX layer
Modern C++
We strived to write Austin in a “modern” way:• C++ Standard Library, augmented with PPL and
Boost• Smart pointers instead of raw pointers• Pervasive RAII pattern• Handle errors using C++ exceptions• Coding conventions inspired by Boost
No bare pointers, no delete
DirectX and XAML
• DirectX to create an immersive, fluid user interface,that's built as a 3D scene with lights, shadows, and a camera
• On the DirectX render target, we draw notebook's pages, photos, ink strokes, and background
• A 3D engine library abstracts some of the DirectX complexity
DirectX for a fast, fluid, real-to-life experience
• XAML UI is used for the settings menu, the app bar, and the rest of the user interface
• The SwapChainBackgroundPanel to host the 3D scene inside the XAML UI page
C++/CX
C++/CX is used at the “boundary”, to interact with Windows, (via the WinRT objects) and to leverage XAML UI• Used for loading and saving images, file picker,
camera, storage files and folders (SkyDrive, etc.), implementing the “share” contract
• Very useful for XAML UI: UI elements and events hook-ups
• We were careful in not having C++/CX code “bleed” too much in our Standard C++ code (15 files out of 350)
Windows is the RunTime
Ink smoothing and auto-vectorizer
Ink smoothing: the problem
We have in the order of 5ms or less to smooth the strokes
In real time, please…
for (int j=0; j<numPoints; j++){ float t = (float)j/(float)(numPoints-1); smoothedPressure[j] = (1-t)*p2p + t*p3p; smoothedPoints_X[j] = (2*t*t*t - 3*t*t + 1) * p2x + (-2*t*t*t + 3*t*t) * p3x + (t*t*t - 2*t*t + t) * L*(p3x-p1x) + (t*t*t - t*t) * L*(p4x-p2x); smoothedPoints_Y[j] = (2*t*t*t - 3*t*t + 1) * p2y + (-2*t*t*t + 3*t*t) * p3y + (t*t*t - 2*t*t + t) * L*(p3y-p1y) + (t*t*t - t*t) * L*(p4y-p2y);}
Ink smoothing: the codeThe C++ compiler is obsessed with optimization: In this case, it will auto-vectorize the loop
Auto-vectorizer (super simplified view)
B[0] B[1] B[2] B[3]
A[0] A[1] A[2] A[3]
A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3]
+
xmm1
xmm0
“addps xmm1, xmm0 “
xmm1
for (i = 0; i < 1000; i++){ C[i] = A[i]+B[i]}
for (i = 0; i < 1000; i+=4){ C[i:i+3] = A[i:i+3]+B[i:i+3]}
Auto-vectorizer: info from the compilerWhen does the auto-vectorizer kick in? On the command line:• /Qvec-report:1 will report the vectorized
loops• /Qvec-report2 will report both vectorized
and non-vectorized loops, and the reason why some loops were not vectorized
• Refer to the Vectorizer and Parallelizer Messages in MSDNink_renderer.cpp(1092) : info C5001: loop vectorized
From the build output, with /Qvec-report1:
#include <vector>
void test1(){ std::vector<int> a(100000), b(10000), c(10000); for (int i = 0; i < a.size(); ++i) { a[i] = b[i] + c[i]; }}
Auto-vectorizer: it’s not always easy
info C5002: loop not vectorized due to reason ‘501’
#include <vector>
void test1(){ std::vector<int> a(100000), b(10000), c(10000); for (int i = 0; i < a.size(); ++i) { a[i] = b[i] + c[i]; }}
Auto-vectorizer: it’s not always easy
#include <vector>
void test1(){ std::vector<int> a(100000), b(10000), c(10000); for (int i = 0, int iMax = a.size(); i < iMax; ++i) { a[i] = b[i] + c[i]; }}
Auto-vectorizer: it’s not always easy
info C5001: loop vectorized
• For the ink-smoothing algorithm, we got a 30% speed-up
• For the first part of the page curling algorithm, we got a 175% speed-up
• Auto-vectorizer can analyze very complex loops• Always measure with a profiler to understand
which loops you need to speed up• Leverage the Vectorizer and Parallelizer
Messages guide for help
The compiler will analyze the loop and emit the right code
Auto-vectorizer at work in Austin
Page curling and C++ AMP
// pseudo-code
for each triangle{ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position;
Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);
triangleNormal.normalize();
vertex1.normal += triangleNormal; vertex2.normal += triangleNormal; vertex3.normal += triangleNormal;}
Page curling: calculating normalsLots of triangles: we have less than 15ms to “turn a page” in real time; we need to parallelize this algorithm
C++ AMP is a good candidate, since the data size is pretty large
// pseudo-code
for each triangle{ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position;
Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);
triangleNormal.normalize();
vertex1.normal += triangleNormal; vertex2.normal += triangleNormal; vertex3.normal += triangleNormal;}
Page curling: calculating normalsWe’re looping over each triangle
This set of operations is safe, because it works on a single triangle at each time, no races
But here we’re updating vertexes which are shared between triangles -> race!This algorithm only works on a single
thread
Page curling: split the loop to make it parallelizable
Calculate triangle normals
Calculate vertex normals
for each triangle Calculate triangle normals
Calculate vertex normals
for each triangle
for each vertex
cache triangle normals
First, loop for each triangle…c::array<b::float32, 2> tempTriangleNormals(3, (int)triangleCount());
parallel_for_each(extent<1>(triangleCount), [=](index<1> idx) restrict(amp){ Position vertex1Pos = triangle.vertex1.position; Position vertex2Pos = triangle.vertex2.position; Position vertex3Pos = triangle.vertex3.position;
Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);
triangleNormal.normalize();
tempTriangleNormals[idx] = triangleNormal;
});
We use C++ AMP
Same as before, we calculate the normals for
each triangle
We collect the normals into a temporary array, which stay in GPU memory
…then, loop for each vertexparallel_for_each( extent<2>(vertexCountY, vertexCountX), [=](index<2> idx) restrict(amp){ Normal vertexNormal = vertexNormalView(idx);
// go find the normals from nearby triangles vertexNormal += sumTriangleNormals(idx);
vertexNormal.normalize();
vertexNormalView(idx) = vertexNormal;});
We go over each vertex, so no races
In sumTriangleNormals, we fetch the normals from tempTriangleNormals, i.e., the temporary we kept on the GPU memory
• Running this algorithm on the GPU yields between 3x and 7x speed-ups
• CPU is now free to execute other code• Even when DirectX 11 capable GPU hardware is
not present, C++ AMP will fallback to WARP, which leverages multi-core and SSE2
Massive Parallelism with GPU and WARP
Page curling: C++ AMP at work
Key takeaways
Key takeaways• Use modern C++: RAII, r-value references,
lambdas, const, Standard C++ Libraries, Boost, other 3rd party libraries, etc.
• DirectX for fast and powerful graphics• XAML UI for standard UI elements• C++/CX to talk to Windows, to other
components and to other languages (e.g., JS)
• Auto-vectorizer and PPL to distribute work on the CPU
• C++ AMP to leverage the GPU massively parallel compute power• C++ Rocks! Go write great apps!!
Resources
• Tue/5:45/B92 OdysseyConnecting C++ Apps to the Cloud via Casablanca
• Wed/11:15/B92 OdysseyIt’s all about performance: Using Visual C++ 2012 to make the best use of your hardware
• Wed/1:45/B92 StingerDirectX Graphics Development with Visual Studio 2012
Related Sessions
• Wed/5:15/B33 CascadeDiving deep into C++ /CX and WinRT
• Thu/5:15/B92 Nexus/NormandyBuilding a Windows Store app using XAML and C++ - Photo app, the hilo project
• Fri/12:45/B33 McKinleyThe Future of C++
Related Sessions
vcblogProject Austin Part 1 of 6: IntroductionProject Austin on CodePlexAuto-Vectorizer in Visual Studio 2012C++ AMP in a nutshellParallel Patterns Library (PPL)
Resources
Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions
ENROLL TODAY!
MICROSOFT DEVELOPER DIVISION DESIGNRESEARCH
EXPERIENCE DEVELOPMENT TOOLS AND FEATURES EARLY IN THEIR DESIGN AND DEVELOPMENT
INFLUENCE FUTURE DESIGN DECISIONS
FILL IT ONLINE AThttp://bit.ly/x6dtHt
Participate in Design Research
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Appendix
• Line must be contiguous, as well as first and second derivatives
• We approximate with the “cardinal” spline solution
• With auto-vectorizer, we get a nice 30% speed-up
Ink smoothing: the math
• Brilliant paper by Hong et. al., Turning Pages of 3D Electronic Books
• Turning a page of a physical book can be simulated as deforming a page around a cone
• Each “page” in Austin is made of a bunch
of triangles• In C++, we apply the page turning
algorithm to all triangles
• The auto-vectorizer comes to rescue again with a sweet 1.7x speed-up
Page curling: how do we turn the page
• Vertex normals are typically calculated as the normalized average of the surface normals of all triangles containing the vertex
• Using this approach, computing the vertex normals on the CPU simply involves iterating over all triangles depicting the page surface and accumulating the triangle normals in the normals of the respective vertices
• To me, the above screams “massive parallel”
Page curling: vertex normals and shading
// first calculate the triangle normalsc::array<b::float32, 2> triangleNormals(3, (int)triangleCount());c::parallel_for_each(c::extent<1>(triangleCount()), [=, &triangleNormals](c::index<1> idx) restrict(amp){ b::float32 v1PosX = vertexPositionArray(0, indexArray(2, idx[0])[0]); b::float32 v1PosY = vertexPositionArray(1, indexArray(2, idx[0])[0]); b::float32 v1PosZ = vertexPositionArray(2, indexArray(2, idx[0])[0]); b::float32 v2PosX = vertexPositionArray(0, indexArray(1, idx[0])[0]); b::float32 v2PosY = vertexPositionArray(1, indexArray(1, idx[0])[0]); b::float32 v2PosZ = vertexPositionArray(2, indexArray(1, idx[0])[0]); b::float32 v3PosX = vertexPositionArray(0, indexArray(0, idx[0])[0]); b::float32 v3PosY = vertexPositionArray(1, indexArray(0, idx[0])[0]); b::float32 v3PosZ = vertexPositionArray(2, indexArray(0, idx[0])[0]); b::float32 x1 = v2PosX - v1PosX; b::float32 y1 = v2PosY - v1PosY; b::float32 z1 = v2PosZ - v1PosZ; b::float32 x2 = v3PosX - v1PosX; b::float32 y2 = v3PosY - v1PosY; b::float32 z2 = v3PosZ - v1PosZ; // cross them b::float32 x3 = y1 * z2 - z1 * y2; b::float32 y3 = z1 * x2 - x1 * z2; b::float32 z3 = x1 * y2 - y1 * x2; NORMALIZE(x3, y3, z3); triangleNormals(0, idx[0]) = x3; triangleNormals(1, idx[0]) = y3; triangleNormals(2, idx[0]) = z3;});
Page curling: C++ AMP