welcome [tc18.tableau.com] · operationalizing and scaling tableau prep. american culture. what is...

Post on 05-Oct-2020

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Welcome

Tableau Prep: Below Decks

# T C 1 8

Doug Thomae

Staff Software Engineer

Tableau

Goal

That you understand enough to start investigating the operation of Tableau Prep on your own, if you want to.

Agenda

American Culture

Executing A Flow (Batch)

Measuring Culture

Interacting With A Flow

Operationalizing and Scaling Tableau Prep

American Culture

What is Culture

The social behavior and norms found in human societiesIt tells us how we should behave and relate to other people and other cultures

A “programmed” lens that affects how we interpret events in our environment

What’s dangerous?

What’s beneficial?

How do we decide?

Culture has a large influence on mass behavior…but influences individuals in the culture to varying degrees

American “Nations”

USA consists of 11 regional culturesBased on history. Outlined in dialect maps, genetic study

Complex mapping from culture to politics, religion and issues

Similar ideas that came beforeWilber Zelinsky, “Doctrine of First Effective Settlement”

Kevin Phillips, Emerging Republican Majority, 1969

Joel Garreau, The Nine Nations of North America, 1981

David Hackett Fisher, Albion’s Seed, 1989

David Hackett Fisher, Champlain’s Dream, 2008

Robert Cushing, The Big Sort, 2008

…and others…

“American Nations” vs. Genome Map

Han, Carbonetto, Curtis, Wang, Granka, Byrnes, Noto, Kermany, Myres, Barber, Rand, Song, Roman, Battat, Elyashiv, Guturu, Hong, Chahine, Ball, “Clustering of 770,000 genomes reveals post-colonial population structure of North America”, Nature Communications 8, Article number 14238, 07 February 2017

Executing Batch Flows

The Beginning of the Beginning: Filtered Map

Tableau Prep Desktop is Web Client/Server

Electron’s embedded Chrome

Electron

TP Front End(Typescript, React,

Redux)

TP Back End(Spring + Collection

of Java Services)

Tableau Query Pipeline + Connector Platform

HTTPS

AQL

PostgreSQL Server“Customer Database”

Front End Back End

C++ Stack

Tableau Prep Back End Services

Service Name Purpose

Cache Analysis Service Caching of data for interactive operation

Connection ServicePresentation models for connection dialogs.

Enables sharing of presentation models/dialogs with rest of Tableau products

File Service Storing and saving .tfl/.tflx documents. Probably should be named “document service”

Flow Executor Service Entry point for compiling and initiating flows runs

Flow Operation Service Manages binning and brushing during interactive operation

Function Def ServiceRetrieves Tableau function definitions from C++ stack.

Enables sharing of functions/formulas with rest of Tableau products

Tableau Prep Back End Services

Service Name Purpose

Desktop Integration Service Returns information about installed Tableau Desktop products

Licensing Service License validation and activation

Versioning Service Document versioning (in the “documents from different releases” sense of version)

LoomDoc Validator Service Analyzes/validates LoomDoc objects

Node Validator Service Validates single nodes by doing a front end compile and returning errors (or not)

MRU Flow Service Persists/retrieves the most recently used documents list

Tableau Prep Back End Services

Service Name Purpose

Status Service Tracks/returns the status of flows that have been initiated by the Flow Executor Service.

Telemetry Service Gathers/sends telemetry (if the user has chosen that option).

What is a Tableau Prep Flow?

Answer 1: It’s the graph displayed in the top pane

Answer 2: It’s a specification defined in loom-lang“loom-lang” is the language that captures flow definitions

It’s only current textual form is in JSON

Answer 3: It’s a set of specifications for queriesEach node in a flow is a specification for a query (e.g. for a SQL database or in Hyper)

When federation is involved, it may be multiple queries

Same Flow Graph, One Level Down

Input Node Output Node

Container Node

Filter off HI

Flow Document/Loom-Lang

{“nodes”:{<see next page>

},“connections”: {

“53bcf9c0-59a8-4f42-bf28-daf4be6b144c”:{“connectionType”: “.v1.SqlConnection”,“isPackaged”: false“name”: “dthomae2.tsi.lan”,

“connectionAttributes”: {“server”: “dthomae2.tsi.lan”,“dbname”: “tc18”,“port”: “5432”,“class”: postgres

}}

}}

Connection id, unique within a flow

Standard fields for all connections

connectionAttributes differ by connection class

Flow Document/Loom-Lang, continued

{“nodes”:{

“074e9fd5-e4a5-4217-80b2-2caa214f02bf”:{“nodeType”: “.v1.LoadSql”,“name”: “county_to_nation_map”,“id”: “074e9fd5-e4a5-4217-80b2-2caa214f02bf”,“baseType”: “input”,“nextNodes”: [{

“namespace”: “Default”,“nextNodeId”: “a77e4d8e-387d-4ccd-be23-75b487896686”,“nextNamespace”: “Default”

}],

<node type specific fields>}

},“connections”: {…}

}

Node id, unique within a flow

Standard fields for all nodes

074e9fd5-e4a5-4217-80b2-2caa214f02bf

Flow Document/Loom-Lang, continued{

“nodes”:{“074e9fd5-e4a5-4217-80b2-2caa214f02bf”:{“nodeType”: “.v1.LoadSql”,“baseType”: “input”,“nextNodes”: [{“nextNodeId”: “a77…”,“nextNamespace”: “Default”}],

},“a77e4d8e-387d-4cc-be23-75b487896686”:{“nodeType”: “.v1.Container”,“baseType”: “container”,“nextNodes”: [{“nextNodeId”: “132…”,“nextNamespace”: “Default”}],“loomContainer”: {

“nodes”: {“120daf25-3ae2-4f11-b83d-b5c87651edfd”: {

“nodeType” : “.v1.RangeFilter”,“baseType”: “transform”,“nextNodes” : []

}}

}}

},…

Loom-Lang, continued

All nodes have:A type, which has a version component and a type name

A name

An id which is unique within the flow

A base type, one of input, output, transform, container, and supernode (another type of container)

…followed by node type specific fields

Every node input and output exists in a namespace:Namespaces are how Tableau Prep keeps duplicate column names straight

Single input/single output nodes use the “Default” namespace

Join nodes have an incoming “Left” and “Right” namespace

General multi-input nodes (e.g. Unions) generate guids as namespaces

Compilation and Queries

Flow executor Service

Loom engine

Front end compiler

Pre-compilation

Build node and type info

Back end compiler

Build execution

plan

Create nodeLogical and

physical models

AQLRunner Querypipeline

Connectorplatform

PostgreSQLdatabase

Database agnostic “logical” query

Database dependentSQL query

Error info

Tableau data platform

Logical Query for Our Flow<logical-query>

<selects><field>[stcou]</field>

...other fields</selects><projectOp class=\"logical-operator\">

<expressions><binding name=[stcou]><identifierExp identifier=\"[stcou]\" class=\"logical-expression\"/></binding>

…other fields<projectOp><selectOp class=\"logical-operator\">

<predicate><funcallExp function=\"!\" shape=\"scalar\" class=\"logical-expression\">

<funcallExp function=\"&amp;&amp;\" shape=\"scalar\" class=\"logical-expression\"><funcallExp function=\"==\" shape=\"scalar\" class=\"logical-expression\">

<identifierExp identifier=\"[state]\" class=\"logical-expression\"/><literalExp value=\"&quot;HI&quot;\" datatype=\"string\" class=\"logical-expression\"/>

</funcallExp><funcallExp function=\"!\" shape=\"scalar\" class=\"logical-expression\">

<funcallExp function=\"ISNULL\" shape=\"scalar\" class=\"logical-expression\"><identifierExp identifier=\"[state]\" class=\"logical-expression\"/>

</funcallExp></funcallExp>

</funcallExp></funcallExp>

</predicate>…table and field name information

</logical-query>

PostgreSQL QuerySELECT "e1b673e1-afa9-47aa-bbac-0b12dc"."stcou" AS "stcou","e1b673e1-afa9-47aa-bbac-0b12dc"."county" AS "county","e1b673e1-afa9-47aa-bbac-0b12dc"."state" AS "state","e1b673e1-afa9-47aa-bbac-0b12dc"."nation" AS "nation“

FROM "public"."county_to_nation_map" "e1b673e1-afa9-47aa-bbac-0b12dc“WHERE (NOT (("e1b673e1-afa9-47aa-bbac-0b12dc"."state" = 'HI’)

AND (NOT ("e1b673e1-afa9-47aa-bbac-0b12dc"."state" IS NULL)))

)

We Use Hyper Under the Covers a Lot

Local Files (e.g. .csv, .xls) are put into Hyper:Connector creates a table in Hyper and transfers data into it

Queries are then generated for Hyper, just as if it was any other database

Hyper is used for federationFederation brings together data in one place to do cross database joins

Hyper is the place where the data is brought together

Federation for Tableau Prep is exactly the same as it is for other Tableau products

Handling Local Files, continued

Same data comes from .csv instead of PostgreSQL, Hyper sees:1) At ingestion time a table is created and data copied into it

CREATE TABLE "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU" BIGINT, "County" TEXT COLLATE "en_US", "State" TEXT COLLATE "en_US","Nation\" TEXT COLLATE "en_US")

COPY "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU", "County", "State", "Nation") FROM STDIN WITH (FORMAT HYPERBINARY, SANITIZE)

2) Later, when the query happensSELECT "-1384900078"."STCOU" AS "STCOU“,

"-1384900078".\"County\" AS \"County\","-1384900078"."State" AS "State","-1384900078"."Nation" AS "Nation”

FROM "TableauTemp"."CountyToNationMapUSA#csv" "-1384900078“WHERE (NOT (("-1384900078"."State" = 'HI') AND (NOT ("-1384900078"."State" IS NULL)))

Measuring Culture

Hofstede’s Cultural Dimensions

Geert Hofstede devised a set of dimensions that can be used to compare cultures:

Power Distance—degree of acceptance of unequal power

Individualism vs. collectivism—degree of integration into groups

Uncertainty avoidance—a society’s tolerance for things outside the status quo

Masculinity vs. femininity—degree of preference for achievement, heroism, assertiveness

Long-term orientation—degree to which a society is able/willing to adapt or change

Indulgence vs. restraint—degree to which behavior is controlled by social norms

Hofestede’s Dimensions Can Be Measured

High power distance:Greater income inequality

Smaller middle class

Dictatorships or oligarchies

Violence in national politics

Political systems changed by revolution

Business executives older

Innovations only when supported by hierarchy

Low power distance:Smaller income inequality

Larger middle class

Separation of powers

Peaceful political conflict resolution

Political systems changed by evolution

Business executives younger

Spontaneous innovations

Note that these are measures that allow

cultures to be compared, not absolute indices

Gini Coefficient

The Gini coefficient measures income inequality:

Interacting With Flows

Prepping ACS Gini Data

Binning

Binning produces the vertical “bar chart” of values in the profile pane:

The Flow Operation service is called to generate the binned values

It uses the Flow Executor service to generate the actual queries

Binning uses various “bin strategies” to decide how to do the actual binning:

A bin strategy decides how to select the values/ranges that will be shown to the user

When the user clicks on the node a bin strategy is chosen based on the type of the column

The final binning operation is a count of something, although continuous ranges need to be partitioned first

Walking Through the Gini Flow (Binning)

The Hyper ViewSELECT COUNT(1) AS Measure,

t0.Dimension AS Dimension

FROM (

SELECT hyper.GEO.id AS GEO.id,

hyper.HD02_VD01 AS HD02_VD01,

hyper.GEO.display-label AS GEO.display-label,

hyper.HD01_VD01 AS HD01_VD01,

hyper.File Paths AS File Paths,

hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,

hyper.GEO.id2 AS GEO.id2,

hyper.HD02_VD01 AS Dimension

FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper

LIMIT 1048576

) t0

WHERE ((NOT (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL))

AND (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 > 0)

AND ((t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL)

OR (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 <= 8868)))

GROUP BY 2

ORDER BY Dimension ASC NULLS FIRST

Hyper uses PostgreSQL, conventions, this is same as count(*)

TP issues one query per columns, the one specified “as Dimension” changes in each one

This has to do with paging data to the UI

Null is always sorted to the top

Interactive ops will always limit to default 1M rows

Brushing

Brushing is binning with a condition:The user picks the condition by clicking on a value

Brushing and binning are both performed by “analyzers”. Binners are a special case analyzer.

There are null values for the Gini coefficient values. Where do they come from?

Walking Through the Gini Flow (Brushing)

The Hyper View

The query generated when Gini Coefficient null value was selected

SELECT COUNT(1) AS Measure,

t0.Dimension AS Dimension

FROM (

SELECT hyper.GEO.id AS GEO.id,

hyper.HD02_VD01 AS HD02_VD01,

hyper.GEO.display-label AS GEO.display-label,

hyper.HD01_VD01 AS HD01_VD01,

hyper.File Paths AS File Paths,

hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,

hyper.GEO.id2 AS GEO.id2,

hyper.HD02_VD01 AS Dimension

FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper

LIMIT 1048576

) t0

WHERE (t0.HD01_VD01 IS NULL)

GROUP BY 2

ORDER BY Dimension ASC NULLS FIRST

It’s the binning query with a condition added

It’s still the old name –the column name is still the same at the db level and TP knows that

The Hyper View

After the exclusion of Geography is added to the recipe the binning query looks like:

SELECT t0.GEO.id AS GEO.id,

t0.File Paths AS File Paths,

t0.GEO.id2 AS GEO.id2,

t0.Gini Coefficient Error AS Gini Coefficient Error,

t0.Gini Coefficient AS Gini Coefficient,

t0.GEO.display-label AS GEO.display-label,

t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82

FROM (

SELECT hyper.GEO.id AS GEO.id,

hyper.HD02_VD01 AS HD02_VD01,

hyper.GEO.display-label AS GEO.display-label,

hyper.HD01_VD01 AS HD01_VD01,

hyper.File Paths AS File Paths,

hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,

hyper.GEO.id2 AS GEO.id2,

hyper.HD02_VD01 AS Gini Coefficient Error,

hyper.HD01_VD01 AS Gini Coefficient

FROM Extract.tmp-e20YjwL8onYfAWC4NZP4GxsweOTq18Hj4q0mczFO7bI=-Default hyper

LIMIT 1048576

) t0

WHERE (NOT ((t0.GEO.display-label = 'Geography') AND (NOT (t0.GEO.display-label IS NULL))))

LIMIT 1000

The query handles the column rename

Conditions from the recipe are included in future binning and brushing queries

Log File Locations

“My Tableau Prep Repository”“Logs” directory

hyperd.log – the operations (including queries) as seen by Hyper

log.txt – the operations (including queries) that are sent by Tableau Prep

Download Tableau Log Viewer!https://github.com/tableau/tableau-log-viewer

Exercise For the Interested:What query is sent to get “what’s in/what’s out” data in a join node?

Hint: Search for “FULL” in hyperd.log in Tableau Log Viewer or text editor

Getting Started With the Exercise

Walking Through the Gini Flow (Join)

How Do Gini Coefficients Compare?

Operationalizing and Scaling Tableau Prep

Tableau Prep Conductor – V1

Add Tableau Prep Capabilities to Tableau Server:Use the same Flow Executor Service used by Tableau Prep DesktopUse the same Versioning Service used by Tableau Prep DesktopAdd:

Flow Orchestrator Service – to set up connections needed by Flow Executor ServiceFlow Publishing Service – API to publish flowsFlow Service – API for UI to retrieve flow inputs/outputs, decrypt credentials and other functions

Extend/Using Existing Tableau Server Mechanism:Job type to schedule flowsBackgrounder support for running flowsSecure credential storageEnforcement of permissionsExtension of content types (e.g. data sources, workbooks, flows)Extension of administrative views

Tableau Prep Conductor – Post V1

Enable Web Authoring:Port most remaining services

Port existing web UI

Server version of Hyper caching

Improve Scheduling and Resource Management:Part of larger data platform efforts

Trigger flow runs when inputs are updated

Scaling to Larger Datasets

Output To Database:Tableau Prep currently outputs to local files or data sources (hyper, csv)

For large datasets move computation to data:- Generate a query using existing mechanisms

- Wrap it in an upsert, send to database. Tableau systems never handle data at all in batch runs.

Augmenting Data Warehouse/Lake With Local Data:Don’t pull down the big dataset to federate with the local data

Push the smaller, local data to a temp table to work with larger dataset

Incremental Update and Query:Large Data Warehouse/Lake datasets are built one hourly/daily/weekly/etc. update at a time

Parameterized Data Pulls

Incremental Upserts

Finishing Up

What Should I Remember?

Tableau Prep flows are specifications that get turned into queries

Tableau Prep Desktop is actually a client/server system

Tableau Prep is built on top of the Tableau data platform

Tableau Prep is architected to scale…although many of the mechanisms aren’t built out yet

Tableau Prep | Below Decks

S E S S I O N R E P E AT S

Tue 10/23 | 2:15 – 3:15 | MCCNO – L2 - 297

Wed 10/24 | 12:00 – 1:00 | MCCNO – L2 - 263

Preparing Your Data the Tableau Prep Way

R E L AT E D S E S S I O N S

Thu 10/25 | 12:30 – 1:30 | MCCNO – L3 - 388

How Aggregate Friends and Influence Pivots

Wed 10/24 | 3:30 – 4:30 | MCCNO – L2 – New Orleans Theater A

Please complete the

session survey from the

Session Details screen

in your TC18 app

Thank you!

#TC18

Douglas Thomae

dthomae@tableau.com

.tfl/.tflx Files

Both are always in zip format:Some older output files from Tableau Alpha were JSON files

The x on .tflx files is a hint that they contain data files, but has no other significance

You can open them up with a standard zip utility.

They’re not encrypted and will never contain secrets like passwords

The content is segmented by zip stream:maestroMetadata – other stream names, “document versions”

displaySettings – data or config that affects the way things are displayed (e.g. column order)

flow – the flow definition in loom-lang

data files (streams named using a guid to avoid name collision)

top related