Performance tuning dataset refresh in Power BI
Chris Webb
Power BI Customer Advisory Team
Microsoft
Agenda
• Gathering requirements for data refresh in Power BI
• Choosing a storage mode
• Import refresh tuning methodology
• Measuring refresh performance
• Data modelling
• Tuning your data source
• Tuning the Power Query engine
• Tuning the Analysis Services engine
• Refresh in the Power BI Service
Why is refresh performance important?
• Your reports are ready for your users to view faster
• You can refresh more frequently during the day if you need to
• Dataset development is easier
• If something goes wrong with your data you can fix and reload faster
• Slow refresh of one dataset may impact• Refresh performance of other datasets
• Report performance
• But how fast is fast enough?
How often do you want your data to refresh?
I want real-time data!
Requirements for data refresh
• Don’t ask what your users want, ask what they need
• Questions:• When is your source data ready to use?
• How often does your source data change?
• What time do you need your data by?
• How many times do you need to refresh in a day? What is the business need?
• What if you unexpectedly need to refresh (eg to fix data problems)?
• How important is keeping data up-to-date versus report performance?
Choosing a storage mode
• Import – fastest query performance but data must be refreshed
• Push – data is pushed into a dataset; many limitations
• DirectQuery – no need to refresh but query performance is slower• Composite models allow you to mix DirectQuery and Import tables
• Aggregations are pre-aggregated tables that improve query performance
• Use auto-refresh to make sure your report always shows the latest data
• Use Import unless you have a good reason not to!
What happens during refresh?
Data sources Power BI
Power Query Analysis Services
Dataset
Query
Query
Query
Query
Import refresh tuning methodology
• Steps:• Model your data properly
• Remove all data that isn’t needed for your reports/analysis
• Tune your data source
• Tune your Power Query queries
• Tune the Analysis Services engine inside Power BI
• You need to check:• Performance of a single refresh while developing
• Actual performance of dataset refresh in production
Measuring overall refresh performance
• SQL Server Profiler is the best tool for measuring refresh performance• Connect to Power BI Desktop via DAX Studio
• Connect to Power BI Premium capacities via XMLA endpoint• Not possible to connect to Power BI Shared capacity
• Displays all activity in the Analysis Services engine
• Look for Process command and Duration column
• Power BI Service refresh history also has overall refresh times
• Refresh summary page (and API) shows refresh times for datasets in Premium
• Power BI Capacity Metrics app shows refresh times for Premium
Data modelling and refresh performance
• Good data modelling is important for many reasons – data refresh performance is only one of them
• Good modelling may make refresh performance slower, but will make report query performance faster
• Basic rule: always build a star schema!
• Common problems:• Tables with lots of columns
• Do you need to unpivot measures?• Do some of your fact table columns actually belong on a dimension table?• Are you even going to use all of these columns?
• One big table instead of fact tables and dimension tables• Use of expensive data types, eg Double instead of Currency
Only load the data you need
• The more data you load, the slower refresh will be
• So:• Remove any columns you don’t need
• Filter out any rows you don’t need
• Think about applying a limit on history, eg only loading one year of data
• Do this as soon as possible, ideally before the data even reaches Power BI
• It’s easier to add data back if you need it than remove data from a dataset in production
• Deployment pipelines (in Premium) can be used to limit the amount of data you work with in a development environment
Data source type and refresh performance
• How quickly can your data source send data to Power BI?
• Some tips:• Relational databases perform better than files
• CSV files will perform better than JSON, XML and especially Excel
• Files stored in SharePoint may be slow to load compared to local files
• Web services may also be slow
• Consider loading your data into a fast data source before loading it into Power BI
Tuning your data source
• If your data source is a relational database, tune the SQL queries that are run when refresh takes place• Tools like SQL Server Profiler can be used to see what queries are run
• Other useful tools:• Fiddler for viewing requests made to web services
• Process Monitor for viewing reads from text files
• Power Query Query Diagnostics
Data source location
• Network latency between your data source and Power BI can affect refresh performance• If you’re using an On-premises data gateway, think about the location of the
gateway machine
• Power BI Premium allows you to locate different capacities in different Azure Regions
Power Query engine performance
• Power Query performance can vary depending on where Power Query queries are run:• Power BI Desktop – when you are developing
• Power BI Service – if you’re only connecting to cloud data sources
• On-premises data gateway – if any of your data sources are on-prem, all traffic has to go through a gateway
• Performance will depend on:• Hardware of the machine where queries are run
• Configuration settings and properties
• Efficiency of the queries themselves
Power Query Power BI Desktop
• Measure performance of Power Query queries in Desktop using:• SQL Server Profiler
• Power Query Query Diagnostics
• Settings to improve performance in Power BI Desktop:• Disable queries that you don’t need to load into the dataset
• Turn off “Allow data preview to download in the background”
• Turn off data privacy checks – but only if you know what this means!
• Experiment with “Enable parallel loading of tables”
• Use Table.View to stop multiple reads
• Turn off “Include in report refresh” if a query doesn’t need to be refreshed
Query folding
• Query folding refers to the way the Power Query engine can push transformations back to the data source
• Almost always results in much better performance
• Only possible with some data sources: relational databases, Analysis Services, OData feeds, some others
• Only possible for some transformations• Different data sources support folding for different transformations
• Some transformations stop other folding happening
• Writing your own SQL queries also prevents folding
Tuning the Power Query engine
• If query folding is not taking place, then the Power Query engine does the transformations in your queries
• Some transformations such as sort, merge, pivot/unpivot require all data to be loaded into RAM• A query is limited to using 256MB RAM, so paging may take place
• Some transformations force multiple reads from a data source• Using Table.Buffer may help – but may also cause paging
Tuning the on-premises data gateway
• If you are using an on-premises data gateway to load data, your Power Query queries will be executed on the gateway machine
• Tips:• Locate the gateway machine close to the data source
• Make sure the gateway server has enough CPU and memory
• Clustered gateways allow for the load to be spread across multiple servers
• Turn on performance logging and use the Power BI template report to analyse it
Using dataflows to improve performance
• Dataflows let you share the output of a Power Query query between multiple datasets• Do complex transformations once instead of inside multiple datasets
• Do transformations when the data for one query is ready, no need to wait until all data needed by the dataset is ready
• Data privacy checks are off by default -> better performance
• In a Premium capacity:• Enhanced compute engine improves performance by loading data into SQL
• Container Size property = more RAM for the Power Query engine
• Computed entities allow you to stage data from slow data sources
Power BIData Source
Dataset A
Dataset B
Table
Query
Query
Power BIData Source
Dataset A
Dataset B
Table Dataflow Entity
Query
Query
Tuning the Analysis Services engine
• SQL Server Profiler displays a lot of detail about what happens during refresh in the Analysis Services engine
• Official support for Tabular Editor within Desktop will allow changing more properties:• IsAvailableInMDX – controls whether hierarchies are built on columns (only
relevant for clients that query using MDX such as Excel)
• EncodingHints – forces the use of a certain type of encoding for a column
Calculated columns and calculated tables
• Calculated columns and calculated tables are evaluated during refresh• So the more you have, the slower refresh will be
• Can you replace a calculated column with a measure?• Strange but true: this may also help query performance
• Can you replace a calculated table with a Power Query query or a table in your data source?
• Loading data into hidden tables and then using DAX to transform it is usually a bad thing
• BUT certain calculations will be much quicker in DAX
Incremental refresh
• Incremental refresh lets you refresh only the data that is new or has changed• Less data to load -> faster refresh
• Works by creating and managing partitions within the table
• Now available in Power BI Shared as well as Premium
• Designed for use with data warehouses built on relational databases
• Can be adapted for use with other data sources such as:• Web services
• Folders containing multiple files
Refresh in the Power BI Service
• Refresh in the Power BI Service only when resources are available
• Therefore, refresh does not always start at the scheduled time
• Refresh may be slower in the Service because:• You have a very fast development PC
• It takes longer to load data into the cloud than into Power BI Desktop
• Refresh may run faster on Premium because: • More resources = more parallelism, but only on a P2+
• More likely to start on time – assuming your capacity isn’t overloaded