Microsoft Power BI is a very popular platform for IT professionals, managers, and analysts to build and consume reports with interactive visuals using a no-code / low-code approach. When working with Power BI you have several different tools available for transforming data throughout various stages of your data pipeline.
This article covers considerations of using dataflows which are part of the Microsoft Power BI platform and discuss when and why we should leverage these dataflows in the first place. The article focuses on the challenges we have faced in transforming data stored in Data Lake Gen2 storage.
Before we dive in, I want to mention that a user must have Power BI Pro or Power BI premium account to utilise these mammoths.
Complex datasets impacting data refresh performance.
Power Query is a fundamental part of Power BI's data transformation capabilities however during development it can be very resource-intensive when performing complex transformations on large transactional tables with millions of records. As datasets evolve during the typical development process, any changes (changing a data type, transforming a column, etc) could trigger a refresh of the dataset. It is here that complex merge operations may take hours to re-apply the transformations and can consume most of the memory of your development machine, bringing productivity to its knees! We can leverage dataflows here and let it perform the heavy lifting before data even hits our dataset. We can simply define these same transformations in dataflows instead of Power Query desktop and set it for scheduled refresh to load data.
Power BI Dataflows became generally available in PowerBI.com in mid-2019. The purpose of Dataflows is to provide self-service big data preparation in the report development process. Those who are familiar with Power Query in Excel or Power BI Desktop would find that UI is very similar to Power Query editor. So, is it just the cloud version of Power Query? Well, No; It is a much more powerful and useful tool. However, dataflows and Power Query editor share a similar user interface, dataflows are completely different in terms of features they provide. Let's dive in to find out what we can get from them!
What's the difference?
Dataflows serve the purpose of Big Data preparation and can be both created and processed in the Power BI service without any impact on performance on your development machine. It also supports a time-based incremental refresh which tracks the changes in data sources and only bring changes instead of refreshing whole data.
The dataflows store these transformed data in data lake storage and uses common data model schema. Dataflows provides reusability of transformations done by data engineers. In the organizations where multiple data models are used across different departments, data engineers can define these transformations once in dataflows and then can reuse them in different data models (datasets). Then, each dataset can contain a data model to deliver a set of reports.
Another advantage of using dataflows is: they run in the cloud! It still may take some time to refresh based on the size of the data but we can further optimise the performance of these dataflows by leveraging the power of enhanced compute engines which is a preview feature currently. This enhanced compute engine first stores this data in Azure SQL cache and then applies these transformations in optimised SQL engine. This makes dataflows efficient and scalable enough to lift complex transformations. Enabling enhanced compute engine doesn't change the way user define transformations, it would be still the power query interface. Power query intelligently folds the query to compute engine to let it do heavy transformations.
Is it worth the effort?
We can model our data in a dataset so why go through the trouble of adding an additional layer by utilising dataflows? It is because dataflows provide reusability of these transformations along with efficient computing.
In a nutshell, dataflows provides a central storage of transformed data across organisations which can be leveraged by many departments. It is a recommended to use when data lake is the source of the data. An incremental scheduled refresh would be beneficial to keep them with up-to-date data. Defining transformations in data flow is easy using Power Query online that is available in Power BI service which can be accessible using browser on any device.