This article is based on exciting information just released by MicrosoftBarleyConference on 23 May 2023.
what we have today
IfSynapse-Analysewas created, technical sessions inspired me with some comparisons and explanations, and I reproduced them in my own technical sessions and texts.
Synapse was created at the request of many Microsoft customers. They required the ability to use a single tool for the entire data intelligence platform: collect data, store it, process it, query it, apply data science and create reports.
Synapse is a real Swiss army knife: we can take it with usSynapse data factory;query and process data using different methods,Serverloser SQL-PoolorDedicated SQL pool;and apply data science using Spark Pool and additional ML frameworks. Finally, Synapse is also connectedPower BI, so we can use some shortcuts to create visualizations.
These unique features have always been great, far better than the isolated tools we had before. But in light of Microsoft Fabric, we can see the missing points in Synapse:
- The integration of the various tools was limited. It was the best at the time, but the integration was limited compared to Microsoft Fabric.
- We still have to choose between different infrastructure resources like serverless SQL pool and dedicated SQL pool instead of sharing all data.
- We still need to make infrastructure decisions, specifically the size of the dedicated SQL pool. Decisions were often largely based on guesswork.
- It does not completely isolate storage and processing. When you use a dedicated SQL pool, processing and storage are linked.
Synapse is considered so advanced that few have noticed these problems and not all problems. Microsoft Fabric, the new product announced during BUILD, shows us this and more.
What is Microsoft Fabric?
Like Synapse, Microsoft Fabric brings together all the services required for a data intelligence environment, is highly integrated, and built in a way that requires far less technical effort to implement.
The image below shows the services included in Microsoft Fabric
In the following sections, I will introduce these new concepts.
Data Intelligence og Software as a Service (SaaS)
Microsoft Fabric is coming, breaking standards and cementing new ones. In the cloud environment, we are used to classifying services into Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and SaaS. Synapse is classified as PaaS, while Microsoft Fabric is officially classified as SaaS. The following diagram shows the general areas that each management level and hosted management level offers.
Undoubtedly, the level of managed services provided by Microsoft Fabric is far above Synapse. Many tasks in Synapse would require careful configuration, but in Microsoft Fabric there is a kind of automatic setting that works out of the box.
Usually, when we think of a SaaS service, we think of an end-user application like Office 365 or many other applications where the user simply uses them. It is a concept not typically associated with software used to collect, transform, model and generate intelligent results from data.
This is Microsoft Fabric, software that pushes the boundaries of what we know about cloud software and services.
Microsoft Fabric is not in the Azure environment, it is in the Power BI portal. This results in a very different environment than what we have in Synapse.
But the new environment looks like something we knowPower BI-Portalas well as. The environment is designed for different experiences: you choose an experience according to the type of task you want to perform, and the environment adapts to the usual tasks associated with that experience.
The following experiences are available:
- Power BI:The typical Power BI environment and tools
- fabric data:This persona allows you to create and manage data streams and data pipelines like in a data factory.
- Data enabler:This is a brand new feature that allows you to create triggers for your visuals in Power BI
- computer technology:This experience includes several tasks. It is responsible for creating and managing Lakehouses, but also allows you to create notebooks and orchestrate them with pipelines.
- Data Science:With this experience, you can apply Azure ML techniques to your data.
- Data warehouse:This experience allows you to model your data as you would in a SQL database and use SQL for your data. It's hard to compare this to anything else. We can create many star models across our data lake and these models will be reused by our Power BI datasets, making it easier to have a central model for all our reports.
- Real-time analysis:This persona is somewhat comparable to Power BI Streaming Dataflows and allows for real-time data ingestion.
Switching from ingestion (data factory), processing (data engineering), modeling and SQL (data warehouse) and more is just a matter of choosing the right experience to do the work with the same data sets.
Switching personas is like a way to focus the environment on the type of activities you want to do. Creation of the object itself still takes place in a Power BI workspace.
In addition, the main new objects are: aHaus am Seeand adata warehouse, have their own way of switching work between the two.
Microsoft Fabric and OneLake
OneLakeis the core performance iMicrosoft fabric. It exposes a data lake as a service so we can build our data lake without having to deploy it first. It is the central data store for all data inMicrosoft fabricand it will be given to the tenant on the first occasionMicrosoft fabricartifact is created.
The nameOneLakeAlso goes very well with the shortcut function inOneLake: We can create shortcuts to files that are external and directly access them as if they were in our own lake.
The image below shows how this worksOneLakeis related to the other Microsoft Fabric features.
Onelake, Lakehouse and Workspaces
The lake house is one of the core objects that we can create in a buildingGoodbye. We will create the lake house with the Data Engineer persona, and the lake house will be contained in a workspace that we know as the Power BI workspace.
Once we have created a lake house, we can use the Data Factory to load data into the file space or table space.
ThefilerThe area is the unmanaged area of the lake that accepts any type of file. There we place the RAW files for further processing. Thatshouldhowever, only contains data in delta format.
Søhuset optimizesshouldArea with a special structure that can make a regular delta table up to 10x faster, while maintaining full compliance with the delta format.
However, Lakehouse is not the largest data structure we have. This position is reserved for OneLake. It is an invisible, automatically provisioned data store that contains all data for data warehouses, seahouses, datasets and more.
This allows us to build an enterprise architecture that leverages workspaces as department lakes. Lakehouse joins make it possible to share data across departments. This simultaneously ensures domain ownership of the data and a relationship between domains.
This is just the starting point for an enterprise architecture: OneLake ensures consistent control and management of the data. Data lineage, privacy, certification, catalog integration and more are unified features that OneLake brings to every lake house created in an organization.
All these functions are handled by the Power BI environment, ensuring a business management environment for the company.
OneLake and treatment isolation
When you use Synapse, the Synapse Dedicated Pool stores and processes the data. This is a scenario where storage and processing are linked.
In OneLake, storage and processing are independent of each other. The same data in OneLake can be processed using many different methods, which ensures independence in storage and processing.
Let's analyze the different methods we have to process the data in OneLake.
All work areas are activated forMicrosoft fabrichas a function calledLive-Pool. TheLive-PoolAllows notebooks to run without prior Spark cluster configuration.
As soon as the first block of code runs in a notebook, it willLive Spark Poolhappens in seconds and performs execution.
We can manage thatOneLakeData with Spark Notebooks with the benefit ofLive-Pools
Data Factory objects such as pipelines and dataflows in the Power BI environment are the beginning of a union of ETL tools: we have pipelines and dataflows from Data Factory and dataflows from Power BI.
These two are now united and work together under Microsoft Fabric. We have an additional advantage: Dataflows Gen2.
TheData Streams Gen2is a step up from the Power BI dataflows or arguments dataflows we are used to. One of the most interesting features, in my opinion, is the ability to define the goal of a transformation that we have never been able to achieve beforePower BI dataflow(orConflicting data streams)
TheMicrosoft fabricprovides two different methods of accessing data using SQL as if the data were in a regular database.
One of the methods is to use the Lakehouse object. This object provides us with a SQL endpoint that allows us to model the tables and query the data using SQL.
The second method uses a data warehouse object that provides a complete SQL processing environment for the data in OneLake.
The table below shows all the differences between the Lakehouse SQL Endpoint and the data warehouse. Some of these differences are available in the documentation. Some of these are my personal conclusions.
Microsoft Fabric offering
SQL MPP – Polaris
To the vertical
Vertipaq for tables
Open data format - Delta
Complete data warehouse with transactional support in T-SQL
Read-only, system-generated SQL endpoint for Lakehouse for T-SQL queries and deployment.
Only supports queries and views over Lakehouse delta tables
Recommended use case
Full DQL, DML and DDL T-SQL support. Full transaction support
Full DQL, no DML, limited DDL-T-SQL support like SQL Views and TVFs
loading of data
SQL, pipelines, datastrømme
Spark, Pipelines, Dataflows, Joins
Delta table support
Reads and writes delta table
Reading the delta table
Microsoft Fabric is closely related toPower BISurroundings. We can in many aspects of our work, either from aHaus am Seeor a repository, start creating a Power BI report.
The access method is best:Power BIhas a new method to accessOneLake, calledDirector See.
Director Seeis a new connection method betweenPower BI datasetsandOneLake.
If we useDirectQuery, each update requires a reload of the source, which slows down the connection. On the other hand, if we useImport, the data is stored in memory and performance is better, but when the data is updated, a refresh of the record is required. Updates are not immediately visible in datasets and reports.
TheDirector SeeConnection combines the best of both scenarios: it offers the power of import mode to keep the data in memory and real-time update of data obtained by import modeDirectQuery.
What about Azure Synapse and Data Factory?
Customers using Data Factory and Synapse Dedicated Pool can also expect easy ways to migrateMicrosoft fabric.Microsoft is focused on making the transition as smooth as possible.
Data Factory users even benefit from Gen2 data flows that Azure Data Factory does not support. So you benefit from developing data streams and pipelines using Microsoft Fabric and have an easy migration path from one to the other.
Microsoft Fabric seems to be the beginning of a new era. In an era of Open AI/ChatGPT and Co-Pilot, we get an extremely powerful tool that makes complex data solutions available to all companies and can think of a Co-Pilot in the futureMicrosoft fabric