About a year ago I wrote a blog on workspace topology. Since then, I have spoken to a number of customers and engineers at various meetups on capacities and the challenges they can present.
From those conversations, I've come to realise that a follow on is needed to to help demystify capacity management. For those that haven't read the previous blog, I clearly called out:
In this blog, I'm going to explain why this is the case and how you should go about managing workspace creation.
The dangers of creating workspaces without planning
Whilst Fabric makes it really easy to standup a new workspace on an existing capacity through the UI, a number of reasons exist today as to why we shouldn't be doing this.
Financial management
To understand the dangers of workspace creation, we need to first review the ways that Fabric items can be billed:
- Consumption unit (CU) purchased via SKU with Burst and Smooth functionality enabled (Default). This is the approach we're all used to with Fabric: purchase a SKU via the Azure portal to get a number of CUs and when these are used Microsoft boost the capacity to give you more power than you pay for and then use a burn down rate to claw back the cost once your job completes.
- CU purchased via SKU with Burst and Smooth disabled at workspace level. This was added as part of the August 2025 release
- Spark Auto scale billing. This is now GA and means that we can turn spark billing from using the SKU approach to the traditional consumption based pricing.
It's the first of those options that presents us with the issue.
Imagine the scenario that you have your production Lakehouse(s) and a data science sandpit on the same capacity (sorry data scientists - it was this type of workload that has caused issues each time this happens).
Your data science team are developing a new model and kick off a process in dev at 5:30 on a Friday. Come Monday they get into the office to see the boost process has kicked in and the capacity has been running at full utilisation the entire weekend. But because we have a shared capacity that's at 100% utilisation, your production workloads have all been impacted and business critical reporting isn't available.
Now you have your CFO, CEO, CMO and CTO all shouting that they need their reports for the weekly board meeting. You go to look at the capacity metrics app and see that everything is being throttled and you have a massive overage that needs to burn down before you get the capacity back again.
The only way to get out this situation at this point, is to pause the capacity. Stop before you do that.
What a lot of customers I've spoken to don't realise is that hitting pause is going to mean you get an instant bill from Microsoft on top of your SKU price. The second that the capacity is paused, Microsoft take the overage and instantly charged your Azure subscription to pay back that overage. Now that might not be much if you've caught it quickly or have a small capacity, but for others that could be days and days of CU costs.
Now, not only do you have the effort of running all your processes that haven't missed but you've also got to go to your CFO to explain why they will see an unexpected bill from Microsoft. It's really a position you want to avoid.
Security and governance
If you suddenly dropped a new workspace, do you know that the security and governance controls that are across the rest of the system been applied as well - I suspect not.
Suddenly you have a gap in your protocols, meaning that you can't guarantee that data isn't being exposed to unauthorised users of the platform and that your KPIs are going to match existing analytical output.
Shadow IT estate starts to grow
We've all seen those situations that one department or another have gone off and purchased a shadow platform - or business critical reports are living on individual laptops/dev platforms.
Suddenly you've taken on technical debt that it's going to cost financially to repay - and you'll probably be asked to pay that back yesterday putting development teams under pressure.
The right way to do it
In order to avoid these traps, we have a number of areas that we need to follow across technology, people, and process.
Technology
Capacity provisioning best practice
The guidance I've had from Microsoft teams is that, as a minimum, 3 capacities should form the basis of any Fabric deployment:
- Shared capacity for dev/build/test/pre-prod. To keep costs down, these non-production environments can safely share a capacity. Let's face it if these environments are down for a week, it's annoying but it isn't the end of the world (you can always spin up a new capacity and migrate the workspaces not impacted across to get things moving).
- Capacity for back-stage, production, Fabric objects. This will be used for things like your pipelines, Lakehouses, Warehouses, Databases, etc and would be left as default (i.e. burst and smooth turned on)
- Capacity for front stage, production, Fabric objects. This will be thinks like downstream sandboxes, Power BI reports, etc. On this capacity you'll turn off burst and smooth. That way when capacity is reached, new requests will be throttled until processes end - but critically the capacity won't completely lock up.
Whilst this does add cost to any Fabric deployment than you may have expected already, it's cost that is necessary to guard against the financial risk above.
Use infrastructure as code
Whilst Fabric is a SaaS platform that can be deployed via the GUI, that doesn't mean it is a good idea to do it that way.
With the Fabric Terraform connector, we can use infrastructure as code to defined our capacity, workspaces connected to GitHub, folder, shortcut setup, Fabric gateway config, mirrored databases initialised, etc - meaning that we have the config backed up for disaster recovery, and can repeat the deployment across environments.
Use workspace integration with Azure DevOps (ADO) git/GitHub
Make sure that you have configured your workspace to integrate into ADO/GitHub for code management. Now that all items are supported within Fabric, you not only have a copy for DR purposes, with a change history, but you'll also be able to promote a version of your code through your environments.
Use ADO / GitHub actions for CICD
With your Terraform code written, now it's time to start building out our deployment pipelines. Within this we can use our Terraform setup to deploy infrastructure changes across environments with a few clicks.
On top of that we can use the Fabric CLI to programmatically deploy our Entra group based security setup. Again meaning that it's repeatable and backed up for DLR purposes.
With our foundations in-place, and connected up to the relevant git branches, we can use the Fabric APIs to automate our sync between the Git branch and our workspace.
At this stage, you have all the foundations you need to be able to use code to get your platform up and running from a blank sheet - and recover the setup and code if you lose it all.
Purview
Lastly you'll want to bring in Purview to get that cross workspace governance. Ensuring that you can quickly and easily document what's on the platform and key data flows.
Process
To facilitate all this, you will need a process to control who can request a new workspace, and how that request is logged - with SLAs around it on how long it takes to resolve. All supported with technologies like Service Now.
You'll also need coding standards and onboarding processes to make sure those using your new works
People
Lastly, the skill sets that you need to support this type of flow:
- Resource(s) familiar with Terraform
- Resource(s) familiar with ADO/GitHub
- Resource(s) familiar with the Fabric APIs
- Resource(s) familiar with Purview
- Resource(s) made accountable for running deployments
Depending on the size of your organisation this might a couple of individuals with this knowledge right the way through teams dedicated to specific parts of this setup.
It seems a big overhead, but if you want to minimise the risk of you running into some of the comment challenges I mentioned at the start then this is how you do it - whilst ensuring each step is reliable and repeatable at a moments notice.
Comments
Post a Comment