Azure Synapse Analytics - Serverless SQL Pool

Following on from our Director of Data Science & Analytics blog on how Azure Synapse Analytics went GA, Sandeep Sohal takes a closer look at one of the new capabilities Synapse SQL.

Introduction

As you may have already noticed on the 3^rd of December 2020, Azure Synapse Analytics went GA, our Director of Data Science & Analytics Jonathan Scott has written a great overview in his blog post here.

Although there are a plethora of new capabilities that have gone to GA in Synapse we have for this blog post focused on Synapse SQL. Synapse SQL gives Azure Synapse Analytics workspace users the ability to do SQL based analytics at scale. We see that there are nuances to using this from a technical perspective, also potential for substantially changing cost model in azure for SQL workloads.

Synapse SQL has two consumption models. Dedicated SQL pools, where you can provision a SQL pool at a unit of scale, scale the service up or down and pause it during non-operational hours. The other is serverless SQL pool, where you do not need to provision a server, it auto-scales and you consume the service on a pay-per-query cost model. A workspace can have more than one dedicated SQL pool and only ever one serverless SQL pool. Which resource type you use is determined by the cost model that suits you and your computing needs.

We are going to take a closer look at one element of Synapse SQL the serverless SQL pool, why you might use it, the architecture, cost and how you use it to query data in different Azure storage sources.

Why use the serverless SQL pool?

By default, each Synapse workspace that you create comes with a serverless SQL pool endpoint. In Synapse Studio under Manage>SQL Pools you will see a pool called “Built-In”; this is your serverless SQL pool as shown below.

Due to the nature of the SQL pool being serverless, there is no infrastructure to set up or cluster to maintain, it auto-scales to meet your query needs and you can start querying data stored in Azure Storage as soon as you have created a Synapse workspace.

As you can appreciate there will be numerous use cases for the use of a serverless SQL Pool, below are sample use cases:

Explore and discover data in Azure Storage quickly and on-demand.
Create a Logical data warehouse by creating transformation views in a metadata store for up to data, directly querying from your raw ADLS Gen2 (Data Lake) storage layer.
Simple data transformation of data in Azure Storage for use in other BI systems.
Build data visualisation layers in Power BI, directly querying against Spark Tables.

Ultimately the decision to use the serverless SQL pool will come down to individual uses cases and cost vs benefit analysis. Is the optimal option serverless SQL pool or would a dedicated resource like dedicated SQL pools be a better choice for particular workloads.

Architecture

Serverless SQL pool uses a scale-out architecture to distribute processing across multiple nodes like dedicated SQL pools. And like dedicated SQL pools, compute and storage are separate from one another, this is what allows compute to scale independently of your data.

One of the key differences in architecture between the two Synapse SQL services is that dedicated SQL pools compute power is determined by the provisioned unit of scale known as data warehouse units (DWU), and queries are distributed in parallel across compute nodes using a Massively Parallel Processing (MPP) engine. A serverless SQL pool on the other hand uses a Distributed Querying Processing (DQP) engine which assigns Tasks to compute nodes to complete a query. In the diagram below a query is passed to the central Control Node that uses DQP. Four compute nodes were needed to complete the query based on the number of Tasks determined by the DQP engine. This is what allows serverless SQL pool to scale automatically to accommodate any querying requirement.

Cost

Serverless SQL pool uses a pay-per-query costing model. Cost is calculated per TB of data processed by the queries that you run. This is unlike dedicated SQL pools where you pay for a reserved resource at a pre-determined scale. Current approx. pricing for serverless SQL pool is £3.73 per TB data processed, so costs can quickly escalate if the queries processed are over very large multi TB datasets.

If the thought of end-users unwittingly running massive queries using a pay-per-query costing model scares you, there is a way to manage spend. The Cost Control feature in Synapse enables you to set budgets on the amount that can be spent on a daily, weekly, or monthly basis, and can be configured either through Synapse Studio or through the stored procedure in T-SQL.

To configure a budget In Synapse Studio you need to navigate to Manage->SQL Pools and click on the Cost Control icon for the “Built-In” serverless pool as shown below.

This will bring up the Cost Control configuration screen as shown below. Here budgets can be configured in terms of TB per day, week, or month. You also get a view of actual spend in terms of the amount of data that has been processed. This can be useful to see when budget limits might be close to being reached.

The drawback with the Cost Control feature is that limits can only be configured for an entire serverless SQL pool per workspace. This suggests that if want more granular control over costs, you may need to create multiple workspaces to manage costs per departmentteam etc.

Querying

With the serverless SQL pool, you can query data stored in ADLS Gen2 (Data Lake), Azure Blob Storage, Cosmos DB and Spark Tables using T-SQL. File formats supported include Parquet, CSV and JSON.

You are provided with a serverless SQL pool endpoint when you create a Synapse workspace and a familiar T-SQL language. Therefore any client tool that can establish a SQL TDS connection can connect to and query data using the serverless SQL pool. This includes Synapse Studio, SQL Server Management Studio, Azure Data Studio, Azure Analysis Services and Power BI.

The familiar T-SQL OPENROWSET function is provided so that you can query your primary Data Lake associated with your workspace or external Azure storage data sources. Useful features like automatic schema inference allow you to read Parquet files without having to declare column names in your query. Below is an example query that queries data from the publicly available New York Taxi dataset.

What is exciting is the ability to read Spark Tables using T-SQL syntax without the need to provision a Spark Cluster, pay for the reserved resource, wait for it to start up and then execute a SQL query. With the serverless SQL pool, you can query a Spark Table in seconds as opposed to minutes, and with connectivity to Power BI, this opens a whole new possibility of powerful data visualisations and quicker data insights.

There are limitations to the T-SQL that you can use with the serverless SQL pool. For further details on what T-SQL is supported by serverless SQL pool, please refer to the following Microsoft article here.

For more information, please do not hesitate to get in touch with our experts.

Sandeep Sohal

Sandeep is an Analytics Architect in the Data Science and Analytics team at Hitachi Solutions. In his role, Sandeep helps our customers unlock the value from their data whether that be by helping them implement good data governance or by designing and architecting modern cloud-based data solutions. Sandeep has been working in the field of Data and Analytics for over 15 years designing and developing enterprise data warehouses, modern cloud-based data platforms and big data solutions both in consultancy and in the retail and marketing sectors.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-functional	1 year	The cookie is set by the GDPR Cookie Consent plugin to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Necessary" category .
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin to store whether or not the user has consented to the use of cookies. It does not store any personal data.
wordpress_monolith_access_gated_content	3 days	A cookie to remember gated content entries

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
AnalyticsSyncHistory	1 month	Used by LinkedIn to store information about the time a sync took place with the lms_analytics cookie
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used by LinkedIn to store consent of guests regarding the use of cookies for non-essential purposes
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
ln_or	1 day	Used by LinkedIn to determine if Oribi analytics can be carried out on a specific domain
pll_language	1 year	The pll _language cookie is used by Polylang to remember the language selected by the user when returning to the website, and also to get the language information when not available in another way.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-97336965-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
cusid	30 minutes	ClickDimensions sets this cookie to establish and continue a user session with the site.
cuvid	2 years	This cookie, set by ClickDimensions, is written to the browser upon the first visit to the site from that web browser.
cuvon	30 minutes	ClickDimensions sets this cookie to store the last time a visitor viewed a page.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Azure Synapse Analytics – Serverless SQL Pool