Why Use Apache Spark in Azure Synapse Analytics?

25 . 08 . 202028 . 04 . 2021

In this blog post, we’ll cover the main libraries of Apache Spark to understand why having it in Azure Synapse Analytics is an excellent idea. Azure Synapse Analytics brings Data Warehousing and Big Data together, and Apache Spark is a key component within the big data space.

In my previous blog post on Apache Spark, we covered how to create an Apache Spark cluster in Azure Synapse Analytics. Today, let’s check out some of its main components.

Table of Contents

A few facts about Apache Spark

You can develop solutions writing code in Scala, Python, Spark SQL, .NET or R using language bindings.
Built to process large volumes of data and to overcome the limitations of MapReduce back in 2009 by Matei Zaharia. The release date for version 1.0 was 2014. Matei Zaharia is one of the co-founders of Databricks (around 2013).
An open-source platform and it combines batch and real-time (micro-batch) processing within a single platform.
Azure Synapse Analytics offers version 2.4 (released on 2018-11-02) of Apache Spark, while the latest version is 3.0 (released on 2020-06-08). You can expect to have version 3.0 in Azure Synapse Analytics in the near future.
Azure Databricks released the use of Apache Spark 3.0 only 10 days after its release (2020-06-18). Technology providers must be on top of the game when it comes to releasing new platforms.
Azure HDInsight Apache Spark also runs version 2.4. Is it a coincidence? No, Azure Synapse Analytics takes advantage of existing technology built-in HDInsight.
Apache Spark was built for and is proved to work with environments with over 100 PB (Petabytes) of data.

Apache Spark Libraries

You can find 4 main libraries:

Spark SQL
Spark Streaming
MLlib
GraphX

Spark SQL

I imagine Spark SQL was thought of as a must-have feature when they built the product. Spark SQL allows developers to use SQL to work with structured datasets. It allows you to:

Perform distributed in-memory computations of large volumes of data using SQL
Scale your relational databases with big data capabilities by leveraging SQL solutions to create data movements (ETL pipelines). This can be done using non-structured or structured datasets
Take advantage of existing knowledge in writing queries with SQL
Integrate relational and procedural programs using data frames and SQL

Additionally,

Many Business Intelligence (BI) tools offer SQL as an input language by using the JDBC/ODBC connectors. This extends your BI tool to consume big data
By creating tables, you can easily consume information with Python, Scala, R, and .NET

Spark Streaming

Bringing real-time data streaming within Apache Spark closes the gap between batch and real time-processing by using micro-batches. Before, you usually had different technologies to achieve these scenarios. For example, Hadoop and MapReduce for batch processing and Apache Storm for real-time streaming.

You can stream real-time data and apply transformations with Continuous Processing with end-to-end latencies as low as 1 millisecond.

MLlib (machine learning)

MLlib speeds up data scientists’ experimentations, not only due to the large number of libraries included as part of MLlib, but also because analyzing large volumes of information is time-consuming and Apache Spark can deal with this.

It provides tools such as (the following information comes from Apache Spark documentation):

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and loading algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.

GraphX (graph)

GraphX enables you to perform graph computation using edges and vertices. It is developed and enhanced for each Apache Spark release, bringing new algorithms to the platform.

Running analytical graph analysis can be resource expensive, but with GraphX you’ll have performance gains with the distributed computational engine.

Graph analysis covers specific analytical scenarios and it extends Spark RDDs.

Summary

In this blog post, you looked at some of the components within Apache Spark to understand how it makes Azure Synapse Analytics a game-changing one-stop-shop for analytics and helps develop data warehousing or big data workloads.

Final Thoughts

During the past few years while working in the data analytics space, I’ve seen the rise of big data technologies, with some of the main limitations for their adoption being deployment, maintenance, governance, and anything related to its lifecycle. Having managed clusters in Azure Synapse Analytics or Azure Databricks helps mitigate these limitations.

What’s next?

During the next few weeks, we’ll explore more features and services within the Azure offering.  

Please follow me on Twitter at TechTalkCorner for more articles, insights, and tech talk! 

Check out my other posts

Azure SQL Analytics Pool Engine Version

No-code Experience for Querying JSON Files in Azure Synapse Analytics Serverless

Soft Deletes in Azure Storage Accounts

comment [ 0 ]

No tags 0

David Alzamendi

As a Data Architect, I help organisations to adopt Azure data analytics technologies that mitigate some of their business challenges. I’ve been working in the data analytics space since 2011, mainly in the data warehousing area and I’m specialized in the design and implementation of data analytics solutions with Microsoft technologies. I am responsible for providing end-to-end technical guidance and expertise across multiple data analytics projects.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Why Use Apache Spark in Azure Synapse Analytics?

A few facts about Apache Spark

Apache Spark Libraries

Spark SQL

Spark Streaming

MLlib (machine learning)

GraphX (graph)

Summary

Final Thoughts

What’s next?

Check out my other posts

David Alzamendi

Soft Deletes in Azure Storage Accounts

No-code Experience for Querying JSON Files in Azure Synapse Analytics Serverless

Do you want to leave a comment? Cancel reply

Recent Posts

Categories

David Alzamendi

Calendar

Archives