In this blog post, I’ll show you how to easily query JSON files with Notebooks by converting them to temporal tables in Apache Spark and using Spark SQL.
With the appearance of Data Lakes and other file formats in the data analytics space, people are curious about how to consume these new dataset formats. While a highly skilled technical resource can do it, you can also use Azure Synapse Analytics to get insights with a no-code experience or by writing only a few lines of code.
Querying JSON Files Tutorial
In this tutorial, we are going to be querying some files from the Adventure Works LT database. You can download the files and get a copy of the notebook.
%%pyspark ## Customers customers_data_path = spark.read.load([ 'abfss://firstname.lastname@example.org/Demos/AdventureWorks/json/2020/202007/20200712/SalesLT_Customer_20200712.json' ], format='json') customers_data_path.createOrReplaceTempView("Customer") ## Customers Address customers_data_path = spark.read.load([ 'abfss://email@example.com/Demos/AdventureWorks/json/2020/202007/20200712/SalesLT_CustomerAddress_20200712.json' ], format='json') customers_data_path.createOrReplaceTempView("CustomerAddress") ## Address customers_data_path = spark.read.load([ 'abfss://firstname.lastname@example.org/Demos/AdventureWorks/json/2020/202007/20200712/SalesLT_Address_20200712.json' ], format='json') customers_data_path.createOrReplaceTempView("Address")
Next, after loading the files into temporal views, query them using Spark SQL.
%%sql select CountryRegion, count(*) as NumberOfCustomers from Customer CUS join CustomerAddress CAD on CUS.CustomerID = CAD.CustomerID join Address ADR on ADR.AddressID = CAD.AddressID group by CountryRegion order by count(*) desc
Is that it? Yes, that’s all! Now you are ready to start consuming your JSON files.
What if you have multiple files?
Query multiple files at the same time by using the wildcard such as *. For example:
To sum up, querying JSON files is not a challenging on this platform thanks to Notebooks. This enables data engineers, data scientists and data analysts to easily consume information using different file formats. In my opinion, it’s certainly one of the best features of Azure Synapse Analytics.
Looking forward, I’ll continue introducing new features that are already available in Azure Synapse Analytics Workspaces. Please leave me a comment if you have any questions.
My previous post: Azure Synapse Analytics Notebooks.
Another post on Azure Synapse Analytics: Create Parquet Files