Read Microsoft Excel files in Azure Databricks Cluster

Photo by Carlos Muza on Unsplash

Microsoft Excel is not new to any of you. Suppose you have thousands of excel files and would like to process data stored in those excel files, then it is not easy. However, with the Azure Databricks spark cluster, your job is easy now.

Azure Databricks helps you to quickly spin up a spark cluster and process petabytes of data. Many of the customers prefer their data to be CSV, JSON, parquet, AVRO, etc. format. The key reason for using these formats is because they are universal, provides high throughput with reading, writing, and compression. If you spin up the Azure Databricks cluster and access data stored in CSV, JSON, parquet format, you don’t need any special libraries. However, to read excel files, you need a specific library to be installed and a specific spark/Scala version to be installed. This article will explain how to install the library, validate the correct spark/Scala version, and use the pyspark command to read the excel file.

So, let’s start with step-by-step instructions on how to read excel files in Azure Databricks spark cluster.

  • Login to Azure Portal with your login ID and Password
  • In the Azure portal, select Create a resource > Analytics > Azure Databricks.
  • Under Azure Databricks Service, provide the values to create a Databricks workspace.
  • Select Review + Create and then Create. The workspace creation takes a few minutes. During workspace creation, you can view the deployment status in Notifications. Once this process is finished, your user account is automatically added as an admin user in the workspace.
  • In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.
  • You are redirected to the Azure Databricks portal. From the portal, click New Cluster.
  • In the New cluster page, provide the values to create a cluster.
  • Please make sure you select Runtime: 6.4 (Scala 2.11, Spark 2.4.5). If you select the latest version, then the spark cluster won’t read excel files though the required library is installed successfully.
  • In the left pane, select Azure Databricks. From the Common Tasks, select Import Library.
  • In the Create Library dialog box, select Maven under Library Source and click search Packages.
  • Under the Search Packages dialog box, select Maven central and type “spark-excel_2.12” under the search box.
  • Click select under options for the release 0.13.7 displayed in the result section.
  • Wait for the library to be successfully installed.
  • In the left pane, select Azure Databricks. From the Common Tasks, select Create New Table.
  • In the Create New Table dialog box, click browse under Files and upload an excel file from your laptop/desktop.
  • In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook.
  • In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster you created earlier.
  • The following command allows the spark to read the excel file stored in DBFS and display its content.

# Read excel file from DBFS

df = (spark.read

.format(“com.crealytics.spark.excel”)

.option(“Header”, “true”)

.option(“treatEmptyValuesAsNulls”, “false”)

.option(“inferSchema”, “true”).option(“addColorColumns”, “false”)

.load(“/FileStore/tables/<Name of Your excel file>.xlsx”))

display(df)

Use the following code to read excel file stored in your Azure Blob Storage account.

# Mount blob storage container

SasURL = “<Blob Service SAS URL>”

indQuestionMark = SasURL.index(‘?’)

SasKey = SasURL[indQuestionMark:len(SasURL)]

StorageAccount = “<Storage account>”

ContainerName = “<Container name>”

MountPoint = “/mnt/<Mount point name>”

dbutils.fs.mount(

source = “wasbs://%s@%s.blob.core.windows.net/” % (ContainerName, StorageAccount),

mount_point = MountPoint,

extra_configs = {“fs.azure.sas.%s.%s.blob.core.windows.net” % (ContainerName, StorageAccount) : “%s” % SasKey}

)

# read excel file from mount point

df = (spark.read

.format(“com.crealytics.spark.excel”)

.option(“Header”, “true”)

.option(“treatEmptyValuesAsNulls”, “false”)

.option(“inferSchema”, “true”).option(“addColorColumns”, “false”)

.load(“/mnt/<Mount point name>/<Excel file name>”))

display(df)

That’s it. This is how you can read excel files into your Azure Databricks cluster. The key thing over here is the spark and Scala version. If you select the latest runtime with the latest spark and Scala version, then the chances are that your spark cluster won’t be able to read the excel file.

Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.

--

--

--

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Keywords of Java

Tuning Delete Query In Postgres

Use Speech To Text to type in WordPress using the Voice In Chrome Extension

Digging into the vortex of unknown memory dump

HW#13 Making GIF via Code Writing in Storyboard

popshop customer care number//8584892738/8584892738/popshop customer care…

Deploying a personal website

How To Use The Construct Function In Object Oriented Programming PHP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
kapil rajyaguru

kapil rajyaguru

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.

More from Medium

Data Lakes: An overview

Challenges with data lakes

Big Data and Hadoop

Introduction to Azure Synapse Analytics