Read Microsoft Excel files in Azure Databricks Cluster

kapil rajyaguru
4 min readApr 26, 2021
Photo by Carlos Muza on Unsplash

Microsoft Excel is not new to any of you. Suppose you have thousands of excel files and would like to process data stored in those excel files, then it is not easy. However, with the Azure Databricks spark cluster, your job is easy now.

Azure Databricks helps you to quickly spin up a spark cluster and process petabytes of data. Many of the customers prefer their data to be CSV, JSON, parquet, AVRO, etc. format. The key reason for using these formats is because they are universal, provides high throughput with reading, writing, and compression. If you spin up the Azure Databricks cluster and access data stored in CSV, JSON, parquet format, you don’t need any special libraries. However, to read excel files, you need a specific library to be installed and a specific spark/Scala version to be installed. This article will explain how to install the library, validate the correct spark/Scala version, and use the pyspark command to read the excel file.

So, let’s start with step-by-step instructions on how to read excel files in Azure Databricks spark cluster.

  • Login to Azure Portal with your login ID and Password
  • In the Azure portal, select Create a resource > Analytics > Azure Databricks.
  • Under Azure Databricks Service, provide the values to create a Databricks workspace.
  • Select Review + Create and then Create. The workspace creation takes a few minutes. During workspace creation, you can view the deployment status in Notifications. Once this process is finished, your user account is automatically added as an admin user in the workspace.
  • In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.
  • You are redirected to the Azure Databricks portal. From the portal, click New Cluster.
  • In the New cluster page, provide the values to create a cluster.
  • Please make sure you select Runtime: 6.4 (Scala 2.11, Spark 2.4.5). If you select the latest version, then the spark cluster won’t read excel files though the required library is installed successfully.
  • In the left pane, select Azure Databricks. From the Common Tasks, select Import Library.
  • In the Create Library dialog box, select Maven under Library Source and click search Packages.
  • Under the Search Packages dialog box, select Maven central and type “spark-excel_2.12” under the search box.
  • Click select under options for the release 0.13.7 displayed in the result section.
  • Wait for the library to be successfully installed.
  • In the left pane, select Azure Databricks. From the Common Tasks, select Create New Table.
  • In the Create New Table dialog box, click browse under Files and upload an excel file from your laptop/desktop.
  • In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook.
  • In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster you created earlier.
  • The following command allows the spark to read the excel file stored in DBFS and display its content.

# Read excel file from DBFS

df = (spark.read

.format(“com.crealytics.spark.excel”)

.option(“Header”, “true”)

.option(“treatEmptyValuesAsNulls”, “false”)

.option(“inferSchema”, “true”).option(“addColorColumns”, “false”)

.load(“/FileStore/tables/<Name of Your excel file>.xlsx”))

display(df)

Use the following code to read excel file stored in your Azure Blob Storage account.

# Mount blob storage container

SasURL = “<Blob Service SAS URL>”

indQuestionMark = SasURL.index(‘?’)

SasKey = SasURL[indQuestionMark:len(SasURL)]

StorageAccount = “<Storage account>”

ContainerName = “<Container name>”

MountPoint = “/mnt/<Mount point name>”

dbutils.fs.mount(

source = “wasbs://%s@%s.blob.core.windows.net/” % (ContainerName, StorageAccount),

mount_point = MountPoint,

extra_configs = {“fs.azure.sas.%s.%s.blob.core.windows.net” % (ContainerName, StorageAccount) : “%s” % SasKey}

)

# read excel file from mount point

df = (spark.read

.format(“com.crealytics.spark.excel”)

.option(“Header”, “true”)

.option(“treatEmptyValuesAsNulls”, “false”)

.option(“inferSchema”, “true”).option(“addColorColumns”, “false”)

.load(“/mnt/<Mount point name>/<Excel file name>”))

display(df)

That’s it. This is how you can read excel files into your Azure Databricks cluster. The key thing over here is the spark and Scala version. If you select the latest runtime with the latest spark and Scala version, then the chances are that your spark cluster won’t be able to read the excel file.

Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.

--

--

kapil rajyaguru

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.