Store Azure Databricks logs into Azure Data Lake Gen2

Photo by Joshua Sortino on Unsplash

One of the common question I receive from my customers who uses Azure Databricks (ADB) is how can I store my Azure Databricks notebook execution/error logs to Azure Data Lake Storage (ADLS) Gen2 storage account. While you can view the Spark driver and executor logs in the Spark UI, Databricks can also deliver the logs to ADLS Gen2 destination. See the following examples.

Prerequisites

  • Azure subscription with sign in access to the Azure Portal
  • Azure Databricks workspace
  • Azure Data Lake Storage Gen2 account

Generate Azure Databricks personal access token

  • In the Azure portal, search for Azure Databricks and open Azure Databricks workspace
  • Click the user profile icon in the upper right corner of your Databricks workspace.
  • Click User Settings.
  • Go to the Access Tokens tab.
  • Click the Generate New Token button.
  • Optionally enter a description (comment) and expiration period.
  • Click the Generate button.
  • Copy the generated token and store in a secure location.

Mount Azure Data Lake storage Gen2 containers to DBFS

  • Follow the instruction provided at here to mount the ADLS Gen2 container to DBFS

Create a cluster with logs delivered to ADLS Gen2 location

The following cURL command creates a cluster named cluster_log_dbfs and requests Databricks to sends its logs to dbfs:/mnt/logs with the cluster ID as the path prefix.

curl -X POST -H ‘Authorization: Bearer <Access Token>’ ‘Content-Type: application/json’ -d \

‘{

“cluster_name”: “cluster_log_dbfs”,

“spark_version”: “7.3.x-scala2.12”,

“node_type_id”: “Standard_DS3_v2”,

“num_workers”: 1,

“cluster_log_conf”: {

“dbfs”: {

“destination”: “dbfs:/mnt/logs”

}

}

}’ https://<databricks-instance>/api/2.0/clusters/create

Replace <databricks-instance> with the workspace URL of your Databricks deployment.

The response should contain the cluster ID: The deployment may take few mins. You can validate by going to ADB workspace à Clusters. Here you will find cluster state in pending.

After cluster creation, Databricks syncs log files to the destination every 5 minutes. It uploads driver logs to dbfs:/mnt/logs/1111–223344-abc55/driver and executor logs to dbfs:/logs/1111–223344-abc55/executor.

Hope this helps you to save ADB logs into ADLS Gen2 account for future audits and troubleshooting purpose.

Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.

--

--

--

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Magento Upgrade

Creating Templates with Mustache

Schedule Netlify builds with GitHub Actions

Employee Spotlight: Louis

Running selenium on MacOS using chromedriver

Setup AWS CloudWatch Agent On-Premise Server — Part 1

QT5 Photo Booth

KubeCon + CloudNativeCon Europe 2019 in Barcelona

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
kapil rajyaguru

kapil rajyaguru

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.

More from Medium

Delta Lake Scanning with Azure Purview (and Apache Spark)

Azure Synapse Analytics — A Powerful Datawarehouse!

Tutorial: Create a Single Node Databricks Cluster in Azure Data Factory

HTAP-Creating a Hybrid transactional/analytical processing data solution in Azure!