Store Azure Databricks logs into Azure Data Lake Gen2

kapil rajyaguru
2 min readApr 30, 2021
Photo by Joshua Sortino on Unsplash

One of the common question I receive from my customers who uses Azure Databricks (ADB) is how can I store my Azure Databricks notebook execution/error logs to Azure Data Lake Storage (ADLS) Gen2 storage account. While you can view the Spark driver and executor logs in the Spark UI, Databricks can also deliver the logs to ADLS Gen2 destination. See the following examples.

Prerequisites

  • Azure subscription with sign in access to the Azure Portal
  • Azure Databricks workspace
  • Azure Data Lake Storage Gen2 account

Generate Azure Databricks personal access token

  • In the Azure portal, search for Azure Databricks and open Azure Databricks workspace
  • Click the user profile icon in the upper right corner of your Databricks workspace.
  • Click User Settings.
  • Go to the Access Tokens tab.
  • Click the Generate New Token button.
  • Optionally enter a description (comment) and expiration period.
  • Click the Generate button.
  • Copy the generated token and store in a secure location.

Mount Azure Data Lake storage Gen2 containers to DBFS

  • Follow the instruction provided at here to mount the ADLS Gen2 container to DBFS

Create a cluster with logs delivered to ADLS Gen2 location

The following cURL command creates a cluster named cluster_log_dbfs and requests Databricks to sends its logs to dbfs:/mnt/logs with the cluster ID as the path prefix.

curl -X POST -H ‘Authorization: Bearer <Access Token>’ ‘Content-Type: application/json’ -d \

‘{

“cluster_name”: “cluster_log_dbfs”,

“spark_version”: “7.3.x-scala2.12”,

“node_type_id”: “Standard_DS3_v2”,

“num_workers”: 1,

“cluster_log_conf”: {

“dbfs”: {

“destination”: “dbfs:/mnt/logs”

}

}

}’ https://<databricks-instance>/api/2.0/clusters/create

Replace <databricks-instance> with the workspace URL of your Databricks deployment.

The response should contain the cluster ID: The deployment may take few mins. You can validate by going to ADB workspace à Clusters. Here you will find cluster state in pending.

After cluster creation, Databricks syncs log files to the destination every 5 minutes. It uploads driver logs to dbfs:/mnt/logs/1111–223344-abc55/driver and executor logs to dbfs:/logs/1111–223344-abc55/executor.

Hope this helps you to save ADB logs into ADLS Gen2 account for future audits and troubleshooting purpose.

Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.

--

--

kapil rajyaguru

Enabling Organizations with IT Transformation & Cloud Migrations | Principal CSM Architect at IBM, Ex-Microsoft, Ex-AWS. My opinions are my own.