Store Azure Databricks logs into Azure Data Lake Gen2
One of the common question I receive from my customers who uses Azure Databricks (ADB) is how can I store my Azure Databricks notebook execution/error logs to Azure Data Lake Storage (ADLS) Gen2 storage account. While you can view the Spark driver and executor logs in the Spark UI, Databricks can also deliver the logs to ADLS Gen2 destination. See the following examples.
Prerequisites
- Azure subscription with sign in access to the Azure Portal
- Azure Databricks workspace
- Azure Data Lake Storage Gen2 account
Generate Azure Databricks personal access token
- In the Azure portal, search for Azure Databricks and open Azure Databricks workspace
- Click the user profile icon in the upper right corner of your Databricks workspace.
- Click User Settings.
- Go to the Access Tokens tab.
- Click the Generate New Token button.
- Optionally enter a description (comment) and expiration period.
- Click the Generate button.
- Copy the generated token and store in a secure location.
Mount Azure Data Lake storage Gen2 containers to DBFS
- Follow the instruction provided at here to mount the ADLS Gen2 container to DBFS
Create a cluster with logs delivered to ADLS Gen2 location
The following cURL command creates a cluster named cluster_log_dbfs and requests Databricks to sends its logs to dbfs:/mnt/logs with the cluster ID as the path prefix.
curl -X POST -H ‘Authorization: Bearer <Access Token>’ ‘Content-Type: application/json’ -d \
‘{
“cluster_name”: “cluster_log_dbfs”,
“spark_version”: “7.3.x-scala2.12”,
“node_type_id”: “Standard_DS3_v2”,
“num_workers”: 1,
“cluster_log_conf”: {
“dbfs”: {
“destination”: “dbfs:/mnt/logs”
}
}
}’ https://<databricks-instance>/api/2.0/clusters/create
Replace <databricks-instance> with the workspace URL of your Databricks deployment.
The response should contain the cluster ID: The deployment may take few mins. You can validate by going to ADB workspace à Clusters. Here you will find cluster state in pending.
After cluster creation, Databricks syncs log files to the destination every 5 minutes. It uploads driver logs to dbfs:/mnt/logs/1111–223344-abc55/driver and executor logs to dbfs:/logs/1111–223344-abc55/executor.
Hope this helps you to save ADB logs into ADLS Gen2 account for future audits and troubleshooting purpose.
Disclaimer: I work for @Microsoft Azure Cloud & my opinions are my own.