Databricks – Developer Blog

Azure | Cookbook Databricks

Databricks CLI

Export all Notebooks

databricks workspace list | ForEach { databricks workspace export_dir /$_ $_ }

Troubleshooting

Problem

Error in SQL statement: AnalysisException: Can not create the managed table('`demo`'). The associated location('dbfs:/user/hive/warehouse/demo') already exists.;

Solution

dbutils.fs.rm("dbfs:/user/hive/warehouse/demo/", true)

Handling Complex Data Scenarios

When working with nested data structures in Databricks, the explode() function is essential but comes with hidden pitfalls. Here are key insights for advanced users:

1. The Null Trap in explode()

The standard explode() function silently drops rows with empty arrays or null values – a common pain point in production pipelines. Consider this dataset:

pythondata = [
    (1, "Luke", ["baseball", "soccer"]),
    (2, "Lucy", None),
    (3, "Eve", [])
]

df = spark.createDataFrame(data, ["id", "name", "likes"])

Standard explode behavior

Output retains only Luke’s exploded rows

df.select("id", "name", explode("likes")).show()

Solution: explode_outer()

Preserves Lucy (null) and Eve (empty array) with null values

from pyspark.sql.functions import explode_outer

df.select("id", "name", explode_outer("likes")).show()

2. Advanced Array Handling

For complex nested structures, combine explode_outer() with struct typing:

pythonfrom pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("sport", StringType()),
    StructField("level", StringType())
])

df.withColumn("nested", array(struct(lit("baseball").alias("sport"), 
                                   lit("pro").alias("level")))) \
  .select(explode_outer("nested")) \
  .select("col.*") \
  .show()

3. Z-Order Optimization for Exploded Data

When working with large exploded datasets, optimize Delta Lake storage:

python(df
 .write
 .format("delta")
 .option("delta.optimizeWrite", "true")
 .option("delta.dataSkippingNumIndexedCols", "3")
 .saveAsTable("exploded_data")
)

spark.sql("OPTIMIZE exploded_data ZORDER BY (id, sport)")

Developer Blog

Azure Databricks| Working with Unit Tests

Introduction

Problem

Solution

Create Notebook with the Code

Create Notebook with the Tests

Azure | Cookbook Databricks

Databricks CLI

Export all Notebooks

Troubleshooting

Problem

Solution

Handling Complex Data Scenarios

1. The Null Trap in explode()

Standard explode behavior

Solution: explode_outer()

2. Advanced Array Handling

3. Z-Order Optimization for Exploded Data

4. Performance Comparison

5. Best Practices

Azure | Working with Widgets

TL;DR

Requirements

Define needed moduls and functions

Create DataFrame for this post:

Working with Widgets

Default Widgets

Multiselect Widgets

Monitor the changes when selection values

Filter Query by widgets

Prepare widgets

Now, change some values

Azure Databricks | Cookbook

Reading Data

Create Table from CSV file with SQL

Create Table from CSV file with PySpark

Analyse Data

Group and Display