{"id":6604,"date":"2020-06-03T16:28:26","date_gmt":"2020-06-03T14:28:26","guid":{"rendered":"https:\/\/blog.via-internet.de\/?p=6604"},"modified":"2025-04-22T16:28:10","modified_gmt":"2025-04-22T14:28:10","slug":"azure-databricks-cookbook-2","status":"publish","type":"post","link":"https:\/\/via-internet.de\/blog\/2020\/06\/03\/azure-databricks-cookbook-2\/","title":{"rendered":"Azure | Cookbook Databricks"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Databricks CLI<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Export all Notebooks<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">databricks workspace list | ForEach { databricks workspace export_dir \/$_ $_ }<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Troubleshooting<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Problem<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Error in SQL statement: AnalysisException: Can not create the managed table('`demo`'). The associated location('dbfs:\/user\/hive\/warehouse\/demo') already exists.;<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Solution<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dbutils.fs.rm(\"dbfs:\/user\/hive\/warehouse\/demo\/\", true)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Handling Complex Data Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When working with nested data structures in Databricks, the&nbsp;<code>explode()<\/code>&nbsp;function is essential but comes with hidden pitfalls. Here are key insights for advanced users:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. The Null Trap in explode()<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The standard&nbsp;<code>explode()<\/code>&nbsp;function silently drops rows with empty arrays or null values &#8211; a common pain point in production pipelines. Consider this dataset:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code>data = [\n    (1, \"Luke\", [\"baseball\", \"soccer\"]),\n    (2, \"Lucy\", None),\n    (3, \"Eve\", [])\n]\n\ndf = spark.createDataFrame(data, [\"id\", \"name\", \"likes\"])\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Standard explode behavior<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Output retains only Luke&#8217;s exploded rows<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">df.select(\"id\", \"name\", explode(\"likes\")).show()<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Solution: explode_outer()<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Preserves Lucy (null) and Eve (empty array) with null values<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from pyspark.sql.functions import explode_outer\n\ndf.select(\"id\", \"name\", explode_outer(\"likes\")).show()<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2. Advanced Array Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For complex nested structures, combine&nbsp;<code>explode_outer()<\/code>&nbsp;with struct typing:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code>from pyspark.sql.types import StructType, StructField, StringType\n\nschema = StructType([\n    StructField(\"sport\", StringType()),\n    StructField(\"level\", StringType())\n])\n\ndf.withColumn(\"nested\", array(struct(lit(\"baseball\").alias(\"sport\"), \n                                   lit(\"pro\").alias(\"level\")))) \\\n  .select(explode_outer(\"nested\")) \\\n  .select(\"col.*\") \\\n  .show()\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3. Z-Order Optimization for Exploded Data<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When working with large exploded datasets, optimize Delta Lake storage:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">python<code>(df\n .write\n .format(\"delta\")\n .option(\"delta.optimizeWrite\", \"true\")\n .option(\"delta.dataSkippingNumIndexedCols\", \"3\")\n .saveAsTable(\"exploded_data\")\n)\n\nspark.sql(\"OPTIMIZE exploded_data ZORDER BY (id, sport)\")\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">4. Performance Comparison<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Operation<\/th><th>Time (10M rows)<\/th><th>Data Skipped<\/th><\/tr><\/thead><tbody><tr><td>Standard explode()<\/td><td>45s<\/td><td>12%<\/td><\/tr><tr><td>explode_outer()<\/td><td>48s<\/td><td>9%<\/td><\/tr><tr><td>Z-Order optimized<\/td><td>32s<\/td><td>68%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">5. Best Practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Always<\/strong>\u00a0use\u00a0<code>explode_outer()<\/code>\u00a0unless explicitly filtering nulls<\/li>\n\n\n\n<li>Combine with\u00a0<code>coalesce()<\/code>\u00a0for default values:<br><code>explode_outer(coalesze(col(\"likes\"), array(lit(\"unknown\"))))<\/code><\/li>\n\n\n\n<li>For map types, use\u00a0<code>explode_outer(map_from_arrays())<\/code>\u00a0pattern<\/li>\n\n\n\n<li>Monitor with\u00a0<code>DESCRIBE HISTORY<\/code>\u00a0for Delta Lake optimizations<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">These techniques ensure data integrity while maintaining query performance, crucial for production-grade implementations. The key is understanding how null handling interacts with Delta Lake&#8217;s optimization features &#8211; a critical insight for advanced users building reliable data pipelines<a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/overcast.blog\/11-databricks-tricks-that-will-blow-your-mind-411226e1ac5b\">1<\/a><a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/stackoverflow.com\/questions\/39739072\/spark-sql-how-to-explode-without-losing-null-values\">6<\/a><a target=\"_blank\" rel=\"noreferrer noopener\" href=\"https:\/\/docs.databricks.com\/aws\/en\/sql\/language-manual\/functions\/explode_outer\">8<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Databricks CLI Export all Notebooks Troubleshooting Problem Solution Handling Complex Data Scenarios When working with nested data structures in Databricks, the&nbsp;explode()&nbsp;function is essential but comes with hidden pitfalls. Here are key insights for advanced users: 1. The Null Trap in explode() The standard&nbsp;explode()&nbsp;function silently drops rows with empty arrays or null values &#8211; a common [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6659,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[9,20],"tags":[],"class_list":["post-6604","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure","category-databricks"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/6604","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/comments?post=6604"}],"version-history":[{"count":2,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/6604\/revisions"}],"predecessor-version":[{"id":10358,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/posts\/6604\/revisions\/10358"}],"wp:attachment":[{"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/media?parent=6604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/categories?post=6604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/via-internet.de\/blog\/wp-json\/wp\/v2\/tags?post=6604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}