# Wednesday, July 10, 2019

Azure Databricks is a web-based platform built on top of Apache Spark and deployed to Microsoft's Azure cloud platform that provides a web-based interface that makes it simple for users to create and scale clusters of Spark servers and deploy jobs and Notebooks to those clusters. Spark provides a general-purpose compute engine ideal for working with big data, thanks to its built-in parallelization engine.

In the last article in this series, I showed how to create a new Databricks Cluster in a Microsoft Azure Databricks Workspace.

In this article, I will show how  to create a notebook and run it on that cluster.

Navigate to the Databricks service, as shown in Fig. 1.

db01-OverviewBlade
Fig. 1

Click the [Launch Workspace] button (Fig. 2) to open the Azure Databricks page, as shown in Fig. 3.

db02-LaunchWorkspaceButton
Fig. 2

db03-DatabricksHomePage
Fig. 3

Click the "New Notebook" link under "Common Tasks" to open the "Create Notebook" dialog, as shown in Fig. 4.

db04-CreateNotebookDialog
Fig. 4

At the "Name" field, enter a name for your notebook. The name must be unique within this workspace.

At the "Language" dropdown, select the default language for your notebook. Current options are Python, Scala, SQL, and R. Selecting a language does not limit you to only using that language within this notebook. You can override the language in a given cell.

Click the [Create] button to create the new notebook. A blank notebook displays, as shown in Fig. 5.

db05-BlankNotebook
Fig. 5

Fig. 6 shows a notebook with some simple code added to the first 2 cells.

db06-Notebook
Fig. 6

You can add, move, or manipulate cells by clicking the cell menu at the top right of an existing cell, as shown in Fig. 7.

db07-AddCell
Fig. 7

In order to run your notebook, you will need to attach it to an existing, running cluster. Click the "Attach to" dropdown and select from the clusters in the current workspace, as shown in Fig. 8.  See this article for information on how to create a cluster.

db08-AttachCluster
Fig. 8

You can run all the cells in a notebook by clicking the "Run all" button in the toolbar, as shown in Fig. 9.

db09-RunAll
Fig. 9

Use the "Run" menu in the top right of a cell to run only that cell or the cells above or below it, as shown in Fig. 10.

db10-RunCell
Fig. 10

Fig. 11 shows a notebook after all cells have been run. Note the output displayed below each cell.

db11-NotebookWithResults
Fig. 11

In this article, I showed how to create, run, and manage a notebook in an Azure Databricks workspace.

Wednesday, July 10, 2019 9:20:00 AM (GMT Daylight Time, UTC+01:00)
# Tuesday, July 9, 2019

Azure Databricks is a web-based platform built on top of Apache Spark and deployed to Microsoft's Azure cloud platform that provides a web-based interface that makes it simple for users to create and scale clusters of Spark servers and deploy jobs and Notebooks to those clusters. Spark provides a general-purpose compute engine ideal for working with big data, thanks to its built-in parallelization engine.

In the last article in this series, I showed how to create a new Databricks service in Microsoft Azure.

A cluster is a set of compute nodes that can work together. All Databricks jobs run  in a cluster, so you will need to create one if you want to do anything with your Databricks service.

In this article, I will show how  to create a cluster in that service.

Navigate to the Databricks service, as shown in Fig. 1.

db01-OverviewBlade
Fig. 1

Click the [Launch Workspace] button (Fig. 2) to open the Azure Databricks page, as shown in Fig. 3.

db02-LaunchWorkspaceButton
Fig. 2

db03-DatabricksHomePage
Fig. 3

Click the "New Cluster" link to open the "Create Cluster" dialog, as shown in Fig. 4.

db04-CreateCluster
Fig. 4

At the "Cluster Name" field, enter a descriptive name for your cluster.

At the "Cluster Mode" dropdown, select "Standard" or "High Concurrency". The "High Concurrency" option can run multiple jobs concurrently.

At the "Databricks Runtime Version" dropdown, select the runtime version you wish to support on this cluster. I recommend selecting the latest non-beta version.

At the "Python Version" dropdown, select the version of Python you wish to support. New code will likely be written in version 3, but you may be running old notebooks written in version 2.

I recommend checking the "Enable autoscaling" checkbox. This allows the cluster to automatically spin up the number of nodes required for a  job, effectively balancing cost and performance.

I recommend checking the "Terminate after ___ minutes" checkbox and including a reasonable amount of time (I usually set this to 60 minutes) of inactivity to shut down your clusters. Running a cluster is an expensive operation, so you will save a lot of money if you shut them down when not in use. Because it takes a long time to spin up a cluster, consider how frequently a new job is required before setting this value too low. You may need to experiment with this value to get it right for your situation.

At the "Worker Type" node, select the size of machines to include in your cluster. If you enabled autoscaling, you can set the minimum and maximum worker nodes as well. If you did not enable autoscaling, you can only set the number of worker nodes. My experience is that more nodes and smaller machines tends to be more cost-effective than fewer nodes and more powerful machines; but you may want to experiment with your jobs to find the optimum setting for your organization.

At the "Driver Type" dropdown, select "Same as worker".

You can expand the "Advanced Options" section to pass specific data to your cluster, but this is usually not necessary.

Click the [Create Cluster] button to create this cluster. It will take a few minutes to create and start a new cluster.

When the cluster is created, you will see it listed, as shown in Fig. 5, with a state of "Running".

db05-Clusters
Fig. 5

You are now ready to create jobs and run them on this cluster. I will cover this in a future article.

In this article, you learned how to create a cluster in an existing Azure Databricks workspace.

Tuesday, July 9, 2019 9:37:00 AM (GMT Daylight Time, UTC+01:00)
# Friday, July 5, 2019

Azure Databricks is a web-based platform built on top of Apache Spark and deployed to Microsoft's Azure cloud platform.

Databricks provides a web-based interface that makes it simple for users to create and scale clusters of Spark servers and deploy jobs and Notebooks to those clusters. Spark provides a general-purpose compute engine ideal for working with big data, thanks to its built-in parallelization engine.

Apache Spark is open source and Databricks is owned by the Databricks company; but, Microsoft adds value by providing the hardware and fabric on which these tools are deployed, including providing capacity on which to scale and built-in fault tolerance.

To create an Azure Databricks environment, navigate to the Azure Portal, log in, and click the [Create Resource] button (Fig. 1).

db01-CreateResourceButton
Fig. 1

From the menu, select Analytics | Azure Databricks, as shown in Fig. 2.

db02-NewDataBricksMenu
Fig. 2

The "Azure Databricks service" blade displays, as shown in Fig. 3.

db03-NewDataBricksBlade
Fig. 3

At the "Workspace name" field, enter a unique name for the Databricks workspace you will create.

At the "Subscription" field, select the subscription associated with this workspace. Most of you will have only one subscription.

At the "Resource group" field, click the "Use existing" radio button and select an existing Resource Group from the dropdown below; or click the "Create new" button and enter the name and region of a new Resource Group when prompted.

At the "Location" field, select the location in which to store your workspace. Considerations include the location of the data on which you will be working and the location of developers and users who will access this workspace.

At the "Pricing Tier" dropdown, select the desired pricing tier. The Pricing Tier options are shown in Fig. 4.

db04-PricingTier
Fig. 4

If you wish to deploy this workspace to a particular virtual network, select "Yes" radio button at this question.

When completed, the blade should look similar to Fig. 5.

db05-NewDataBricksBlade-Completed
Fig. 5

Click the [Create] button to create the new Databricks service. This may take a few minutes.

Navigate to the Databricks service, as shown in Fig. 6.

db06-OverviewBlade
Fig. 6

Click the [Launch Workspace] button (Fig. 7) to open the Azure Databricks page, as shown in Fig. 8.

db07-LaunchWorkspaceButton
Fig. 7

db08-DatabricksHomePage
Fig. 8

In this article, I showed you how  to create a new Azure Databricks service. In future articles, I will show how to create clusters, notebooks, and otherwise make use of your Databricks service.

Friday, July 5, 2019 9:00:00 AM (GMT Daylight Time, UTC+01:00)