# Monday, July 1, 2019

Episode 570

Laurent Bugnion on Migrating Data to Azure

Laurent Bugnion describes how he migrated from on-premise MongoDB and SQL Server databases to CosmosDB and Azure SQL Database running in Microsoft Azure, using both native tools and the Database migration service.

Monday, July 1, 2019 9:39:00 AM (GMT Daylight Time, UTC+01:00)
# Friday, June 28, 2019

Azure Data Factory (ADF) is an example of an Extract, Transform, and Load (ETL) tool, meaning that it is designed to extract data from a source system, optionally transform its format, and load it into a different destination system.

The source and destination data can reside in different locations, in different data stores, and can support different data structures.

For example, you can extract data from an Azure SQL database and load it into an Azure Blob storage container.

To create a new Azure Data Factory, log into the Azure Portal, click the [Create a resource] button (Fig. 1) and select Integration | Data Factory from the menu, as shown in Fig. 2.

df01-CreateResource
Fig. 1

df02-IntegrationDataFactory
Fig. 2

The "New data factory" blade displays, as shown in Fig. 3.

df03-NewDataFactory
Fig. 3

At the "Name" field, enter a unique name for this Data Factory.

At the Subscription dropdown, select the subscription with which you want to associate this Data Factory. Most of you will only have one subscription, making this an easy choice.

At the "Resource Group" field, select an existing Resource Group or create a new Resource Group which will contain your Data Factory.

At the "Version" dropdown, select "V2".

At the "Location" dropdown, select the Azure region in which you want your Data Factory to reside. Consider the location of the data with which it will interact and try to keep the Data Factory close to this data, in order to reduce latency.

Check the "Enable GIT" checkbox, if you want to integrate your ETL code with a source control system.

After the Data Factory is created, you can search for it by name or within the Resource Group containing it. Fig. 4 shows the "Overview" blade of a Data Factory.

df04-OverviewBlade
Fig. 4

To begin using the Data Factory, click the [Author & Monitor] button in the middle of the blade.

The "Azure Data Factory Getting Started" page displays in a new browser tab, as shown in Fig. 5.

df05-GetStarted
Fig. 5

Click the [Copy Data] button (Fig. 6) to display, the "Copy Data" wizard, as shown in Fig. 7.

df06-CopyDataIcon
Fig. 6

df07-Properties
Fig. 7

This wizard steps you through the process of creating a Pipeline and its associated artifacts. A Pipeline performs an ETL on a single source and destination and may be run on demand or on a schedule.

At the "Task name" field, enter a descriptive name to identify this pipeline later.

Optionally, you can add a description to your task.

You have the option to run the task on a regular or semi-regular schedule (Fig. 8); but you can set this later, so I prefer to select "Run once now" until I know it is working properly.

df08-Schedule
Fig. 8

Click the [Next] button to advance to the "Source data store" page, as shown in Fig. 9.

df09-Source
Fig. 9

Click the [+ Create new connection] button to display to the "New Linked Service" dialog, as shown in Fig. 10.

df10-NewLinkedService
Fig.10

This dialog lists all the supported data stores.
At the top of the dialog is a search box and a set of links, which allow you to filter the list of data stores, as shown in Fig. 11.

df11-AzureSql
Fig. 11

Fig. 12 shows the next dialog if you select Azure SQL Database as your data source.

df12-AzureSqlDetails
Fig. 12

In this dialog, you can enter information specific to the database from which you are extracting data. When complete, click the [Test connection] button to verify your entries are correct; then click the [Finish] button to close the dialog.

After successfully creating a new connection, the connection appears in the "Source data store" page, as shown in Fig. 13.

df13-Source
Fig. 13

Click the [Next] button to advance to the next page in the wizard, which asks questions to specific to the type of data in your data source. Fig. 14 shows the page for Azure SQL databases, which allows you to select which tables to extract.

df14-SelectTables
Fig. 14

Click the [Next] button to advance to the "Destination data store", as shown in Fig. 15.

df15-Destination
Fig. 15

Click the [+ Create new connection] button to display the "New Linked Service" dialog, as shown in Fig. 16.

df16-NewLinkedService
Fig. 16

As with the source data connection, you can filter this list via the search box and top links, as shown in Fig. 17. Here we are selecting Azure Data Lake Storage Gen2 as our destination data store.

df17-NewLinkedService-ADL
Fig. 17

After selecting a service, click the [Continue] button to display a dialog requesting information about the data service you selected. Fig. 18 shows the page for Azure Data Lake. When complete, click the [Test connection] button to verify your entries are correct; then click the [Finish] button to close the dialog.

df18-ADLDetails
Fig. 18

After successfully creating a new connection, the connection appears in the "Destination data store" page, as shown in Fig. 19.

df19-Destination
Fig. 19

Click the [Next] button to advance to the next page in the wizard, which asks questions to specific to the type of data in your data destination. Fig. 20 shows the page for Azure Data Lake, which allows you to select the destination folder and file name.

df20-ChooseOutput
Fig. 20

Click the [Next] button to advance to the "File format settings" page, as shown in Fig. 21.

df21-FileFormatSettings
Fig. 21

At the "File format" dropdown, select a format in which to structure your output file. The prompts change depending on the format you select. Fig.  21 shows the prompts for a Text format file.

Complete the page and click the [Next] button to advance to the "Settings" page, as shown in Fig. 22.

df22-Settings
Fig. 22

The important question here is "Fault tolerance". When an error occurs, do you want to abort the entire activity, skipping the remaining records or do you want to log the error, skip the bad record, and continue with the remaining records.

Click the [Next] button to advance to the "Summary" page as shown in Fig. 23.

df23-Summary
Fig. 23

This page lists the selections you have made to this point. You may edit a section if you want to change any settings. When satisfied with your changes, click the [Next] button to kick off the activity and advance to the "Deployment complete" page, as shown in Fig. 24.

df24-DeploymentComplete
Fig. 24

You will see progress of the major steps in  this activity as they run. You can click the [Monitor] button to see a more detailed real-time progress report or you can click the [Finish] button to close the wizard.

In this article, you learned about the Azure Data Factory and how to create a new data factory with an activity to copy data from a source to a destination.

Friday, June 28, 2019 9:04:00 AM (GMT Daylight Time, UTC+01:00)
# Thursday, June 27, 2019

GCast 54:

Azure Storage Replication

Learn about the data replication options in Azure Storage and how to set the option appropriate for your needs.

Azure | Database | GCast | Screencast | Video
Thursday, June 27, 2019 4:16:00 PM (GMT Daylight Time, UTC+01:00)
# Tuesday, June 25, 2019

Data Lake storage is a type of Azure Storage that supports a hierarchical structure.

There are no pre-defined schemas in a Data Lake, so you have a lot of flexibility on the type of data you want to store. You can store structured data or unstructured data or both. In fact, you can store data of different data types and structures in the same Data Lake.

Typically a Data Lake is used for ingesting raw data in order to preserve that data in its original format. The low cost, lack of schema enforcement, and optimization for inserts make it ideal for this. From the Microsoft docs: "The idea with a data lake is to store everything in its original, untransformed state."

After saving the raw data, you can then use ETL tools, such as SSIS or Azure Data Factory to copy and/or transform this data in a more usable format in another location.

Like most solutions in Azure, it is inherently highly scalable and highly reliable.

Data in Azure Data Lake is stored in a Data Lake Store.

Under the hood, a Data Lake Store is simply an Azure Storage account with some specific properties set.

To create a new Data Lake storage account, navigate to the Azure Portal, log in, and click the [Create a Resource] button (Fig.1).

dl01-CreateResource
Fig. 1

From the menu, select Storage | Storage Account, as shown in Fig. 2.

dl02-MenuStorageAccount
Fig. 2

The "Create Storage Account" dialog with the "Basic" tab selected displays, as shown in Fig. 3.

dl03-Basics
Fig. 3

At the “Subscription” dropdown, select the subscription with which you want to associate this account. Most of you will have only one subscription.

At the "Resource group" field, select a resource group in which to store your service or click "Create new" to store it in a newly-created resource group. A resource group is a logical container for Azure resources.

At the "Storage account name" field, enter a unique name for the storage account.

At the "Location" field, select the Azure Region in which to store this service. Consider where the users of this service will be, so you can reduce latency.

At the "Performance" field, select the "Standard" radio button. You can select the "Premium" performance button to achieve faster reads; however, there may be better ways to store your data if performance is your primary objective.

At the "Account kind" field, select "Storage V2"

At the "Replication" dropdown, select your preferred replication. Replication is explained here.

At the "Access tier" field, select the "Hot" radio button.

Click the [Next: Advanced>] button to advance to the "Advanced" tab, as shown in Fig. 4.

dl04-Advanced
Fig. 4

The important field on this tab is "Hierarchical namespace". Select the "Enabled" radio button at this field.

Click the [Review + Create] button to advance to the "Review + Create" tab, as shown in Fig. 5.

dl05-Review
Fig. 5

Verify all the information on this tab; then click the [Create] button to begin creating the Data Lake Store.

After a minute or so, a storage account is created. Navigate to this storage account and click the [Data Lake Gen2 file systems] button, as shown in Fig. 6.

dl06-Services
Fig. 6

The "File Systems" blade displays, as shown in Fig. 7.

dl07-FileSystem
Fig. 7

Data Lake data is partitioned into file systems, so you must create at least one file system. Click the [+ File System] button and enter a name for the file system you wish to create, as shown in Fig. 8.

dl08-AddFileSystem
Fig. 8

Click the [OK] to add  this file system and close the dialog. The newly-created file system displays, as shown in Fig. 9.

dl09-FileSystem
Fig. 9

If you double-click the file system in the list, a page displays where you can set access control and read about how to manage the files in this Data Lake Storage, as shown in Fig. 10

dl10-FileSystem
Fig. 10

In this article, you learned how to create a Data Lake Storage and a file system within it.

Tuesday, June 25, 2019 10:10:00 AM (GMT Daylight Time, UTC+01:00)
# Thursday, June 20, 2019

GCast 53:

Creating a Data Warehouse in Azure

Learn how to create a new SQL Sever data warehouse in Microsoft Azure.

Thursday, June 20, 2019 9:24:00 AM (GMT Daylight Time, UTC+01:00)
# Monday, May 20, 2019

Episode 564

David Makogon on Streaming Data

David Makogon talks about streaming data and the tools to help you make it happen.

David on Twitter

Monday, May 20, 2019 9:10:00 AM (GMT Daylight Time, UTC+01:00)
# Monday, April 22, 2019

Episode 560

Frank Gill on Azure SQL Database Managed Instances

DBA Frank Gill discusses Azure SQL Database Managed Instances - a cloud-based managed database service. He describes what they are, how they differ from Azure SQL Databases, and when it is appropriate to consider them.

Links:

https://skreebydba.com/
https://twitter.com/skreebydba

Monday, April 22, 2019 9:49:00 AM (GMT Daylight Time, UTC+01:00)
# Thursday, February 28, 2019

GCast 37:

Managing Blobs with the Azure Storage Explorer

The Azure Storage Explorer is a free resource to manage Azure Storage Accounts.
This video shows how to manage Azure blobs with this tool.

Thursday, February 28, 2019 8:55:00 AM (GMT Standard Time, UTC+00:00)
# Thursday, January 31, 2019

GCast 33:

An Introduction to Power BI

Power BI is a tool that allows you to create visualizations from a variety of data sources. This video shows how to get started with this tool.

Thursday, January 31, 2019 8:03:00 AM (GMT Standard Time, UTC+00:00)
# Tuesday, January 8, 2019

Microsoft Power BI is a tool for users to create visualizations of data from disparate sources. You can get started with the browser-based version of Power BI by navigating to https://powerbi.com/. This page is shown in Fig. 1.

pbi01-start
Fig. 1

If your company has an Office 365 account, you may be able to sign in by clicking the [Sign in] button and start using Power BI. If not, you can click the [START FREE] button to create a free account.

Once you are signed in, the Home page displays, as shown in Fig. 2.

pbi02-home
Fig. 2

You can add data to Power BI to begin working with it by clicking the [Get Data] button (Fig. 3)

pbi03-GetDataButton
Fig. 3

The "Get Data" page displays, as shown in Fig. 4.

pbi04-GetDataPage
Fig. 4

From this page, you can choose to import data published by your organization ("My organization"), from a third-party service ("Services"), from a file on your computer or network ("Files") or from data in a database ("Databases").

Click the [Get] button on the "Files" blade (Fig. 5) to display the "Files" page, as shown in Fig. 6.

pbi05-GetFilesTile
Fig. 5

pbi06-Files
Fig. 6

As you can see, you can import data from your local file system, from OneDrive, or from a SharePoint site.

Click the "Local File" blade (Fig. 7) to open a File Open dialog, as shown in Fig. 8.

pbi07-LocalFile
Fig. 7

pbi08-SelectFile
Fig. 8

Navigate to the folder containing your data file, select the file, and click the [Open] button.

A list of all files imported is displayed, as shown in Fig. 9.

pbi09-FileList
Fig. 9

Click on your data file name to display the "Ask a question about your data" blade (Fig. 10); then, click the file name in this blade.

pbi10-AskAQuestion
Fig. 10

A blank canvas displays, along with a side menu and a list of fields in your data, as shown in Fig. 11.

pbi11-BlankCanvas
Fig. 11

Select the checkboxes next to some of the fields to create the first visualization on the canvas. In the example in Fig. 12, I selected "nMonths", which holds a number from 1-12, representing the month (Jan-Dec) that a measurement was taken. This is set as the x value. I also selected, "temp", which contains the measured temperature. This created the bar chart visualization at the left of the canvas.

pbi12-SelectData
Fig. 12

If you click on "temp" under "Value", you will notice that it shows the Sum of the temperatures, which is not very useful information. You can select something more useful, like "Average", "Minimum", or "Maximum" temperature from this menu, as shown in Fig. 13.

pbi13-SelectAggregateFunction
Fig. 13

If you don't like a bar chart, you can also change the type of visualization by selecting something different from the "VISUALIZATIONS" blade, as shown in Fig. 14.

pbi14-Visualizations
Fig. 14

This quick overview shows some of the features available in Microsoft Power BI.

Tuesday, January 8, 2019 9:54:00 AM (GMT Standard Time, UTC+00:00)
# Tuesday, November 6, 2018

Azure CosmosDB is a flexible, fast, reliable, scalable, geographically distributed NoSQL database.

You can create a CosmosDB account and database in the Azure poral.

Navigate to the Azure portal and login.

Click the [Create a resource] button, as shown in Fig. 1.

CDB01-CreateResourceButton
Fig. 1

From the menu, select Database | Azure CosmosDB, as shown in Fig. 2.

CDB01-DatabaseCosmosDb
Fig. 2

The "Create Azure CosmosDB Account" blade displays, as shown in Fig. 3.

CDB03-CreateAzureCosmosDbAccount
Fig. 3

At the Subscription dropdown, select your Azure subscription. Most of you will have only one subscription.

At the Resource Group dropdown, select an existing resource group or click "Create new" to display the New Resource Group dialog, as shown in Fig. 4.

CDB04-CreateResourceGroup
Fig. 4

In the New Resource Group dialog, enter a unique name for your resource group and click the [OK] button.

At the "API" dropdown, select the API you want to use to access the databases in this account, as shown in Fig. 5.  Options are

  • Core (SQL)
  • MongoDB
  • Cassandra
  • Azure Table
  • Gremlin (graph)

CDB05-Api
Fig. 5

If you are migrating data from another database, you may want to choose the API that resembles your old database in order to minimize changes to the client code accessing the database. If this is a new database, you may wish to choose the API with which you and your team are most familiar.

At the "Location" dropdown, select a region in which to store your data. It is a good idea to keep your data near your users and/or near any services that will interact with your data.

The "Geo-Redundancy" and "Multi-region writes" options allow you to globally distribute your data. There is an extra charge for enabling these features.

You can enable Geo-Redundancy by clicking the [Enable] button next to "Geo-Redundancy". This creates a copy of your data in another nearby region and keeps that data in sync.

Click the [Enable] button next to "Multi-region writes" if you wish to allow data to be written in multiple regions. This will improve the performance when writing data to the database.

Notice the tabs at the top of the page (Fig. 5). The "Basics" tab displays first, but the "Network", "Tags", and "Summary" tabs are also available.

CDB06-Tabs
Fig. 6

The "Network" tab (Fig. 7) allows you to add your CosmosDB account to a specific Virtual Network and Subnet. This is not required.

CDB07-NetworkTab
Fig. 7

The "Tags" tab (Fig. 8) allows you to assign metadata to this CosmosDB account, which may help when grouping together related accounts on a report. This is not required.

CDB08-TagsTab
Fig. 8

The "Summary" tab (Fig. 9) displays all the options you have chosen and validates that you completed the required responses and that all responses are consistent. You can navigate to this tab by clicking the "Summary" tab link at the top or by clicking the [Review + create] button on any other tab.

CDB09-SummaryTab
Fig. 9

Click the [Create] button to begin creating your CosmosDB account. This will take a few minutes. A message displays as shown in Fig. 10 when the account is created and deployed.

CDB10-DeploymentComplete
Fig. 10

As you can see, there are a number of links to documentation and tutorials.

Click the [Go to resource] button to open the CosmosDB account. By default, the "Quick start" blade displays, as shown in Fig. 11.

CDB11-CosmosDBQuickStartPage
Fig. 11

In this article, I showed how to create a new Azure CosmosDB account. In the next article, I will show how to add a database with containers to that account.

Tuesday, November 6, 2018 6:28:00 AM (GMT Standard Time, UTC+00:00)
# Thursday, October 18, 2018
Thursday, October 18, 2018 9:54:00 PM (GMT Daylight Time, UTC+01:00)
# Thursday, October 11, 2018
Azure | Database | GCast | Screencast | Video
Thursday, October 11, 2018 3:20:41 PM (GMT Daylight Time, UTC+01:00)
# Thursday, October 4, 2018
Thursday, October 4, 2018 4:08:28 PM (GMT Daylight Time, UTC+01:00)
# Tuesday, September 11, 2018

An Azure storage account gives you the ability to create, store, and manage tables, queues, blobs, and files.

Azure Storage Services

Available Azure storage services are:

Blobs

A blob is any unstructured or semi-structured object that you want to store. Examples include videos, images, and text files. Blob storage is flexible for storing an object without the system having to know much about the object.

Files

Azure Files allow you to store files to Azure and access them as a file share using the standard SMB protocol. Files are built on top of Azure Blob Storage.

Tables

Azure Table Storage provides a simple NoSQL database for your application. Data in an Azure Table is stored as rows. Each row contains a key and one or more properties. You do not need to pre-define these properties beforehand and you can define different properties for different rows in the same table.

Queues

A queue is a popular mechanism for designing asynchronous applications. One application can drop a message onto a queue and another application can pick up that message later and process it. This decoupling allows for scalability, flexibility, and faster response times.

Creating an Azure Storage Account

To create a new Azure Storage Account, navigate to the https://portal.azure.com.

Click the [Create a resource] button; then select Storage | Storage account - blob, file, table, queue from the menu, as shown in Fig. 1

Fig01-NewResource
Fig. 1

The New Storage Account blade displays, as shown in Fig. 2.

Fig02-NewStorage
Fig. 2

At the "name" field, enter a unique name for your storage account. You will be able to access this account through a REST API by sending requests to

https://accountname.core.windows.net

where accountname is the name you enter for your account.

At the "Deployment model" radio button, select "Resource manager".

At the "Account kind" dropdown, select "Storage (general purpose)"

At the "Location" dropdown, select a location in which to create your account. To minimize latency, you should create an account either near you or near the users and applications that will access the data in this account.

At the "Resource Group" field, click the "Create new" link to display the "Create Resource Group" dialog, as shown in Fig. 3.

Fig03-NewResourceGroup
Fig. 3

Enter a unique name for your new resource group and click the [OK] button to close the dialog and return to the previous blade.

Review your Storage Account settings and click the [Create] button to create your new storage account.

After a few seconds, you will be able to access the properties of your Storage Account, manage the account, and connect to the account.

The "Overview" tab (Fig. 4) displays information about the account, along with links to create and manage Blobs, Files, Tables, and Queues.

Fig04-StorageOverview
Fig. 4

Click the "Blobs" link in the "Services" section of the "Overview" tab to create and manage blobs and containers, as shown in Fig. 5.

Fig05-BlobsBlade
Fig. 5

Click the "Files" link in the "Services" section of the "Overview" tab to create and manage files, as shown in Fig. 6.

Fig06-FilesBlade
Fig. 6

Click the "Tables" link in the "Services" section of the "Overview" tab to create and manage tables , as shown in Fig. 7.

Fig07-TablesBlade
Fig. 7

Click the "Queues" link in the "Services" section of the "Overview" tab to create and manage queues, as shown in Fig. 8.

Fig08-QueuesBlade
Fig. 8

Azure Storage gives you many option for storing data in the cloud.

Tuesday, September 11, 2018 9:37:00 AM (GMT Daylight Time, UTC+01:00)
# Thursday, July 19, 2018

GCast 7:

Azure SQL Database

Azure MySQL Database

Thursday, July 19, 2018 8:46:00 AM (GMT Daylight Time, UTC+01:00)
# Thursday, July 12, 2018

GCast 6:

Azure SQL Database

How to create an Azure SQL database in the Azure portal.

Thursday, July 12, 2018 9:16:00 AM (GMT Daylight Time, UTC+01:00)
# Monday, July 2, 2018
Monday, July 2, 2018 9:35:00 AM (GMT Daylight Time, UTC+01:00)
# Sunday, June 24, 2018

MySQL is a popular relational database. Although the product is owned by Oracle, which offers several versions for purchase, it is an open source project and they offer a free "Community Edition.

There are several ways to install and start running MySQL.

You can download a copy from https://www.mysql.com/downloads/. The community edition offers install files for various platforms (an MSI file for Windows) that allows you install the database engine locally.

My preferred option is to install a database on Azure and store my data in the cloud. Because Azure offers MySQL as a service, I don't need to create a Virtual Machine or install the database engine anywhere. To create a MySQL service is Azure, navigate to the Azure portal and select Create a Resource | Databases | Azure Database for MySQL; then complete the blade that displays.

Whether using a cloud service or installing locally, you will be prompted to create an admin user and password. Remember these.

Once you have access to MySQL, you can use the command line to create and manage databases. On a local machine, open a command prompt; in Azure, you can click the "Cloud Shell" button at the top of the Azure portal.

To begin working with MySQL, you need to connect to the database engine with the following command:

mysql -h hostname -u username -p
  

where hostname is the name of the MySQL instance and username is the name of the admin user.

For example, if I have an Azure MySQL instance named "dgmysql" and a user named "dgiard", I would type the following:

mysql -h dgmysql.mysql.database.azure.com -u dgiard@dgmysql -p
  

When prompted, enter the appropriate password for the user.

The following commands are usfule for working with databases.

# Create a new database
CREATE DATABASE databasename;

# List all databases available
SHOW DATABASES;

# Switch to a specific database
USE publicspeaking;
  

Now you can start using standard SQL commands to work with database objects and data in the current database.

For example,

# Create a new database
CREATE DATABASE databasename;

# List all databases available
SHOW DATABASES;

# Switch to a specific database
USE publicspeaking;
  

For the most part, MySQL follows ANSI SQL syntax.

If you are familiar with another relational database, such as Microsoft SQL Server, working with MySQL will feel natural.

Sunday, June 24, 2018 7:16:00 AM (GMT Daylight Time, UTC+01:00)
# Tuesday, May 8, 2018

IMG_0394The DataFest concept was created back in 2011 by the American Statistical Association. Students are provided a large data set and are given 2 full days to report on some useful insights and/or visualizations about the data.

IMG_0388I attended the ASA DataFest at the University of Toronto May 1-2. The event was organized by UT Professor Nathan Taback, who served as host.

The students – all from U of T - worked in teams of 2-4 and presented their findings on the evening of the second day. This was a judged competition with prizes for the top 4 teams.

Students had no knowledge of the data set before it was made available to them when they showed up at the venue. Data was provided by the Indeed job search engine and included information on job postings in the United States, Canada, and Germany.

IMG_0406 Following Dr. Taback's opening remarks, I delivered a presentation on Data Science tools in Azure, including demos of Machine Learning Studio and Azure Notebooks. Over half the teams ended up using these tools in their analysis.

IMG_0407In addition to the opening ceremonies, I served as a mentor during the DataFest and a judge at the end. Several professors and students donated their time as mentors during the event and judges included professors and industry professionals. I also recruited local MVPs Atley Hunter and Vivek Patel, along with user group leader Ashraf Ghonaim to serve as mentors and/or judges.

Almost 200 students attended, and 19 teams presented their findings on Day 2.

IMG_0414The winning team used Azure ML Studio to split users into low, medium, and high salary ranges and determine the factors required to move from one level to the next level above.

Microsoft donated prizes and money for food to the event (along with my time) and Azure credits for the students to use.

Tuesday, May 8, 2018 12:01:00 PM (GMT Daylight Time, UTC+01:00)
# Monday, November 20, 2017
Monday, November 20, 2017 9:43:00 AM (GMT Standard Time, UTC+00:00)
# Monday, May 15, 2017
Monday, May 15, 2017 1:49:00 PM (GMT Daylight Time, UTC+01:00)
# Tuesday, February 7, 2017

Managing Big Data takes a lot of process power. Data often needs to be captured, scrubbed, merged, and queried and each of these things can take many hours of compute time. But often they can be performed in parallel - reducing the amount of time, but increasing the number of computers required.

You could buy a bunch of computers, cluster them, and process your data on this process. But this is expensive and these computers are likely to sit idle most of the time.

Cloud Computing tends to be an ideals solution for most Big Data processing because you can rent the servers you need and only pay for them while they are running.

Microsoft Azure offers a full suite of Big Data tools. These tools are based on the popular Hadoop open source project and are collectively known as "HD Insight".

HBase

HBase is a NoSQL data store that is optimized for big data. Unlike SQL Server and other relational databases, the database does not enforce referential integrity, pre-defined schemas, or auto-generated keys. The developer must code these features into the client application. Because the database doesn't need to worry about these things, inputting data tends to be much faster than in a relational database.

HBase also can be scaled to store petabytes of data.

Storm

Apache Storm is a framework that allows you to build workflow engines against real-time data. This is ideal for scenarios like collecting IoT data. The Storm topology consists of a Stream, which is a container that holds a Spout and one or more Bolts. A Spout is a component that accepts data into the Stream and hands it off to Bolts. Each Bolt takes in data; preforms some discrete actions, such as cleaning up the data or looking up values from IDs; and passes data onto one or more other Bolts. Data is passed as "Tuples", which are sets of name-value pairs formatted as JSON. You can write your code in C#, Java, or Python and a Visual Studio template helps you create these components.

Hive

Hive is a data warehouse. With it, you can query NoSQL data (such as Hive) and relational data (such as SQL Server). Hive ships with a query language - HiveQL - that is similar to SQL. Where HiveQL falls short, you can even write user-defined functions to perform more complex calculations.

Spark

Spark is a visualization tool. In Spark, you can write code in R, Python, or Scala. Jupyter notebooks are a popular interactive tools that allow you to create templates consisting of text and code, so that you can generate real-time reports. Jupyter notebooks support both Python and Scala. Spark also ships with a number of libraries that make it easier to connect to data, create graphs, and perform a number of other tasks.

Clusters

Each of the services described above supports running in clusters of servers. In a cluster, these servers process in parallel, greatly reducing the amount of time required to process the data.  You can easily create a cluster in the portal or you can write a script in PowerShell or CLI.

The ease of creating clusters is a big advantage of running HD Insight over deploying your own Hadoop servers and clustering them yourself. Of course, the other advantage is that you do not have to purchase and maintain servers that are only being used occasionally, which can be a big cost saving.

Warning

One word of caution about using these services. You pay for each server in a cluster by the minute. This can quickly add up. Typically, you don't need to have your cluster running for very long in order to complete tasks, so it is a good idea to shut them down when they are finished. Because of this, it's a good idea to script the creation and deletion of your cluster to make it easy to perform these tasks.

Tuesday, February 7, 2017 6:08:01 PM (GMT Standard Time, UTC+00:00)
# Monday, December 26, 2016
Monday, December 26, 2016 10:48:00 AM (GMT Standard Time, UTC+00:00)
# Monday, November 28, 2016
Monday, November 28, 2016 9:34:00 AM (GMT Standard Time, UTC+00:00)
# Saturday, May 21, 2016

Last month, I had the privilege of attending the AWS Summit in Chicago. It was a great experience for me because, although I do a lot of work with cloud computing, I have very little experience with the Amazon Web Services (AWS) platform.

The most interesting session I attended was about a service called "Aurora" (Amazon tends to give all their services catchy names). This is a relational database that looks and acts almost exactly like MySQL but runs much faster. The official product page brags that Aurora is a "MySQL-compatible relational database with 5X performance", however the session I attended claimed that they found cases in which Aurora was 63 times faster than MySQL. The presenters didn't share details of those cases, but even if results are only a fraction of that speed, it's still an impressive performance improvement.

Because Aurora is MySQL-compliant, you should be able to plug it into any application and use it just like MySQL. The SQL syntax is identical and the management tools will be familiar to anyone used to managing MySQL.

Of course, the fact that Aurora is hosted on a cloud platform like AWS gives it the advantage of high availability and flexible scaling that cloud computing offers.

Since most of my cloud computing experience is with Microsoft Azure, I tend to use Azure as a reference point for the services I saw at this summit. I was drawn to Aurora in part because I'm not aware of the same offering in Microsoft Azure.

MySQL as a service is available on Azure, but it's offered and supported by ClearDb - a third party.  If you want better performance or scalability on Azure than that offered by ClearDb, you will need to either switch to a different database or create a Virtual Machine and install MySQL on that, in which case you would be using Infrastructure as a Service, instead of Software as a Service.

In many cases, this is a non-issue. If you are building a new application, you have the flexibility to choose your preferred database technology. MySQL and SQL server have very similar languages; and, although I won't get into a debate here as to which is "better", it would be difficult to argue that SQL server is significantly less reliable or enterprise-ready than MySQL.

But there are times when you don't have a choice of database technologies. For example, if you have a large legacy application that you want to migrate to Azure, it may be a daunting task to migrate every stored procedure and SQL statement to use T-SQL.  Or if you are using a framework that is specifically built on top of MySQL, it makes sense to use that database, rather than re-writing the entire data access layer. Luckily, some frameworks have alternative data access layers. For example, Project Nami is a data access layer for WordPress that uses SQL Server as a data store, rather than MySQL.

Although the various cloud computing companies follow one another and are likely to build a service when they see traction on their competitor's platform, I find it interesting to see these gaps in offerings.

Saturday, May 21, 2016 11:28:00 AM (GMT Daylight Time, UTC+01:00)
# Monday, March 28, 2016
Monday, March 28, 2016 10:27:00 AM (GMT Daylight Time, UTC+01:00)
# Monday, January 4, 2016
Monday, January 4, 2016 12:29:00 PM (GMT Standard Time, UTC+00:00)
# Wednesday, September 23, 2015

Earlier this week, dozens of technologists from the Microsoft DX Team met in San Diego for a team hackathon.

Some brought projects they started back home; some brought hardware with them to control via Bluetooth or USB cable or through the Internet; some brought an idea for a software project; some for a hardware project.

I came with a desire to learn more about Azure Machine Learning. I was inspired by the work that my teammate Jennifer Marsman was doing analyzing EEG data with AML. (link)

I began by walking through a couple tutorials: here and here.

Then I tried it myself. AML provides some sample data sources, so I imported the xxx data. I cleaned the data and applied a Category algorithm.

Machine Learning seems complex and the AML tools are not all intuitive when you first begin working with them; but they are not difficult to master. And the graphical interface of ML Studio lowers the learning curve considerably.

I'll provide more details and instructions about this project in a future blog post.

For now, my message is that building something yourself is the best way to learn any technology. Pick a project, set aside some time, and build it. I know that not every company invests in a day of hacking like mine did, so many of you will need to invest your own time in order to get this benefit. But it’s worth it.

My project wasn't nearly as sexy as some created by my colleagues. But my knowledge of Machine Learning is an order of magnitude greater than it was a week ago.

DXHack
My Machine Learning experiment

Wednesday, September 23, 2015 10:48:13 PM (GMT Daylight Time, UTC+01:00)
# Wednesday, September 2, 2015

One of the nice things about Azure storage is that Azure always makes extra copies of your data. How and where those copies are made is up to you.

In the current Azure portal, you select the REPLICATION property of a new Storage Account; In the Azure Preview Portal, you select the Storage Account Type.

In both cases, the options are:

  • Locally Redundant
  • Geo-Redundant
  • Read-Access Geo-Redundant
  • Zone Redundant

Here is an explanation of each type:

Locally Redundant

Three copies of your storage data are created - all within the same region. No two copies will reside in the same Fault Domain and no two copies will reside in the same Upgrade domain.

This provides fault tolerance in case of the failure of one of the machines on which a copy of the storage account is stored. If the entire data center goes down, no copies of your data will be available.

This is the cheapest of the available redundancy options.

Geo-Redundant

As with Locally Redundant storage, Geo-Redundant storage also creates 3 copies of your data on separate fault domains and update domains in the same data center. But it also creates 3 more copies of your data in another region - typically the region nearest the primary region for this account. For example, if you select North Central US as your storage account's primary region, the account data will be replicated in the South Central US Region.

Once this cross-region replication occurs, you are protected from data loss, even if an entire Azure region fails.

Read-Access Geo-Redundant

Read-Access Geo-Redundant storage is identical to Geo-Redundant storage, but it also provides read access to data stored in the secondary region,

Zone Redundant

Three copies of your storage data are created and stored in at least 2 geographically disparate data centers. These data centers may or may not be in the same Region. This provides fault tolerance, even if an entire data center fails.

Zone Redundant Storage Accounts only support Block Blog storage, so selecting this option will limit the uses of your Storage account.

Considerations

Data within a region or data center is always distributed across multiple update domains and fault domains to protect against most hardware failures or planned maintenance downtime.

Replication within a data center is an atomic operation. In other words, success is not reported to the client until all 3 copies have been successfully written.

Replication to a secondary data center is done asynchronously and typically completes after success has been reported to the client. The good news about this is that clients don't experience any latency when writing to one of a Geo-Redundant storage account. A potential downside is that there is that data in the secondary data center is eventually consistent. If the primary data center fails, it is possible that not all data was written yet to the secondary data center.

Geo-Redundant and Read-Access Geo-Redundant are very similar - both create 6 copies of your data spread across two regions. The difference is that in a Geo-Redundant scenario, the data in the secondary region is only accessible in the event of a failure in the primary region. If all 3 copies of the data in the primary region are unavailable, Azure will fail over to a copy of the data in the secondary region. This also holds true with Read-Access Geo-Redundant, but you get one more benefit: users can access a read-only copy of the data in the secondary region, even if there is no failure in the first region. This can make for greater availability and access speed for users. This also explains why Read-Access Geo-Redundant is the most expensive option. It's the only option that allows users to read copies of the data.

Which Should I choose?

For maximum performance and reliability, Read-Access Geo-Redundant storage is your best option. But this is also the most expensive option. If you are very cost-conscious or if the government requires you to keep data within specific geographic boundaries, you should consider Zone Replication. Geo-Redundant storage is a good compromise between these two options for most scenarios.

Wednesday, September 2, 2015 1:08:06 AM (GMT Daylight Time, UTC+01:00)
# Thursday, July 30, 2015

In this video, you will see how to use the portal to quickly create a table linked to an Azure Mobile Service and a Windows Universal App client that connects to that mobile service.

G-Cast 2

Thursday, July 30, 2015 12:10:00 PM (GMT Daylight Time, UTC+01:00)
# Tuesday, June 23, 2015

The past few years, I've heard a lot about something called NoSQL. Some people really love it. Those who love it, talk about its lack of ceremony and the speed with which you can develop and the speed with which it reads and writes and its scalability. It sounds all sounds so awesome!

But I grew up on relational databases. My first computer language was FoxPro, which included a relational database and supported a powerful version of SQL. From there, I graduated to SQL Server and I've dabbled occasionally in Microsoft Access. I've even worked with Oracle and MySQL and, as a developer, I find them intuitive. Should I abandon the databases with which I am familiar and travel to this brave, new world of NoSQL? Is NoSQL the best solution for every project? For some projects? How do I know?

Let's start with a definition for NoSQL. This is harder than you might think, because NoSQL databases are basically defined by what they are not. The only real definition is that they are not SQL databases. They tend not to have pre-defined schemas; they tend not to enforce relationships; and they tend to be able to store hierarchical data; and, of course, they tend not to support the SQL language (although some support syntaxes similar to SQL, such as LINQ). These are broad definitions and only address things that NoSQL databases don't do. There is no standard language, syntax, storage mechanism, or API for addressing NoSQL databases.

For purposes of this article, I'll define SQL databases as those in which the database engined provides the following features:

  1. Supports SQL
  2. Enforces pre-defined schemas
  3. Enforces referential integrity among related, normalized tables

This includes database engines supported by large companies, such as Microsoft SQL Server and Oracle, as well as Open Source databases, such as MySQL.

I'll lump all other persistent storage technologies as NoSQL databases. This includes MongoDB, RavenDB, Azure table storage, and DocumentDB.

So when should we choose good old SQL databases and when should we use this newfangled NoSQL thing?

Let's start with SQL databases. They have a few advantages:

SQL databases

Advantages of SQL DBs

First, they are relational, so they make it easy to normalize your database into a set of related tables. This almost always saves disc space and often makes your data more consistent (e.g., you can change the name of a product in one table and it changes throughout your entire application). Databases like this also allow you to create persistent relationships between these tables and these relationships enforce referential integrity, ensuring that we are not left with orphaned records (Who wants an order line without a corresponding order?)  You can set up cascading deletes or force users to create and delete records in an order that will never have inconsistent data.

The SQL language itself is very flexible, allowing users to create either pre-defined or ad-hoc queries against a relational database. This makes SQL databases great for reporting.

The schema in a SQL database helps catch errors in almost the same way that a compiler or a unit test does. If you want to capture a customer's last name and you create a “LastName” column, but one time you accidentally misspell it as "LastNmae", the database will catch this and throw an exception which should be early and obvious enough for you to fix the error.

Disadvantages of SQL DBs

But these features come at a price. There is overhead in enforcing database schemas and referential integrity. As a result, saving to a SQL database tends to be slower.

Also, when developers build an application intended for human interaction, they almost never structure normalize the application's objects in the  way that they normalize the data in their relational database. An entire class of Object Relational Mapper (ORM) software exists simply to deal with this mismatch. It requires code, time, and CPU cycles to map between objects in an application and data in a database.

NoSQL databases

Advantages of NoSQL DBs

Because NoSQL databases don't need to enforce schemas or relationships, they tend to perform faster than their SQL cousins.

Database development tends to be faster because developers and DBAs are not required to pre-define the columns in each table.

The lack of database relationship enforcement also makes it easier to move parts of a database to another server, which makes it easier to support very large data sets. Relational databases can move across servers, but it tends to be more difficult because of their need to enforce referential integrity.

The lack of schema also adds flexibility, especially if you are capturing data in which different objects may have different properties. For example, a product catalogue table may contain some items (such as computer monitors) for which diagonal size in inches is an important property, and other items (such as hard drives) for which capacity in GB is an important property. Mapping these disparate needs to a relational database table would be add complexity to your data model.

Finally, it is possible to serialize and de-serialize objects in the same format that they are used in an application's user interface. This eliminates the need for an ORM, which makes applications simpler and faster.

Disadvantages of NoSQL DBs

When reading data, NoSQL databases tend to be very fast, as long as you are looking up rows by an index or key. If you want to look up a row by any other property or filter your data by a property, this often requires a full table scan, which is very slow. Some NoSQL databases allow you to create index on non-key rows, which speeds up such searches but slows down data writes - decreasing one of the advantages of NoSQL.

Other factors

It's worth looking at the cost of any database solution. For example, Azure provides both SQL and NoSQL databases as a service. If we compare the cost of Azure SQL Database with Azure table storage (a NoSQL option), we can see that the price of table storage is far less than the cost of SQL Server. Table storage might not be the answer for your application, but it's worth examining whether some of your data can work with Azure table storage.

Conclusion

As with most questions facing IT developers, architects and managers, there is no clear-cut answer to whether to use SQL or NoSQL databases. SQL databases tend to be better when ad-hoc reporting is required, while NoSQL databases tend to shine when saving and retrieving transactional data from a user application. Many applications take advantage of the features of both database types by creating a NoSQL database with which their application interacts; then transforming and regularly copying this data into a relational database, which can be queried and reported on.

There are many options for your persistent storage needs. Choose the right one for your application.

Tuesday, June 23, 2015 2:42:00 PM (GMT Daylight Time, UTC+01:00)
# Monday, June 15, 2015
Monday, June 15, 2015 1:54:00 PM (GMT Daylight Time, UTC+01:00)
# Monday, May 26, 2014
Monday, May 26, 2014 6:11:00 PM (GMT Daylight Time, UTC+01:00)
# Monday, April 21, 2014
Monday, April 21, 2014 11:01:22 PM (GMT Daylight Time, UTC+01:00)
# Monday, February 17, 2014
Monday, February 17, 2014 5:01:00 PM (GMT Standard Time, UTC+00:00)
# Monday, September 2, 2013
Monday, September 2, 2013 6:16:00 PM (GMT Daylight Time, UTC+01:00)