# Tuesday, 07 February 2017

Managing Big Data takes a lot of process power. Data often needs to be captured, scrubbed, merged, and queried and each of these things can take many hours of compute time. But often they can be performed in parallel - reducing the amount of time, but increasing the number of computers required.

You could buy a bunch of computers, cluster them, and process your data on this process. But this is expensive and these computers are likely to sit idle most of the time.

Cloud Computing tends to be an ideals solution for most Big Data processing because you can rent the servers you need and only pay for them while they are running.

Microsoft Azure offers a full suite of Big Data tools. These tools are based on the popular Hadoop open source project and are collectively known as "HD Insight".


HBase is a NoSQL data store that is optimized for big data. Unlike SQL Server and other relational databases, the database does not enforce referential integrity, pre-defined schemas, or auto-generated keys. The developer must code these features into the client application. Because the database doesn't need to worry about these things, inputting data tends to be much faster than in a relational database.

HBase also can be scaled to store petabytes of data.


Apache Storm is a framework that allows you to build workflow engines against real-time data. This is ideal for scenarios like collecting IoT data. The Storm topology consists of a Stream, which is a container that holds a Spout and one or more Bolts. A Spout is a component that accepts data into the Stream and hands it off to Bolts. Each Bolt takes in data; preforms some discrete actions, such as cleaning up the data or looking up values from IDs; and passes data onto one or more other Bolts. Data is passed as "Tuples", which are sets of name-value pairs formatted as JSON. You can write your code in C#, Java, or Python and a Visual Studio template helps you create these components.


Hive is a data warehouse. With it, you can query NoSQL data (such as Hive) and relational data (such as SQL Server). Hive ships with a query language - HiveQL - that is similar to SQL. Where HiveQL falls short, you can even write user-defined functions to perform more complex calculations.


Spark is a visualization tool. In Spark, you can write code in R, Python, or Scala. Jupyter notebooks are a popular interactive tools that allow you to create templates consisting of text and code, so that you can generate real-time reports. Jupyter notebooks support both Python and Scala. Spark also ships with a number of libraries that make it easier to connect to data, create graphs, and perform a number of other tasks.


Each of the services described above supports running in clusters of servers. In a cluster, these servers process in parallel, greatly reducing the amount of time required to process the data.  You can easily create a cluster in the portal or you can write a script in PowerShell or CLI.

The ease of creating clusters is a big advantage of running HD Insight over deploying your own Hadoop servers and clustering them yourself. Of course, the other advantage is that you do not have to purchase and maintain servers that are only being used occasionally, which can be a big cost saving.


One word of caution about using these services. You pay for each server in a cluster by the minute. This can quickly add up. Typically, you don't need to have your cluster running for very long in order to complete tasks, so it is a good idea to shut them down when they are finished. Because of this, it's a good idea to script the creation and deletion of your cluster to make it easy to perform these tasks.

Tuesday, 07 February 2017 18:08:01 (GMT Standard Time, UTC+00:00)
# Monday, 26 December 2016
Monday, 26 December 2016 10:48:00 (GMT Standard Time, UTC+00:00)
# Monday, 28 November 2016
Monday, 28 November 2016 09:34:00 (GMT Standard Time, UTC+00:00)
# Saturday, 21 May 2016

Last month, I had the privilege of attending the AWS Summit in Chicago. It was a great experience for me because, although I do a lot of work with cloud computing, I have very little experience with the Amazon Web Services (AWS) platform.

The most interesting session I attended was about a service called "Aurora" (Amazon tends to give all their services catchy names). This is a relational database that looks and acts almost exactly like MySQL but runs much faster. The official product page brags that Aurora is a "MySQL-compatible relational database with 5X performance", however the session I attended claimed that they found cases in which Aurora was 63 times faster than MySQL. The presenters didn't share details of those cases, but even if results are only a fraction of that speed, it's still an impressive performance improvement.

Because Aurora is MySQL-compliant, you should be able to plug it into any application and use it just like MySQL. The SQL syntax is identical and the management tools will be familiar to anyone used to managing MySQL.

Of course, the fact that Aurora is hosted on a cloud platform like AWS gives it the advantage of high availability and flexible scaling that cloud computing offers.

Since most of my cloud computing experience is with Microsoft Azure, I tend to use Azure as a reference point for the services I saw at this summit. I was drawn to Aurora in part because I'm not aware of the same offering in Microsoft Azure.

MySQL as a service is available on Azure, but it's offered and supported by ClearDb - a third party.  If you want better performance or scalability on Azure than that offered by ClearDb, you will need to either switch to a different database or create a Virtual Machine and install MySQL on that, in which case you would be using Infrastructure as a Service, instead of Software as a Service.

In many cases, this is a non-issue. If you are building a new application, you have the flexibility to choose your preferred database technology. MySQL and SQL server have very similar languages; and, although I won't get into a debate here as to which is "better", it would be difficult to argue that SQL server is significantly less reliable or enterprise-ready than MySQL.

But there are times when you don't have a choice of database technologies. For example, if you have a large legacy application that you want to migrate to Azure, it may be a daunting task to migrate every stored procedure and SQL statement to use T-SQL.  Or if you are using a framework that is specifically built on top of MySQL, it makes sense to use that database, rather than re-writing the entire data access layer. Luckily, some frameworks have alternative data access layers. For example, Project Nami is a data access layer for WordPress that uses SQL Server as a data store, rather than MySQL.

Although the various cloud computing companies follow one another and are likely to build a service when they see traction on their competitor's platform, I find it interesting to see these gaps in offerings.

Saturday, 21 May 2016 11:28:00 (GMT Daylight Time, UTC+01:00)
# Monday, 28 March 2016
Monday, 28 March 2016 10:27:00 (GMT Daylight Time, UTC+01:00)
# Monday, 04 January 2016
Monday, 04 January 2016 12:29:00 (GMT Standard Time, UTC+00:00)
# Wednesday, 23 September 2015

Earlier this week, dozens of technologists from the Microsoft DX Team met in San Diego for a team hackathon.

Some brought projects they started back home; some brought hardware with them to control via Bluetooth or USB cable or through the Internet; some brought an idea for a software project; some for a hardware project.

I came with a desire to learn more about Azure Machine Learning. I was inspired by the work that my teammate Jennifer Marsman was doing analyzing EEG data with AML. (link)

I began by walking through a couple tutorials: here and here.

Then I tried it myself. AML provides some sample data sources, so I imported the xxx data. I cleaned the data and applied a Category algorithm.

Machine Learning seems complex and the AML tools are not all intuitive when you first begin working with them; but they are not difficult to master. And the graphical interface of ML Studio lowers the learning curve considerably.

I'll provide more details and instructions about this project in a future blog post.

For now, my message is that building something yourself is the best way to learn any technology. Pick a project, set aside some time, and build it. I know that not every company invests in a day of hacking like mine did, so many of you will need to invest your own time in order to get this benefit. But it’s worth it.

My project wasn't nearly as sexy as some created by my colleagues. But my knowledge of Machine Learning is an order of magnitude greater than it was a week ago.

My Machine Learning experiment

Wednesday, 23 September 2015 22:48:13 (GMT Daylight Time, UTC+01:00)
# Wednesday, 02 September 2015

One of the nice things about Azure storage is that Azure always makes extra copies of your data. How and where those copies are made is up to you.

In the current Azure portal, you select the REPLICATION property of a new Storage Account; In the Azure Preview Portal, you select the Storage Account Type.

In both cases, the options are:

  • Locally Redundant
  • Geo-Redundant
  • Read-Access Geo-Redundant
  • Zone Redundant

Here is an explanation of each type:

Locally Redundant

Three copies of your storage data are created - all within the same region. No two copies will reside in the same Fault Domain and no two copies will reside in the same Upgrade domain.

This provides fault tolerance in case of the failure of one of the machines on which a copy of the storage account is stored. If the entire data center goes down, no copies of your data will be available.

This is the cheapest of the available redundancy options.


As with Locally Redundant storage, Geo-Redundant storage also creates 3 copies of your data on separate fault domains and update domains in the same data center. But it also creates 3 more copies of your data in another region - typically the region nearest the primary region for this account. For example, if you select North Central US as your storage account's primary region, the account data will be replicated in the South Central US Region.

Once this cross-region replication occurs, you are protected from data loss, even if an entire Azure region fails.

Read-Access Geo-Redundant

Read-Access Geo-Redundant storage is identical to Geo-Redundant storage, but it also provides read access to data stored in the secondary region,

Zone Redundant

Three copies of your storage data are created and stored in at least 2 geographically disparate data centers. These data centers may or may not be in the same Region. This provides fault tolerance, even if an entire data center fails.

Zone Redundant Storage Accounts only support Block Blog storage, so selecting this option will limit the uses of your Storage account.


Data within a region or data center is always distributed across multiple update domains and fault domains to protect against most hardware failures or planned maintenance downtime.

Replication within a data center is an atomic operation. In other words, success is not reported to the client until all 3 copies have been successfully written.

Replication to a secondary data center is done asynchronously and typically completes after success has been reported to the client. The good news about this is that clients don't experience any latency when writing to one of a Geo-Redundant storage account. A potential downside is that there is that data in the secondary data center is eventually consistent. If the primary data center fails, it is possible that not all data was written yet to the secondary data center.

Geo-Redundant and Read-Access Geo-Redundant are very similar - both create 6 copies of your data spread across two regions. The difference is that in a Geo-Redundant scenario, the data in the secondary region is only accessible in the event of a failure in the primary region. If all 3 copies of the data in the primary region are unavailable, Azure will fail over to a copy of the data in the secondary region. This also holds true with Read-Access Geo-Redundant, but you get one more benefit: users can access a read-only copy of the data in the secondary region, even if there is no failure in the first region. This can make for greater availability and access speed for users. This also explains why Read-Access Geo-Redundant is the most expensive option. It's the only option that allows users to read copies of the data.

Which Should I choose?

For maximum performance and reliability, Read-Access Geo-Redundant storage is your best option. But this is also the most expensive option. If you are very cost-conscious or if the government requires you to keep data within specific geographic boundaries, you should consider Zone Replication. Geo-Redundant storage is a good compromise between these two options for most scenarios.

Wednesday, 02 September 2015 01:08:06 (GMT Daylight Time, UTC+01:00)
# Thursday, 30 July 2015

In this video, you will see how to use the portal to quickly create a table linked to an Azure Mobile Service and a Windows Universal App client that connects to that mobile service.

G-Cast 2

Thursday, 30 July 2015 12:10:00 (GMT Daylight Time, UTC+01:00)
# Tuesday, 23 June 2015

The past few years, I've heard a lot about something called NoSQL. Some people really love it. Those who love it, talk about its lack of ceremony and the speed with which you can develop and the speed with which it reads and writes and its scalability. It sounds all sounds so awesome!

But I grew up on relational databases. My first computer language was FoxPro, which included a relational database and supported a powerful version of SQL. From there, I graduated to SQL Server and I've dabbled occasionally in Microsoft Access. I've even worked with Oracle and MySQL and, as a developer, I find them intuitive. Should I abandon the databases with which I am familiar and travel to this brave, new world of NoSQL? Is NoSQL the best solution for every project? For some projects? How do I know?

Let's start with a definition for NoSQL. This is harder than you might think, because NoSQL databases are basically defined by what they are not. The only real definition is that they are not SQL databases. They tend not to have pre-defined schemas; they tend not to enforce relationships; and they tend to be able to store hierarchical data; and, of course, they tend not to support the SQL language (although some support syntaxes similar to SQL, such as LINQ). These are broad definitions and only address things that NoSQL databases don't do. There is no standard language, syntax, storage mechanism, or API for addressing NoSQL databases.

For purposes of this article, I'll define SQL databases as those in which the database engined provides the following features:

  1. Supports SQL
  2. Enforces pre-defined schemas
  3. Enforces referential integrity among related, normalized tables

This includes database engines supported by large companies, such as Microsoft SQL Server and Oracle, as well as Open Source databases, such as MySQL.

I'll lump all other persistent storage technologies as NoSQL databases. This includes MongoDB, RavenDB, Azure table storage, and DocumentDB.

So when should we choose good old SQL databases and when should we use this newfangled NoSQL thing?

Let's start with SQL databases. They have a few advantages:

SQL databases

Advantages of SQL DBs

First, they are relational, so they make it easy to normalize your database into a set of related tables. This almost always saves disc space and often makes your data more consistent (e.g., you can change the name of a product in one table and it changes throughout your entire application). Databases like this also allow you to create persistent relationships between these tables and these relationships enforce referential integrity, ensuring that we are not left with orphaned records (Who wants an order line without a corresponding order?)  You can set up cascading deletes or force users to create and delete records in an order that will never have inconsistent data.

The SQL language itself is very flexible, allowing users to create either pre-defined or ad-hoc queries against a relational database. This makes SQL databases great for reporting.

The schema in a SQL database helps catch errors in almost the same way that a compiler or a unit test does. If you want to capture a customer's last name and you create a “LastName” column, but one time you accidentally misspell it as "LastNmae", the database will catch this and throw an exception which should be early and obvious enough for you to fix the error.

Disadvantages of SQL DBs

But these features come at a price. There is overhead in enforcing database schemas and referential integrity. As a result, saving to a SQL database tends to be slower.

Also, when developers build an application intended for human interaction, they almost never structure normalize the application's objects in the  way that they normalize the data in their relational database. An entire class of Object Relational Mapper (ORM) software exists simply to deal with this mismatch. It requires code, time, and CPU cycles to map between objects in an application and data in a database.

NoSQL databases

Advantages of NoSQL DBs

Because NoSQL databases don't need to enforce schemas or relationships, they tend to perform faster than their SQL cousins.

Database development tends to be faster because developers and DBAs are not required to pre-define the columns in each table.

The lack of database relationship enforcement also makes it easier to move parts of a database to another server, which makes it easier to support very large data sets. Relational databases can move across servers, but it tends to be more difficult because of their need to enforce referential integrity.

The lack of schema also adds flexibility, especially if you are capturing data in which different objects may have different properties. For example, a product catalogue table may contain some items (such as computer monitors) for which diagonal size in inches is an important property, and other items (such as hard drives) for which capacity in GB is an important property. Mapping these disparate needs to a relational database table would be add complexity to your data model.

Finally, it is possible to serialize and de-serialize objects in the same format that they are used in an application's user interface. This eliminates the need for an ORM, which makes applications simpler and faster.

Disadvantages of NoSQL DBs

When reading data, NoSQL databases tend to be very fast, as long as you are looking up rows by an index or key. If you want to look up a row by any other property or filter your data by a property, this often requires a full table scan, which is very slow. Some NoSQL databases allow you to create index on non-key rows, which speeds up such searches but slows down data writes - decreasing one of the advantages of NoSQL.

Other factors

It's worth looking at the cost of any database solution. For example, Azure provides both SQL and NoSQL databases as a service. If we compare the cost of Azure SQL Database with Azure table storage (a NoSQL option), we can see that the price of table storage is far less than the cost of SQL Server. Table storage might not be the answer for your application, but it's worth examining whether some of your data can work with Azure table storage.


As with most questions facing IT developers, architects and managers, there is no clear-cut answer to whether to use SQL or NoSQL databases. SQL databases tend to be better when ad-hoc reporting is required, while NoSQL databases tend to shine when saving and retrieving transactional data from a user application. Many applications take advantage of the features of both database types by creating a NoSQL database with which their application interacts; then transforming and regularly copying this data into a relational database, which can be queried and reported on.

There are many options for your persistent storage needs. Choose the right one for your application.

Tuesday, 23 June 2015 14:42:00 (GMT Daylight Time, UTC+01:00)
# Monday, 15 June 2015
Monday, 15 June 2015 13:54:00 (GMT Daylight Time, UTC+01:00)
# Monday, 26 May 2014
Monday, 26 May 2014 18:11:00 (GMT Daylight Time, UTC+01:00)
# Monday, 21 April 2014
Monday, 21 April 2014 23:01:22 (GMT Daylight Time, UTC+01:00)
# Monday, 17 February 2014
Monday, 17 February 2014 17:01:00 (GMT Standard Time, UTC+00:00)
# Monday, 02 September 2013
Monday, 02 September 2013 18:16:00 (GMT Daylight Time, UTC+01:00)