# Tuesday, 07 February 2017

Managing Big Data takes a lot of process power. Data often needs to be captured, scrubbed, merged, and queried and each of these things can take many hours of compute time. But often they can be performed in parallel - reducing the amount of time, but increasing the number of computers required.

You could buy a bunch of computers, cluster them, and process your data on this process. But this is expensive and these computers are likely to sit idle most of the time.

Cloud Computing tends to be an ideals solution for most Big Data processing because you can rent the servers you need and only pay for them while they are running.

Microsoft Azure offers a full suite of Big Data tools. These tools are based on the popular Hadoop open source project and are collectively known as "HD Insight".


HBase is a NoSQL data store that is optimized for big data. Unlike SQL Server and other relational databases, the database does not enforce referential integrity, pre-defined schemas, or auto-generated keys. The developer must code these features into the client application. Because the database doesn't need to worry about these things, inputting data tends to be much faster than in a relational database.

HBase also can be scaled to store petabytes of data.


Apache Storm is a framework that allows you to build workflow engines against real-time data. This is ideal for scenarios like collecting IoT data. The Storm topology consists of a Stream, which is a container that holds a Spout and one or more Bolts. A Spout is a component that accepts data into the Stream and hands it off to Bolts. Each Bolt takes in data; preforms some discrete actions, such as cleaning up the data or looking up values from IDs; and passes data onto one or more other Bolts. Data is passed as "Tuples", which are sets of name-value pairs formatted as JSON. You can write your code in C#, Java, or Python and a Visual Studio template helps you create these components.


Hive is a data warehouse. With it, you can query NoSQL data (such as Hive) and relational data (such as SQL Server). Hive ships with a query language - HiveQL - that is similar to SQL. Where HiveQL falls short, you can even write user-defined functions to perform more complex calculations.


Spark is a visualization tool. In Spark, you can write code in R, Python, or Scala. Jupyter notebooks are a popular interactive tools that allow you to create templates consisting of text and code, so that you can generate real-time reports. Jupyter notebooks support both Python and Scala. Spark also ships with a number of libraries that make it easier to connect to data, create graphs, and perform a number of other tasks.


Each of the services described above supports running in clusters of servers. In a cluster, these servers process in parallel, greatly reducing the amount of time required to process the data.  You can easily create a cluster in the portal or you can write a script in PowerShell or CLI.

The ease of creating clusters is a big advantage of running HD Insight over deploying your own Hadoop servers and clustering them yourself. Of course, the other advantage is that you do not have to purchase and maintain servers that are only being used occasionally, which can be a big cost saving.


One word of caution about using these services. You pay for each server in a cluster by the minute. This can quickly add up. Typically, you don't need to have your cluster running for very long in order to complete tasks, so it is a good idea to shut them down when they are finished. Because of this, it's a good idea to script the creation and deletion of your cluster to make it easy to perform these tasks.