DCS UK May 2014

data analytics ICT

for its success has been simple – before Hadoop, data storage was expensive. Hadoop lets you store as much data as you want in whatever form you need, simply by adding more servers to a cluster – these can be commodity x86 machines with a relatively low price tag. Each new server adds more storage and more processing power to the overall cluster.

Hadoop also lets companies store data as it comes in – structured or unstructured - so you don’t have to spend money and time configuring data for relational database management systems and their rigid tables – which is a very expensive proposition.

However, Hadoop has its limitations and bottlenecks. It was envisioned as a batch-oriented system - its real-time capabilities are still emerging - which has created a gap that fast in-memory NewSQL databases are rushing to fill. NewSQL database vendors, such as MemSQL and VoltDB, are working towards real-time analytics on huge data stores with latencies measured in milliseconds.

Q A

Where does MapReduce fit in to the Big Data landscape?

For those of you who are not familiar with MapReduce it is the key algorithm [or framework] that the Hadoop engine uses to filter and distribute work around a cluster. The MapReduce program is composed of two steps: 1. Firstly, the Map procedure – this performs filtering and sorting of data, such as sorting students by first names into queues, one queue for each name; and

2. Secondly, the Reduce procedure – this performs a summary operation, such as counting the number of students in each queue, yielding name frequencies.

The program marshals the distributed servers, running all the tasks in parallel whilst also managing all communications between component

parts of the system. The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different instructions and sets of data. Consequently on a large cluster of machines you can go one step further and run the Map operations on servers where the data lives.

Rather than copy the data over the network to the program, you push out the program to the machines. The output data can be saved to the distributed file system, and the Reducers run to merge the results.

There are, however, limitations to MapReduce as follows: £ For maximum parallelism you need the Maps and Reduces to be stateless, to not depend on any data generated by the same MapReduce job

£ It is very inefficient if you are repeating similar searches again and again. A database with an index will always be faster than running a MapReduce job over unindexed data – this can utilize both CPU time and power

£ In Hadoop Reduce, operations do not take place until all the Maps are complete – as such no data is available until all mapping has finished

Q A

What other Big Data solutions are out there?

The world of Big Data solutions and vendors is divided into two camps. There are the pure play Big Data start-ups who are bringing innovation and buzz to the marketplace. And then there are the established database/data warehouse vendors who are moving into the world of Big Data from a position of strength, both in terms of an installed base and a proven product line.

Apache Hadoop, now a ten year old platform, first used by internet giants, Yahoo, Google and Facebook, led the Big Data revolution. The jury is still out on whether Hadoop will become as indispensable as database management systems [DBMS] – where volume and variety are extreme, although it has proven its utility and cost advantages. Cloudera introduced commercial support for enterprises in 2008, and MapR and Hortonworks piled on in 2009 and 2011 respectively. Among data management incumbents, IBM and EMC spinout – Pivotal each has introduced its own Hadoop distribution. Microsoft and Terradata offer complementary software and first line support for Hortonworks’ platform. Oracle resells and supports Cloudera, while HP, SAP, and others work with multiple Hadoop software providers. In-memory analysis is gaining steam as Moore’s law brings us faster, more affordable and more memory rich processors. SAP has been the biggest champion of the in-memory approach with its Hana platform, but Microsoft and Oracle are now poised to introduce in-memory options for their flagship databases too.

Niche vendors include Actian, InfinitiDB, HP Vertica, Splunk, Platfora, Infobright and Kognition, all of which have centered their Big Data stories around database management systems focused entirely on analytics rather than transaction processing.

In addition to the Big Data solution providers mentioned above there are analytics vendors, such as Alpine Data Labs, Revolution Analytics

34 www.dcsuk.info I May 2014

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56