This page contains a Flash digital edition of a book.
data-intensive computing Chris Brown, data analytics consultant at OCF J


ust so we all understand the size (according to Intel) of the big data phenomenon, from the dawn of time until 2003 mankind


generated five exabytes of data, in 2012 it generated 2.7 zettabytes1


, (500 times more data)


and by 2015 it is estimated that figure will have grown to eight zettabyes (one zettabyte is 1000 exabytes). How does this mass of information affect the scientific community? Well, you can thank the scientific community


for ‘big data’, emanating as it does from ‘big science’. It made the link between having an enormous amount of data to work with and the mathematically huge probabilities of actually finding anything useful, spawning projects such as astronomy images (planet detection), physics research (supercollider data analytics), medical research (drug interaction), weather prediction and others. It is also the scientific community that is at the forefront of the new technologies making big data analytics possible and cost- effective, major projects are under way to


Jeff Denworth, VP of marketing at DataDirect Networks


T


he reality is that we’re seeing two shiſts occur in the marketplace. Firstly, big data is being distilled down into analytics of


very large data sets. Click down one more level and you’ll find that almost everyone is talking about Hadoop. Te two have become so linked I predict that, in the future, the term ‘big data’ will be used only to talk about Hadoop or NoSQL- style parallel data processing. Te general industry trend is to take scale-out


servers and deploy the technology on those servers. But there are a growing number of customers around world who are realising there may be a smarter way of building more efficient and scalable infrastructure for these technologies. Te HPC market is looking for


HPC IS LOOKING FOR CENTRALISED STORAGE


centralised storage that enables people to, first and foremost, decouple the capacity they need to deploy with the compute, because there’s not always a 1:1 ratio. Secondly, high-throughput centralised storage allows people to deliver to a compute node with faster I/O performance than they can get by simply deploying 12 drives within that node. Drives aren’t getting any faster,


www.scientific-computing.com


but computers are and they’re getting hungrier – a lot hungrier if you add an accelerator or GPGPU to the system. As a result, people are really challenging the notion of data nodes – some call them super data nodes – because no matter how many disks they put into the system it seems like they can’t get the greatest level of compute efficiency. An alternate approach is to put a number of InfiniBand cards into that node and really drive the LAN speed as opposed to the speed of commodity disks. Te Hadoop ecosystem has become an


extraordinary movement, where developers are involved and a lot of work is being done at scale as people recognise that it is the default platform for batch and real-time computing. Internally, we refer to it as a data refinery. In the past, the general trend has been towards a ‘pizza box’ scale-out mentality, but now the industry is really engaging in discussion of smarter approaches to a Hadoop infrastructure. Parallel processing at massive scale is the next stage of data analytics. Performance and compute efficiency are increasingly key system attributes, and balanced, efficient systems will usher in broad adoption of parallel Map Reduce processing methods.


evaluate core technologies and tools that take advantage of collections of large data sets. One technology


addressing this emerging market is the Hadoop framework, which redefines the way data is managed and analysed by leveraging the power of a distributed grid of computing resources using a simple programming model to enable distributed processing of large data sets on clusters of computers. Its technology stack includes common utilities, a distributed file system, analytics, data storage platforms, an application layer that manages distributed processing, parallel computation, workflow, and configuration management.


HADOOP REDEFINES


THE WAY DATA IS MANAGED


Ted Dunning, chief application architect, MapR


this is not the entire Hadoop experience – is certain types of computations on very large out-of-core datasets. Usually these datasets tend to be in the form of sparse observations of some kind – this may mean log lines, measurements from a physical system, or text derived from messages. Te critical points are that we have an online computation the size of the data, and that the cost of that computation is linear. Te scientific computations that fit that constraint are numerous, especially anything that’s embarrassingly parallel. With image processing in astronomy, for example, most parts of the sky can be processed fairly independently and so you can do data reduction on whole sky imagery or large scale synthetic aperture radio astronomy of the kind that the Square Kilometre Array


W WE’VE HAD TO


FIND ATERNATIVE ALGORITHMS


(SKA) will be producing. Tat can be done pretty efficiently with Hadoop. Te drawback is that many of the


numerical algorithms we’ve used in the past don’t work well in this new environment. A good example is the singular value decomposition (SVD), a large matrix product, and there is an algorithm that’s been known to be wonderfully good for many years called Lanczos. Involving sequential matrix by vector multiplications, it works very well on supercomputers, but on Hadoop the cost of iterating a process like that is much higher, so Lanczos becomes less appropriate. Te rise of Hadoop has meant that we’ve had to find alternative algorithms that fit well in this new paradigm. Traditional clustering is also iterative, but there are new algorithms that allow you to go through the data very quickly and develop a sketch of it so that you can then apply a classic algorithm. Many of these problems require substantial restatements and that’s a severe downside of Hadoop. But, to cause a revolution in computing, you have to restate the problem.


JUNE/JULY 2013 35


hat Map Reduce, the Hadoop framework that assigns work to the nodes in a cluster, does well – and


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52