Big Data

Big Data

You have probably heard the term Big Data by now. I certainly mentioned it in passing in my post on Information Overload. So I was amused to receive the final edition of IEEE Computing in Science & Engineering for 2011 with Big Data as the topic.

 

Big Data

Big Data

I learned a few more things about Big Data that I hadn’t considered up until now. There are:

 

  • storing the data is a major issue
  • moving the data between storage and processing is an even bigger issue
  • processing capacity is increasing faster than storage or transport capacity
  • for simulations, the results matrices are so huge that reducing them before storage is the only way they can be handled

 

An example where all these points converge is climate modelling where the exponential growth in sensors and the complexity of the models mean that there is too much data too widely dispersed to get it to one place, process it and get the results back out efficiently. A new methodology is required for problems like this.

 

IO Bottleneck

So we are back to the old information IO Bottleneck problem. The graph that really got my attention tracked the growth in data access rates versus the growth in data processing rates.

 

Data Storage Versus Data Processing

Data Storage Versus Data Processing

The rate of performance improvement in disks (red line) is much lower than that in computing systems (blue line), driving the need for larger disk counts in each generation of Supercomputer. This approach isn€™t sustainable regardless of whether you look at cost, power or reliability. Richard Freitas of IBM Almaden Research provided some of this data for IEEE.

 

So we have reached the point where the storage and movement of data is now the limiting factor in computing analysis. 40 years ago Seymour Cray had to overcome this at the individual computing system level to build the Supercomputers he is famous for. Today we have hit it at the system level.

 

Areas being looked at for innovative solutions are:

 

  • continue looking for higher density and faster storage systems
  • data compression or subsetting algorithms to reduce the amount of data to be moved or stored
  • parallel processing techniques with parallel storage to reduce the bottleneck
  • results summarisation so less storage is required for the analysis results

 

And all this while trying to maintain data integrity and traceability for proof of scientific rigour. Answers will be found, that much we can be sure of from history.

 

And there is a lot of money to be made from doing this well. Forbes put $50 Billion as the value of the Big Data Market.

 

And even Google can have issues with Big Data Overload.

 

Successful Endeavours specialise in Electronics Design and Embedded Software Development. Ray Keefe has developed market leading electronics products in Australia for nearly 30 years. This post is Copyright © 2012 Successful Endeavours Pty Ltd

Information Overload

How Much Data?

According to IBM, 90% of the data created in the history of the world, was created in the past 2 years. The article was looking at Social Media Information but the claim was generic. Talk about Information Overload. How do we keep up with this?

 

There are sceptics that believe this Data Deluge is overstated but even if they are out by a factor of 10, it seems we are in danger of moving from the Information Age to drowning in data.

 

I worked with a very fast thinker once. Working with him was like trying to see ahead underwater while travelling in the wake of an outboard motor engine. The trick was to decide what to ignore so you could just address the important things. He used it as a tactic to get his own way during meetings. I was reminded of this while thinking about this topic. It seems the whole human race is about to face the same dilemma. How to sort the important information from the huge volume of total information being produced.

 

Information Overload

Information Overload

Information Relevance

Not all of information produced is of the same quality, usefulness or relevance. Assessing Information Relevance will become increasingly more important. A post on Facebook letting us all know that someone’s dog just farted is not as valuable to know for most of us compared to the passing of a new law that puts a carbon tax on high carbon emitters.

 

The CERN Large Hadron Collider (LHC) is expected to produce data equal to 1% of the worlds production rate when it is running. This required a new approach to data storage. For those who aren’t familiar with it, the Large Hadron Collider is a higher energy version of the Australian Synchrotron which has specialised detectors that examine the fine details of how the matter of the universe is constructed. The intent is to look for evidence that the Higgs Boson exists as predicted by the Standard Model of particle physics.

 

CERN Large Hadron Collider

Test Everything

I mention it here because they have to record the experimental data knowing that it may be some time before they can fully interpret it. They have planned for the Information Overload as well as the long term Information Storage.

 

In fact it is a great example of long term planning with the original proposal in 1985 and the construction beginning in 1994 and being complete in 2008. You see the steps involved in LHC Milestones.

 

Stephen Wolfram has put together a timeline of the Advance of the Data Civilisation and if you are keen you can also buy the Historical Timeline of Systematic Data from them.

 

Information Storage

So how do you store all that data?

 

If we used DVDs it would produce a stack that goes to the Moon and back. That’s too big to store as DVDs.

 

The increase in data comes from 3 sources:

 

  • new data sources such as ubiquitous sensors, LHC, business metrics, research…
  • increased data creation from existing sources such as social media, blogs, web publishing…
  • unprecedented processing power

 

So far the storage solution is the growth of server farms and while many higher density storage technologies are being investigated, most data is stored on conventional hard disks. Redundacy and data security are of course hot topics.

 

Hard Disk Storage

Hard Disk Storage

Information Processing

The other major issue is how do we make sense of all this data. Traditional data Integration tools are considered to be not ready for Big Data, and this is likely to get worse before it gets better. Information Processing is going to be one of the opportunity areas of the next decade.

 

According to CNN, Data Scientist will be one of the hot jobs in 2022.

 

Even in the much smaller world of Successful Endeavours where we develop new products and have to do the Innovation, research, Prototypes and testing associated with them; managing all the data requires both discipline and planning.

 

Successful Endeavours specialise in Electronics Design and Embedded Software Development. Ray Keefe has developed market leading electronics products in Australia for nearly 30 years. This post is Copyright © 2012 Successful Endeavours Pty Ltd