What is Big Data?

Every day human beings eat, sleep, work, play, and produce data—lots and lots of data. According to IBM, the human race generates 2.5 quintillion (25 billion x billion) bytes of data every day.

That’s the equivalent of a stack of DVDs reaching to the moon and back, and encompasses everything from the texts we send and photos we upload to industrial sensor metrics and machine-to-machine communications.

That’s a big reason why “big data” has become such a common catch phrase. Simply put, when people talk about big data, they mean the ability to take large portions of this data, analyse it, and turn it into something useful.

Exactly what is Big Data?

But big data is much more than that. It’s about:

  • taking vast quantities of data, often from multiple sources
  • and not just lots of data but different kinds of data, often, multiple kinds of data at the same time, as well as data that changed over time, that didn’t need to be first transformed into a specific format or made consistent
  • and analysing the data in a way that allows for ongoing analysis of the same data pools for different purposes
  • and doing all of that quickly, even in real time.

In the early days, the industry came up with an acronym to describe three of these four facets:

VVV, for volume (the vast quantities), variety (the different kinds of data and the fact that data changes over time), and velocity (speed).

Big data vs the Data Warehouse

What the VVV acronym missed was the key notion that data did not need to be permanently changed (transformed) to be analysed. That nondestructive analysis meant that organisations could both analyse the same pools of data for different purposes and could analyse data from sources gathered for different purposes.

By contrast, the data warehouse was purpose-built to analyse specific data for specific purposes, and the data was structured and converted to specific formats, with the original data essentially destroyed in the process, for that specific purpose, and no other, in what was called extract, transform, and load (ETL).

Data warehousing’s ETL approach limited analysis to specific data for specific analyses. That was fine when all your data existed in your transaction systems, but not so much in today’s Internet-connected world with data from everywhere.

However, don’t think for a moment that big data makes the data warehouse obsolete.  Big data systems let you work with unstructured data largely as it comes, but the type of query results you get is nowhere near the sophistication of the data warehouse.

After all, the data warehouse is designed to get deep into data, and it can do that precisely because it has transformed all the data into a consistent format that lets you do things like build cubes for deep drilldown? Data warehousing vendors have spent many years optimising their query engines to answer the queries typical of a business environment.

Big data lets you analyse much more data from more sources, but at less resolution. Thus, we will be living with both traditional data warehouses and the new style for some time to come.  

The Technology breakthroughs behind Big Data

To accomplish the four required facets of big data, volume, variety, nondestructive use, and speed, required several technology breakthroughs, including the development of a distributed file system (Hadoop), a method to make sense of disparate data on the fly (first Google’s MapReduce, and more recently Apache Spark), and a cloud/internet infrastructure for accessing and moving the data as needed.

Until about a dozen years ago, it wasn’t possible to manipulate more than a relatively small amount of data at any one time. Limitations on the amount and location of data storage, computing power, and the ability to handle disparate data formats from multiple sources made the task all but impossible.

Then, sometime around 2003, researchers at Google developed MapReduce. This programming technique simplifies dealing with large data sets by first mapping the data to a series of key/value pairs, then performing calculations on similar keys to reduce them to a single value, processing each chunk of data in parallel on hundreds or thousands of low-cost machines.

This massive parallelism allowed Google to generate faster search results from increasingly larger volumes of data.

Around 2003, Google created the two breakthroughs that made big data possible: One was Hadoop, which consists of two key services:

  • reliable data storage using the Hadoop Distributed File System (HDFS)
  • high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data, and run large-scale, high-performance processing jobs, in spite of system changes or failures.

Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons, cross-integration, and custom implementations of the technology.

To that end, Hadoop offers subprojects, which add functionality and new capabilities to the platform:

  • Hadoop Common: The common utilities that sup- port the other Hadoop subprojects.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that sup- ports structured data storage for large tables.
  • HDFS: A distributed le system that provides high throughput access to application data.
  • Hive: A data warehouse infrastructure that provides data summarisation and ad hoc querying.
  • MapReduce: A software framework for distributed processing of large data sets on compute clusters.
  • Pig: A high-level data- ow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

Most implementations of a Hadoop platform include at least some of these subprojects, as they are often necessary for exploiting big data. For example, most organizations choose to use HDFS as the primary distributed file system and HBase as a database, which can store billions of rows of data.

And the use of MapReduce or the more recent Spark is almost a given since they bring speed and agility to the Hadoop platform.

With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

The MapReduce framework is broken down into two functional areas:

  • Map, a function that parcels out work to different nodes in the distributed cluster.
  • Reduce, a function that collates the work and resolves the results into a single value.

One of MapReduce’s primary advantages is that it is fault-tolerant, which it accomplishes by monitoring each node in the cluster; each node is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and reassigns the work to other nodes.

Apache Hadoop, an open-source framework that uses MapReduce at its core, was developed two years later. Originally built to index the now-obscure Nutch search engine, Hadoop is now used in virtually every major industry for a wide range of big data jobs.

Thanks to Hadoop’s Distributed File System and YARN (Yet Another Resource Negotiator), the software lets users treat massive data sets spread across thousands of devices as if they were all on one enormous machine.

In 2009, University of California at Berkeley researchers developed Apache Spark as an alternative to MapReduce. Because Spark performs calculations in parallel using in-memory storage, it can be up to 100 times faster than MapReduce. Spark can work as a standalone framework or inside Hadoop.

Even with Hadoop, you still need a way to store and access the data. That’s typically done via a NoSQL database like MongoDB, like CouchDB, or Cassandra, which specialise in handling unstructured or semi-structured data distributed across multiple machines.

Unlike in data warehousing, where massive amounts and types of data are, converged into a unified format and stored in a single data store, these tools don’t change the underlying nature or location of the data, emails are still emails, sensor data is still sensor data, and can be stored virtually anywhere.

Still, having massive amounts of data stored in a NoSQL database across clusters of machines isn’t much good until you do something with it. That’s where big data analytics comes in. Tools like Tableau, Splunk, and Jasper BI let you parse that data to identify patterns, extract meaning, and reveal new insights. What you do from there will vary depending on your needs.

InfoWorld

You Might Also Read:

Measuring the Economic Value of Data:

Tech Giants Put Big Data To Work:

« AI Makes People In Your Business More Important
The Shifting Sands of Cybersecurity »

CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

Authentic8

Authentic8

Authentic8 transforms how organizations secure and control the use of the web with Silo, its patented cloud browser.

ON-DEMAND WEBINAR: What Is A Next-Generation Firewall And Why Does It Matter

ON-DEMAND WEBINAR: What Is A Next-Generation Firewall And Why Does It Matter

See how to use next-generation firewalls (NGFWs) and how they boost your security posture.

Practice Labs

Practice Labs

Practice Labs is an IT competency hub, where live-lab environments give access to real equipment for hands-on practice of essential cybersecurity skills.

LockLizard

LockLizard

Locklizard provides PDF DRM software that protects PDF documents from unauthorized access and misuse. Share and sell documents securely - prevent document leakage, sharing and piracy.

MIRACL

MIRACL

MIRACL provides the world’s only single step Multi-Factor Authentication (MFA) which can replace passwords on 100% of mobiles, desktops or even Smart TVs.

TrustedIA

TrustedIA

TrustedIA is a cyber and protective security company. Our mission is to help businesses protect themselves from disruptive events that can impact their successful operation.

Kernelios

Kernelios

Kernelios is a simulator-based training center and an incubator for cyber experts worldwide.

National Agency for Information & Communication Technologies (ANTIC) - Cameroon

National Agency for Information & Communication Technologies (ANTIC) - Cameroon

ANTIC is responsible for regulating the activities of electronic security and regulation of the Internet in Cameroon.

National Cybersecurity Institute (NCI) - Excelsior College

National Cybersecurity Institute (NCI) - Excelsior College

NCI is Excelsior College’s research center dedicated to assisting government, industry, military and academic sectors meet the challenges in cybersecurity policy, technology and education.

National Cyber Security Centre (NCSC) - Ireland

National Cyber Security Centre (NCSC) - Ireland

The National Cyber Security Centre (NCSC) is the operational side of the Department of Communications in regard to network and information security in the Republic of Ireland.

Infosec Partners

Infosec Partners

Whether you’re looking for complete managed security or an on-call expert advisor, we offer a range of managed security services to complement your internal team or primary outsource partner.

T-REX

T-REX

T-REX is a coworking space, technology incubator, and entrepreneur resource center for technology startups.

US-Africa Cybersecurity Group (USAFCG)

US-Africa Cybersecurity Group (USAFCG)

USAFCG provides cybersecurity consulting services and delivers training programs for capacity building in Africa.

Newtec Services

Newtec Services

IT should be responsive, adaptive, and smart. Now more than ever, you need a business that runs efficiently and can adapt to today's challenges. We can help with custom IT solutions.

AirITSystems

AirITSystems

AirITSystems offer companies comprehensive IT security solutions that take all security considerations into account and are tailored to your business.

HEQA Security

HEQA Security

HEQA Security (formerly QuantLR) offer the world’s most cost-effective, easy-to-integrate, and secure Quantum Key Distribution (QKD) solution

Babble

Babble

Babble is a Unified Comms, Contact Centre and Cyber Solutions provider. We believe in making next-generation technology simple to use, deploy and manage.

Phriendly Phishing

Phriendly Phishing

Phriendly Phishing offers phishing awareness training programs designed to ward off potential security threats and minimise the impact of cyber attacks.

Radix Technologies

Radix Technologies

Radix offer end-to-end device management solutions, consolidating all the organization devices, processes and stakeholders into one easy-to-use management platform.

Exodata

Exodata

Exodata is a French digital services company specializing in the outsourcing of IT Systems and solutions.

Wired Assurance

Wired Assurance

Wired Assurance is a testing and assurance company, specialized in software applications and blockchain smart contracts.