What is Big Data?

Big data refers to a large collection of datasets that is impossible to process and store with traditional tools or on a single computer within a reasonable amount of time. This data may be structured, unstructured, or semi-structured.

The uniqueness of Big Data (3 Vs and more)

The realm of big data was initially characterized by Doug Laney in 2001 as follows: –

Volume – This refers to the amount/size of data.
Variety – This refers to different types of data (IoT data, financial data)
Velocity – This refers to the speed at which the data is generated, collected, processed, and stored.

Over time, the above has been updated to include: –

Veracity – This refers to the degree of accuracy and trustworthiness of data.
Variability – This refers to the meaning and usage of collected data. Note that the meaning of data collected evolves which may lead to inconsistency.
Value – This refers to data that is essential to derive business value. Deriving business value is the ultimate goal of big data!

Big Data Life cycle

The implementation may differ across individuals and organizations but our goal is to discuss the commonalities and baseline strategies for the big data life cycle. The steps can be summarized as follows: –

Data identification – This involves finding the appropriate datasets to work with to give a solution to a particular business problem (derive business value).
Data acquisition and ingestion – This involves fetching required datasets from the source(s) which could include internal or third-party sources.
Data processing – This involves filtration, aggregation, and validation of the data against rules set by the business. It is advisable to have the various datasets obtained to be merged, for instance, products, and customer details joined together through a common key e.g., customer ID. This layer is best to handle this. There are two main types of data processing strategies namely Online transaction processing (OLTP) and Online analytical processing (OLAP).
Data Storage – This involves persisting datasets on a suitable storage system. This is much dependent on the system of processing chosen in the step above. Storage systems primarily include Relational database systems e.g., MySQL, Oracle, or Non-Relational database systems (NoSQL) e.g., Hadoop File System, key-value, document, wide-column storages, etc
Data analysis and knowledge creation – This involves the generation of insights (business value) from data using statistical and/or machine learning methods.
Data visualization – This involves the presentation of business insights for instance to the business executive for decision making.

Common Terminologies in Big Data

Structured Data

This is data that is organized according to rules set by an organization (defined schema) and thus readily compatible with conventional structured database formats. It is easy to evaluate, process, and transform as each field is accessible individually or with other fields. E.g., Sales details, employee details

Unstructured data

This is schema-less data thus impossible to store in conventional database formats in its original form. It can be textual e.g., files, dates or non-textual e.g. images, chats, video, audio, social media content

Semi-Structured Data

This is a combination of unstructured and structured data. It has some characteristics of structured data, but also lacks a clear organization and thus does not adhere to structured relational database formats. Examples include data in JSON and XML. Note that the majority of NoSQL databases such as MongoDB can query semi-structured data.

Data lake

This is a large repository of data collected in its raw format. This data may be structured, unstructured, or semi-structured. E.g, Hadoop, Amazon S3

Data warehouse

This is a large and ordered data repository that can be used for analysis and reporting. It is composed of data that has been cleaned, and integrated with other sources. E.g., Informatica, Postgres

Batch processing

This is a computing strategy that involves processing data in large periodically. The process is started at a later time after the system returns the results.

Realtime processing

This is a computing strategy that involves instant or near-instant data processing thus the need for a constant flow of data. Also referred to as stream processing.

Cluster computing

This is a mechanism of pooling the resources of multiple computers (typically memory, storage, and compute) and managing their collective capabilities to execute and complete tasks. Computer clusters often require a cluster management layer to handle communication among nodes and coordination of work assignments.

Data mining

This is a general term for the practice of discovering patterns in large sets of data. Its goal is to refine a mass of data into a more understandable and cohesive dataset.

ETL

This is an abbreviation for extract, transform, and load. It refers to the process of fetching, preparing, and storing raw data for use.

Hadoop

This is an Apache project that uses clustered computers to solve problems involving massive amounts of data. It consists of a distributed filesystem called HDFS, with a cluster management and resource scheduler on top called YARN (Yet Another Resource Negotiator). Batch processing capabilities are provided by the MapReduce computation engine. Other computational systems such as Apache Spark can be run alongside MapReduce in modern Hadoop deployments.

In-memory computing

This is a strategy that involves moving the working and processing datasets entirely within a cluster’s collective memory. Intermediate results are also held in memory. This gives in-memory computing systems like Apache Spark a huge advantage in speed over I/O bound systems like Hadoop’s MapReduce (which writes intermediate results back to file hence I/O bound)

Machine learning

This is a system (set of algorithms) that can learn, adjust, and improve from data concerning a particular task and performance measure.

Map reduce

This is an algorithm for scheduling work on a computing cluster. The process involves tokenizing and mapping the “work” into tasks on different nodes and computing over them to produce intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a single value for each set.

NoSQL

This is a broad term referring to databases designed outside of the traditional relational model. NoSQL databases have different trade-offs compared to relational databases, but often the best choice for big data use cases due to their versatility, flexibility, and distributed-first architecture.

Industrial application & Benefits of big data

Big data is almost applicable in all industries (manufacturing, retail, financial services, agriculture) including governments. Below is a highlight of the main benefits of big data: –

Offer superior customer experience through 360 Customer of consumer view for Improved customer acquisition, management, and churn prevention.
Fraud prevention and detection
Optimized and efficient operations through automation
Continuous and Timely interventions
Speed-up innovations

Challenges: Why do Big Data Projects fail?

Any organization desiring to immerse in big data must be ready to address a few challenges.

Lack of skilled professionals
The uniqueness of big data (6V’s) presents infrastructural challenges such as storage, security, and network.
Data quality and compliance violations – Business value is only guaranteed if the data used is accurate and timely. Curating this kind of data from raw data while adhering to data privacy and regulatory requirements is a big challenge.
Siloed data systems – Disparate data sources present integration complexities that must be conquered to derive business value. Many businesses struggle to re-engineer established processes and systems to set up and gain leverage from big data systems.

Way forward on Big Data Solutions

To immerse into big data solutioning, a business may first need to define its picture of success. A great point to begin would be to identify and evaluate a specific use case (e.g., product, intervention), the opportunity, the economic value of data available, and the organization’s strategy as a whole. Thus It is highly recommended to have all the stakeholders involved from the get-go.