Skip to main content

What is Big Data Technologies

Big Data and  Processing Frameworks

Content of Big Data:

  • What is Big Data?
  • Big Data Processing Frameworks (Hadoop, Spark)
  • Distributed Data Storage (HDFS, S3)
  • Distributed Data Processing (MapReduce, Spark)

What is Big Data?

Big Data refers to the vast amount of structured and unstructured data that is generated from various sources such as social media, sensors, logs, and transactions. It is challenging to analyze and evaluate this data using conventional data processing techniques because of its immense volume, velocity, and variety.Therefore, specialized Big Data technologies and frameworks have been developed to handle and process this data efficiently.

Big Data Concepts


Big Data Processing Frameworks:

Hadoop is a popular Big Data processing framework that provides distributed storage and processing capabilities for large datasets. It is based on the MapReduce programming model, which allows developers to write distributed processing jobs that can be executed across a cluster of commodity hardware.

Apache Spark is another Big Data processing framework that provides fast, in-memory data processing capabilities. It is designed to handle a wide range of Big Data processing tasks such as batch processing, real-time streaming, machine learning, and graph processing.

    Distributed Data Storage:

Hadoop Distributed File System (HDFS) is a distributed file system that provides reliable and scalable storage for Big Data applications. It is designed to store large files across a cluster of commodity hardware and provides built-in fault tolerance and data replication capabilities.


Amazon S3 is a cloud-based storage service that provides reliable and scalable storage for Big Data applications. It is designed to store and retrieve large amounts of data from anywhere in the world and provides a pay-as-you-go pricing model.

    Distributed Data Processing:

Large datasets can be processed in a distributed fashion using the MapReduce programming style.. It allows developers to write parallel processing jobs that can be executed across a cluster of commodity hardware. MapReduce consists of two phases: the Map phase, which processes input data and generates key-value pairs, and the Reduce phase, which aggregates the key-value pairs and generates the final output.

Apache Spark provides an alternative to MapReduce that is faster and more efficient. It is designed to handle complex data processing tasks such as real-time stream processing, machine learning, and graph processing. Spark provides a distributed data processing engine that can be run on a cluster of commodity hardware and provides built-in support for SQL, streaming, and machine learning.

Big Data Visualization


Example code for running a Spark job:

python code

from pyspark import SparkContext

from pyspark.sql import SparkSession

Create a SparkContext

sc = SparkContext(appName="myApp")

Create a SparkSession

spark = SparkSession.builder.appName("myApp").getOrCreate()

Read data from HDFS

data = spark.read.csv("hdfs://localhost:9000/data.csv")

Perform some data processing

Write output to HDFS

output.write.csv("hdfs://localhost:9000/output")

This is a good starting code for running a Spark job, but there are some parts missing, such as the actual data processing steps. Here's a more complete example that includes reading data from HDFS, processing it, and writing the output back to HDFS:

python code

from pyspark import SparkContext

from pyspark.sql import SparkSession

# Create a SparkContext

sc = SparkContext(appName="myApp")

# Create a SparkSession

spark = SparkSession.builder.appName("myApp").getOrCreate()

# Read data from HDFS

data = spark.read.csv("hdfs://localhost:9000/data.csv")

# Perform some data processing

processed_data = data.selectExpr("_c0 as id", "_c1 as name", "_c2 as age")

processed_data = processed_data.filter(processed_data.age > 18)

# Write output to HDFS

processed_data.write.csv("hdfs://localhost:9000/output")

# Stop the SparkContext

sc.stop()

In this example, we read a CSV file from HDFS using SparkSession's read.csv method. We then perform some data processing, which in this case involves selecting specific columns and filtering out rows where age is less than or equal to 18. Finally, we write the processed data back to HDFS using the write.csv method. Note that the output directory must not exist beforehand, or else an error will occur. Finally, we stop the SparkContext to free up resources.

To Main (Topics of Data Science)

                                            Continue to (Data Visualization and Communication)


Comments

Popular posts from this blog

What is Data Science

Learn Data Science - Introduction Introduction to Data Science History The field of data science has its roots in statistics and computer science and has evolved to encompass a wide range of techniques and tools for understanding and making predictions from data. The history of data science can be traced back to the early days of statistics when researchers first began using data to make inferences and predictions about the world. In the 1960s and 1970s, the advent of computers and the development of new algorithms and statistical methods led to a growth in the use of data to answer scientific and business questions. The term "data science" was first coined in the early 1960s by John W. Tukey, a statistician and computer scientist . In recent years, the field of data science has exploded in popularity, thanks in part to the increasing availability of data from a wide range of sources, as well as advances in computational power and machine learning. Today, data science is us...

What is the Probability and Statistics

Undrstand the Probability and Statistics in Data Science Contents of P robability and Statistics Probability Basics Random Variables and Probability Distributions Statistical Inference (Hypothesis Testing, Confidence Intervals) Regression Analysis Probability Basics Solution :  Sample Space = {H, T} (where H stands for Head and T stands for Tail) Solution :  The sample space is {1, 2, 3, 4, 5, 6}. Each outcome is equally likely, so the probability distribution is: Hypothesis testing involves making a decision about a population parameter based on sample data. The null hypothesis (H0) is the hypothesis that there is no significant difference between a set of population parameters and a set of observed sample data. The alternative hypothesis (Ha) is the hypothesis that there is a significant difference between a set of population parameters and a set of observed sample data. The hypothesis testing process involves the following steps: Formulate the null and al...

What is Data Collection and Cleaning

Know the Data Collection Methods and Cleaning Techniques Contents data collection and cleaning: Data Collection Methods Data Quality Assessment Data Cleaning Techniques Outlier Detection Data Collection Methods Data Collection is the process of gathering relevant data from various sources that can be used for analysis. The two primary categories of data collection techniques are:      Primary Data Collection :  Primary data collection involves collecting data directly from the source for a specific purpose. This method involves the use of surveys, interviews, observations, and experiments to collect data.      Secondary Data Collection :  Secondary data collection involves the use of data that has already been collected and is available for public use. This method involves the use of data obtained from books, journals, newspapers, and government publications. Data Quality Assessment Data Quality Assessment is the process of evaluating th...