Big Data and Processing Frameworks
Content of Big Data:
- What is Big Data?
- Big Data Processing Frameworks (Hadoop, Spark)
- Distributed Data Storage (HDFS, S3)
- Distributed Data Processing (MapReduce, Spark)
What is Big Data?
Big Data refers to the vast amount of structured and unstructured data that is generated from various sources such as social media, sensors, logs, and transactions. It is challenging to analyze and evaluate this data using conventional data processing techniques because of its immense volume, velocity, and variety.Therefore, specialized Big Data technologies and frameworks have been developed to handle and process this data efficiently.
Big Data Processing Frameworks:
Hadoop is a popular Big Data processing framework that provides distributed storage and processing capabilities for large datasets. It is based on the MapReduce programming model, which allows developers to write distributed processing jobs that can be executed across a cluster of commodity hardware.
Hadoop Distributed File System (HDFS) is a distributed file system that provides reliable and scalable storage for Big Data applications. It is designed to store large files across a cluster of commodity hardware and provides built-in fault tolerance and data replication capabilities.
Example code for running a Spark job:
python code
from pyspark import SparkContext
from pyspark.sql import SparkSession
Create a SparkContext
sc = SparkContext(appName="myApp")
Create a
SparkSession
spark = SparkSession.builder.appName("myApp").getOrCreate()
Read data from HDFS
data = spark.read.csv("hdfs://localhost:9000/data.csv")
Perform some data processing
Write output to
HDFS
output.write.csv("hdfs://localhost:9000/output")
This is a good
starting code for running a Spark job, but there are some parts missing, such
as the actual data processing steps. Here's a more complete example that
includes reading data from HDFS, processing it, and writing the output back to
HDFS:
python code
from pyspark import SparkContext
from pyspark.sql import SparkSession
# Create a SparkContext
sc = SparkContext(appName="myApp")
# Create a SparkSession
spark = SparkSession.builder.appName("myApp").getOrCreate()
# Read data from HDFS
data = spark.read.csv("hdfs://localhost:9000/data.csv")
# Perform some data processing
processed_data = data.selectExpr("_c0 as id", "_c1 as name", "_c2 as age")
processed_data = processed_data.filter(processed_data.age > 18)
# Write output to HDFS
processed_data.write.csv("hdfs://localhost:9000/output")
# Stop the SparkContext
sc.stop()
Continue to (Data Visualization and Communication)
Comments
Post a Comment
Requesting you please share your opinion about my content in this blog for further development in a better way. Thank you. Dr.Srinivas