Know the Data Collection Methods and Cleaning Techniques
Contents data collection and cleaning:
- Data Collection Methods
- Data Quality Assessment
- Data Cleaning Techniques
- Outlier Detection
Data Collection Methods
Data Collection is the process of gathering relevant data from various sources that can be used for analysis. The two primary categories of data collection techniques are:
Primary Data Collection:
Primary data collection involves collecting data directly from the source for a specific purpose. This method involves the use of surveys, interviews, observations, and experiments to collect data.
Secondary Data Collection:
Secondary data collection involves the use of data that has already been collected and is available for public use. This method involves the use of data obtained from books, journals, newspapers, and government publications.
Data Quality Assessment
Data Quality Assessment is the process of evaluating the quality of data collected. Data quality can be assessed based on the following factors:
Accuracy:
The data must be accurate, i.e., free from errors and mistakes.
Completeness:
The data must be complete, i.e., all required fields must be filled.
Consistency:
The data must be consistent, i.e., there should not be any conflicts or contradictions in the data.
Timeliness:
The data must be timely, i.e., it should be collected within a specific time frame.
Relevance:
The data must be relevant, i.e., it should be related to the problem at hand.
Data Cleaning Techniques
Data cleaning is the process of detecting and correcting or removing errors, inconsistencies, and inaccuracies in data. Some of the commonly used data cleaning techniques are:
Handling Missing Values:
In many cases, data may be missing due to various reasons. One of the techniques to handle missing values is to replace them with the mean or median of the available data.
Handling Outliers:
Outliers are data points that deviate significantly from other data points. One of the techniques to handle outliers is to remove them from the dataset.
Data Standardization:
Data Standardization is the process of transforming data to a standard scale. This technique is used to ensure that all data is on the same scale and to reduce the impact of outliers.
Data Normalization:
Data Normalization is the process of transforming data to a standard distribution. This technique is used to ensure that the data is normally distributed and to reduce the impact of outliers.
Outlier Detection
Data points known as outliers differ dramatically from other data points in a dataset. Outlier Detection is the process of identifying and handling outliers in a dataset. Some of the commonly used outlier detection techniques are:
Z-Score:
Z-Score is a statistical technique used to identify outliers based on their deviation from the mean of the dataset. Outliers are data points having a Z-Score larger than 3 or lower than -3.
Interquartile Range (IQR):
IQR is a statistical technique used to identify outliers based on the distance between the first and third quartiles of the dataset. Data points that are more than 1.5 times the IQR above the third quartile or below the first quartile are considered outliers.
Local Outlier Factor (LOF): LOF is a machine learning technique used to identify outliers based on the density of the data points. Data points that are in low-density areas are considered outliers.
Python code for the commonly used data cleaning techniques:
Handling Missing
Values:
python code
import pandas as pd
# Load the dataset
df = pd.read_csv('data.csv')
# Replace missing values with the mean
df.fillna(df.mean(), inplace=True)
Handling Outliers:
the complete code
for outlier detection using Z-Score and IQR:
python code
import pandas as pd
import numpy as np
from scipy import stats
# Load the dataset
df = pd.read_csv('data.csv')
# Z-Score
z_score = np.abs(stats.zscore(df))
threshold = 3
df_cleaned_zscore = df[(z_score < threshold).all(axis=1)]
# IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
df_cleaned_iqr = df[~((df < lower_limit) | (df > upper_limit)).any(axis=1)]
In the above
code, the Z-Score technique is used to detect outliers where data points with a
Z-Score greater than 3 or less than -3 are considered outliers. The IQR
technique is used to detect outliers where data points that are more than 1.5
times the IQR above the third quartile or below the first quartile are
considered outliers.
The code loads
the dataset using pd.read_csv('data.csv'). The stats.zscore() function
calculates the Z-Score of the dataset, and the np.abs() function returns the
absolute values of the Z-Score. The all(axis=1) function checks for all values
in a row to be True, and the ~ operator negates the boolean array.
The code also
uses the quantile() function to calculate the first and third quartiles and the
interquartile range (IQR). The lower and upper limits are calculated by
subtracting and adding 1.5 times the IQR from the first and third quartiles,
respectively. The any(axis=1) function checks for any True value in a row, and
the ~ operator negates the boolean array.
The cleaned
datasets are stored in df_cleaned_zscore and df_cleaned_iqr, respectively.
Remove outliers
using Z-Score
from scipy import
stats
z_score = np.abs(stats.zscore(df['column_name']))
threshold = 3
df = df[(z_score < threshold)]
Remove outliers using IQR
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5IQR
upper_limit = Q3 + 1.5IQR
df = df[(df['column_name'] > lower_limit) & (df['column_name'] < upper_limit)]
Data
Standardization
from
sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['column_name'] = scaler.fit_transform(df[['column_name']])
Data Normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['column_name'] = scaler.fit_transform(df[['column_name']])
Note: In the
above code, 'column_name' refers to the name of the column in the dataset that
needs to be cleaned.
To Main (Topics of Data Science)
Continue to (Data Exploration and Visualization)
Comments
Post a Comment
Requesting you please share your opinion about my content in this blog for further development in a better way. Thank you. Dr.Srinivas