What is Data Collection and Cleaning

Know the Data Collection Methods and Cleaning Techniques

Contents data collection and cleaning:

Data Collection Methods

Data Quality Assessment

Data Cleaning Techniques

Outlier Detection

Data Collection Methods

Data Collection is the process of gathering relevant data from various sources that can be used for analysis. The two primary categories of data collection techniques are:

Primary Data Collection:

Primary data collection involves collecting data directly from the source for a specific purpose. This method involves the use of surveys, interviews, observations, and experiments to collect data.

Secondary Data Collection:

Secondary data collection involves the use of data that has already been collected and is available for public use. This method involves the use of data obtained from books, journals, newspapers, and government publications.

Data Quality Assessment

Data Quality Assessment is the process of evaluating the quality of data collected. Data quality can be assessed based on the following factors:

Accuracy:

The data must be accurate, i.e., free from errors and mistakes.

Completeness:

The data must be complete, i.e., all required fields must be filled.

Consistency:

The data must be consistent, i.e., there should not be any conflicts or contradictions in the data.

Timeliness:

The data must be timely, i.e., it should be collected within a specific time frame.

Relevance:

The data must be relevant, i.e., it should be related to the problem at hand.

Data Cleaning Techniques

Data cleaning is the process of detecting and correcting or removing errors, inconsistencies, and inaccuracies in data. Some of the commonly used data cleaning techniques are:

Handling Missing Values:

In many cases, data may be missing due to various reasons. One of the techniques to handle missing values is to replace them with the mean or median of the available data.

Handling Outliers:

Outliers are data points that deviate significantly from other data points. One of the techniques to handle outliers is to remove them from the dataset.

Data Standardization:

Data Standardization is the process of transforming data to a standard scale. This technique is used to ensure that all data is on the same scale and to reduce the impact of outliers.

Data Normalization:

Data Normalization is the process of transforming data to a standard distribution. This technique is used to ensure that the data is normally distributed and to reduce the impact of outliers.

Outlier Detection

Data points known as outliers differ dramatically from other data points in a dataset. Outlier Detection is the process of identifying and handling outliers in a dataset. Some of the commonly used outlier detection techniques are:

Z-Score:

Z-Score is a statistical technique used to identify outliers based on their deviation from the mean of the dataset. Outliers are data points having a Z-Score larger than 3 or lower than -3.

Interquartile Range (IQR):

IQR is a statistical technique used to identify outliers based on the distance between the first and third quartiles of the dataset. Data points that are more than 1.5 times the IQR above the third quartile or below the first quartile are considered outliers.

Local Outlier Factor (LOF): LOF is a machine learning technique used to identify outliers based on the density of the data points. Data points that are in low-density areas are considered outliers.

Python code for the commonly used data cleaning techniques:

Handling Missing Values:

python code

import pandas as pd
# Load the dataset
df = pd.read_csv('data.csv')
# Replace missing values with the mean
df.fillna(df.mean(), inplace=True)
Handling Outliers:

the complete code for outlier detection using Z-Score and IQR:

python code

import pandas as pd
import numpy as np
from scipy import stats
# Load the dataset
df = pd.read_csv('data.csv')
# Z-Score
z_score = np.abs(stats.zscore(df))
threshold = 3
df_cleaned_zscore = df[(z_score < threshold).all(axis=1)]
# IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
df_cleaned_iqr = df[~((df < lower_limit) | (df > upper_limit)).any(axis=1)]

In the above code, the Z-Score technique is used to detect outliers where data points with a Z-Score greater than 3 or less than -3 are considered outliers. The IQR technique is used to detect outliers where data points that are more than 1.5 times the IQR above the third quartile or below the first quartile are considered outliers.

The code loads the dataset using pd.read_csv('data.csv'). The stats.zscore() function calculates the Z-Score of the dataset, and the np.abs() function returns the absolute values of the Z-Score. The all(axis=1) function checks for all values in a row to be True, and the ~ operator negates the boolean array.

The code also uses the quantile() function to calculate the first and third quartiles and the interquartile range (IQR). The lower and upper limits are calculated by subtracting and adding 1.5 times the IQR from the first and third quartiles, respectively. The any(axis=1) function checks for any True value in a row, and the ~ operator negates the boolean array.

The cleaned datasets are stored in df_cleaned_zscore and df_cleaned_iqr, respectively.

Remove outliers using Z-Score

from scipy import stats

z_score = np.abs(stats.zscore(df['column_name']))
threshold = 3
df = df[(z_score < threshold)]
Remove outliers using IQR
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_limit = Q1 - 1.5IQR
upper_limit = Q3 + 1.5IQR
df = df[(df['column_name'] > lower_limit) & (df['column_name'] < upper_limit)]

Data Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['column_name'] = scaler.fit_transform(df[['column_name']])
Data Normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['column_name'] = scaler.fit_transform(df[['column_name']])

Note: In the above code, 'column_name' refers to the name of the column in the dataset that needs to be cleaned.

To Main (Topics of Data Science)

Continue to (Data Exploration and Visualization)

Search This Blog

What is Data Collection and Cleaning

Know the Data Collection Methods and Cleaning Techniques

Contents data collection and cleaning:

Data Collection Methods

Primary Data Collection:

Secondary Data Collection:

Data Quality Assessment

Accuracy:

Completeness:

Consistency:

Timeliness:

Relevance:

Data Cleaning Techniques

Handling Missing Values:

Handling Outliers:

Data Standardization:

Data Normalization:

Outlier Detection

Z-Score:

Interquartile Range (IQR):

Python code for the commonly used data cleaning techniques:

Labels

Comments

Post a Comment

Popular posts from this blog

What is Data Science

What is the Probability and Statistics

What is the Research process in Data Science