Skip to main content

What is Data Collection and Cleaning

Know the Data Collection Methods and Cleaning Techniques

Contents data collection and cleaning:

  • Data Collection Methods
  • Data Quality Assessment
  • Data Cleaning Techniques
  • Outlier Detection

Data Collection Methods

Data Collection is the process of gathering relevant data from various sources that can be used for analysis. The two primary categories of data collection techniques are:

    Primary Data Collection

Primary data collection involves collecting data directly from the source for a specific purpose. This method involves the use of surveys, interviews, observations, and experiments to collect data.

    Secondary Data Collection

Secondary data collection involves the use of data that has already been collected and is available for public use. This method involves the use of data obtained from books, journals, newspapers, and government publications.

Data Quality Assessment

Data Quality Assessment is the process of evaluating the quality of data collected. Data quality can be assessed based on the following factors:

    Accuracy

The data must be accurate, i.e., free from errors and mistakes.

    Completeness

The data must be complete, i.e., all required fields must be filled.

    Consistency

The data must be consistent, i.e., there should not be any conflicts or contradictions in the data.

    Timeliness

The data must be timely, i.e., it should be collected within a specific time frame.

    Relevance

The data must be relevant, i.e., it should be related to the problem at hand.

Data Cleaning Techniques

Data cleaning is the process of detecting and correcting or removing errors, inconsistencies, and inaccuracies in data. Some of the commonly used data cleaning techniques are:

    Handling Missing Values

In many cases, data may be missing due to various reasons. One of the techniques to handle missing values is to replace them with the mean or median of the available data.

    Handling Outliers

Outliers are data points that deviate significantly from other data points. One of the techniques to handle outliers is to remove them from the dataset.

    Data Standardization

Data Standardization is the process of transforming data to a standard scale. This technique is used to ensure that all data is on the same scale and to reduce the impact of outliers.

    Data Normalization

Data Normalization is the process of transforming data to a standard distribution. This technique is used to ensure that the data is normally distributed and to reduce the impact of outliers.

Outlier Detection

Data points known as outliers differ dramatically from other data points in a dataset. Outlier Detection is the process of identifying and handling outliers in a dataset. Some of the commonly used outlier detection techniques are:

    Z-Score

Z-Score is a statistical technique used to identify outliers based on their deviation from the mean of the dataset. Outliers are data points having a Z-Score larger than 3 or lower than -3.

    Interquartile Range (IQR)

IQR is a statistical technique used to identify outliers based on the distance between the first and third quartiles of the dataset. Data points that are more than 1.5 times the IQR above the third quartile or below the first quartile are considered outliers.

Local Outlier Factor (LOF): LOF is a machine learning technique used to identify outliers based on the density of the data points. Data points that are in low-density areas are considered outliers.

Python code for the commonly used data cleaning techniques:

Handling Missing Values:

python code

import pandas as pd

# Load the dataset

df = pd.read_csv('data.csv')

# Replace missing values with the mean

df.fillna(df.mean(), inplace=True)

Handling Outliers:

the complete code for outlier detection using Z-Score and IQR:

python code

import pandas as pd

import numpy as np

from scipy import stats

# Load the dataset

df = pd.read_csv('data.csv')

# Z-Score

z_score = np.abs(stats.zscore(df))

threshold = 3

df_cleaned_zscore = df[(z_score < threshold).all(axis=1)]

# IQR

Q1 = df.quantile(0.25)

Q3 = df.quantile(0.75)

IQR = Q3 - Q1

lower_limit = Q1 - 1.5*IQR

upper_limit = Q3 + 1.5*IQR

df_cleaned_iqr = df[~((df < lower_limit) | (df > upper_limit)).any(axis=1)]

In the above code, the Z-Score technique is used to detect outliers where data points with a Z-Score greater than 3 or less than -3 are considered outliers. The IQR technique is used to detect outliers where data points that are more than 1.5 times the IQR above the third quartile or below the first quartile are considered outliers.

The code loads the dataset using pd.read_csv('data.csv'). The stats.zscore() function calculates the Z-Score of the dataset, and the np.abs() function returns the absolute values of the Z-Score. The all(axis=1) function checks for all values in a row to be True, and the ~ operator negates the boolean array.

The code also uses the quantile() function to calculate the first and third quartiles and the interquartile range (IQR). The lower and upper limits are calculated by subtracting and adding 1.5 times the IQR from the first and third quartiles, respectively. The any(axis=1) function checks for any True value in a row, and the ~ operator negates the boolean array.

The cleaned datasets are stored in df_cleaned_zscore and df_cleaned_iqr, respectively.

Remove outliers using Z-Score

from scipy import stats

z_score = np.abs(stats.zscore(df['column_name']))

threshold = 3

df = df[(z_score < threshold)]

Remove outliers using IQR

Q1 = df['column_name'].quantile(0.25)

Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

lower_limit = Q1 - 1.5IQR

upper_limit = Q3 + 1.5IQR

df = df[(df['column_name'] > lower_limit) & (df['column_name'] < upper_limit)]

Data Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df['column_name'] = scaler.fit_transform(df[['column_name']])

Data Normalization

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['column_name'] = scaler.fit_transform(df[['column_name']])

Note: In the above code, 'column_name' refers to the name of the column in the dataset that needs to be cleaned.

To Main (Topics of Data Science)

                                Continue to (Data Exploration and Visualization)


Comments

Popular posts from this blog

What is Data Science

Learn Data Science - Introduction Introduction to Data Science History The field of data science has its roots in statistics and computer science and has evolved to encompass a wide range of techniques and tools for understanding and making predictions from data. The history of data science can be traced back to the early days of statistics when researchers first began using data to make inferences and predictions about the world. In the 1960s and 1970s, the advent of computers and the development of new algorithms and statistical methods led to a growth in the use of data to answer scientific and business questions. The term "data science" was first coined in the early 1960s by John W. Tukey, a statistician and computer scientist . In recent years, the field of data science has exploded in popularity, thanks in part to the increasing availability of data from a wide range of sources, as well as advances in computational power and machine learning. Today, data science is us...

What is Model Evaluation and Selection

Understanding the Model Evaluation and Selection  Techniques Content of  Model Evaluation •     Model Performance Metrics •     Cross-Validation Techniques •      Hyperparameter Tuning •      Model Selection Techniques Model Evaluation and Selection: Model evaluation and selection is the process of choosing the best machine learning model based on its performance on a given dataset. There are several techniques for evaluating and selecting machine learning models, including performance metrics, cross-validation techniques, hyperparameter tuning, and model selection techniques.     Performance Metrics: Performance metrics are used to evaluate the performance of a machine learning model. The choice of performance metric depends on the specific task and the type of machine learning model being used. Some common performance metrics include accuracy, precision, recall, F1 score, ROC curve, and AUC score. Cross-...

What is the Probability and Statistics

Undrstand the Probability and Statistics in Data Science Contents of P robability and Statistics Probability Basics Random Variables and Probability Distributions Statistical Inference (Hypothesis Testing, Confidence Intervals) Regression Analysis Probability Basics Solution :  Sample Space = {H, T} (where H stands for Head and T stands for Tail) Solution :  The sample space is {1, 2, 3, 4, 5, 6}. Each outcome is equally likely, so the probability distribution is: Hypothesis testing involves making a decision about a population parameter based on sample data. The null hypothesis (H0) is the hypothesis that there is no significant difference between a set of population parameters and a set of observed sample data. The alternative hypothesis (Ha) is the hypothesis that there is a significant difference between a set of population parameters and a set of observed sample data. The hypothesis testing process involves the following steps: Formulate the null and al...