Data Preprocessing and Feature Selection Techniques
Contentsn of Data Preprocessing Techniques
- Data Preprocessing
Techniques
- Feature Engineering
Techniques
- Feature Selection
Techniques
- Dimensionality
Reduction Techniques
Data Preprocessing Techniques:
Data cleaning:
Data transformation:
Data integration:
Example code for data cleaning:
python code
import pandas as pd
Load data
data = pd.read_csv('data.csv')
Drop irrelevant columns
data = data.drop(['id', 'date'], axis=1)
Fill missing values with median
data = data.fillna(data.median())
Correct data errors
data['age'] = data['age'].apply(lambda x: max(x, 0))
Save cleaned data
data.to_csv('cleaned_data.csv', index=False)
Feature Engineering Techniques:
Feature scaling:
Feature extraction:
python Code
import pandas as pd
from sklearn.preprocessing import StandardScaler
Load data
data = pd.read_csv('data.csv')
Scale features
scaler = StandardScaler()
data[['age', 'income']] = scaler.fit_transform(data[['age', 'income']])
Save scaled data
data.to_csv('scaled_data.csv', index=False)
Feature Selection Techniques:
Filter methods:
Embedded methods:
Example code for feature selection:
Python code
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_regression
Load data
data = pd.read_csv('data.csv')
Select top 5 features based on F-test
selector = SelectKBest(f_regression, k=5)
X = data.drop(['target'], axis=1)
y = data['target']
X_new = selector.fit_transform(X, y)
Save selected features
selected_features = X.columns[selector.get_support()]
selected_data = pd.DataFrame(X_new, columns=selected_features)
selected_data['target'] = y
selected_data.to_csv('selected_data.csv', index=False)
Example code for
feature selection using chi-squared test:
Python code
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
Load data
data = pd.read_csv('data.csv')
Define dependent variable and independent variables
Y = data['Target']
X = data.drop('Target', axis=1)
Apply feature selection using chi-squared test
selector = SelectKBest(chi2, k=5)
X_new = selector.fit_transform(X, Y)
Print selected features
print(X.columns[selector.get_support(indices=True)])
Dimensionality Reduction Techniques:
Principal Component Analysis (PCA):
Example code for
PCA:
Python code
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import pandas as pd
# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# fit PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# convert to pandas dataframe for visualization
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['target'] = y
# plot the data
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x='PC1', y='PC2', hue='target', data=df)
plt.title('PCA with 2 components')
plt.show()
This code loads
the iris dataset, performs PCA with 2 components, and visualizes the data in a
scatterplot. The resulting plot shows how the different species of iris are
separated in the reduced dimensional space.
To Main (Topics of Data Science)
Continue to (Model Evaluation and Selection)
Comments
Post a Comment
Requesting you please share your opinion about my content in this blog for further development in a better way. Thank you. Dr.Srinivas