Undrstand the Probability and Statistics in Data Science
Contents of Probability and Statistics
- Probability Basics
- Random Variables and Probability Distributions
- Statistical Inference (Hypothesis Testing, Confidence Intervals)
- Regression Analysis
Probability Basics
Solution:Sample
Space = {H, T} (where H stands for Head and T stands for Tail)
Solution:
The sample space is {1, 2, 3, 4, 5, 6}. Each outcome is equally likely, so the probability distribution is:
Hypothesis testing involves making a decision about a population parameter based on sample data.
The null hypothesis (H0) is the hypothesis that there is no significant difference between a set of population parameters and a set of observed sample data.
The alternative hypothesis (Ha) is the hypothesis that there is a significant difference between a set of population parameters and a set of observed sample data.
The hypothesis testing process involves the following steps:
Formulate the null and alternative hypotheses.
Choose an appropriate test statistic and significance level (alpha) based on the research question and data. Calculate the test statistic and corresponding p-value.
Compare the p-value to the significance level. Reject the null hypothesis when the value of the p-value falls under or equal to that same significance level. Fail to, If the p-value is less than or equal to the significance level, reject the null hypothesis.
Draw conclusions based on the results.
For example, suppose we want to test the hypothesis that the mean height of a population is equal to 68 inches, based on a sample of 50 individuals. Our null hypothesis would be that the population mean is 68 inches, and our alternative hypothesis would be that the population mean is not equal to 68 inches.
A confidence interval is a set of values that, with a particular level of certainty, is likely to include the true population parameter. Based on sample data and a selected level of confidence, it is calculated.
Define the research question: The first step is to clearly define the research question and the objective of the analysis.
Collect data: The next step is to collect relevant data for the analysis.
Data Cleaning: After collecting data, it needs to be cleaned by removing any irrelevant or incomplete data, handling missing values and outliers, and transforming the data if necessary
Visualize the data: The data needs to be visualized to understand the relationship between the dependent variable and independent variables.
Choose the regression model: Based on the nature of the problem and the data, choose the appropriate regression model.
Fit the model: Once the regression model is selected, the next step is to fit the model to the data using the chosen algorithm.
Evaluate the model: Evaluate the model by checking the assumptions of the model and measuring the accuracy of the predictions.
Use the model: After evaluating the model, it can be used to make predictions and draw conclusions about the relationship between the variables
The general
linear regression model is:
Python code
A statistical technique known as multiple linear regression is used to simulate the relationship between a dependent variable and two or more independent variables. The general multiple linear regression model can be represented as:
Logistic Regression:
A binary dependent variable and one or more independent variables are modelled using statistics known as logistic regression. The logistic regression model uses the logistic function to transform the linear equation of the independent variables into a probability of the dependent variable being in one of the binary classes.
We then fit our logistic regression model using the sm.Logit() function and pass in our dependent variable Y and independent variables X as arguments. We assign the result of this to the model variable.
Finally, we print the summary of our model using the summary() function.
It's important to note that in logistic regression, the dependent variable should be binary (taking on only two values, usually 0 and 1). In our example above, we assume that our dependent variable Ad Clicked is binary. If it is not, we would need to transform it to a binary variable before fitting a logistic regression model.
Probability and Statistics
Probability and statistics are foundational concepts in data science. Probability refers to the likelihood of an event occurring, while statistics involves collecting, analyzing, and interpreting data.
A measure of an event's likelihood is probability. It is expressed as a number between 0 and 1, with 0 designating an impossibility and 1 designating a certainty for the event.
The basic concepts in probability are:
The range of potential results for an experiment is known as the sample space.
Event: A subset of the sample space.
Probability: The likelihood of an event occurring, represented as a number between 0 and 1.
Example: A fair coin is tossed. How likely is it that you will obtain a head?
Event = {H}
Probability of getting a head = P(H) = Number of favorable outcomes / Total number of outcomes = 1/2 = 0.5
Python Code for calculating Probability:
python code
# Importing Required Libraries
import numpy as np
# Probability of getting a Head
n_trials = 10000
n_heads =np.sum(np.random.choice(['H', 'T'], size=n_trials) == 'H')
P_H = n_heads/n_trials
print(f"The Probability of getting a Head is {P_H}")
Random Variables and Probability Distributions
Variables whose values are decided by chance are known as random variables. Probability distributions describe the likelihood of different values occurring for a random variable.
Probability distributions come in two flavors: discrete and continuous.
Discrete Probability Distributions:
Discrete random variables have finite or countably infinite possible values.
Examples: Bernoulli, Binomial, Poisson distributions.
Continuous Probability Distributions:
Any value inside a range is possible for continuous random variables.
Examples: Normal, Exponential, Gamma distributions.
Example: Suppose that a die is rolled. What is the probability distribution of the number rolled?
X P(X=x)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
Python Code for Discrete Probability Distribution (Bernoulli Distribution):
python code
# Importing Required Libraries
import numpy as np
import seaborn as sns
# Bernoulli Distribution (Discrete Probability Distribution)
n_trials = 10000
p = 0.3 # Probability of success
bernoulli_dist = np.random.binomial(1, p, n_trials)
sns.displot(bernoulli_dist)
plt.title('Bernoulli Distribution')
plt.show()
Statistical Inference
Drawing conclusions about a population based on a sample of its data is known as statistical inference. It involves using sample data to estimate population parameters and to test hypotheses about those parameters.
Hypothesis Testing:
Confidence Intervals:
Regression Analysis:
A statistical technique for simulating the relationship between a dependent variable and one or more independent variables is regression analysis. It is frequently used for forecasting and prediction.
The most common type of regression analysis is linear regression, which involves fitting a line to the data that best represents the relationship between the variables.
Steps in Regression Analysis:
Y = β0 + β1X1 +
β2X2 + ... + βkXk + ε
where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2,
..., βk are the regression coefficients, and ε is the error term.
The goal of regression analysis is to estimate the regression coefficients and use them to make predictions about the dependent variable. There are several types of regression analysis, including simple linear regression, multiple linear regression, and logistic regression.
Here's an example code for performing simple linear regression using Python's Scikit-learn library:
import numpy as np
from sklearn.linear_model import LinearRegression
# create some example data
X = np.array([1,2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2,4, 5, 4, 6])
# create a linear regression object and fit the model
model = LinearRegression()
model.fit(X, y)
# make a prediction for a new value of X
new_X = np.array([6]).reshape((-1, 1))
new_y = model.predict(new_X)
# print the coefficients and the predicted value
print("Coefficients:", model.coef_)
print("Intercept:",model.intercept_)
print("Predicted value:", new_y)
This code creates some example data with one independent variable (X) and one dependent variable (y), creates a LinearRegression object, fits the model to the data, and then uses the model to make a prediction for a new value of X. Finally, it prints the coefficients of the linear equation, the intercept, and the predicted value.
Example code:
Here is an example of simple linear regression in Python:
python code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Generate some random data
x = np.array([1,2, 3, 4, 5]).reshape((-1, 1))
y = np.array([2, 3, 4, 5, 6])
# Create a linear regression model
model =LinearRegression()
# Fit the model to the data
model.fit(x, y)
# Make a prediction
x_new = np.array([6]).reshape((-1, 1))
y_new = model.predict(x_new)
# Plot the data and the regression line
plt.scatter(x, y)
plt.plot(x,model.predict(x), color='red')
plt.show()
This code generates some random data and fits a linear regression model to it using the LinearRegression() function from the scikit-learn library. It then makes a prediction for a new data point and plots the data and the regression line using matplotlib.
Multiple Linear Regression:
where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2,..., βk are the regression coefficients, and ε is the error term.
To perform multiple linear regression in Python, we can use the statsmodels library. Here is an example code:
python code
import pandas as pd
import statsmodels.api as sm
# Load data
data = pd.read_csv('data.csv')
# Define dependent variable and independent variables
Y = data['Sales']
X = data[['TV','Radio', 'Newspaper']]
# Add constant term to independent variables
X = sm.add_constant(X)
# Fit the model
model = sm.OLS(Y,X).fit()
# Print model summary
print(model.summary())
The general logistic regression model can be represented as:
P(Y=1) = e^(β0 + β1X1 + β2X2 + ... + βkXk) / (1 + e^(β0 + β1X1 + β2X2 + ... + βkXk))
where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2,..., βk are the regression coefficients, and P(Y=1) is the probability of the dependent variable being in the class 1.
To perform logistic regression in Python, we can use the statsmodels library. Here is an example code:
Python code
import pandas as pd
import statsmodels.api as sm
# Load data
data = pd.read_csv('data.csv')
# Define dependent variable and independent variables
Y =data['AdClicked']
X = data[['Age','Income', 'Gender']]
# Add constant term to independent variables
X =sm.add_constant(X)
# Fit the model
model =sm.Logit(Y, X).fit()
# Print modelsummary
print(model summary())
In the code above, we first load our data using pandas and define our dependent variable Y and independent variables X. We then add a constant term to our independent variables using the sm.add_constant() function from stats models.
To Main (Topics of Data Science)
Continue to (Machine Learning)
Comments
Post a Comment
Requesting you please share your opinion about my content in this blog for further development in a better way. Thank you. Dr.Srinivas