Skip to main content

What is the Probability and Statistics

Undrstand the Probability and Statistics in Data Science

Contents of Probability and Statistics

  • Probability Basics
  • Random Variables and Probability Distributions
  • Statistical Inference (Hypothesis Testing, Confidence Intervals)
  • Regression Analysis

Probability Basics

Solution

Sample Space = {H, T} (where H stands for Head and T stands for Tail)

Solution

The sample space is {1, 2, 3, 4, 5, 6}. Each outcome is equally likely, so the probability distribution is:

Hypothesis testing involves making a decision about a population parameter based on sample data.

The null hypothesis (H0) is the hypothesis that there is no significant difference between a set of population parameters and a set of observed sample data.

The alternative hypothesis (Ha) is the hypothesis that there is a significant difference between a set of population parameters and a set of observed sample data.

the hypothesis that there is a significant difference between a set of population parameters and a set of observed sample data.


The hypothesis testing process involves the following steps:

Formulate the null and alternative hypotheses.

Choose an appropriate test statistic and significance level (alpha) based on the research question and data. Calculate the test statistic and corresponding p-value.

Compare the p-value to the significance level. Reject the null hypothesis when the value of the p-value falls under or equal to that same significance level. Fail to, If the p-value is less than or equal to the significance level, reject the null hypothesis.
Draw conclusions based on the results.

For example, suppose we want to test the hypothesis that the mean height of a population is equal to 68 inches, based on a sample of 50 individuals. Our null hypothesis would be that the population mean is 68 inches, and our alternative hypothesis would be that the population mean is not equal to 68 inches.

A confidence interval is a set of values that, with a particular level of certainty, is likely to include the true population parameter. Based on sample data and a selected level of confidence, it is calculated.

Define the research question: The first step is to clearly define the research question and the objective of the analysis.

Collect data: The next step is to collect relevant data for the analysis.

Data Cleaning: After collecting data, it needs to be cleaned by removing any irrelevant or incomplete data, handling missing values and outliers, and transforming the data if necessary

Visualize the data: The data needs to be visualized to understand the relationship between the dependent variable and independent variables.

Choose the regression model: Based on the nature of the problem and the data, choose the appropriate regression model.

Fit the model: Once the regression model is selected, the next step is to fit the model to the data using the chosen algorithm.

Evaluate the model: Evaluate the model by checking the assumptions of the model and measuring the accuracy of the predictions.

Use the model: After evaluating the model, it can be used to make predictions and draw conclusions about the relationship between the variables

The general linear regression model is:

Python code

A statistical technique known as multiple linear regression is used to simulate the relationship between a dependent variable and two or more independent variables. The general multiple linear regression model can be represented as:


Y = β0 + β1X1 + β2X2 + ... + βkXk + ε

    Logistic Regression:

A binary dependent variable and one or more independent variables are modelled using statistics known as logistic regression. The logistic regression model uses the logistic function to transform the linear equation of the independent variables into a probability of the dependent variable being in one of the binary classes.

We then fit our logistic regression model using the sm.Logit() function and pass in our dependent variable Y and independent variables X as arguments. We assign the result of this to the model variable.

Finally, we print the summary of our model using the summary() function.

It's important to note that in logistic regression, the dependent variable should be binary (taking on only two values, usually 0 and 1). In our example above, we assume that our dependent variable Ad Clicked is binary. If it is not, we would need to transform it to a binary variable before fitting a logistic regression model.

Probability and Statistics

Probability and statistics are foundational concepts in data science. Probability refers to the likelihood of an event occurring, while statistics involves collecting, analyzing, and interpreting data.

A measure of an event's likelihood is probability. It is expressed as a number between 0 and 1, with 0 designating an impossibility and 1 designating a certainty for the event.

The basic concepts in probability are:

The range of potential results for an experiment is known as the sample space.

Event: A subset of the sample space.

Probability: The likelihood of an event occurring, represented as a number between 0 and 1.

Example: A fair coin is tossed. How likely is it that you will obtain a head?

Event = {H}

Probability of getting a head = P(H) = Number of favorable outcomes / Total number of outcomes = 1/2 = 0.5

Python Code for calculating Probability:

python code

# Importing Required Libraries

import numpy as np

# Probability of getting a Head

n_trials = 10000

n_heads =np.sum(np.random.choice(['H', 'T'], size=n_trials) == 'H')

P_H = n_heads/n_trials

print(f"The Probability of getting a Head is {P_H}")

Random Variables and Probability Distributions

Variables whose values are decided by chance are known as random variables. Probability distributions describe the likelihood of different values occurring for a random variable.

Probability distributions come in two flavors: discrete and continuous.

    Discrete Probability Distributions:

Discrete random variables have finite or countably infinite possible values.

Examples: Bernoulli, Binomial, Poisson distributions.

    Continuous Probability Distributions:

Any value inside a range is possible for continuous random variables.

Examples: Normal, Exponential, Gamma distributions.

Example: Suppose that a die is rolled. What is the probability distribution of the number rolled?

X        P(X=x)

1        1/6

2        1/6

3        1/6

4        1/6

5        1/6

6        1/6

Python Code for Discrete Probability Distribution (Bernoulli Distribution):

python code

# Importing Required Libraries

import numpy as np

import seaborn as sns

# Bernoulli Distribution (Discrete Probability Distribution)

n_trials = 10000

p = 0.3  # Probability of success

bernoulli_dist = np.random.binomial(1, p, n_trials)

sns.displot(bernoulli_dist)

plt.title('Bernoulli Distribution')

plt.show()

Statistical Inference

Drawing conclusions about a population based on a sample of its data is known as statistical inference. It involves using sample data to estimate population parameters and to test hypotheses about those parameters.

    Hypothesis Testing:

Confidence Intervals:

For example, suppose we want to estimate the mean weight of a population with a 95% confidence interval, based on a sample of 100 individuals. We calculate the sample mean and standard deviation and use a t-distribution to find the confidence interval. Our resulting interval might be (150, 170) pounds, which means that we are 95% confident that the true population mean weight falls within this range.

    Regression Analysis:

A statistical technique for simulating the relationship between a dependent variable and one or more independent variables is regression analysis. It is frequently used for forecasting and prediction.

The most common type of regression analysis is linear regression, which involves fitting a line to the data that best represents the relationship between the variables.

Steps in Regression Analysis:

Y = β0 + β1X1 + β2X2 + ... + βkXk + ε

where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2, ..., βk are the regression coefficients, and ε is the error term.

The goal of regression analysis is to estimate the regression coefficients and use them to make predictions about the dependent variable. There are several types of regression analysis, including simple linear regression, multiple linear regression, and logistic regression.

Here's an example code for performing simple linear regression using Python's Scikit-learn library:

import numpy as np

from sklearn.linear_model import LinearRegression

# create some example data

X = np.array([1,2, 3, 4, 5]).reshape((-1, 1))

y = np.array([2,4, 5, 4, 6])

# create a linear regression object and fit the model

model = LinearRegression()

model.fit(X, y)

# make a prediction for a new value of X

new_X = np.array([6]).reshape((-1, 1))

new_y = model.predict(new_X)

# print the coefficients and the predicted value

print("Coefficients:", model.coef_)

print("Intercept:",model.intercept_)

print("Predicted value:", new_y)

This code creates some example data with one independent variable (X) and one dependent variable (y), creates a LinearRegression object, fits the model to the data, and then uses the model to make a prediction for a new value of X. Finally, it prints the coefficients of the linear equation, the intercept, and the predicted value.

Example code:

Here is an example of simple linear regression in Python:

python code

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# Generate some random data

x = np.array([1,2, 3, 4, 5]).reshape((-1, 1))

y = np.array([2, 3, 4, 5, 6])

# Create a linear regression model

model =LinearRegression()

# Fit the model to the data

model.fit(x, y)

# Make a prediction

x_new = np.array([6]).reshape((-1, 1))

y_new = model.predict(x_new)

# Plot the data and the regression line

plt.scatter(x, y)

plt.plot(x,model.predict(x), color='red')

plt.show()

This code generates some random data and fits a linear regression model to it using the LinearRegression() function from the scikit-learn library. It then makes a prediction for a new data point and plots the data and the regression line using matplotlib.

    Multiple Linear Regression:

where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2,..., βk are the regression coefficients, and ε is the error term.

To perform multiple linear regression in Python, we can use the statsmodels library. Here is an example code:

python code

import pandas as pd

import statsmodels.api as sm

# Load data

data = pd.read_csv('data.csv')

# Define dependent variable and independent variables

Y = data['Sales']

X = data[['TV','Radio', 'Newspaper']]

# Add constant term to independent variables

X = sm.add_constant(X)

# Fit the model

model = sm.OLS(Y,X).fit()

# Print model summary

print(model.summary())

The general logistic regression model can be represented as:

P(Y=1) = e^(β0 + β1X1 + β2X2 + ... + βkXk) / (1 + e^(β0 + β1X1 + β2X2 + ... + βkXk))

where Y is the dependent variable, X1, X2, ..., Xk are the independent variables, β0, β1, β2,..., βk are the regression coefficients, and P(Y=1) is the probability of the dependent variable being in the class 1.

To perform logistic regression in Python, we can use the statsmodels library. Here is an example code:

Python code

import pandas as pd

import statsmodels.api as sm

# Load data

data = pd.read_csv('data.csv')

# Define dependent variable and independent variables

Y =data['AdClicked']

X = data[['Age','Income', 'Gender']]

# Add constant term to independent variables

X =sm.add_constant(X)

# Fit the model

model =sm.Logit(Y, X).fit()

# Print modelsummary

print(model summary())

In the code above, we first load our data using pandas and define our dependent variable Y and independent variables X. We then add a constant term to our independent variables using the sm.add_constant() function from stats models.

To Main (Topics of Data Science)

                                            Continue to (Machine Learning)

Comments

Popular posts from this blog

What is Data Science

Learn Data Science - Introduction Introduction to Data Science History The field of data science has its roots in statistics and computer science and has evolved to encompass a wide range of techniques and tools for understanding and making predictions from data. The history of data science can be traced back to the early days of statistics when researchers first began using data to make inferences and predictions about the world. In the 1960s and 1970s, the advent of computers and the development of new algorithms and statistical methods led to a growth in the use of data to answer scientific and business questions. The term "data science" was first coined in the early 1960s by John W. Tukey, a statistician and computer scientist . In recent years, the field of data science has exploded in popularity, thanks in part to the increasing availability of data from a wide range of sources, as well as advances in computational power and machine learning. Today, data science is us...

What is Model Evaluation and Selection

Understanding the Model Evaluation and Selection  Techniques Content of  Model Evaluation •     Model Performance Metrics •     Cross-Validation Techniques •      Hyperparameter Tuning •      Model Selection Techniques Model Evaluation and Selection: Model evaluation and selection is the process of choosing the best machine learning model based on its performance on a given dataset. There are several techniques for evaluating and selecting machine learning models, including performance metrics, cross-validation techniques, hyperparameter tuning, and model selection techniques.     Performance Metrics: Performance metrics are used to evaluate the performance of a machine learning model. The choice of performance metric depends on the specific task and the type of machine learning model being used. Some common performance metrics include accuracy, precision, recall, F1 score, ROC curve, and AUC score. Cross-...