Overview: This project aims to develop a predictive model using linear regression to estimate the profitability of companies based on various independent variables, including R&D expenditure, administration costs, and marketing spend. The goal is to identify key factors that influence profit and to provide insights that can guide business strategies.
Data Import and Preparation: The dataset, comprising information on 1000 companies, is imported using the Pandas library. The independent variables (features) include:
R&D Spend: Investment in research and development.
Administration: Administrative expenses.
Marketing Spend: Expenditure on marketing activities.
State: The geographical location of the company.
The dependent variable (target) is Profit, representing the company's profit in USD. The data is then cleaned and preprocessed to prepare it for analysis.
Data Visualization: A correlation matrix is generated using Seaborn's heatmap function, visualizing the relationships between the variables. This helps in understanding the degree of correlation and guides further analysis.
Encoding Categorical Data: Since the State variable is categorical, it is encoded using Label Encoding and One-Hot Encoding to convert it into a numerical format suitable for the regression analysis. This process also avoids the dummy variable trap by removing one of the categories.
Splitting the Dataset: The dataset is divided into training and testing sets using the train_test_split function from Scikit-learn. 80% of the data is used for training the model, while 20% is reserved for testing its performance.
Model Training: A Multiple Linear Regression model is fitted to the training data using Scikit-learn’s LinearRegression class. This model learns the relationship between the independent variables and the target variable (Profit).
Predicting Outcomes: The trained model is then used to predict profits for the test set. The predicted values provide insights into the expected profitability based on the input features.
Model Evaluation: Key metrics, including the coefficients and intercept of the regression model, are calculated to understand the influence of each feature on profit. The R-squared value is also computed, providing an indication of the model's explanatory power.
Coefficients: [ -8.80536598e+02, -6.98169073e+02, 5.25845857e-01, 8.44390881e-01, 1.07574255e-01 ]
Intercept: -51035.229724
R-squared Value: 0.9113