This project was developed to determine the cost of rent in Brazil. The interactive interface is based on Streamlit, which allows you to easily interact with the model and analyze the results.
1. Project objective: To create an analytical tool for researching the value of rents in Brazil, including various factors, including regions.
2. Project tasks:
- Data analysis: to identify the key factors that influence the cost;
- Model building: using machine learning and statistical analyses to create a model that can predict rental values;
- User interface: develop an interactive interface that will allow users to enter new data, analyze the results and make predictions based on the model.
The project was implemented using the following technologies:
- Python: the main programming language;
- Docker Compose: to simplify the process of deploying and managing the project in the Docker environment.
- Pandas: for data processing;
- Numpy: for numerical calculations;
- Scikit-learn: for building and evaluating machine learning models;
- Matplotlib and Seaborn: for data visualization;
- Streamlit: for creating an interactive interface;
- Joblib: for efficient serialization (saving) and loading of Python objects.
The dataset used for this project has the following characteristics:
- https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset
- format:
.csv
; - contains the following key columns:
city
,rent amount
,bill_avg
,parking spaces
, etc.
This dataset contains 10692 rental properties with 13 different characteristics:
- city: the city in which the property is located;
- area: area of the property;
- rooms: number of rooms;
- bathroom: number of bathrooms;
- parking spaces: number of parking spaces;
- floor: floor;
- animals: permission to stay with animals;
- furniture: furniture;
- hoa (R$) : homeowners association tax;
- rent amount (R$): the amount of rent;
- property tax: Municipal property tax;
- fire insurance (R$): the cost of fire insurance;
- total (R$): the total sum of all values.
✅ In São Paulo, the average cost of rent is higher than in other cities. And this is not surprising, as it is the center of a large agglomeration with a population of 23 million, also one of the largest in the world.
Correlation analysis conclusions for rent amount:
Strong positive relationship: variables that are highly positively correlated with rent amount have a direct relationship. That is, when the value of these variables increases, the rent also increases;
-
fire insurance (0.987): very strong correlation with rent amount. This is expected as the amount of insurance can be proportional to the rent;
-
bathrooms (0.666): having more bathrooms is associated with higher rents. This indicates that living spaces with more amenities are more expensive;
-
parking spaces (0.574): housing with parking spaces has higher rents, as it is often a sign of luxury or convenience;
-
rooms (0.537): a larger number of rooms is also associated with higher rents, which is consistent with the logic of larger areas;
-
City São Paulo (0.25): location strongly influences higher rents.
Weak positive relationship:
-
area (0.178): area has a moderate positive correlation, but is not a key factor. This may indicate that a larger area does not always mean significantly higher rents;
-
property tax (0.107): property tax has a weak impact. It is probably taken into account by property owners, but is not a direct indicator of rents.
Almost neutral impact:
-
floor (0.071): floor has a very weak impact on rents, which may depend on the city and the architecture of the buildings;
-
animal accept (0.06): minor impacts on rent levels;
-
hoa (0.052): small relationship with homeowners' association fees. This may only affect certain types of housing (e.g., apartments in condominiums);
Negative correlation: variables with a negative correlation have an inverse relationship, i.e., when the value of the variable increases, the rent decreases.
-
animal_not_accept (-0.06): in premises where animals are not allowed, rents are slightly lower;
-
furniture_not furnished (-0.17): show an inverse relationship, possibly due to tenant preferences.
Cities, for example, city_Belo Horizonte, city_Campinas: In cities with a negative correlation with rent amount, rents may be lower compared to the base cities (e.g. city_São Paulo).
-
the following models were tested in the project: LinearRegression; Linear regression with Lasso, Ridge, ElasticNet regularization; SVR, RandomForestRegressor;
-
was used to select the best hyperparameters: GridSearchCV.
Model | MAE | MSE | RMSE | R2 |
---|---|---|---|---|
Linear Regression (Train) | 296.182627 | 2.156508e+05 | 464.382169 | 0.981444 |
Linear Regression (Test) | 321.153450 | 6.703823e+05 | 818.768761 | 0.946516 |
Linear regression with Lasso regularization (Train) | 293.663125 | 2.247485e+05 | 474.076464 | 0.980661 |
Linear regression with Lasso regularization (Test) | 315.717036 | 3.939180e+05 | 627.628853 | 0.968573 |
SVR (Train) | 284.258514 | 1.639049e+06 | 1280.253446 | 0.858968 |
SVR (Test) | 291.741362 | 4.293574e+05 | 655.253668 | 0.965745 |
Random Forest Regressor (Train) | 123.370415 | 5.525399e+04 | 235.061670 | 0.995246 |
Random Forest Regressor (Test) | 342.022606 | 9.401107e+05 | 969.593048 | 0.924996 |
Linear regression with Ridge regularization (Train) | 296.434150 | 2.156853e+05 | 464.419328 | 0.981441 |
Linear regression with Ridge regularization (Test) | 321.891060 | 6.876466e+05 | 829.244573 | 0.945138 |
Linear regression with Elastic Net regularization (Train) | 294.810219 | 2.231826e+05 | 472.422034 | 0.980796 |
Linear regression with Elastic Net regularization (Test) | 294.810219 | 2.231826e+05 | 472.422034 | 0.980796 |
Model evaluation is based on several metrics: MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and R² (coefficient of determination). The best model can be chosen depending on the purpose of the analysis, but in general, lower errors (MAE, MSE, RMSE) and higher R² indicate a better model.
❎ Best model:
Best model: Elastic Net (a regularized linear regression). Why:
High R² on training (0.9808) and test (0.9808) data. The lowest RMSE on the test data (472.42), indicating the best fit to real-world values.
Elastic Net balances the regularization, avoiding overfitting.
Alternative model: Lasso Regression (a regularized linear regression). Why: The second best R² on the test data (0.9686). Low RMSE (627.63), which also indicates good consistency. Less sensitive to collinearity than simple linear regression.
⭕ Worst Model:
Random Forest overfits the training data, with the highest test errors and reduced generalizability. It may require hyperparameter tuning to improve its performance.
Clone the repository:
git clone https://github.com/MariiaSam/Rent-in-Brazil.git
cd Rent-in-Brazil
Set up the virtual environment with Poetry
Set up project dependencies:
poetry install
To activate the virtual environment, run the command:
poetry shell
To add a dependency to a project, run the command:
poetry add <package_name>
To pull in existing dependencies:
poetry install
Run the Streamlit application with the command:
streamlit run app.py
This project also supports Docker containerization, which makes it easy to run the application without having to manually configure the environment.
1. Starting a project via Docker
Use the command to run a container based on your Docker Image:
docker run --rm -d -p 8880:8880 rentinbrazil:latest
--rm - automatically removes the container after stopping;
-d - runs the container in the background;
-p 8880:8880 - displays port 8880 on the local machine to access the application.
2. Access to the application:
After starting the container, the application will be available at the address:
http://localhost:8880
- Stopping the project:
To stop the project, do:
docker ps
Then use the command to stop the container:
docker stop <container_id>
Where <container_id> is your container's identifier, obtained from the docker ps command.