Bias in Mortgage Lending

skills learned: model training & evaluation, hyperparameter tuning, minority oversampling, fairness evaluation, correlation analysis, data preprocessing, and data visualization.

This project investigates biases in U.S. mortgage lending using the Home Mortgage Disclosure Act (HMDA) dataset, with a focus on demographic disparities affecting mortgage approvals. Building machine learning models such as Random Forest and XGBoost, and utilizing the AI Fairness 360 toolkit, we analyze potential biases to mitigate those related to race, ethnicity, sex, and age. Our study quantifies these biases and proposes mitigation strategies through fairness assessments. Findings indicate moderate to significant biases, especially against racial minorities, highlighting the need for measures to ensure fairness in mortgage lending practices. This project contributes to efforts aimed at enhancing financial equity and regulatory compliance in the housing market.

The mortgage lending system in the United States plays a crucial role in enabling homeownership, a key component of financial stability and wealth creation for many Americans. Despite regulatory efforts such as the Home Mortgage Disclosure Act (HMDA), which mandates transparency by requiring the disclosure of loan information, demographic disparities persist. These biases not only hinder fair access to homeownership but also impact long-term wealth accumulation among disadvantaged communities. Addressing discriminatory practices in mortgage lending is essential not only for ensuring equity in housing opportunities but also for helping financial institutions adhere to fairness laws such as the Fair Housing Act and the Equal Credit Opportunity Act.
Our project aims to employ machine learning techniques to train models on datasets provided under HMDA to predict mortgage approvals. We then plan to use fairness techniques to detect and address biases, expanding the scope beyond the commonly studied racial disparities to include other demographic factors like socioeconomic status and geographic distribution.
The novelty of our project lies in this comprehensive approach to understanding biases within mortgage lending. By moving beyond traditional focus areas like race, and including other variables, our project offers a more holistic view of the issues at hand. This helps in developing nuanced models that reflect real-world complexities.
The significance of our project is that it has practical implications for enhancing transparency and accountability within the financial sector. Our project aims to generate actionable insights that can help financial institutions in complying with legal standards and promote trust among consumers, thereby fostering a more just financial ecosystem.
For this project, we utilized the 2022 Home Mortgage Disclosure Act (HMDA) Public Loan/Application (LAR) Data, which encompasses a comprehensive compilation of mortgage application filed for the year. This dataset is publicly available and features a substantial volume of data, including 16,080,210 observations across 99 variables. It incorporates demographic attributes such as race, ethnicity, age, and sex, alongside loan-specific details like loan purpose, type, interest rate, and loan-to-value ratio. This rich dataset's high dimensionality requires careful preprocessing and feature selection before proceeding with training and evaluating machine learning models.
Our approach was structured to ensure the development of robust predictive models before evaluating their fairness. Initially, using our preprocessed data, we aimed to establish a baseline model that performed reasonably well to set a standard for subsequent more complex models. To this end, we chose logistic regression without any tuning or regularization as our baseline model.
To explore the potential of achieving higher performance and better handling of the high-dimensional data, we incorporated more sophisticated ensemble models such as Random Forest and XGBoost. These models are known for their robust performance across various types of data scenarios and their ability to model non-linear relationships more effectively than simpler models.
Random Forest
We chose Random Forest because it is known for performing well at handling high dimensionality and reducing the risk of overfitting through its ensemble approach. Also, it offers feature importance metrics, which are helpful in identifying key factors that influence lending decisions.
XGBoost
We chose XGBoost for similar reasons as Random Forest, i.e., due to its robust performance with high-dimensional datasets and its capability to handle class imbalance effectively, especially when combined with techniques like SMOTE. XGBoost provides feature importance information as well.
AI Fairness 360
To address the critical aspect of fairness in our model predictions, we utilized the AI Fairness 360 toolkit provided by IBM. This toolkit is designed to help detect bias and potentially mitigate it in machine learning models and datasets. It offers a comprehensive suite of metrics and algorithms that enabled us to assess fairness in both our dataset and the models we developed. We examined fairness in the dataset using two metrics (Disparate Impact and Statistical Parity Difference) and the models using four metrics (Disparate Impact, Statistical Parity Difference, Average Odds Difference, and Equal Opportunity Difference).
Overall, we observed moderate to significant biases against certain demographic groups in mortgage lending. Even a moderate level of bias can lead to significant real-world issues, particularly for individuals who depend on fair lending practices. It is critical to explore methods to mitigate these biases effectively. Our suggestion for the mitigation consists of three parts: Preprocessing, Inprocessing, and Postprocessing. Preprocessing techniques such as reweighing and optimized preprocessing (OP) play a crucial role by adjusting the weights of training instances and modifying labels and features, respectively, to ensure fairness prior to model training. Additionally, inprocessing techniques like adversarial debiasing, which simultaneously aims to maximize prediction accuracy and minimize adversarial loss, contribute to reducing bias. Postprocessing methods, such as equalized odds postprocessing, further refine the model by adjusting the decision boundary to balance false positive and negative rates among groups. Alongside these techniques, conducting thorough exploratory data analysis and visualization remains essential to understand and address the underlying biases in the data comprehensively.

Motivation

Dataset

System Design

Discussion

Big Apple Bunkdowns

Determinants of Home Ownership