# Doing Your Own Fair Lending Statistical Analysis

At the IBAT Lending Compliance Summit in April, 2014 and at the SWGSB Alumni program in May, there was much discussion about the regulatory focus on Fair Lending in general and the statistical analysis that is being done to identify disparate treatment. This is the second in a series of articles that discuss statistical analysis as it can be used for Fair Lending analysis. The first article in the series, *Preparing for a Fair Lending Examination Statistical Analysis*, discusses how to collect and prepare a dataset that a regulatory agency will use for Fair Lending analysis. The steps described in the article that follows involve analysis doing your own Fair Lending compliance analysis to anticipate problems that might come up during an examination. Fortunately, there are now open source statistical tools to do very sophisticated analysis, though these tools may require skills that the bank may not have in-house.

The article assumes that you have already cleaned up and prepared your dataset as described in *Preparing for a Fair Lending Examination Statistical Analysis*. The article is divided into the following sections:

- Geocode Addresses
- Estimate the Race, Ethnicity and Gender of Loan Applicants
- Calculate Manhattan Distance or Drive Time from Borrower to Nearest Branch
- Join Loan Data with Rate Sheet Historical Data
- Create Variables for Analysis and Create Training and Test Data Sets
- Testing for Disparate Treatment
- Conclusions

## Geocode Addresses

The first step in almost any customer-oriented analysis is to geocode the customer's address. Geocoding is the process of converting a street address into latitude and longitude coordinates that can be plotted on a map, used to merge address data with census data, used to calculate a drive time between two locations or used in calculating the Manhattan or the straight-line distances between two points. It is very useful in a variety of banking analysis problems, not the least of which is address clean-up; if an address won’t geocode and isn’t a P.O. Box, it probably has some problems that need to be fixed. Ten years ago, geocoding was difficult and expensive. Today, there are a variety of applications to do this in volumes that are reasonable for small banks:

- Most Master Customer Information File (MCIF) marketing system vendors provide services to add demographic data and frequently geocode addresses as part of this service. This is probably the easiest way to geocode a set of loans.
- If you have a commercial address standardization package or your statement mailing vendor does address standardization, it may have geocoding available by default or as an added feature; it is worth investigating.
- For in-house geocoding without a commercial package, the most convenient geocoder is probably Google Maps. Make sure to review the Google Maps API Terms of Service and potential privacy issues with your bank's attorney before choosing this option. Most institutions would want to get a Google Maps API key to use in their geocoding application to set up payment, otherwise the geocoding application would need to be throttled in order to meet Google’s Terms of Service. This is available from a variety of programming languages including PERL, Python (one or both may be known by IT Systems Administrators) and R (a statistical language). There are other open source packages available for geocoding.
- The PERL programming language has an open source geocoding package available for download and installation. See the Comprehensive PERL Archive Network and search on “geocode”.
- The Python package geopy offers several open source geocoders using several different cloud APIs.
- The R ggmap package offers geocoding via the Google Maps API.
- FFIEC offers a geocoding service, but it would require screen scraping and isn't really suited to doing a large volume.

## Estimate the Race, Ethnicity and Gender of Loan Applicants

Since no one collects information on race and ethnicity in loan applications, in doing its Fair Lending analysis, the regulatory agencies must come up with some way to estimate the race, ethnicity and gender of a borrower. All of the ways to do this are error-prone to one degree or another, but that discussion is beyond the scope of this article. Use one of the alternatives below to come up with an estimate for the race, gender and ethnicity of each borrower, and then create an array of variables to use in the analysis.

### The Hard Way--Do It Yourself

- Merge Loan Data with Census Data
Once you have geocoded all of your loans, join it with Census data, and add census variables for race, and ethnicity of the surrounding block group to your dataset. You'll end up with a number of

- Join Loan Data Set with Census Ethnic Surname Database
An on-line research publication, Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities provides a description of a methodology for estimating the race/ethnicity for a person using Census surname data and the geocoded block group; the method described has a correlation of 0.76 when compared to self-reported race/ethnicity at a health insurance provider. The surname frequency and race/ethnicity probability can be downloaded from http://www.census.gov/genealogy/www/data/2000surnames/index.html.

### The Easy Way--Merge Loan Data with MCIF Vendor Race and Ethnicity Data

Acxiom and a number of data brokers routinely provide estimated race and ethnicity data in demographic data sets. If this is available to you, merge this data with your loan data. Although you may not do business directly with Acxiom or the other major data brokers, many MCIF vendors offer demographic data enhancement services that resell Acxiom’s services. You can probably get this service through your MCIF vendor.

### Generate Array of Variables for Estimated Race, Ethnicity and Gender

For each race, ethnicity and gender, create a variable with a probability that the person fits in to each category. You should develop two sets; one with all of the census detail, and a second where all of the unprotected groups (talk to your Compliance Officer on this) are merged. For each set create a variable that is the best estimate of race and ethnicity. You will use the second set of variables to determine whether or not you have a Fair Lending compliance problem, and the first set to diagnose and refine your understanding should you identify a Fair Lending compliance problem.

## Calculate Manhattan Distance or Drive Time from Borrower to Nearest Branch

Since you have the latitude and longitude for each borrower, go ahead and calculate the Manhattan distance (distance North/South + distance East/West) between the borrower and your nearest branch and between the borrower and the nearest competitor’s branch. Distance to a branch is a strong predictor for a variety of consumer financial behaviors, so it is worth having it available for analysis. The Manhattan distance is easy to compute and doesn't require expensive software.

The time it takes to drive from the customer's address to the branch is a much better predictor than distance, but it is difficult to calculate. The easiest is a commercial package called Arcview Business Analyst and the related Arcview Network Analyst products from ESRI. They aren't cheap, so you might want to contract with ESRI to do this part of the work. There may be other alternatives within the Google Maps API or other navigation web services.

## Join Loan Data with Rate Sheet Historical Data

One of the most important tests is to look at deviations from published rate sheets (See FDIC Compliance Manual--January 2014 page IV-1.8 section P2). To do this you will need to join your historical rate sheets with the loan data for corresponding dates and then calculate the deviation from the rate sheet.

## Create Variables for Analysis and Create Training and Test Data Sets

Now that you have cleaned up all of your data as described in *Preparing for a Fair Lending Examination Statistical Analysis*, and done all of the estimates for the borrower’s race/ethnicity/gender, combine everything into a single dataset to be used for analysis. For codes that are numbers, make sure to identify them as factor or ordered factor data types rather than real valued numbers. Now is the time to start duplicating items with common data transformations to normalize values on a 0 to 1 scale, or take the log of a variable where the values are orders of magnitude different. This step is the first that must be done in a statistical tool, as databases and flat files don't support the concepts of "factor" and "ordered factor."

You should also create training and test datasets, to determine whether or not the models that you generate are over-fitted. If protected groups (or unprotected groups) are very infrequent in your dataset, you should consider repeating some of these low-frequency observations in the training data set. If they are very infrequent, they may be ignored, and a pattern could be present, but not recognized, in which case you could get a rude awakening during the examination.

## Testing for Disparate Treatment

There are several approaches to look for disparate treatment under the Fair Lending regulations. Since we are interested in screening for problems rather than proving a problem, it is appropriate to use that an approach that casts a wide net and identifies issues that might not rise to the level of *statistical significance*, *financial materiality*, *frequency*, or *causation* that may cause problems. These terms are mine and don't appear in any regulation or compliance manual that I've seen. I use them because most of the discussions that I've heard combine all of these concepts into the term “significant” and aren't all that precise.

The statistical approach that follows is hopefully not the procedure used used by regulators. The predictive modeling approach that I discuss below will probably indicate patterns where race/ethnicity/gender are useful in predicting price where hypothesis testing approaches might not identify race/ethnicity/gender as statistically significant. Remember that in this analysis, we want to cast a broad net to find anything that might be remotely problematic.

Once all of the data preparation is done, you can begin to look at the data and identify any race/ethnicity/gender patterns that exist. Broadly speaking, you will need to look at interest rates on loans that were approved, including both loans that closed and loans that did not close. You will also need to look at loan approvals.

Because we are screening for problems, we don't want to spend a lot of time if we can help it. The approaches below are by no means exhaustive, but instead are intended to be a labor-efficient approach to screening for problems. The section is divided into the following steps:

- Visualizations
- Disparate Treatment in Pricing--Test Approved Loans Including Loans that Did Not Close
- Disparate Treatment in Underwriting--Test Approved vs. Denied Loan Applications
- Disparate Treatment in Product Selection--Test Qualified vs. Sold
- Test Models for Over-Fitting

### Visualizations

Before doing any statistical tests on the dataset, it is usually helpful to look at some simple visualizations. A few possibilities are listed below:

- Generate visualizations for all of the variables that you have in the dataset. the best way to start out is to plot the deviation from the rate sheet vs. each variable. In the R statistical program, you can do this easily using the
`lattice`

package. - Plot the geocodes for all of the loans on a map, along with branch locations. This won't shed any light on the disparate treatment directly, but it is an easy plot to do and may help you to understand sales patterns better.
- For each racial/ethnic group, plot the deviation from the rate sheet as a time series to see if there are any seasonal patterns; you may find significant deviations immediately before and after rate sheet changes. If this is the case, you should look at these by race and ethnicity to see if there are patterns in who got the old rate after the rate sheet change in a rising environment and who got the new rate early in a falling environment.

### Disparate Treatment in Pricing--Test Approved Loans Including Loans that Did Not Close

There are many ways that you can look for disparate treatment in pricing in approved loans, but the fastest way to get an understanding of the data would be to do a stepwise regression to predict the interest rate on the loan using all of the credit-worthiness metrics available plus race/ethnicity/gender. If the race/ethnicity/gender variable shows up as significant in the stepwise regression model, you either have disparate treatment in pricing, you have been making credit decisions on creditworthiness variables that are not included in your dataset, or you need to do further analysis to find a model that better explains the patterns present in the data. Stepwise regression gives good models quickly, but there may well be a model that better explains the patterns present in the data that the stepwise automation didn’t find.

At a minimum, you should do the following:

- Perform a step-wise linear regression to predict interest rate
- Perform a step-wise linear regression to predict interest rate deviation from rate sheet
- Repeat analysis for each indirect dealer/originator

In all cases, make sure to check the residual plots for the various regression models.

Although it won't tell you anything directly, looking at the closing rates for each racial and ethnic group as shown in Table 1 can point you to further investigation; if there are statistically significant differences, you will probably want to expend more effort in the later steps.

Approved | Closed | Close Rate | P-value That Group Has Same Average as Non-Hispanic White | |

Non-Hispanic White | 100 | 50 | 0.50 | |

Ethnic Group 1 | 100 | 40 | 0.40 | |

Ethnic Group 2 | 100 | 60 | 0.60 | |

Ethnic Group 3 | 100 | 45 | 0.45 | |

All Groups | 400 | 195 | 0.4875 |

### Disparate Treatment in Underwriting--Test Approved vs. Denied Loan Applications

To look for disparate treatment in underwriting, you will need to look at both approvals and denials. To get a quick understanding, do a stepwise logistic regression to predict loan approval.

- Perform stepwise logistic regression to predict loan approval
- Repeat analysis for each indirect dealer/originator

In all cases, make sure to check the residual plots for the various regression models.

### Disparate Treatment in Product Steering--Test Qualified vs. Sold

Finally, we need to look at product selection to make sure that qualified borrowers aren't steered into more expensive sub-prime products when they qualify for a prime product. For this analysis, we will generate a matrix like the one shown in Table 2 below for each racial/ethnic group:

Qualified for Prime | Not Qualified for Prime | |

Sold Prime | 100 | |

Sold Sub-Prime | 100 |

In this table, everyone should be on the northwest to southeast diagonal. If you have non-zero entries on the southwest to northeast diagonal for any of the racial or ethnic groups, you will need to perform a chi-squared test (*Χ*^{2}) to determine if the groups are treated differently from a steering perspective.

### Test Models for Over-Fitting

If you come up with models that include race/ethnicity/gender as significant predictors even when all other creditworthiness variables are available to the stepwise regression, make sure to run them against the test data set. If the model continues to predict well, you are missing a creditworthiness variable with strong race/ethnicity/gender patterns, you have a lot of work ahead to find a manually constructed model that performs better or you have a disparate treatment problem that needs to be addressed.

## Conclusions

The analysis described above will help you to identify whether you have disparate treatment patterns that could appear in a Fair Lending examination statistical analysis. Since the exact procedures that the regulatory agencies use are not public, it is not a guarantee that issues won’t come up.