Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
This article is the second in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on determining which articles have clickthrough rates higher than would be expected for their average search position and which articles need work. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
 Which articles need work to improve search engine ranking
 Which articles are well ranked but do not get clicked, and need work on titles or meta data
 Where to focus efforts for new content
 How to use passive web search data to focus new product development
The other articles in the series are:
 Visualizing Web Analytics Data in R Part 1: the Problem
 Visualizing Web Analytics Data in R Part 2: Interactive Outliers
 Visualizing Web Analytics Data in R Part 4: Interactive Globe
 Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
 Visualizing Web Analytics Data in R Part 6: Interactive Networks
 Visualizing Web Analytics Data in R Part 7: Interactive Complex
Interactive 3D Scatterplot showing Five Dimensional Data
Showing more than three dimensional data gets to be a problem an any visualization; parallel axes work very well if you are presenting the data to scientists and engineers that are familiar with this type of plot, but it does not go over well with typical business professionals. For users that have not been trained to read parallel axes and other types of advanced charts, a 3D scatter plot is a good way to represent up to five dimensions; three on the x, y and z axes, a fourth dimension with dot size and a fifth dimension with color. If you can use additional dot types, you can represent a sixth data dimension.
The R threejs
makes it easy to generate an 3D scatterplot with five variables, though it has some limitations that will hopefully be removed as the package is enhanced.
Figure 1 shows a 3D scatter plot of Google Analytics Clickthrough rate (CTR), search position, and article size generated with the scatterplot3d
call in the threejs
package. The call allows two different rending tools–“canvas” and “webgl”. The “webgl” option is faster in most browsers as it uses OpenGL, but it does not allow changing the size of the dots. It also may not allow users on some mobile devices to manipulate the image, as some devices do not support OpenGL. The “canvas” rendering option is slower, but it allows changes to dot size and works better on mobile devices.
Figure 2 shows the same plot but uses “canvas” so that the dot size represents the number of times the page was presented in a search (impressions) and the color represents the classification of the page as general (blue), web (red), banking (green) and consumer (yellow).
Rotating it so that the Y axis (postion) is closest to the foreground, Figure 2 shows that the CTR (X axis) increases strongly with article size (Z axis) for general, banking and consumer articles, but not as strongly for web articles. With the X axis (CTR) in the foreground, it is clear that search position improves (low number to left is good) with article size.
The most prominent article, stoppingrachelfromcardholderservices is pretty clearly in the top half of article size (Z axis). The single largest article, statisticalexampleoffairlendingdisparatetreatmentproblem has one of the highest clickthrough rates and is generally positioned well in searches. The small articles that do well on both CTR and position tend to be very, very specific articles that people are trying to find specifically: brucemoore
, the author’s contact page, and various articles with links to presentation slides.
Linear Regression of Article Size
The linear regression shown in Figure 3 below indicates that increasing page size and decreasing position improve the CTR, but the adjusted Rsquared value of only 0.1181 indicates that this is not a good regression model, and that some other variable is probably much more important to the CTR. Including the query area (general, banking, web and consumer) does not shed much additional light on the situation as shown in Figure 4.
With no additional data elements in the Google Search Console data set, there isn’t much more insight that can be gained from this dataset without doing text mining on both the queries and the article content. Since this series is primarily about visualization, the next article in the series will look at using the globejs
call to visualize geographic data on an interactive globe.
Notes
This article was written in RStudio and uses the threejs
package for all graphics except for the linear regression residual plot. The formula display uses MathJax.
 Details
 Written by Bruce Moore
 Hits: 1683
Visualizing Web Analytics Data in R Part 2: Interactive Outliers
This article is the second in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on determining which articles have clickthrough rates higher than would be expected for their average search position and which articles need work. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
 Which articles need work to improve search engine ranking
 Which articles are well ranked but do not get clicked, and need work on titles or meta data
 Where to focus efforts for new content
 How to use passive web search data to focus new product development
The other articles in the series are:
 Visualizing Web Analytics Data in R Part 1: the Problem
 Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
 Visualizing Web Analytics Data in R Part 4: Interactive Globe
 Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
 Visualizing Web Analytics Data in R Part 6: Interactive Networks
 Visualizing Web Analytics Data in R Part 7: Interactive Complex
Interactive Scatterplot Labels Help Domain Experts Understand Data
In the previous article, simple plots of the daily Google Search Console information provided a great deal of information on how to search position influences the clickthrough rate (CTR), but did not help to give information on why some days performed better than expected, or for that matter which days performed better than expected. The googleVis
R package provides the ability to label flyovers for data points, which makes it much easier to figure out what is going on with with betterthanexpected and worsethanexpected events.
Figure 1 shows a googleVis
plot of CTR vs. position with point labels that include the date and number of impressions. This makes it possible to figure out which dates to go back and investigate further. Clearly, June 28, 29, and 30 are days to investigate further to figure out what content users were searching to find (which drove high numbers of impressions) and what content they found the was especially relevant (high CTR).
An analysis of active queries at that time showed that there was a high search volume for information on a Google Analytics referral spam domain 100dollarsseo.com
; during this period, an article on this web site ranked highly on this search term and was unusual in that most of the other highly ranked articles were operated by the referral spammer. The content was unusual relative to the other items in the search list, and was clearly different from other highly ranked sites due to the title and snippet.
This still does not directly tell what page to look at; the next section looks at visualizations for landing pages.
Interactive Scatterplot of Landing Page Data
Although the previous analysis provided a lot of seasonality information, it did not give actionable information on which individual pages worked well and which did not. For that a data set based upon page is needed. Figure 2 shows a googleVis
plot of pages by search position and CTR where the average position is 19.057
and average CTR is 0.038
. The pages above a CTR of 0.04 perform better than expected; a qualitative review of these pages indicates three general attributes:
 The articles contain specialized
schema.org
structured markup.  The articles are topically very different from others returned for the search terms.
 The articles may be longer than the pages that underperform with respect to CTR.
The simple step of adding interactive flyover text can provide a great deal of information quickly and easily. The next visualization will show how to show five data dimensions in a threedimensional interactive scatter plot.
Notes
This article was written in RStudio and uses the googleVis
package for all graphics.
 Details
 Written by Bruce Moore
 Hits: 1319
Visualizing Web Analytics Data in R Part 1: The Problem
The analysis of Google Analytics and other web analytics data is a big part of many marketing efforts today, and can provide useful information for market analysis and web site design as well as the more obvious topic of search engine optimization (SEO). This is the first of a series of articles that demonstrate the use of R to analize and visualize web analytics data from Google Webmaster Tools (now Google Search Console) and Google Analytics. The series is based upon a presentation given to the Dallas Infographics and Data Visualization Meetup on September 24, 2015.
The fundamental business problem for this case study is how to decide where to focus web site redesign and search engine optimization efforts on a web site–this web site. The articles that follow use traditional 2dimensional plots and new interactive visualization tools to help answer the following business questions:
 Which articles need work to improve search engine ranking
 Which articles are well ranked but do not get clicked, and need work on titles or meta data
 Where to focus efforts for new content
 How to use passive web search data to focus new product development
The other articles in the series are:
 Visualizing Web Analytics Data in R Part 2: Interactive Outliers
 Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
 Visualizing Web Analytics Data in R Part 4: Interactive Globe
 Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
 Visualizing Web Analytics Data in R Part 6: Interactive Networks
 Visualizing Web Analytics Data in R Part 7: Interactive Complex
Search Console Data Give Seasonality
The first thing to do in analyzing web analytics data is the to plot the basic data from Google Search Console:
 Impressions–how many times did Google present your site in a search result?
 Position–where in the search results did Google rank your web site?
 Clickthrough Rate (CTR)–when presented in a search result, what portion of users clicked on your web site?
 Clicks–how many users ultimately clicked on your site? This is the product of impressions and CTR.
Figure 1 is a simple plot of the four quantities on one plot, but it isn’t very useful, because the four Impressions is in at least the thousands, and Position is generally less than 50; we need to scale the four variables by dividing each by its maximum value, as shown in Figure 2. Finally, we need to reverse the Position variable so that improvements in position (smaller number) move in the same direction as improvements in clicks (bigger number) as shown in Figure 3. The calculation for this is
\[ \begin{aligned} \text{Reversed Position}&=1  \frac{\text{position}}{\text{max(position)}} \end{aligned} \]Figure 3 shows that the site gets a lot of traffic during the business week, but that isn–t surprising since the products and services are aimed at businesses rather than consumers. Generally, the search ranking is slowly improving over time, and something interesting happened during the first week of July: although position did not change significantly, the number of impressions, CTR and clicks increased dramatically for a short period of time. Understanding this increase in search traffic and the important search terms for content development is the topic for the second and third articles in this series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers and Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D).
These plots give us a clear understanding of seasonality, but they don’t give a complete picture of the importance of position in a search: if we buy ads to show up as paid top spot ads in search results, will that improve the clickthrough, and how much will it improve clickthrough? The section that follows after Figure 3 looks at these problems.
How are Clicks and Clickthroughrate Related to Search Position?
To figure out what to advertize and how to allocate article revision and SEO efforts, it is important to understand how clicks and clickthrough rate are related to search position. Figure 4 shows a plot of clicks vs. position along with a fitted curve; this is very helpful, but since clicks are related to impressions and CTR, we know that the high values are related to an event in July that does not apply to the web site in general. To get a more general understanding, we need to plot CTR vs. position, as shown in Figure 5. In this plot, we can clearly see that some days get higher clickthrough than would be expected from the search rank while other days under perform. The Rsquared value in the regression model shown in Figure 6 indicates that about 50% of the CTR is related to search position–and 50% is related to whether or not the user thought the article was relevant. Clearly, the clarity of the title and snippet matter a lot on whether a user will click through to read an article; getting a good search ranking by advertising is only 50% of the battle. The residual plot shows some nonlinearity as the fitted value increases; this probably indicates the big difference between showing up on the first page of search results and the second page of search results.
The next article in the series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers looks at how to plots with interactive graphics and flyovers to help in understanding what pages are overperforming or underperforming relative to the position in search results.
Notes
This article was written in RStudio and uses the ggplot2 package for all graphics except for the linear regression resitual plot. The formula display uses MathJax.
 Details
 Written by Bruce Moore
 Hits: 1514
Doing Your Own Fair Lending Statistical Analysis
At the IBAT Lending Compliance Summit in April, 2014 and at the SWGSB Alumni program in May, there was much discussion about the regulatory focus on Fair Lending in general and the statistical analysis that is being done to identify disparate treatment. This is the second in a series of articles that discuss statistical analysis as it can be used for Fair Lending analysis. The first article in the series, Preparing for a Fair Lending Examination Statistical Analysis, discusses how to collect and prepare a dataset that a regulatory agency will use for Fair Lending analysis. The steps described in the article that follows involve analysis doing your own Fair Lending compliance analysis to anticipate problems that might come up during an examination. Fortunately, there are now open source statistical tools to do very sophisticated analysis, though these tools may require skills that the bank may not have inhouse.
The article assumes that you have already cleaned up and prepared your dataset as described in Preparing for a Fair Lending Examination Statistical Analysis. The article is divided into the following sections:
 Geocode Addresses
 Estimate the Race, Ethnicity and Gender of Loan Applicants
 Calculate Manhattan Distance or Drive Time from Borrower to Nearest Branch
 Join Loan Data with Rate Sheet Historical Data
 Create Variables for Analysis and Create Training and Test Data Sets
 Testing for Disparate Treatment
 Conclusions
Geocode Addresses
The first step in almost any customeroriented analysis is to geocode the customer's address. Geocoding is the process of converting a street address into latitude and longitude coordinates that can be plotted on a map, used to merge address data with census data, used to calculate a drive time between two locations or used in calculating the Manhattan or the straightline distances between two points. It is very useful in a variety of banking analysis problems, not the least of which is address cleanup; if an address won’t geocode and isn’t a P.O. Box, it probably has some problems that need to be fixed. Ten years ago, geocoding was difficult and expensive. Today, there are a variety of applications to do this in volumes that are reasonable for small banks:
 Most Master Customer Information File (MCIF) marketing system vendors provide services to add demographic data and frequently geocode addresses as part of this service. This is probably the easiest way to geocode a set of loans.
 If you have a commercial address standardization package or your statement mailing vendor does address standardization, it may have geocoding available by default or as an added feature; it is worth investigating.
 For inhouse geocoding without a commercial package, the most convenient geocoder is probably Google Maps. Make sure to review the Google Maps API Terms of Service and potential privacy issues with your bank's attorney before choosing this option. Most institutions would want to get a Google Maps API key to use in their geocoding application to set up payment, otherwise the geocoding application would need to be throttled in order to meet Google’s Terms of Service. This is available from a variety of programming languages including PERL, Python (one or both may be known by IT Systems Administrators) and R (a statistical language). There are other open source packages available for geocoding.
 The PERL programming language has an open source geocoding package available for download and installation. See the Comprehensive PERL Archive Network and search on “geocode”.
 The Python package geopy offers several open source geocoders using several different cloud APIs.
 The R ggmap package offers geocoding via the Google Maps API.
 FFIEC offers a geocoding service, but it would require screen scraping and isn't really suited to doing a large volume.
Estimate the Race, Ethnicity and Gender of Loan Applicants
Since no one collects information on race and ethnicity in loan applications, in doing its Fair Lending analysis, the regulatory agencies must come up with some way to estimate the race, ethnicity and gender of a borrower. All of the ways to do this are errorprone to one degree or another, but that discussion is beyond the scope of this article. Use one of the alternatives below to come up with an estimate for the race, gender and ethnicity of each borrower, and then create an array of variables to use in the analysis.
The Hard WayDo It Yourself
 Merge Loan Data with Census Data
Once you have geocoded all of your loans, join it with Census data, and add census variables for race, and ethnicity of the surrounding block group to your dataset. You'll end up with a number of
 Join Loan Data Set with Census Ethnic Surname Database

An online research publication, Using the Census Bureau’s surname list to improve estimates of race/ethnicity and associated disparities provides a description of a methodology for estimating the race/ethnicity for a person using Census surname data and the geocoded block group; the method described has a correlation of 0.76 when compared to selfreported race/ethnicity at a health insurance provider. The surname frequency and race/ethnicity probability can be downloaded from http://www.census.gov/genealogy/www/data/2000surnames/index.html.
The Easy WayMerge Loan Data with MCIF Vendor Race and Ethnicity Data
Acxiom and a number of data brokers routinely provide estimated race and ethnicity data in demographic data sets. If this is available to you, merge this data with your loan data. Although you may not do business directly with Acxiom or the other major data brokers, many MCIF vendors offer demographic data enhancement services that resell Acxiom’s services. You can probably get this service through your MCIF vendor.
Generate Array of Variables for Estimated Race, Ethnicity and Gender
For each race, ethnicity and gender, create a variable with a probability that the person fits in to each category. You should develop two sets; one with all of the census detail, and a second where all of the unprotected groups (talk to your Compliance Officer on this) are merged. For each set create a variable that is the best estimate of race and ethnicity. You will use the second set of variables to determine whether or not you have a Fair Lending compliance problem, and the first set to diagnose and refine your understanding should you identify a Fair Lending compliance problem.
Calculate Manhattan Distance or Drive Time from Borrower to Nearest Branch
Since you have the latitude and longitude for each borrower, go ahead and calculate the Manhattan distance (distance North/South + distance East/West) between the borrower and your nearest branch and between the borrower and the nearest competitor’s branch. Distance to a branch is a strong predictor for a variety of consumer financial behaviors, so it is worth having it available for analysis. The Manhattan distance is easy to compute and doesn't require expensive software.
The time it takes to drive from the customer's address to the branch is a much better predictor than distance, but it is difficult to calculate. The easiest is a commercial package called Arcview Business Analyst and the related Arcview Network Analyst products from ESRI. They aren't cheap, so you might want to contract with ESRI to do this part of the work. There may be other alternatives within the Google Maps API or other navigation web services.
Join Loan Data with Rate Sheet Historical Data
One of the most important tests is to look at deviations from published rate sheets (See FDIC Compliance ManualJanuary 2014 page IV1.8 section P2). To do this you will need to join your historical rate sheets with the loan data for corresponding dates and then calculate the deviation from the rate sheet.
Create Variables for Analysis and Create Training and Test Data Sets
Now that you have cleaned up all of your data as described in Preparing for a Fair Lending Examination Statistical Analysis, and done all of the estimates for the borrower’s race/ethnicity/gender, combine everything into a single dataset to be used for analysis. For codes that are numbers, make sure to identify them as factor or ordered factor data types rather than real valued numbers. Now is the time to start duplicating items with common data transformations to normalize values on a 0 to 1 scale, or take the log of a variable where the values are orders of magnitude different. This step is the first that must be done in a statistical tool, as databases and flat files don't support the concepts of "factor" and "ordered factor."
You should also create training and test datasets, to determine whether or not the models that you generate are overfitted. If protected groups (or unprotected groups) are very infrequent in your dataset, you should consider repeating some of these lowfrequency observations in the training data set. If they are very infrequent, they may be ignored, and a pattern could be present, but not recognized, in which case you could get a rude awakening during the examination.
Testing for Disparate Treatment
There are several approaches to look for disparate treatment under the Fair Lending regulations. Since we are interested in screening for problems rather than proving a problem, it is appropriate to use that an approach that casts a wide net and identifies issues that might not rise to the level of statistical significance, financial materiality, frequency, or causation that may cause problems. These terms are mine and don't appear in any regulation or compliance manual that I've seen. I use them because most of the discussions that I've heard combine all of these concepts into the term “significant” and aren't all that precise.
The statistical approach that follows is hopefully not the procedure used used by regulators. The predictive modeling approach that I discuss below will probably indicate patterns where race/ethnicity/gender are useful in predicting price where hypothesis testing approaches might not identify race/ethnicity/gender as statistically significant. Remember that in this analysis, we want to cast a broad net to find anything that might be remotely problematic.
Once all of the data preparation is done, you can begin to look at the data and identify any race/ethnicity/gender patterns that exist. Broadly speaking, you will need to look at interest rates on loans that were approved, including both loans that closed and loans that did not close. You will also need to look at loan approvals.
Because we are screening for problems, we don't want to spend a lot of time if we can help it. The approaches below are by no means exhaustive, but instead are intended to be a laborefficient approach to screening for problems. The section is divided into the following steps:
 Visualizations
 Disparate Treatment in PricingTest Approved Loans Including Loans that Did Not Close
 Disparate Treatment in UnderwritingTest Approved vs. Denied Loan Applications
 Disparate Treatment in Product SelectionTest Qualified vs. Sold
 Test Models for OverFitting
Visualizations
Before doing any statistical tests on the dataset, it is usually helpful to look at some simple visualizations. A few possibilities are listed below:
 Generate visualizations for all of the variables that you have in the dataset. the best way to start out is to plot the deviation from the rate sheet vs. each variable. In the R statistical program, you can do this easily using the
lattice
package.  Plot the geocodes for all of the loans on a map, along with branch locations. This won't shed any light on the disparate treatment directly, but it is an easy plot to do and may help you to understand sales patterns better.
 For each racial/ethnic group, plot the deviation from the rate sheet as a time series to see if there are any seasonal patterns; you may find significant deviations immediately before and after rate sheet changes. If this is the case, you should look at these by race and ethnicity to see if there are patterns in who got the old rate after the rate sheet change in a rising environment and who got the new rate early in a falling environment.
Disparate Treatment in PricingTest Approved Loans Including Loans that Did Not Close
There are many ways that you can look for disparate treatment in pricing in approved loans, but the fastest way to get an understanding of the data would be to do a stepwise regression to predict the interest rate on the loan using all of the creditworthiness metrics available plus race/ethnicity/gender. If the race/ethnicity/gender variable shows up as significant in the stepwise regression model, you either have disparate treatment in pricing, you have been making credit decisions on creditworthiness variables that are not included in your dataset, or you need to do further analysis to find a model that better explains the patterns present in the data. Stepwise regression gives good models quickly, but there may well be a model that better explains the patterns present in the data that the stepwise automation didn’t find.
At a minimum, you should do the following:
 Perform a stepwise linear regression to predict interest rate
 Perform a stepwise linear regression to predict interest rate deviation from rate sheet
 Repeat analysis for each indirect dealer/originator
In all cases, make sure to check the residual plots for the various regression models.
Although it won't tell you anything directly, looking at the closing rates for each racial and ethnic group as shown in Table 1 can point you to further investigation; if there are statistically significant differences, you will probably want to expend more effort in the later steps.
Approved  Closed  Close Rate  Pvalue That Group Has Same Average as NonHispanic White 

NonHispanic White  100  50  0.50  
Ethnic Group 1  100  40  0.40  
Ethnic Group 2  100  60  0.60  
Ethnic Group 3  100  45  0.45  
All Groups  400  195  0.4875 
Disparate Treatment in UnderwritingTest Approved vs. Denied Loan Applications
To look for disparate treatment in underwriting, you will need to look at both approvals and denials. To get a quick understanding, do a stepwise logistic regression to predict loan approval.
 Perform stepwise logistic regression to predict loan approval
 Repeat analysis for each indirect dealer/originator
In all cases, make sure to check the residual plots for the various regression models.
Disparate Treatment in Product SteeringTest Qualified vs. Sold
Finally, we need to look at product selection to make sure that qualified borrowers aren't steered into more expensive subprime products when they qualify for a prime product. For this analysis, we will generate a matrix like the one shown in Table 2 below for each racial/ethnic group:
Qualified for Prime 
Not Qualified for Prime 

Sold Prime  100  
Sold SubPrime  100 
In this table, everyone should be on the northwest to southeast diagonal. If you have nonzero entries on the southwest to northeast diagonal for any of the racial or ethnic groups, you will need to perform a chisquared test (Χ^{2}) to determine if the groups are treated differently from a steering perspective.
Test Models for OverFitting
If you come up with models that include race/ethnicity/gender as significant predictors even when all other creditworthiness variables are available to the stepwise regression, make sure to run them against the test data set. If the model continues to predict well, you are missing a creditworthiness variable with strong race/ethnicity/gender patterns, you have a lot of work ahead to find a manually constructed model that performs better or you have a disparate treatment problem that needs to be addressed.
Conclusions
The analysis described above will help you to identify whether you have disparate treatment patterns that could appear in a Fair Lending examination statistical analysis. Since the exact procedures that the regulatory agencies use are not public, it is not a guarantee that issues won’t come up.
 Details
 Written by Bruce Moore
 Hits: 5025