Chapter 18 Generalized Low Rank Models

The PCs constructed in PCA are linear in nature, which can cause deficiencies in its performance. This is much like the deficiency that linear regression has in capturing nonlinear relationships. Alternative approaches, known as matrix factorization methods have helped address this issue. More recently, however, a generalization of PCA and matrix factorization, called generalized low rank models (GLRMs) (Udell et al. 2016), has become a popular approach to dimension reduction.

18.1 Prerequisites

This chapter leverages the following packages:

# Helper packages
library(dplyr)    # for data manipulation
library(ggplot2)  # for data visualization
library(tidyr)    # for data reshaping

# Modeling packages
library(h2o)  # for fitting GLRMs

To illustrate GLRM concepts, we’ll continue using the my_basket data set created in the previous chapter:

url <- ""
my_basket <- readr::read_csv(url)

18.2 The idea

GLRMs reduce the dimension of a data set by producing a condensed vector representation for every row and column in the original data. Specifically, given a data set A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. The matrix Y has k rows and n columns, where n is equal to the total dimension of the embedded features in A. For example, if A has 4 numeric columns and 1 categorical column with 3 distinct levels (e.g., red, blue, and green), then Y will have 7 columns (due to one-hot encoding). When A contains only numeric features, the number of columns in A and Y are identical, as shown Eq. (18.1).

\[\begin{equation} \tag{18.1} m \Bigg \{ \overbrace{\Bigg [ \quad A \quad \Bigg ]}^n \hspace{0.5em} \approx \hspace{0.5em} m \Bigg \{ \overbrace{\Bigg [ X \Bigg ]}^k \hspace{0.5em} \overbrace{\big [ \quad Y \quad \big ]}^n \big \}k \end{equation}\]

Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected onto this smaller dimensional feature space. We can approximately reconstruct A from the matrix product \(X \times Y\), which has rank k. The number k is chosen to be much less than both m and n (e.g., for 1 million rows and 2,000 columns of numeric data, k could equal 15). The smaller k is, the more compression we gain from our low rank representation.

To make this more concrete, lets look at an example using the mtcars data set (available from the built-in datasets package) where we have 32 rows and 11 features (see ?datasets::mtcars for details):

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars represents our original matrix A. If we want to reduce matrix A to a rank of \(k = 3\) then our objective is to produce two matrices X and Y that when we multiply them together produce a near approximation to the original values in A.

We call the condensed columns and rows in matrices X and X, respectively, “archetypes” because they are a representation of the original features and observations. The archetypes in X represent each observation projected onto the smaller dimensional space, and the archetypes in Y represent each feature projeced onto the smaller dimensional space.

Example GLRM where we reduce the mtcars data set down to a rank of 3.

Figure 18.1: Example GLRM where we reduce the mtcars data set down to a rank of 3.

The resulting archetypes are similar in spirit to the PCs in PCA; as they are a reduced feature set that represents our original features. In fact, if our features truly behave in a linear and orthogonal manner than our archetypes produced by a GLRM will produce the same reduced feature set as PCA. However, if they are not linear, then GLRM will provide archetypes that are not necessarily orthogonal.

However, a few questions remain:

  1. How does GLRM produce the archetype values?
  2. How do you select the appropriate value for k?

We’ll address these questions next.

18.3 Finding the lower ranks

18.3.1 Alternating minimization

There are a number of methods available to identify the optimal archetype values for each element in X and Y; however, the most common is based on alternating minimization. Alternating minimization simply alternates between minimizing some loss function for each feature in X and Y. In essence, random values are initially set for the archetype values in X and Y. The loss function is computed (more on this shortly), and then the archetype values in X are slightly adjusted via gradient descent (Section 12.2.2) and the improvement in the loss function is recorded. The archetype values in Y are then slightly adjusted and the improvement in the loss function is recorded. This process is continued until the loss function is optimized or some suitable stopping condition is reached.

18.3.2 Loss functions

As stated above, the optimal achetype values are selected based on minimizing some loss function. The loss function should reflect the intuitive notion of what it means to “fit the data well”. The most common loss function is the quadratic loss. The quadratic loss is very similar to the SSE criterion (Section 2.6) for supervised learning models where we seek to minimize the squared difference between the actual value in our original data (matrix A) and the predicted value based on our achetypal matrices (\(X \times Y\)) (i.e., minimizing the squared residuals).

\[\begin{equation} \tag{18.2} \text{quadratic loss} = minimize \bigg\{ \sum^m_{i=1}\sum^{n}_{i=1}\left(A_{i,j} - X_iY_j\right)^2 \bigg\} \end{equation}\]

However, note that some loss functions are preferred over others in certain scenarios. For example, quadratic loss, similar to SSE, can be heavily influenced by outliers. If you do not want to emphasize outliers in your data set, or if you just want to try minimize errors for lower values in addition to higher values (e.g., trying to treat low-cost products equally as important as high-cost products) then you can use the Huber loss function. For brevity we do not show the Huber loss equation but it essentially applies quadratic loss to small errors and uses the absolute value for errors with larger values. Figure 18.2 illustrates how the quadratic and Huber loss functions differ.

Huber loss (green) compared to quadratic loss (blue).  The $x$-axis represents a particular value at $A_{i,j}$ and the $y$-axis represents the predicted value produced by $X_iY_j$. Note how the Huber loss produces a linear loss while the quadratic loss produces much larger loss values as the residual value increases.

Figure 18.2: Huber loss (green) compared to quadratic loss (blue). The \(x\)-axis represents a particular value at \(A_{i,j}\) and the \(y\)-axis represents the predicted value produced by \(X_iY_j\). Note how the Huber loss produces a linear loss while the quadratic loss produces much larger loss values as the residual value increases.

As with supervised learning, the choice of loss function should be driven by the business problem.

18.3.3 Regularization

Another important component to fitting GLRMs that you, the analyst, should consider is regularization. Much like the regularization discussed in Chapter 6, regularization applied to GLRMs can be used to constrain the size of the archetypal values in X (with \(r_x\left(X\right)\) in the equation below) and/or Y (with \(r_y\left(Y\right)\) in the equation below). This can help to create sparse X and/or Y matrices to mitigate the effect of negative features in the data (e.g., multicollinearity or excessive noise) which can help prevent overfitting.

If you’re using GLRMs to merely describe your data and gain a better understanding of how observations and/or features are similar then you do not need to use regularization. If you are creating a model that will be used to assign new observations and/or features to these dimensions, or you want to use GLRMs for imputation then you should use regularization as it can make your model generalize better to unseen data.

\[\begin{equation} \tag{18.3} \text{regularization} = minimize \bigg\{ \sum^m_{i=1}\sum^{n}_{i=1}\left(A_{i,j} - X_iY_j\right)^2 + r_x\left(X\right) + r_y\left(Y\right) \bigg\} \end{equation}\]

As the above equation illustrates, we can regularize both matrices X and Y. However, when performing dimension reduction we are mainly concerned with finding a condensed representation of the features, or columns. Consequently, we’ll be more concerned with regularizing the Y matrix (\(r_y\left(Y\right)\)). This regularizer encourages the Y matrix to be column-sparse so that many of the columns are all zero. Columns in Y that are zero, means that those features are likely uninformative in reproducing the original matrix A.

Even when we are focusing on dimension reduction, applying regularization to the X matrix can still improve performance. Consequently, it is good practice to compare different approaches.

There are several regularizers to choose from. You can use a ridge regularizer to retain all columns but force many of the values to be near zero. You can also use a LASSO regularizer which will help zero out many of the columns; the LASSO helps you perform automated feature selection. The non-negative regularizer can be used when your feature values should always be zero or positive (e.g., when performing market basket analysis).

The primary purpose of the regularizer is to minimize overfitting. Consequently, performing GRLMs without a regularizer will nearly always perform better than when using a regularizer if you are only focusing on a single data set. The choice of regularization should be led by statistical considerations, so that the model generalizes well to unseen data. This means you should always incorporate some form of CV to assess the performance of regularization on unseen data.

18.3.4 Selecting k

Lastly, how do we select the appropriate value for k? There are two main approaches, both of which will be illustrated in the section that follows. First, if you’re using GLRMs to describe your data, then you can use many of the same approaches we discussed in Section 17.5 where we assess how different values of k minimize our loss function. If you are using GLRMs to produce a model that will be used to assign future observations to the reduced dimensions then you should use some form of CV.

18.4 Fitting GLRMs in R

h2o is the preferred package for fitting GLRMs in R. In fact, a few of the key researchers that developed the GLRM methodology helped develop the h2o implementation as well. Let’s go ahead and start up h2o:

h2o.no_progress()  # turn off progress bars
h2o.init(max_mem_size = "5g")  # connect to H2O instance

18.4.1 Basic GLRM model

First, we convert our my_basket data frame to an appropriate h2o object before calling h2o.glrm(). The following performs a basic GLRM analysis with a quadratic loss function. A few arguments that h2o.glrm() provides includes:

  • k: rank size desired, which declares the desired reduced dimension size of the features. This is specified by you, the analysts, but is worth tuning to see which size k performs best.
  • loss: there are multiple loss functions to apply. The default is “quadratic”.
  • regularization_x: type of regularizer to apply to the X matrix.
  • regularization_y: type of regularizer to apply to the Y matrix.
  • transform: if your data are not already standardized this will automate this process for you. You can also normalize, demean, and descale.
  • max_iterations: number of iterations to apply for the loss function to converge. Your goal should be to increase max_iterations until your loss function plot flatlines.
  • seed: allows for reproducibility.
  • max_runtime_secs: when working with large data sets this will limit the runtime for model training.

There are additional arguments that are worth exploring as you become more comfortable with h2o.glrm(). Some of the more useful ones include the magnitude of the regularizer applied (gamma_x, gamma_y). If you’re working with ordinal features then multi_loss = “Ordinal” may be more appropriate. If you’re working with very large data sets than min_step_size can be adjusted to speed up the learning process.

# convert data to h2o object
my_basket.h2o <- as.h2o(my_basket)

# run basic GLRM
basic_glrm <- h2o.glrm(
  training_frame = my_basket.h2o,
  k = 20, 
  loss = "Quadratic",
  regularization_x = "None", 
  regularization_y = "None", 
  transform = "STANDARDIZE", 
  max_iterations = 2000,
  seed = 123

We can check the results with summary(). Here, we see that our model converged at 901 iterations and the final quadratic loss value (SSE) is 31,004.59. We can also see how many iterations it took for our loss function to converge to its minimum:

# get top level summary information on our model
## Model Details:
## ==============
## H2ODimReductionModel: glrm
## Model Key:  GLRM_model_R_1538746363268_1 
## Model Summary: 
##   number_of_iterations final_step_size final_objective_value
## 1                  901         0.36373           31004.59190
## H2ODimReductionMetrics: glrm
## ** Reported on training data. **
## Sum of Squared Error (Numeric):  31004.59
## Misclassification Error (Categorical):  0
## Number of Numeric Entries:  84000
## Number of Categorical Entries:  0
## Scoring History: 
##             timestamp   duration iterations step_size   objective
## 1 2018-10-05 09:32:54  1.106 sec          0   0.66667 67533.03413
## 2 2018-10-05 09:32:54  1.149 sec          1   0.70000 49462.95972
## 3 2018-10-05 09:32:55  1.226 sec          2   0.46667 49462.95972
## 4 2018-10-05 09:32:55  1.257 sec          3   0.31111 49462.95972
## 5 2018-10-05 09:32:55  1.289 sec          4   0.32667 41215.38164
## ---
##               timestamp   duration iterations step_size   objective
## 896 2018-10-05 09:33:22 28.535 sec        895   0.28499 31004.59207
## 897 2018-10-05 09:33:22 28.566 sec        896   0.29924 31004.59202
## 898 2018-10-05 09:33:22 28.597 sec        897   0.31421 31004.59197
## 899 2018-10-05 09:33:22 28.626 sec        898   0.32992 31004.59193
## 900 2018-10-05 09:33:22 28.655 sec        899   0.34641 31004.59190
## 901 2018-10-05 09:33:22 28.685 sec        900   0.36373 31004.59190

# Create plot to see if results converged - if it did not converge, 
# consider increasing iterations or using different algorithm
Loss curve for our GLRM model. The model converged at 901 iterations.

Figure 18.3: Loss curve for our GLRM model. The model converged at 901 iterations.

Our model object (basic_glrm) contains a lot of information (see everything it contains with str(basic_glrm)). Similar to h2o.pca(), we can see how much variance each archetype (aka principal component) explains by looking at the model$importance component:

# amount of variance explained by each archetype (aka "pc")
## Importance of components: 
##                             pc1      pc2      pc3      pc4      pc5      pc6      pc7
## Standard deviation     1.513919 1.473768 1.459114 1.440635 1.435279 1.411544 1.253307
## Proportion of Variance 0.054570 0.051714 0.050691 0.049415 0.049048 0.047439 0.037400
## Cumulative Proportion  0.054570 0.106284 0.156975 0.206390 0.255438 0.302878 0.340277
##                             pc8      pc9     pc10     pc11     pc12     pc13     pc14
## Standard deviation     1.026387 1.010238 1.007253 0.988724 0.985320 0.970453 0.964303
## Proportion of Variance 0.025083 0.024300 0.024156 0.023276 0.023116 0.022423 0.022140
## Cumulative Proportion  0.365360 0.389659 0.413816 0.437091 0.460207 0.482630 0.504770
##                            pc15     pc16     pc17     pc18     pc19     pc20
## Standard deviation     0.951610 0.947978 0.944826 0.932943 0.931745 0.924206
## Proportion of Variance 0.021561 0.021397 0.021255 0.020723 0.020670 0.020337
## Cumulative Proportion  0.526331 0.547728 0.568982 0.589706 0.610376 0.630713

Consequently, we can use this information just like we did in the PCA chapter to determine how many components to keep (aka how large should our k be). For example, the following provides nearly the same results as we saw in Section 17.5.2.

When your data aligns to the linearity and orthogonal assumptions made by PCA, the default GLRM model will produce nearly the exact same results regarding variance explained. However, how features align to the archetypes will be different than how features align to the PCs in PCA.

    PC  = basic_glrm@model$importance %>% seq_along(),
    PVE = basic_glrm@model$importance %>% .[2,] %>% unlist(),
    CVE = basic_glrm@model$importance %>% .[3,] %>% unlist()
) %>%
    gather(metric, variance_explained, -PC) %>%
    ggplot(aes(PC, variance_explained)) +
    geom_point() +
    facet_wrap(~ metric, ncol = 1, scales = "free")
Variance explained by the first 20 archetypes in our GLRM model.

Figure 18.4: Variance explained by the first 20 archetypes in our GLRM model.

We can also extract how each feature aligns to the different archetypes by looking at the model$archetypes component:

t(basic_glrm@model$archetypes)[1:5, 1:5]
##              Arch1      Arch2      Arch3      Arch4       Arch5
## 7up     -0.5783538 -1.5705325  0.9906612 -0.9306704  0.17552643
## lasagna  0.2196728  0.1213954 -0.7068851  0.8436524  3.56206178
## pepsi   -0.2504310 -0.8156136 -0.7669562 -1.2551630 -0.47632696
## yop     -0.1856632  0.4000083 -0.4855958  1.1598919 -0.26142763
## redwine -0.1372589 -0.1059148 -0.9579530  0.4641668 -0.08539977

We can use this information to see how the different features contribute to Archetype 1 or compare how features map to multiple Archetypes (similar to how we did this in the PCA chapter). The following shows that many liquid refreshments (e.g., instant coffee, tea, horlics, and milk) contribute positively to archetype 1. We also see that some candy bars contribute strongly to archetype 2 but minimally, or negatively, to archetype 1. The results are displayed in Figure 18.5.

p1 <- t(basic_glrm@model$archetypes) %>% %>% 
  mutate(feature = row.names(.)) %>%
  ggplot(aes(Arch1, reorder(feature, Arch1))) +

p2 <- t(basic_glrm@model$archetypes) %>% %>% 
  mutate(feature = row.names(.)) %>%
  ggplot(aes(Arch1, Arch2, label = feature)) +

gridExtra::grid.arrange(p1, p2, nrow = 1)
Feature contribution for archetype 1 and 2.

Figure 18.5: Feature contribution for archetype 1 and 2.

If we were to use the scree plot approach (Section 17.5.3) to determine \(k\), we would decide on \(k = 8\). Consequently, we would want to re-run our model with \(k = 8\). We could then use h2o.reconstruct() and apply our model to a data set to see the predicted values. Below we see that our predicted values include negative numbers and non-integers. Considering our original data measures the counts of each product purchased we would need to apply some additional rounding logic to convert values to integers:

# Re-run model with k = 8
k8_glrm <- h2o.glrm(
  training_frame = my_basket.h2o,
  k = 8, 
  loss = "Quadratic",
  regularization_x = "None", 
  regularization_y = "None", 
  transform = "STANDARDIZE", 
  max_iterations = 2000,
  seed = 123

# Reconstruct to see how well the model did
my_reconstruction <- h2o.reconstruct(k8_glrm, my_basket.h2o, reverse_transform = TRUE)

# Raw predicted values
my_reconstruction[1:5, 1:5]
##   reconstr_7up reconstr_lasagna reconstr_pepsi reconstr_yop reconstr_red-wine
## 1  0.025595726      -0.06657864    -0.03813350 -0.012225807        0.03814142
## 2 -0.041778553       0.02401056    -0.05225379 -0.052248809       -0.05487031
## 3  0.012373600       0.04849545     0.05760424 -0.009878976        0.02492625
## 4  0.338875544       0.00577020     0.48763580  0.187669229        0.53358405
## 5  0.003869531       0.05394523     0.07655745 -0.010977765        0.51779314
## [5 rows x 5 columns]

# Round values to whole integers
my_reconstruction[1:5, 1:5] %>% round(0)
##   reconstr_7up reconstr_lasagna reconstr_pepsi reconstr_yop reconstr_red-wine
## 1            0                0              0            0                 0
## 2            0                0              0            0                 0
## 3            0                0              0            0                 0
## 4            0                0              0            0                 1
## 5            0                0              0            0                 1
## [5 rows x 5 columns]

18.4.2 Tuning to optimize for unseen data

A more sophisticated use of GLRMs is to create a model where the reduced archetypes will be used on future, unseen data. The preferred approach to deciding on a final model when you are going to use a GLRM to score future observations, is to perform a validation process to select the optimally tuned model. This will help your final model generalize better to unseen data.

As previously mentioned, when applying a GLRM model to unseen data, using a regularizer can help to reduce overfitting and help the model generalize better. Since our data represents all positive values (items purchases which can be 0 or any positive integer), we apply the non-negative regularizer. This will force all predicted values to at least be non-negative. We see this when we use predict() on the results.

If we compare the non-regularized GLRM model (k8_glrm) to our regularized model (k8_glrm_regularized), you will notice that the non-regularized model will almost always have a lower loss value. However, this is because the regularized model is being generalized more and is not overfitting to our training data, which should help improve on unseen data.

# Use non-negative regularization
k8_glrm_regularized <- h2o.glrm(
  training_frame = my_basket.h2o,
  k = 8, 
  loss = "Quadratic",
  regularization_x = "NonNegative", 
  regularization_y = "NonNegative",
  gamma_x = 0.5,
  gamma_y = 0.5,
  transform = "STANDARDIZE", 
  max_iterations = 2000,
  seed = 123

# Show predicted values
predict(k8_glrm_regularized, my_basket.h2o)[1:5, 1:5]
##   reconstr_7up reconstr_lasagna reconstr_pepsi reconstr_yop reconstr_red-wine
## 1     0.000000                0      0.0000000    0.0000000         0.0000000
## 2     0.000000                0      0.0000000    0.0000000         0.0000000
## 3     0.000000                0      0.0000000    0.0000000         0.0000000
## 4     0.609656                0      0.6311428    0.4565658         0.6697422
## 5     0.000000                0      0.0000000    0.0000000         0.8257210
## [5 rows x 5 columns]

# Compare regularized versus non-regularized loss
par(mfrow = c(1, 2))
Loss curve for original GLRM model that does not include regularization (left) compared to a GLRM model with regularization (right).

Figure 18.6: Loss curve for original GLRM model that does not include regularization (left) compared to a GLRM model with regularization (right).

GLRM models behave much like supervised models where there are several hyperparameters that can be tuned to optimize performance. For example, we can choose from a combination of multiple regularizers, we can adjust the magnitude of the regularization (i.e., the gamma_* parameters), and we can even tune the rank \(k\).

Unfortunately, h2o does not currently provide an automated tuning grid option, such as h2o.grid() which can be applied to supervised learning models. To perform a grid search with GLRMs, we need to create our own custom process.

First, we create training and validation sets so that we can use the validation data to see how well each hyperparameter setting does on unseen data. Next, we create a tuning grid that contains 225 combinations of hyperparameters. For this example, we’re going to assume we want \(k = 8\) and we only want to tune the type and magnitude of the regularizers. Lastly, we create a for loop to go through each hyperparameter combination, apply the given model, assess on the model’s performance on the hold out validation set, and extract the error metric.

The squared error loss ranges from as high as 58,908 down to 13,371. This is a significant reduction in error. We see that the best models all have errors in the 13,700+ range and the majority of them have a large (signaled by gamma_x) L1 (LASSO) regularizer on the X matrix and also a non-negative regularizer on the Y matrix. However, the magnitude of the Y matrix regularizers (signaled by gamma_y) has little to no impact.

The following tuning and validation process took roughly 35 minutes to complete.

# Split data into train & validation
split <- h2o.splitFrame(my_basket.h2o, ratios = 0.75, seed = 123)
train <- split[[1]]
valid <- split[[2]]

# Create hyperparameter search grid
params <- expand.grid(
  regularization_x = c("None", "NonNegative", "L1"),
  regularization_y = c("None", "NonNegative", "L1"),
  gamma_x = seq(0, 1, by = .25),
  gamma_y = seq(0, 1, by = .25),
  error = 0,
  stringsAsFactors = FALSE

# Perform grid search
for(i in seq_len(nrow(params))) {
  # Create model
  glrm_model <- h2o.glrm(
    training_frame = train,
    k = 8, 
    loss = "Quadratic",
    regularization_x = params$regularization_x[i], 
    regularization_y = params$regularization_y[i],
    gamma_x = params$gamma_x[i],
    gamma_y = params$gamma_y[i],
    transform = "STANDARDIZE", 
    max_runtime_secs = 1000,
    seed = 123
  # Predict on validation set and extract error
  validate <- h2o.performance(glrm_model, valid)
  params$error[i] <- validate@metrics$numerr

# Look at the top 10 models with the lowest error rate
params %>%
  arrange(error) %>%
##    regularization_x regularization_y gamma_x gamma_y    error
## 1                L1      NonNegative    1.00    0.25 13731.81
## 2                L1      NonNegative    1.00    0.50 13731.81
## 3                L1      NonNegative    1.00    0.75 13731.81
## 4                L1      NonNegative    1.00    1.00 13731.81
## 5                L1      NonNegative    0.75    0.25 13746.77
## 6                L1      NonNegative    0.75    0.50 13746.77
## 7                L1      NonNegative    0.75    0.75 13746.77
## 8                L1      NonNegative    0.75    1.00 13746.77
## 9                L1             None    0.75    0.00 13750.79
## 10               L1               L1    0.75    0.00 13750.79

Once we identify the optimal model, we’ll want to re-run this on the entire training data set. We can then score new unseen observations with this model, which tells us based on their buying behavior and how this behavior aligns to the \(k = 8\) dimensions in our model, what products are they’re likely to buy and would be good opportunities to market to them.

# Apply final model with optimal hyperparamters
final_glrm_model <- h2o.glrm(
  training_frame = my_basket.h2o,
  k = 8, 
  loss = "Quadratic",
  regularization_x = "L1", 
  regularization_y = "NonNegative",
  gamma_x = 1,
  gamma_y = 0.25,
  transform = "STANDARDIZE", 
  max_iterations = 2000,
  seed = 123

# New observations to score
new_observations <- as.h2o(sample_n(my_basket, 2))

# Basic scoring
predict(final_glrm_model, new_observations) %>% round(0)
##   reconstr_7up reconstr_lasagna reconstr_pepsi reconstr_yop reconstr_red-wine reconstr_cheese reconstr_bbq reconstr_bulmers reconstr_mayonnaise reconstr_horlics reconstr_chicken-tikka reconstr_milk reconstr_mars reconstr_coke
## 1            0                0              0            0                 0               0            0                0                   0                0                      0             0             0             0
## 2            0               -1              1            0                 0               0            0                0                   0                1                      0             1             0             1
##   reconstr_lottery reconstr_bread reconstr_pizza reconstr_sunny-delight reconstr_ham reconstr_lettuce reconstr_kronenbourg reconstr_leeks reconstr_fanta reconstr_tea reconstr_whiskey reconstr_peas reconstr_newspaper
## 1                0              0              0                      0            0                0                    0              0              0            0                0             0                  0
## 2                0              0             -1                      0            0                0                    0              0              0            1                0             0                  0
##   reconstr_muesli reconstr_white-wine reconstr_carrots reconstr_spinach reconstr_pate reconstr_instant-coffee reconstr_twix reconstr_potatoes reconstr_fosters reconstr_soup reconstr_toad-in-hole reconstr_coco-pops
## 1               0                   0                0                0             0                       0             0                 0                0             0                     0                  0
## 2               1                   0                0                0             0                       1             0                 0                0             0                     0                  1
##   reconstr_kitkat reconstr_broccoli reconstr_cigarettes
## 1               0                 0                   0
## 2               0                 0                   0

18.5 Final thoughts

GLRMs are an extension of the well-known matrix factorization methods such as PCA. While PCA is limited to numeric data, GLRMs can handle mixed numeric, categorical, ordinal, and boolean data with an arbitrary number of missing values. It allows the user to apply regularization to \(X\) and \(Y\), imposing restrictions like non-negativity appropriate to a particular data science context. Thus, it is an extremely flexible approach for analyzing and interpreting heterogeneous data sets. Although this chapter focused on using GLRMs for dimension/feature reduction, GLRMs can also be used for clustering, missing data imputation, compute memory reduction, and speed improvements.


Udell, Madeleine, Corinne Horn, Reza Zadeh, Stephen Boyd, and others. 2016. “Generalized Low Rank Models.” Foundations and Trends in Machine Learning 9 (1). Now Publishers, Inc.: 1–118.