Walkthrough

This page gives you a very basic example of how to use EmpiricalRiskMinimization.jl, with links to other documentation pages to learn more about advanced functionality available within the package.

Suppose we want to solve a regularized least square linear regression problem. Let's first generate some data.

n = 2000; k = 30;
d = k + 1;
U = randn(n, k); theta = randn(d);
v = [ones(n) U] * theta + 0.5 * randn(n);

So, we've generated 2000 random raw data points, occuping the rows of U. These data points have 30 features. Additionally, we generated targets v, so that v[i] is the label associated with example U[i, :].

Formulating and solving (regularized) least square linear regression with ERM.jl is simple. The first step is to instantiate the model

M = Model(U, v, embedall=true);

Model: applying default embedding

The option embedall=true takes U and compiles our true training data X, by appending the constant feature to the rows of U. Additionally, it standardizes our data for us. There are many more features available for training, embedding, and modelling. Of course, to specify a different model, users must specify different losses and regularizers.

Training the model and getting the output is two lines of code.

train(M)
status(M)

FrameSource: Applying feature map: embed all at once
FrameSource: Applying feature map: Add column 1 as feature0
FrameSource: Applying feature map: standardize column feature0
Model: splitting data
Model: calling solver: QRSolver
Model: Not regularizing constant feature X[:,1]
----------------------------------------
Results for single train/test
  training  loss: 0.012991910517894887
  test loss: 0.014343329348235463
  training samples: 1600
  test samples: 400
  columns in X: 31
----------------------------------------
----------------------------------------
Results for single train/test
  training  loss: 0.012991910517894887
  test loss: 0.014343329348235463
  training samples: 1600
  test samples: 400
  columns in X: 31
----------------------------------------

This training summary is useful, and is the most basic validation tool that ERM provides; cross-validation and repeated out-of-sample validation are also available.

To assess the accuracy of the model on the train and test sets, we can compute the (average) train and test losses.

println("Training error = $(trainloss(M))")
println("Testing error = $(testloss(M))")

Training error = 0.012991910517894887
Testing error = 0.014343329348235463

Finally, suppose we actually want to retrieve our predictions on the test data.

v_test_pred = predict_v_from_test(M);

There are more prediction functions available. These allow you to provide alternative model parameters, unembed predictions, and test on various other datasets.