ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
Loading...
Searching...
No Matches
ppforest2 — C++ API Reference

ppforest2 is a fast, memory-efficient implementation of Projection Pursuit Random Forests, built on Projection Pursuit (oblique) Decision Trees. By learning linear projections at each split, the model captures complex structure that axis-aligned trees often miss, without sacrificing interpretability or scalability.

The C++ core is the single source of truth for the implementation. Language bindings (R, Python planned) wrap the core via thin interface layers.

Overview

The library is organised around a small set of core abstractions:

Abstraction Header Purpose
Tree models/Tree.hpp A single projection pursuit decision tree
Forest models/Forest.hpp An ensemble of bootstrap trees with OOB estimation
TrainingSpec models/TrainingSpec.hpp Composes seven strategies into a training configuration
ProjectionPursuit strategies/pp/ProjectionPursuit.hpp Projection pursuit index optimisation (e.g. PDA)
VariableSelection strategies/vars/VariableSelection.hpp Variable subset selection
Cutpoint strategies/cutpoint/Cutpoint.hpp Split cutpoint computation
StopRule strategies/stop/StopRule.hpp Node stopping condition
Binarization strategies/binarize/Binarization.hpp Multiclass → binary regrouping
Grouping strategies/grouping/Grouping.hpp Group partition management
LeafStrategy strategies/leaf/LeafStrategy.hpp Leaf node creation
GroupPartition stats/GroupPartition.hpp Grouped observations with arbitrary row indices
VariableImportance models/Evaluation.hpp Three variable importance measures (permuted, projections, weighted)

Quick start

Training a single tree

#include "ppforest2.hpp"
// Load data — CSV last column is the response, labels encoded as integers.
auto data = io::csv::read_sorted("iris.csv");
using namespace ppforest2;
// Compose strategies into a training spec:
0, 0, 0, 3, // size (0 = single tree), seed, threads, max_retries
pp::pda(0.0), // LDA (no penalty)
vars::all(), // use all variables
cutpoint::mean_of_means()); // midpoint split rule
// Train via the unified entry point:
auto model = Model::train(spec, data.x, data.y);
// Or train directly:
Tree tree = Tree::train(spec, data.x, data.y);
types::Outcome label = tree.predict(data.x.row(0));
types::OutcomeVector preds = tree.predict(data.x);
static Ptr train(TrainingSpec const &spec, types::FeatureMatrix &x, types::OutcomeVector &y)
Train a model from a training specification.
Training configuration for projection pursuit trees and forests.
Definition TrainingSpec.hpp:43
Abstract base class for projection pursuit decision trees.
Definition Tree.hpp:29
static Ptr train(TrainingSpec const &spec, types::FeatureMatrix &x, types::OutcomeVector &y)
Train a tree from a response vector.
types::Outcome predict(types::FeatureVector const &x) const override
Predict a single observation.
Cutpoint::Ptr mean_of_means()
Factory function for mean-of-means split cutpoint.
ProjectionPursuit::Ptr pda(float lambda)
Factory function for a PDA projection pursuit strategy.
Eigen::Matrix< Outcome, Eigen::Dynamic, 1 > OutcomeVector
Dynamic-size column vector of predictions.
Definition Types.hpp:42
Feature Outcome
Scalar type for predictions (float for both classification and regression).
Definition Types.hpp:30
VariableSelection::Ptr all()
Factory function: select all variables (no selection).
Binarization strategies for multiclass-to-binary reduction.
Definition Benchmark.hpp:25

Training a random forest

#include "ppforest2.hpp"
using namespace ppforest2;
// PDA with uniform random variable selection:
500, 0, 0, 3, // size, seed, threads, max_retries
pp::pda(0.0), // LDA
vars::uniform(3), // sample 3 variables per split
cutpoint::mean_of_means()); // midpoint split rule
// Use the typed factory when you want classification-specific API
// (proportions, oob_error) without downcasting from the bimodal base.
auto forest = ClassificationForest::train(spec, x, y);
types::OutcomeVector preds = forest->predict(x_test);
// Out-of-bag error — uses only trees where each obs was not in the
// bootstrap sample. Returns std::optional<double>: std::nullopt when
// no observation has any OOB tree (e.g. all-in-bag degenerate forest).
std::optional<double> oob = oob_error(*forest, x, y);
// Vote proportions — (n x G) matrix, rows sum to 1. Classification only.
types::FeatureMatrix probs = predict_proportions(*forest, x_test);
// Variable importance — free functions (see models/Evaluation.hpp).
auto vi_perm = vi_permuted(*forest, x, y, 0);
auto vi_proj = vi_projections(*forest, x.cols());
auto vi_wt = vi_weighted_projections(*forest, x, y);
// Convenience: compute all three in one call.
auto vi = variable_importance(*forest, x, y, 0);
static Ptr train(TrainingSpec const &spec, FeatureMatrix const &x, OutcomeVector const &y)
Eigen::Matrix< Feature, Eigen::Dynamic, Eigen::Dynamic > FeatureMatrix
Dynamic-size matrix of feature values.
Definition Types.hpp:33
VariableSelection::Ptr uniform(int n_vars)
Factory function: uniform random variable selection.
types::FeatureVector vi_weighted_projections(Forest const &forest, types::FeatureMatrix const &x, types::OutcomeVector const &y, types::FeatureVector const *scale=nullptr)
VI3 — weighted projection-coefficient importance.
types::FeatureMatrix predict_proportions(Model const &model, types::FeatureMatrix const &x)
Compute vote proportions for a classification model.
VariableImportance variable_importance(Tree const &tree, types::FeatureMatrix const &x)
Bundle the available VI measures for a single tree (VI2 only).
std::optional< double > oob_error(Forest const &forest, types::FeatureMatrix const &x, types::OutcomeVector const &y)
Out-of-bag error.
types::FeatureVector vi_permuted(Forest const &forest, types::FeatureMatrix const &x, types::OutcomeVector const &y, int seed)
VI1 — per-variable permuted importance.
types::FeatureVector vi_projections(Tree const &tree, int n_vars, types::FeatureVector const *scale=nullptr)
VI2 for a single tree — projection-coefficient importance.

Serialisation

#include "io/IO.hpp"
// Save
auto j = serialization::to_json(forest);
io::json::write_file(j, "model.json");
// Load
auto j2 = io::json::read_file("model.json");
Forest restored = serialization::forest_from_json(j2);
File I/O utilities, JSON and CSV reading/writing.
Abstract base class for projection pursuit random forests.
Definition Forest.hpp:31
json to_json(types::OutcomeVector const &y, types::Names const &names)
Serialize a prediction vector as JSON.

Strategy pattern

Tree training is parameterised by seven pluggable strategies. Concrete implementations are composed at runtime via TrainingSpec, so new optimisation criteria, variable selection methods, or splitting rules can be added without changing the tree-building logic.

  • pp::ProjectionPursuit — projection pursuit index optimisation.
    Built-in: pp::PDA (Generalised LDA, with optional PDA penalty via lambda). lambda = 0 gives standard LDA; lambda in (0, 1] penalises the within-group covariance matrix.
  • vars::VariableSelection — variable subset selection.
    Built-in: vars::All (use all variables, for single trees), vars::Uniform (sample count variables uniformly at random, for forests).
  • cutpoint::Cutpoint — split cutpoint computation.
    Built-in: cutpoint::MeanOfMeans (midpoint of projected group means).
  • stop::StopRule — node stopping condition.
    Built-in: stop::PureNode (stop when the node contains a single group).
  • binarize::Binarization — multiclass → binary regrouping.
    Built-in: binarize::LargestGap (largest gap between projected group means).
  • grouping::Grouping — group partition management.
    Built-in: grouping::ByLabel (route all observations of a group to the same child).
  • leaf::LeafStrategy — leaf node creation.
    Built-in: leaf::MajorityVote (assign the majority class).

TrainingSpec is a single concrete class that composes these strategies together with forest-level parameters (size, seed, threads, max retries):

using namespace ppforest2;
// Single tree with PDA:
0, 0, 0, 3, // size, seed, threads, max_retries
pp::pda(0.5), // PDA with lambda = 0.5
vars::all(), // use all variables
cutpoint::mean_of_means()); // midpoint of group means
// Random forest:
100, 0, 0, 3, // size, seed, threads, max_retries
pp::pda(0.5), // PDA with lambda = 0.5
vars::uniform(4), // sample 4 variables per split
cutpoint::mean_of_means()); // midpoint of group means

Strategies are held via shared_ptr and are immutable after construction, so TrainingSpec can be freely copied and shared across trees without deep cloning. Each strategy implements to_json() for serialisation.

For a step-by-step guide to implementing new strategies, see Extending: Custom Strategies.

Visitor pattern

Two visitor interfaces avoid dynamic_cast and keep traversal logic decoupled from the model types:

  • TreeNode::Visitor — dispatches over internal nodes (TreeBranch) and leaf nodes (TreeLeaf). Used by serialisation, visualisation, and variable importance.
  • Model::Visitor — dispatches over Tree and Forest. Used by the serialisation layer.

For a step-by-step guide to implementing new visitors, see Extending: Custom Visitors.

Variable importance

Three measures quantify each variable's contribution to predictions:

  • VI1 (permuted) — For each tree, measures the degradation in OOB fit when each variable is randomly permuted among the OOB observations. Classification uses accuracy drop; regression uses NMSE increase (MSE increase normalised by Var(y_oob)). Averaged over all trees. See Forest::vi_permuted().
  • VI2 (projections) — Accumulates |a_j| / G_s at every split node, where a_j is the projection coefficient and G_s is the number of groups. Optionally scaled by per-variable standard deviations. Mode-agnostic. See Forest::vi_projections() and Tree::vi_projections().
  • VI3 (weighted projections) — Each tree's contribution is weighted by a per-tree OOB quality score (classification: 1 - error_rate; regression: max(0, 1 - NMSE)), and each split contributes I_s x |a_j|. See Forest::vi_weighted_projections().

Numeric precision

The core uses single-precision (float) arithmetic for all feature data. This is sufficient for classification and reduces memory usage. If a strategy needs higher precision internally (e.g. for regression loss computation), it can cast to double within its own scope. See types::Feature.

Namespace guide

Namespace Purpose
ppforest2 Core model types: Tree, Forest, TrainingSpec, VariableImportance
ppforest2::types Numeric type aliases (Feature, Outcome, FeatureMatrix, ...)
ppforest2::stats Statistical infrastructure: RNG, GroupPartition, Uniform, ConfusionMatrix
ppforest2::pp Projection pursuit strategies
ppforest2::vars Variable selection strategies
ppforest2::cutpoint Split cutpoint strategies
ppforest2::stop Stopping rule strategies
ppforest2::binarize Multiclass binarization strategies
ppforest2::grouping Group partition strategies
ppforest2::leaf Leaf node creation strategies
ppforest2::serialization JSON serialization and deserialization
ppforest2::viz Visualisation visitors for tree structure and decision boundaries
ppforest2::io File I/O (CSV, JSON) and presentation utilities
ppforest2::io::style ANSI-aware colored terminal output
ppforest2::io::layout Column-driven table formatting
ppforest2::math Numeric comparison utilities
ppforest2::cli Command-line interface parsing and subcommands
ppforest2::sys System utilities (memory measurement)

Reproducibility

Results are identical across platforms (Ubuntu/GCC, macOS/Clang, Windows/MinGW) for the same seed. This is enforced by:

  • Using pcg32 exclusively (never std::mt19937).
  • Using Lemire's rejection method for unbiased random integers (never std::uniform_int_distribution).
  • Using Fisher-Yates shuffle via stats::Uniform::distinct() (never std::shuffle).
  • Using std::stable_sort where element order affects downstream results.
  • Verifying against golden reference files on every platform in CI.
See also
GitHub repository
R package documentation