ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
Loading...
Searching...
No Matches
ppforest2 — C++ API Reference

ppforest2 is a fast, memory-efficient implementation of Projection Pursuit Random Forests, built on Projection Pursuit (oblique) Decision Trees. By learning linear projections at each split, the model captures complex structure that axis-aligned trees often miss, without sacrificing interpretability or scalability.

The C++ core is the single source of truth for the implementation. Language bindings (R, Python planned) wrap the core via thin interface layers.

Overview

The library is organised around a small set of core abstractions:

Abstraction Header Purpose
Tree models/Tree.hpp A single projection pursuit decision tree
Forest models/Forest.hpp An ensemble of bootstrap trees with OOB estimation
TrainingSpec models/TrainingSpec.hpp Composes a PP, DR, and SR strategy into a training configuration
PPStrategy models/PPStrategy.hpp Projection pursuit index optimisation (e.g. PDA)
DRStrategy models/DRStrategy.hpp Dimensionality reduction / variable selection
SRStrategy models/SRStrategy.hpp Split threshold rule
GroupPartition stats/GroupPartition.hpp Contiguous-block representation of grouped observations
VariableImportance models/VariableImportance.hpp Three variable importance measures (permuted, projections, weighted)

Quick start

Training a single tree

#include "ppforest2.hpp"
// Load data — CSV last column is the response, labels encoded as integers.
auto data = io::csv::read_sorted("iris.csv");
// Compose strategies into a training spec:
TrainingSpec spec(
pp::pda(0.0), // LDA (no penalty)
dr::noop(), // use all variables
sr::mean_of_means()); // midpoint split rule
// Train via the unified entry point:
auto model = Model::train(spec, data.x, data.y);
// Or train directly:
stats::RNG rng(0);
Tree tree = Tree::train(spec, data.x, data.y, rng);
types::Response label = tree.predict(data.x.row(0));
types::ResponseVector preds = tree.predict(data.x);

Training a random forest

#include "ppforest2.hpp"
// PDA with uniform random variable selection:
TrainingSpec spec(
pp::pda(0.0), // LDA
dr::uniform(3), // sample 3 variables per split
sr::mean_of_means(), // midpoint split rule
500, 0); // size, seed
Forest forest = Forest::train(spec, x, y);
types::ResponseVector preds = forest.predict(x_test);
// Out-of-bag error — uses only trees where each obs was not in the bootstrap sample.
double oob = forest.oob_error(x, y);
// Vote proportions — (n x G) matrix, rows sum to 1.
types::FeatureMatrix probs = forest.predict(x_test, Proportions{});
// Variable importance (three measures).
auto vi_perm = variable_importance_permuted(forest, x, y, 0); // seed
auto vi_proj = variable_importance_projections(forest, x.cols());
auto vi_wt = variable_importance_weighted_projections(forest, x, y);

Serialisation

#include "io/IO.hpp"
// Save
auto j = serialization::to_json(forest);
io::json::write_file(j, "model.json");
// Load
auto j2 = io::json::read_file("model.json");
Forest restored = serialization::forest_from_json(j2);
File I/O utilities, JSON and CSV reading/writing.

Strategy pattern

Tree training is parameterised by three pluggable strategies. Concrete implementations are composed at runtime via TrainingSpec, so new optimisation criteria or variable selection methods can be added without changing the tree-building logic.

  • PPStrategy — projection pursuit index optimisation.
    Built-in: PPPDAStrategy (Generalised LDA, with optional PDA penalty via lambda). lambda = 0 gives standard LDA; lambda in (0, 1] penalises the within-group covariance matrix.
  • DRStrategy — dimensionality reduction (variable subset selection).
    Built-in: DRNoopStrategy (use all variables, for single trees), DRUniformStrategy (sample n_vars uniformly at random, for forests).
  • SRStrategy — split threshold computation.
    Built-in: SRMeanOfMeansStrategy (midpoint of projected group means).

TrainingSpec is a single concrete class that composes these strategies together with forest-level parameters (size, seed, threads, max retries):

// Single tree with PDA:
TrainingSpec spec(pp::pda(0.5), dr::noop(), sr::mean_of_means());
// Random forest:
TrainingSpec spec(
pp::pda(0.5), // PDA with lambda = 0.5
dr::uniform(4), // sample 4 variables per split
sr::mean_of_means(), // midpoint of group means
100, 0); // size, seed

Strategies are held via shared_ptr and are immutable after construction, so TrainingSpec can be freely copied and shared across trees without deep cloning. Each strategy implements to_json() for serialisation.

For a step-by-step guide to implementing new strategies, see Extending: Custom Strategies.

Visitor pattern

Two visitor interfaces avoid dynamic_cast and keep traversal logic decoupled from the model types:

  • TreeNode::Visitor — dispatches over internal nodes (TreeCondition) and leaf nodes (TreeResponse). Used by serialisation, visualisation, and variable importance.
  • Model::Visitor — dispatches over Tree and Forest. Used by the serialisation layer.

For a step-by-step guide to implementing new visitors, see Extending: Custom Visitors.

Variable importance

Three measures quantify each variable's contribution to predictions:

  • VI1 (permuted) — For each tree, measures the drop in OOB accuracy when each variable is randomly permuted among the OOB observations. Averaged over all trees. See variable_importance_permuted().
  • VI2 (projections) — Accumulates |a_j| / G_s at every split node, where a_j is the projection coefficient and G_s is the number of groups. Optionally scaled by per-variable standard deviations. See variable_importance_projections().
  • VI3 (weighted projections) — Each tree's contribution is weighted by (1 - OOB error), and each split contributes I_s x |a_j|. See variable_importance_weighted_projections().

Numeric precision

The core uses single-precision (float) arithmetic by default. Compile with -DPPFOREST2_DOUBLE_PRECISION=ON to switch to double throughout. See types::Feature.

Namespace guide

Namespace Purpose
ppforest2 Core model types: Tree, Forest, TrainingSpec, VariableImportance
ppforest2::types Numeric type aliases (Feature, Response, FeatureMatrix, ...)
ppforest2::stats Statistical infrastructure: RNG, GroupPartition, Uniform, ConfusionMatrix
ppforest2::pp Projection pursuit strategies
ppforest2::dr Dimensionality reduction strategies
ppforest2::sr Split rule strategies
ppforest2::serialization JSON serialization and deserialization
ppforest2::viz Visualisation visitors for tree structure and decision boundaries
ppforest2::io File I/O (CSV, JSON) and presentation utilities
ppforest2::io::style ANSI-aware colored terminal output
ppforest2::io::layout Column-driven table formatting
ppforest2::math Numeric comparison utilities
ppforest2::cli Command-line interface parsing and subcommands
ppforest2::sys System utilities (memory measurement)

Reproducibility

Results are identical across platforms (Ubuntu/GCC, macOS/Clang, Windows/MinGW) for the same seed. This is enforced by:

  • Using pcg32 exclusively (never std::mt19937).
  • Using Lemire's rejection method for unbiased random integers (never std::uniform_int_distribution).
  • Using Fisher-Yates shuffle via stats::Uniform::distinct() (never std::shuffle).
  • Using std::stable_sort where element order affects downstream results.
  • Verifying against golden reference files on every platform in CI.
See also
GitHub repository
R package documentation