ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
Loading...
Searching...
No Matches
ppforest2::stats Namespace Reference

Statistical infrastructure for training and evaluation. More...

Namespaces

namespace  simulation
 

Classes

struct  ClassificationMetrics
 Classification evaluation metrics. More...
 
struct  ConfusionMatrix
 A confusion matrix comparing predicted vs actual group labels. More...
 
struct  DataPacket
 Bundled dataset: features, response, and group labels. More...
 
class  GroupPartition
 Contiguous-block representation of grouped observations. More...
 
class  Normal
 Normal (Gaussian) random number generator. More...
 
struct  RegressionMetrics
 Regression evaluation metrics. More...
 
struct  Split
 Indices for a train/test split. More...
 
class  Uniform
 Discrete uniform random integer generator over [min, max]. More...
 

Typedefs

using Metrics = std::variant<ClassificationMetrics, RegressionMetrics>
 Mode-polymorphic metrics block.
 
using RNG = pcg32
 

Functions

float accuracy (types::OutcomeVector const &predictions, types::GroupIdVector const &actual)
 Fraction of predictions matching the ground-truth class label.
 
double error_rate (types::OutcomeVector const &predictions, types::GroupIdVector const &actual)
 Misclassification rate — 1 - accuracy.
 
double error_rate (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
 Convenience overload: float-typed labels (cast to GroupId locally).
 
std::map< int, int > get_labels_map (types::GroupIdVector const &y_pred, types::GroupIdVector const &y)
 Build a sorted mapping from unique group labels to contiguous indices.
 
std::map< types::GroupId, int > group_indices (std::set< types::GroupId > const &groups)
 Map each label in groups to its index in iteration order.
 
double mae (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
 Mean absolute error.
 
types::Outcome majority_vote (std::vector< types::Outcome > const &preds)
 Majority vote over a sequence of integer-coded class labels.
 
types::Outcome mean (std::vector< types::Outcome > const &preds)
 Arithmetic mean of a sequence of outcome values.
 
Metrics metrics_from_outcomes (types::OutcomeVector const &y_pred, types::OutcomeVector const &y, types::Mode mode)
 Build a mode-appropriate Metrics variant from in-memory tensors.
 
double mse (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
 Mean squared error.
 
double r_squared (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
 Coefficient of determination (R²). Returns 0 when total variance is 0.
 
template<typename Derived>
double sd (Eigen::MatrixBase< Derived > const &data)
 Sample standard deviation of a vector — sqrt(var(data)).
 
types::FeatureVector sd (types::FeatureMatrix const &data)
 Column-wise sample standard deviation — element-wise sqrt of var.
 
DataPacket simulate (int n, int p, int G, RNG &rng, simulation::params::Classification const &params={})
 Generate a simulated classification dataset.
 
DataPacket simulate (int n, int p, RNG &rng, simulation::params::Regression const &params={})
 Generate a simulated regression dataset.
 
template<typename Y>
void sort (types::FeatureMatrix &x, Y &y)
 Sort a feature matrix and a response vector by the response values.
 
Split split (DataPacket const &data, float train_ratio, RNG &rng)
 Perform a stratified random train/test split on a DataPacket.
 
std::set< types::GroupIdunique (types::GroupIdVector const &column)
 Unique group labels in a response vector.
 
template<typename Derived>
double var (Eigen::MatrixBase< Derived > const &data)
 Sample variance of a vector (unbiased, n-1 denominator).
 
types::FeatureVector var (types::FeatureMatrix const &data)
 Column-wise sample variance of a matrix.
 

Detailed Description

Statistical infrastructure for training and evaluation.

Provides the random number generator (pcg32), discrete uniform sampling (Lemire's method), grouped-observation bookkeeping (GroupPartition), confusion matrices, data simulation, and basic descriptive statistics used throughout the training pipeline.

Typedef Documentation

◆ Metrics

Mode-polymorphic metrics block.

Carries ClassificationMetrics for classification models and RegressionMetrics for regression models. Consumers that want the scalar / matrix can std::visit the variant instead of branching on which JSON key (or strategy mode) happens to be present.

◆ RNG

using ppforest2::stats::RNG = pcg32

Function Documentation

◆ accuracy()

float ppforest2::stats::accuracy ( types::OutcomeVector const & predictions,
types::GroupIdVector const & actual )

Fraction of predictions matching the ground-truth class label.

Returns
Accuracy in [0, 1].

◆ error_rate() [1/2]

double ppforest2::stats::error_rate ( types::OutcomeVector const & predictions,
types::GroupIdVector const & actual )

Misclassification rate — 1 - accuracy.

Returns
Error rate in [0, 1].

◆ error_rate() [2/2]

double ppforest2::stats::error_rate ( types::OutcomeVector const & predictions,
types::OutcomeVector const & actual )
inline

Convenience overload: float-typed labels (cast to GroupId locally).

Used by the unified training pipeline where y is carried as OutcomeVector for both classification and regression.

◆ get_labels_map()

std::map< int, int > ppforest2::stats::get_labels_map ( types::GroupIdVector const & y_pred,
types::GroupIdVector const & y )

Build a sorted mapping from unique group labels to contiguous indices.

Includes labels from both predictions and actual so the confusion matrix can hold cells for any predicted class — even classes the model confused into that didn't appear in the ground truth (and vice versa).

Returns
A map from label value to its 0-based index.

◆ group_indices()

std::map< types::GroupId, int > ppforest2::stats::group_indices ( std::set< types::GroupId > const & groups)

Map each label in groups to its index in iteration order.

Used by predict(FeatureMatrix, Proportions) on ClassificationTree and ClassificationForest to assign each group a column in the proportions matrix. std::set iterates in ascending key order, so the resulting indices are sorted by group label.

◆ mae()

double ppforest2::stats::mae ( types::OutcomeVector const & predictions,
types::OutcomeVector const & actual )

Mean absolute error.

◆ majority_vote()

types::Outcome ppforest2::stats::majority_vote ( std::vector< types::Outcome > const & preds)

Majority vote over a sequence of integer-coded class labels.

Returns the label with the largest count. Ties are broken by the smallest GroupIdstd::map iterates in ascending key order and the strict > comparison keeps the earliest winner.

Used for classification ensemble aggregation: per-row OOB voting (models/Evaluation.cpp) and per-row in-bag voting (ClassificationForest::predict). Throws on empty input.

◆ mean()

types::Outcome ppforest2::stats::mean ( std::vector< types::Outcome > const & preds)

Arithmetic mean of a sequence of outcome values.

Used for regression ensemble aggregation: per-row OOB averaging (models/Evaluation.cpp) and per-row in-bag averaging (RegressionForest::predict). Sum is computed in double to avoid loss-of-precision when many float outcomes are added. Throws on empty input.

◆ metrics_from_outcomes()

Metrics ppforest2::stats::metrics_from_outcomes ( types::OutcomeVector const & y_pred,
types::OutcomeVector const & y,
types::Mode mode )

Build a mode-appropriate Metrics variant from in-memory tensors.

Sibling to serialization::metrics_from_json — same variant, different source. Lets callers compute metrics once and pass the variant to polymorphic consumers (e.g. print_metrics_block).

◆ mse()

double ppforest2::stats::mse ( types::OutcomeVector const & predictions,
types::OutcomeVector const & actual )

Mean squared error.

◆ r_squared()

double ppforest2::stats::r_squared ( types::OutcomeVector const & predictions,
types::OutcomeVector const & actual )

Coefficient of determination (R²). Returns 0 when total variance is 0.

◆ sd() [1/2]

template<typename Derived>
double ppforest2::stats::sd ( Eigen::MatrixBase< Derived > const & data)

Sample standard deviation of a vector — sqrt(var(data)).

◆ sd() [2/2]

types::FeatureVector ppforest2::stats::sd ( types::FeatureMatrix const & data)

Column-wise sample standard deviation — element-wise sqrt of var.

◆ simulate() [1/2]

DataPacket ppforest2::stats::simulate ( int n,
int p,
int G,
RNG & rng,
simulation::params::Classification const & params = {} )

Generate a simulated classification dataset.

Rows are sorted by group label.

◆ simulate() [2/2]

DataPacket ppforest2::stats::simulate ( int n,
int p,
RNG & rng,
simulation::params::Regression const & params = {} )

Generate a simulated regression dataset.

Rows are sorted by the continuous response. Distinguished from the classification overload by the absence of a group-count parameter.

◆ sort()

template<typename Y>
void ppforest2::stats::sort ( types::FeatureMatrix & x,
Y & y )

Sort a feature matrix and a response vector by the response values.

Templated on the response vector type so callers can sort by integer group labels (GroupIdVector) or continuous responses (OutcomeVector).

◆ split()

Split ppforest2::stats::split ( DataPacket const & data,
float train_ratio,
RNG & rng )

Perform a stratified random train/test split on a DataPacket.

Samples indices within each group proportional to train_ratio so that group balance is preserved in both train and test sets.

Parameters
dataThe full dataset.
train_ratioProportion of data to use for training (0, 1).
rngRandom number generator.
Returns
A Split containing train and test index vectors.

◆ unique()

std::set< types::GroupId > ppforest2::stats::unique ( types::GroupIdVector const & column)

Unique group labels in a response vector.

Parameters
columnGroup ID vector.
Returns
Set of unique group labels.

◆ var() [1/2]

template<typename Derived>
double ppforest2::stats::var ( Eigen::MatrixBase< Derived > const & data)

Sample variance of a vector (unbiased, n-1 denominator).

Returns 0 for single-element inputs (variance undefined; 0 is the natural degenerate value for "no spread"). Throws on empty input.

◆ var() [2/2]

types::FeatureVector ppforest2::stats::var ( types::FeatureMatrix const & data)

Column-wise sample variance of a matrix.

Parameters
dataFeature matrix with at least 2 rows.
Returns
FeatureVector of size p (one σ² per column).