Statistical infrastructure for training and evaluation. More...

Namespaces
namespace	simulation

Classes
struct	ClassificationMetrics
	Classification evaluation metrics. More...

struct	ConfusionMatrix
	A confusion matrix comparing predicted vs actual group labels. More...

struct	DataPacket
	Bundled dataset: features, response, and group labels. More...

class	GroupPartition
	Contiguous-block representation of grouped observations. More...

class	Normal
	Normal (Gaussian) random number generator. More...

struct	RegressionMetrics
	Regression evaluation metrics. More...

struct	Split
	Indices for a train/test split. More...

class	Uniform
	Discrete uniform random integer generator over [min, max]. More...

Typedefs
using	Metrics = std::variant<ClassificationMetrics, RegressionMetrics>
	Mode-polymorphic metrics block.

using	RNG = pcg32

Functions
float	accuracy (types::OutcomeVector const &predictions, types::GroupIdVector const &actual)
	Fraction of predictions matching the ground-truth class label.

double	error_rate (types::OutcomeVector const &predictions, types::GroupIdVector const &actual)
	Misclassification rate — `1 - accuracy`.

double	error_rate (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
	Convenience overload: float-typed labels (cast to `GroupId` locally).

std::map< int, int >	get_labels_map (types::GroupIdVector const &y_pred, types::GroupIdVector const &y)
	Build a sorted mapping from unique group labels to contiguous indices.

std::map< types::GroupId, int >	group_indices (std::set< types::GroupId > const &groups)
	Map each label in `groups` to its index in iteration order.

double	mae (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
	Mean absolute error.

types::Outcome	majority_vote (std::vector< types::Outcome > const &preds)
	Majority vote over a sequence of integer-coded class labels.

types::Outcome	mean (std::vector< types::Outcome > const &preds)
	Arithmetic mean of a sequence of outcome values.

Metrics	metrics_from_outcomes (types::OutcomeVector const &y_pred, types::OutcomeVector const &y, types::Mode mode)
	Build a mode-appropriate `Metrics` variant from in-memory tensors.

double	mse (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
	Mean squared error.

double	r_squared (types::OutcomeVector const &predictions, types::OutcomeVector const &actual)
	Coefficient of determination (R²). Returns 0 when total variance is 0.

template<typename Derived>
double	sd (Eigen::MatrixBase< Derived > const &data)
	Sample standard deviation of a vector — `sqrt(var(data))`.

types::FeatureVector	sd (types::FeatureMatrix const &data)
	Column-wise sample standard deviation — element-wise `sqrt` of `var`.

DataPacket	simulate (int n, int p, int G, RNG &rng, simulation::params::Classification const &params={})
	Generate a simulated classification dataset.

DataPacket	simulate (int n, int p, RNG &rng, simulation::params::Regression const &params={})
	Generate a simulated regression dataset.

template<typename Y>
void	sort (types::FeatureMatrix &x, Y &y)
	Sort a feature matrix and a response vector by the response values.

Split	split (DataPacket const &data, float train_ratio, RNG &rng)
	Perform a stratified random train/test split on a DataPacket.

std::set< types::GroupId >	unique (types::GroupIdVector const &column)
	Unique group labels in a response vector.

template<typename Derived>
double	var (Eigen::MatrixBase< Derived > const &data)
	Sample variance of a vector (unbiased, `n-1` denominator).

types::FeatureVector	var (types::FeatureMatrix const &data)
	Column-wise sample variance of a matrix.

Detailed Description

Statistical infrastructure for training and evaluation.

Provides the random number generator (pcg32), discrete uniform sampling (Lemire's method), grouped-observation bookkeeping (GroupPartition), confusion matrices, data simulation, and basic descriptive statistics used throughout the training pipeline.

Typedef Documentation

◆ Metrics

using ppforest2::stats::Metrics = std::variant<ClassificationMetrics, RegressionMetrics>

Mode-polymorphic metrics block.

Carries ClassificationMetrics for classification models and RegressionMetrics for regression models. Consumers that want the scalar / matrix can std::visit the variant instead of branching on which JSON key (or strategy mode) happens to be present.

◆ RNG

using ppforest2::stats::RNG = pcg32

Function Documentation

◆ accuracy()

float ppforest2::stats::accuracy	(	types::OutcomeVector const &	predictions,
		types::GroupIdVector const &	actual )

Fraction of predictions matching the ground-truth class label.

Returns: Accuracy in [0, 1].

◆ error_rate() [1/2]

double ppforest2::stats::error_rate	(	types::OutcomeVector const &	predictions,
		types::GroupIdVector const &	actual )

Misclassification rate — 1 - accuracy.

Returns: Error rate in [0, 1].

◆ error_rate() [2/2]

double ppforest2::stats::error_rate	(	types::OutcomeVector const &	predictions,
		types::OutcomeVector const &	actual )

inline

Convenience overload: float-typed labels (cast to GroupId locally).

Used by the unified training pipeline where y is carried as OutcomeVector for both classification and regression.

◆ get_labels_map()

std::map< int, int > ppforest2::stats::get_labels_map	(	types::GroupIdVector const &	y_pred,
		types::GroupIdVector const &	y )

Build a sorted mapping from unique group labels to contiguous indices.

Includes labels from both predictions and actual so the confusion matrix can hold cells for any predicted class — even classes the model confused into that didn't appear in the ground truth (and vice versa).

Returns: A map from label value to its 0-based index.

◆ group_indices()

std::map< types::GroupId, int > ppforest2::stats::group_indices ( std::set< types::GroupId > const & groups )

Map each label in groups to its index in iteration order.

Used by predict(FeatureMatrix, Proportions) on ClassificationTree and ClassificationForest to assign each group a column in the proportions matrix. std::set iterates in ascending key order, so the resulting indices are sorted by group label.

◆ mae()

double ppforest2::stats::mae	(	types::OutcomeVector const &	predictions,
		types::OutcomeVector const &	actual )

Mean absolute error.

◆ majority_vote()

types::Outcome ppforest2::stats::majority_vote ( std::vector< types::Outcome > const & preds )

Majority vote over a sequence of integer-coded class labels.

Returns the label with the largest count. Ties are broken by the smallest GroupId — std::map iterates in ascending key order and the strict > comparison keeps the earliest winner.

Used for classification ensemble aggregation: per-row OOB voting (models/Evaluation.cpp) and per-row in-bag voting (ClassificationForest::predict). Throws on empty input.

◆ mean()

types::Outcome ppforest2::stats::mean ( std::vector< types::Outcome > const & preds )

Arithmetic mean of a sequence of outcome values.

Used for regression ensemble aggregation: per-row OOB averaging (models/Evaluation.cpp) and per-row in-bag averaging (RegressionForest::predict). Sum is computed in double to avoid loss-of-precision when many float outcomes are added. Throws on empty input.

◆ metrics_from_outcomes()

Metrics ppforest2::stats::metrics_from_outcomes	(	types::OutcomeVector const &	y_pred,
		types::OutcomeVector const &	y,
		types::Mode	mode )

Build a mode-appropriate Metrics variant from in-memory tensors.

Sibling to serialization::metrics_from_json — same variant, different source. Lets callers compute metrics once and pass the variant to polymorphic consumers (e.g. print_metrics_block).

◆ mse()

double ppforest2::stats::mse	(	types::OutcomeVector const &	predictions,
		types::OutcomeVector const &	actual )

Mean squared error.

◆ r_squared()

double ppforest2::stats::r_squared	(	types::OutcomeVector const &	predictions,
		types::OutcomeVector const &	actual )

Coefficient of determination (R²). Returns 0 when total variance is 0.

◆ sd() [1/2]

template<typename Derived>

double ppforest2::stats::sd ( Eigen::MatrixBase< Derived > const & data )

Sample standard deviation of a vector — sqrt(var(data)).

◆ sd() [2/2]

types::FeatureVector ppforest2::stats::sd ( types::FeatureMatrix const & data )

Column-wise sample standard deviation — element-wise sqrt of var.

◆ simulate() [1/2]

DataPacket ppforest2::stats::simulate	(	int	n,
		int	p,
		int	G,
		RNG &	rng,
		simulation::params::Classification const &	params = {} )

Generate a simulated classification dataset.

Rows are sorted by group label.

◆ simulate() [2/2]

DataPacket ppforest2::stats::simulate	(	int	n,
		int	p,
		RNG &	rng,
		simulation::params::Regression const &	params = {} )

Generate a simulated regression dataset.

Rows are sorted by the continuous response. Distinguished from the classification overload by the absence of a group-count parameter.

◆ sort()

template<typename Y>

void ppforest2::stats::sort	(	types::FeatureMatrix &	x,
		Y &	y )

Sort a feature matrix and a response vector by the response values.

Templated on the response vector type so callers can sort by integer group labels (GroupIdVector) or continuous responses (OutcomeVector).

◆ split()

Split ppforest2::stats::split	(	DataPacket const &	data,
		float	train_ratio,
		RNG &	rng )

Perform a stratified random train/test split on a DataPacket.

Samples indices within each group proportional to train_ratio so that group balance is preserved in both train and test sets.

Parameters

data	The full dataset.
train_ratio	Proportion of data to use for training (0, 1).
rng	Random number generator.

Returns: A Split containing train and test index vectors.

◆ unique()

std::set< types::GroupId > ppforest2::stats::unique ( types::GroupIdVector const & column )

Unique group labels in a response vector.

Parameters

column Group ID vector.

Returns: Set of unique group labels.

◆ var() [1/2]

template<typename Derived>

double ppforest2::stats::var ( Eigen::MatrixBase< Derived > const & data )

Sample variance of a vector (unbiased, n-1 denominator).

Returns 0 for single-element inputs (variance undefined; 0 is the natural degenerate value for "no spread"). Throws on empty input.

◆ var() [2/2]

types::FeatureVector ppforest2::stats::var ( types::FeatureMatrix const & data )

Column-wise sample variance of a matrix.

Parameters

data	Feature matrix with at least 2 rows.

Returns: FeatureVector of size p (one σ² per column).

Namespaces

Classes

Typedefs

Functions

Detailed Description

Typedef Documentation

◆ Metrics

◆ RNG

Function Documentation

◆ accuracy()

◆ error_rate() [1/2]

◆ error_rate() [2/2]

◆ get_labels_map()

◆ group_indices()

◆ mae()

◆ majority_vote()

◆ mean()

◆ metrics_from_outcomes()

◆ mse()

◆ r_squared()

◆ sd() [1/2]

◆ sd() [2/2]

◆ simulate() [1/2]

◆ simulate() [2/2]

◆ sort()

◆ split()

◆ unique()

◆ var() [1/2]

◆ var() [2/2]