Please rotate your device to landscape mode to view the charts.

Estimating non-overfitted convex production technologies: a stochastic machine learning approach

Journal: European Journal of Operational Research

Date: 2025

Author: Maria D. Guillen, Vincent Charles, Juan Aparicio

Abstract:
Overfitting is a classical statistical issue that occurs when a model fits a particular observed data sample too closely, potentially limiting its generalizability. While Data Envelopment Analysis (DEA) is a powerful non-parametric method for assessing the relative efficiency of decision-making units (DMUs), its reliance on the minimal extrapolation principle can lead to concerns about overfitting, particularly when the goal extends beyond evaluating the specific DMUs in the sample to making broader inferences. In this paper, we propose an adaptation of Stochastic Gradient Boosting to estimate production possibility sets that mitigate overfitting while satisfying shape constraints such as convexity and free disposability. Our approach is not intended to replace DEA but to complement it, offering an additional tool for scenarios where generalization is important. Through simulation experiments, we demonstrate that the proposed method performs well compared to DEA, especially in high-dimensional settings. Furthermore, the new machine learning-based technique is compared to the Corrected Concave Non-parametric Least Squares (C2NLS), showing competitive performance. We also illustrate how the usual efficiency measures in DEA can be implemented under our approach. Finally, we provide an empirical example based on data from the Program for International Student Assessment (PISA) to demonstrate the applicability of the new method.

Link: Google Scholar

Background and Context

The Overfitting Challenge

Data Envelopment Analysis (DEA) uses minimal extrapolation that can lead to overfitting and limited generalizability beyond the observed sample.

Proposed Solution

The researchers develop SEATBoosting, adapting Stochastic Gradient Boosting to estimate production possibility sets that mitigate overfitting while maintaining shape constraints.

Research Approach

The study compares SEATBoosting with traditional DEA and C²NLS through simulation experiments and evaluates its practical application using PISA education data.

DEA vs. SEATBoosting: From Overfitting to Generalization

DEA creates a frontier that tightly wraps observed data points, potentially leading to overfitting issues.
SEATBoosting produces a smoother frontier that better generalizes to unobserved data points.
The smoother frontier allows for more accurate prediction of efficiency for units not in the sample.

SEATBoosting Significantly Reduces Mean Squared Error Compared to DEA

SEATBoosting achieves lower Mean Squared Error (MSE) than DEA across all input dimensions tested.
The improvement is most dramatic in high-dimensional settings (9+ inputs), where SEATBoosting reduces MSE by up to 80%.
SEATBoosting also outperforms C²NLS in higher dimensions, while being computationally more efficient.

Superior Bias Reduction in Efficiency Estimation with SEATBoosting

SEATBoosting demonstrates consistently lower bias than DEA, with improvements ranging from 30% to 60%.
For high-dimensional problems (12-15 inputs), SEATBoosting shows the most significant bias reduction advantage.
Lower bias indicates SEATBoosting produces efficiency estimates closer to the true production frontier.

SEATBoosting Offers Significant Computational Efficiency Over C²NLS

SEATBoosting is dramatically faster than C²NLS, with execution times 6-20 times shorter.
The computational advantage becomes more pronounced as the number of inputs increases.
While DEA remains the fastest method, SEATBoosting offers a practical balance between accuracy and speed.

SEATBoosting Shows Greater Discriminatory Power in Educational Efficiency Assessment

Applied to PISA data, DEA classified 19 schools as efficient while SEATBoosting identified only 1.
SEATBoosting's greater discrimination helps identify truly exceptional performers versus merely above-average ones.
This improved discrimination is valuable for policy decisions targeting educational resource allocation and improvement strategies.

Contribution and Implications

SEATBoosting complements DEA by providing more accurate efficiency estimates when generalization beyond the sample is important.
The method is particularly valuable in high-dimensional settings, offering significant improvements in both accuracy and computational efficiency.
For policymakers, the approach allows for more reliable benchmarking and resource allocation decisions in education and other sectors.
SEATBoosting maintains key economic shape constraints while leveraging machine learning, bridging theoretical rigor with modern analytics.
The framework can be extended to various efficiency measures, making it adaptable to different analytical contexts and requirements.

Data Sources

The MSE comparison chart (Visualization 2) was created using data from Table 3 of the article, focusing on n=150 sample size.
The bias comparison chart (Visualization 3) was also constructed from Table 3, showing absolute bias values across different input dimensions.
The execution time comparison (Visualization 4) was derived from the "Mean Time (sec)" columns in Table 3.
The PISA case study visualization (Visualization 5) reflects findings from Table 5, showing the number of efficient schools identified by each method.
The conceptual comparison visualization (Visualization 1) illustrates the key theoretical difference between DEA and SEATBoosting approaches described in the article.