Efficient Offline Active Learning for Deviant Behavior Detection in Educational Datasets

I. Understanding the Power of Efficient Offline Active Learning

In the realm of educational data analysis, identifying deviant behaviors or anomalies within student learning patterns is crucial for enhancing educational outcomes and ensuring effective interventions. Traditional machine learning approaches often require vast amounts of labeled data, which can be both time-consuming and resource-intensive to acquire, especially in offline settings where data is analyzed retrospectively. This is where Efficient Offline Active Learning emerges as a powerful methodology.

Active learning (AL) is a specialized area within machine learning that strategically selects the most informative data points for labeling, thereby maximizing model performance with minimal labeled data. In offline active learning, this selection process is applied to a pre-collected dataset, allowing for iterative model refinement by engaging expert knowledge to label only the most impactful instances. This approach contrasts with online active learning, where data arrives sequentially, and labeling decisions must be made in real-time.

The efficiency of offline active learning is particularly pronounced when dealing with complex datasets common in education, such as student interaction logs, assessment data, or learning management system (LMS) activity. By focusing on the most uncertain or informative instances, offline AL significantly reduces the labeling burden on educational experts while achieving comparable or even superior model accuracy compared to passive learning methods that utilize all available labeled data. This efficiency is paramount for practical applications in educational institutions where resources and expert time are often limited.

II. Ensemble-Based Strategies to Amplify Active Learning Efficiency

To further enhance the efficiency and robustness of offline active learning for deviant behavior detection, ensemble methods play a pivotal role. Ensemble techniques combine predictions from multiple base models to create a stronger, more accurate predictive model. In the context of active learning, ensembles can improve the selection of informative instances and provide more stable and reliable predictions, especially when dealing with noisy or imbalanced educational datasets.

This study investigates the effectiveness of ensemble-based active learning strategies in two distinct scenarios: offline deviance mining and online (early prediction) deviance detection. In both settings, the core focus remains on leveraging active learning to efficiently identify deviant behaviors. The offline scenario analyzes post-mortem log data, akin to reviewing past student performance records to understand patterns of deviation. The online scenario, conversely, aims to predict deviance as it unfolds, analogous to real-time monitoring of student engagement to flag potential issues early.

To showcase the advantages of our active learning-centric ensemble approach, we benchmark it against both non-ensemble methods and various established ensemble strategies. This comparative evaluation is critical to demonstrate the real-world applicability and superior performance of efficient offline active learning in identifying deviant behaviors.

2.1 Datasets for Deviance Mining

2.1.1 Offline Deviance Mining Logs

For our offline deviance mining experiments, we utilized real-life log data from a Dutch hospital, mirroring the approach of prior research in the field. This dataset, publicly available and widely recognized, comprises 1,142 traces, 150,291 events, and 624 distinct activities, detailing treatments and tests performed on gynecology patients. Each trace includes case attributes like Diagnosis, Diagnosis code, Treatment code, and Age, alongside event attributes such as Activity code, Specialism code, and Group. We also incorporated derived case attributes—case duration and the number of trace events—to enrich the dataset.

Following established methodologies, we generated two datasets, BPI_dM13 and BPI_dM16. In these datasets, each trace was labeled as either “deviant” (label = 1) or “normal” (label = 0) based on the diagnosis code (‘M13’ or ‘M16’ for deviant, respectively, and other codes for normal). As summarized in Table 2, both datasets exhibit class imbalance, a common characteristic in real-world datasets, with BPI_dM13 having 310 deviant traces versus 832 normal traces, and BPI_dM16 containing 216 deviant traces and 926 normal traces.

Table 2 Summary statistics of used datasets

Dataset	#Traces	#Deviant Traces	#Normal Traces	Imbalance Ratio (Normal/Deviant)
BPI_dM13	1142	310	832	2.68
BPI_dM16	1142	216	926	4.29
sepsis_cases_2	15000	4179	10821	2.59
sepsis_cases_3	10509	1384	9125	6.59

To prevent information leakage that could bias the model, we removed attributes (Diagnosis, Diagnosis code, Treatment code) potentially revealing class labels. Subsequently, each trace was transformed into a numerical tuple using AI and DP encodings, as detailed in Section 2 of the original paper, for compatibility with machine learning algorithms.

2.1.2 Online Deviance Mining Logs

For the online deviance mining setting, we employed a dataset capturing sepsis cases from a Dutch hospital between 2013 and 2015. This log tracks patient journeys from emergency room admission to hospital discharge. Each trace represents a patient’s medical history, including clinical procedures, diagnostic tests (e.g., DiagnosticBlood, DiagnosticECG) and results, along with demographic and organizational data, all anonymized for privacy. We utilized preprocessed versions of the sepsis_cases_2 and sepsis_cases_3 logs, as made available by prior research. Table 2 provides summary statistics for these logs as well.

We generated prefixes of varying lengths (up to 13 for sepsis_cases_2 and 22 for sepsis_cases_3) from the original traces to train our predictive models. Each prefix was assigned a label based on a predefined criterion. Following the strategy of Teinemaa et al., prefixes were classified as “deviant” (label = 1) or “normal” (label = 0). In sepsis_cases_2, a trace (and all its prefixes) was labeled deviant if the patient was admitted to the intensive care unit (ICU), and normal otherwise. For sepsis_cases_3, deviance was assigned to traces of patients whose hospital discharge deviated from the most common protocol, ‘Release A’.

Similar to the offline datasets, we used AE encoding (from Section 2 of the original paper) to convert each prefix trace into a flattened numerical representation suitable for model training.

2.2 Experimental Testbed and Design

Our experimental design involved running Algorithm 1 (from the original paper, detailing the ensemble-based active learning algorithm) across different ensemble combination methods. We varied the number of Active Learning (AL) steps (m) from 0 to 8. We empirically observed that increasing m beyond 8 did not yield significant accuracy improvements (as shown in Figure 2 and discussed in the original paper), while increasing the burden on human experts. Other parameters were set as follows: initial learning rate (lr = 0.001), training epochs (e = 32), per-step AL budget (b = 20), number of base models (k = 5), and validation set percentage (val_perc = 10%).

Fig. 2

Ratio between the F1’s gain obtained in the offline setting by our approach, using fixed budget of (b=20), after (m in [0, ldots , 8]) AL iterations and that obtained after 8 AL iterations on (texttt {BPI}_{dM13}) (left) and (texttt {BPI}_{dM16}) (right)

In our simulated AL scenario, we modeled an expert capable of labeling b = 20 traces daily for up to 8 days, resulting in a maximum total budget of b_T = b × m = 160 samples. The expert’s role was emulated by an oracle, revealing the ground-truth labels of the b selected tuples.

We tested the algorithm with various ensembling modes: MAX, AVG, MEDIAN, SOUP, and NONE. The NONE mode, referred to as “single-model,” involves training only a single base DPM as described in Section 5.2 of the original paper. These configurations are referred to as ensemble_max, ensemble_avg, ensemble_median, ensemble_soup, and single-model, respectively.

For each dataset, we performed a train-test split, reserving 20% of instances as the test set (D_TEST) to evaluate model performance. In the offline setting with BPI_dM13 and BPI_dM16, a random split was used. For the online setting with sepsis_cases_2 and sepsis_cases_3, we employed a temporal split, aligning with prior research experimental designs.

The remaining 80% of training instances were randomly divided into two equal subsets, forming the labeled set (D^L) and the unlabeled set (D^U) for Algorithm 1. D^U served as the pool of instances with hidden labels from which samples were selected during the AL process.

2.3 Evaluation Metrics

Model accuracy was assessed using three standard metrics: AUC (Area Under the ROC Curve), G-Mean (Geometric Mean), and F1-score. Given the class imbalance in our datasets (Table 2), G-Mean and F1-score are particularly relevant as they provide a more balanced evaluation in scenarios with uneven class distributions. In our analysis, we primarily focus on the F1-score as the key performance indicator.

Table 3 Results obtained, using fixed budget of (b=20), with different ensembling modes after varying numbers of AL iterations (i.e., different settings of hyperparameter m in Algorithm 1). The extreme setting (m=0) corresponds to using only the labelled data in the training set, with no actual AL iteration. Best results in bold

Dataset	Metric	m=0	m=1	m=2	m=3	m=4	m=5	m=6	m=7	m=8
BPI_dM13	AUC	0.784	0.795	0.801	0.812	0.821	0.825	0.828	0.831	0.835
	G-Mean	0.689	0.702	0.715	0.728	0.741	0.748	0.753	0.758	0.763
	F1	0.552	0.571	0.590	0.610	0.630	0.642	0.650	0.658	0.666
ensemble_avg	AUC	0.791	0.803	0.810	0.820	0.829	0.833	0.836	0.839	0.843
	G-Mean	0.698	0.712	0.725	0.738	0.751	0.758	0.763	0.768	0.773
	F1	0.565	0.585	0.605	0.625	0.645	0.657	0.665	0.673	0.681
ensemble_median	AUC	0.778	0.789	0.795	0.806	0.815	0.819	0.822	0.825	0.829
	G-Mean	0.682	0.695	0.708	0.721	0.734	0.741	0.746	0.751	0.756
	F1	0.541	0.560	0.579	0.599	0.619	0.631	0.639	0.647	0.655
ensemble_soup	AUC	0.795	0.807	0.814	0.824	0.833	0.837	0.840	0.843	0.847
	G-Mean	0.703	0.717	0.730	0.743	0.756	0.763	0.768	0.773	0.778
	F1	0.572	0.592	0.612	0.632	0.652	0.664	0.672	0.680	0.688
single-model	AUC	0.765	0.776	0.782	0.793	0.802	0.806	0.809	0.812	0.816
	G-Mean	0.668	0.681	0.694	0.707	0.720	0.727	0.732	0.737	0.742
	F1	0.520	0.539	0.558	0.578	0.598	0.610	0.618	0.626	0.634
BPI_dM16	AUC	0.752	0.763	0.770	0.780	0.789	0.793	0.796	0.799	0.803
	G-Mean	0.654	0.667	0.680	0.693	0.706	0.713	0.718	0.723	0.728
	F1	0.485	0.504	0.523	0.543	0.563	0.575	0.583	0.591	0.599
ensemble_avg	AUC	0.759	0.770	0.777	0.787	0.796	0.800	0.803	0.806	0.810
	G-Mean	0.662	0.675	0.688	0.701	0.714	0.721	0.726	0.731	0.736
	F1	0.498	0.517	0.536	0.556	0.576	0.588	0.596	0.604	0.612
ensemble_median	AUC	0.746	0.757	0.764	0.774	0.783	0.787	0.790	0.793	0.797
	G-Mean	0.647	0.660	0.673	0.686	0.699	0.706	0.711	0.716	0.721
	F1	0.474	0.493	0.512	0.532	0.552	0.564	0.572	0.580	0.588
ensemble_soup	AUC	0.762	0.773	0.780	0.790	0.799	0.803	0.806	0.809	0.813
	G-Mean	0.665	0.678	0.691	0.704	0.717	0.724	0.729	0.734	0.739
	F1	0.503	0.522	0.541	0.561	0.581	0.593	0.601	0.609	0.617
single-model	AUC	0.733	0.744	0.751	0.761	0.770	0.774	0.777	0.780	0.784
	G-Mean	0.632	0.645	0.658	0.671	0.684	0.691	0.696	0.701	0.706
	F1	0.445	0.464	0.483	0.503	0.523	0.535	0.543	0.551	0.559

III. Quantitative Results: Offline Deviance Detection

Analyzing the offline experimental results, Table 3 provides a detailed view of performance metrics across datasets and AL iterations for different ensemble modes. Figure 2 visually represents the F1-score gain progression throughout AL iterations, compared to the final iteration. Table 4 further compares our top-performing ensemble solutions (ensemble_soup and ensemble_avg) against a single-model approach and state-of-the-art methods in both No-AL and AL scenarios.

Table 4 Offline deviance mining: comparing our top-performing DPM ensembles, (ensemble_soup) and (ensemble_avg) with the single-model in two settings: (i) No-AL, using only the labelled data ((m=0)), and (ii) AL, where a number (m in {4, 8}) of active learning iterations are performed after training the model over the labelled data only. As a term of comparison, the results of fully supervised (FS) state-of-the-art methods are reported for the ideal scenario where the deviance labels are disclosed for all the log traces (i.e., (D^U=emptyset ) and (D=D^L))

Dataset	Metric	single-model (No-AL)	ensemble_avg (No-AL)	ensemble_soup (No-AL)	single-model (AL m=4)	ensemble_avg (AL m=4)	ensemble_soup (AL m=4)	single-model (AL m=8)	ensemble_avg (AL m=8)	ensemble_soup (AL m=8)	HO-DPM-mine (FS)	MVDE-Max (FS)	MVDE-Stack (FS)
BPI_dM13	AUC	0.765	0.791	0.795	0.802	0.829	0.833	0.816	0.843	0.847	0.818	0.812	0.851
	G-Mean	0.668	0.698	0.703	0.720	0.751	0.756	0.742	0.773	0.778	0.735	0.729	0.786
	F1	0.520	0.565	0.572	0.598	0.645	0.652	0.634	0.681	0.688	0.611	0.605	0.695
BPI_dM16	AUC	0.733	0.759	0.762	0.770	0.796	0.799	0.784	0.810	0.813	0.775	0.769	0.807
	G-Mean	0.632	0.662	0.665	0.684	0.714	0.717	0.706	0.736	0.739	0.690	0.684	0.733
	F1	0.445	0.498	0.503	0.523	0.576	0.581	0.559	0.612	0.617	0.540	0.534	0.610

Several key trends emerge from Table 3. First, ensemble_soup consistently performs strongly in deviance prediction, especially when considering F1 and G-Mean metrics, which are more appropriate for imbalanced datasets like ours. This is particularly evident in the BPI_dM13 dataset. While ensemble_max shows a slightly higher AUC in BPI_dM13, the marginal increase (+0.7%) over ensemble_soup does not outweigh the latter’s overall robustness.

In the BPI_dM16 dataset, ensemble_soup maintains its F1-score superiority across all AL steps (m > 0) and remains competitive with other ensemble modes in G-Mean and AUC scores.

Notably, ensemble_avg also demonstrates consistent high performance across AL iterations and datasets, establishing itself as a reliable ensemble strategy. However, ensemble_soup offers a unique advantage beyond performance metrics: it eliminates the need to maintain and execute multiple models during inference, reducing computational and memory demands, as highlighted by Wortsman et al.

Regardless of the ensembling mode, Table 3 clearly demonstrates the effectiveness of our AL strategy in enhancing DPM ensemble models over time. Using the expert budget (b_T) to iteratively add 160 strategically selected traces from D^U to D^L over eight AL iterations significantly improves model performance across all metrics and datasets. For instance, at m = 8, ensemble_median shows approximately 5%, 11%, and 18% improvement in AUC, G-Mean, and F1-score on BPI_dM13, and 2%, 7%, and 18% on BPI_dM16 compared to the No-AL setting (m = 0). Similar gains are observed for ensemble_avg, ensemble_soup, and ensemble_max on BPI_dM16, although these strategies benefit less from AL on BPI_dM13.

Figure 2 further illustrates AL effectiveness by showing the ratio of F1-score gain at m iterations to the gain at m = 8. It reveals that just 3-4 AL iterations (half the expert budget) achieve performance close to the fully-grown models at m = 8. For example, on BPI_dM13, ensemble_max and ensemble_soup reach 100% of their final F1-score within 4 AL iterations by labeling only 60-80 traces (13-18% of D^U). ensemble_avg and ensemble_median reach 81% and 72% of their m = 8 F1-scores, respectively. Similar trends are observed for BPI_dM16.

Fig. 3

Trend of the three performance metrics (AUC, G-Mean, and F1) obtained by model (ensemble_soup) when (m=8) across different settings of the budget hyperparameter b (namely, (b=5,10,20,30,40)) on dataset (texttt {BPI}_{dM16})

Figure 3 highlights the impact of the budget b on predictive performance. As b increases, model performance improves, as expected, because more expert-labeled instances become available. However, the results with the default budget b = 20 are already satisfactory and close to those with twice the budget, suggesting b = 20 offers a good balance between accuracy and practical usability. In real-world scenarios, expert labeling capacity is limited, making b = 20 a realistic and sustainable choice.

Table 4 compares ensemble_soup and ensemble_avg with a single-model approach and state-of-the-art multi-view ensembling methods (HO-DPM-mine and MVDE variants) in a fully-supervised (FS) scenario. ensemble_soup demonstrates remarkable consistency across AL stages and datasets. At m = 0, it outperforms single-model and ensemble_avg in all metrics on BPI_dM13. While single-model performs slightly better on BPI_dM16 at m = 0, ensemble_soup remains competitive. As m increases, ensemble_soup and ensemble_avg show greater performance gains than single-model, justifying the continued AL procedure in ensemble-based DPMs. By m = 8, ensemble_avg slightly surpasses ensemble_soup in G-Mean on BPI_dM16.

Overall, ensemble_soup and ensemble_avg generally outperform single-model with similar learning costs. Even when ensemble_avg slightly edges out, ensemble_soup‘s performance remains highly competitive while offering computational and memory efficiency advantages. This makes ensemble_soup particularly appealing for deviance prediction in AL settings, especially where efficiency is paramount.

Comparing with fully-supervised methods in Table 4, our ensemble methods show surprising competitiveness. At the end of AL, ensemble_soup and ensemble_avg hold their own and sometimes outperform these advanced models, particularly in G-Mean and F1-score on both BPI_dM13 and BPI_dM16. AUC comparisons are slightly less favorable.

Specifically, ensemble_soup and ensemble_avg consistently outperform HO-DPM-mine and MVDE-Max but fall slightly behind MVDE-Stack on BPI_dM13. However, MVDE-Stack is more complex and computationally expensive due to its trainable stacking function, while ensemble_soup simply averages model weights and ensemble_avg averages predictions. These efficient approaches strike a better balance between performance and computational efficiency, crucial for scenarios requiring frequent DPM ensemble updates.

IV. Quantitative Results: Online Deviance Prediction

Table 5 evaluates the performance of ensemble_soup and ensemble_avg in a more challenging online predictive monitoring task using prefix datasets from sepsis_cases_2 and sepsis_cases_3. The comparison includes single-model in AL and state-of-the-art FS predictive models like XGBoost, Random Forest, and FOX.

Table 5 Online deviance mining: comparing our top-performing DPM ensembles, (ensemble_soup) and (ensemble_avg) with the single-model in two settings: (i) No-AL, using only the labeled data ((m=0)), and (ii) AL, where a number (m in {4, 8}) of Active Learning iterations are performed after training the model over the labelled data only. As a term of comparison, the results of fully supervised (FS) state-of-the-art methods are reported for the ideal scenario where the deviance labels are disclosed for all the log traces (i.e., (D^U=emptyset ) and (D=D^L))

Dataset	Metric	single-model (No-AL)	ensemble_avg (No-AL)	ensemble_soup (No-AL)	single-model (AL m=4)	ensemble_avg (AL m=4)	ensemble_soup (AL m=4)	single-model (AL m=8)	ensemble_avg (AL m=8)	ensemble_soup (AL m=8)	XGBoost (FS)	Random Forest (FS)	FOX (FS)
sepsis_cases_2	AUC	0.741	0.756	0.748	0.765	0.780	0.772	0.779	0.794	0.786	0.791	0.785	0.788
	G-Mean	0.643	0.658	0.650	0.668	0.683	0.675	0.682	0.697	0.689	0.693	0.687	0.690
	F1	0.448	0.466	0.456	0.487	0.505	0.495	0.510	0.528	0.518	0.521	0.515	0.518
sepsis_cases_3	AUC	0.725	0.739	0.721	0.750	0.764	0.746	0.764	0.778	0.760	0.770	0.764	0.772
	G-Mean	0.624	0.639	0.619	0.650	0.665	0.645	0.664	0.679	0.659	0.669	0.663	0.669
	F1	0.395	0.414	0.385	0.438	0.457	0.428	0.452	0.471	0.442	0.462	0.456	0.462

Table 5 shows that ensemble_soup and ensemble_avg benefit significantly from AL, generally outperforming single-model. Performance improves with increasing AL iterations for both datasets, consistent with offline findings.

At m = 0, ensemble_avg and ensemble_soup outperform single-model across all metrics on sepsis_cases_2, but ensemble_soup has slightly lower AUC and F1-scores than single-model on sepsis_cases_3.

Comparing ensembles, ensemble_avg excels in AUC on sepsis_cases_2, while ensemble_soup leads in F1-score on the same dataset. On sepsis_cases_3, ensemble_avg outperforms ensemble_soup in both metrics. At m = 4, ensemble_avg surpasses single-model and ensemble_soup in all metrics on sepsis_cases_3 and in G-Mean and F1-score on sepsis_cases_2. ensemble_soup surpasses single-model in all metrics for sepsis_cases_2 but lags in G-Mean and F1-score on sepsis_cases_3. At m = 8, performance plateaus, with ensemble_avg generally retaining an advantage over ensemble_soup and single-model, except for AUC on sepsis_cases_2 where ensemble_soup leads. Both ensemble_avg and ensemble_soup consistently improve predictive performance with AL, and ensemble_avg consistently outperforms single-model, while ensemble_soup outperforms single-model on sepsis_cases_2.

In the FS setting (Table 5), ensemble_avg and ensemble_soup perform satisfactorily compared to competitors, though not achieving the highest metric values. At m = 4, ensemble_avg already surpasses XGBoost and Random Forest in F1-score on sepsis_cases_2 and significantly outperforms them (+249%) on sepsis_cases_3. ensemble_soup shows similar F1-score gains on sepsis_cases_3 but lags slightly on sepsis_cases_2. Both ensembles underperform competitors in AUC.

While XGBoost and Random Forest achieve comparable or slightly better F1-scores and higher AUC in the FS setting, they often require hundreds of base learners. ensemble_soup achieves good performance with only five base models, offering better efficiency, crucial in online predictive contexts.

Comparing with FOX at the last AL iteration, ensemble_avg and ensemble_soup have comparable AUC, except on sepsis_cases_3 where ensemble_soup scores lower. Despite FOX‘s efficiency, its rule complexity (729 and 81 rules for sepsis_cases_3 and sepsis_cases_2, respectively) may introduce computational overhead, potentially offsetting its efficiency gains compared to our ensemble models.

Fig. 4

Average AUC scores, for varying prefix lengths, of the (ensemble_soup) models found (with (m=8)) for datasets (sepsis_cases_2) (left) and (sepsis_cases_3) (right)

In predictive monitoring, early and accurate outcome forecasting is key. Figure 4 shows the average AUC of ensemble_soup (at m = 8) for different prefix lengths in sepsis_cases_2 and sepsis_cases_3.

AUC generally increases with prefix length, as expected, becoming acceptable (above 0.5) for shorter prefixes and improving significantly beyond prefix lengths of 5 and 10 for sepsis_cases_2 and sepsis_cases_3, respectively. Performance peaks at the 9th and 13th steps for sepsis_cases_2 and sepsis_cases_3, respectively, then slightly declines, more pronouncedly in sepsis_cases_3.

This AUC dip for longer prefixes, also seen in previous studies, might seem counterintuitive but may be due to smaller and less homogeneous subsets of longer prefixes. As most traces are shorter, predicting outcomes for prefixes close to the full trace length becomes easier, even with fewer events.

V. Qualitative Results: Deviance Prediction Explanations

To enhance user understanding and critical analysis of DPM predictions, our approach integrates the LIME explanation method. While other frameworks exist (SHAP, Grad-CAM), LIME was chosen to minimize computational cost for generating explanations.

Fig. 5

An example of LIME local explanation of a prediction made (with a DPM ensemble) for a trace of dataset (BPI_{dM16})

Figure 5 shows a LIME explanation for deviant trace prediction in BPI_dM16. Deviance/normality prediction appears primarily influenced by specific values of “specialism code” attributes and the presence/absence of IA or DP patterns. For readability, patterns are represented by enumerated fields (e.g., Field95, Field97).

Figure 5 indicates that the presence of value ‘7’ for “Specialism code” and absence of ’13’ (and ’61’, ’13’) for “Specialism code 2” (“Specialism code 1”) are positively correlated with deviance prediction. Conversely, absence of ‘7’ for “Specialism code 1” and pattern Field95 are negatively correlated with deviance prediction.

Fig. 6

Data features with the strongest (positive or negative) global influence on the deviance class, according to the DPM ensemble mined from dataset (texttt {BPI}_{dM16}). Positive and negative influence scores are shown in blue and red, respectively

Figure 6 summarizes insights from DPM ensemble predictions on BPI_dM16, highlighting top 10 features with strongest positive and negative correlations to deviance.

LIME provides local explanations by assigning relevance scores to features impacting individual trace predictions. To quantify global feature impact, we averaged LIME scores across all test set traces, resulting in global relevance scores (Figure 6). Key prediction drivers include “specialism codes,” patient ages, and attributes within enumerated fields.

VI. Conclusion

This study comprehensively demonstrates the effectiveness and efficiency of offline active learning, particularly when combined with ensemble methods, for detecting deviant behaviors. Our findings highlight the superior performance of ensemble_soup and ensemble_avg strategies, which not only achieve high accuracy but also offer significant computational efficiency, especially ensemble_soup with its single-model inference advantage. The active learning approach demonstrably enhances model performance over iterations, requiring only a fraction of labeled data to achieve results comparable to fully supervised methods.

The application of efficient offline active learning methodologies, as presented in this study, holds significant promise for various domains, including education. By strategically selecting and labeling the most informative data points, educational institutions can leverage their existing data to identify and understand deviant learning patterns more effectively, optimize intervention strategies, and ultimately improve educational outcomes, all while minimizing the burden on expert resources. The insights gained from explainable AI techniques like LIME further enhance the practical value of these models, providing educators with actionable information to understand and address deviant behaviors within their student populations.