Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (2024)

Abdulaziz A. Almuzaini, Department of Computer Science, Rutgers University, United States, aaa395@rutgers.edu

David M. Pennock, Department of Computer Science, Rutgers University, United States, dpennock@dimacs.rutgers.edu

Vivek K. Singh, School of Communication and Information, Rutgers University, United States, v.singh@rutgers.edu

DOI: https://doi.org/10.1145/3614419.3644028
WEBSCI '24: ACM Web Science Conference, Stuttgart, Germany, May 2024

Web-based content analysis tasks, such as labeling toxicity, misinformation, or spam often rely on machine learning models to achieve cost and scale efficiencies. As these models impact real human lives, ensuring accuracy and fairness of such models is critical. However, maintaining the performance of these models over time can be challenging due to the temporal shifts in the application context and the sub-populations represented. Furthermore, there is often a delay in obtaining human expert labels for the raw data, which hinders the timely adaptation and safe deployment of the models. To overcome these challenges, we propose a novel approach that anticipates future distributions of data, especially in settings where unlabeled data becomes available earlier than the labels to estimate the future distribution of labels per sub-population and adapt the model preemptively. We evaluate our approach using multiple temporally-shifting datasets and consider bias based on racial, political, and demographic identities. We find that the proposed approach yields promising performance with respect to both accuracy and fairness. Our paper contributes to the web science literature by proposing a novel method for enhancing the quality and equity of web-based content analysis using machine learning. Experimental code and datasets are publicly available at https://github.com/Behavioral-Informatics-Lab/FAIRCAST.

CCS Concepts: • Computing methodologies → Machine learning; • Computing methodologies → Learning settings; • Computing methodologies → Online learning settings;

Keywords: algorithmic fairness, distribution shifts, temporal shifts, domain adaptation, continual learning

ACM Reference Format:
Abdulaziz A. Almuzaini, David M. Pennock, and Vivek K. Singh. 2024. Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling. In ACM Web Science Conference (WEBSCI '24), May 21--24, 2024, Stuttgart, Germany. ACM, New York, NY, USA 11 Pages. https://doi.org/10.1145/3614419.3644028

1 INTRODUCTION

Web-based content analysis tasks, such as labeling misinformation, toxicity, spam, or bots, are increasingly automated using machine learning (ML) models. These models can have significant impact on the lives and opportunities of people who are affected by the decisions made by such models. For instance, systematic discrepancies in labeling toxicity as targeted at different population groups can have serious downstream implications for those groups [6]. Therefore, ensuring accuracy and fairness of ML models is a crucial challenge for web science.

However, ML models may not be able to maintain accuracy and fairness over time due to temporal shifts, i.e., change in the data distribution over time [44]. This could happen because of a natural evolution of the application context, data, and or labels per sub-population. For instance, Google Flu Trends (GFT) tracking system that yielded good performance in the initial years of deployment was later found to become error prone [24]. Similarly, there was a rise in hate messages targeting certain population groups during the COVID pandemic [27]. Hence, algorithms trained on older data will struggle to provide accurate and fair results across those population groups in the changed settings. Therefore, ensuring accuracy and fairness of ML models in temporally evolving settings is an important challenge [11, 44].

A natural way to tackle the above challenge is to retrain ML model at each time instance. However, in such settings, ML models are always trailing the recent data. In other words, the model (re)trained on data from batch/time t is applied on data from batch t + 1, which can be an issue if there is a shift in population groups or the associated labels across these time instances. A key challenge in detecting whether temporal shifts have occurred between time t and t + 1 is the availability of the labels from t + 1 which are often not observed immediately. While manual labeling for the entire data from t + 1 would be unlikely for ML deployments, labels for smaller (e.g., validation) sets also come with delayed labeling, i.e., human expert labels for batch t + 1 are available later than the unlabeled data for batch t + 1, which hinders timely model validation and safe model deployment. This combination of temporal shifts and delayed labeling is common in web-based tasks such as labeling toxicity, spam, or misinformation.

A recent method proposed to address fairness in these settings involves temporal averaging of past data distributions to estimate future values in order to proactively mitigate unfair behaviors [3]. However, this approach does not acknowledge that raw, unlabeled data for the next batch are available before the ML decisions are made. These data can provide insights into macro-properties for the next batch (i.e., t + 1), which can be used to realign the model at the current time instance, enhancing its accuracy and fairness when deployed for the next batch.

Given this context, our research question is: How can we enhance the accuracy and fairness of machine learning models for web-based content analysis tasks under temporal shifts and delayed labeling?

To tackle this question, we introduce a novel approach called FAIRCAST: Fairness and Accuracy in Content Analysis Across Time. This approach combines forecasting with fairness and utilizes unlabeled data to anticipate the future distribution of labels across different population groups and to proactively adapt the model. Our proposed method assumes an online batch processing setting where unlabeled data from t + 1 can reveal important information about the macro properties of the label distribution at that time, such as label prevalence and the correlation of labels with group identities. We leverage these insights to adjust the current model's parameters, optimizing for both accuracy and fairness. Our proposed approach is versatile and can be applied to a variety of web-based content analysis tasks that involve temporal shifts, which can have varying tasks (e.g., labeling toxicity, misinformation) and notions of sensitive attributes (e.g., racial identity, political leaning).

We motivate our approach by highlighting the importance of temporal shifts and delayed labeling in web-based content analysis tasks, and the challenges they pose for accuracy and fairness. We review the related work on fairness-aware learning, temporal shifts, and domain adaptation, and identify the gaps and limitations of the existing methods. We then present our approach in detail and explain how it estimates the future label distribution, adapts the model, and combines the past models.

Our FAIRCAST approach builds upon iterative re-training as the baseline, but improves it based on four components:

We anticipate the label distribution for next time instance t + 1 based on the unlabeled data that becomes available earlier than the human labels. This allows us to estimate the future distribution of labels for different population groups and adapt the model preemptively, rather than waiting for the labels to become available and re-training the model after the fact.
We apply domain adaptation to improve the accuracy and fairness performance for t + 1 based on the above estimates. We use domain adaptation techniques to align the label distributions of the training (t) and deployment (t + 1) data to counter the temporal shifts.
We dynamically re-weigh different population groups (e.g., those with higher error rates) based on past performance of the models. We use a re-weighing fairness approach to assign different weights to different groups of data, based on their error rates and sensitivities. This helps to balance the trade-off between accuracy and fairness, and to mitigate the unfairness caused by the temporal shifts.
We keep a log of past models to support continual training. We use a temporal trail of past models to average them and yield smoothened models for deployment. This helps to retain past knowledge, reduce the variance and noise of the models, and to handle the abrupt changes in the data distribution.

We evaluate our approach using multiple temporally-shifting datasets and consider bias based on racial, political, and demographic identities. We compare our approach with several baseline methods, such as static deployment, retraining, domain adaptation and past related work on anticipatory bias correction. We measure both accuracy and fairness using various metrics, such as area under the ROC curve, demographic parity and equalized odds. We find that our approach yields promising performance with respect to both accuracy and fairness, and outperforms the baseline methods in most cases. Our paper makes the following main contributions to the web science literature:

We propose a novel approach for enhancing accuracy and fairness of machine learning models for web-based content analysis tasks under temporal shifts and delayed labeling.
We evaluate our approach using multiple temporally-shifting datasets and consider bias based on racial, political, and demographic identities.
We discuss the implications and limitations of our approach for web science research and practice.

Table 1: Summary of related work and existing methods including: Empirical Risk Minimization (ERM), Re-weighing (RW), Domain Adaptation (DA) and Anticipatory Bias Correction (ABC).

Methods	Stationarity	Fairness	Anticipation	Continuity
ERM [41]	(iid)	✗	✗	✗
RW [23]	(iid)	✓	✗	✗
DA [43]	(non-iid)	✗	✓	✗
ABC [3]	(non-iid)	✓	✓	✗
FAIRCAST	(non-iid)	✓	✓	✓

2 RELATED WORK

Web-based content analysis tasks, such as web search, misinformation detection, cyberbullying prevention, and toxicity moderation, are increasingly automated using machine learning algorithms [2, 32]. These algorithms can have significant impacts on the web ecosystem and the society at large, as they can affect the access, quality, diversity, and credibility of the web information and services [5]. Therefore, ensuring fairness in web-based content analysis algorithms is a crucial challenge for web science [29, 32]. Fairness in web-based content analysis algorithms can be defined as the property that the algorithms do not discriminate or harm different groups of web users or web entities based on their protected or sensitive attributes, such as race, gender, age, religion, political affiliation, location, or preference [42]. Fairness can be measured and optimized using various criteria and methods, such as statistical parity, individual fairness, group fairness, counterfactual fairness, causal fairness, and adversarial fairness [34].

However, achieving fairness in web-based content analysis algorithms is not a trivial task, as there are many sources and types of bias and unfairness that can affect the web data and the web algorithms. For example, there can be bias and unfairness in the web data collection, web data representation, web data processing, web data analysis, web data interpretation, web data dissemination, and web data feedback. Moreover, there can be bias and unfairness in the web algorithm design, implementation, evaluation, deployment, etc [29]. In this work, we focus on the performance of machine learning models that are frequently used in web applications for tasks such as labeling misinformation, toxicity, spam, bots, etc.

Some of the major challenges of web-based content analysis using algorithms are temporal shifts and delayed labeling of the web data. Temporal shifts refer to the change in the data distribution over time (e.g., based on world events, seasons, demographic shifts), which can affect both the accuracy and fairness of the ML algorithms [11, 20, 28, 44, 46]. Delayed labeling refers to the lag in obtaining human expert labels for the raw, unlabeled data, which limits the retraining and re-evaluation opportunities of the algorithms and often resulting in a significant performance drop [17, 37, 47]. These challenges are quite common in web-based content analysis tasks where the web data evolves rapidly over time and the labels require human expert inputs [27, 31]. Most of the existing work on algorithmic fairness assumes that the data distribution is stationary, i.e., independently and identically distributed (iid), and that the labels are available immediately after the data is collected. These assumptions do not hold in web-based content analysis tasks, where the data distribution changes over time and the labels are often delayed [27, 47]. Therefore, there is a need for new methods that can handle temporal shifts and delayed labeling in web-based content analysis algorithms, and that can maintain or improve both accuracy and fairness over time.

Two of the closest related concepts to temporal shifts and delayed labeling in the literature are domain adaptation and anticipatory bias correction [3, 21]. Domain adaptation is a technique that aims to improve the performance of ML algorithms when the training and deployment data come from different domains or distributions [21]. Anticipatory bias correction is a technique that aims to proactively reduce the bias of machine learning algorithms before it occurs, by using anticipations regarding the relative distributions of population groups in the next cycle [3]. However, these concepts have some limitations and gaps that need to be addressed. The majority of the past work on domain adaptation has focused on improving accuracy and a few has addressed fairness in non-temporal settings [36, 38, 45, 46]. Moreover, some of the existing domain adaptation methods may not be able to handle the complex and dynamic temporal shifts that occur in web-based content analysis tasks, such as label shift, demographic shift, or sub-population shift [12, 44, 45]. Therefore, there is a need for domain adaptation methods that can support both accuracy and fairness in temporally-shifting settings, and that can handle different types of temporal shifts.

3 METHODOLOGY

3.1 Problem Formulation

We define random variables (A, X) to represent the sensitive variable A (e.g., gender) and the input variable X (e.g., tweets). We consider A a binary random variable. Also, we represent the ground truth label of the web classification task by a binary random variable Y. Each instance x has a label y ∈ Y and a sensitive attribute a ∈ A such that: Y ∈ {y⁺, y⁻}, where +, (-) represents positive class (negative class, respectively), and A ∈ {a⁺, a⁻} where +, (-) represents advantaged (disadvantaged, respectively) demographic group. We utilize a function f: (A, X) → Y, representing a binary classifier.

We follow the online/continual learning settings by assuming that the data arrive sequentially in batches {B₁, B₂,...} in which each batch B_t has a collection of (a, x, y) instances drawn iid from a distribution P_t(A, X, Y) and t represents the timestamp metadata [10]. We also assume two consecutive distributions sampled from different time instances might be different with respect to the joint distribution of A and Y (See Eq. 2). A type of temporal shift is known in the literature as either a joint shift or a sub-population shift. Additionally, we represent the data sampled from P_t as B_t, the current training distribution batch on which we train a web classification task f_t (i.e., f_t(B_t)), whereas B_{t + 1} is the following deployment distribution batch on which we apply f_t (i.e., f_t(B_{t + 1})). We assume during the training time we get access to labeled samples from the training distribution (P_t(A, X, Y)) and unlabeled samples from the deployment distribution P_{t + 1}(A, X), i.e., Y is missing but A and X are available. In this work, we additionally assume the human labels for t + 1 are delayed; therefore, blindly deploying and applying a model f_t on t + 1 is risky if a temporal shift exists. Hence, domain adaptation methods can be beneficial.

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (1)

3.2 Overall Approach

Our overall FAIRCAST approach is summarized in Figure 1. We address temporal shifts, delayed labeling and algorithmic bias challenges simultaneously to enhance the trade-off performance with respect to the accuracy and fairness. Specifically, we first estimate the joint shift between the current and the next time instance (i.e., the anticipation aspect) (Section 3.2.1). Next, we follow the guidelines of domain adaptation literature to re-sample the joint training distribution at time t to be aligned with the expected deployment distribution as that of time t + 1 (i.e., stationarity aspect) (Section 3.2.2) which results in achieving high accuracy for time t + 1. Next, we adopt a popular re-weighing method for bias mitigation (i.e., fairness aspect) (Section 3.2.3) on the re-sampled distribution, which yields a model optimized for the distribution anticipated at time t + 1. Further, we extend the RW method to dynamically re-weigh different groups based on their past performance. Lastly, we keep a history of past models to retain some of the knowledge from past cycles and to avoid sudden changes in the algorithms prediction (i.e., continuity aspect) (Section 3.2.4). Specifically, we undertake a weighted average of weights from the historical models to yield a temporal ensemble model that is ready to be applied to data from time t + 1. See Algorithm 1 for the step-wise procedure and the next sub-sections for details.

3.2.1 Joint Shift Estimation (Step 1). When there is dissimilarity between two consecutive distributions with respect to the correlation of the sensitive and the label variables, a web classification model trained on a certain distribution is not guaranteed to perform well when this distribution changes. This change can be dynamic which will adversely affect a previously trained model. To further explain this phenomenon and estimate this type of shift, we follow the label shift estimation assumptions but with an extension to include the sensitive variable [25].

We assume access to the fully observed training distribution at time t, P_t(A, X, Y). We can empirically derive the joint distribution of the sensitive variable A and the label Y, P_t(A, Y) and the conditional distribution P_t(X|A, Y) to learn a Bayes optimal classifier P_t(Y|A, X) in the training distribution [21]:

\begin{equation} P_{t}(Y|A, X) \propto {P_{t}(A, Y) P_{t}(X|A, Y)} \end{equation}
(1)

In the presence of a joint shift over time, Eq. (1) might not perform well on the deployment distribution since we assume the shift to be in the following form:

\begin{equation} P_{t}(A, Y) \ne P_{t+1}(A, Y) \end{equation}
(2)

whereas

\begin{equation} P_{t}(X |A, Y) = P_{t+1}(X| A, Y) \end{equation}
(3)

Eq. (2) states that a joint shift has occurred over time which causes a performance drop for models previously trained on a different distribution. Specifically, when building a fair model that utilizes this information in its prediction task, a model that is trained on a higher joint distribution during a certain time (e.g., during COVID) with respect to a disadvantaged targeted group and a positive label (e.g., P(A = ``Asian", Y = ``Toxic")) might further over-predict this aspect when this joint distribution has decreased later in post-COVID era. If this change hasn't been countered for, a web-based toxicity classification task would flag each text mentioning or being associated with Asian as toxic resulting in higher false positive rate for the Asian groups. Eq. (3) states that conditional distribution is constant. For example, the writing style doesn't change between time t and t + 1 with regard to the sensitive attribute and the class label. Lastly, to account for this change in Eq. (2) and to enhance the performance at time t + 1, adapting the model with respect to the following ratio is necessary to mitigate the effects of the joint shift [21]:

\begin{equation} w^{DA} = \frac{P_{t+1} (A, Y)}{P_{t} (A, Y)} \end{equation}
(4)

Which gives a higher importance weight to those groups that are represented in the deployment distribution compared to the training distribution. However, the joint distribution P_{t + 1}(A, Y) is not available at training time; thus, it needs to be estimated. Almuzaini et al. [3] provided an estimation where the average of the past k joint distributions was utilized. A more effective way to provide this estimation especially when there is variation over time (e.g., pre/post COVID) is to take advantage of the unlabeled data provided at time t+1 (i.e., (A, X)) to derive a better estimate, i.e., $\hat{P}_{t+1}(A, Y)$. To get this estimate, we implement a Probabilistic Adjusted Classify (PACC) algorithm that was initially introduced for label shift problem and we extend it to count for a joint shift estimation [8].

3.2.2 Joint Shift Alignment (Step 2). Using the PACC algorithm we can estimate this ratio (Eq. 4) and mitigate the joint shift by either : (1) re-sampling or re-weighing the training distributing with respect to this shift, thus making $P_{t}(A,Y) \approx \hat{P}_{t+1}(A, Y)$ and then apply a re-training in the new sampled distribution, or (2) correct the learned web classifier P_t(Y|A, X) with respect to this ratio to compute the new model for time t + 1 as:

\begin{equation} P_{t+1}(Y|A, X) \propto \frac{\hat{P}_{t+1} (A, Y)}{P_{t} (A, Y)} P_{t}(Y|A, X) \end{equation}
(5)

3.2.3 Bias Mitigation Strategy (Step 3). To mitigate the discrimination in the training distribution, we adopt a popular pre-processing method [23]. The Re-weighing (RW) method assigns different weights w^RW for each sub-population with regards to their representations in the current batch t which helps a model gives equal importance (on aggregate) to different demographic groups. Unfortunately, this method only addresses bias in the current batch and doesn't include the natural difficulty level for correctly classifying each group over time. Thus, we propose a dynamic re-weighting method (DRW) that addresses this issue by giving more fairness weight to those groups who are historically misclassified at a higher rate by the previous temporal models. Specifically, we define set of misclassification rates for each group from the previous timestamp (i.e., per-group error rate):

\begin{equation} E_{t-1} = \lbrace P(\hat{Y} \ne Y | Y=y, A=a) \rbrace \end{equation}
(6)

∀y ∈ {y⁺, y⁻} and ∀a ∈ {a⁺, a⁻}. The per-group error rate examines the FPR and FNR for each subgroup (y, a). Finally, we combine the fairness weight w^RW at time t and E_{t − 1} from time t − 1 to have a new dynamic fairness group weight:

\begin{equation} w_{t}^{DRW} = w_{t}^{RW} \times e^{E_{t-1}} \end{equation}
(7)

where those sub-groups with higher error rate will receive more attention; thus, addressing fairness and accuracy jointly.

3.2.4 Temporal Ensembling (Step 4). Preserving past trained models help in building a robust web classifier since each model has been trained on different distributions; thus we can exploit the previous models and their performance (i.e., how fair/accurate a model is in the past) to build a temporal ensemble. 1 Finally, the prediction of the B_{t + 1} will be: $\hat{Y}_{t+1} = \sum _{i=1} ^{t} d_i \, f_{i}(B_{t+1})$ where d_i is the model weight which will be determined based on the historical model performance on the current validation set sampled from the current time instance. We primarily derive these weights based on how fair such a model is regarding the demographic parity metric.

4 EXPERIMENT

4.1 Setup

To evaluate a web classification related task performance in temporal shifts settings, the typical cross-validation approach no longer suffices [1, 39]. Therefore, we use the temporal evaluation settings by training and evaluating using batches sampled from different timestamps [14]. For all methods, we start training using the second time instance from the distribution set and testing forward since (ABC) baseline [3] has to preserve the first two sets for estimating the upcoming distributions (i.e., ABC utilizes the average of the previous 2 distributions as the anticipated distribution for the current time instance). We experiment with a linear classifier (i.e., Logistic Regression) with the default parameters due to its interpretability. Lastly, for web text-classification tasks (i.e., ClaimBuster and CivilComments datasets), we represent each textual document as 100-dimension features extracted by using a pre-trained GloVe embedding model [33].

Table 2: Datasets summary

Dataset	Period	No. Samples	Sensitive	Target
ACS	2010-2018	1,663,672	Race	Income >=50K?
ClaimBuster	1960-2016	18,596	Political	Check worthy?
CivilComments	2016-2017	447,875	Identity	Toxic?

4.2 Datasets

A dataset suitable for our analysis will need to include a timestamp metadata, a target variable, and a sensitive attribute. We must first acknowledge the relative dearth of temporally varying datasets in the fairness in machine learning literature [3]. In particular, we use one dataset (American Community Survey), which is considered an algorithmic fairness benchmark and two other datasets that are directly relevant to web science literature and meet the temporal shifts, delayed labeling, and sensitive attribute criteria.

American Community Survey (ACS) [15]: is a well-known algorithmic fairness benchmark (i.e., 1994 adult dataset [7]). Considering the temporal shift aspect, we utilize a newly released version in which samples are collected between 2010-2018 and span across US states. We build a model on American Community Survey (ACS Income) prediction task to predict whether an individual will earn more than 50K US dollars based on various attributes such as education level, expertise, etc. We utilize a subset of the dataset focusing on the state of New York (NY) and we examine the racial bias (i.e., white v.s. black) in the algorithmic predictions. We address the temporal shift by evaluating the model yearly.

ClaimBuster (CB) [4]: is a dataset related to misinformation and consists of 18,596 statements extracted from U.S general election presidential debates (1960-2016) and annotated for three different categories: non-factual, unimportant factual, and check-worthy factual statements. It also provides metadata associated with each statement such as the speaker name and their political party. The dataset has been used in web related applications such as building automated fact-checking systems and fake news/misinformation detection tools [30]. As identifying factual statements is an important in the above analysis, here we focus on the task of identifying check-worthy statements compared to non-factual statements. We consider political leaning of the politician (Republican or Democrat) as the sensitive attribute, and exclude speakers who are considered Independent. We examine the temporal shifts on a yearly basis.

CivilComments (CC) [9]: is a dataset compiled to examine the toxicity of user-generated comments posted online on CivilComments forum. Each comment is labeled for toxicity by human annotators (i.e., toxic vs. non-toxic). Additional annotation includes the demographic identities that have been mentioned within the comments (e.g., black, christian, gay). Following [35], we utilize a coarse version of this dataset and consider the presence of demographic identities in the comment text to be the sensitive attribute. In effect, the presence (respectively, absence) of demography identifying descriptors should not change the toxicity detection performance. We use the timestamp metadata to examine the temporal shifts on a monthly basis starting from 2016-01 and ending on 2017-11, thus experimenting with a total of 23 months.

Additionally, we control the number of samples in each batch by having 5000 samples for the ACS and the CivilComments datasets while only using 3000 samples for the ClaimBuster dataset due its limited samples. We also ensure that the probabilities P(Y), P(A), and P(A, Y) remain consistent with the original distribution for each timestamp in this sampling procedure. Lastly, we evaluate the results by experimenting with 20 different runs and reporting the average and standard deviations. A summary of the datasets is shown in Table 2.

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (3)

4.3 Model Evaluation

For the overall performance, we use the Area Under the ROC Curve (AUROC) since it is relatively robust to class imbalance. To assess the model unfairness behavior, three different bias measurements that are common in the literature and suitable for toxicity and misinformation have been utilized [9, 26].

Demographic Parity (DP) [16] measures the group disparity of being assigned to a positive class. This measure ensures that the model predictions are statistically independent of the sensitive attributes:

\begin{equation} DP = | P(\hat{Y} = \hat{y}^{+} | A = a^{+}) - P(\hat{Y} = \hat{y}^{+} | A = a^{-})| \end{equation}
(8)

Equalized Odds (EOD) [19] which accounts for the true label Y in which it measures the disparity of the False Negative Rate (FNR) and the False Positive Rate (FPR) (i.e., Type II and Type I errors) for individuals from different groups. Hence, it requires equal FNR and FPR for individuals from different demographic groups:

\begin{equation} \Delta \,FNR = |P(\hat{Y} = \hat{y}^{-} | Y = y^{+}, A = a^{+}) - P(\hat{Y} = \hat{y}^{-} | Y = y^{+}, A = a^{-})| \end{equation}
(9)

\begin{equation} \Delta \, FPR = |P(\hat{Y} = \hat{y}^{+} | Y = y^{-}, A = a^{+}) - P(\hat{Y} = \hat{y}^{+} | Y = y^{-}, A = a^{-})| \end{equation}
(10)

Distance to optimal (DTO) [18] is a metric that examines the performance trade-off by facilitating the computation of a single metric that can be used to compare alternative models/approaches. It works on the idea of an “optimal point”, which is the point that corresponds to the highest accuracy and fairness scores (i.e., 100% accurate and 100% fair) which marks a theoretical goal for any of the models to strive for. DTO simply calculates the Euclidean distance to the “optimal point”, and the model that is closer to the optimal point is considered a better model. We use DTO with respect to the accuracy (AUROC) and each bias measure.

5 RESULTS AND DISCUSSION

To validate the proposed approach, we consider three different datasets as described in Section 4.2. We also experiment with two (static) baselines that do not consider the non-stationarity aspect.

(1) Empirical Risk Minimization (ERM), a model that is trained once on the first training distribution (i.e., t) and never gets updated [41]. This is by far the most common approach currently used for web algorithm deployments.
(2) Re-weighing (RW) method is a static bias mitigation strategy. (RW) is similar to (ERM) but with additional fairness constraints enforced at time t [23].

We also consider two other dynamic baselines, where (a) the model is retrained at every timestamp, and (b) certain properties of the next batch t + 1 are anticipated and used for building a decision-making model.

(3) Domain Adaptation (DA) [40] utilizes the change in joint shift (Eq. 4) to obtain the potential accuracy gains but does not focus on bias reduction.
(4) Anticipatory Bias Correction (ABC) approach utilizes the recent past data of joint distributions to anticipate the future joint shift to create a bias mitigation strategy but does not utilize the unlabeled data (B_{t + 1}) for estimating the joint shift or the domain adaptation approach for ensuring accuracy.

We start with localized evaluation of design choices (e.g., need for temporal shift adaption, the value of sophisticated estimation approaches) in Section 5.1, followed by aggregated assessment of the overall approach in Section 5.2.

5.1 Temporal Data Shift Assessment and its Anticipation

5.1.1 Data distributions shift over time. To examine the natural joint shift which occurs naturally in each dataset, we utilize the Jensen-Shannon divergence (JSD) of the joint distribution derived from samples of the first timestamp and compare it with each subsequent future distribution. To recap, joint shift refers to the dissimilarity between two consecutive distributions with respect to the correlation between the sensitive variable and the output variable where 0 implies no joint shift whereas 1 is an indication of a maximum shift. As shown in Fig. 2 the ACS dataset has an increasing joint shift divergence over time indicated by the darker colors towards later timestamps. On the other hand, ClaimBuster and CivilComments datasets show a fluctuation behavior rather than a monotonic pattern. Overall, we see clear evidence to support that data distributions evolve over time, and train-once deploy constantly approach (as used by ERM and RW) is unlikely to work well in practical settings.

Hence, we move the attention to dynamic models where there is (a) retraining over time and (b) anticipation of macro properties of the next timestamp.

5.1.2 Anticipation performance of different approaches. A good anticipation mechanism should be able create estimates that are close to the actual observations for next time period. Here, we examined the joint distribution estimation errors that result as a function of the abovementioned joint distribution in data. We use the JSD metric with respect to the actual joint distribution and the estimated joint distribution in Fig. (3) across the three datasets to compare a simple estimation method used by (ABC) method and a more robust method utilized by the (DA) method (Section 3.2.1). Across different timestamps and across different datasets, we observe a consistent trend that estimation based on DA was better, i.e., had lesser estimation error, compared to ABC. We note that ABC is the only anticipation approach discussed in prior fairness literature and the DA approach has not yet been applied to support fairness objectives. Hence, this motivates the combination of DA with fairness adaptation as is adopted in the proposed approach.

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (4)

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (5)

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (6)

Table 3: Results of the mean and std dev (%) calculated across all timestamps along with other primary fairness evaluation and the DTO metrics.

Dataset	Metrics	ERM	RW	DA	ABC	FAIRCAST (Ours)
ACS	AUROC ↑	78.2 ± 0.7	78.1 ± 0.7	78.6 ± 0.6	78.5 ± 0.7	78.8 ± 0.6
	DP ↓	15.5 ± 1.7 (0.066)	12.0 ± 2.7 (0.029)	16.9 ± 2.1 (0.082)	12.9 ± 1.8 (0.037)	9.5 ± 2.0 (0.000)
	Δ FPR ↓	7.7 ± 1.3 (0.045)	5.6 ± 1.7 (0.024)	9.5 ± 1.7 (0.062)	6.2 ± 1.5 (0.028)	3.5 ± 1.7 (0.000)
	Δ FNR ↓	17.6 ± 3.1 (0.116)	11.7 ± 3.6 (0.053)	17.6 ± 3.8 (0.116)	12.0 ± 3.2 (0.056)	6.9 ± 2.8 (0.000)
ClaimBuster (CB)	AUROC ↑	84.3 ± 1.0	84.4 ± 1.0	85.0 ± 1.0	85.5 ± 0.9	86.4 ± 0.8
	DP ↓	6.4 ± 2.2 (0.025)	7.1 ± 1.9 (0.028)	11.4 ± 3.5 (0.064)	7.2 ± 2.3 (0.021)	5.6 ± 2.1 (0.000)
	Δ FPR ↓	2.7 ± 1.6 (0.024)	3.1 ± 1.5 (0.024)	6.8 ± 2.9 (0.045)	4.1 ± 2.0 (0.018)	4.0 ± 2.0 (0.013)
	Δ FNR ↓	10.0 ± 3.8 (0.034)	11.7 ± 3.9 (0.049)	12.9 ± 4.7 (0.059)	8.3 ± 3.3 (0.013)	7.7 ± 3.4 (0.000)
CivilComments (CC)	AUROC ↑	71.6 ± 1.4	69.7 ± 1.6	77.8 ± 1.2	77.4 ± 1.1	78.8 ± 1.0
	DP ↓	28.1 ± 3.4 (0.238)	8.5 ± 2.3 (0.115)	31.3 ± 4.4 (0.255)	8.2 ± 2.3 (0.017)	7.9 ± 1.8 (0.000)
	Δ FPR ↓	25.9 ± 3.4 (0.238)	6.5 ± 2.2 (0.116)	28.2 ± 4.5 (0.244)	5.6 ± 2.2 (0.018)	5.0 ± 1.8 (0.000)
	Δ FNR ↓	25.4 ± 5.7 (0.239)	5.2 ± 3.5 (0.115)	28.7 ± 6.5 (0.255)	5.4 ± 3.5 (0.021)	4.3 ± 3.0 (0.000)

5.2 Overall Assessment of FAIRCAST

We now proceed to discuss the overall performance of the proposed FAIRCAST approach in terms of both fairness and accuracy. We apply the FAIRCAST method and the baselines on the ACS, ClaimBuster and CivilComment datasets and report the performance for each timestamp (Figure 4) and on an aggregated basis (Table 3) with various evaluation metrics. As can be seen in the figures, FAIRCAST obtains some of the highest AUROC scores. The trend is largely consistent over time and this yields the highest average AUROC scores as documented in Table 3.

Across the three datasets, we clearly see that for the basic baselines (ERM) and (RW) accuracy performance tend to decrease over time (See Figs. 4 (a, e, i)). On the other hand, since (ABC), (DA) and (FAIRCAST) have temporal adaptation and anticipation components, they were less susceptible to temporal shifts. FAIRCAST tends to be, broadly, the more accurate method across the three datasets and yields the best performance on aggregate (see Fig. 4 and Table 3).

From the fairness perspective, a consistent trend shows that the (ERM) and (DA) are most susceptible to bias as they are mainly optimizing for accuracy whereas (RW), (ABC) and (FAIRCAST) tend to have higher fairness scores. FAIRCAST consistently yields the least biased (i.e., most fair) results as seen in Figs. 4 (b, c, d) for ACS dataset. Similarly, for the ClaimBuster and CivilComments datasets the FAIRCAST method tends to have a lower scores on average compared to other baselines (see Fig. 4 and Table 3).

One of the key goals of the proposed approach is to optimize for accuracy and fairness simultaneously. Hence, we compare the accuracy-fairness trade-off in Figure 5. We plot the AUROC against each bias measurement (DP, Δ FPR, and Δ FNR). We represent each method by taking the mean across all timestamps (We also provide the mean and the standard deviation on Table 3). Fig. 5 shows that the results for FAIRCAST method (shown with a plus shape and in black color) tend to lie on the top-left corner showing a better trade-off; increasing accuracy (reaching the top) and reducing bias (moving to left) compared to other approaches. FAIRCAST strictly dominates the compared methods in eight of the nine scenarios and lies on pareto-optimal curve for the last scenario [22].

Moving beyond the proposed approach, we note that (ERM) and (DA) are most susceptible to bias as they do not consider the bias aspects. We see that (DA) and (ABC) have higher accuracy compared to (ERM) as they focus on anticipating the future performance. Lastly (ABC) typically does better than other baselines but is worse than the proposed approach.

In Table 3, we summarize Figure (4) and Figure (5) by reporting the mean, standard deviation for each dataset across the main metrics. It further reports the DTO as the trade-off metric between pairs of accuracy and fairness metrics (e.g., DTO for AUROC and DP in ACS is 0.000). Consistent with the earlier trends, we notice that the FAIRCAST approach yields the best aggregate performance in 11 of the 12 considered metrics. These results demonstrate the ability of the proposed FAIRCAST approach in achieving both fairness and accuracy in the considered settings.

5.3 Implications for Web Science Literature

The proposed FAIRCAST approach has several implications for web science, in terms of both research and practice. The approach can enhance the quality and equity of web-based content analysis, which is a core task in web science. The proposed approach can help create better versions of web content analysis algorithms such as those for labeling toxicity, misinformation, or spam.

Further, it facilitates models’ swift adaptation to the web's dynamic nature, a major web science challenge. Leveraging early-available unlabeled data, our approach proactively estimates label distributions and updates models avoiding the label delays, which can render models obsolete as the data stream progresses.

Overall, consistent with this year's web science conference theme of reflecting on the web, AI, and society, this work adds spotlight onto the ethical and social aspects of AI techniques used in web science, and underscores the need to consider social dimensions when developing algorithms for web data analysis.

5.4 Limitations

The proposed approach relies on the assumption that the future distribution of labels can be estimated from the unlabeled data that become available earlier than the labels. However, this assumption may not always hold. Similarly, the proposed approach also assumes batch processing with sufficient amount of unlabeled data to estimate the future distribution of labels. Further, we have not addressed other types of distribution shifts that might co-exist with joint shift such as covariate or conditional shifts [13]. Lastly, the proposed approach does not consider other aspects of web-based content analysis that are relevant for web science, such as explainability, transparency, accountability, and user feedback.

Despite the limitations, this work marks an important first step toward achieving anticipatory fairness in delayed labeling settings and will hopefully encourage more work on temporal, and especially anticipatory bias correction in diverse web related tasks.

6 CONCLUSION

In this paper, we have presented a new approach to address the challenges of temporal shifts and delayed labeling in algorithmic fairness settings. Our approach is particularly relevant in the practical setting of web science, where distributions vary over time and unlabeled data is much more plentiful and accessible than labeled data. By estimating future distributions based on available unlabeled data, our method enables preemptive model adaptation, ensuring continued effectiveness in terms of fairness and accuracy. The robustness of our approach is evidenced by its strong performance, which consistently surpasses other baselines, for fairness, accuracy, and the trade-off between them.

Significantly, our approach has the potential to anticipate and prevent algorithmic harm in real-world situations marked by temporal changes, which can include natural shifts as well as sudden transitions such pandemics, which can impact different societal groups disproportionately. Our method proactively navigates such temporal shifts, thereby safeguarding the integrity of web-based content analysis. It sets a strong benchmark for algorithmic fairness, helping in ensuring that our digital future is shaped by equitable and unbiased analytical practices.

ACKNOWLEDGMENTS

This research is partially supported by funding from Rutgers University under the SC&I Scholarly Futures Program.

REFERENCES

Oshin Agarwal and Ani Nenkova. 2022. Temporal Effects on Pre-trained Models for Language Processing Tasks. Transactions of the Association for Computational Linguistics 10 (2022), 904–921. https://doi.org/10.1162/tacl_a_00497
Abdullah Almaatouq, Ahmad Alabdulkareem, Mariam Nouh, Erez Shmueli, Mansour Alsaleh, VivekK Singh, Abdulrahman Alarifi, Anas Alfaris, and Alex Pentland. 2014. Twitter: who gets caught? observed trends in social micro-blogging spam. In Proceedings of the 2014 ACM conference on Web science. 33–41.
AbdulazizA Almuzaini, ChidanshA Bhatt, DavidM Pennock, and VivekK Singh. 2022. ABCinML: Anticipatory Bias Correction in Machine Learning Applications. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 1552–1560.
Fatma Arslan, Naeemul Hassan, Chengkai Li, and Mark Tremayne. 2020. A benchmark dataset of check-worthy factual claims. In Proceedings of the International AAAI Conference on Web and Social Media, Vol.14. 821–829.
Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61.
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and Machine Learning: Limitations and Opportunities. http://www.fairmlbook.org.
Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20.
Antonio Bella, Cesar Ferri, José Hernández-Orallo, and MariaJose Ramirez-Quintana. 2010. Quantification via probability estimators. In 2010 IEEE International Conference on Data Mining. IEEE, 737–742.
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2019. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proc. world wide web conference. 491–500.
Zhipeng Cai, Ozan Sener, and Vladlen Koltun. 2021. Online continual learning with natural distribution shifts: An empirical study with visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8281–8290.
Alessandro Castelnovo, Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzanica, and Andrea Cosentini. 2021. Towards fairness through time. In European Conf. Machine Learning and Knowledge Discovery in Databases. Springer, 647–663.
Ilias Chalkidis and Anders Søgaard. 2022. Improved Multi-label Classification under Temporal Concept Drift: Rethinking Group-Robust Algorithms in a Label-Wise Setting. In Findings of the Association for Computational Linguistics: ACL 2022. 2441–2454.
Lingjiao Chen, Matei Zaharia, and JamesY Zou. 2022. Estimating and explaining model performance when both covariates and labels shift. Advances in Neural Information Processing Systems 35 (2022), 11467–11479.
Ashwin DeSilva, Rahul Ramesh, Lyle Ungar, MarshallHussain Shuler, NoahJ Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, et al. 2023. Prospective Learning: Principled Extrapolation to the Future. In Conference on Lifelong Learning Agents. PMLR, 347–357.
Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. Retiring Adult: New Datasets for Fair Machine Learning. arXiv preprint arXiv:2108.04884 (2021).
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. 214–226.
Maciej Grzenda, HeitorMurilo Gomes, and Albert Bifet. 2020. Delayed labelling evaluation for data streams. Data Mining and Knowledge Discovery 34, 5 (2020), 1237–1266.
Xudong Han, Timothy Baldwin, and Trevor Cohn. 2021. Balancing out Bias: Achieving Fairness Through Balanced Training. arXiv preprint arXiv:2109.08253 (2021).
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016), 3315–3323.
ChristinaX Ji, AhmedM Alaa, and David Sontag. 2023. Large-Scale Study of Temporal Shift in Health Insurance Claims. In Conference on Health, Inference, and Learning. PMLR, 243–278.
Jing Jiang. 2008. A literature survey on domain adaptation of statistical classifiers. URL: http://sifaka. cs. uiuc. edu/jiang4/domainadaptation/survey 3, 1-12 (2008), 3.
Yaochu Jin and Bernhard Sendhoff. 2008. Pareto-based multiobjective machine learning: An overview and case studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38, 3 (2008), 397–415.
Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems 33, 1 (2012), 1–33.
David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. The parable of Google Flu: traps in big data analysis. science 343, 6176 (2014), 1203–1205.
Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In International conference on machine learning. PMLR, 3122–3130.
Karima Makhlouf, Sami Zhioua, and Catuscia Palamidessi. 2020. On the applicability of ML fairness notions. arXiv preprint arXiv:2006.16745 (2020).
Lydia Manikonda, MeeYoung Um, and Rui Fan. 2022. Shift of User Attitudes about Anti-Asian Hate on Reddit Before and During COVID-19. In Proceedings of the 14th ACM Web Science Conference 2022. 364–369.
Binny Mathew, Anurag Illendula, Punyajoy Saha, Soumya Sarkar, Pawan Goyal, and Animesh Mukherjee. 2020. Hate begets hate: A temporal study of hate speech. Proc. ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–24.
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54, 6 (2021), 1–35.
Taichi Murayama. 2021. Dataset of fake news detection and fact verification: a survey. arXiv preprint arXiv:2111.03299 (2021).
Rahul Pandey, Carlos Castillo, and Hemant Purohit. 2019. Modeling human annotation errors to design bias-aware systems for social stream processing. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 374–377.
Jinkyung Park, RahulDev Ellezhuthil, Joseph Isaac, Christoph Mergerson, Lauren Feldman, and Vivek Singh. 2023. Misinformation Detection Algorithms and Fairness across Political Ideologies: The Impact of Article Level Labeling. In Proceedings of the 15th ACM Web Science Conference 2023. 107–116.
Jeffrey Pennington, Richard Socher, and ChristopherD Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
Dana Pessach and Erez Shmueli. 2023. Algorithmic fairness. In Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook. Springer, 867–886.
Shiori Sagawa, PangWei Koh, TatsunoriB Hashimoto, and Percy Liang. 2019. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019).
Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong, Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen, et al. 2022. Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?arXiv preprint arXiv:2202.01034 (2022).
Shreya Shankar, Bernease Herman, and AdityaG Parameswaran. 2022. Rethinking streaming machine learning evaluation. arXiv preprint arXiv:2205.11473 (2022).
Harvineet Singh, Rina Singh, Vishwali Mhasawade, and Rumi Chunara. 2021. Fairness violations and mitigation under covariate shift. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 3–13.
Anders Søgaard, Sebastian Ebert, Jasmijn Bastings, and Katja Filippova. 2021. We Need To Talk About Random Splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 1823–1832. https://doi.org/10.18653/v1/2021.eacl-main.156
Qingyao Sun, Kevin Murphy, Sayna Ebrahimi, and Alexander D'Amour. 2022. Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with" Spurious" Correlations. arXiv preprint arXiv:2211.15646 (2022).
VladimirN Vapnik. 1999. An overview of statistical learning theory. IEEE transactions on neural networks 10, 5 (1999), 988–999.
Xiaomeng Wang, Yishi Zhang, and Ruilin Zhu. 2022. A brief review on algorithmic fairness. Management System Engineering (2022). https://link.springer.com/article/10.1007/s44176-022-00006-z
Zeyu Wang, Klint Qinami, IoannisChristos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. 2020. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8919–8928.
Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, PangWei Koh, and Chelsea Finn. 2022. Wild-time: A benchmark of in-the-wild distribution shift over time. arXiv preprint arXiv:2211.14238 (2022).
Zhiqi Yu, Jingjing Li, Zhekai Du, Lei Zhu, and HengTao Shen. 2023. A Comprehensive Survey on Source-free Domain Adaptation. arXiv preprint arXiv:2302.11803 (2023).
Yuji Zhang, Jing Li, and Wenjie Li. 2023. Vibe: Topic-driven temporal adaptation for twitter classification. arXiv preprint arXiv:2310.10191 (2023).
Indre Žliobaite. 2010. Change with delayed labeling: When is it detectable?. In 2010 IEEE International Conference on Data Mining Workshops. IEEE, 843–850.

FOOTNOTE

1An alternative (but expensive) approach is to keep the old data and use them to do a weighted re-training. Unfortunately, web related data tend to be high-dimensional and voluminous. Therefore, we keep only the models parameters for efficiency.

This work is licensed under a Creative Commons Attribution International 4.0 License.

WEBSCI '24, May 21–24, 2024, Stuttgart, Germany

Accuracy and Fairness for Web-Based Content Analysis under Temporal Shifts and Delayed Labeling (2024)

References