The Trade-Off Between Fairness and Accuracy in Algorithm Design

What happens when data is excluded?

The University of California system made waves in 2021 when it announced it would no longer use SAT and ACT scores in admissions decisions. The aim of this change was to evaluate student applications more fairly.

In a world in which algorithms (step-by-step instructions for accomplishing a task) drive decisions, do policies that limit the data available to an algorithm actually increase fairness? And what happens to the algorithm’s accuracy?

In a working paper, Northwestern’s Annie Liang, UCLA’s Jay Lu and Princeton’s Xiaosheng Mu study the trade-off between fairness and accuracy for algorithms. They focus on how the design of the data input to an algorithm can influence decisions.

Fairness at the Cost of Accuracy

Accuracy, measured by the overall error rate of an algorithm’s predictions, used to be the main concern of users when it came to algorithms. But that was before algorithms were applied to society’s high stakes situations — who should receive bail, who gets medical treatment, or a loan or even employment? Algorithms making these decisions can have different error rates for different races, genders, income brackets and so on. For this reason, algorithm designers today may find themselves faced with a trade-off — give up some of an algorithm’s overall accuracy in order to increase fairness across all groups. (Fairness here is defined as balanced error rates across all groups.)

To better understand why the data input into an algorithm can cause unequal errors across social groups, consider an algorithm that decides whether a patient needs medical treatment. The algorithm was likely trained on several data variables including the patient’s age, past medical conditions and the frequency of their past doctor visits. However, the number of their past doctor visits can lead to different predictions depending on a patient’s income. Wealthy patients with a low number of past doctor visits suggest that the patient has been healthy in the past. But a low-income patient might not have been able to afford treatment, so a low number of past doctor visits does not necessarily indicate good health in the past. An algorithm using this data variable as an input is likely to have a higher number of prediction errors for low-income patients than for wealthier patients.

The researchers call the inputs in the previous example, “group-skewed.” With group-skewed inputs, the best predictions for one group, such as the low-income group, will still have higher errors than those for another group. On the other hand, when the best predictions for each group generate lower errors for that group, then the inputs are called “group-balanced.” The researchers note that the property of being group-skewed or group-balanced is an important factor in determining the effectiveness of a public policy that removes or adjusts the data input into an algorithm. In fact, a policy that lowers accuracy for all groups but improves fairness only makes sense when inputs are group-skewed.

Liang, Lu and Mu developed a theoretical framework that considers an algorithm user’s preferences across fairness and accuracy. These preferences can range from only caring about an algorithm’s overall accuracy, or only about its fairness or any combination between the two. They build what they call the “fairness-accuracy frontier” that describes this trade-off between accuracy and fairness for any preference and considers various real-world scenarios within the lens of this frontier.

Is Test Score Elimination a Reasonable Policy?

Using the researchers’ framework, they can study what happens when data is banned from being used in an algorithm. The researchers find that when errors from data are already group-balanced, removing data always creates an outcome that is worse. This is true for any preference across accuracy and fairness.

To illustrate this, consider the previous example in which an algorithm determines whether a patient needs medical care. If using medical records alone results in a group-balanced error rate, then the results seen are always worse when a variable identifying the patient’s income group is not included. This means every preference across fairness and accuracy sees higher errors and greater unfairness without the inclusion of the group identifier..

When the results from the medical data inputs are group-skewed, then the removal of the group identifier will have varying error results that depend on one’s preferences between accuracy and fairness.

This brings us back to removing SAT and ACT scores in the admission process. Assuming that errors for the admission process are group-skewed and that group identifiers like race are removed from data for an admissions algorithm, the researchers’ framework suggests that error rates, or fairness, across races can be balanced by giving up some accuracy. The recent banning of affirmative action means that “universities with certain (fairness-accuracy) preferences will have more reason to ban the use of test scores in admissions decisions,” the authors note.

The authors also investigate what happens if the designer of an algorithm can only control the data input into an algorithm while the user can choose any algorithm. And they consider outcomes when the designer and user have misaligned fairness-accuracy preferences.

Featured Faculty