In the ever-growing constellation of ISO best-practice standards, ISO 24027:2021 gives AI technical and governance teams tools and techniques to establish whether their AI systems are fair and bias-free. It’s not the typical heavy corporate read, but a rather short and practical best-practice standard. I’ll try to demystify ISO 24027 and break it down in a few simple paragraphs.

What is bias and what is fairness?

Bias is a systematic disparity of treatment, judgment, or outcomes towards certain groups of individuals or attributes. On the other hand, fairness is a treatment or behaviour that respects facts, beliefs, and norms, and is not determined by favouritism.

Why this matters for AI systems

A classic example explaining why these concepts are so crucial for AI systems relates to recruitment. You apply for a job, the company receives your CV and scans it through an AI parsing. Based on its analysis of the document’s text, the tool will decide whether you’re a good enough fit to progress your application and, if so, pass it onto the relevant HR person. It’s an awful lot of time saved on the HR side. But how can we be sure that the initial decision made by that AI system is accurate?

In fact, the training data of that AI system might have carried unwanted, unintentional bias. Perhaps a bias towards genders. If not properly mitigated or removed at the training/development stage, that bias will go unnoticed and undisturbed, skewing results and propagating itself in the model’s outputs. The parsing tool trained on biased data may predict that a candidate is a better fit because they are male and discard the application of a better qualified candidate because they are female. The model is not sexist, but the training data may have been (even unintentionally).

ISO 24027 distinguishes between bias and fairness. Bias is not the direct opposite of fairness, but one of its potential drivers. The document mentions cases in which biased inputs did not necessarily result in unfair predictions (a university hiring policy biased towards holders of relevant qualifications), as well as instances of unbiased systems that still can be considered unfair (a policy that indiscriminately rejects all candidates – unbiased towards all categories, but not for this reason fair).

Besides, achieving fairness is also about making difficult trade-offs, which can put the governance person’s ethical coordinates to the test. Take an AI system tasked with deciding what university graduate to award a scholarship. What stakeholder should the algorithm listen to? The university’s diversity officer, who wants a fair distribution of awards to students from varied geographic regions, or that professor who wants a deserving student, who is very keen on the specific research area, to obtain the scholarship? Probably, there is no right answer, and it is the duty of the AI governance function to recognise the possibility. The best course of action remains orientating the AI systems and the overarching governance strategy toward the guiding star of transparency.

A Lot of Biases!

The possible unwanted biases influencing an AI model are numerous. ISO 24027 divides them in three categories: human cognitive bias, data bias, and bias introduced by engineering decisions.

Human cognitive bias

Whether conscious or unconscious, biases influencing human beings’ decisions are likely to affect data collection, AI systems’ design and model training. The most common include:

Other human cognitive biases comprise:

Data bias

Data bias is more technical in nature, in that it originates from technical design decisions. Human cognitive bias, training methodology and variations in infrastructure can also indirectly introduce bias into datasets.

Given the overlap between the AI domain and statistics science (think about inferences, predictions, and the wider data science), it will not come as a surprise that statistical bias is a big one. It comprises:

Other sources of data bias include:

Bias introduced by engineering decisions

AI model architectures can also bring biases:

Assessing Your AI Model’s Bias and Fairness

You can determine whether your AI model is affected by bias and exhibits fairness by assessing its prediction outputs. ISO 24027 proposes a set of fairness metrics to test the quality of a model’s outputs.

Confusion Matrix

Let’s start with the confusion matrix – a visual breakdown of an AI model’s predictions, showing true positives, false positives, true negatives, and false negatives. It allows for a nuanced understanding of where the model performs well and where it falls short, which is particularly helpful for binary classification tasks.


For example, a high false negative rate might indicate an issue with how the model weighs certain criteria (an AI loan model denying loans to applicants who actually meet the requirements). Conversely, a high false positive rate (approving loans for high-risk applicants) could signify overly lenient predictions. We can fine-tune the model to reduce specific errors by adjusting parameters. And this is based exactly on a confusion matrix.

Precision

The first metric drawn from the matrix is precision, which measures the reliability of the model’s positive predictions: the proportion of true positives out of all positive predictions made (true positives + false positives). Precision is calculated as follows:

Accuracy

Accuracy looks at the model’s overall success rate across both positive and negative predictions. It is calculated by dividing the number of correct predictions (both true positives and true negatives) by the total number of predictions made.


Recall

Recall measures how many of the actual positives were correctly identified. It is obtained by dividing the true positives by the sum of true positive and false negatives.

ROC / AUC Curve

ROC (Receiver Operating Characteristic) plots true positive rate vs. false positive rate at various thresholds, highlighting the trade-off between 𝘴𝘦𝘯𝘴𝘪𝘵𝘪𝘷𝘪𝘵𝘺 (true positives rate) and 𝘴𝘱𝘦𝘤𝘪𝘧𝘪𝘤𝘪𝘵𝘺 (true negatives rate).

AUC (Area Under Curve) quantifies the model’s overall performance across all thresholds, providing a single metric that reflects the model’s ability to distinguish between classes.

Fairness metrics

Equalised odds

Equalised odds require a model to have both equal true positives and false positives rates across demographic categories. It ensures that those who qualify for a positive outcome are treated equally, and the model’s mistakes are distributed fairly across groups.

It is calculated by comparing true positive rate and false positive rate. If both rates are similar across groups, then the model satisfies equalised odds.

Equal opportunity

Equal opportunity is a metric that requires the same true positive rates across all protected groups. To see if your model satisfies equal opportunity, you need to compare the true positive rates across groups.

Demographic parity

Demographic parity requires that the selection rates for an AI model are the same across different demographic groups. Basically, each group should receive positive outcomes at the same rate, no matter their actual qualification or risk. To establish whether this condition is satisfied, you need to compare the positive prediction rate across different categories.

Predictive equality

Predictive equality implies that false positive rates are equal across demographic categories. The main idea is that no category should be more likely to be wrongly flagged as positive than another. When false positive rates across groups are similar, then predictive equality is satisfied.

Treating bias across the AI lifecycle

If any of the above metrics detect bias in the model’s predictions, it indicates that the bias is present in the model. Ideally, bias should be mitigated as much as possible before deployment. Regardless, ISO 24027 provides a framework to treat bias across the whole lifecycle.

Inception

Requirements analysis

At this stage, you want to scan and evaluate all internal and external requirements of your system.

Stakeholder engagement

Given that unwanted bias is a known issue in social sciences, ISO 24027 recommends involving an array of trans-disciplinary professionals, spanning social scientists, ethicists, legal and technical experts. To further ensure regulatory compliance and mitigate societal impacts, the document suggests identifying all stakeholders directly or indirectly affected by the AI system.

Data gathering

Another key focus area is the selection of data sources. Organisations should rigorously apply data selection criteria such as

Design and development

Data labelling

A key activity of these phases is data representation and labelling, which carries a high risk of bias due to human implicit assumptions. Generally speaking, this occurs when developers label data “good” or “bad” based on their assumptions or beliefs. The use of crowd(sourced) workers can help mitigate bias, as they bring diverse perspectives, when annotating data. What is good for some may not be so good for others.

Sophisticated training

ISO 24027 warns that merely removing features possibly responsible for bias (ethnicity, gender) is insufficient to mitigate bias. In fact, there may still be features acting as proxy revealing the same information that was removed. Techniques to ensure better data representativeness include data re-weighting, sampling, and stratified sampling. A different approach consists in offsetting the bias ex-post from the result. ISO 24027 mentions a series of advanced, highly technical operations, including (but not limited to):

Tuning

Your model can be appropriately tuned using bias mitigation algorithms, which are typically classified in three categories:

  1. Data-based methods (e.g., use of synthetic data, up-sampling of under-represented populations).
  2. Model-based methods (e.g., introduction of regularization or constraints to force an objective).
  3. Post-hoc methods (e.g., identifying specific decision threshold to equalise false positive rates).

Verification and Validation

Several techniques can be deployed to measure how well your model is doing and, including:

Deployment

Continuous monitoring is critical for AI models, because they are subject to performance degradation. For this reason, your organisation must commit to transparency regarding the model’s performance and limitations. Your model documentation should comprise qualitative information (ethical considerations and use cases) and quantitative information (model evaluation as split across different subgroups and combinations of multiple subgroups).

High effort, high reward

The AI Governance & Compliance strategy needs a multi-disciplinary approach and cross-functional efforts to be truly effective. At minimum, you want compliance and risk management experts, legal and data privacy advisors, and technical/engineering teams. ISO 24027 generally falls under the remit of the latter group.

Nonetheless, this standard – or at least its broader meaning and implications – should be shared knowledge among your AI governance team, because the mission of making AI safe and trustworthy is a collective challenge. And once the dust of excitement around AI settles, only the companies that are AI-robust will thrive.