Contents

Concordance index

Kamil Rzechowski

11 Sep 2024.6 minutes read

Concordance index webp image

Why the C-index?

In biomedical machine learning, we often face the challenge of performing the survival analysis. Survival analysis aims to predict the time of event occurrence, such as disease recurrence. Although it is widely used in the medical field, survival analysis can also be used in machine failure monitoring or the insurance sector. For the evaluation of predictive algorithms, we can use L1 or L2 distance, but that's not always the best metric.

Let's imagine a situation where only a subset of patients get disease recurrence and some patients may get the recurrence in the future, but did not get it at the time of evaluation and we don’t know whether they will ever get it. In that case, we only have ground truth times for patients that have biochemical recurrence, but patients who did not get it still carry information about our algorithm quality. In that case, the commonly used evaluation metric is the Concordance index, which is also known as the C-index. It considers all cases, positive and negative, and treats them as censored cases.

The base concordance index, described in this article is Harrell’s concordance index and was initially presented in 1982 in “Evaluating the Yield of Medical Tests“ by Frank Harell. It is the most intuitive version of the c-index, however, it was proven, that Harrell’s concordance index is too optimistic with an increasing amount of censored data and is not very useful, if the primary interest is in a specific time range, for example, if we are only interested in outcome within 4 years. The first issue can be addressed for example using the concordance index from right-censored survival and the later one by extending the AUC ROC to a Time-dependent Area under the ROC.

Concept behind c-index

The concordance index can be thought of as a measure of how well patients are sorted according to event occurrence. It measures the ability of a predictor to order subjects by estimating the proportion of correctly ordered pairs among all comparable pairs in the dataset. Each positive case is considered correctly sorted if all (positive and negative) cases, evaluated per case, that outlive the investigated cases, actually were predicted to outlive the case. During C-index computation cases are checked on pairs bases and the result is the ratio between pairs that meet the condition and all pairs.

The C-index varies between 0.5, which is a totally random algorithm, and 1.0 meaning perfect predictions. It is important to notice that the algorithm does not measure how accurate the time prediction is, rather whether patients were correctly sorted. It also doesn’t directly measure how well patients were classified into those with disease recurrence and those without. However, partially it takes it into account since negative cases outlive positive cases. Moreover, the C-index handles the uncertainty of negative cases.

The negative case can be a false negative, since it may have the disease recurrence in the future, which is not the case at the moment of evaluation. The C-index uses the last follow-up time for sorting cases and censors negative cases. In that way, the first comparable item in the pair comparison is always the positive case and the positive case is only comparable to the negative that has a longer follow-up time (negative cases with shorter follow-up time are skipped) and to other positive cases that have a longer time to event occurrence.

figure1

Figure 1. Sort patients based on ground truth time in ascending order. Iterate in ascending direction. Compare each case with the following cases. Skip negative cases. Increase the nominator for concordant pairs (predicted in a correct order). Use a number of all compared pairs for the denominator.

Example

It is always easier to understand the algorithm using an example.

{
   "results": [
     {
       "case_id": "case_0",
       "case_id_gt_time": 1.35,
       "case_id_gt_event": 0,
       "case_id_prediction_years_to_recurrence": 1.48
     },
     {
       "case_id": "case_1",
       "case_id_gt_time": 11.89,
       "case_id_gt_event": 1,
       "case_id_prediction_years_to_recurrence": 3.52
     },
     {
       "case_id": "case_2",
       "case_id_gt_time": 19.17,
       "case_id_gt_event": 0,
       "case_id_prediction_years_to_recurrence": 5.52
     }
   ],
   "aggregates": {
     "c_index": 1.0
   }
 }

The trained algorithm yields results as above. There are two negative cases (case_0 and case_2) and one positive (case_1). For negative cases, the last negative check time is reported (1.35y and 19.17y) and for the positive case, the time to recurrence is reported (11.89y). The algorithm was run on each case and case_id_prediction_years_to_recurrence is reported. The years to recurrence for negative cases are not meaningful, however, they should be larger than positive case time to recurrence, to be considered as correctly predicted.

By looking at the results above we can notice something not intuitive about the c-index score. The c-index is 1.0 (best possible), even though negative case_0, was predicted to have a shorter case_id_prediction_years_to_recurrence than positive case_1. However, it makes sense, because case_0 can still get positive by the time it reaches the gt_time of case_1. As we don’t know that, we cannot penalize the algorithm for not predicting the case_0 to overlive the case_1.

Algorithm steps:

  1. Sort all cases by the ground truth time in ascending order: [case_0, case_1, case_2].
  2. Iterate over the sorted cases, starting from the first element, and skipping negative elements (censored cases). As the first element in the list is negative, let’s move straight to the second element.
  3. Always compare elements in the ascending direction of the list. The next element in the list is case_1, so we compare case_1 with other elements in the list. As the list consists of 3 cases, the only case left in the list is case_2. We compare whether case_1 has a shorter predicted time, than case_2. It is true, so we increase nominator += 1. If the list had more elements, we would compare case_1, with case_3, case_4, … and so on. For each pair that meets the condition, the nominator is increased by one. As the list does not have more elements, we exit the loop.
  4. At the end, the nominator is divided by the number of all evaluated pairs. In the given example the nominator and dominator are both equal to 1 (only one comparable pair, and the pair meets the condition). Therefore the c-index is equal to 1.

visual%20example

Visual example 1. Visualization of concordance index computation.

Implementation

The c-index can easily be used using the function implementation in the sksurv library.

from sksurv.metrics import concordance_index_censored

(
    cindex, 
    concordant_pairs, 
    discordant_pairs, 
    tied_risk, 
    tied_time,
) = concordance_index_censored(
    event_indicator=groundtruth_events,
    event_time=groundtruth_times,
    estimate=predicted_event_probabilities,
)

We can clearly see that Harrell’s concordance index can be generalized to binary classification, where the probability of an event occurring is inversely correlated with time to the event. For example, if our risk model is any good, patients who developed the disease more quickly should have higher risk scores. In other words, between two patients, the one with the higher risk score should experience the disease sooner.

Conclusions

The C-index allows for the evaluation of survival prediction algorithms. The main benefit of the metric is that it can handle censored data, like for example patients that did not experience recurrence of the disease yet or due to experiencing another event it is impossible to follow up on the event of interest. The C-index however has a few drawbacks. It does not provide information about the error quantity. How much are the predicted values off, with respect to the ground truth values remains unknown.

If you are wondering, how AI can help in your medical use case, feel free to reach out to us. I would be happy to discuss your use case.

Reviewed by Rafal Pytel

Blog Comments powered by Disqus.