Why OVO May Work Better Than OVR for Imbalanced Multi-Class Classification

This paper is about a pretty classic machine learning question: when we have a multi-class classification problem, should we use one-versus-rest (OVR) or one-versus-one (OVO)?

At first this sounds like an old topic, because both methods have been used for a long time. But the paper makes a useful point: people usually say OVR and OVO have similar accuracy, but that conclusion is mostly based on accuracy. When the dataset is imbalanced, accuracy can hide a lot of problems.

The Paper

Paper: Revisiting One-Versus-One and One-Versus-Rest: Insights into Imbalanced Multi-class Classification

Venue: IEEE International Conference on Data Mining 2025 (ICDM25)

Topic: Imbalanced multi-class classification, OVO, OVR, kernel SVM, and neural-network loss design.

The Problem and Why It Matters

In many real classification tasks, the classes are not balanced. Some classes have many samples, while some classes only have a few. If we only look at accuracy, the model can still look good even if it almost ignores the minority classes.

This is why the paper is useful. It asks whether OVR and OVO are still “similar” when we care about metrics like Macro-F1, G-mean, and balanced accuracy.

The answer is not always. OVO can be much better when the dataset is imbalanced.

The Method

OVR trains one classifier for each class. For example, class 1 vs everything else, class 2 vs everything else, and so on. The problem is that “everything else” can be much larger than the target class. So for a small class, the binary problem is already very imbalanced.

OVO is different. It compares every pair of classes like class 1 vs class 2, class 1 vs class 3, class 2 vs class 3, and so on. Each classifier only needs to separate two classes. This gives minority classes more direct comparisons instead of making them fight against all other classes at once.

The paper tests this idea in two ways.

First, it uses kernel SVMs and compares OVR and OVO directly.

Second, it moves the idea to neural networks. Instead of training many separate neural networks, the authors keep one network and design two losses: one is an OVR-style loss and another is an OVO-style loss. This makes the comparison cleaner because the model architecture stays the same.

What Is Novel

The methods themselves are not new. OVR and OVO are both old ideas.

The novel part is the perspective. The paper shows that the old conclusion depends on what metric we use. If we only care about accuracy, OVR and OVO look similar. But if we care about minority-class performance, OVO often looks better.

I also like the neural-network part because it turns a classical decomposition idea into a loss-design question. That makes the paper more relevant than just another SVM comparison.

Reproduction Plans

I have not fully reproduced the paper yet. For a first reproduction, I would keep it simple.

I would first reproduce the SVM part using imbalanced tabular datasets from UCI, LIBSVM, or KEEL. The setup would be:

RBF-kernel SVM
OVR vs OVO
same train/test splits
metrics: accuracy, Macro-F1, G-mean, balanced accuracy

Then I would reproduce the neural-network part with one small MLP. I would compare cross-entropy, OVR-style loss, and OVO-style loss.

The main result I would expect is:

OVO should not change accuracy that much, but it should improve Macro-F1, G-mean, and balanced accuracy, especially when the imbalance ratio is high.

The plot I would make is simple: x-axis is imbalance ratio, y-axis is OVO improvement over OVR. If the paper is right, the gap should become larger when imbalance becomes more serious.

Limitations and Open Questions

The biggest limitation is that the experiments are mostly on smaller tabular datasets. That is good for a clean study, but I still want to see this tested on modern long-tailed datasets.

For example, I would like to see:

long-tailed image classification
long-tailed text classification
transformer embeddings plus OVO-style loss
comparison with focal loss, class-balanced loss, and logit adjustment

Another gap is calibration. OVO may improve Macro-F1, but does it give better probabilities? The paper does not really answer that.

Also, OVO can become expensive when the number of classes is large. The paper could discuss more clearly when OVO is worth the extra cost.

Ideas for Our Work

This paper connects to my work because it is really about evaluation scale.

If we only use accuracy, OVR and OVO look similar. But when we use Macro-F1 or balanced accuracy, the difference becomes obvious. That is very close to the problem I keep seeing in fairness and urban evaluation: a global metric can hide the thing we actually care about.

This also connects to RISE. In RISE, I am trying to show that standard fairness metrics are not enough by themselves. A model can have acceptable global accuracy or fairness numbers, but the sorted residual plot can still show local error concentration, group separation, or unfair behavior in specific residual regions.

So the connection is:

RISE: aggregate fairness metrics can hide group-level and local residual failure.
UrbanContrastiveQA: raw counts can hide baseline-relative errors.

All point to the same evaluation problem: if the metric or comparison scale is wrong, the conclusion becomes too optimistic.

For RISE specifically, the issue is not only “we need more metrics.” The issue is that different evaluation views reveal different failure modes. OVO makes minority-class behavior more visible through pairwise comparison. RISE makes fairness behavior more visible through residual sorting, group coloring, and knee-based inspection. This also makes me think RISE could add a class-imbalance case study. For example, we could compare OVR and OVO models and use RISE to inspect where the residual errors happen. If OVO really helps minority classes, then the residual plot should show less extreme minority-class error concentration.

For my urban work, this paper also supports the idea that pairwise comparison is not just a benchmark trick. It can actually change what the model focuses on. In UrbanContrastiveQA, we use contrastive pairs to test whether a model follows baseline-relative information or just follows raw count. This paper gives another example of why pairwise structure matters. OVO helps because it compares classes directly. Our benchmark does something similar for cities and zones.

A possible follow-up idea is:

Pairwise Decomposition for Baseline-Relative Urban Reasoning

Instead of asking a model to judge all zones globally, we can ask it to compare two zones at a time with controlled baseline information.

The question would be:

Can pairwise comparison reduce raw-count bias and make the model better at baseline-relative urban reasoning?

That feels like a useful direction because many urban tasks are not about “which number is bigger.” They are about which place is more unusual compared with its own normal pattern.

A second follow-up idea is:

RISE for Imbalanced Multi-Class Evaluation

Use RISE to visualize how OVR and OVO distribute errors across classes. The goal would be to see whether OVO only improves the final metric, or whether it actually changes the residual pattern for minority classes.

This could be a clean bridge between classical imbalanced classification and my fairness visualization work.

For more information about our research, return to our homepage: ufdatastudio.com.