Chapter 10: Model Evaluation and Performance Analysis

Haiyue

October 2, 2025

13min

Chapter 10: Model Evaluation and Performance Analysis

Learning Objectives

Master the calculation methods of object detection evaluation metrics
Learn model performance analysis and error analysis
Understand model visualization and interpretability methods
Become familiar with A/B testing and model comparison techniques

10.1 Object Detection Evaluation Metrics

10.1.1 Basic Evaluation Metrics

IoU (Intersection over Union)

IoU is the most fundamental evaluation metric in object detection, used to measure the overlap between predicted boxes and ground truth boxes.

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Predicted Box] --> C[Calculate Overlap Area]
    B[Ground Truth Box] --> C
    C --> D[IoU = Intersection Area / Union Area]
    D --> E{IoU >= Threshold?}
    E -->|Yes| F[Correct Detection]
    E -->|No| G[Wrong Detection]

IoU Calculation Formula:

IoU = Area(A ∩ B) / Area(A ∪ B)

Impact of IoU Threshold:

IoU ≥ 0.5: Generally considered correct detection
IoU ≥ 0.7: More stringent evaluation standard
IoU ≥ 0.9: Extremely high precision requirement

Precision and Recall

Precision: Among all samples predicted as positive, the proportion that are truly positive

Precision = TP / (TP + FP)

Recall: Among all truly positive samples, the proportion correctly identified

Recall = TP / (TP + FN)

Confusion Matrix:

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Confusion Matrix] --> B[TP: True Positive<br/>Correctly detected targets]
    A --> C[FP: False Positive<br/>Incorrectly detected background]
    A --> D[FN: False Negative<br/>Missed targets]
    A --> E[TN: True Negative<br/>Correctly identified background]

10.1.2 Comprehensive Evaluation Metrics

AP (Average Precision)

AP is the average precision at different recall levels.

Calculation Steps:

Sort all detection results by confidence
Calculate precision and recall at each threshold
Plot P-R curve
Calculate area under curve

🔄 正在渲染 Mermaid 图表...

graph LR
    A[Detection Results] --> B[Sort by Confidence]
    B --> C[Calculate P-R Points]
    C --> D[Plot P-R Curve]
    D --> E[Calculate AUC]
    E --> F[AP Value]

mAP (mean Average Precision)

mAP is the average AP across all classes and is the most important evaluation metric for object detection.

COCO Dataset Evaluation Standards:

mAP@0.5: mAP at IoU threshold of 0.5
mAP@0.75: mAP at IoU threshold of 0.75
mAP@0.5
.95: Average mAP from IoU 0.5 to 0.95 with step 0.05

10.1.3 Other Important Metrics

FPS (Frames Per Second)

Metric measuring model inference speed:

FPS = 1 / Single Frame Inference Time

Model Complexity Metrics

Parameters: Total number of model parameters
FLOPs: Floating-point operations
Model Size: Storage space occupied

10.2 Model Performance Analysis

10.2.1 Detailed Performance Analysis

Analysis by Category

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Overall mAP] --> B[AP Analysis by Category]
    B --> C[High Performance Categories]
    B --> D[Low Performance Categories]
    C --> E[Analyze Success Factors]
    D --> F[Analyze Failure Reasons]
    F --> G[Insufficient Data]
    F --> H[Class Similarity]
    F --> I[Annotation Quality]

Analysis by Object Size

Small Objects: Pixel area < 32²
Medium Objects: 32² ≤ Pixel area < 96²
Large Objects: Pixel area ≥ 96²

10.2.2 Error Analysis Framework

Classification of Detection Errors

🔄 正在渲染 Mermaid 图表...

flowchart TD
    A[Detection Errors] --> B[Localization Error]
    A --> C[Classification Error]
    A --> D[Background Error]
    A --> E[Duplicate Detection]

    B --> B1[Insufficient IoU]
    C --> C1[Class Confusion]
    D --> D1[Background False Positive]
    E --> E1[NMS Failure]

False Positive Analysis

Background False Positives: Mistaking background areas as targets
Localization Errors: Correct target identification but inaccurate localization
Classification Errors: Correct localization but wrong class prediction
Duplicate Detection: Multiple detection boxes for the same target

10.2.3 Performance Bottleneck Analysis

Inference Time Breakdown

🔄 正在渲染 Mermaid 图表...

graph LR
    A[Total Inference Time] --> B[Data Preprocessing]
    A --> C[Network Forward Pass]
    A --> D[Post-processing]

    B --> B1[Image Scaling]
    B --> B2[Normalization]
    C --> C1[Feature Extraction]
    C --> C2[Detection Head Computation]
    D --> D1[NMS]
    D --> D2[Result Parsing]

10.3 Model Visualization and Interpretability

10.3.1 Feature Visualization

Activation Heatmaps

Using techniques like Grad-CAM to visualize regions of model attention:

# Pseudocode: Grad-CAM visualization
def generate_gradcam(model, image, target_layer):
    # Forward pass
    outputs = model(image)

    # Backward pass to get gradients
    gradients = compute_gradients(outputs, target_layer)

    # Calculate weights
    weights = global_average_pooling(gradients)

    # Generate heatmap
    heatmap = weighted_combination(target_layer, weights)

    return heatmap

Feature Map Visualization

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Input Image] --> B[Conv Layer 1]
    B --> C[Conv Layer 2]
    C --> D[Conv Layer 3]

    B --> B1[Feature Map Visualization]
    C --> C1[Feature Map Visualization]
    D --> D1[Feature Map Visualization]

    B1 --> E[Low-level Features<br/>Edges, Textures]
    C1 --> F[Mid-level Features<br/>Shapes, Parts]
    D1 --> G[High-level Features<br/>Semantic Information]

10.3.2 Detection Result Visualization

Confidence Distribution

# Pseudocode: Confidence analysis
def analyze_confidence_distribution(detections):
    confidences = [det.confidence for det in detections]

    # Plot confidence distribution histogram
    plt.hist(confidences, bins=50)
    plt.xlabel('Confidence Score')
    plt.ylabel('Count')
    plt.title('Confidence Distribution')

    # Analyze impact of different confidence thresholds
    for threshold in [0.1, 0.3, 0.5, 0.7, 0.9]:
        filtered_dets = filter_by_confidence(detections, threshold)
        print(f"Threshold {threshold}: {len(filtered_dets)} detections")

PR Curve Visualization

# Pseudocode: PR curve plotting
def plot_pr_curve(precisions, recalls, ap_score):
    plt.figure(figsize=(8, 6))
    plt.plot(recalls, precisions, linewidth=2)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve (AP = {ap_score:.3f})')
    plt.grid(True)
    plt.show()

10.3.3 Error Sample Analysis

Failure Case Visualization

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Collect Failure Cases] --> B[Classify Failure Reasons]
    B --> C[Heavy Occlusion]
    B --> D[Poor Lighting]
    B --> E[Too Small Target]
    B --> F[Annotation Error]

    C --> G[Occlusion Quantification]
    D --> H[Brightness Statistics]
    E --> I[Size Distribution]
    F --> J[Annotation Quality Check]

10.4 A/B Testing and Model Comparison

10.4.1 Experimental Design Principles

Controlled Experiment Design

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Experiment Design] --> B[Control Variables]
    A --> C[Randomization]
    A --> D[Sample Size]

    B --> B1[Same Dataset]
    B --> B2[Consistent Evaluation Metrics]
    B --> B3[Unified Hardware Environment]

    C --> C1[Random Sampling]
    C --> C2[Random Grouping]

    D --> D1[Statistical Power Analysis]
    D --> D2[Confidence Intervals]

Evaluation Dimensions

Accuracy Dimension: mAP, AP at different IoU thresholds
Speed Dimension: FPS, inference time
Resource Dimension: Memory usage, GPU utilization
Robustness: Performance across different scenarios

10.4.2 Statistical Significance Testing

t-test

# Pseudocode: Performance difference significance test
from scipy import stats

def significance_test(model_a_scores, model_b_scores):
    # Perform paired t-test
    t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)

    alpha = 0.05
    if p_value < alpha:
        print("Performance difference is statistically significant")
    else:
        print("Performance difference is not statistically significant")

    return t_stat, p_value

Effect Size Calculation

# Cohen's d effect size
def cohens_d(group1, group2):
    pooled_std = np.sqrt(((len(group1) - 1) * np.var(group1) +
                         (len(group2) - 1) * np.var(group2)) /
                        (len(group1) + len(group2) - 2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

10.4.3 Model Comparison Report

Comprehensive Performance Comparison Table

Model	mAP@0.5	mAP@0.75	FPS	Parameters(M)	Model Size(MB)
YOLOv5s	37.2	56.0	140	7.2	14.1
YOLOv8s	44.9	61.8	120	11.2	21.5
YOLOv9s	46.8	63.4	110	13.8	26.7

Performance-Efficiency Tradeoff Chart

🔄 正在渲染 Mermaid 图表...

graph LR
    A[High Accuracy] --> B[YOLOv9]
    C[High Speed] --> D[YOLOv5]
    E[Balance Point] --> F[YOLOv8]

    subgraph "Performance-Efficiency Space"
        B
        D
        F
    end

10.5 Evaluation Practice Guide

10.5.1 Test Dataset Preparation

Test Set Requirements

Representativeness: Reflects real application scenarios
Diversity: Covers various conditions and challenges
Annotation Quality: Accurate and consistent annotations
Appropriate Scale: Sufficient statistical significance

Dataset Split Strategy

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Complete Dataset] --> B[Training Set 70%]
    A --> C[Validation Set 15%]
    A --> D[Test Set 15%]

    B --> B1[Model Training]
    C --> C1[Hyperparameter Tuning]
    C --> C2[Model Selection]
    D --> D1[Final Evaluation]
    D --> D2[Performance Report]

10.5.2 Evaluation Tools and Frameworks

Common Evaluation Tools

COCO API: Official COCO evaluation tool
mAP Calculation Libraries: e.g., mean-average-precision
Visualization Tools: TensorBoard, Weights & Biases
Statistical Analysis: Python scipy, R language

Automated Evaluation Pipeline

# Pseudocode: Automated evaluation pipeline
class ModelEvaluator:
    def __init__(self, model, test_dataset):
        self.model = model
        self.test_dataset = test_dataset

    def evaluate(self):
        predictions = []
        ground_truths = []

        for batch in self.test_dataset:
            pred = self.model.predict(batch.images)
            predictions.extend(pred)
            ground_truths.extend(batch.annotations)

        # Calculate various metrics
        metrics = self.compute_metrics(predictions, ground_truths)

        # Generate report
        self.generate_report(metrics)

        return metrics

    def compute_metrics(self, preds, gts):
        return {
            'mAP@0.5': compute_map(preds, gts, iou_thresh=0.5),
            'mAP@0.75': compute_map(preds, gts, iou_thresh=0.75),
            'FPS': measure_fps(self.model),
            'model_size': get_model_size(self.model)
        }

10.5.3 Evaluation Best Practices

Evaluation Principles

Multi-perspective Evaluation: Accuracy, speed, resource consumption
Scenario-based Testing: Targeted at specific application scenarios
Long-term Monitoring: Continuous tracking of model performance
Reproducibility: Detailed recording of evaluation conditions

Common Pitfalls and Avoidance Methods

🔄 正在渲染 Mermaid 图表...

graph TB
    A[Evaluation Pitfalls] --> B[Data Leakage]
    A --> C[Overfitting Test Set]
    A --> D[Evaluation Bias]
    A --> E[Inconsistent Environment]

    B --> B1[Strict Data Partitioning]
    C --> C1[Limit Test Frequency]
    D --> D1[Multi-dimensional Evaluation]
    E --> E1[Standardized Environment]

Chapter Summary

Model evaluation and performance analysis is a critical part of YOLO object detection projects, directly affecting the practical application of models. Through this chapter, we have mastered:

Evaluation Metric System: Complete evaluation system from IoU, Precision/Recall to mAP
Performance Analysis Methods: Including error analysis, bottleneck identification, and visualization techniques
Model Comparison Techniques: A/B testing design and statistical significance testing
Practical Guidance: Evaluation process design and best practice summary

Mastering these evaluation methods helps us:

Objectively evaluate model performance
Discover model issues and improvement directions
Guide model optimization and selection
Ensure practical application effectiveness

In the next chapter, we will learn how to optimize and accelerate models based on evaluation results, further improving the practicality of YOLO models.