Chapter 11: Performance Optimization and Large Project Applications

Haiyue

October 3, 2025

21min

Chapter 11: Performance Optimization and Large Project Applications

Learning Objectives

Master Pylint performance optimization techniques in large projects
Learn to configure distributed and parallel checking
Understand incremental checking and caching mechanisms
Master quality management strategies for large-scale codebases

Key Concepts

Performance Optimization Strategies

🔄 正在渲染 Mermaid 图表...

graph TD
    A[Pylint Performance Optimization] --> B[Parallel Processing]
    A --> C[Incremental Checking]
    A --> D[Caching Mechanism]
    A --> E[Configuration Optimization]

    B --> B1[Multi-process Processing]
    B --> B2[Distributed Execution]
    B --> B3[Task Sharding]

    C --> C1[Change Detection]
    C --> C2[Diff Analysis]
    C --> C3[Smart Filtering]

    D --> D1[Result Caching]
    D --> D2[AST Caching]
    D --> D3[Configuration Caching]

    E --> E1[Rule Simplification]
    E --> E2[Path Filtering]
    E --> E3[Memory Optimization]

Large Project Challenges

Challenge	Description	Solution
Long Execution Time	Large codebase checking takes time	Parallel processing, incremental checking
High Memory Usage	AST occupies large memory	Sharded processing, memory limits
Complex Configuration	Multiple teams with different needs	Layered configuration, inheritance mechanism
Result Management	Large number of check results difficult to handle	Result aggregation, priority classification

Code Examples

Parallel Processing Configuration

# parallel_pylint.py
"""
Parallel Pylint Executor

Pylint wrapper supporting multi-process and distributed execution.
"""

import os
import multiprocessing
import subprocess
import json
import time
from pathlib import Path
from typing import List, Dict, Any, Optional
from concurrent.futures import ProcessPoolExecutor, as_completed
import argparse

class ParallelPylintRunner:
    """Parallel Pylint runner"""

    def __init__(self, project_root: str, config_file: str = '.pylintrc'):
        self.project_root = Path(project_root)
        self.config_file = config_file
        self.max_workers = min(multiprocessing.cpu_count(), 8)

    def discover_python_files(self, directories: List[str] = None,
                            exclude_patterns: List[str] = None) -> List[Path]:
        """Discover Python files"""
        if directories is None:
            directories = ['.']

        if exclude_patterns is None:
            exclude_patterns = [
                '__pycache__',
                '.git',
                '.venv',
                'venv',
                '.tox',
                'build',
                'dist',
                '*.egg-info'
            ]

        python_files = []
        for directory in directories:
            dir_path = self.project_root / directory

            if not dir_path.exists():
                continue

            for py_file in dir_path.rglob('*.py'):
                # Check if should be excluded
                should_exclude = any(
                    pattern in str(py_file) for pattern in exclude_patterns
                )
                if not should_exclude:
                    python_files.append(py_file)

        return python_files

    def create_file_chunks(self, files: List[Path],
                          chunk_size: Optional[int] = None) -> List[List[Path]]:
        """Group files into processing chunks"""
        if chunk_size is None:
            # Dynamically calculate chunk size based on worker count and file count
            chunk_size = max(1, len(files) // (self.max_workers * 2))

        chunks = []
        for i in range(0, len(files), chunk_size):
            chunk = files[i:i + chunk_size]
            chunks.append(chunk)

        return chunks

    def run_pylint_on_chunk(self, files: List[Path]) -> Dict[str, Any]:
        """Run Pylint on file chunk"""
        start_time = time.time()

        # Build command
        cmd = [
            'pylint',
            f'--rcfile={self.config_file}',
            '--output-format=json',
            '--reports=no',
            '--score=no'
        ] + [str(f) for f in files]

        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                cwd=self.project_root,
                timeout=300  # 5 minute timeout
            )

            # Parse results
            messages = []
            if result.stdout:
                try:
                    messages = json.loads(result.stdout)
                except json.JSONDecodeError:
                    # If JSON parsing fails, try parsing line by line
                    for line in result.stdout.split('\n'):
                        if line.strip():
                            try:
                                msg = json.loads(line)
                                messages.append(msg)
                            except json.JSONDecodeError:
                                continue

            execution_time = time.time() - start_time

            return {
                'files': [str(f) for f in files],
                'messages': messages,
                'execution_time': execution_time,
                'return_code': result.returncode,
                'stderr': result.stderr
            }

        except subprocess.TimeoutExpired:
            return {
                'files': [str(f) for f in files],
                'messages': [],
                'execution_time': time.time() - start_time,
                'return_code': -1,
                'stderr': 'Timeout expired',
                'error': 'timeout'
            }
        except Exception as e:
            return {
                'files': [str(f) for f in files],
                'messages': [],
                'execution_time': time.time() - start_time,
                'return_code': -1,
                'stderr': str(e),
                'error': str(e)
            }

    def run_parallel(self, directories: List[str] = None,
                    exclude_patterns: List[str] = None,
                    max_workers: Optional[int] = None) -> Dict[str, Any]:
        """Run Pylint in parallel"""
        if max_workers:
            self.max_workers = max_workers

        print(f"Discovering Python files...")
        files = self.discover_python_files(directories, exclude_patterns)
        print(f"Found {len(files)} Python files")

        if not files:
            return {
                'total_files': 0,
                'total_messages': 0,
                'execution_time': 0,
                'chunks': []
            }

        print(f"Creating file chunks...")
        chunks = self.create_file_chunks(files)
        print(f"Processing {len(chunks)} chunks with {self.max_workers} workers")

        start_time = time.time()
        chunk_results = []

        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
            # Submit tasks
            future_to_chunk = {
                executor.submit(self.run_pylint_on_chunk, chunk): i
                for i, chunk in enumerate(chunks)
            }

            # Collect results
            for future in as_completed(future_to_chunk):
                chunk_index = future_to_chunk[future]
                try:
                    result = future.result()
                    chunk_results.append(result)
                    print(f"Completed chunk {chunk_index + 1}/{len(chunks)} "
                          f"({len(result['messages'])} messages, "
                          f"{result['execution_time']:.2f}s)")
                except Exception as e:
                    print(f"Error in chunk {chunk_index + 1}: {e}")
                    chunk_results.append({
                        'files': [str(f) for f in chunks[chunk_index]],
                        'messages': [],
                        'execution_time': 0,
                        'return_code': -1,
                        'error': str(e)
                    })

        total_time = time.time() - start_time

        # Aggregate results
        all_messages = []
        for result in chunk_results:
            all_messages.extend(result['messages'])

        summary = {
            'total_files': len(files),
            'total_messages': len(all_messages),
            'execution_time': total_time,
            'chunks': len(chunks),
            'chunk_results': chunk_results,
            'messages': all_messages
        }

        print(f"Analysis complete! {len(all_messages)} total messages in {total_time:.2f}s")
        return summary

    def save_results(self, results: Dict[str, Any], output_file: str):
        """Save results to file"""
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

        print(f"Results saved to {output_file}")

def main():
    """Main function"""
    parser = argparse.ArgumentParser(description='Parallel Pylint Runner')
    parser.add_argument('--directories', nargs='+', default=['.'],
                       help='Directories to analyze')
    parser.add_argument('--exclude', nargs='+',
                       help='Patterns to exclude')
    parser.add_argument('--config', default='.pylintrc',
                       help='Pylint configuration file')
    parser.add_argument('--workers', type=int,
                       help='Number of worker processes')
    parser.add_argument('--output', default='pylint-parallel-results.json',
                       help='Output file for results')

    args = parser.parse_args()

    runner = ParallelPylintRunner('.', args.config)
    results = runner.run_parallel(
        directories=args.directories,
        exclude_patterns=args.exclude,
        max_workers=args.workers
    )

    runner.save_results(results, args.output)

if __name__ == "__main__":
    main()

Incremental Checking Implementation

# incremental_pylint.py
"""
Incremental Pylint Checker

Only checks changed files to improve checking efficiency in large projects.
"""

import json
import subprocess
import hashlib
import pickle
from pathlib import Path
from typing import Dict, List, Set, Optional, Any
from datetime import datetime
import os

class IncrementalPylintChecker:
    """Incremental Pylint checker"""

    def __init__(self, project_root: str, cache_dir: str = '.pylint_cache'):
        self.project_root = Path(project_root)
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

        # Cache file paths
        self.file_hashes_cache = self.cache_dir / 'file_hashes.json'
        self.results_cache = self.cache_dir / 'results.pickle'
        self.config_cache = self.cache_dir / 'config_hash.txt'

    def calculate_file_hash(self, file_path: Path) -> str:
        """Calculate file hash"""
        try:
            with open(file_path, 'rb') as f:
                content = f.read()
            return hashlib.md5(content).hexdigest()
        except Exception:
            return ''

    def get_config_hash(self, config_file: str) -> str:
        """Get configuration file hash"""
        config_path = self.project_root / config_file
        if config_path.exists():
            return self.calculate_file_hash(config_path)
        return ''

    def load_file_hashes(self) -> Dict[str, str]:
        """Load file hash cache"""
        if self.file_hashes_cache.exists():
            try:
                with open(self.file_hashes_cache, 'r') as f:
                    return json.load(f)
            except Exception:
                return {}
        return {}

    def save_file_hashes(self, hashes: Dict[str, str]):
        """Save file hash cache"""
        with open(self.file_hashes_cache, 'w') as f:
            json.dump(hashes, f, indent=2)

    def load_results_cache(self) -> Dict[str, List[Dict]]:
        """Load results cache"""
        if self.results_cache.exists():
            try:
                with open(self.results_cache, 'rb') as f:
                    return pickle.load(f)
            except Exception:
                return {}
        return {}

    def save_results_cache(self, results: Dict[str, List[Dict]]):
        """Save results cache"""
        with open(self.results_cache, 'wb') as f:
            pickle.dump(results, f)

    def get_changed_files(self, files: List[Path],
                         config_file: str = '.pylintrc') -> Set[Path]:
        """Get changed files"""
        # Check if configuration file changed
        current_config_hash = self.get_config_hash(config_file)
        cached_config_hash = ''

        if self.config_cache.exists():
            try:
                with open(self.config_cache, 'r') as f:
                    cached_config_hash = f.read().strip()
            except Exception:
                pass

        config_changed = current_config_hash != cached_config_hash

        if config_changed:
            print("Configuration changed, checking all files")
            # Save new configuration hash
            with open(self.config_cache, 'w') as f:
                f.write(current_config_hash)
            return set(files)

        # Load file hash cache
        cached_hashes = self.load_file_hashes()
        changed_files = set()

        print("Checking for changed files...")
        for file_path in files:
            relative_path = str(file_path.relative_to(self.project_root))
            current_hash = self.calculate_file_hash(file_path)

            if (relative_path not in cached_hashes or
                cached_hashes[relative_path] != current_hash):
                changed_files.add(file_path)

        print(f"Found {len(changed_files)} changed files out of {len(files)} total")
        return changed_files

    def update_file_hashes(self, files: List[Path]):
        """Update file hash cache"""
        cached_hashes = self.load_file_hashes()

        for file_path in files:
            relative_path = str(file_path.relative_to(self.project_root))
            cached_hashes[relative_path] = self.calculate_file_hash(file_path)

        self.save_file_hashes(cached_hashes)

    def run_pylint_on_files(self, files: List[Path],
                           config_file: str = '.pylintrc') -> Dict[str, List[Dict]]:
        """Run Pylint on specified files"""
        if not files:
            return {}

        print(f"Running Pylint on {len(files)} files...")

        cmd = [
            'pylint',
            f'--rcfile={config_file}',
            '--output-format=json',
            '--reports=no',
            '--score=no'
        ] + [str(f) for f in files]

        try:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                cwd=self.project_root
            )

            messages = []
            if result.stdout:
                try:
                    messages = json.loads(result.stdout)
                except json.JSONDecodeError:
                    # Handle parsing error
                    print("Warning: Could not parse Pylint output as JSON")

            # Organize results by file
            results_by_file = {}
            for msg in messages:
                file_path = msg['path']
                if file_path not in results_by_file:
                    results_by_file[file_path] = []
                results_by_file[file_path].append(msg)

            return results_by_file

        except Exception as e:
            print(f"Error running Pylint: {e}")
            return {}

    def run_incremental_check(self, directories: List[str] = None,
                            config_file: str = '.pylintrc') -> Dict[str, Any]:
        """Run incremental check"""
        if directories is None:
            directories = ['.']

        # Discover all Python files
        all_files = []
        for directory in directories:
            dir_path = self.project_root / directory
            if dir_path.exists():
                all_files.extend(dir_path.rglob('*.py'))

        # Filter out files that shouldn't be checked
        filtered_files = [
            f for f in all_files
            if not any(exclude in str(f) for exclude in [
                '__pycache__', '.git', '.venv', 'venv', '.tox'
            ])
        ]

        print(f"Found {len(filtered_files)} Python files")

        # Get changed files
        changed_files = self.get_changed_files(filtered_files, config_file)

        if not changed_files:
            print("No files changed, using cached results")
            cached_results = self.load_results_cache()
            all_messages = []
            for file_messages in cached_results.values():
                all_messages.extend(file_messages)

            return {
                'total_files': len(filtered_files),
                'changed_files': 0,
                'cached_files': len(filtered_files),
                'total_messages': len(all_messages),
                'messages': all_messages,
                'incremental': True
            }

        # Run Pylint check on changed files
        new_results = self.run_pylint_on_files(list(changed_files), config_file)

        # Load cached results
        cached_results = self.load_results_cache()

        # Update results cache
        for file_path, messages in new_results.items():
            cached_results[file_path] = messages

        # Remove results for deleted files
        existing_files = {str(f.relative_to(self.project_root)) for f in filtered_files}
        cached_results = {
            k: v for k, v in cached_results.items()
            if k in existing_files
        }

        # Save updated cache
        self.save_results_cache(cached_results)
        self.update_file_hashes(filtered_files)

        # Aggregate all results
        all_messages = []
        for file_messages in cached_results.values():
            all_messages.extend(file_messages)

        return {
            'total_files': len(filtered_files),
            'changed_files': len(changed_files),
            'cached_files': len(filtered_files) - len(changed_files),
            'total_messages': len(all_messages),
            'messages': all_messages,
            'incremental': True
        }

    def clear_cache(self):
        """Clear all cache"""
        cache_files = [
            self.file_hashes_cache,
            self.results_cache,
            self.config_cache
        ]

        for cache_file in cache_files:
            if cache_file.exists():
                cache_file.unlink()

        print("Cache cleared")

    def get_cache_stats(self) -> Dict[str, Any]:
        """Get cache statistics"""
        stats = {
            'cache_dir': str(self.cache_dir),
            'cache_size': 0,
            'files': {}
        }

        cache_files = [
            ('file_hashes', self.file_hashes_cache),
            ('results', self.results_cache),
            ('config', self.config_cache)
        ]

        for name, path in cache_files:
            if path.exists():
                size = path.stat().st_size
                stats['cache_size'] += size
                stats['files'][name] = {
                    'path': str(path),
                    'size': size,
                    'exists': True,
                    'modified': datetime.fromtimestamp(path.stat().st_mtime).isoformat()
                }
            else:
                stats['files'][name] = {
                    'path': str(path),
                    'size': 0,
                    'exists': False
                }

        return stats

Performance Optimization Best Practices

Parallel Processing: Utilize multi-core CPUs to check files in parallel
Incremental Checking: Only check changed files to avoid repeated analysis
Caching Mechanism: Cache analysis results and file hashes
Sharded Processing: Process large codebases in shards to control memory usage
Rule Optimization: Disable unnecessary check rules and focus on core quality issues

Precautions

Resource Limits: Monitor CPU and memory usage to avoid system overload
Network Latency: Consider network transmission overhead in distributed execution
Cache Consistency: Ensure cache stays consistent with code changes
Error Recovery: Implement retry and recovery mechanisms for failed tasks

Through performance optimization and distributed technologies, Pylint’s execution efficiency in large projects can be significantly improved, achieving large-scale code quality management.

Apache Airflow 3.x

Chapter 1: Introduction to Apache Airflow 3.x

Chapter 2: Core Concepts and Architecture

Chapter 3: Writing Your First DAG

Chapter 4: Built-in Operators and Sensors

Apache Airflow 3.x

Chapter 1: Introduction to Apache Airflow 3.x

Chapter 2: Core Concepts and Architecture

Chapter 3: Writing Your First DAG

Chapter 4: Built-in Operators and Sensors

Chapter 11: Performance Optimization and Large Project Applications

Chapter 11: Performance Optimization and Large Project Applications

Key Concepts

Performance Optimization Strategies

Large Project Challenges

Code Examples

Parallel Processing Configuration

Incremental Checking Implementation