Mastering Python Refactoring: Decoupling Logic and Configuration for Scale

Overview

Writing code that works is only half the battle. In software engineering, the real challenge lies in making that code maintainable, testable, and flexible. When dealing with complex tasks like

or
PDF
analysis, scripts often start as a single, monolithic file where configuration, logic, and external dependencies are tightly coupled.

This tutorial focuses on high-level

refactoring techniques. We will dismantle "god classes" that instantiate their own subclasses—a major anti-pattern—and replace them with clean functions and
Python Protocols
. Furthermore, we will explore how to move hardcoded strings and settings into external
JSON
configuration files, allowing the application to change behavior without a single line of code being rewritten.

Prerequisites

Mastering Python Refactoring: Decoupling Logic and Configuration for Scale
Refactoring A PDF And Web Scraper Part 2 // CODE ROAST

To get the most out of this guide, you should be comfortable with:

  • Intermediate
    Python
    syntax (classes, functions, and decorators).
  • The concept of
    OOP
    and composition.
  • Type hinting and why it matters for modern development.
  • Basic understanding of
    Python Data Classes
    .

Key Libraries & Tools

  • Python Protocols
    : Part of the typing module, used for structural subtyping (duck typing).
  • Pandas
    : Used for data manipulation, specifically handling data frames in the scraper.
  • tqdm
    : A library for displaying smart progress bars during long-running loops.
  • JSON
    : The standard format for our external configuration files.
  • Hydra
    (Mentioned): A framework for elegantly configuring complex applications.

Code Walkthrough: From Classes to Functions

One of the biggest issues in the original code was a ScrapeRequest class that was responsible for creating its own subclasses. This creates a circular dependency and makes the code difficult to extend. We solve this by using

and
Python Protocols
.

1. Defining the Scraper Protocol

Instead of a rigid class hierarchy, we define what a "scraper" looks like using a

. Any class that has a scrape method matching this signature is now a valid scraper.

from typing import Protocol
from dataclasses import dataclass

@dataclass
class ScrapeResult:
    keywords: list[str]
    word_frequencies: dict[str, int]

class Scraper(Protocol):
    def scrape(self, search_text: str) -> ScrapeResult:
        ...

2. Refactoring Requests into Functions

We don't need a class for every type of request. By converting them into functions, we simplify the flow. These functions now accept a Scraper instance as a dependency.

def fetch_terms_from_doi(target: str, scraper: Scraper) -> ScrapeResult:
    # Logic to process target and call the scraper
    result = scraper.scrape(target)
    return result

3. Centralizing Logging

Duplicate logging logic is a maintenance nightmare. We create a dedicated log.py to handle both file logging and console printing in one place.

import logging

def log_message(message: str):
    logging.info(message)
    print(message)

The Power of External Configuration

Hardcoding paths, URLs, and word lists directly into your logic makes your script brittle. If you want to share your tool with a non-programmer, they shouldn't have to touch

code to change the input folder. We use
Python Data Classes
to map
JSON
data into a typed object.

@dataclass
class ScrapeConfig:
    export_dir: str
    paper_folder: str
    target_words_file: str

def read_config(config_file: str) -> ScrapeConfig:
    with open(config_file, "r") as f:
        data = json.load(f)
    return ScrapeConfig(**data)

By passing this ScrapeConfig object down the call stack, we ensure that every component has access to the settings it needs without relying on global variables.

Syntax Notes

  • Protocol: This is a powerful feature of
    Python
    's typing system. Unlike traditional inheritance, a class doesn't need to explicitly inherit from Scraper to be considered a Scraper. It just needs the right method.
  • Unpacking Operators (**data): We use the double asterisk to unpack a dictionary directly into the initializer of a
    Python Data Classes
    . This only works if the keys in the
    JSON
    exactly match the field names in the class.
  • Context Managers: Always use with open(...) for file operations and directory changes to ensure resources are cleaned up even if an error occurs.

Practical Examples

This refactoring approach is essential for:

  • Data Science Pipelines: Where file paths and filtering parameters change with every experiment.
  • CI/CD Environments: Where different configurations are needed for testing, staging, and production.
  • User-Facing Tools: Allowing users to modify a simple config.json instead of editing source code.

Tips & Gotchas

  • Avoid Instance Variable Bloat: Don't store temporary data as self.variable in a class if it's only used within a single method. Use local variables to keep the object state clean.
  • Type Checking Gaps: Libraries like
    tqdm
    and
    Pandas
    don't always have perfect type hints. You might encounter "Unknown" types; use typing.Any or # type: ignore sparingly when these external tools fail the linter.
  • Configuration Trickle: High-level objects should receive the whole ScrapeConfig, but low-level helpers should only receive the specific strings or sets they need. This keeps the low-level code reusable in other projects that don't use your specific config structure.
Mastering Python Refactoring: Decoupling Logic and Configuration for Scale

Fancy watching it?

Watch the full video and context

5 min read