Refactoring Python: Dismantling Over-Engineered Class Hierarchies
Overview

Software development often suffers from a phenomenon where developers use complex language features simply because they exist. In this tutorial, we analyze a script designed for academic paper scraping that fell into this trap. The original code used convoluted dunder method overrides to change how class initialization works, creating a maintenance nightmare.
We will walk through the process of simplifying this architecture. The goal is to move away from rigid, deep inheritance and toward a functional, modular design. By leveraging and , we can create a codebase that is easier to read, faster to execute, and significantly simpler to test. Proper refactoring isn't just about making code look "clean"; it's about reducing the cognitive load required for the next developer to understand the logic.
Prerequisites
To get the most out of this guide, you should be comfortable with:
- Python Fundamentals: Basic syntax, loops, and list comprehensions.
- Object-Oriented Programming (OOP): Understanding classes, instances, and inheritance.
- Type Hinting: Familiarity with the
typingmodule (List,Tuple,Set). - Basic Tooling: Experience using an IDE like or .
Key Libraries & Tools
- : A built-in Python module used to create classes that primarily store data with less boilerplate.
- : The Natural Language Toolkit, used here for tokenizing text and identifying stop words.
- : A library for extracting text and metadata from PDF files.
- : Python’s built-in unit testing framework used to verify our refactored functions.
Code Walkthrough: Cleaning the PDF Scraper
The original PDFScrape class was bloated with instance variables that were only used temporarily. This created unnecessary state within the object. We start by defining a clear structure for our output using a data class.
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class ScrapeResult:
doi: string
word_score: int
frequency: List[Tuple[str, int]]
study_design: List[Tuple[str, int]]
Moving Data Out of the Class
The original code hard-coded target word lists inside the class. This makes the class difficult to reuse. We move these to constants (or eventually an external config file) and convert them to sets for better performance.
TARGET_WORDS = {"analysis", "results", "methodology"}
BYCATCH_WORDS = {"introduction", "references"}
Converting Methods to Pure Functions
One of the biggest wins in refactoring is moving logic out of classes when it doesn't need to access self. Functions like guess_doi or compute_filtered_tokens should be "pure"—meaning they only depend on their input arguments.
def guess_doi(path_name: str) -> str:
base_name = os.path.basename(path_name)
# Simplified extraction logic
doi_part = base_name.split('_')[0]
return f"doi_prefix/{doi_part}"
By moving these out, we can test guess_doi without ever needing to instantiate a heavy scraper object. This is the cornerstone of testable architecture.
Syntax Notes: The Power of Sets
In the original script, the developer wrote a custom overlap function to find common words between two lists. This is a classic example of "reinventing the wheel." Python’s set type handles this natively and much more efficiently.
Instead of iterating through lists, use the intersection operator:
# The old, slow way
overlap = [word for word in list_a if word in list_b]
# The Pythonic, fast way
overlap = set_a.intersection(set_b)
# Or even shorter:
overlap = set_a & set_b
Using sets changes the time complexity of lookups from O(n) to O(1) on average. When processing thousands of words across hundreds of academic papers, these micro-optimizations stack up.
Practical Examples
This refactoring technique applies to any data processing pipeline. Imagine you are building a log analyzer. Instead of a LogAnalyzer class with complex inheritance for different log formats (Nginx, Apache, Syslog), you can create a suite of pure parsing functions.
A main execution script then selects the appropriate function based on the file extension. This "Functional Core, Imperative Shell" pattern keeps your logic isolated from the messy details of file I/O and configuration.
Tips & Gotchas
- The Dunder New Trap: Never override
__new__unless you are doing something extremely specialized like creating a Singleton or a C-extension. If you find yourself using__new__to return instances of other classes, you actually want a Factory Pattern. - Instance Variable Pollution: Only use
self.variableif that data needs to live for the entire life of the object. If you only need it for the duration of one method, make it a local variable. This prevents bugs where one method accidentally relies on the leftover state from another. - Naming Clarity: Avoid generic names like
downloadif the function is actually reading a local file. Usescrapeorparseto accurately reflect the action. Predictable naming reduces the need for documentation. - Test Early: If a function is hard to test, it is probably too tightly coupled. Breaking it down into smaller, pure functions makes writing unit tests effortless.
- 20%· libraries
- 10%· libraries
- 10%· libraries
- 10%· software
- 10%· programming languages
- Other topics
- 40%

Refactoring A PDF And Web Scraper Part 1 // CODE ROAST
WatchArjanCodes // 37:44
On this channel, I post videos about programming and software design to help you take your coding skills to the next level. I'm an entrepreneur and a university lecturer in computer science, with more than 20 years of experience in software development and design. If you're a software developer and you want to improve your development skills, and learn more about programming in general, make sure to subscribe for helpful videos. I post a video here every Friday. If you have any suggestion for a topic you'd like me to cover, just leave a comment on any of my videos and I'll take it under consideration. Thanks for watching!