Refactoring Data Science Projects: Finalizing Configuration and Data Flow

ArjanCodes//Oct 22, 2021//3 min read

Overview

Most data science projects begin as experimental scripts. While this works for quick prototyping, it often leads to a tangled mess of hard-coded paths, hidden dependencies, and bloated main functions. This tutorial focuses on the final stage of refactoring a digit recognition project. We aim to move logic into dedicated runner classes, centralize configuration constants, and decouple data loading from the core model logic. By doing this, we make the project adaptable to new datasets without hunting through dozens of source files.

Prerequisites

To follow this walkthrough, you should have a solid grasp of fundamentals, including classes and functions. Familiarity with concepts like DataLoaders and training loops is highly recommended, as well as basic knowledge of .

Key Libraries & Tools

: The primary language for the implementation.
: A modern library used for object-oriented filesystem paths.
: The deep learning framework used for the model and training logic.

Code Walkthrough: Cleaning the Main Loop

The first step involves extracting the training loop from main.py into a specialized function within running.py. This reduces the complexity of the entry point.

def run_epoch(test_runner, train_runner, tracker, epoch_id, epoch_total):
    # Core epoch logic moved here
    train_runner.run(epoch_id)
    test_runner.run(epoch_id)
    tracker.log_epoch(epoch_id, epoch_total)

By passing the runners as arguments, we remove the need for main.py to manage internal state. We also replace the non-standard hparams dictionary with clear, uppercase constants at the top of the file. This makes the hyperparameters like BATCH_SIZE and LEARNING_RATE immediately visible and editable.

Decoupling Data Loading

A common mistake is hard-coding file paths inside data-loading functions. We refactor this by injecting the path from the configuration level.

def create_data_loader(batch_size, data_path: Path, label_path: Path, shuffle=True):
    images = load_image_data(data_path)
    labels = load_label_data(label_path)
    dataset = MNISTDataset(images, labels)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

This approach ensures that MNISTDataset doesn't need to know where the files live. It simply receives the data it needs to operate.

Syntax Notes & Best Practices

Using Type Hints is non-negotiable for professional code. By defining types like : Path or : int, you allow your IDE to catch bugs before execution. We also follow the Information Expert principle: assign responsibilities to the class that has the most information to fulfill them. The Runner class handles metrics because it sees the raw data during the training loop.

Tips & Gotchas

Avoid "shotgun surgery," where one change requires edits in five different files. If you change your data directory, you should only have to update one constant in main.py. Also, keep validation and loading separate. Don't put assert statements regarding data shape inside your loading function; move those to a dedicated validation suite or unit test.

Topic DensityMention share of the most discussed topics · 10 mentions across 6 distinct topics

: 40%· programming languages
: 20%· libraries
: 10%· people
: 10%· products
: 10%· libraries
: 10%· libraries

End of Article

Source video

Refactoring Data Science Projects: Finalizing Configuration and Data Flow

Refactoring A Data Science Project Part 3 - Configuration Cleanup

ArjanCodes // 16:27

ArjanCodes

ArjanCodes

On this channel, I post videos about programming and software design to help you take your coding skills to the next level. I'm an entrepreneur and a university lecturer in computer science, with more than 20 years of experience in software development and design. If you're a software developer and you want to improve your development skills, and learn more about programming in general, make sure to subscribe for helpful videos. I post a video here every Friday. If you have any suggestion for a topic you'd like me to cover, just leave a comment on any of my videos and I'll take it under consideration. Thanks for watching!

What they talk about

AI and Agentic Coding News

Who and what they mention most

33.3%5

20.0%3

20.0%3

13.3%2

13.3%2

3 min read0%

3 min read