Mastering High-Performance CSV Imports: From Collections to MySQL Native Loaders

Overview: The Challenge of the Million-Row Import

Importing data is a deceptive task. It starts simple: read a file, loop through the rows, and save them to a database. But as the row count climbs from hundreds to millions, the "simple" way becomes a catastrophic failure. Memory limits are breached, database connections timeout, and processing times stretch from seconds into hours. This guide explores how to scale a

import from a basic collection-based approach to high-performance methods using
PDO
and
MySQL
native features.

We will operate under three strict constraints to simulate a real-world production environment: a memory limit of 256MB, a refusal to use background queues/jobs for this specific exercise, and a focus on vanilla

and
Laravel
tools without external packages. The goal is to reach a sub-20-second import for one million records.

Prerequisites

Before diving in, you should have a solid grasp of the following:

  • PHP 8.x: Familiarity with closures, arrays, and file handles.
  • Laravel 11: Basic understanding of
    Eloquent
    and the
    Laravel
    .
  • MySQL: Knowledge of tables and basic
    SQL
    syntax.
  • CLI: Comfort running
    Laravel
    commands.

Key Libraries & Tools

  • Laravel
    : Our primary environment, specifically using the new Concurrency facade.
  • PDO
    : Used for low-level database interactions to bypass
    Eloquent
    overhead.
  • MySQL
    : The target database engine.
  • fgetcsv(): A native
    PHP
    function that parses a line from a file pointer and checks for
    CSV
    fields.

Code Walkthrough: Evolution of an Import

Phase 1: The Collection Pitfall

The most intuitive

approach is loading a file into a collection. While readable, it is the least efficient.

$rows = file($filePath);
$data = collect($rows)->skip(1)->map(function ($line) {
    $row = str_getcsv($line);
    return [
        'name' => $row[1],
        'email' => $row[2],
        // ... more fields
    ];
});

foreach ($data as $customer) {
    Customer::create($customer);
}

Why it fails: The file() function reads the entire file into memory at once. At 100,000 rows, you hit the 256MB limit. Furthermore, Customer::create() triggers a separate

INSERT for every row, making it incredibly slow.

Phase 2: Streamed File Pointers and PDO Chunks

To scale, we must stop loading the whole file. Instead, we use fopen() and fgetcsv() to stream the file line-by-line. To improve database performance, we move away from

and use
PDO
prepared statements with manual chunking.

$handle = fopen($filePath, 'r');
fgetcsv($handle); // Skip header

$pdo = DB::connection()->getPdo();
$stmt = $pdo->prepare("INSERT INTO customers (name, email, ...) VALUES (?, ?, ...)");

while (($row = fgetcsv($handle)) !== false) {
    $stmt->execute([$row[1], $row[2], ...]);
}
fclose($handle);

By executing 1,000,000 individual queries with a single prepared statement, the memory footprint drops to nearly zero. However, it still takes approximately 3 minutes. To go faster, we must bundle these into "multi-insert" statements.

Phase 3: Parallel Processing with Laravel Concurrency

introduced a Concurrency facade that allows us to run tasks in parallel. We can divide the file processing across 10 different worker processes. We use a Modulo Operator strategy to ensure each process only handles its assigned lines (e.g., Process 1 handles lines 0, 10, 20...).

Concurrency::run([
    function () use ($filePath, $i) {
        $handle = fopen($filePath, 'r');
        $currentLine = 0;
        while (($line = fgets($handle)) !== false) {
            if ($currentLine % 10 !== $i) {
                $currentLine++;
                continue;
            }
            // Process and insert the line here...
            $currentLine++;
        }
    }
], 10);

Performance Note: This method slashed the 1,000,000-row import time down to 4 seconds. Parallelization allows the CPU to utilize multiple cores to handle the parsing and I/O simultaneously.

Syntax Notes: The Power of fgetcsv vs str_getcsv

During high-volume processing, the choice of function matters. fgetcsv() is designed to work with file handles, making it perfect for streaming. On the other hand, str_getcsv() is better when you already have a string in memory (like when using fgets()). In the parallel approach, we use fgets() first to quickly skip rows we don't own, and only call str_getcsv() on the rows we intend to process. This saves significant CPU cycles.

Practical Examples

  • E-commerce Migration: Moving millions of customer records or legacy orders from an old system to a new
    Laravel
    app.
  • Financial Reporting: Importing daily transaction logs for analysis where speed is critical to meet processing windows.
  • Data Warehousing: Seeding local development environments with massive datasets to test performance at scale.

Tips & Gotchas

  • The PDO Placeholder Limit:
    MySQL
    via
    PDO
    has a limit on the number of placeholders (?) in a single query (usually 65,535). If you chunk 10,000 rows with 10 columns each, you will hit this limit. Always calculate your chunk size based on Limit / Number of Columns.
  • MySQL local-infile: The absolute fastest method is the native LOAD DATA LOCAL INFILE command. However, this is often disabled for security. If you use it, you must enable it on both the
    PDO
    client and the
    MySQL
    server (SET GLOBAL local_infile = 1).
  • Database Reconnection: When using the Concurrency facade, each process might need to explicitly reconnect to the database to avoid sharing a socket, which can cause intermittent connection errors.
  • Date Formats: Databases are picky. Ensure you format your
    CSV
    dates (e.g., Y-m-d H:i:s) before inserting, or use
    SQL
    functions like NOW() for timestamps to reduce
    PHP
    memory usage.
5 min read