Mastering High-Performance CSV Imports: From Collections to MySQL Native Loaders
Overview: The Challenge of the Million-Row Import
Importing data is a deceptive task. It starts simple: read a file, loop through the rows, and save them to a database. But as the row count climbs from hundreds to millions, the "simple" way becomes a catastrophic failure. Memory limits are breached, database connections timeout, and processing times stretch from seconds into hours. This guide explores how to scale a
We will operate under three strict constraints to simulate a real-world production environment: a memory limit of 256MB, a refusal to use background queues/jobs for this specific exercise, and a focus on vanilla
Prerequisites
Before diving in, you should have a solid grasp of the following:
- PHP 8.x: Familiarity with closures, arrays, and file handles.
- Laravel 11: Basic understanding of Eloquentand theLaravel.
- MySQL: Knowledge of tables and basic SQLsyntax.
- CLI: Comfort running Laravelcommands.
Key Libraries & Tools
- Laravel: Our primary environment, specifically using the new Concurrency facade.
- PDO: Used for low-level database interactions to bypassEloquentoverhead.
- MySQL: The target database engine.
fgetcsv(): A nativePHPfunction that parses a line from a file pointer and checks forCSVfields.
Code Walkthrough: Evolution of an Import
Phase 1: The Collection Pitfall
The most intuitive
$rows = file($filePath);
$data = collect($rows)->skip(1)->map(function ($line) {
$row = str_getcsv($line);
return [
'name' => $row[1],
'email' => $row[2],
// ... more fields
];
});
foreach ($data as $customer) {
Customer::create($customer);
}
Why it fails: The file() function reads the entire file into memory at once. At 100,000 rows, you hit the 256MB limit. Furthermore, Customer::create() triggers a separate INSERT for every row, making it incredibly slow.
Phase 2: Streamed File Pointers and PDO Chunks
To scale, we must stop loading the whole file. Instead, we use fopen() and fgetcsv() to stream the file line-by-line. To improve database performance, we move away from
$handle = fopen($filePath, 'r');
fgetcsv($handle); // Skip header
$pdo = DB::connection()->getPdo();
$stmt = $pdo->prepare("INSERT INTO customers (name, email, ...) VALUES (?, ?, ...)");
while (($row = fgetcsv($handle)) !== false) {
$stmt->execute([$row[1], $row[2], ...]);
}
fclose($handle);
By executing 1,000,000 individual queries with a single prepared statement, the memory footprint drops to nearly zero. However, it still takes approximately 3 minutes. To go faster, we must bundle these into "multi-insert" statements.
Phase 3: Parallel Processing with Laravel Concurrency
Concurrency::run([
function () use ($filePath, $i) {
$handle = fopen($filePath, 'r');
$currentLine = 0;
while (($line = fgets($handle)) !== false) {
if ($currentLine % 10 !== $i) {
$currentLine++;
continue;
}
// Process and insert the line here...
$currentLine++;
}
}
], 10);
Performance Note: This method slashed the 1,000,000-row import time down to 4 seconds. Parallelization allows the CPU to utilize multiple cores to handle the parsing and I/O simultaneously.
Syntax Notes: The Power of fgetcsv vs str_getcsv
During high-volume processing, the choice of function matters. fgetcsv() is designed to work with file handles, making it perfect for streaming. On the other hand, str_getcsv() is better when you already have a string in memory (like when using fgets()). In the parallel approach, we use fgets() first to quickly skip rows we don't own, and only call str_getcsv() on the rows we intend to process. This saves significant CPU cycles.
Practical Examples
- E-commerce Migration: Moving millions of customer records or legacy orders from an old system to a new Laravelapp.
- Financial Reporting: Importing daily transaction logs for analysis where speed is critical to meet processing windows.
- Data Warehousing: Seeding local development environments with massive datasets to test performance at scale.
Tips & Gotchas
- The PDO Placeholder Limit: MySQLviaPDOhas a limit on the number of placeholders (
?) in a single query (usually 65,535). If you chunk 10,000 rows with 10 columns each, you will hit this limit. Always calculate your chunk size based onLimit / Number of Columns. - MySQL
local-infile: The absolute fastest method is the nativeLOAD DATA LOCAL INFILEcommand. However, this is often disabled for security. If you use it, you must enable it on both thePDOclient and theMySQLserver (SET GLOBAL local_infile = 1). - Database Reconnection: When using the Concurrency facade, each process might need to explicitly reconnect to the database to avoid sharing a socket, which can cause intermittent connection errors.
- Date Formats: Databases are picky. Ensure you format your CSVdates (e.g.,
Y-m-d H:i:s) before inserting, or useSQLfunctions likeNOW()for timestamps to reducePHPmemory usage.
