Pandas

Products

May 2021 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

May 2021

Oct 2021 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Oct 2021

Dec 2021 • 2 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 2 videos across 1 sources.

Dec 2021

Jun 2022 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Jun 2022

Aug 2022 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Aug 2022

Jan 2023 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Jan 2023

Mar 2023 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Mar 2023

Apr 2023 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Apr 2023

Aug 2023 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Aug 2023

Nov 2023 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Nov 2023

Aug 2024 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Aug 2024

Dec 2024 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Dec 2024

Jan 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Jan 2025

Feb 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Feb 2025

May 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

May 2025

Jun 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Jun 2025

Aug 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Aug 2025

Sep 2025 • 1 videos

High activity month for Pandas. ArjanCodes among the most active voices, with 1 videos across 1 sources.

Sep 2025

Jun 2026 • 1 videos

High activity month for Pandas. AI Engineer among the most active voices, with 1 videos across 1 sources.

Jun 2026

TL;DR

ArjanCodes (10 mentions) contexts Pandas within broader discussions on efficient Python programming, demonstrating its use with DataFrames in videos like "10 Python Features You’re Not Using (But Really Should)" and when optimizing tasks as shown in "The Lazy Loading Pattern".

// AI Engineer
The Costly Mess of Unstructured PDFs Unstructured data sits in silos. Organizations run on PDFs, slide decks, and scanned invoices, but large language models cannot read them natively. Simple text parsers fail quickly because they merge side-by-side columns, drop images, or turn tables into unreadable strings. Bad parsing leads to bad outputs. A recent academic paper highlighted this issue when an AI tool merged two separate columns from an old, scanned document, creating a non-existent word that other researchers eventually cited. Accurate extraction is the foundation of reliable AI systems. An open-source parser called Docling, supported by the Linux Foundation, resolves these layout issues locally without sending sensitive files to third-party APIs. Prerequisites and Tooling To follow this tutorial, you need a basic understanding of Python and command-line interfaces. We will use several libraries to build our processing pipeline: * **Docling**: The core document conversion library. * **Ollama**: For running local vision language models. * **Pydantic**: For structured data validation. Install the primary package using pip: ```bash pip install docling ``` Extracting Tables and Text with DocumentConverter The central class in the library is `DocumentConverter`. This component orchestrates layout analysis and Optical Character Recognition (OCR) models to identify headers, text blocks, images, and tables. Here is how to run a basic conversion: ```python from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("https://arxiv.org/pdf/2408.09869") print(result.document.export_to_markdown()) ``` This script downloads a remote research PDF, analyzes the structure, and outputs clean markdown. If your documents contain critical tabular data, you can isolate those elements and convert them directly into Pandas DataFrames for analysis: ```python for table in result.document.tables: df = table.to_dataframe() print(df.head()) ``` Because Docling represents the parsed output as a Pydantic model, you can programmatically inspect pages, bounding boxes, and specific document segments without writing fragile regular expressions. Chunkless RAG and Agentic Workflows Traditional Retrieval-Augmented Generation (RAG) pipelines cut text into arbitrary, fixed-size chunks, encode them into vectors, and store them in databases. This process often breaks paragraph context. Using structured document parsing, you can implement chunkless RAG. Instead of a vector database, the markdown outline of the document serves as your index: ```python Use the structural outline as the search index outline = result.document.export_to_markdown() ``` An LLM agent reviews this hierarchical outline first, identifies the exact section containing the answer, and retrieves only that specific block of text. This setup completely bypasses embedding models and vector search. Scaling with Microservices and MCP Processing thousands of documents in a single script is slow. You can scale your pipeline by running the parser as a REST API service: ```bash pip install docling-serve docling-serve --port 8000 ``` For developer environments, configure the Docling Model Context Protocol (MCP) server in your AI editor. This lets development assistants natively parse local files during chat sessions. When running these workflows, watch your CPU resources; layout detection and image OCR are intensive tasks.
Jun 28, 2026
// ArjanCodes
Sep 26, 2025
// ArjanCodes
Aug 29, 2025
// ArjanCodes
Jun 27, 2025
// ArjanCodes
May 16, 2025
// ArjanCodes
Overview: The Analytic Power of DuckDB DuckDB represents a shift in how we handle local data analysis. While SQLite dominates transactional workloads, it often struggles with the heavy aggregation and scanning required for big data analytics. DuckDB fills this gap as a relational database management system (RDBMS) designed specifically for analytical workloads. It operates as an embedded database, meaning it runs directly inside your application process without the overhead of a separate server. This architecture allows for lightning-fast querying of Pandas DataFrames, CSVs, and Parquet files using standard SQL. Prerequisites To follow this guide, you should have a basic understanding of Python and SQL syntax. Familiarity with Pandas DataFrames is helpful, as DuckDB's primary advantage is its ability to interface with these objects. Ensure you have a Python environment ready (version 3.8+ recommended). Key Libraries & Tools - **DuckDB**: The core engine for analytical SQL queries. - **Pandas**: The industry-standard library for data manipulation in Python. - **uv**: A high-performance Python package and project manager used for dependency installation. - **Jupyter Notebook**: An interactive computing environment for testing queries. Code Walkthrough: Querying DataFrames Directly One of the most impressive features of DuckDB is its "Python magic"—the ability to recognize local variables within a SQL string. ```python import pandas as pd import duckdb Create a sample DataFrame df = pd.DataFrame({"name": ["Alice", "Bob"], "salary": [150000, 90000]}) Query the DataFrame variable 'df' directly using SQL result = duckdb.query("SELECT * FROM df WHERE salary > 100000").to_df() print(result) ``` DuckDB inspects the calling scope to find the variable name used in the `FROM` clause. While this is convenient, it can confuse IDEs like PyLance, which may flag the variable as unused. For cleaner code, I recommend explicit registration: ```python con = duckdb.connect() con.register("employees", df) filtered_df = con.execute("SELECT * FROM employees").df() ``` Persistent vs. In-Memory Storage By default, `duckdb.connect()` creates an in-memory database. This is perfect for unit tests where you want a clean state for every run. However, once the connection closes, the data vanishes. To save your work, specify a file path: ```python This creates a persistent database file on disk con = duckdb.connect("company_data.duckdb") con.execute("CREATE TABLE IF NOT EXISTS staff AS SELECT * FROM 'data.csv'") ``` Advanced SQL Extensions DuckDB includes powerful diagnostic tools that usually require heavy enterprise databases. Use `DESCRIBE` to see schema details, or `SUMMARIZE` to get instant statistics like percentiles and null counts. If a query is running slowly, prepend it with `EXPLAIN` to see the physical execution plan, including filters and projections. Tips & Gotchas - **Explicit is Better**: Always use `con.register()` to avoid IDE errors and make data lineage clear. - **Thread Safety**: DuckDB supports multithreading, but ensure you manage connections properly when using the `threading` or `multiprocessing` modules. - **CSV Performance**: While DuckDB reads CSVs quickly, repeatedly scanning massive files in an in-memory database will slow down your scripts. Use persistent storage for large datasets.
Feb 7, 2025
// ArjanCodes
Moving from a competent coder to an elite developer requires more than just knowing syntax; it demands a deep understanding of the language's internal philosophy. Python provides a unique set of tools that, when used correctly, create code that is not only functional but elegant and highly maintainable. These ten strategies bridge the gap between basic script writing and professional software engineering. Data Structures and Lazy Evaluation To write truly Pythonic code, you must move beyond the basic for-loop. Python offers comprehensions that extend far beyond simple lists. By using dictionary and set comprehensions, you can transform data in a single, readable line without the overhead of manual initialization. However, the real efficiency comes from generators. Unlike lists that store every element in memory, generators utilize lazy evaluation. They produce values on demand using the `yield` keyword, which is essential when processing massive datasets or real-time streams where memory conservation is paramount. Mastering the shift from eager to lazy evaluation is a hallmark of a mature developer. The Power of Advanced Formatting and Built-ins Modern Python development favors f-strings for string manipulation. These aren't just for variable interpolation; they support complex expressions and specialized formatting. You can center text, truncate floating points, or even use the debugging syntax `{var=}` to print both the name and value of a variable instantly. This readability should extend to your use of built-in functions. Many developers reinvent the wheel by manually tracking indices or merging lists. Using `enumerate()`, `zip()`, and functional tools like `map()` and `filter()` simplifies your logic. These built-ins are often implemented in C, meaning they perform significantly better than manual Python loops. Resource Management and External Ecosystems Reliable software must handle resources like files and database connections gracefully. Context managers, invoked via the `with` statement, automate the setup and teardown of these resources. This ensures that even if an error occurs, your files close and your database locks release. Beyond the language core, the strength of this ecosystem lies in its libraries. For data-heavy tasks, Pandas and NumPy are non-negotiable. For networking, HTTPX offers a modern alternative for API requests. Knowing when to rely on a battle-tested library versus writing a custom implementation is a vital skill for project velocity. Structural Integrity Through Typing and Abstraction As projects grow, clarity becomes your biggest challenge. Type annotations serve as a form of living documentation. They tell other developers exactly what a function expects and what it promises to return. When combined with abstraction tools like Abstract Base Classes (ABCs) and Protocols, you can decouple your code. ABCs provide a strict blueprint for inheritance, while Protocols allow for structural subtyping or "duck typing." This allows you to write functions that care only about what an object can *do* (like a `.log()` method) rather than what it *is*, making your system incredibly flexible and easy to test. The Professional Workflow: Testing and Logic Choice No code is truly finished until it is tested. Using Pytest allows you to build a safety net that catches regressions as you refactor. Effective testing often relies on the abstractions mentioned earlier, allowing you to swap real database repositories for mock versions. Finally, the most common struggle for developers is choosing the right structure: functions, classes, or data classes. Use functions for stateless logic, data classes for pure data containers, and full classes only when you need to encapsulate both state and complex behavior. Balancing these choices ensures your codebase remains lean and purposeful. Python offers a path to simplicity through sophisticated tools. By integrating these ten principles, you ensure your code is not just working, but built to last.
Jan 24, 2025
// ArjanCodes
Overview Most developers fall into the trap of over-engineering early in a project. We often reach for complex design patterns like Model-View-Controller (MVC) or the Command pattern because they feel like the professional way to build. However, as this exploration of the Data Validator CLI demonstrates, excessive abstraction can drown your logic in boilerplate. This guide focuses on identifying "pattern fatigue" and refactoring a class-heavy Python application into a streamlined, functional, and testable tool. We are looking at an interactive shell designed to load CSV files, filter data, and perform validations. While the original architecture used separate classes for every possible user command, we will strip away that complexity. By favoring functions over classes and Protocols over Abstract Base Classes (ABCs), we create a codebase that is easier to maintain and far less brittle. Prerequisites To follow this tutorial, you should have a solid grasp of Python (3.10+) fundamentals, including dictionaries, decorators, and basic typing. Familiarity with Pandas for data manipulation and Pytest for unit testing is highly recommended. You should also understand the concept of a CLI (Command Line Interface) and how interactive shells differ from standard script execution. Key Libraries & Tools * **Python**: The core programming language used for the entire application. * **Pandas**: Used for high-performance data manipulation and loading CSV files into memory. * **Pydantic**: Originally used for argument validation (later refactored for simplicity). * **Pytest**: Our primary testing framework for ensuring refactored logic remains sound. * **Typing Module**: Utilized for adding type hints, `Protocol`, and `Callable` definitions to improve code clarity. Code Walkthrough: From Classes to Functions The original code used a classic Command pattern where every command (e.g., `exit`, `import`, `merge`) was a separate class with an `execute` method. This created a massive amount of file-system noise. Here is how we simplify it. 1. Decoupling the Event System The project uses an event system to handle updates. Instead of nesting this inside a controller, we move it to a standalone module and simplify the logic. We add support for a "star" (`*`) listener, allowing one function to catch all events—perfect for a shell that just needs to print messages to the user. ```python events.py from typing import Any, Callable _event_listeners: dict[str, set[Callable]] = {} def register_event(event_name: str, listener: Callable[..., None]) -> None: if event_name not in _event_listeners: _event_listeners[event_name] = set() _event_listeners[event_name].add(listener) def raise_event(event_name: str, *args: Any, **kwargs: Any) -> None: listeners = _event_listeners.get("*", set()).union(_event_listeners.get(event_name, set())) for listener in listeners: listener(*args, **kwargs) ``` 2. Refactoring Commands to Functions There is no need for a `ShowFilesCommand` class when a simple function will do. By using a dictionary to map strings to functions, we eliminate the need for a complex Factory pattern. We also replace Pydantic models with direct validation calls to reduce the number of small, single-use classes. ```python commands/show_files.py from .model import Model from ..events import raise_event def show_files(model: Model) -> None: table_names = list(model.data_frames.keys()) message = f"Files present: {', '.join(table_names)}" raise_event("display_message", message) ``` 3. Implementing the Command Factory With commands now being functions, the factory becomes a simple registry. This is much easier to read and extend than a series of class registrations. ```python commands/factory.py from typing import Any, Callable from .exit import exit_app from .show_files import show_files CommandFunc = Callable[..., None] COMMANDS: dict[str, CommandFunc] = { "exit": exit_app, "files": show_files, } def execute_command(name: str, *args: Any) -> None: if name in COMMANDS: COMMANDSname ``` Syntax Notes: Protocols vs. ABCs One major change in this refactor is the move from Abstract Base Classes to Protocols. ABCs require explicit inheritance (nominal subtyping), which can make your code rigid. If you want to replace the Model with a different implementation, you must inherit from the ABC. Protocols, on the other hand, use structural subtyping (often called static duck typing). As long as an object has the required methods, it matches the protocol. This is cleaner and more Pythonic. ```python from typing import Protocol class Model(Protocol): def get_data(self, alias: str) -> Any: ... def delete_data(self, alias: str) -> None: ... ``` Practical Examples This refactored architecture is ideal for any CLI tool that manages state in memory. For instance, a local database explorer or a file conversion utility benefits from this "flat" structure. By keeping the main entry point as a "patching" area where you register events and initialize the shell, you keep the logic of individual commands isolated and easy to test. In a real-world scenario, you might extend this by: 1. **Adding a Logger**: Instead of just printing, have the event system send data to a logging service. 2. **Configuration Files**: Use TOML or JSON to define a list of files that should automatically load when the shell starts. 3. **Advanced Querying**: Integrate DuckDB to allow SQL-like queries directly on the loaded Pandas DataFrames. Tips & Gotchas * **Avoid Global Namespace Pollution**: Always wrap your startup code in a `if __name__ == "__main__":` block and a `main()` function. This prevents variables from leaking into the global scope and makes your code easier to import for testing. * **Relative vs. Absolute Imports**: When working within a package, use relative imports (`from . import module`). This allows you to rename folders or move the package without breaking every internal reference. * **The YAGNI Principle**: "You Ain't Gonna Need It." Don't build an MVC structure just because you might add a GUI later. Build the simplest version that works today. If you need a GUI tomorrow, the clean, functional code you wrote will be easy to adapt. * **Testing Output**: Use the `capsys` fixture in Pytest to capture `stdout`. This is the most reliable way to test that your shell is actually displaying the correct messages to the user.
Dec 20, 2024
// ArjanCodes
Overview Python 3.13 introduces an experimental feature that allows developers to run code without the Global Interpreter Lock (GIL). Historically, the GIL prevented multiple threads from executing Python bytecode simultaneously to ensure memory safety. Removing this lock enables true parallelism, allowing CPU-bound tasks to utilize multiple processor cores effectively within a single process. This guide explores the technical shift toward a "no-GIL" Python ecosystem. Prerequisites To follow this exploration, you should understand: * **Threading vs. Multiprocessing**: Knowing how Python handles concurrent execution. * **CPython Internals**: Basic familiarity with how the default Python interpreter manages memory. * **Compilation**: Comfort with building Python from source, as early no-GIL builds require custom flags. Key Libraries & Tools * **CPython**: The standard Python implementation currently undergoing these architectural changes. * **threading**: The built-in module for managing concurrent execution threads. * **multiprocessing**: A module used to side-step the GIL by spawning separate memory spaces. * **FastAPI** and **SQLAlchemy**: High-level frameworks that may require updates for thread-safety in a no-GIL environment. Code Walkthrough Testing the impact of the GIL involves comparing standard threaded execution against a no-GIL build. In a standard environment, the following CPU-bound task gains no speed from threading: ```python import threading import time def count_primes(n): # Intensive calculation logic here pass Standard threading hampered by the GIL threads = [threading.Thread(target=count_primes, args=(1000000,)) for _ in range(4)] start = time.perf_counter() for t in threads: t.start() for t in threads: t.join() print(f"Elapsed: {time.perf_counter() - start}") ``` When running this on a Python 3.13 build with the GIL disabled, the execution time drops significantly. The interpreter no longer forces threads to wait for the mutex, allowing the operating system to distribute the `count_primes` workload across four physical CPU cores simultaneously. Syntax Notes Disabling the GIL is currently a build-time configuration. Developers check for the status using `sys._is_gil_enabled()` if available. The implementation relies heavily on new C macros in the `ceval_gil.c` source file, which conditionally compile locking logic based on the `--disable-gil` flag. Practical Examples * **Data Science**: Running heavy NumPy or pandas transformations across threads without the overhead of inter-process communication. * **AI/ML**: Scaling model inference locally by utilizing all available CPU threads within a single memory space. * **Web Servers**: Handling high-concurrency requests in frameworks like FastAPI more efficiently. Tips & Gotchas Removing the GIL is not a free lunch. Single-threaded performance in early no-GIL builds may actually decrease due to the overhead of new thread-safety mechanisms like biased reference counting. Furthermore, many third-party C extensions assume the GIL protects them; running these in a no-GIL environment can lead to race conditions or segmentation faults. Always test your dependency tree before migrating to a free-threaded build.
Aug 2, 2024
// ArjanCodes
The Architecture of Agile Analysis Many developers fall into the trap of viewing data science scripts as disposable. Since objectives shift as insights emerge, the temptation is to ignore software design. This is a mistake. Arjan Egges argues that proper project structure is precisely what allows for rapid iteration. If your code is a mess, you can't pivot when the data reveals a new direction. Standardizing the Starting Line Consistency across projects reduces the cognitive load of context switching. For teams, this isn't just a preference—it's a requirement for collaboration. Using Cookiecutter allows you to instantiate projects from a template like Cookiecutter Data Science, ensuring every experiment begins with the same directory structure and configuration. Pipeline Power and Library Leverage Writing custom code for data cleaning often introduces unnecessary bugs. Mature libraries like pandas and scikit-learn offer tested, optimized patterns that actually teach you domain standards. For complex workflows, tools like Taipy provide a backend-to-frontend pipeline that manages scenarios and versioning. ```python Installing Taipy for pipeline management pip install taipy ``` Decoupling Data and Configuration Hard-coding constants is the fastest way to break a deployment. Keep your configuration in a single, separate location. Using environment variables via a `.env` file is the gold standard, as it integrates seamlessly with cloud environments and prevents sensitive database paths from leaking into version control. The Notebook Exit Strategy Jupyter notebooks are excellent for exploration but terrible for maintenance. Once a piece of logic is stable, move it into a shared Python package. This transition enables professional tooling—auto-formatters, linters, and most importantly, unit tests. Robustness Beyond the Chart Visualizing data isn't a substitute for testing. Subtle bugs might not skew a scatter plot but can lead to catastrophic decision-making errors. Writing unit tests ensures that when you swap a dataset or hand code to a colleague, the underlying logic remains sound. It’s about building a project that functions autonomously, rather than one that requires your constant intervention to survive a deadline.
Nov 3, 2023
// ArjanCodes
The Hidden Costs of Exploratory Coding Jupyter Notebooks offer unparalleled freedom for exploratory data analysis, letting you execute code block by block and visualize outputs instantly. However, this flexibility introduces a dangerous trap: hidden global state. Because you can run cells in arbitrary order, your active memory easily becomes desynchronized from the linear sequence on the screen, leading to phantom imports and stale variable values. Anatomy of a Broken Notebook Consider this typical scenario where we simulate dice rolls. We start by defining a global variable and a simulation function: ```python import random NUMBER_OF_SIDES = 6 def roll_dice(n: int) -> int: return sum(random.randint(1, NUMBER_OF_SIDES) for _ in range(n)) ``` If you run this, change `NUMBER_OF_SIDES = 20` in an upper cell, run a simulation, and then execute the original function again, your results will silently corrupt. The function depends on a mutable global state rather than explicit parameters. Even worse, deleting the `import random` statement from a cell won't trigger an error in subsequent runs because the Python kernel retains that module in memory. When you share this notebook, it immediately breaks for your colleagues. Best Practices and Syntax Patterns To write reproducible code, eliminate global dependencies. Refactor your code to use explicit arguments and default values: ```python def roll_dice(n: int, sides: int = 6) -> int: return sum(random.randint(1, sides) for _ in range(n)) ``` Tooling and Testing To prevent state pollution, use the **Restart Kernel and Clear Outputs** button in your IDE. If your notebook logic grows complex, extract your core helper functions into a standalone `dice.py` script and import them: ```python In your Jupyter Notebook from dice import roll_dice ``` Moving logic to traditional scripts allows you to use standard tooling like pandas safely, run unit tests, and leverage automated linters.
Aug 18, 2023
// ArjanCodes
The Limitation of Standard Type Hints Python type annotations offer a great developer experience, but they fall short when dealing with the internal structure of a Pandas DataFrame. While you can annotate a function to return a `pd.DataFrame`, that tells you nothing about the columns, data types, or value constraints inside that table. In production data pipelines, knowing a variable is a "table" isn't enough; you need to know that the "Quantity" column contains positive integers and "Email" contains valid addresses. Validation with Pandera Pandera bridges this gap by providing a flexible validation layer specifically designed for Pandas. It allows you to define a schema that acts as a contract for your data. If the data drifting into your pipeline violates these rules, Pandera catches it immediately. One of its most powerful features is **Schema Inference**. You can pass an existing DataFrame to `pa.infer_schema(df)`, and it will automatically generate a starting schema based on the current data distribution, which you can then refine. Implementing Class-Based Schemas While Pandera supports several syntax styles, the class-based approach using `SchemaModel` is the cleanest. It mirrors the familiar Pydantic syntax, making your validation logic readable and modular. ```python import pandera as pa from pandera.typing import DataFrame, Series class OutputSchema(pa.SchemaModel): item_name: Series[str] quantity: Series[int] = pa.Field(ge=1) price: Series[float] = pa.Field(le=1000) @pa.check_types def process_data(df: DataFrame) -> DataFrame[OutputSchema]: # Your logic here return df ``` By using the `@pa.check_types` decorator, Pandera validates the data at runtime based on the type hint `DataFrame[OutputSchema]`. This creates a self-documenting pipeline where the types actually enforce data integrity. Ecosystem Integrations Pandera doesn't live in a vacuum. It integrates seamlessly with FastAPI for validating incoming API dataframes and Hypothesis for generating synthetic test data. This interoperability makes it a core tool for modern Python data engineering, moving beyond simple scripts into robust, verifiable software systems.
Apr 14, 2023
// ArjanCodes
Overview When you process large datasets, memory becomes your most expensive resource. Pandas is built on top of Python and NumPy, providing a high-level interface for data manipulation. However, if you rely solely on default settings, your memory usage can balloon by over 90% unnecessarily. This tutorial explores how to control data types to build efficient, scalable data pipelines. Prerequisites To follow along, you should be comfortable with Python basics. You will need Pandas installed in your environment. Familiarity with tabular data concepts like rows and columns is essential. Key Libraries & Tools * Pandas: The primary library for data structures (DataFrames and Series). * NumPy: The numerical engine that provides the underlying C-based data types. * Pip: The package manager used to install these tools. Code Walkthrough Type Inference and Metadata Issues When reading a CSV, Pandas often struggles with files containing metadata rows. This results in every column defaulting to the expensive `object` type. ```python import pandas as pd Skipping metadata rows to help Pandas infer types correctly df = pd.read_csv("airports.csv", skiprows=2) print(df.dtypes) ``` By skipping the first two rows, Pandas correctly identifies integers and floats rather than treating everything as generic objects. Manual Type Casting You can force specific types using the `astype` method or specialized conversion functions like `to_numeric` and `to_datetime`. ```python Mapping multiple columns at once type_map = { "name": "string", "is_active": "bool" } df = df.astype(type_map) Converting to datetime df["last_updated"] = pd.to_datetime(df["last_updated"]) ``` The Power of Categorical Types For columns with many repeated strings (like 'State' or 'City'), the `category` type stores data as integers internally, mapped to a unique set of strings. This can reduce memory footprints by up to 98%. ```python df["state"] = df["state"].astype("category") ``` Syntax Notes * **Object Type**: The fallback for any data Pandas doesn't recognize; it is highly memory-inefficient. * **astype()**: A versatile method that accepts a single type or a dictionary for bulk conversion. * **memory_usage(deep=True)**: Essential for seeing the true cost of string data stored in object columns. Practical Examples In a Brazilian e-commerce dataset with 100,000 records, switching a "State" column from `object` to `category` slashed memory usage significantly because there are only 26 unique states. This optimization allows you to process millions of rows on standard hardware. Tips & Gotchas Avoid using the categorical type if the column has high cardinality—meaning almost every value is unique (like a Zip Code). In these cases, the overhead of maintaining the category map actually increases memory consumption.
Mar 17, 2023
// ArjanCodes
Modern Software Design: Beyond the Python Hype When we look at the trajectory of software development in 2023, it is easy to get swept up in the latest library or the newest language version. However, the real work of a developer remains centered on the architecture of logic. **Software design is the art of keeping things manageable.** While much of my recent work focuses on Python, the principles of clean code are largely language-agnostic. Whether you are working in Rust, TypeScript, or Java, the challenge remains the same: how do we structure our systems so they do not collapse under their own weight as they grow? One of the most frequent requests I receive is for more content on Artificial Intelligence and Machine Learning. While these are undoubtedly the "noisy" sectors of our industry right now, I have intentionally kept my focus on the niche of software design. There is a specific reason for this. In the rush to implement neural networks or data pipelines, many developers abandon the fundamental practices that make software sustainable. A machine learning model wrapped in spaghetti code is a liability, not an asset. My goal is to ensure that as we move into these complex domains, we carry with us the habits of clean functions, decoupled classes, and robust testing. The Protocol Shift: Inheritance vs. Composition One of the more nuanced discussions in modern development involves the transition away from heavy inheritance hierarchies. In the past, Object-Oriented Programming (OOP) often forced us into rigid parent-child relationships between classes. Today, I find myself moving toward a more functional approach, favoring protocols and composition over abstract base classes. This is a significant shift in how we think about interfaces. In Python, the use of Protocols allows for structural subtyping, or "duck typing." This means we define what an object *does* rather than what it *is*. If an object has the required methods, it satisfies the protocol. This leads to much cleaner code because it removes the need for a central inheritance tree that every developer must understand to make a change. When you define a protocol close to the function that uses it, you are documenting the requirements of that function explicitly. This is not just a syntax choice; it is a design philosophy that prioritizes flexibility and reduces the cognitive load on the developer. We must also be careful about where we place our business logic. A common mistake is overloading constructors with complex operations. Creating an object should be lightweight. If you bury heavy logic in a `__init__` method, you lose control over the execution flow. You cannot easily create objects for testing or previewing without triggering those side effects. By keeping constructors thin and moving logic into dedicated methods or factory functions, you gain the ability to manage state more effectively, which is essential for building responsive applications. Navigating the Ecosystem: Tools, Frameworks, and Risks Choosing a tech stack is rarely about finding the "best" tool; it is about managing risk. Take the choice between FastAPI and newer contenders like Starlite. FastAPI has become a staple because of its speed and developer experience, but it is largely maintained by one person. This creates a "bus factor" risk. If the primary maintainer disappears, the ecosystem stalls. Conversely, a newer framework might have more maintainers but lacks the massive community support, plugin ecosystem, and battle-tested stability of the market leader. For production environments, I always lean toward stability. It is fun to experiment with the latest web framework or a new language like Mojo for a hobby project, but when users' data and company revenue are on the line, you want the tool that has the most eyes on its GitHub issues. The same applies to deployment. Docker has become non-negotiable for the modern developer because it solves the "it works on my machine" problem. Understanding how your code lives in a container and how that container interacts with a cloud provider like AWS is no longer a specialty—it is a baseline requirement for being an effective software engineer. The AI Assistant: GitHub Copilot and the Future of Work There is a lot of anxiety surrounding ChatGPT and GitHub Copilot. People ask if these tools will replace us. My experience has been the opposite: they make us more powerful, provided we remain the architects. GitHub Copilot is excellent at generating boilerplate or suggesting the implementation of a standard algorithm. It saves time on the repetitive parts of coding, allowing the developer to focus on the high-level design and the integration of components. However, a chat interface is not the future of programming. Coding is about context and overview. You need to see how a change in one module affects the entire system. AI tools struggle with this holistic view. They are optimized for the immediate snippet. As an engineer, your value is not in your ability to type syntax—it is in your ability to define the problem and verify that the solution is correct. We are moving from being "code writers" to "code reviewers" and "system architects." This shift requires even stronger analytical skills and a deeper understanding of design patterns, as you must be able to spot when the AI-generated code is subtly wrong or architecturally unsound. Balancing the Grind: Career Growth and Learning One of the hardest parts of being a developer is the constant feeling that you are falling behind. New frameworks emerge every week, and the industry's pace is relentless. My advice is to find a way to incorporate learning into your professional life rather than sacrificing every evening and weekend to the grind. If you are learning new skills, you are becoming a more valuable asset to your employer. It should be a win-win scenario. For those looking to transition into the field or move into management, remember that credentials matter less than demonstrated skill. While a Computer Science degree provides a solid foundation, many successful engineers come from diverse backgrounds like electrical engineering or self-taught paths via coding schools. What matters most is the ability to break down complex problems and communicate solutions. If you want to move into management, start by taking an advisory role in technical decisions. Show that you understand the business impact of code, not just the technical elegance. The most successful lead developers are those who can bridge the gap between a messy business requirement and a clean technical implementation. Ultimately, software development is a long game. Whether you are dealing with workplace politics, choosing between Scrum and Kanban, or debating the merits of Graph Databases, the key is to stay curious and methodical. Don't be afraid to step out of your comfort zone—it is the only place where real growth happens. Keep building, keep breaking things, and most importantly, keep designing with the future in mind.
Jan 10, 2023
// ArjanCodes
Overview Constructing a robust financial dashboard requires more than just displaying charts; it demands a solid data pipeline and a flexible architecture. This guide focuses on transitioning from static sample data to a dynamic, file-driven application. By integrating pandas for data manipulation and Plotly Dash for the user interface, you can build tools that respond instantly to complex user queries. The goal is to create a multi-layered filtering system where dropdowns for years, months, and expense categories interact seamlessly to update visualizations like bar and pie charts. Prerequisites To follow along, you should have a baseline understanding of Python syntax and the basic structure of a Dash application. Familiarity with pandas DataFrames and the concept of 'callbacks' in reactive programming will help you navigate the logic behind the UI updates. Key Libraries & Tools - **Plotly Dash**: The core framework used to build the web-based analytical interface. - **pandas**: Essential for loading CSV data, performing data cleaning, and generating pivot tables for aggregation. - **Plotly Express**: A high-level wrapper for Plotly that allows for rapid creation of complex charts with minimal code. Structured Data Loading and Schemas Hardcoding column names throughout your application is a recipe for technical debt. A cleaner approach involves defining a `DataSchema` class. This centralizes your string identifiers, allowing your IDE to provide autocompletion and ensuring that a change in the CSV header only requires a single update in your code. ```python import pandas as pd class DataSchema: AMOUNT = "amount" CATEGORY = "category" DATE = "date" YEAR = "year" MONTH = "month" def load_transaction_data(path: str) -> pd.DataFrame: data = pd.read_csv( path, parse_dates=[DataSchema.DATE], dtype={DataSchema.AMOUNT: float, DataSchema.CATEGORY: str} ) # Feature Engineering: Extract year and month for better filtering data[DataSchema.YEAR] = data[DataSchema.DATE].dt.year.astype(str) data[DataSchema.MONTH] = data[DataSchema.DATE].dt.month.astype(str) return data ``` Implementing Multi-Input Callbacks The power of Plotly Dash lies in its callback system. While a simple callback might update a chart based on one dropdown, a sophisticated dashboard often requires listening to multiple inputs. For instance, a bar chart should update whenever the year, month, or category selection changes. ```python @app.callback( Output("bar-chart", "children"), [Input("year-dropdown", "value"), Input("month-dropdown", "value"), Input("category-dropdown", "value")] ) def update_bar_chart(years, months, categories): filtered_data = data.query( "year in @years and month in @months and category in @categories" ) if filtered_data.empty: return "No data selected" # Aggregate for visualization pivot = filtered_data.pivot_table( values=DataSchema.AMOUNT, index=DataSchema.CATEGORY, aggfunc="sum" ).reset_index() fig = px.bar(pivot, x=DataSchema.CATEGORY, y=DataSchema.AMOUNT) return dcc.Graph(figure=fig) ``` Syntax Notes and Practical Examples The `query` method in pandas provides a concise, readable way to filter data compared to traditional boolean indexing. Using the `@` symbol inside the query string allows you to refer to local variables directly. This is particularly useful in dashboards where filter criteria are passed as list arguments from the UI. Real-world applications of this pattern include personal expense trackers, corporate budget monitors, and stock portfolio analysis tools. Tips & Gotchas Avoid using global variables for data state whenever possible. Instead, pass the DataFrame through your layout functions to keep components decoupled. A common mistake is forgetting that Dash callbacks are triggered on page load; always ensure your functions can handle empty input lists or null values to prevent the app from crashing during initialization.
Aug 19, 2022