Why Python Always Breaks

Long Live Python

Jan 19, 2024

Okay, so that's a bit fatalistic, but if you've spent any time in Python you've likely stubbed your toes on either the language, interpreter, or that dreaded pip install. There's no one perfect tool or language out there, and if there was then Python isn't it (sorry, not sorry, speaking as someone who’s built a career in data with Python).

Don't get me wrong... Python has many great applications and is a language both accessible and versatile. It’s the only choice right now for data and ML workloads … for 80%+ of the market.
When it comes to rapid iteration, prototyping, or one-off scripting, it can be the fastest path to delivery. Memory management and the type system have a strong focus on getting out of your way and allowing you to quickly express ideas in code. Also, its long-running success has created an enormous ecosystem of tools that cater to science, finance, data, and ML professionals.

Any good tool requires experience, diligence, and discipline. For instance, accuse a C++ developer of working with a foot shotgun (maybe even throw in a reference to Keith, the "official" mascot), and you'll get a retort to the effect that it's only a problem with developers who don't know the language well or aren't using the newest standards.

While that point's debatable, there are plenty of pitfalls in Python that can trip up developers of all experience levels.

You should check out Prefect, the sponsor of the newsletter this week! Prefect is a workflow orchestration tool that gives you observability across all of your data pipelines. Deploy your Python code in minutes with Prefect Cloud.

I believe that being honest with yourself, and your tools is an underrated skill. To become better, to use the right tool at the right time … we can’t walk around with blinders on. We simply must look ourselves in the mirror and say the hard truths.

Flexibility isn't always your friend

Let's be honest - one of the strengths of Python is the ease with which you can crank out a working program, it’s quite amazing if you stop to ponder it.

It's simple enough for beginners, and powerful enough for veterans, and its dynamic typing means you don't need to fight-type systems along the way. However, the lack of guard rails is a two-edged sword.

For instance, consider the following function signature:

def calculate_total(items):

You might tease out a bit about how the function or its parameter is named, but you're still left hanging when it comes to understanding that function.

Does it return something? What side effects does it have? Somewhere, a Haskell dev might be encountering that line of code right now and fighting back some bile.

Unless that function is well-documented, you're forced to take a look at the code. Of course, you're not totally out of luck; PEP 484 has your back, but type-hinting is optional and does nothing to help you with legacy codebases that don't already take advantage of it.

What if you didn't know about type hints? You might try to use a default parameter value to serve in its place:

def calculate_total(items = []):

Here be dragons. The statement itself is innocuous enough and offers a hint to readers of your code that you're dealing with a list. However, if you didn't already know that the items = [] initialization is evaluated only once, you might inadvertently assume it's reinitialized with every call of the function.

Above, we can see that unless a discounts list was provided as an argument, as was done in the p2 declaration, every instance of the Product class shares the same list of discounts.

Beginners beware

As already mentioned, Python positions itself as a valuable tool for all calibers of developers. That said, it ends up being sold pretty hard to students and learners as a "beginner" language. So, what kinds of things tend to trip this crowd up?

Here are some things that a neophyte might try:

Any rookie mistakes I missed? Leave a note of your favorite in the comments.

There are always two sides to every story. You can be like me and write Python for 20 years, and you’re still human. If you’re in a hurry, need to fix something, tests are covering … and you just push some quick fix PR … next thing you know … yup … a syntax error.

Scope

Seasoned developers are primed to understand scoping rules. For new developers, though, this is easily one of the more confusing topics they'll encounter regardless of their choice of language. Python is no exception.

Consider a trivial scenario where Python's scope allows for closures:

Pretty cool, right? A novice might see an example like that, and think that being inside the closure gives them full access to parent-scoped variables, inspiring them to write something like:

Behavior like that can make a person want to avoid using closure scopes, particularly if coming from another language such as JavaScript:

As mentioned in the comment, it's unambiguous what's going on here, since the omission of the let or any of its alternatives (const, var) means we're not trying to redeclare a variable but instead are accessing one that is already assumed to be in scope.

Not your speed? Look at how PHP does things (trigger warning: we're about to see some PHP):

In this language, there are now additional guards against explicitly defining what's coming in from a parent scope, as well as the fact that it's a direct reference (the & before the variable) rather than a copy. While it requires a bit more code than Python, it leaves no room for ambiguity and behaves as expected.

Fail fast

There's a mentality shared between programmers and entrepreneurs - "fail fast" or "fail early." If something's not going to work, you'd rather know early on rather than tripping over a problem in a deployed product.

This is where compiled and strongly typed languages, such as Rust, really shine; they don't let you out of the gates if there's a syntax error, typo, missing variable declaration, ambiguous typing, etc.

def my_func():
    print(my_first_variable)
    
print("Not calling the function for now...")

The contrived example above shows how you can have a glaring issue (in this case, my_first_variable wasn't declared), and potentially not know about it even as the program otherwise successfully runs.

Supposing you're developing a shared library and your functions aren't often called within your application, you might be shipping all manner of show-stopping bugs without ever witnessing them for yourself.

The Importance of Unit Tests for Python

This can easily be remedied, of course, with unit tests. With 100% code coverage, ensuring that every branching statement is handled, you can preemptively identify and get in front of these kinds of mistakes. That said, can you guarantee that you or your colleagues have a perfect testing system in place?

Testing is of upmost importance in Python codebases. It can give some trustworthiness to things that otherwise a toss up.

Alternatively, your IDE or toolchain might catch such issues with linting. For instance, Visual Studio Code, with the Python and Pylance extensions enabled, will identify that usage of my_first_variable with a warning:

"my_first_variable" is not defined

If this is your approach, you should make a point of knowing all of the warnings emitted from your codebase; unfortunately, many developers have a tendency to ignore warnings or otherwise fail to consider them.

The pain with the toolchain

Thank you for reading Data Engineering Central. This post is public so feel free to share it.

Everything to this point has been an idiosyncrasy of the Python language. However, there are considerations outside of just the grammar and syntax of a language that need to be considered.

Global Interpreter Lock (GIL)

Python's dirty little not-so-secret: it's forcefully single-threaded. It does this by design, using a mutex (mututally exclusive lock) to ensure only one thread is in execution at a time. In a nutshell, you could think of the mutex as a baton, and only the thread holding that baton is allowed to run.

You might ask why? Well, it's an effective tool for preventing concurrent access from multiple threads for the same resource. It's not the only way this could have been done, and other languages find ways to accomplish multiple concurrent threads, but it's been around for long enough now that getting rid of it would be extremely painful and almost certainly break existing code.

What it means for you is that, if you want to do multiple concurrent processes in Python, you'll need to reach outside of the interpreter (e.g. spawning new processes, relying on non-Python-based libraries). That said, you may take some comfort in knowing that your code is inherently thread-safe; languages lacking this mechanism often tend to run into some nasty issues (race conditions, deadlocks) that can create some of the most painful bugs imaginable.

The environments (pip, venv, conda, etc.)

A naive developer might start by unboxing the Python runtime and get to work with pip. Eventually, they'll find themselves with an installation riddled with dependencies for all the one-off tasks they were working on, possibly even tripping up over version conflicts. Thus, the a need for an environment wrapper to encapsulate a given program's dependencies.

If you haven’t been stabbed in the back by a requirements.txt file … you haven’t been working on Python code bases long enough.

If you're using virtualenv, you have a bit of rigamarole to go through as part of creating and working within a project (creating, activating, and deactivating), all of which is fairly painless.

If you have many projects, you might also find yourself with multiple copies of a given dependency occupying your drive, but disk space is cheap and this only really becomes a problem if you juggle many projects at once or are a packrat.

The strange third cousin … conda.

There's also conda, which has the added advantage of being useful for more than just Python. Rather than being directly tied to a single project, a conda environment is stored in a centralized location that can be used across multiple projects.

It'll even manage multiple versions of Python for you and bring several precompiled binaries to the table for popular dependencies like Scipy or toolchain components like GCC.

Be careful if you're not running a solo development act, though; their TOS can make this a potentially expensive venture if you have enough revenue or people on staff.

Whatever your solution, though, it's undeniable that developing in Python is more complex than just the code you write - you ultimately depend on tooling, system libraries, specific versions of Python, and assorted dependencies. For a beginner, understanding the options and their respective pros and cons can be overwhelming.

Performance

For a computationally heavy project, Python's performance may fall flat when compared to compiled languages, particularly system languages like Rust, C/C++, or Go. This likely won't be a deterrent for many people, and indeed often Python is "good enough" to be the first choice for new project development.

Want to know how Python stacks against the competition? Well, it's complicated - the workload and environmental considerations can drastically affect the relative performance between languages.

You'll find no shortage of people attempting to quantify the differences, and some can be more flattering than others when trying to make a case for a particular choice. That said, let's be honest: there is an obvious and unavoidable tradeoff when choosing Python or any other interpreted language over one that is compiled, even when allowing for technologies such as JIT.

Decisions with Python

Share Data Engineering Central

Also, much of the pain (or relief thereof) can be attributed to which dependencies you pull in. For instance, Polars is rapidly gaining popularity over Pandas because it performs dramatically better in many regards.

Often, choosing the right dependency will have a much larger impact on your project than your choice of project language, and language choices also potentially involve programming difficulties and issues with time-to-market.

Most of this is irrelevant to the beginning developer, but for greenfield development, there is an often painful and contentious battle over which language to choose and what the short- and long-term impacts will be.

In these cases, Python can be great for establishing a proof-of-concept or a skeletal high-level workflow, but could also lock you in with a worse-performing solution. To some extent, that last point can be dealt with if you allow for iterative refactors through best practices (e.g. good TDD coverage, clear separations of concerns, quality code standards).

Best practices

Whether you're in Python or any other language, you should approach a project with professionalism and discipline. You could perhaps afford to be sloppy on a one-off pet project, but for anything that requires collaboration or that might be picked up by another person (including future-you), you should strive for clean, consistent, and maintainable code.

Because "good" code is a subjective topic, you're bound to butt heads with other people on what the right standards are. I'll say this now: a "bad" standard is better than no standard, so you might have to swallow your pride or preference when choosing standards as part of a team.

It is critical that the standards and expectations for a project are clearly defined upfront. At a minimum, a project should have code quality tools like pylint or ruff in place, and enforcement of these rules (git hooks, CI/CD restrictions) should be automated.

Testing is a must in Python projects.

Tests are your friend, but if you're going to do tests, you need to do it right. Decide with your team what needs to be tested versus what doesn't; establish coverage requirements (sometimes, 100% just isn't feasible, but you should justify yourself if you can't manage it); and decide what kinds of tests (unit, BDD, integration, boundary/contract, etc.) offer the greatest rewards for your project.

"Clean code" is a somewhat loaded term, but the version popularized by Robert "Uncle Bob" Martin can be an excellent starting point. While his original work dealt with Java and its nuances, there have been efforts (e.g. here) to adapt the principles to Python.

Many of the principles described within may seem self-evident, but having a team agree on certain practices such as …

a maximum number of statements per function
single-responsibility principles
naming conventions
when and how to use comments, etc.

All these decisions will go a long way in protecting your code from yourself or other developers.

What else should Python developers be doing?

Where possible, tools and automated processes should enforce rules, but you also want a strong peer-review culture. In addition to having reviewers provide perspective on all code that you may lack, you ensure knowledge sharing across development teams by having multiple eyes on all committed work.

At a minimum (where feasible), all code should be introduced through pull requests that cannot be merged prior to having unanimous approval from at least two others.

As for Python itself, you'll want to have good tooling to preemptively handle many of your problems. A quality IDE with refactoring and code-quality capabilities, such as PyCharm, will not only empower your development cycle but also aid in proper debugging (and no, sprinkling print statements all over your code is not a "proper” approach).

Conclusion

Python's actually a great language, dare I say the greatest? It's not the best overall (if there even is such a thing), and in many aspects, it will lose to its alternatives, but at the same time, it is also a terrific first choice for assorted problems.

If you want to make the most of it, though, you need to put in the time to understand it and grow in your skills. What ultimately makes or breaks most projects isn't the choice of language, but the developers responsible for its creation.

Always look for opportunities to continue to develop yourself even as you are trying to ship code, and be aware of the ever-changing landscape of tools, frameworks, language evolution, and best practices.

Steve Phelps

Jan 22, 2024Edited

A couple of points regarding conda:

1. You can manage conda dependencies using miniforge (https://github.com/conda-forge/miniforge) without downloading the full Anconda distribution. Miniforge is available under an open-source license (BSD-3) so you don't need to pay a subscription, even for large commercial projects.

2. You mention that conda can manage binary dependencies. But this is just one way it delivers its key advantage over pip: it gives you a completely deterministic and reproducible build. OTOH, if you install a project using virtualenv/pip the resulting build (combination of program + dependencies) can sometimes depend on the particular environment in which the installation was conducted. This can make trouble-shooting very hard, because the same code works differently, or not at all, depending on how it was installed.

Regarding linting and Pycharm, rather than rely on proprietary IDEs another practice is to use open-source tooling such as mypy (strict static type-checking), pylint, pyright, black, flake8 etc. You can also use pre-commit to ensure that all checks pass before a commit is allowed into the repo: https://dev.to/techishdeep/maximize-your-python-efficiency-with-pre-commit-a-complete-but-concise-guide-39a5. Note that static checking and unit tests are not mutually exclusive. Best practice is to use both.

Finally, you can configure a github workflow to ensure that, on every commit, all current dependencies can be successfully installed, and that all tests pass: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#testing-your-code. Developers will then be immediately alerted if they have made any changes that break the pipeline.

All of the above is part of standard best-practice for any project using Continuous Integration and/or Continuous Delivery (https://www.continuous-delivery.co.uk/), which have been empirically proven to improve software reliability. Irrespective of the programming language used, Continuous Integration is an essential prerequisite of reliability, even/especially in data engineering projects: https://dev.to/edublancas/rethinking-continuous-integration-for-data-science-1c0c.

One common failure-mode of data engineering pipelines arises from lack of reproducibility. For example, it is very tempting for data scientists to clean and pre-process data using quick and dirty "temporary" scripts which are then discarded (sometimes not even saved in version control), and then save the final cleaned data in a database. But this makes it very hard to detect or rectify mistakes in the cleaning process, or respond to changes in pre-processing requirements. Data pipelines are still pieces of software. So if you want a reliable data pipeline it's important to adopt best-practice software-engineering: version-control, fully automated build and tests, pair-programming. i.e. Continuous Integration.

Expand full comment