1 Comment

A couple of points regarding conda:

1. You can manage conda dependencies using miniforge (https://github.com/conda-forge/miniforge) without downloading the full Anconda distribution. Miniforge is available under an open-source license (BSD-3) so you don't need to pay a subscription, even for large commercial projects.

2. You mention that conda can manage binary dependencies. But this is just one way it delivers its key advantage over pip: it gives you a completely deterministic and reproducible build. OTOH, if you install a project using virtualenv/pip the resulting build (combination of program + dependencies) can sometimes depend on the particular environment in which the installation was conducted. This can make trouble-shooting very hard, because the same code works differently, or not at all, depending on how it was installed.

Regarding linting and Pycharm, rather than rely on proprietary IDEs another practice is to use open-source tooling such as mypy (strict static type-checking), pylint, pyright, black, flake8 etc. You can also use pre-commit to ensure that all checks pass before a commit is allowed into the repo: https://dev.to/techishdeep/maximize-your-python-efficiency-with-pre-commit-a-complete-but-concise-guide-39a5. Note that static checking and unit tests are not mutually exclusive. Best practice is to use both.

Finally, you can configure a github workflow to ensure that, on every commit, all current dependencies can be successfully installed, and that all tests pass: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#testing-your-code. Developers will then be immediately alerted if they have made any changes that break the pipeline.

All of the above is part of standard best-practice for any project using Continuous Integration and/or Continuous Delivery (https://www.continuous-delivery.co.uk/), which have been empirically proven to improve software reliability. Irrespective of the programming language used, Continuous Integration is an essential prerequisite of reliability, even/especially in data engineering projects: https://dev.to/edublancas/rethinking-continuous-integration-for-data-science-1c0c.

One common failure-mode of data engineering pipelines arises from lack of reproducibility. For example, it is very tempting for data scientists to clean and pre-process data using quick and dirty "temporary" scripts which are then discarded (sometimes not even saved in version control), and then save the final cleaned data in a database. But this makes it very hard to detect or rectify mistakes in the cleaning process, or respond to changes in pre-processing requirements. Data pipelines are still pieces of software. So if you want a reliable data pipeline it's important to adopt best-practice software-engineering: version-control, fully automated build and tests, pair-programming. i.e. Continuous Integration.

Expand full comment