📦 Python packaging: a simple overview
- 8 minsIndex
- Introduction
- Packages: what are they
- Packages: what types are there
- Packages: how to create them
- Packages: how to distribute them
- Missing topics
Introduction
The literature on how to create Python packages has changed over the years. It is quite diverse, and often lagged behind, given the age of Python and how the tooling around the encapsulation and distribution of code have changed along those years.
This post tries to summarize how things state at 2021, and to serve as a quick bootcamp to anyone coming to Python from other languages.
Note: most of the presented information come from the learnings raised from the discussion among other IRIS-HEP members and the amazing resources that they have created for physicists to level up their Python game 🐍
Packages: what are they
Within the Python world, the word “package” generates some confusion. In Python, that word can be applied to two different concepts interchangeably:
- A folder with a
__init__.py
file on it (file marking that folder as importable). - A set of source files, and possibly data files, which get distributed, and hopefully installed, together.
The second concept is what other languages call library, and it is the one covered in this post.
Packages: what types are there
Like other languages, Python have multiple types of distributables that can be generated from a certain project. Each has a different connotation and could contain a different set of files. They are:
- The source distribution (
sdist
in short). - The built distribution (
bdist
in short).
There is a hierarchical relationship among the two, and among those two and what is generally considered a project repository. The relationship can be visualized as a collection of subsets:
Source distribution
A source distribution is the minimal subset of files within a Python project repository needed to create a built distribution out of it.
It obviously contains all the source files (all the .py
files), in addition to many others that could be relevant for building the complete metadata of the package (README.md
, VERSION
, LICENSE
…), and a couple of special files where all the metadata, as well as the “instructions” on what to include in the built distribution are gathered: setup.py
and / or setup.cfg
.
These last two (setup.py
and setup.cfg
) have their names derived from setuptools, the long-lived library for package building being offered by the Python Package Authority (PyPA).
Something to keep in mind: the PyPA is trying to make developers to move their metadata and built distribution “instructions” from setup.py
to setup.cfg
as much as possible (they can co-live). There are many good reasons for this, but I will not cover them in this post. For more information, check this Paul Ganssle article on deprecating setup.py
.
Built distribution
A built distribution is the set of files within a Python source distribution needed to be installed on the end-user computer to offer all the desired functionality.
It contains all the source files (all the .py
files), in addition to the package full metadata.
These built distributions have taken different formats along the years. Many years ago, they were distributed as eggs, and every built distribution had the .egg
extension. Nowadays, most of them have migrated to the wheel format (.whl
extension), by having the wheel library installed, in the local environment, when building them.
Why the distinction
One could ask: why the distinction between source and built distributions? just distribute the built one and get done with it. Well, it is not that simple 😕.
In some cases, Python projects rely on underlying libraries, written in other programming languages, that get bundled alongside them. In these cases (and some others) the translation between a project and a built distribution is not that direct, as the compilation needed for those bundled libraries depends on the target system (macOS / Linux / Windows, x64 / x32 / ARM architecture…).
To ease this problem, the source distribution idea was introduced, by defining the minimal set of files, that any developer need to have in order to create a built distribution for any target system.
Packages: how to create them
When considering how to create the distributions, the tooling has evolved in recent years. Until recent times, setuptools was the almighty, all-powerful CLI tool. It was used to (I) install, (II) test, (III) build and (IV) upload the packages.
Those days are long gone.
Nowadays, there are multiple tools to perform any of those actions, being most of them developed and maintained by the Python Package Authority (PyPA).
When it comes to building the distributions, build is the right tool for the job 🔧. It can be easily used to create both the source and the built distributions for any pure-python project:
python -m build \
--sdist \
--wheel \
--outdir dist \
.
Packages: how to distribute them
Finally, when it comes to distributing the packages, there are alternative ways to do so depending on how much dependency towards third-party platforms developers want to assume.
Distribution without an index
The simplest way of distributing Python packages (if they come from a version controlled repository), is to rely on the functionality that pip, the PyPA official tool for installing packages, already provides.
Pip is able to identify a tool identifier as suffix to the URL of the repository containing all the package code, and use that tool to fetch the necessary code. There are some real-world examples within the Pip documentation, but some of the most common ones are:
🔓 For public repos:
pip install git+https://github.com/<ORGANIZATION_NAME>/<REPOSITORY_NAME>.git@<COMMIT/TAG>
🔐 For private repos:
pip install git+ssh://git@github.com/<ORGANIZATION_NAME>/<REPOSITORY_NAME>.git@<COMMIT/TAG>
Please consider that in order to authenticate yourself with the repository hosting service (i.e. GitHub) in order to download and install a private package, the public version of an RSA pair of keys must be set up as a Deploy key within the target repository.
Distribution with an index
When relying on a package index, offered packages have been already built, and are ready to download. This is preferable over the distribution of packages directly from a version controlled system, given that, in those cases, the packages are being built on-the-fly.
The Python Software Foundation, in collaboration with some private partners, offer a free package index for every developer to use, called Python Package Index (PyPI). This is the main index, but not the only one, where users can fetch built packages from 🏗️.
From a developer perspective, they must perform some actions in order to distribute their packages:
- Set up an account with PyPI (duh).
- Have a procedure to create the distributions out of their repository (see previous section).
- Have a procedure to upload the distributions to some package index.
For the case of PyPI, the best tool for uploading is twine, which of course, is also provided by PyPA. This CLI tool hides a lot of complexity by exposing a simple interface:
twine upload <PATH_TO_THE_DISTRIBUTIONS>/*
If a GitHub Action atomization is preferred instead, check out the PyPI publish action:
name: Publish package
on:
...
jobs:
publish:
- ...
- name: "Publish Python package"
uses: pypa/gh-action-pypi-publish@v1.4.2
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }} # Your PyPI auth token
verify_metadata: true
Missing topics
This post aimed to provide a high-level summarized overview of Python packages and the logistics surrounding them. There are many aspects and peculiarities not covered. For a complete explanation, check out the official documentation.
Please, consider that the tooling around the encapsulation and distribution of packages has changed a lot throughout the history of Python, so you may find old blog posts talking about:
- The
easy_install
tool. - The
.egg
format. - The
distutils
library. - How great it is to invoke
setup.py
directly (python setup.py <command>
) - …
Most of these concepts / tools are considered legacy as of 2021.