- Python
- Project setup
- Python versions
- Recommended sources of information
- Dependencies and package management
- Editors and IDEs
- Coding style conventions
- Building and packaging code
- Testing
- Code quality analysis tools and services
- Debugging and profiling
- Logging
- Writing Documentation
- Recommended additional packages and libraries
Python
Python is the “dynamic language of choice” of the Netherlands eScience Center.
Project setup
When starting a new Python project, consider using our Python template. This template provides a basic project structure, so you can spend less time setting up and configuring your new Python packages, and comply with the software guide right from the start.
Python versions
Currently, there are two Python versions: 2 and 3.
Should I use Python 2 or Python 3 for my development activity?
Generally, Python 2.x is legacy, Python 3.x is the present and future of the language. However, not all Python libraries are compatible with Python 3.
- Six: Python 2 and 3 Compatibility Library
- 2to3: Automated Python 2 to 3 code translation
- python-modernize: wrapper around 2to3
The philosophy of Python is summarized in the Zen of Python. In Python, this text can be retrieved with the import this
command.
Recommended sources of information
- A good way to learn Python is by doing it the hard way at http://learnpythonthehardway.org/
- Introduction to python for data science: http://skillsmatter.com/podcast/java-jee/introducing-python-for-data-science
- Blog by Ian Ozsvald, mostly on high performance python.
- Planet Python
- Using
pylint
andyapf
while learning Python is an easy way to get familiar with best practices and commonly used coding styles
Dependencies and package management
Use pip
or conda
(note that pip and conda can be used side by side, see also what is the difference between pip and conda?).
If you are planning on distributing your code at a later stage, be aware that your choice of package management may affect your packaging process. See Building and packaging for more info.
Pip + virtualenv
Create isolated Python environments with virtualenv. Very much recommended for all Python projects since it:
- installs Python modules when you are not root,
- contains all Python dependencies so the environment keeps working after an upgrade, and
- lets you select the Python version per environment, so you can test code compatibility between Python 2.x and 3.x.
To manage multiple virtualenv environments and reference them only by name, use virtualenvwrapper. To create a new environment, run mkvirtualenv environment_name
, to start using it, run workon environment_name
and to stop working with it, run deactivate
.
If you are using Python 3 only, you can also make use of the standard library venv module. Creating a virtual environment with it is as easy as running python3 -m venv /path/to/environment
. Run . /path/to/environment/bin/activate
to start using it and deactivate
to deactivate.
With virtualenv and venv, pip is used to install all dependencies. An increasing number of packages are using wheel
, so pip downloads and installs them as binaries. This means they have no build dependencies and are much faster to install. If the installation of a package fails because of its native extensions or system library dependencies and you are not root, you have to revert to Conda (see below).
To keep a log of the packages used by your package, run pip freeze > requirements.txt
in the root of your package. If some of the packages listed in requirements.txt
are needed during testing only, use an editor to move those lines to test_requirements.txt
. Now your package can be installed with
pip install -r requirements.txt
pip install -e .
The -e
flag will install your package in editable mode, i.e. it will create a symlink to your package in the installation location instead of copying the package. This is convenient when developing, because any changes you make to the source code will immediately be available for use in the installed version.
Conda
Conda can be used instead of virtualenv and pip. It easily installs binary dependencies, like Python itself or system libraries. Installation of packages that are not using wheel
but have a lot of native code is much faster than pip
because Conda does not compile the package, it only downloads compiled packages. The disadvantage of Conda is that the package needs to have a Conda build recipe. Many Conda build recipes already exist, but they are less common than the setup.py
that generally all Python packages have.
There are two main distributions of Conda: Anaconda and Miniconda. Anaconda is large and contains a lot of common packages, like numpy and matplotlib, whereas Miniconda is very lightweight and only contains Python. If you need more, the conda
command acts as a package manager for Python packages.
Use conda install
to install new packages and conda update
to keep your system up to date. The conda
command can also be used to create virtual environments.
For environments where you do not have admin rights (e.g. DAS-5) either Anaconda or Miniconda is highly recommended, since the install is very straightforward. The installation of packages through Conda seems very robust. If you want to add packages to the (Ana)conda repositories, please check Build using conda.
A possible downside of Anaconda is the fact that this is offered by a commercial supplier, but we don’t foresee any vendor lock-in issues.
Editors and IDEs
- Every major text editor supports Python, either natively or through plugins. At the Netherlands eScience Center, often used editors are atom, Sublime Text and vim.
- PyDev is an open source IDE. The source code is available here. It has debugging, unit testing, and reporting(code analysis, code coverage) support.
- For those seeking an IDE, JetBrains PyCharm is the Python IDE of choice. PyCharm Community Edition is open source. The source code is available here. It has visual debugger, unit testing and code coverage support, profiler. List of other tools can be found here.
Coding style conventions
The style guide for Python code is PEP8 and for docstrings it is PEP257. The autopep8
package can automatically format most Python code to conform to the PEP 8 style guide. The more comprehensive yapf
tool can automatically format code for optimal readability according to a chosen style (PEP 8 is the default). The isort
package automatically formats and groups all imports in a standard, readable way.
Many linters exists for Python, prospector
is a tool for running a suite of linters, it supports, among others:
Make sure to set strictness to veryhigh
for best results. prospector
has it’s own configuration file, see here for an example, but also supports configuration files for any of the linters that it runs. Most of the above tools can be integrated in text editors and IDEs for convenience.
Building and packaging code
To create an installable Python package, create a file setup.py
and use the setuptools
module. Make sure you only import standard library packages in setup.py
, directly or through importing other modules of your package, or your package will fail to install on systems that do not have the required dependencies pre-installed. Set up continuous integration to test your installation script. Use pyroma
(can be run as part of prospector
) as a linter for your installation script.
For packaging your code, you can either use pip
or conda
. Neither of them is better than the other — they are different; use the one which is more suitable for your project. pip
may be more suitable for distributing pure python packages, and it provides some support for binary dependencies using wheels
. conda
may be more suitable when you have external dependencies which cannot be packaged in a wheel.
- Use twine to upload your package to PyPI (so it can be installed with pip) (tutorial)
- Packages should be uploaded to PyPI using the
nlesc
account - When distributing code through PyPI, non-python files (such as
requirements.txt
) will not be packaged automatically, you need to add them to aMANIFEST.in
file. - To test whether your distribution will work correctly before uploading to PyPI, you can run
python setup.py sdist
in the root of your repository. Then try installing your package withpip install dist/<your_package>tar.gz.
- Packages should be uploaded to PyPI using the
- Build using conda
- If possible, add packages to conda-forge. Use BioConda or custom channels (hosted on GitHub) as alternatives if need be.
- Python wheels are the new standard for distributing Python packages. For pure python code, without C extensions, use
bdist_wheel
with a Python 2 and Python 3 setup, or usebdist_wheel --universal
if the code is compatible with both Python 2 and 3. If C extensions are used, each OS needs to have its own wheel. The manylinux docker images can be used for building wheels compatible with multiple Linux distributions. See the manylinux demo for an example. Wheel building can be automated using Travis (for pure python, Linux and OS X) and Appveyor (for Windows).
Testing
- pytest is a full featured Python
testing tool. You can use it withunittest
.
Pytest intro - Using mocks in Python
- unittest is a
framework available in Python Standard Library.
Dr.Dobb’s on Unit Testing with Python - doctest searches for pieces of text that look like interactive Python sessions, and then executes those sessions to verify that they work exactly as shown. Always use this if you have example code in your documentation to make sure your examples actually work.
Using pytest
is preferred over unittest
, pytest
has a much more concise syntax and supports many useful features.
Please make sure the command python setup.py test
can be used to run your tests. When using pytest
, this can be easily configured as described here.
Code coverage
When you have tests it is also a good to see which source code is exercised by the test suite.
Code coverage can be measured with the coverage Python package.
The coverage package can also generate html reports which show which line was covered.
Most test runners have have the coverage package integrated.
The code coverage reports can be published online in code quality service or code coverage services.
Preferred is to use one of the code quality service which also handles code coverage listed below.
If this is not possible or does not fit then use one of the generic code coverage service list in the software guide.
Code quality analysis tools and services
Code quality service is explained in the Generic software guide.
There are multiple code quality services available for Python.
There is not a best one, below is a short list of services with their different strenghts.
Codacy
Code quality and coverage grouped by file.
Can setup goals to improve quality or coverage by file or category.
For example project see https://www.codacy.com/app/3D-e-Chem/kripodb/dashboard.
Note that Codacy does not install your depencencies, which prevents it from correctly identifying import errors.
Scrutinizer
Code quality and coverage grouped by class and function.
For example project see https://scrutinizer-ci.com/g/NLeSC/eEcology-Annotation-WS/
Landscape
Dedicated for Python code quality.
Celery, Django and Flask specific behaviors.
The Landscape analysis tool called prospector
can be run locally.
For example project see https://landscape.io/github/NLeSC/MAGMa
Debugging and profiling
Debugging
- Python has its own debugger called pdb. It is a part of the Python distribution.
pudb is a console-based Python debugger which can easily be installed using pip.
If you are looking for IDE’s with debugging capabilities, please check Editors and IDEs section.
If you are using Windows, Python Tools for Visual Studio adds Python support for Visual Studio.
If you would like to integrate pdb with vim editor, you can use Pyclewn.
List of other available software can be found here.
If you are looking for some tutorials to get started:
Profiling
There are a number of available profiling tools that are suitable for different situations.
- cProfile measures number of function calls and how much CPU time they take. The output can be further analyzed using the
pstats
module. - For more fine-grained, line-by-line CPU time profiling, two modules can be used:
- line_profiler provides a function decorator that measures the time spent on each line inside the function.
- pprofile is less intrusive; it simply times entire Python scripts line-by-line. It can give output in callgrind format, which allows you to study the statistics and call tree in
kcachegrind
(often used for analyzing c(++) profiles fromvalgrind
).
More realistic profiling information can usually be obtained by using statistical or sampling profilers. The profilers listed below all create nice flame graphs.
Logging
- logging module is the most commonly used tool to track events in Python code.
- Tutorials:
Writing Documentation
Python uses Docstrings for function level documentation. You can read a detailed description of docstring usage in PEP 257.
The default location to put HTML documentation is Read the Docs. You can connect your account at Read the Docs to your GitHub account and let the HTML be generated automatically using Sphinx.
Autogenerating the documentation
There are several tools that automatically generate documentation from docstrings. These are the most used:
- pydoc
- Sphinx (uses reStructuredText as its markup language)
- sphinx tutorial
- Restructured Text (reST) and Sphinx CheatSheet
- Instead of using reST, Sphinx can also generate documentation from the more readable NumPy style or Google style docstrings. The Napoleon extension needs to be enabled.
We recommend using Sphinx and Google documentation style. Sphinx can easily be integrated with setuptools, so documentation can be built with in the command python setup.py build_sphinx
.
Recommended additional packages and libraries
General scientific
- NumPy
- SciPy
- Pandas data analysis toolkit
- scikit-learn: machine learning in Python
- Cython speed up Python code by using C types and calling C functions
- dask larger than memory arrays and parallel execution
IPython and Jupyter notebooks (aka IPython notebooks)
IPython is an interactive Python interpreter — very much the same as the standard Python interactive interpreter, but with some extra features (tab completion, shell commands, in-line help, etc).
Jupyter notebooks (formerly know as IPython notebooks) are browser based interactive Python enviroments. It incorporates the same features as the IPython console, plus some extras like in-line plotting. Look at some examples to find out more. Within a notebook you can alternate code with Markdown comments (and even LaTeX), which is great for reproducible research.
Notebook extensions adds extra functionalities to notebooks.
JupyterLab is a web-based environment with a lot of improvements and integrated tools. JupyterLab is still under development and may not be suitable if you need a stable tool.
Visualization
- Matplotlib has been the standard in scientific visualization. It supports quick-and-dirty plotting through the
pyplot
submodule. Its object oriented interface can be somewhat arcane, but is highly customizable and runs natively on many platforms, making it compatible with all major OSes and environments. It supports most sources of data, including native Python objects, Numpy and Pandas.- Seaborn is a Python visualisation library based on Matplotlib and aimed towards statistical analysis. It supports numpy, pandas, scipy and statmodels.
- Web-based:
- Bokeh is Interactive Web Plotting for Python.
- Plotly is another platform for interactive plotting through a web browser, including in Jupyter notebooks.
- altair is a grammar of graphics style declarative statistical visualization library. It does not render visualizations itself, but rather outputs Vega-Lite JSON data. This can lead to a simplified workflow.
- ggplot is a plotting library imported from R.
Database Interface
- psycopg is an PostgreSQL adapter
- cx_Oracle enables access to Oracle databases
- monetdb.sql
is monetdb Python client - pymongo allows for work with MongoDB database
- py-leveldb are thread-safe Python bindings for LevelDb
Parallelisation
CPython (the official and mainstream Python implementation) is not built for parallel processing due to the global interpreter lock. Note that the GIL only applies to actual Python code, so compiled modules like e.g. numpy
do not suffer from it.
Having said that, there are many ways to run Python code in parallel:
- The multiprocessing module is the standard way to do parallel executions in one or multiple machines, it circumvents the GIL by creating multiple Python processess.
- A much simpler alternative in Python 3 is the
concurrent.futures
module. - IPython / Jupyter notebooks have built-in parallel and distributed computing capabilities
- Many modules have parallel capabilities or can be compiled to have them.
- At the eScience Center, we have developed the Noodles package for creating computational workflows and automatically parallelizing it by dispatching independent subtasks to parallel and/or distributed systems.
Web Frameworks
There are a lot web frameworks for Python that are very easy to run.
We recommend flask
.