diff --git a/content/good-practices.md b/content/good-practices.md new file mode 100644 index 0000000..840bfbd --- /dev/null +++ b/content/good-practices.md @@ -0,0 +1,598 @@ +# Tools and useful practices + +:::{objectives} +- How does good Python code look like? And if we only had 30 minutes, which + good practices should we highlight? +- Some of the points are inspired by the excellent [Effective Python](https://effectivepython.com/) book by Brett Slatkin. +::: + + +## Follow the PEP 8 style guide + +- Please browse the [PEP 8 style guide](https://pep8.org/) so that you are familiar with the most important rules. +- Using a consistent style makes your code easier to read and understand for others. +- You don't have to check and adjust your code manually. There are tools that can do this for you (see below). + + +## Linting and static type checking + +A **linter** is a tool that analyzes source code to detect potential errors, unused +imports, unused variables, code style violations, and to improve readability. +- Popular linters: + - [Autoflake](https://pypi.org/project/autoflake/) + - [Flake8](https://flake8.pycqa.org/) + - [Pyflakes](https://pypi.org/project/pyflakes/) + - [Pycodestyle](https://pycodestyle.pycqa.org/) + - [Pylint](https://pylint.readthedocs.io/) + - [Ruff](https://docs.astral.sh/ruff/) + +We recommend [Ruff](https://docs.astral.sh/ruff/) since it +can do **both checking and formatting** and you don't have to switch between +multiple tools. + +:::{discussion} Linters and formatters can be configured to your liking +These tools typically have good defaults. But if you don't like the defaults, +you can configure what they should ignore or how they should format or not format. +::: + +This code example (which we possibly recognize from the previous section about +{doc}`profiling`) +has few problems (highlighted): +```{code-block} python +--- +emphasize-lines: 2, 7, 10 +--- +import re +import requests + + +def count_unique_words(file_path: str) -> int: + unique_words = set() + forgotten_variable = 13 + with open(file_path, "r", encoding="utf-8") as file: + for line in file: + words = re.findall(r"\b\w+\b", line.lower())) + for word in words: + unique_words.add(word) + return len(unique_words) +``` + +Please try whether you can locate these problems using Ruff: +```console +$ ruff check +``` + +If you use version control and like to have your code checked or formatted +**before you commit the change**, you can use tools like [pre-commit](https://pre-commit.com/). + +Many editors can be configured to automatically check your code as you type. Ruff can also +be used as a **language server**. + + +## Use an auto-formatter + +[Ruff](https://docs.astral.sh/ruff/) is one of the best tools to automatically +format your code according to a consistent style. + +To demonstrate how it works, let us try to auto-format a code example which is badly formatted and also difficult +to read: +:::::{tabs} + ::::{tab} Badly formatted + ```python + import re + def count_unique_words (file_path : str)->int: + unique_words=set() + with open(file_path,"r",encoding="utf-8") as file: + for line in file: + words=re.findall(r"\b\w+\b",line.lower()) + for word in words: + unique_words.add(word) + return len( unique_words ) + ``` + :::: + + ::::{tab} Auto-formatted + ```python + import re + + + def count_unique_words(file_path: str) -> int: + unique_words = set() + with open(file_path, "r", encoding="utf-8") as file: + for line in file: + words = re.findall(r"\b\w+\b", line.lower()) + for word in words: + unique_words.add(word) + return len(unique_words) + ``` + + This was done using: + ```console + $ ruff format + ``` + :::: +::::: + +Other popular formatters: + - [Black](https://black.readthedocs.io/) + - [YAPF](https://github.com/google/yapf) + +Many editors can be configured to automatically format for you when you save the file. + +It is possible to automatically format your code in Jupyter notebooks! +For this to work you need +the following three dependencies installed: +``` +jupyterlab-code-formatter +black +isort +``` + +More information and a screen-cast of how this works can be found at +. + + +## Consider annotating your functions with type hints + +Compare these two versions of the same function and discuss how the type hints +can help you and the Python interpreter to understand the function better: +:::::{tabs} + ::::{tab} Without type hints + ```{code-block} python + --- + emphasize-lines: 1 + --- + def count_unique_words(file_path): + unique_words = set() + with open(file_path, "r", encoding="utf-8") as file: + for line in file: + words = re.findall(r"\b\w+\b", line.lower()) + for word in words: + unique_words.add(word) + return len(unique_words) + ``` + :::: + ::::{tab} With type hints + ```{code-block} python + --- + emphasize-lines: 1 + --- + def count_unique_words(file_path: str) -> int: + unique_words = set() + with open(file_path, "r", encoding="utf-8") as file: + for line in file: + words = re.findall(r"\b\w+\b", line.lower()) + for word in words: + unique_words.add(word) + return len(unique_words) + ``` + :::: +::::: + +A (static) type checker is a tool that checks whether the types of variables in your +code match the types that you have specified. +Popular tools: +- [Mypy](https://mypy.readthedocs.io/) +- [Pyright](https://github.com/microsoft/pyright) (Microsoft) +- [Pyre](https://pyre-check.org/) (Meta) + + +## Consider using AI-assisted coding + +We can use AI as an assistant/apprentice: +- Code completion +- Write a test based on an implementation +- Write an implementation based on a test + +Or we can use AI as a mentor: +- Explain a concept +- Improve code +- Show a different (possibly better) way of implementing the same thing + + +:::{figure} img/productivity/chatgpt.png +:alt: Screenshot of ChatGPT +:width: 100% + +Example for using a chat-based AI tool. +::: + +:::{figure} img/productivity/code-completion.gif +:alt: Screen-cast of working with GitHub Copilot +:width: 100% + +Example for using AI to complete code in an editor. +::: + +:::{admonition} AI tools open up a box of questions which are beyond our scope here +- Legal +- Ethical +- Privacy +- Lock-in/ monopolies +- Lack of diversity +- Will we still need to learn programming? +- How will it affect learning and teaching programming? +::: + + +## Debugging with print statements + +Print-debugging is a simple, effective, and popular way to debug your code like +this: +```python +print(f"file_path: {file_path}") +``` +Or more elaborate: +```python +print(f"I am in function count_unique_words and the value of file_path is {file_path}") +``` + +But there can be better alternatives: + +- [Logging](https://docs.python.org/3/library/logging.html) module + ```python + import logging + + logging.basicConfig(level=logging.DEBUG) + + logging.debug("This is a debug message") + logging.info("This is an info message") + ``` +- [IceCream](https://github.com/gruns/icecream) offers compact helper functions for print-debugging + ```python + from icecream import ic + + ic(file_path) + ``` + + +## Often you can avoid using indices + +Especially people coming to Python from other languages tend to use indices +where they are not needed. Indices can be error-prone (off-by-one errors and +reading/writing past the end of the collection). + +### Iterating +:::::{tabs} + ::::{tab} Verbose and can be brittle + ```python + scores = [13, 5, 2, 3, 4, 3] + + for i in range(len(scores)): + print(scores[i]) + ``` + :::: + + ::::{tab} Better + ```python + scores = [13, 5, 2, 3, 4, 3] + + for score in scores: + print(score) + ``` + :::: +::::: + + +### Enumerate if you need the index +:::::{tabs} + ::::{tab} Verbose and can be brittle + ```python + particle_masses = [7.0, 2.2, 1.4, 8.1, 0.9] + + for i in range(len(particle_masses)): + print(f"Particle {i} has mass {particle_masses[i]}") + ``` + :::: + + ::::{tab} Better + ```python + particle_masses = [7.0, 2.2, 1.4, 8.1, 0.9] + + for i, mass in enumerate(particle_masses): + print(f"Particle {i} has mass {mass}") + ``` + :::: +::::: + + + +### Zip if you need to iterate over two collections + +:::::{tabs} + ::::{tab} Using an index can be brittle + ```python + persons = ["Alice", "Bob", "Charlie", "David", "Eve"] + favorite_ice_creams = ["vanilla", "chocolate", "strawberry", "mint", "chocolate"] + + for i in range(len(persons)): + print(f"{persons[i]} likes {favorite_ice_creams[i]} ice cream") + ``` + :::: + + ::::{tab} Better + ```python + persons = ["Alice", "Bob", "Charlie", "David", "Eve"] + favorite_ice_creams = ["vanilla", "chocolate", "strawberry", "mint", "chocolate"] + + for person, flavor in zip(persons, favorite_ice_creams): + print(f"{person} likes {flavor} ice cream") + ``` + :::: +::::: + + +### Unpacking +:::::{tabs} + ::::{tab} Verbose and can be brittle + ```python + coordinates = (0.1, 0.2, 0.3) + + x = coordinates[0] + y = coordinates[1] + z = coordinates[2] + ``` + :::: + + ::::{tab} Better + ```python + coordinates = (0.1, 0.2, 0.3) + + x, y, z = coordinates + ``` + :::: +::::: + + +### Prefer catch-all unpacking over indexing/slicing + +:::::{tabs} + ::::{tab} Verbose and can be brittle + ```python + scores = [13, 5, 2, 3, 4, 3] + + sorted_scores = sorted(scores) + + smallest = sorted_scores[0] + rest = sorted_scores[1:-1] + largest = sorted_scores[-1] + + print(smallest, rest, largest) + # Output: 2 [3, 3, 4, 5] 13 + ``` + :::: + + ::::{tab} Better + ```python + scores = [13, 5, 2, 3, 4, 3] + + sorted_scores = sorted(scores) + + smallest, *rest, largest = sorted(scores) + + print(smallest, rest, largest) + # Output: 2 [3, 3, 4, 5] 13 + ``` + :::: +::::: + + +### List comprehensions, map, and filter instead of loops + +:::::{tabs} + ::::{tab} For-loop + ```python + string_numbers = ["1", "2", "3", "4", "5"] + + integer_numbers = [] + for element in string_numbers: + integer_numbers.append(int(element)) + + print(integer_numbers) + # Output: [1, 2, 3, 4, 5] + ``` + :::: + + ::::{tab} List comprehension + ```python + string_numbers = ["1", "2", "3", "4", "5"] + + integer_numbers = [int(element) for element in string_numbers] + + print(integer_numbers) + # Output: [1, 2, 3, 4, 5] + ``` + :::: + + ::::{tab} Map + ```python + string_numbers = ["1", "2", "3", "4", "5"] + + integer_numbers = list(map(int, string_numbers)) + + print(integer_numbers) + # Output: [1, 2, 3, 4, 5] + ``` + :::: +::::: + +:::::{tabs} + ::::{tab} For-loop + ```python + def is_even(number: int) -> bool: + return number % 2 == 0 + + + numbers = [1, 2, 3, 4, 5, 6] + + even_numbers = [] + for number in numbers: + if is_even(number): + even_numbers.append(number) + + print(even_numbers) + # Output: [2, 4, 6] + ``` + :::: + + ::::{tab} List comprehension + ```python + def is_even(number: int) -> bool: + return number % 2 == 0 + + + numbers = [1, 2, 3, 4, 5, 6] + + even_numbers = [number for number in numbers if is_even(number)] + + print(even_numbers) + # Output: [2, 4, 6] + ``` + :::: + + ::::{tab} Filter + ```python + def is_even(number: int) -> bool: + return number % 2 == 0 + + + numbers = [1, 2, 3, 4, 5, 6] + + even_numbers = list(filter(is_even, numbers)) + + print(even_numbers) + # Output: [2, 4, 6] + ``` + :::: +::::: + + +## Know your collections + +How to choose the right collection type: +- Ordered and modifiable: `list` +- Fixed and (rather) immutable: `tuple` +- Key-value pairs: `dict` +- Dictionary with default values: `defaultdict` from {py:mod}`collections` +- Members are unique, no duplicates: `set` +- Optimized operations at both ends: `deque` from {py:mod}`collections` +- Cyclical iteration: `cycle` from {py:mod}`itertools` +- Adding/removing elements in the middle: Create a linked list (e.g. using a dictionary or a dataclass) +- Priority queue: {py:mod}`heapq` library +- Search in sorted collections: {py:mod}`bisect` library + +What to avoid: +- Need to add/remove elements at the beginning or in the middle? Don't use a list. +- Need to make sure that elements are unique? Don't use a list. + + +## Making functions more ergonomic + +- Less error-prone API functions and fewer backwards-incompatible changes by enforcing keyword-only arguments: + ```python + def send_message(*, message: str, recipient: str) -> None: + print(f"Sending to {recipient}: {message}") + ``` + +- Use dataclasses or named tuples or dictionaries instead of too many input or output arguments. + +- Docstrings instead of comments: + ```python + def send_message(*, message: str, recipient: str) -> None: + """ + Sends a message to a recipient. + + Parameters: + - message (str): The content of the message. + - recipient (str): The name of the person receiving the message. + """ + print(f"Sending to {recipient}: {message}") + ``` + +- Consider using `DeprecationWarning` from the {py:mod}`warnings` module for deprecating functions or arguments. + + +## Iterating + +- When working with large lists or large data sets, consider using generators or iterators instead of lists. + Discuss and compare these two: + ```python + even_numbers1 = [number for number in range(10000000) if number % 2 == 0] + + even_numbers2 = (number for number in range(10000000) if number % 2 == 0) + ``` + +- Beware of functions which iterate over the same collection multiple times. + With generators, you can iterate only once. + +- Know about {py:mod}`itertools` which provides a lot of functions for working with iterators. + + +## Use relative paths and pathlib + +- Scripts that read data from absolute paths are not portable and typically + break when shared with a colleague or support help desk or reused by the next + student/PhD student/postdoc. +- {py:mod}`pathlib` is a modern and portable way to handle paths in Python. + + +## Project structure + +- As your project grows from a simple script, you should consider organizing + your code into modules and packages. + +- Function too long? Consider splitting it into multiple functions. + +- File too long? Consider splitting it into multiple files. + +- Difficult to name a function or file? It might be doing too much or unrelated things. + +- If your script can be imported into other scripts, wrap your main function in + a `if __name__ == "__main__":` block: + ```python + def main(): + ... + + if __name__ == "__main__": + main() + ``` + +- Why this construct? You can try to either import or run the following script: + ```python + if __name__ == "__main__": + print("I am being run as a script") # importing will not run this part + else: + print("I am being imported") + ``` + +- Try to have all code inside some function. This can make it easier to + understand, test, and reuse. It can also help Python to free up memory when + the function is done. + + +## Reading and writing files + +- Good construct to know to read a file: + ```python + with open("input.txt", "r") as file: + for line in file: + print(line) + ``` +- Reading a huge data file? Read and process it in chunks or buffered or use a library which does it for you. +- On supercomputers, avoid reading and writing thousands of small files. +- For input files, consider using standard formats like CSV, YAML, or TOML - then you don't need to write a parser. + + +## Use subprocess instead of os.system + +- Many things can go wrong when launching external processes from Python. The {py:mod}`subprocess` module is the recommended way to do this. +- `os.system` is not portable and not secure enough. + + +## Parallelizing + +- Use one of the many libraries: {py:mod}`multiprocessing`, {py:mod}`mpi4py`, [Dask](https://dask.org/), [Parsl](https://parsl-project.org/), ... +- Identify independent tasks. +- More often than not, you can convert an expensive loop into a command-line + tool and parallelize it using workflow management tools like + [Snakemake](https://snakemake.github.io/). diff --git a/content/img/productivity/chatgpt.png b/content/img/productivity/chatgpt.png new file mode 100644 index 0000000..6794801 Binary files /dev/null and b/content/img/productivity/chatgpt.png differ diff --git a/content/img/productivity/code-completion.gif b/content/img/productivity/code-completion.gif new file mode 100644 index 0000000..e0bed42 Binary files /dev/null and b/content/img/productivity/code-completion.gif differ diff --git a/content/index.md b/content/index.md index 11f1b09..c006e63 100644 --- a/content/index.md +++ b/content/index.md @@ -61,9 +61,16 @@ them to own projects**. - 11:30-12:15 - Lunch break -- 12:15-13:00 - TBA +- 12:15-12:45 - {doc}`dependencies` -- 13:15-14:00 - {doc}`dependencies` +- 12:45-13:30 - Working with Notebooks + - Notebooks and version control + - Other tooling + - Sharing notebooks + +- 13:30-14:15 - Other useful tools for Python development + - {doc}`good-practices` + - {doc}`profiling` - 14:15-15:00 - Debriefing and Q&A - Participants work on their projects @@ -76,12 +83,9 @@ them to own projects**. - 09:00-10:00 - {doc}`testing` -- 10:15-11:30 - Code quality and good practices - - Tools - - Concepts in refactoring and modular code design - - From a notebook to a script to a workflow - - Good practices - - {doc}`profiling` (30 min) +- 10:15-11:30 - Modular code development + - {doc}`refactoring-concepts` + - How to parallelize independent tasks using workflows - 11:30-12:15 - Lunch break @@ -124,6 +128,8 @@ documentation collaboration dependencies testing +refactoring-concepts +good-practices profiling software-licensing publishing diff --git a/content/refactoring-concepts.md b/content/refactoring-concepts.md index d136161..554d03b 100644 --- a/content/refactoring-concepts.md +++ b/content/refactoring-concepts.md @@ -1,7 +1,3 @@ ---- -orphan: true ---- - # Concepts in refactoring and modular code design diff --git a/content/refactoring-demo.md b/content/refactoring-demo.md deleted file mode 100644 index b085680..0000000 --- a/content/refactoring-demo.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -orphan: true ---- - -# Demo: From a script towards a workflow - -In this episode we will explore code quality and good practices in Python using -a hands-on approach. We will together build up a small project and improve it -step by step. - -We will start from a relatively simple image processing script which can read a -telescope image of stars and our goal is to **count the number of stars** in -the image. Later we will want to be able to process many such images. - -The (fictional) telescope images look like the one below here ([in this -repository](https://github.com/workshop-material/random-star-images) we can find more): -:::{figure} refactoring/stars.png -:alt: Generated image representing a telescope image of stars -:width: 60% - -Generated image representing a telescope image of stars. -::: - -:::{admonition} Rough plan for this demo -- (15 min) Discuss how we would solve the problem, run example code, and make it work (as part of a Jupyter notebook)? -- (15 min) Refactor the positioning code into a function and a module -- (15 min) Now we wish to process many images - discuss how we would approach this -- (15 min) Introduce CLI and discuss the benefits -- (30 min) From a script to a workflow (using Snakemake) -::: - -:::{solution} Starting point (spoiler alert) - -We can imagine that we pieced together the following code -based on some examples we found online: -```python -import matplotlib.pyplot as plt -from skimage import io, filters, color -from skimage.measure import label, regionprops - - -image = io.imread("stars.png") -sigma = 0.5 - -# if there is a fourth channel (alpha channel), ignore it -rgb_image = image[:, :, :3] -gray_image = color.rgb2gray(rgb_image) - -# apply a gaussian filter to reduce noise -image_smooth = filters.gaussian(gray_image, sigma) - -# threshold the image to create a binary image (bright stars will be white, background black) -thresh = filters.threshold_otsu(image_smooth) -binary_image = image_smooth > thresh - -# label connected regions (stars) in the binary image -labeled_image = label(binary_image) - -# get properties of labeled regions -regions = regionprops(labeled_image) - -# extract star positions (centroids) -star_positions = [region.centroid for region in regions] - -# plot the original image -plt.figure(figsize=(8, 8)) -plt.imshow(image, cmap="gray") - -# overlay star positions with crosses -for star in star_positions: - plt.plot(star[1], star[0], "rx", markersize=5, markeredgewidth=0.1) - -plt.savefig("detected-stars.png", dpi=300) - -print(f"number of stars detected: {len(star_positions)}") -``` -::: - - -## Plan - -Topics we wish to show and discuss: -- Naming (and other) conventions, project organization, modularity -- The value of pure functions and immutability -- Refactoring (explained through examples) -- Auto-formatting and linting with tools like black, vulture, ruff -- Moving a project under Git -- How to document dependencies -- Structuring larger software projects in a modular way -- Command-line interfaces -- Workflows with Snakemake - -We will work together on the code on the big screen, and participants will be -encouraged to give suggestions and ask questions. We will **end up with a Git -repository** which will be shared with workshop participants. - - -## Possible solutions - -:::{solution} Script after some work, with command-line interface (spoiler alert) -This is one possible solution (`countstars.py`): -```{literalinclude} refactoring/countstars.py -:language: python -``` -::: - -:::{solution} Snakemake rules which define a workflow (spoiler alert) -This is one possible solution (`snakefile`): -```{literalinclude} refactoring/snakefile -:language: python -``` - -We can process as many images as we like by running: -```console -$ snakemake --cores 4 # adjust to the number of available cores -``` -::: diff --git a/content/refactoring/countstars.py b/content/refactoring/countstars.py deleted file mode 100644 index e139fd1..0000000 --- a/content/refactoring/countstars.py +++ /dev/null @@ -1,66 +0,0 @@ -import click -import matplotlib.pyplot as plt -from skimage import io, filters, color -from skimage.measure import label, regionprops - - -def convert_to_gray(image): - # if there is a fourth channel (alpha channel), ignore it - rgb_image = image[:, :, :3] - return color.rgb2gray(rgb_image) - - -def locate_positions(image): - gray_image = convert_to_gray(image) - - # apply a gaussian filter to reduce noise - image_smooth = filters.gaussian(gray_image, sigma=0.5) - - # threshold the image to create a binary image (bright objects will be white, background black) - thresh = filters.threshold_otsu(image_smooth) - binary_image = image_smooth > thresh - - # label connected regions in the binary image - labeled_image = label(binary_image) - - # get properties of labeled regions - regions = regionprops(labeled_image) - - # extract positions (centroids) - positions = [region.centroid for region in regions] - - return positions - - -def plot_positions(image, positions, file_name): - # plot the original image - plt.figure(figsize=(8, 8)) - plt.imshow(image, cmap="gray") - - # overlay positions with crosses - for y, x in positions: - plt.plot(y, x, "rx", markersize=5, markeredgewidth=0.1) - - plt.savefig(file_name, dpi=300) - - -@click.command() -@click.option( - "--image-file", type=click.Path(exists=True), help="Path to the input image" -) -@click.option("--output-file", type=click.Path(), help="Path to the output file") -@click.option("--generate-plot", is_flag=True, default=False) -def main(image_file, output_file, generate_plot): - image = io.imread(image_file) - - star_positions = locate_positions(image) - - if generate_plot: - plot_positions(image, star_positions, f"detected-{image_file}") - - with open(output_file, "w") as f: - f.write(f"number of stars detected: {len(star_positions)}\n") - - -if __name__ == "__main__": - main() diff --git a/content/refactoring/snakefile b/content/refactoring/snakefile deleted file mode 100644 index f40bdab..0000000 --- a/content/refactoring/snakefile +++ /dev/null @@ -1,21 +0,0 @@ -# the comma is there because glob_wildcards returns a named tuple -numbers, = glob_wildcards("input-images/stars-{number}.png") - - -# rule that collects the target files -rule all: - input: - expand("results/{number}.txt", number=numbers) - - -rule process_data: - input: - "input-images/stars-{number}.png" - output: - "results/{number}.txt" - log: - "logs/{number}.txt" - shell: - """ - python countstars.py --image-file {input} --output-file {output} - """ diff --git a/content/refactoring/stars.png b/content/refactoring/stars.png deleted file mode 100644 index f4c5cf7..0000000 Binary files a/content/refactoring/stars.png and /dev/null differ