diff --git a/00-setting-the-scene.md b/00-setting-the-scene.md new file mode 100644 index 000000000..d12eb84ee --- /dev/null +++ b/00-setting-the-scene.md @@ -0,0 +1,219 @@ +--- +title: Setting the Scene +start: no +colour: '#FBED65' +teaching: 15 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Setting the scene and expectations +- Making sure everyone has all the necessary software installed + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are we teaching in this course? +- What motivated the selection of topics covered in the course? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +So, you have gained basic software development skills either by self-learning or attending, +e.g., a [novice Software Carpentry course][swc-lessons]. +You have been applying those skills for a while by writing code to help with your work +and you feel comfortable developing code and troubleshooting problems. +However, your software has now reached a point where there is too much code to be kept in one script. +Perhaps it is involving more researchers (developers) and users, +and more collaborative development effort is needed to add new functionality +while ensuring previous development efforts remain functional and maintainable. + +This course provides the next step in software development - +it teaches some **intermediate software engineering skills and best practices** +to help you restructure existing code and design more robust, +reusable and maintainable code, +automate the process of testing and verifying software correctness +and support collaborations with others in a way that +mimics a typical software development process within a team. + +The course uses a number of different **software development tools and techniques** +interchangeably as you would in a real life. +We had to make some choices about topics and tools to teach here, +based on established best practices, +ease of tool installation for the audience, +length of the course and other considerations. +Tools used here are not mandated though: +alternatives exist and we point some of them out along the way. +Over time, you will develop a preference for certain tools and programming languages +based on your personal taste +or based on what is commonly used by your group, collaborators or community. +However, the topics covered should give you a solid foundation for working on software development +in a team and producing high quality software that is easier to develop +and sustain in the future by yourself and others. +Skills and tools taught here, while Python-specific, +are transferable to other similar tools and programming languages. + +The course is organised into the following sections: + +![Course overview diagram](fig/course-overview.svg){alt="Course overview diagram. Arrows connect the following boxed text in order: 1) Setting up software environment 2) Verifying software correctness 3) Software development as a process 4) Collaborative development for reuse 5) Managing software over its lifetime."} + + + +### [Section 1: Setting up Software Environment](10-section1-intro.md) + +In the first section we are going to set up our working environment +and familiarise ourselves with various tools and techniques for +software development in a typical collaborative code development cycle: + +- **Virtual environments** for **isolating a project** from other projects developed on the same machine +- **Command line** for running code and interacting with the **command line tool Git** for +- **Integrated Development Environment** for **code development, testing and debugging**, + **Version control** and using code branches to develop new features in parallel, +- **GitHub** (central and remote source code management platform supporting version control with Git) + for **code backup, sharing and collaborative development**, and +- **Python code style guidelines** to make sure our code is + **documented, readable and consistently formatted**. + +### [Section 2: Verifying Software Correctness at Scale](20-section2-intro.md) + +Once we know our way around different code development tools, techniques and conventions, +in this section we learn: + +- how to set up a **test framework** and write tests to verify the behaviour of our code is correct, and +- how to automate and scale testing with **Continuous Integration (CI)** using + **GitHub Actions** (a CI service available on GitHub). + +### [Section 3: Software Development as a Process](30-section3-intro.md) + +In this section, we step away from writing code for a bit +to look at software from a higher level as a process of development and its components: + +- different types of **software requirements** and **designing and architecting software** to meet them, + how these fit within the larger **software development process** + and what we should consider when **testing** against particular types of requirements. +- different **programming and software design paradigms**, + each representing a slightly different way of thinking about, + structuring + and **implementing** the code. + +### [Section 4: Collaborative Software Development for Reuse](40-section4-intro.md) + +Advancing from developing code as an individual, +in this section you will start working with your fellow learners +on a group project (as you would do when collaborating on a software project in a team), and learn: + +- how **code review** can help improve team software contributions, + identify wider codebase issues, and increase codebase knowledge across a team. +- what we can do to prepare our software for further development and reuse, + by adopting best practices in + **documenting**, + **licencing**, + **tracking issues**, + **supporting** your software, + and **packaging software** for release to others. + +### [Section 5: Managing and Improving Software Over Its Lifetime](50-section5-intro.md) + +Finally, we move beyond just software development to managing a collaborative software project and will look into: + +- internal **planning and prioritising tasks** for future development + using agile techniques and effort estimation, + management of **internal and external communication**, + and **software improvement** through feedback. +- how to adopt a critical mindset not just towards our own software project + but also to **assess other people's software to ensure it is suitable** for us to reuse, + identify areas for improvement, + and how to use GitHub to register good quality issues with a particular code repository. + +## Before We Start + +A few notes before we start. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Prerequisite Knowledge + +This is an intermediate-level software development course +intended for people who have already been developing code in Python (or other languages) +and applying it to their own problems after gaining basic software development skills. +So, it is expected for you to have some prerequisite knowledge on the topics covered, +as outlined at the [beginning of the lesson](../index.md#prerequisites). +Check out this [quiz](../learners/quiz.md) to help you test your prior knowledge +and determine if this course is for you. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Setup, Common Issues \& Fixes + +Have you [setup and installed](../learners/setup.md) all the tools and accounts required for this course? +Check the list of [common issues, fixes \& tips](../learners/common-issues.md) +if you experience any problems running any of the tools you installed - +your issue may be solved there. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Compulsory and Optional Exercises + +Exercises are a crucial part of this course and the narrative. +They are used to reinforce the points taught +and give you an opportunity to practice things on your own. +Please do not be tempted to skip exercises +as that will get your local software project out of sync with the course and break the narrative. +Exercises that are clearly marked as "optional" can be skipped without breaking things +but we advise you to go through them too, if time allows. +All exercises contain solutions but, wherever possible, try and work out a solution on your own. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Outdated Screenshots + +Throughout this lesson we will make use and show content +from Graphical User Interface (GUI) tools (PyCharm and GitHub). +These are evolving tools and platforms, always adding new features and new visual elements. +Screenshots in the lesson may then become out-of-sync, +refer to or show content that no longer exists or is different to what you see on your machine. +If during the lesson you find screenshots that no longer match what you see +or have a big discrepancy with what you see, +please [open an issue]({{ site.github.repository_url }}/issues/new) describing what you see +and how it differs from the lesson content. +Feel free to add as many screenshots as necessary to clarify the issue. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- This lesson focuses on core, intermediate skills covering the whole software development life-cycle that will be of most use to anyone working collaboratively on code. +- For code development in teams - you need more than just the right tools and languages. You need a strategy (best practices) for how you'll use these tools as a team. +- The lesson follows on from the novice Software Carpentry lesson, but this is not a prerequisite for attending as long as you have some basic Python, command line and Git skills and you have been using them for a while to write code to help with your work. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[swc-lessons]: https://software-carpentry.org/lessons/ diff --git a/10-section1-intro.md b/10-section1-intro.md new file mode 100644 index 000000000..9d0e4ea0e --- /dev/null +++ b/10-section1-intro.md @@ -0,0 +1,143 @@ +--- +title: 'Section 1: Setting Up Environment For Collaborative Code Development' +teaching: 10 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Provide an overview of all the different tools that will be used in this course. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What tools are needed to collaborate on code development effectively? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The first section of the course is dedicated to setting up your environment for collaborative software development +and introducing the project that we will be working on throughout the course. +In order to build working (research) software efficiently +and to do it in collaboration with others rather than in isolation, +you will have to get comfortable with using a number of different tools interchangeably +as they will make your life a lot easier. +There are many options when it comes to deciding +which software development tools to use for your daily tasks - +we will use a few of them in this course that we believe make a difference. +There are sometimes multiple tools for the job - +we select one to use but mention alternatives too. +As you get more comfortable with different tools and their alternatives, +you will select the one that is right for you based on your personal preferences +or based on what your collaborators are using. + +![Section 1 Overview](fig/section1-overview.svg){alt='Tools needed to collaborate on code development effectively'} + + + +Here is an overview of the tools we will be using. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Setup, Common Issues \& Fixes + +Have you [setup and installed](../learners/setup.md) all the tools and accounts required for this course? +Check the list of [common issues, fixes \& tips](../learners/common-issues.md) +if you experience any problems running any of the tools you installed - +your issue may be solved there. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Command Line \& Python Virtual Development Environment + +We will use the [command line](https://en.wikipedia.org/wiki/Shell_\(computing\)) +(also known as the command line shell/prompt/console) +to run our Python code +and interact with the version control tool Git and software sharing platform GitHub. +We will also use command line tools +[`venv`](https://docs.python.org/3/library/venv.html) +and [`pip`](https://pip.pypa.io/en/stable/) +to set up a Python virtual development environment +and isolate our software project from other Python projects we may work on. + +***Note:** some Windows users experience the issue where Python hangs from Git Bash +(i.e. typing `python` causes it to just hang with no error message or output) - +[see the solution to this issue](../learners/common-issues.md#python-hangs-in-git-bash).* + +### Integrated Development Environment (IDE) + +An IDE integrates a number of tools that we need +to develop a software project that goes beyond a single script - +including a smart code editor, a code compiler/interpreter, a debugger, etc. +It will help you write well-formatted and readable code that conforms to code style guides +(such as [PEP8](https://www.python.org/dev/peps/pep-0008/) for Python) +more efficiently by giving relevant and intelligent suggestions +for code completion and refactoring. +IDEs often integrate command line console and version control tools - +we teach them separately in this course +as this knowledge can be ported to other programming languages +and command line tools you may use in the future +(but is applicable to the integrated versions too). + +We will use [PyCharm](https://www.jetbrains.com/pycharm/) in this course - +a free, open source IDE. + +### Git \& GitHub + +[Git](https://git-scm.com/) is a free and open source distributed version control system +designed to save every change made to a (software) project, +allowing others to collaborate and contribute. +In this course, we use Git to version control our code in conjunction with [GitHub](https://github.com/) +for code backup and sharing. +GitHub is one of the leading integrated products and social platforms +for modern software development, monitoring and management - +it will help us with +version control, +issue management, +code review, +code testing/Continuous Integration, +and collaborative development. +An important concept in collaborative development is version control workflows +(i.e. how to effectively use version control on a project with others). + +### Python Coding Style + +Most programming languages will have associated standards and conventions for how the source code +should be formatted and styled. +Although this sounds pedantic, +it is important for maintaining the consistency and readability of code across a project. +Therefore, one should be aware of these guidelines +and adhere to whatever the project you are working on has specified. +In Python, we will be looking at a convention called PEP8. + +Let us get started with setting up our software development environment! + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- In order to develop (write, test, debug, backup) code efficiently, you need to use a number of different tools. +- When there is a choice of tools for a task you will have to decide which tool is right for you, which may be a matter of personal preference or what the team or community you belong to is using. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/11-software-project.md b/11-software-project.md new file mode 100644 index 000000000..2a66ddacb --- /dev/null +++ b/11-software-project.md @@ -0,0 +1,364 @@ +--- +title: 1.1 Introduction to Our Software Project +teaching: 20 +exercises: 10 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use Git to obtain a working copy of our software project from GitHub. +- Inspect the structure and architecture of our software project. +- Understand Model-View-Controller (MVC) architecture in software design and its use in our project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is the design architecture of our example software project? +- Why is splitting code into smaller functional units (modules) good when designing software? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Patient Inflammation Study Project + +You have joined a software development team that has been working on the +[patient inflammation study project](https://github.com/carpentries-incubator/python-intermediate-inflammation) +developed in Python and stored on GitHub. +The project analyses the data to study the effect of a new treatment for arthritis +by analysing the inflammation levels in patients who have been given this treatment. +It reuses the inflammation datasets from the +[Software Carpentry Python novice lesson](https://swcarpentry.github.io/python-novice-inflammation/index.html). + +![](fig/inflammation-study-pipeline.png){alt='Snapshot of the inflammation dataset' .image-with-shadow width="800px" } + +

Inflammation study pipeline from the Software Carpentry Python novice lesson

+ +::::::::::::::::::::::::::::::::::::::::: callout + +## What Does Patient Inflammation Data Contain? + +Each dataset records inflammation measurements from a separate clinical trial of the drug, +and each dataset contains information for 60 patients, +who had their inflammation levels recorded (in some arbitrary units of inflammation measurement) for 40 days whilst participating in the trial. +A snapshot of one of the data files is shown in the diagram above. + +Each of the data files uses the popular +[comma-separated (CSV) format](https://en.wikipedia.org/wiki/Comma-separated_values) +to represent the data, where: + +- each row holds inflammation measurements for a single patient +- each column represents a successive day in the trial +- each cell represents an inflammation reading on a given day for a patient + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The project is not finished and contains some errors. +You will be working on your own and in collaboration with others +to fix and build on top of the existing code during the course. + +## Downloading Our Software Project + +To start working on the project, you will first create a fork of the software project repository +from GitHub within your own GitHub account +and then obtain a local copy of that project (from your GitHub) on your machine. + +1. Make sure you have a GitHub account + and that you have [set up SSH key pair for authentication with GitHub](../learners/setup.md#secure-access-to-github-using-git-from-command-line). + + ***Note:** while it is possible to use HTTPS with a personal access token for authentication + with GitHub, the recommended and supported authentication method to use for this course + is SSH with key pairs.* + +2. Log into your GitHub account. + +3. Go to the [software project repository](https://github.com/carpentries-incubator/python-intermediate-inflammation) + in GitHub. + + ![](fig/github-fork-repository.png){alt='Software project fork repository in GitHub' .image-with-shadow width="900px" } + +4. Click the `Fork` button + towards the top right of the repository's GitHub page to **create a fork** of the repository + under your GitHub account. + Remember, you will need to be signed into GitHub for the `Fork` button to work. + + ***Note:** each participant is creating their own fork of the project to work on.* + + ***Note 2:** we are creating a fork of the software project repository (instead of copying it + from its template) because we want to preserve the history of all commits (with template copying + you only get a snapshot of a repository at a given point in time).* + +5. Make sure to select your personal account + and set the name of the project to `python-intermediate-inflammation` + (you can call it anything you like, + but it may be easier for future group exercises if everyone uses the same name). + Ensure that you **uncheck** the `Copy the main branch only` option. + This guarantees you get all the branches from this repository needed for later exercises. + + ![](fig/github-fork-repository-confirm.png){alt='Making a fork of the software project repository in GitHub' .image-with-shadow width="600px" } + +6. Click the `Create fork` button + and wait for GitHub to create the forked copy of the repository under your account. + +7. Locate the forked repository under your own GitHub account. + GitHub should redirect you there automatically after creating the fork. + If this does not happen, click your user icon in the top right corner and select + `Your Repositories` from the drop-down menu, then locate your newly created fork. + + ![](fig/github-forked-repository-own.png){alt='View of your own fork of the software repository in GitHub' .image-with-shadow width="900px" } + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Obtain the Software Project Locally + +Using the command line, clone the copied repository +from your GitHub account into the home directory on your computer using SSH. +Which command(s) would you use to get a detailed list of contents of the directory you have just cloned? + +::::::::::::::: solution + +## Solution + +1. Find the SSH URL of the software project repository to clone from your GitHub account. + Make sure you do not clone the original repository but rather your own fork, + as you should be able to push commits to it later on. + Also make sure you select the **SSH** tab and not the **HTTPS** one. + For this course, SSH is the preferred way of authenticating when sending your changes back to GitHub. + If you have only authenticated through HTTPS in the past, + please follow the guidance [at the top of this section](#downloading-our-software-project) + to add an SSH key to your GitHub account. + +![](fig/clone-repository.png){alt='URL to clone the repository in GitHub' .image-with-shadow width="800px" } + +2. Make sure you are located in your home directory in the command line with: + + ```bash + $ cd ~ + ``` + +3. From your home directory in the command line, do: + + ```bash + $ git clone git@github.com:/python-intermediate-inflammation.git + ``` + + Make sure you are cloning your fork of the software project and not the original repository. + +4. Navigate into the cloned repository folder in your command line with: + + ```bash + $ cd python-intermediate-inflammation + ``` + + Note: If you have accidentally copied the **HTTPS** URL of your repository instead of the SSH one, + you can easily fix that from your project folder in the command line with: + + ```bash + $ git remote set-url origin git@github.com:/python-intermediate-inflammation.git + ``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Our Software Project's Structure + +Let's inspect the content of the software project from the command line. +From the root directory of the project, +you can use the command `ls -l` to get a more detailed list of the contents. +You should see something similar to the following. + +```bash +$ cd ~/python-intermediate-inflammation +$ ls -l +total 24 +-rw-r--r-- 1 carpentry users 1055 20 Apr 15:41 README.md +drwxr-xr-x 18 carpentry users 576 20 Apr 15:41 data +drwxr-xr-x 5 carpentry users 160 20 Apr 15:41 inflammation +-rw-r--r-- 1 carpentry users 1122 20 Apr 15:41 inflammation-analysis.py +drwxr-xr-x 4 carpentry users 128 20 Apr 15:41 tests +``` + +As can be seen from the above, our software project contains the `README` file +(that typically describes the project, its usage, installation, authors and how to contribute), +Python script `inflammation-analysis.py`, +and three directories - +`inflammation`, `data` and `tests`. + +The Python script `inflammation-analysis.py` provides +the main entry point in the application, +and on closer inspection, +we can see that the `inflammation` directory contains two more Python scripts - +`views.py` and `models.py`. +We will have a more detailed look into these shortly. + +```bash +$ ls -l inflammation +total 24 +-rw-r--r-- 1 alex staff 71 29 Jun 09:59 __init__.py +-rw-r--r-- 1 alex staff 838 29 Jun 09:59 models.py +-rw-r--r-- 1 alex staff 649 25 Jun 13:13 views.py +``` + +Directory `data` contains several files with patients' daily inflammation information +(along with some other files): + +```bash +$ ls -l data +total 264 +-rw-r--r-- 1 alex staff 5365 25 Jun 13:13 inflammation-01.csv +-rw-r--r-- 1 alex staff 5314 25 Jun 13:13 inflammation-02.csv +-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-03.csv +-rw-r--r-- 1 alex staff 5367 25 Jun 13:13 inflammation-04.csv +-rw-r--r-- 1 alex staff 5345 25 Jun 13:13 inflammation-05.csv +-rw-r--r-- 1 alex staff 5330 25 Jun 13:13 inflammation-06.csv +-rw-r--r-- 1 alex staff 5342 25 Jun 13:13 inflammation-07.csv +-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-08.csv +-rw-r--r-- 1 alex staff 5327 25 Jun 13:13 inflammation-09.csv +-rw-r--r-- 1 alex staff 5342 25 Jun 13:13 inflammation-10.csv +-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-11.csv +-rw-r--r-- 1 alex staff 5340 25 Jun 13:13 inflammation-12.csv +-rw-r--r-- 1 alex staff 22554 25 Jun 13:13 python-novice-inflammation-data.zip +-rw-r--r-- 1 alex staff 12 25 Jun 13:13 small-01.csv +-rw-r--r-- 1 alex staff 15 25 Jun 13:13 small-02.csv +-rw-r--r-- 1 alex staff 12 25 Jun 13:13 small-03.csv +``` + +As [previously mentioned](#what-does-patient-inflammation-data-contain), +each of the inflammation data files contains separate trial data for 60 patients over 40 days. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Have a Peek at the Data + +Which command(s) would you use to list the contents or a first few +lines of `data/inflammation-01.csv` file? + +::::::::::::::: solution + +## Solution + +1. To list the entire content of a file from the project root do: `cat data/inflammation-01.csv`. +2. To list the first 5 lines of a file from the project root do: + +```bash +head -n 5 data/inflammation-01.csv +``` + +```output +0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0 +0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 +0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 +0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 +0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Directory `tests` contains several tests that have been implemented already. +We will be adding more tests during the course as our code grows. + +```bash +$ ls -l tests +total 16 +-rw-r--r-- 1 alex staff 941 18 Dec 11:42 test_models.py +-rw-r--r-- 1 alex staff 182 18 Dec 11:42 test_patient.py +``` + +An important thing to note here is that the structure of our project is not arbitrary. +One of the big differences between novice and intermediate software development is +planning the structure of your code. +This structure includes software components and behavioural interactions between them, +including how these components are laid out in a directory and file structure. +A novice will often make up the structure of their code as they go along. +However, for more advanced software development, +we need to plan and design this structure - called a *software architecture* - beforehand. + +Let us have a quick look into what a software architecture is +and which architecture is used by our software project +before we start adding more code to it. + +### Software Architecture + +A software architecture is the fundamental structure of a software system +that is decided at the beginning of project development +based on its requirements and cannot be changed that easily once implemented. +It refers to a "bigger picture" of a software system +that describes high-level components (modules) of the system +and how they interact. + +In software design and development, +large systems or programs are often decomposed into a set of smaller modules +each with a subset of functionality. +Typical examples of modules in programming are software libraries; +some software libraries, such as `numpy` and `matplotlib` in Python, +are bigger modules that contain several smaller sub-modules. +Another example of modules are classes in object-oriented programming languages. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Programming Modules and Interfaces + +Although modules are self-contained and independent elements to a large extent +(they can depend on other modules), +there are well-defined ways of how they interact with one another. +These rules of interaction are called **programming interfaces** - +they define how other modules (clients) can use a particular module. +Typically, an interface to a module includes +rules on how a module can take input from +and how it gives output back to its clients. +A client can be a human, in which case we also call these user interfaces. +Even smaller functional units such as functions/methods have clearly defined interfaces - +a function/method's definition +(also known as a *signature*) +states what parameters it can take as input and what it returns as an output. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We are going to talk about software architecture and design a +bit more in [Section 3](30-section3-intro.md) - for now +it is sufficient to know that the way our software project's code is structured is intentional. + +### Our Project's Architecture + +Our software project uses the [Model-View-Controller (MVC) architecture](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller). +MVC architecture divides the software logic into three interconnected modules: + +- **Model** (data) - represents the data used by a program and contains operations/rules + for manipulating and changing the data in the model (a database, a file, a single data object + or a series of objects - for example a table representing patients' data). +- **View** (client interface) - provides means of displaying data to users/clients within an + application (i.e. provides visualisation of the state of the model). + For example, displaying a window with input fields and buttons (Graphical User Interface, GUI) + or textual options within a command line (Command Line Interface, CLI) are examples of Views. +- **Controller** (processes that handle input/output and manipulate the data) - + accepts input from the **View** and performs the corresponding action on the **Model** + (changing the state of the model) and then updates the **View** accordingly. + +In our project, `inflammation-analysis.py` is the **Controller** module +that performs basic statistical analysis over patient data +and provides the main entry point into the application. +The **View** and **Model** modules are contained in the files `views.py` and `models.py`, respectively, +and are conveniently named. +Data underlying the **Model** is contained within the directory `data` - +as we have seen already it contains several files with patients' daily inflammation information. + +We will revisit the software architecture and MVC topics once again in later episodes +when we talk in more detail about [software architecture and design](32-software-architecture-design.md). +We now proceed to set up our virtual development environment +and start working with the code using a more convenient graphical tool - +[IDE PyCharm](https://www.jetbrains.com/pycharm/). + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Programming interfaces define how individual modules within a software application interact among themselves or how the application itself interacts with its users. +- MVC is a software design architecture which divides the application into three interconnected modules: Model (data), View (user interface), and Controller (input/output and data manipulation). +- The software project we use throughout this course is an example of an MVC application that manipulates patients’ inflammation data and performs basic statistical analysis using Python. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/12-virtual-environments.md b/12-virtual-environments.md new file mode 100644 index 000000000..afa971423 --- /dev/null +++ b/12-virtual-environments.md @@ -0,0 +1,612 @@ +--- +title: 1.2 Virtual Environments For Software Development +start: no +teaching: 30 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Set up a Python virtual environment for our software project using `venv` and `pip`. +- Run our software from the command line. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are virtual environments in software development and why you should use them? +- How can we manage Python virtual environments and external (third-party) libraries? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +So far we have cloned our software project from GitHub and inspected its contents and architecture a bit. +We now want to run our code to see what it does - +let us do that from the command line. +For the most part of the course we will run our code +and interact with Git from the command line. +While we will develop and debug our code using the PyCharm IDE +and it is possible to use Git from PyCharm too, +typing commands in the command line allows you to familiarise yourself and learn it well. +A bonus is that this knowledge is transferable to running code in other programming languages +and is independent from any IDE you may use in the future. + +If you have a little peek into our code +(e.g. run `cat inflammation/views.py` from the project root), +you will see the following two lines somewhere at the top. + +```python +from matplotlib import pyplot as plt +import numpy as np +``` + +This means that our code requires two **external libraries** +(also called third-party packages or dependencies) - +`numpy` and `matplotlib`. +Python applications often use external libraries that don't come as part of the standard Python distribution. +This means that you will have to use a *package manager* tool to install them on your system. +Applications will also sometimes need a +specific version of an external library +(e.g. because they were written to work with feature, class, +or function that may have been updated in more recent versions), +or a specific version of Python interpreter. +This means that each Python application you work with may require a different setup +and a set of dependencies so it is useful to be able to keep these configurations +separate to avoid confusion between projects. +The solution for this problem is to create a self-contained +**virtual environment** per project, +which contains a particular version of Python installation +plus a number of additional external libraries. + +Virtual environments are not just a feature of Python - +most modern programming languages use a similar mechanism to isolate libraries or dependencies +for a specific project, making it easier to develop, run, test and share code with others. +Some examples include Bundler for Ruby, Conan for C++, or Maven with classpath for Java. +This can also be achieved with more generic package managers like Spack, +which is used extensively in HPC settings to resolve complex dependencies. +In this episode, we learn how to set up a virtual environment to develop our code +and manage our external dependencies. + +## Virtual Environments + +So what exactly are virtual environments, and why use them? + +A Python virtual environment helps us create an **isolated working copy** of a software project +that uses a specific version of Python interpreter +together with specific versions of a number of external libraries +installed into that virtual environment. +Python virtual environments are implemented as +directories with a particular structure within software projects, +containing links to specified dependencies +allowing isolation from other software projects on your machine that may require +different versions of Python or external libraries. + +As more external libraries are added to your Python project over time, +you can add them to its specific virtual environment +and avoid a great deal of confusion by having +separate (smaller) virtual environments for each project +rather than one huge global environment with potential package version clashes. +Another big motivator for using virtual environments is +that they make sharing your code with others much easier +(as we will see shortly). +Here are some typical scenarios where +the use of virtual environments is highly recommended (almost unavoidable): + +- You have an older project that only works under Python 2. + You do not have the time to migrate the project to Python 3 + or it may not even be possible as some of the third party dependencies + are not available under Python 3. + You have to start another project under Python 3. + The best way to do this on a single machine is + to set up two separate Python virtual environments. +- One of your Python 3 projects is locked to use + a particular older version of a third party dependency. + You cannot use the latest version of the dependency as it breaks things in your project. + In a separate branch of your project, + you want to try and fix problems introduced by the new version of the dependency + without affecting the working version of your project. + You need to set up a separate virtual environment for your branch to + 'isolate' your code while testing the new feature. + +You do not have to worry too much about specific versions of external libraries +that your project depends on most of the time. +Virtual environments also enable you to always use +the latest available version without specifying it explicitly. +They also enable you to use a specific older version of a package for your project, should you need to. + +::::::::::::::::::::::::::::::::::::::::: callout + +## A Specific Python or Package Version is Only Ever Installed Once + +Note that you will not have a separate Python or package installations for each of your projects - +they will only ever be installed once on your system but will be referenced +from different virtual environments. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Managing Python Virtual Environments + +There are several commonly used command line tools for managing Python virtual environments: + +- `venv`, available by default from the standard `Python` distribution from `Python 3.3+` +- `virtualenv`, needs to be installed separately but supports both `Python 2.7+` and `Python 3.3+`versions +- `pipenv`, created to fix certain shortcomings of `virtualenv` +- `conda`, package and environment management system + (also included as part of the Anaconda Python distribution often used by the scientific community) +- `poetry`, a modern Python packaging tool which handles virtual environments automatically + +While there are pros and cons for using each of the above, +all will do the job of managing Python virtual environments for you +and it may be a matter of personal preference which one you go for. +In this course, we will use `venv` to create and manage our virtual environment +(which is the preferred way for Python 3.3+). +The upside is that `venv` virtual environments created from the command line are +also recognised and picked up automatically by PyCharm IDE, +as we will see in the next episode. + +### Managing External Packages + +Part of managing your (virtual) working environment involves +installing, updating and removing external packages on your system. +The Python package manager tool `pip` is most commonly used for this - +it interacts and obtains the packages from the central repository called +[Python Package Index (PyPI)](https://pypi.org/). +`pip` can now be used with all Python distributions (including Anaconda). + +::::::::::::::::::::::::::::::::::::::::: callout + +## A Note on Anaconda and `conda` + +Anaconda is an open source Python distribution commonly used for scientific programming - +it conveniently installs Python, package and environment management `conda`, +and a number of commonly used scientific computing packages +so you do not have to obtain them separately. +`conda` is an independent command line tool +(available separately from the Anaconda distribution too) with dual functionality: +(1) it is a package manager that helps you find Python packages +from remote package repositories and install them on your system, and +(2) it is also a virtual environment manager. +So, you can use `conda` for both tasks instead of using `venv` and `pip`. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Many Tools for the Job + +Installing and managing Python distributions, +external libraries and virtual environments is, well, complex. +There is an abundance of tools for each task, +each with its advantages and disadvantages, +and there are different ways to achieve the same effect +(and even different ways to install the same tool!). +Note that each Python distribution comes with its own version of `pip` - +and if you have several Python versions installed you have to be extra careful to +use the correct `pip` to manage external packages for that Python version. + +`venv` and `pip` are considered the *de facto* standards for virtual environment +and package management for Python 3. +However, the advantages of using Anaconda and `conda` are that +you get (most of the) packages needed for scientific code development included with the distribution. +If you are only collaborating with others who are also using Anaconda, +you may find that `conda` satisfies all your needs. +It is good, however, to be aware of all these tools, and use them accordingly. +As you become more familiar with them you will realise that +equivalent tools work in a similar way even though the command syntax may be different +(and that there are equivalent tools for other programming languages too +to which your knowledge can be ported). + +![Python Environment Hell from [XKCD](https://xkcd.com/1987/) (Creative Commons Attribution-NonCommercial 2.5 License)](fig/python-environment-hell.png){alt='Python environment hell XKCD comic'} + +Let us have a look at how we can create and manage virtual environments from the command line +using `venv` and manage packages using `pip`. + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Making Sure You Can Invoke Python + +You can test your Python installation from the command line with: + +```bash +$ python3 --version # on Mac/Linux +$ python --version # on Windows — Windows installation comes with a python.exe file rather than a python3.exe file +``` + +If you are using Windows and invoking `python` command causes your Git Bash terminal to hang with no error message or output, you may +need to create an alias for the python executable `python.exe`, as explained in the [troubleshooting section](../learners/common-issues.md#python-hangs-in-git-bash). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Creating Virtual Environments Using `venv` + +Creating a virtual environment with `venv` is done by executing the following command: + +```bash +$ python3 -m venv /path/to/new/virtual/environment +``` + +where `/path/to/new/virtual/environment` is a path to a directory where you want to place it - +conventionally within your software project so they are co-located. +This will create the target directory for the virtual environment +(and any parent directories that don't exist already). + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is `-m` Flag in `python3` Command? + +The Python `-m` flag means "module" and tells the Python interpreter to treat what follows `-m` +as the name of a module and not as a single, executable program with the same name. +Some modules (such as `venv` or `pip`) have main entry points +and the `-m` flag can be used to invoke them on the command line via the `python` command. +The main difference between running such modules as standalone programs +(e.g. executing "venv" by running the `venv` command directly) +versus using `python3 -m` command seems to be that +with latter you are in full control of which Python module will be invoked +(the one that came with your environment's Python interpreter vs. +some other version you may have on your system). +This makes it a more reliable way to set things up correctly +and avoid issues that could prove difficult to trace and debug. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +For our project let us create a virtual environment called "venv". +First, ensure you are within the project root directory, then: + +```bash +$ python3 -m venv venv +``` + +If you list the contents of the newly created directory "venv", on a Mac or Linux system +(slightly different on Windows as explained below) you should see something like: + +```bash +$ ls -l venv +``` + +```output +total 8 +drwxr-xr-x 12 alex staff 384 5 Oct 11:47 bin +drwxr-xr-x 2 alex staff 64 5 Oct 11:47 include +drwxr-xr-x 3 alex staff 96 5 Oct 11:47 lib +-rw-r--r-- 1 alex staff 90 5 Oct 11:47 pyvenv.cfg +``` + +So, running the `python3 -m venv venv` command created the target directory called "venv" +containing: + +- `pyvenv.cfg` configuration file + with a home key pointing to the Python installation from which the command was run, +- `bin` subdirectory (called `Scripts` on Windows) + containing a symlink of the Python interpreter binary used to create the environment + and the standard Python library, +- `lib/pythonX.Y/site-packages` subdirectory (called `Lib\site-packages` on Windows) + to contain its own independent set of installed Python packages isolated from other projects, and +- various other configuration and supporting files and subdirectories. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Naming Virtual Environments + +What is a good name to use for a virtual environment? +Using "venv" or ".venv" as the name for an environment +and storing it within the project's directory seems to be the recommended way - +this way when you come across such a subdirectory within a software project, +by convention you know it contains its virtual environment details. +A slight downside is that all different virtual environments on your machine +then use the same name +and the current one is determined by the context of the path you are currently located in. +A (non-conventional) alternative is +to use your project name for the name of the virtual environment, +with the downside that there is nothing to indicate that such a directory contains a virtual environment. +In our case, we have settled to use the name "venv" instead of ".venv" +since it is not a hidden directory and we want it to be displayed by the command line +when listing directory contents +(the "." in its name that would, by convention, make it hidden). +In the future, you will decide what naming convention works best for you. +Here are some references for each of the naming conventions: + +- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/dev/virtualenvs/) + notes that "venv" is the general convention used globally +- [The Python Documentation](https://docs.python.org/3/library/venv.html) + indicates that ".venv" is common +- ["venv" vs ".venv" discussion](https://discuss.python.org/t/trying-to-come-up-with-a-default-directory-name-for-virtual-environments/3750) + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Once you've created a virtual environment, you will need to activate it. + +On Mac or Linux, it is done as: + +```bash +$ source venv/bin/activate +(venv) $ +``` + +On Windows, recall that we have `Scripts` directory instead of `bin` +and activating a virtual environment is done as: + +```bash +$ source venv/Scripts/activate +(venv) $ +``` + +Activating the virtual environment will change your command line's prompt +to show what virtual environment you are currently using +(indicated by its name in round brackets at the start of the prompt), +and modify the environment so that running Python will get you +the particular version of Python configured in your virtual environment. + +You can verify you are using your virtual environment's version of Python +by checking the path using the command `which`: + +```bash +(venv) $ which python3 +``` + +```output +/home/alex/python-intermediate-inflammation/venv/bin/python3 +``` + +When you're done working on your project, you can exit the environment with: + +```bash +(venv) $ deactivate +``` + +If you have just done the `deactivate`, +ensure you reactivate the environment ready for the next part: + +```bash +$ source venv/bin/activate +(venv) $ +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Python Within A Virtual Environment + +Within an active virtual environment, +commands `python3` and `python` should both refer to the version of Python 3 +you created the environment with (note you may have multiple Python 3 versions installed). + +However, on some machines with Python 2 installed, +`python` command may still be hardwired to the copy of Python 2 +installed outside of the virtual environment - this can cause errors and confusion. + +You can always check which version of Python you are using in your virtual environment +with the command `which python` to be absolutely sure. +We continue using `python3` in this material to avoid mistakes, +but the command `python` may work for you as expected. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Note that, since our software project is being tracked by Git, +the newly created virtual environment will show up in version control - +we will see how to handle it using Git in one of the subsequent episodes. + +### Installing External Packages Using `pip` + +We noticed earlier that our code depends on two *external packages/libraries* - +`numpy` and `matplotlib`. +In order for the code to run on your machine, +you need to install these two dependencies into your virtual environment. + +To install the latest version of a package with `pip` +you use pip's `install` command and specify the package's name, e.g.: + +```bash +(venv) $ python3 -m pip install numpy +(venv) $ python3 -m pip install matplotlib +``` + +or like this to install multiple packages at once for short: + +```bash +(venv) $ python3 -m pip install numpy matplotlib +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## How About `pip3 install ` Command? + +You may have seen or used the `pip3 install ` command in the past, which is shorter +and perhaps more intuitive than `python3 -m pip install`. However, the +[official Pip documentation](https://pip.pypa.io/en/stable/user_guide/#running-pip) recommends +`python3 -m pip install` and core Python developer Brett Cannon offers a +[more detailed explanation](https://snarky.ca/why-you-should-use-python-m-pip/) +of edge cases when the two commands may produce different results and why `python3 -m pip install` +is recommended. In this material, we will use `python3 -m` whenever we have to invoke a Python +module from command line. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +If you run the `python3 -m pip install` command on a package that is already installed, +`pip` will notice this and do nothing. + +To install a specific version of a Python package +give the package name followed by `==` and the version number, +e.g. `python3 -m pip install numpy==1.21.1`. + +To specify a minimum version of a Python package, +you can do `python3 -m pip install numpy>=1.20`. + +To upgrade a package to the latest version, e.g. `python3 -m pip install --upgrade numpy`. + +To display information about a particular installed package do: + +```bash +(venv) $ python3 -m pip show numpy +``` + +```output +Name: numpy +Version: 1.26.2 +Summary: Fundamental package for array computing in Python +Home-page: https://numpy.org +Author: Travis E. Oliphant et al. +Author-email: +License: Copyright (c) 2005-2023, NumPy Developers. +All rights reserved. +... +Required-by: contourpy, matplotlib +``` + +To list all packages installed with `pip` (in your current virtual environment): + +```bash +(venv) $ python3 -m pip list +``` + +```output +Package Version +--------------- ------- +contourpy 1.2.0 +cycler 0.12.1 +fonttools 4.45.0 +kiwisolver 1.4.5 +matplotlib 3.8.2 +numpy 1.26.2 +packaging 23.2 +Pillow 10.1.0 +pip 23.0.1 +pyparsing 3.1.1 +python-dateutil 2.8.2 +setuptools 67.6.1 +six 1.16.0 +``` + +To uninstall a package installed in the virtual environment do: `python3 -m pip uninstall `. +You can also supply a list of packages to uninstall at the same time. + +### Exporting/Importing Virtual Environments Using `pip` + +You are collaborating on a project with a team so, naturally, +you will want to share your environment with your collaborators +so they can easily 'clone' your software project with all of its dependencies +and everyone can replicate equivalent virtual environments on their machines. +`pip` has a handy way of exporting, saving and sharing virtual environments. + +To export your active environment - +use `python3 -m pip freeze` command to produce a list of packages installed in the virtual environment. +A common convention is to put this list in a `requirements.txt` file: + +```bash +(venv) $ python3 -m pip freeze > requirements.txt +(venv) $ cat requirements.txt +``` + +```output +contourpy==1.2.0 +cycler==0.12.1 +fonttools==4.45.0 +kiwisolver==1.4.5 +matplotlib==3.8.2 +numpy==1.26.2 +packaging==23.2 +Pillow==10.1.0 +pyparsing==3.1.1 +python-dateutil==2.8.2 +six==1.16.0 +``` + +The first of the above commands will create a `requirements.txt` file in your current directory. +Yours may look a little different, +depending on the version of the packages you have installed, +as well as any differences in the packages that they themselves use. + +The `requirements.txt` file can then be committed to a version control system +(we will see how to do this using Git in one of the following episodes) +and get shipped as part of your software and shared with collaborators and/or users. +They can then replicate your environment +and install all the necessary packages from the project root as follows: + +```bash +(venv) $ python3 -m pip install -r requirements.txt +``` + +As your project grows - you may need to update your environment for a variety of reasons. +For example, one of your project's dependencies has just released a new version +(dependency version number update), +you need an additional package for data analysis (adding a new dependency) +or you have found a better package and no longer need the older package +(adding a new and removing an old dependency). +What you need to do in this case +(apart from installing the new and removing the packages that are no longer needed +from your virtual environment) +is update the contents of the `requirements.txt` file accordingly +by re-issuing `pip freeze` command +and propagate the updated `requirements.txt` file to your collaborators +via your code sharing platform (e.g. GitHub). + +::::::::::::::::::::::::::::::::::::: testimonial + +## Official Documentation + +For a full list of options and commands, +consult the [official `venv` documentation](https://docs.python.org/3/library/venv.html) +and the [Installing Python Modules with `pip` guide](https://docs.python.org/3/installing/index.html#installing-index). +Also check out the guide +["Installing packages using `pip` and virtual environments"](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/#installing-packages-using-pip-and-virtual-environments). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Running Python Scripts From Command Line + +Congratulations! +Your environment is now activated and set up +to run our `inflammation-analysis.py` script from the command line. + +You should already be located in the root of the `python-intermediate-inflammation` directory +(if not, please navigate to it from the command line now). +To run the script, type the following command: + +```bash +(venv) $ python3 inflammation-analysis.py +``` + +```output +usage: inflammation-analysis.py [-h] infiles [infiles ...] +inflammation-analysis.py: error: the following arguments are required: infiles +``` + +In the above command, we tell the command line two things: + +1. to find a Python interpreter + (in this case, the one that was configured via the virtual environment), and +2. to use it to run our script `inflammation-analysis.py`, + which resides in the current directory. + +As we can see, the Python interpreter ran our script, which threw an error - +`inflammation-analysis.py: error: the following arguments are required: infiles`. +It looks like the script expects a list of input files to process, +so this is expected behaviour since we do not supply any. +We will fix this error in a moment. + +## Optional exercises + +Checkout [this optional exercise](17-section1-optional-exercises.md) +to try out different virtual environment managers. +Or, [this exercise](17-section1-optional-exercises.md) +to customize the command line. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Virtual environments keep Python versions and dependencies required by different projects separate. +- A virtual environment is itself a directory structure. +- Use `venv` to create and manage Python virtual environments. +- Use `pip` to install and manage Python external (third-party) libraries. +- `pip` allows you to declare all dependencies for a project in a separate file (by convention called `requirements.txt`) which can be shared with collaborators/users and used to replicate a virtual environment. +- Use `python3 -m pip freeze > requirements.txt` to take snapshot of your project's dependencies. +- Use `python3 -m pip install -r requirements.txt` to replicate someone else's virtual environment on your machine from the `requirements.txt` file. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/13-ides.md b/13-ides.md new file mode 100644 index 000000000..5324931f3 --- /dev/null +++ b/13-ides.md @@ -0,0 +1,629 @@ +--- +title: 1.3 Integrated Software Development Environments +start: no +teaching: 25 +exercises: 10 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Set up a (virtual) development environment in PyCharm +- Use PyCharm to run a Python script + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are Integrated Development Environments (IDEs)? +- What are the advantages of using IDEs for software development? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +As we have seen in the previous episode - +even a simple software project is typically split into smaller functional units and modules, +which are kept in separate files and subdirectories. +As your code starts to grow and becomes more complex, +it will involve many different files and various external libraries. +You will need an application to help you manage all the complexities of, +and provide you with some useful (visual) facilities for, +the software development process. +Such clever and useful graphical software development applications are called +Integrated Development Environments (IDEs). + +## Integrated Development Environments + +An IDE normally consists of at least a source code editor, +build automation tools +and a debugger. +The boundaries between modern IDEs and other aspects of the broader software development process +are often blurred. +Nowadays IDEs also offer version control support, +tools to construct graphical user interfaces (GUI) +and web browser integration for web app development, +source code inspection for dependencies and many other useful functionalities. +The following is a list of the most commonly seen IDE features: + +- **syntax highlighting** - + to show the language constructs, keywords and the syntax errors + with visually distinct colours and font effects +- **code completion** - + to speed up programming by offering a set of possible (syntactically correct) code options +- **code search** - + finding package, class, function and variable declarations, their usages and referencing +- **version control support** - + to interact with source code repositories +- **debugging support** - + for setting breakpoints in the code editor, + step-by-step execution of code and inspection of variables + +IDEs are extremely useful and modern software development would be very hard without them. +There are a number of IDEs available for Python development; +a good overview is available from the +[Python Project Wiki](https://wiki.python.org/moin/IntegratedDevelopmentEnvironments). +In addition to IDEs, there are also a number of code editors that have Python support. +Code editors can be as simple as a text editor +with syntax highlighting and code formatting capabilities +(e.g., GNU EMACS, Vi/Vim). +Most good code editors can also execute code and control a debugger, +and some can also interact with a version control system. +Compared to an IDE, a good dedicated code editor is usually smaller and quicker, +but often less feature-rich. +You will have to decide which one is the best for you - +in this course we will learn how to use [PyCharm](https://www.jetbrains.com/pycharm/), +a free, open source Python IDE. +Some popular alternatives include +free and open source IDE [Spyder](https://www.spyder-ide.org/) +and Microsoft's free [Visual Studio Code (VS Code)](https://code.visualstudio.com/). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Using VS Code for This Course + +If you want to use VS Code as your IDE for this course, there is a separate [extras episode](../learners/vscode.md) +to help you set up. The instructions for PyCharm in the course will not apply to you verbatim but there +is an equivalent functionality in VS Code for each of the actions we ask you to do in PyCharm. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Using the PyCharm IDE + +Let us open our project in PyCharm now and familiarise ourselves with some commonly used features. + +### Opening a Software Project + +If you do not have PyCharm running yet, start it up now. +You can skip the initial configuration steps which just go through +selecting a theme and other aspects. +You should be presented with a dialog box that asks you what you want to do, +e.g. `Create New Project`, `Open`, or `Check out from Version Control`. + +Select `Open` and find the software project directory +`python-intermediate-inflammation` you cloned earlier. +This directory is now the current working directory for PyCharm, +so when we run scripts from PyCharm, this is the directory they will run from. + +PyCharm will show you a *'Tip of the Day'* window which you can safely ignore and close for now. +You may also get a warning *'No Python interpreter configured for the project'* - +we [will deal with this](#configuring-a-virtual-environment-in-pycharm) +shortly after we familiarise ourselves with the PyCharm environment. +You will notice the IDE shows you a project/file navigator window on the left hand side, +to traverse and select the files (and any subdirectories) within the working directory, +and an editor window on the right. +At the bottom, you would typically have a panel for version control, +terminal (the command line within PyCharm) and a TODO list. + +![](fig/pycharm-open-project.png){alt='View of an opened project in PyCharm' .image-with-shadow width="1000px" } + +Select the `inflammation-analysis.py` file in the project navigator on the left +so that its contents are displayed in the editor window. +You may notice a warning about the missing Python interpreter +at the top of the editor panel showing `inflammation-analysis.py` file - +this is one of the first things you will have to configure for your project +before you can do any work. + +![](fig/pycharm-missing-python-interpreter.png){alt='Missing Python Interpreter Warning in PyCharm' .image-with-shadow width="800px" } + +You may take the shortcut and click on one of the offered options above +but we want to take you through the whole process of setting up your environment in PyCharm +as this is important conceptually. + +### Configuring a Virtual Environment in PyCharm + +Before you can run the code from PyCharm, +you need to explicitly specify the path to the Python interpreter on your system. +The same goes for any dependencies your code may have - +you need to tell PyCharm where to find them - +much like we did from the command line in the previous episode. +Luckily for us, we have already set up a virtual environment for our project +from the command line +and PyCharm is clever enough to understand it. + +#### Adding a Python Interpreter + +1. Select either `PyCharm` > `Settings` (Mac) or `File` > `Settings` (Linux, Windows). +2. In the window that appears, + select `Project: python-intermediate-inflammation` > `Python Interpreter` from the left. + You'll see a number of Python packages displayed as a list, and importantly above that, + the current Python interpreter that is being used. + These may be blank or set to ``, + or possibly the default version of Python installed on your system, + e.g. `Python 2.7 /usr/bin/python2.7`, + which we do not want to use in this instance. +3. Select the cog-like button in the top right, then `Add...` + (or `Add Local...` depending on your PyCharm version). + An `Add Python Interpreter` window will appear. +4. Select `Virtualenv Environment` from the list on the left + and ensure that `Existing environment` checkbox is selected within the popup window. + In the `Interpreter` field point to the Python 3 executable inside + your virtual environment's `bin` directory + (make sure you navigate to it and select it from the file browser rather than + just accept the default offered by PyCharm). + Note that there is also an option to create a new virtual environment, + but we are not using that option as we want to reuse the one we created + from the command line in the previous episode. + ![](fig/pycharm-configuring-interpreter.png){alt='Configuring Python Interpreter in PyCharm' .image-with-shadow width="800px"} +5. Select `Make available to all projects` checkbox + so we can also use this environment for other projects if we wish. +6. Select `OK` in the `Add Python Interpreter` window. + Back in the `Preferences` window, you should select "Python 3.11 (python-intermediate-inflammation)" + or similar (that you have just added) from the `Project Interpreter` drop-down list. + +Note that a number of external libraries have magically appeared under the +"Python 3.11 (python-intermediate-inflammation)" interpreter, +including `numpy` and `matplotlib`. +PyCharm has recognised the virtual environment we created from the command line using `venv` +and has added these libraries effectively replicating our virtual environment in PyCharm +(referred to as "Python 3.11 (python-intermediate-inflammation)"). + +![](fig/pycharm-installed-packages.png){alt='Packages Currently Installed in a Virtual Environment in PyCharm' .image-with-shadow width="800px"} + +Also note that, although the names are not the same - +this is one and the same virtual environment +and changes done to it in PyCharm will propagate to the command line and vice versa. +Let us see this in action through the following exercise. + +::: challenge + +## Compare External Libraries in the Command Line and PyCharm + +Can you recall two places where information about our project's dependencies +can be found from the command line? +Compare that information with the equivalent configuration in PyCharm. + +Hint: We can use an argument to `pip`, +or find the packages directly in a subdirectory of our virtual environment directory "venv". + +:::: solution + +From the previous episode, +you may remember that we can get the list of packages in the current virtual environment +using `pip`: + +```bash +(venv) $ python3 -m pip list +``` + +```output +Package Version + +*** + +contourpy 1.2.0 +cycler 0.12.1 +fonttools 4.45.0 +kiwisolver 1.4.5 +matplotlib 3.8.2 +numpy 1.26.2 +packaging 23.2 +Pillow 10.1.0 +pip 23.0.1 +pyparsing 3.1.1 +python-dateutil 2.8.2 +setuptools 67.6.1 +six 1.16.0 +``` + +However, `python3 -m pip list` shows all the packages in the virtual environment - +if we want to see only the list of packages that we installed, +we can use the `python3 -m pip freeze` command instead: + +```bash +(venv) $ python3 -m pip freeze +``` + +```output +contourpy==1.2.0 +cycler==0.12.1 +fonttools==4.45.0 +kiwisolver==1.4.5 +matplotlib==3.8.2 +numpy==1.26.2 +packaging==23.2 +Pillow==10.1.0 +pyparsing==3.1.1 +python-dateutil==2.8.2 +six==1.16.0 +``` + +We see the `pip` package in `python3 -m pip list` but not in `python3 -m pip freeze` +as we did not install it using `pip`. +Remember that we use `python3 -m pip freeze` to update our `requirements.txt` file, +to keep a list of the packages our virtual environment includes. +Python will not do this automatically; +we have to manually update the file when our requirements change using: + +```bash +python3 -m pip freeze > requirements.txt +``` + +If we want, we can also see the list of packages directly in the following subdirectory of `venv`: + +```bash +(venv) $ ls -l venv/lib/python3.11/site-packages +``` + +```output +total 88 +drwxr-xr-x 105 alex staff 3360 20 Nov 15:34 PIL +drwxr-xr-x 9 alex staff 288 20 Nov 15:34 Pillow-10.1.0.dist-info +drwxr-xr-x 4 alex staff 128 20 Nov 15:34 **pycache** +drwxr-xr-x 5 alex staff 160 20 Nov 15:32 \_distutils\_hack +drwxr-xr-x 16 alex staff 512 20 Nov 15:34 contourpy +drwxr-xr-x 7 alex staff 224 20 Nov 15:34 contourpy-1.2.0.dist-info +drwxr-xr-x 5 alex staff 160 20 Nov 15:34 cycler +drwxr-xr-x 8 alex staff 256 20 Nov 15:34 cycler-0.12.1.dist-info +drwxr-xr-x 14 alex staff 448 20 Nov 15:34 dateutil +\-rw-r--r-- 1 alex staff 151 20 Nov 15:32 distutils-precedence.pth +drwxr-xr-x 33 alex staff 1056 20 Nov 15:34 fontTools +drwxr-xr-x 9 alex staff 288 20 Nov 15:34 fonttools-4.45.0.dist-info +drwxr-xr-x 8 alex staff 256 20 Nov 15:34 kiwisolver +drwxr-xr-x 8 alex staff 256 20 Nov 15:34 kiwisolver-1.4.5.dist-info +drwxr-xr-x 150 alex staff 4800 20 Nov 15:34 matplotlib +drwxr-xr-x 20 alex staff 640 20 Nov 15:34 matplotlib-3.8.2.dist-info +drwxr-xr-x 5 alex staff 160 20 Nov 15:34 mpl\_toolkits +drwxr-xr-x 43 alex staff 1376 20 Nov 15:34 numpy +drwxr-xr-x 9 alex staff 288 20 Nov 15:34 numpy-1.26.2.dist-info +drwxr-xr-x 18 alex staff 576 20 Nov 15:34 packaging +drwxr-xr-x 9 alex staff 288 20 Nov 15:34 packaging-23.2.dist-info +drwxr-xr-x 9 alex staff 288 20 Nov 15:32 pip +drwxr-xr-x 10 alex staff 320 20 Nov 15:33 pip-23.0.1.dist-info +drwxr-xr-x 6 alex staff 192 20 Nov 15:32 pkg\_resources +\-rw-r--r-- 1 alex staff 90 20 Nov 15:34 pylab.py +drwxr-xr-x 15 alex staff 480 20 Nov 15:34 pyparsing +drwxr-xr-x 7 alex staff 224 20 Nov 15:34 pyparsing-3.1.1.dist-info +drwxr-xr-x 9 alex staff 288 20 Nov 15:34 python\_dateutil-2.8.2.dist-info +drwxr-xr-x 49 alex staff 1568 20 Nov 15:32 setuptools +drwxr-xr-x 10 alex staff 320 20 Nov 15:32 setuptools-67.6.1.dist-info +drwxr-xr-x 8 alex staff 256 20 Nov 15:34 six-1.16.0.dist-info +\-rw-r--r-- 1 alex staff 34549 20 Nov 15:34 six.py +``` + +Finally, if you look at both the contents of +`venv/lib/python3.11/site-packages` and `requirements.txt` +and compare that with the packages shown in PyCharm's Python Interpreter Configuration - +you will see that they all contain equivalent information. + +:::: + +::: + +#### Adding an External Library + +We have already added packages `numpy` and `matplotlib` to our virtual environment +from the command line in the previous episode, +so we are up-to-date with all external libraries we require at the moment. +However, we will need library `pytest` soon to implement tests for our code. +We will use this opportunity to install it from PyCharm in order to see +an alternative way of doing this and how it propagates to the command line. + +1. Select either `PyCharm` > `Settings` (Mac) or `File` > `Settings` (Linux, Windows). +2. In the preferences window that appears, + select `Project: python-intermediate-inflammation` > `Project Interpreter` from the left. +3. Select the `+` icon at the top of the window. + In the window that appears, search for the name of the library (`pytest`), + select it from the list, + then select `Install Package`. + Once it finishes installing, you can close that window. + ![](fig/pycharm-add-library.png){alt='Installing a package in PyCharm' .image-with-shadow width="800px" } +4. Select `OK` in the `Preferences`/`Settings` window. + +It may take a few minutes for PyCharm to install it. +After it is done, the `pytest` library is added to our virtual environment. +You can also verify this from the command line by +listing the `venv/lib/python3.11/site-packages` subdirectory. +Note, however, that `requirements.txt` is not updated - +as we mentioned earlier this is something you have to do manually. +Let us do this as an exercise. + +::: challenge +## Update `requirements.txt` After Adding a New Dependency + +Export the newly updated virtual environment into `requirements.txt` file. + +:::: solution + +Let us verify first that the newly installed library `pytest` is appearing in our virtual environment +but not in `requirements.txt`. First, let us check the list of installed packages: + +```bash +(venv) $ python3 -m pip list +``` + +```output +Package Version + +*** + +contourpy 1.2.0 +cycler 0.12.1 +fonttools 4.45.0 +iniconfig 2.0.0 +kiwisolver 1.4.5 +matplotlib 3.8.2 +numpy 1.26.2 +packaging 23.2 +Pillow 10.1.0 +pip 23.0.1 +pluggy 1.3.0 +pyparsing 3.1.1 +pytest 7.4.3 +python-dateutil 2.8.2 +setuptools 67.6.1 +six 1.16.0 +``` + +We can see the `pytest` library appearing in the listing above. However, if we do: + +```bash +(venv) $ cat requirements.txt + +``` + +```output +contourpy==1.2.0 +cycler==0.12.1 +fonttools==4.45.0 +kiwisolver==1.4.5 +matplotlib==3.8.2 +numpy==1.26.2 +packaging==23.2 +Pillow==10.1.0 +pyparsing==3.1.1 +python-dateutil==2.8.2 +six==1.16.0 +``` + +`pytest` is missing from `requirements.txt`. To add it, we need to update the file by repeating the command: + +```bash +(venv) $ python3 -m pip freeze > requirements.txt +``` + +`pytest` is now present in `requirements.txt`: + +```output +contourpy==1.2.0 +cycler==0.12.1 +fonttools==4.45.0 +iniconfig==2.0.0 +kiwisolver==1.4.5 +matplotlib==3.8.2 +numpy==1.26.2 +packaging==23.2 +Pillow==10.1.0 +pluggy==1.3.0 +pyparsing==3.1.1 +pytest==7.4.3 +python-dateutil==2.8.2 +six==1.16.0 +``` + +:::: + +::: + +#### Adding a Run Configuration for Our Project + +Having configured a virtual environment, we now need to tell PyCharm to use it for our project. +This is done by creating and adding a **Run Configuration** to a project. +Run Configurations in PyCharm are named sets of startup properties +that define which main Python script to execute and what (optional) +runtime parameters/environment variables (i.e. additional configuration options) to pass +and use on top of virtual environments. + +1. To add a new Run Configuration for a project - + select `Run` > `Edit Configurations...` from the top menu. +2. Select `Add new run configuration...` then `Python`. + ![](fig/pycharm-add-run-configuration.png){alt='Adding a Run Configuration in PyCharm' .image-with-shadow width="800px" } +3. In the new popup window, in the `Script path` field select the folder button + and find and select `inflammation-analysis.py`. + This tells PyCharm which script to run (i.e. what the main entry point to our application is). + ![](fig/pycharm-run-configuration-popup.png){alt='Run Configuration Popup in PyCharm' .image-with-shadow width="800px" } +4. In the same window, select "Python 3.11 (python-intermediate-inflammation)" + (i.e. the virtual environment and interpreter you configured earlier in this episode) + in the `Python interpreter` field. +5. You can give this run configuration a name at the top of the window if you like - + e.g. let us name it `inflammation analysis`. +6. You can optionally configure run parameters and environment variables in the same window - + we do not need this at the moment. +7. Select `Apply` to confirm these settings. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Virtual Environments And Run Configurations in PyCharm + +We configured the Python interpreter to use for our project by pointing PyCharm +to the virtual environment we created from the command line +(which encapsulates a Python interpreter and external libraries our code needs to run). +Recall that you can create several virtual environments based on the same Python interpreter +but with different external libraries - +this is helpful when you need to develop different types of applications. +For example, you can create one virtual environment +based on Python 3.11 to develop Django Web applications +and another virtual environment +based on the same Python 3.11 to work with scientific libraries. + +Run Configurations provided by PyCharm are one extra layer on top of virtual environments - +you can vary a run configuration each time your code is executed and +you can have separate configurations for running, debugging and testing your code. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now you know how to configure and manipulate your environment in both tools +(command line and PyCharm), +which is a useful parallel to be aware of. +Let us have a look at some other features afforded to us by PyCharm. + +### Syntax Highlighting + +The first thing you may notice is that code is displayed using different colours. +Syntax highlighting is a feature that displays source code terms +in different colours and fonts according to the syntax category the highlighted term belongs to. +It also makes syntax errors visually distinct. +Highlighting does not affect the meaning of the code itself - +it is intended only for humans to make reading code and finding errors easier. + +![](fig/pycharm-syntax-highlighting.png){alt='Syntax Highlighting Functionality in PyCharm' .image-with-shadow width="1000px" } + +### Code Completion + +As you start typing code, +PyCharm will offer to complete some of the code for you in the form of an auto completion popup. +This is a context-aware code completion feature +that speeds up the process of coding +(e.g. reducing typos and other common mistakes) +by offering available variable names, +functions from available packages, +parameters of functions, +hints related to syntax errors, +etc. + +![](fig/pycharm-code-completion.png){alt='Code Completion Functionality in PyCharm' .image-with-shadow width="600px" } + +### Code Definition \& Documentation References + +You will often need code reference information to help you code. +PyCharm shows this useful information, +such as definitions of symbols +(e.g. functions, parameters, classes, fields, and methods) +and documentation references by means of quick popups and inline tooltips. + +For a selected piece of code, +you can access various code reference information from the `View` menu +(or via various keyboard shortcuts), +including: + +- Quick Definition - + where and how symbols (functions, parameters, classes, fields, and methods) are defined +- Quick Type Definition - + type definition of variables, fields or any other symbols +- Quick Documentation - + inline documentation ([*docstrings*](15-coding-conventions.md) + for any symbol created in accordance with [PEP-257](https://peps.python.org/pep-0257/)) +- Parameter Info - + the names and expected types of parameters in method and function calls. + Use this when cursor is on the argument of a function call. +- Type Info - + type of an expression + +![](fig/pycharm-code-reference.png){alt='Code References Functionality in PyCharm' .image-with-shadow width="1000px" } + +### Code Search + +You can search for a text string within a project, +use different scopes to narrow your search process, +use regular expressions for complex searches, +include/exclude certain files from your search, find usages and occurrences. +To find a search string in the whole project: + +1. From the main menu, + select `Edit | Find | Find in Path ...` + (or `Edit | Find | Find in Files...` depending on your version of PyCharm). +2. Type your search string in the search field of the popup. + Alternatively, in the editor, highlight the string you want to find + and press `Command-Shift-F` (on Mac) or `Control-Shift-F` (on Windows). + PyCharm places the highlighted string into the search field of the popup. + + ![](fig/pycharm-code-search.png){alt='Code Search Functionality in PyCharm' .image-with-shadow width="800px" } + + If you need, specify the additional options in the popup. + PyCharm will list the search strings and all the files that contain them. +3. Check the results in the preview area of the dialog where you can replace the search string + or select another string, + or press `Command-Shift-F` (on Mac) or `Control-Shift-F` (on Windows) again + to start a new search. +4. To see the list of occurrences in a separate panel, + click the `Open in Find Window` button in the bottom right corner. + The find panel will appear at the bottom of the main window; + use this panel and its options to group the results, preview them, + and work with them further. + + ![](fig/pycharm-find-panel.png){alt='Code Search Functionality in PyCharm' .image-with-shadow width="1000px" } + +### Version Control + +PyCharm supports a directory-based versioning model, +which means that each project directory can be associated with a different version control system. +Our project was already under Git version control and PyCharm recognised it. +It is also possible to add an unversioned project directory to version control directly from PyCharm. + +During this course, +we will do all our version control commands from the command line +but it is worth noting that PyCharm supports a comprehensive subset of Git commands +(i.e. it is possible to perform a set of common Git commands from PyCharm but not all). +A very useful version control feature in PyCharm is +graphically comparing changes you made locally to a file +with the version of the file in a repository, +a different commit version +or a version in a different branch - +this is something that cannot be done equally well from the text-based command line. + +You can get a full +[documentation on PyCharm's built-in version control support](https://www.jetbrains.com/help/pycharm/version-control-integration.html) +online. + +![](fig/pycharm-version-control.png){alt='Version Control Functionality in PyCharm' .image-with-shadow width="1000px" } + +### Running Scripts in PyCharm + +We have configured our environment and explored some of the most commonly used PyCharm features +and are now ready to run our script from PyCharm! +To do so, right-click the `inflammation-analysis.py` file +in the PyCharm project/file navigator on the left, +and select `Run 'inflammation analysis'` (i.e. the Run Configuration we created earlier). + +![](fig/pycharm-run-script.png){alt='Running a script from PyCharm' .image-with-shadow width="800px" } + +The script will run in a terminal window at the bottom of the IDE window and display something like: + +```output +/Users/alex/work/python-intermediate-inflammation/venv/bin/python /Users/alex/work/python-intermediate-inflammation/inflammation-analysis.py +usage: inflammation-analysis.py [-h] infiles [infiles ...] +inflammation-analysis.py: error: the following arguments are required: infiles + +Process finished with exit code 2 +``` + +This is the same error we got when running the script from the command line. +We will get back to this error shortly - +for now, the good thing is that we managed to set up our project for development +both from the command line and PyCharm and are getting the same outputs. +Before we move on to fixing errors and writing more code, +Let us have a look at the last set of tools for collaborative code development +which we will be using in this course - Git and GitHub. + +## Optional exercises + +Checkout [this optional exercise](17-section1-optional-exercises.md) +to try out different IDEs and code editors. + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- An IDE is an application that provides a comprehensive set of facilities for software development, including syntax highlighting, code search and completion, version control, testing and debugging. +- PyCharm recognises virtual environments configured from the command line using `venv` and `pip`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/14-collaboration-using-git.md b/14-collaboration-using-git.md new file mode 100644 index 000000000..23ca9d868 --- /dev/null +++ b/14-collaboration-using-git.md @@ -0,0 +1,651 @@ +--- +title: 1.4 Software Development Using Git and GitHub +start: no +teaching: 35 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Commit changes in a software project to a local repository and publish them in a remote repository on GitHub +- Create branches for managing different threads of code development +- Learn to use feature branch workflow to effectively collaborate with a team on a software project + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are Git branches and why are they useful for code development? +- What are some best practices when developing software collaboratively using Git? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +So far we have checked out our software project from GitHub +and used command line tools to configure a virtual environment for our project and run our code. +We have also familiarised ourselves with PyCharm - +a graphical tool we will use for code development, testing and debugging. +We are now going to start using another set of tools +from the collaborative code development toolbox - +namely, the version control system Git and code sharing platform GitHub. +These two will enable us to track changes to our code and share it with others. + +You may recall that we have already made some changes to our project locally - +we created a virtual environment in the directory called "venv" +and exported it to the `requirements.txt` file. +We should now decide which of those changes we want to check in and share with others in our team. +This is a typical software development workflow - +you work locally on code, +test it to make sure it works correctly and as expected, +then record your changes using version control +and share your work with others via a shared and centrally backed-up repository. + +Firstly, let us remind ourselves how to work with Git from the command line. + +## Git Refresher + +Git is a version control system for tracking changes in computer files +and coordinating work on those files among multiple people. +It is primarily used for source code management in software development +but it can be used to track changes in files in general - +it is particularly effective for tracking text-based files +(e.g. source code files, CSV, Markdown, HTML, CSS, Tex, etc. files). + +Git has several important characteristics: + +- support for non-linear development + allowing you and your colleagues to work on different parts of a project concurrently, +- support for distributed development + allowing for multiple people to be working on the same project + (even the same file) at the same time, +- every change recorded by Git remains part of the project history + and can be retrieved at a later date, + so even if you make a mistake you can revert to a point before it. + +The diagram below shows a typical software development lifecycle with Git +(in our case starting from making changes in a local branch that "tracks" a remote branch) and the commonly used commands to interact +with different parts of the Git infrastructure, including: + +- **working directory** - + a local directory (including any subdirectories) where your project files live + and where you are currently working. + It is also known as the "untracked" area of Git. + Any changes to files will be marked by Git in the working directory. + If you make changes to the working directory and do not explicitly tell Git to save them - + you will likely lose those changes. + Using `git add filename` command, + you tell Git to start tracking changes to file `filename` within your working directory. +- **staging area (index)** - + once you tell Git to start tracking changes to files + (with `git add filename` command), + Git saves those changes in the staging area on your local machine. + Each subsequent change to the same file needs to be followed by another `git add filename` command + to tell Git to update it in the staging area. + To see what is in your working directory and staging area at any moment + (i.e. what changes is Git tracking), + run the command `git status`. +- **local repository** - + stored within the `.git` directory of your project locally, + this is where Git wraps together all your changes from the staging area + and puts them using the `git commit` command. + Each commit is a new, permanent snapshot (checkpoint, record) of your project in time, + which you can share or revert to. +- **remote repository** - + this is a version of your project that is hosted somewhere on the Internet + (e.g., on GitHub, GitLab or somewhere else). + While your project is nicely version-controlled in your local repository, + and you have snapshots of its versions from the past, + if your machine crashes - you still may lose all your work. Furthermore, you cannot + share or collaborate on this local work with others easily. + Working with a remote repository involves pushing your local changes remotely + (using `git push`) and pulling other people's changes from a remote repository to + your local copy (using `git fetch` or `git pull`) to keep the two in sync + in order to collaborate (with a bonus that your work also gets backed up to another machine). + Note that a common best practice when collaborating with others on a shared repository + is to always do a `git pull` before a `git push`, to ensure you have any latest changes before you push your own. + + + + + +![Software development lifecycle with Git](fig/git-lifecycle.svg){alt='Development lifecycle with Git, containing Git commands add, commit, push, fetch, restore, merge and pull' .image-with-shadow width="600px"} + +## Checking-in Changes to Our Project + +Let us check-in the changes we have done to our project so far. +The first thing to do upon navigating into our software project's directory root +is to check the current status of our local working directory and repository. + +```bash +$ git status +``` + +```output +On branch main +Your branch is up to date with 'origin/main'. + +Untracked files: + (use "git add ..." to include in what will be committed) + requirements.txt + venv/ + +nothing added to commit but untracked files present (use "git add" to track) +``` + +As expected, +Git is telling us that we have some untracked files - +`requirements.txt` and directory "venv" - +present in our working directory which we have not +staged nor committed to our local repository yet. +You do not want to commit the newly created directory "venv" and share it with others +because this directory is specific to your machine and setup only +(i.e. it contains local paths to libraries on your system +that most likely would not work on any other machine). +You do, however, want to share `requirements.txt` with your team +as this file can be used to replicate the virtual environment on your collaborators' systems. + +To tell Git to intentionally ignore and not track certain files and directories, +you need to specify them in the `.gitignore` text file in the project root. +Our project already has `.gitignore`, +but in cases where you do not have it - +you can simply create it yourself. +In our case, we want to tell Git to ignore the "venv" directory +(and ".venv" as another naming convention for directories containing virtual environments) +and stop notifying us about it. +Edit your `.gitignore` file in PyCharm +and add a line containing "venv/" and another one containing ".venv/". +It does not matter much in this case where within the file you add these lines, +so let us do it at the end. +Your `.gitignore` should look something like this: + +```output +# IDEs +.vscode/ +.idea/ + +# Intermediate Coverage file +.coverage + +# Output files +*.png + +# Python runtime +*.pyc +*.egg-info +.pytest_cache + +# Virtual environments +venv/ +.venv/ +``` + +You may notice that we are already not tracking certain files and directories +with useful comments about what exactly we are ignoring. +You may also notice that each line in `.gitignore` is actually a pattern, +so you can ignore multiple files that match a pattern +(e.g. "\*.png" will ignore all PNG files in the current directory). + +If you run the `git status` command now, +you will notice that Git has cleverly understood that +you want to ignore changes to the "venv" directory so it is not warning us about it any more. +However, it has now detected a change to `.gitignore` file that needs to be committed. + +```bash +$ git status +``` + +```output +On branch main +Your branch is up to date with 'origin/main'. + +Changes not staged for commit: + (use "git add ..." to update what will be committed) + (use "git restore ..." to discard changes in working directory) + modified: .gitignore + +Untracked files: + (use "git add ..." to include in what will be committed) + requirements.txt + +no changes added to commit (use "git add" and/or "git commit -a") +``` + +To commit the changes `.gitignore` and `requirements.txt` to the local repository, +we first have to add these files to staging area to prepare them for committing. +We can do that at the same time as: + +```bash +$ git add .gitignore requirements.txt +``` + +Now we can commit them to the local repository with: + +```bash +$ git commit -m "Initial commit of requirements.txt. Ignoring virtual env. folder." +``` + +Remember to use meaningful messages for your commits. + +So far we have been working in isolation - +all the changes we have done are still only stored locally on our individual machines. +In order to share our work with others, +we should push our changes to the remote repository on GitHub. +Before we push our changes however, we should first do a `git pull`. +This is considered best practice, since any changes made to the repository - +notably by other people - +may impact the changes we are about to push. +This could occur, for example, +by two collaborators making different changes to the same lines in a file. +By pulling first, we are made aware of any changes made by others, +in particular if there are any conflicts between their changes and ours. + +```bash +$ git pull +``` + +Now we have ensured our repository is synchronised with the remote one, +we can now push our changes: + +```bash +$ git push origin main +``` + +In the above command, +`origin` is an alias for the remote repository you used when cloning the project locally +(it is called that by convention and set up automatically by Git +when you run `git clone remote_url` command to replicate a remote repository locally); +`main` is the name of our main (and currently only) development branch. + +::::::::::::::::::::::::::::::::::::::::: callout + +## GitHub Authentication/Authorisation Error + +If, at this point (i.e. the first time you try to write to a remote repository on GitHub), +you get a warning/error that HTTPS access is deprecated, or a personal access token is required, +then you have cloned the repository using HTTPS and not SSH. +You should revisit the [instructions +on setting up your GitHub for SSH and key pair authentication](../learners/setup.md#secure-access-to-github-using-git-from-command-line) +and can fix this from the command line by +changing the remote repository's HTTPS URL to its SSH equivalent: + +```bash +$ git remote set-url origin git@github.com:/python-intermediate-inflammation.git +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Git Remotes + +Note that systems like Git allow us to synchronise work between +any two or more copies of the same repository - +the ones that are not located on your machine are "Git remotes" for you. +In practice, though, +it is easiest to agree with your collaborators to use one copy as a central hub +(such as GitHub or GitLab), where everyone pushes their changes to. +This also avoid risks associated with keeping the "central copy" on someone's laptop. +You can have more than one remote configured for your local repository, +each of which generally is either read-only or read/write for you. +Collaborating with others involves +managing these remote repositories and pushing and pulling information +to and from them when you need to share work. + +![](fig/git-distributed.png){alt='git-distributed' .image-with-shadow width="400px"} + +

Git - distributed version control system
From W3Docs (freely available)

+ +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Git Branches + +When we do `git status`, +Git also tells us that we are currently on the `main` branch of the project. +A branch is one version of your project (the files in your repository) +that can contain its own set of commits. +We can create a new branch, +make changes to the code which we then commit to the branch, +and, once we are happy with those changes, +merge them back to the main branch. +To see what other branches are available, do: + +```bash +$ git branch +``` + +```output +* main +``` + +At the moment, there is only one branch (`main`) +and hence only one version of the code available. +When you create a Git repository for the first time, +by default you only get one version (i.e. branch) - `main`. +Let us have a look at why having different branches might be useful. + +### Feature Branch Software Development Workflow + +While it is technically OK to commit your changes directly to `main` branch, +and you may often find yourself doing so for some minor changes, +the best practice is to use a new branch for each separate and self-contained unit/piece of work +you want to add to the project. +This unit of work is also often called a *feature* +and the branch where you develop it is called a *feature branch*. +Each feature branch should have its own meaningful name - +indicating its purpose (e.g. "issue23-fix"). +If we keep making changes and pushing them directly to `main` branch on GitHub, +then anyone who downloads our software from there will get all of our work in progress - +whether or not it is ready to use! +So, working on a separate branch for each feature you are adding is good for several reasons: + +- it enables the main branch to remain stable + while you and the team explore and test the new code on a feature branch, +- it enables you to keep the untested and not-yet-functional feature branch code + under version control and backed up, +- you and other team members may work on several features + at the same time independently from one another, and +- if you decide that the feature is not working or is no longer needed - + you can easily and safely discard that branch without affecting the rest of the code. + +Branches are commonly used as part of a feature-branch workflow, shown in the diagram below. + +![](fig/git-feature-branch.svg){alt='Git feature branch workflow diagram' .image-with-shadow width="800px"} + +

Git feature branches
+Adapted from Git Tutorial by sillevl (Creative Commons Attribution 4.0 International License)

+ +In the software development workflow, +we typically have a main branch which is the version of the code that is +tested, stable and reliable. +Then, we normally have a development branch +(called `develop` or `dev` by convention) +that we use for work-in-progress code. +As we work on adding new features to the code, +we create new feature branches that first get merged into `develop` +after a thorough testing process. +After even more testing - `develop` branch will get merged into `main`. +The points when feature branches are merged to `develop`, +and `develop` to `main` +depend entirely on the practice/strategy established in the team. +For example, for smaller projects +(e.g. if you are working alone on a project or in a very small team), +feature branches sometimes get directly merged into `main` upon testing, +skipping the `develop` branch step. +In other projects, +the merge into `main` happens only at the point of making a new software release. +Whichever is the case for you, a good rule of thumb is - +nothing that is broken should be in `main`. + +### Creating Branches + +Let us create a `develop` branch to work on: + +```bash +$ git branch develop +``` + +This command does not give any output, +but if we run `git branch` again, +without giving it a new branch name, we can see the list of branches we have - +including the new one we have just made. + +```bash +$ git branch +``` + +```output + develop + * main +``` + +The `*` indicates the currently active branch. +So how do we switch to our new branch? +We use the `git switch` command with the name of the branch: + +```bash +$ git switch develop +``` + +```output +Switched to branch 'develop' +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Create and Switch to Branch Shortcut + +A shortcut to create a new branch and immediately switch to it: + +```bash +$ git switch -c develop +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Updating Branches + +If we start updating and committing files now, +the commits will happen on the `develop` branch +and will not affect the version of the code in `main`. +We add and commit things to `develop` branch in the same way as we do to `main`. + +Let us make a small modification to `inflammation/models.py` in PyCharm, +and, say, change the spelling of "2d" to "2D" in docstrings for functions +`daily_mean()`, +`daily_max()` and +`daily_min()` to see updating branches in action. + +If we do: + +```bash +$ git status +``` + +```output + On branch develop + Changes not staged for commit: + (use "git add ..." to update what will be committed) + (use "git restore ..." to discard changes in working directory) + + modified: inflammation/models.py + + no changes added to commit (use "git add" and/or "git commit -a") +``` + +Git is telling us that we are on branch `develop` +and which tracked files have been modified in our working directory. + +We can now `add` and `commit` the changes in the usual way. + +```bash +$ git add inflammation/models.py +$ git commit -m "Spelling fix" +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Currently Active Branch + +Remember, `add` and `commit` commands always act on the currently active branch. +You have to be careful and aware of which branch you are working with at any given moment. +`git status` can help with that, and you will find yourself invoking it very often. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Pushing New Branch Remotely + +We push the contents of the `develop` branch to GitHub +in the same way as we pushed the `main` branch. +However, as we have just created this branch locally, +it still does not exist in our remote repository. +You can check that in GitHub by listing all branches. + +![](fig/github-main-branch.png){alt="Software project's main branch" .image-with-shadow width="600px"} + +To push a new local branch remotely for the first time, +you could use the `-u` flag and the name of the branch you are creating and pushing to: + +```bash +$ git push -u origin develop +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Git Push With `-u` Flag + +Using the `-u` switch with the `git push` command is a handy shortcut for: +(1) creating the new remote branch and +(2) setting your local branch to automatically track the remote one at the same time. +You need to use the `-u` switch only once to set up that association between +your branch and the remote one explicitly. +After that you could simply use `git push` +without specifying the remote repository, if you wished so. +We still prefer to explicitly state this information in commands. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let us confirm that the new branch `develop` now exist remotely on GitHub too. +From the `Code` tab in your repository in GitHub, +click the branch dropdown menu (currently showing the default branch `main`). +You should see your `develop` branch in the list too. + +![](fig/github-develop-branch.png){alt="Software project's develop branch" .image-with-shadow width="600px"} + +You may also have noticed GitHub's notification about the latest push to your `develop` branch just +on top of the repository files and branches drop-down menu. + +Now the others can check out the `develop` branch too and continue to develop code on it. + +After the initial push of the new branch, +each next time we push to it in the usual manner (i.e. without the `-u` switch): + +```bash +$ git push origin develop +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is the Relationship Between Originating and New Branches? + +it is natural to think that new branches have a parent/child relationship +with their originating branch, +but in actual Git terms, branches themselves do not have parents +but single commits do. +Any commit can have zero parents (a root, or initial, commit), +one parent (a regular commit), +or multiple parents (a merge commit), +and using this structure, we can build +a 'view' of branches from a set of commits and their relationships. +A common way to look at it is that Git branches are really only +[lightweight, movable pointers to commits](https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell). +So as a new commit is added to a branch, +the branch pointer is moved to the new commit. + +What this means is that when you accomplish a merge between two branches, +Git is able to determine the common 'commit ancestor' +through the commits in a 'branch', +and use that common ancestor to +determine which commits need to be merged onto the destination branch. +It also means that, in theory, you could merge any branch with any other at any time... +although it may not make sense to do so! + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Merging Into Main Branch + +Once you have tested your changes on the `develop` branch, +you will want to merge them onto the `main` branch. +To do so, make sure you have committed all your changes on the `develop` branch and then switch to `main`: + +```bash +$ git switch main +``` + +```output +Switched to branch 'main' +Your branch is up to date with 'origin/main'. +``` + +To merge the `develop` branch on top of `main` do: + +```bash +$ git merge develop +``` + +```output +Updating 05e1ffb..be60389 +Fast-forward + inflammation/models.py | 6 +++--- + 1 files changed, 3 insertions(+), 3 deletions(-) +``` + +If there are no conflicts, +Git will merge the branches without complaining +and replay all commits from `develop` on top of the last commit from `main`. +If there are merge conflicts +(e.g. a team collaborator modified the same portion of the same file you are working on +and checked in their changes before you), +the particular files with conflicts will be marked +and you will need to resolve those conflicts +and commit the changes before attempting to merge again. +Since we have no conflicts, we can now push the `main` branch to the remote repository: + +```bash +$ git push origin main +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## All Branches Are Equal + +In Git, all branches are equal - there is nothing special about the `main` branch. +It is called that by convention and is created by default, +but it can also be called something else. +A good example is `gh-pages` branch +which is often the source branch for website projects hosted on GitHub +(rather than `main`). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: testimonial + +## Keeping Main Branch Stable + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Good software development practice is to keep the `main` branch stable +while you and the team develop and test new functionalities on feature branches +(which can be done in parallel and independently by different team members). +The next step is to merge feature branches onto the `develop` branch, +where more testing can occur to verify that the new features +work well with the rest of the code (and not just in isolation). +We talk more about different types of code testing in one of the following episodes. + + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- A branch is one version of your project that can contain its own set of commits. +- Feature branches enable us to develop / explore / test new code features without affecting the stable `main` code. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/15-coding-conventions.md b/15-coding-conventions.md new file mode 100644 index 000000000..a45506b61 --- /dev/null +++ b/15-coding-conventions.md @@ -0,0 +1,808 @@ +--- +title: 1.5 Python Code Style Conventions +teaching: 20 +exercises: 20 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the benefits of following community coding conventions + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Why should you follow software code style conventions? +- Who is setting code style conventions? +- What code style conventions exist for Python? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +We now have all the tools we need for software development and are raring to go. +But before you dive into writing some more code and sharing it with others, +ask yourself what kind of code should you be writing and publishing? +It may be worth spending some time learning a bit about Python coding style conventions +to make sure that your code is consistently formatted and readable by yourself and others. + +> *"Any fool can write code that a computer can understand. +> Good programmers write code that humans can understand."* +> +> --- [Martin Fowler](https://en.wikiquote.org/wiki/Martin_Fowler), British software engineer, author and international speaker on software development + +## Python Coding Style Guide + +One of the most important things we can do to make sure our code is readable by other developers +(and ourselves a few months down the line) +is to make sure that it is descriptive, +cleanly and consistently formatted +and uses sensible, descriptive names for variable, function and module names. +In order to help us format our code, we generally follow guidelines known as a **style guide**. +A style guide is a set of conventions that we agree upon +with our colleagues or community, +to ensure that everyone contributing to the same project is +producing code which looks similar in style. +While a group of developers may choose to write +and agree upon a new style guide unique to each project, +in practice many programming languages have a single style guide +which is adopted almost universally by the communities around the world. +In Python, although we do have a choice of style guides available, +the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide is most commonly used. +PEP here stands for Python Enhancement Proposals; +PEPs are design documents for the Python community, +typically specifications or conventions for how to do something in Python, +a description of a new feature in Python, etc. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Style consistency + +One of the +[key insights from Guido van Rossum](https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds), +the creator of the Python programming language and one of the PEP 8 authors, +is that code is read much more often than it is written. +Style guidelines are intended to improve the readability of code +and make it consistent across the wide spectrum of Python code. +Consistency with the style guide is important. +Consistency within a project is more important. +Consistency within one module or function is the most important. +However, know when to be inconsistent - +sometimes style guide recommendations are just not applicable. +When in doubt, use your best judgment. +Look at other examples and decide what looks best. And do not hesitate to ask! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +As we have already covered in the +[episode on PyCharm IDE](13-ides.md), +PyCharm highlights the language constructs (reserved words) +and syntax errors to help us with coding. +PyCharm also gives us recommendations for formatting the code - +these recommendations are mostly taken from the PEP 8 style guide. + +A full list of style guidelines for this style is available from the +[PEP 8 website](https://www.python.org/dev/peps/pep-0008/); +here we highlight a few. + +### Indentation + +Python is a kind of language that uses indentation as a way of grouping +statements that belong to a particular block of code. +Spaces are the recommended indentation method in Python code. +The guideline is to use 4 spaces per indentation level - +so 4 spaces on level one, 8 spaces on level two and so on. +Many people prefer the use of tabs to spaces to indent the code for many reasons +(e.g. additional typing, +easy to introduce an error by missing a single space character, +accessibility for individuals using screen readers, etc.) +and do not follow this guideline. +Whether you decide to follow this guideline or not, +be consistent and follow the style already used in the project. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Indentation in Python 2 vs Python 3 + +Python 2 allowed code indented with a mixture of tabs and spaces. +Python 3 disallows mixing the use of tabs and spaces for indentation. +Whichever you choose, be consistent throughout the project. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +PyCharm has built-in support for converting tab indentation to spaces +"under the hood" for Python code in order to conform to PEP 8. +So, you can type a tab character and PyCharm will automatically convert it to 4 spaces. +You can control the amount of spaces that PyCharm uses to replace one tab character +or you can decide to keep the tab character altogether and prevent automatic conversion. +You can modify these settings in PyCharm's `Settings`\>`Editor`\>`Code Style`\>`Python`. + +![](fig/pycharm-indentation.png){alt='Python code indentation settings in PyCharm' .image-with-shadow width="800px"} + +You can also tell the editor to show non-printable characters +if you are ever unsure what character exactly is being used +by selecting `Settings` > `Editor` > `General` > `Appearance` then checking "Show whitespaces" option. + +![](fig/pycharm-whitespace.png){alt='Python code whitespace settings in PyCharm' .image-with-shadow width="800px"} + +There are more complex rules on indenting single units of code that continue over several lines, +e.g. function, list or dictionary definitions can all take more than one line. +The preferred way of wrapping such long lines is +by using Python's implied line continuation inside delimiters such as +parentheses (`()`), +brackets (`[]`) +and braces (`{}`), +or a hanging indent. + +```python +# Add an extra level of indentation (extra 4 spaces) to distinguish arguments from the rest of the code that follows +def long_function_name( + var_one, var_two, var_three, + var_four): + print(var_one) + + +# Aligned with opening delimiter +foo = long_function_name(var_one, var_two, + var_three, var_four) + +# Use hanging indents to add an indentation level like paragraphs of text where all the lines in a paragraph are +# indented except the first one +foo = long_function_name( + var_one, var_two, + var_three, var_four) + +# Using hanging indent again, but closing bracket aligned with the first non-blank character of the previous line +a_long_list = [ + [[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[0.33, 0.66, 1], [0.66, 0.83, 1], [0.77, 0.88, 1]] + ] + +# Using hanging indent again, but closing bracket aligned with the start of the multiline contruct +a_long_list2 = [ + 1, + 2, + 3, + # ... + 79 +] +``` + +More details on good and bad practices for continuation lines can be found in +[PEP 8 guideline on indentation](https://www.python.org/dev/peps/pep-0008/#indentation). + +### Maximum Line Length + +All lines should be up to 80 characters long; +for lines containing comments or docstrings (to be covered later) +the line length limit should be 73 - +see [this discussion](https://stackoverflow.com/q/15438326) +for reasoning behind these numbers. +Some teams strongly prefer a longer line length, +and seemed to have settled on the length of 100. +Long lines of code can be broken over multiple lines +by wrapping expressions in delimiters, +as mentioned above (preferred method), +or using a backslash (`\`) at the end of the line +to indicate line continuation (slightly less preferred method). + +```python +# Using delimiters ( ) to wrap a multi-line expression +if (a == True and + b == False): + +# Using a backslash (\) for line continuation +if a == True and \ + b == False: +``` + +### Should a Line Break Before or After a Binary Operator? + +Lines should break before binary operators +so that the operators do not get scattered across different columns on the screen. +In the example below, the eye does not have to do the extra work to tell +which items are added and which are subtracted: + +```python +# PEP 8 compliant - easy to match operators with operands +income = (gross_wages + + taxable_interest + + (dividends - qualified_dividends) + - ira_deduction + - student_loan_interest) +``` + +### Blank Lines + +Top-level function and class definitions should be surrounded with two blank lines. +Method definitions inside a class should be surrounded by a single blank line. +You can use blank lines in functions, sparingly, to indicate logical sections. + +### Whitespace in Expressions and Statements + +Avoid extraneous whitespace in the following situations: + +- immediately inside parentheses, brackets or braces + + ```python + # PEP 8 compliant: + my_function(colour[1], {id: 2}) + + # Not PEP 8 compliant: + my_function( colour[ 1 ], { id: 2 } ) + ``` + +- Immediately before a comma, + semicolon, + or colon + (unless doing slicing where the colon acts like a binary operator + in which case it should should have equal amounts of whitespace on either side) + + ```python + # PEP 8 compliant: + if x == 4: print(x, y); x, y = y, x + + # Not PEP 8 compliant: + if x == 4 : print(x , y); x , y = y, x + ``` + +- Immediately before the open parenthesis that starts the argument list of a function call + + ```python + # PEP 8 compliant: + my_function(1) + + # Not PEP 8 compliant: + my_function (1) + ``` + +- Immediately before the open parenthesis that starts an indexing or slicing + + ```python + # PEP 8 compliant: + my_dct['key'] = my_lst[id] + first_char = my_str[:, 1] + + # Not PEP 8 compliant: + my_dct ['key'] = my_lst [id] + first_char = my_str [:, 1] + ``` + +- More than one space around an assignment (or other) operator to align it with another + + ```python + # PEP 8 compliant: + x = 1 + y = 2 + student_loan_interest = 3 + + # Not PEP 8 compliant: + x = 1 + y = 2 + student_loan_interest = 3 + ``` + +- Avoid trailing whitespace anywhere - it is not necessary and can cause errors. + For example, if you use backslash (`\`) for continuation lines + and have a space after it, + the continuation line will not be interpreted correctly. + +- Surround these binary operators with a single space on either side: + assignment (=), + augmented assignment (+=, -= etc.), + comparisons (==, \<, >, !=, \<>, \<=, >=, in, not in, is, is not), + booleans (and, or, not). + +- Do not use spaces around the = sign + when used to indicate a keyword argument assignment + or to indicate a default value for an unannotated function parameter + + ```python + # PEP 8 compliant use of spaces around = for variable assignment + axis = 'x' + angle = 90 + size = 450 + name = 'my_graph' + + # PEP 8 compliant use of no spaces around = for keyword argument assignment in a function call + my_function( + 1, + 2, + axis=axis, + angle=angle, + size=size, + name=name) + ``` + +### String Quotes + +In Python, single-quoted strings and double-quoted strings are the same. +PEP 8 does not make a recommendation for this +apart from picking one rule and consistently sticking to it. +When a string contains single or double quote characters, +use the other one to avoid backslashes in the string as it improves readability. + +### Naming Conventions + +There are a lot of different naming styles in use, including: + +- lower\_case\_with\_underscores (or snake\_case) +- UPPER\_CASE\_WITH\_UNDERSCORES +- CapitalisedWords (or PascalCase) (note: when using acronyms in CapitalisedWords, capitalise all the letters of the acronym, + e.g HTTPServerError) +- camelCase (differs from CapitalisedWords/PascalCase by the initial lowercase character) +- Capitalised\_Words\_With\_Underscores + +As with other style guide recommendations - consistency is key. +Follow the one already established in the project, if there is one. +If there is not, follow any standard language style (such as +[PEP 8](https://www.python.org/dev/peps/pep-0008/) for Python). +Failing that, just pick one, document it and stick to it. + +Some things to be wary of when naming things in the code: + +- Avoid any names that could cause confusion (e.g. `l` (lower case L) is + hard to distinguish from a `1` (one), `O` (uppercase o) from a `0` (zero), + `I` (uppercase i) from `l` (lowercase L)). +- Avoid using non-ASCII (e.g. Unicode) characters for identifiers as these + can trip up software that does not support Unicode. +- If your audience is international and English is the common language, + try to use English words for identifiers and comments whenever possible + but try to avoid abbreviations/local slang as they may not be understood by everyone. + Also consider sticking with either 'American' or 'British' English spellings + and try not to mix the two. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Function, Variable, Class, Module, Package Naming in Python + +- Function and variable names should use lower\_case\_with\_underscores +- Avoid single character names in almost all instances. +- Variable names should tell you what they store, and not just the type (e.g. `name_of_patient` is better than `string`) +- Function names should tell you what the function does. +- Class names should use the CapitalisedWords convention. +- Modules should have short, all-lowercase names. + Underscores can be used in the module name if it improves readability. +- Packages should also have short, all-lowercase names, + although the use of underscores is discouraged. + +A more detailed guide on +[naming functions, modules, classes and variables](https://www.python.org/dev/peps/pep-0008/#package-and-module-names) +is available from PEP 8. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Comments + +Comments allow us to provide the reader with additional information on what the code does - +reading and understanding source code is slow, laborious and can lead to misinterpretation, +plus it is always a good idea to keep others in mind when writing code. +A good rule of thumb is to assume that someone will *always* read your code at a later date, +and this includes a future version of yourself. +It can be easy to forget why you did something a particular way in six months' time. +Write comments as complete sentences and in English +unless you are 100% sure the code will never be read by people who do not speak your language. + +::::::::::::::::::::::::::::::::::::::::: callout + +## The Good, the Bad, and the Ugly Comments + +As a side reading, check out the +['Putting comments in code: the good, the bad, and the ugly' blogpost](https://medium.com/free-code-camp/code-comments-the-good-the-bad-and-the-ugly-be9cc65fbf83). +Remember - a comment should answer the 'why' question". +Occasionally the "what" question. +The "how" question should be answered by the code itself. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Block comments generally apply to some (or all) code that follows them, +and are indented to the same level as that code. +Each line of a block comment starts with a `#` and a single space +(unless it is indented text inside the comment). + +```python +def fahr_to_cels(fahr): + # Block comment example: convert temperature in Fahrenheit to Celsius + cels = (fahr + 32) * (5 / 9) + return cels +``` + +An inline comment is a comment on the same line as a statement. +Inline comments should be separated by at least two spaces from the statement. +They should start with a `#` and a single space and should be used sparingly. + +```python +def fahr_to_cels(fahr): + cels = (fahr + 32) * (5 / 9) # Inline comment example: convert temperature in Fahrenheit to Celsius + return cels +``` + +Python does not have any multi-line comments, +like you may have seen in other languages like C++ or Java. +However, there are ways to do it using *docstrings* as we will see in a moment. + +The reader should be able to understand a single function or method +from its code and its comments, +and should not have to look elsewhere in the code for clarification. +The kind of things that need to be commented are: + +- Why certain design or implementation decisions were adopted, + especially in cases where the decision may seem counter-intuitive +- The names of any algorithms or design patterns that have been implemented +- The expected format of input files or database schemas + +However, there are some restrictions. +Comments that simply restate what the code does are redundant, +and comments must be accurate and updated with the code, +because an incorrect comment causes more confusion than no comment at all. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Improve Code Style of Our Project + +let us look at improving the coding style of our project. +First, from the project root, use `git switch` to create a new feature branch called `style-fixes` +from our develop branch. +(Note that at this point `develop` and `main` branches +are pointing at the same commit so it does not really matter which one we are branching off - +in real collaborative software development environments, you'd likely be expected to branch off `develop` +as it would contain the latest code developed by your team.) + +```bash +$ git switch develop +$ git switch -c style-fixes +``` + +Next look at the `inflammation-analysis.py` file in PyCharm +and identify where the above guidelines have not been followed. +Fix the discovered inconsistencies and commit them to the feature branch. + +::::::::::::::: solution + +## Solution + +Modify `inflammation-analysis.py` from PyCharm, +which is helpfully marking inconsistencies with coding guidelines by underlying them. +There are a few things to fix in `inflammation-analysis.py`, for example: + +1. Line 30 in `inflammation-analysis.py` is too long and not very readable. + A better style would be to use multiple lines and hanging indent, + with the closing brace \`}' aligned either with + the first non-whitespace character of the last line of list + or the first character of the line that starts the multiline construct + or simply moved to the end of the previous line. + All three acceptable modifications are shown below. + + ```python + # Using hanging indent, with the closing '}' aligned with the first non-blank character of the previous line + view_data = { + 'average': models.daily_mean(inflammation_data), + 'max': models.daily_max(inflammation_data), + 'min': models.daily_min(inflammation_data) + } + ``` + + ```python + # Using hanging indent with the, closing '}' aligned with the start of the multiline contruct + view_data = { + 'average': models.daily_mean(inflammation_data), + 'max': models.daily_max(inflammation_data), + 'min': models.daily_min(inflammation_data) + } + ``` + + ```python + # Using hanging indent where all the lines of the multiline contruct are indented except the first one + view_data = { + 'average': models.daily_mean(inflammation_data), + 'max': models.daily_max(inflammation_data), + 'min': models.daily_min(inflammation_data)} + ``` + +2. Variable 'InFiles' in `inflammation-analysis.py` uses CapitalisedWords naming convention + which is recommended for class names but not variable names. + By convention, variable names should be in lowercase with optional underscores + so you should rename the variable 'InFiles' to, e.g., 'infiles' or 'in\_files'. + +3. There are two blank lines starting from line 19 in `inflammation-analysis.py`. + Normally, you should not use blank lines in the middle of the code + unless you want to separate logical units - + in which case only one blank line is used. + Note how PyCharm is warning us by underlining the whole line below. + +4. Only one blank line after the end of definition of function `main` + and the rest of the code below line 27 in `inflammation-analysis.py` - + should be two blank lines (PEP 8 recommends surrounding top-level function + (and class) definitions with two blank lines). + Note how PyCharm is warning us by underlining the whole line below. + +Finally, let us add and commit our changes to the feature branch. +We will check the status of our working directory first. + +```bash +$ git status +``` + +```output +On branch style-fixes +Changes not staged for commit: +(use "git add ..." to update what will be committed) +(use "git restore ..." to discard changes in working directory) +modified: inflammation-analysis.py + +no changes added to commit (use "git add" and/or "git commit -a") +``` + +Git tells us we are on branch `style-fixes` +and that we have unstaged and uncommited changes to `inflammation-analysis.py`. +let us commit them to the local repository. + +```bash +$ git add inflammation-analysis.py +$ git commit -m "Code style fixes." +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Improve Code Style of Your Other Python Projects + +If you have another Python project, check to which extent it conforms to PEP 8 coding style. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Documentation Strings aka Docstrings + +If the first thing in a function is a string that is not assigned to a variable, +that string is attached to the function as its documentation. +Consider the following code implementing function +for calculating the nth Fibonacci number: + +```python +def fibonacci(n): + """Calculate the nth Fibonacci number. + + A recursive implementation of Fibonacci array elements. + + :param n: integer + :raises ValueError: raised if n is less than zero + :returns: Fibonacci number + """ + if n < 0: + raise ValueError('Fibonacci is not defined for N < 0') + if n == 0: + return 0 + if n == 1: + return 1 + + return fibonacci(n - 1) + fibonacci(n - 2) +``` + +Note here we are explicitly documenting our input variables, +what is returned by the function, +and also when the `ValueError` exception is raised. +Along with a helpful description of what the function does, +this information can act as a *contract* for readers to understand what to expect in terms of +behaviour when using the function, +as well as how to use it. + +A special comment string like this is called a **docstring**. +We do not need to use triple quotes when writing one, +but if we do, we can break the text across multiple lines. +Docstrings can also be used at the start of a Python module +(a file containing a number of Python functions) +or at the start of a Python class +(containing a number of methods) +to list their contents as a reference. +You should not confuse docstrings with comments though - +docstrings are context-dependent and should only be used in specific locations +(e.g. at the top of a module and immediately after `class` and `def` keywords as mentioned). +Using triple quoted strings in locations where +they will not be interpreted as docstrings +or using triple quotes as a way to 'quickly' comment out an entire block of code +is considered bad practice. + +In our example case, we used the +[Sphynx/ReadTheDocs docstring style](https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html) +formatting for the `param`, `raises` and `returns` - other docstring formats exist as well. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Python PEP 257 - Recommendations for Docstrings + +[PEP 257](https://peps.python.org/pep-0257/) +is another one of Python Enhancement Proposals +and this one deals with docstring conventions to standardise how they are used. +For example, on the subject of module-level docstrings, PEP 257 says: + +``` +The docstring for a module should generally list +the classes, +exceptions +and functions +(and any other objects) +that are exported by the module, with a one-line summary of each. +(These summaries generally give less detail than the summary line in the object's docstring.) +The docstring for a package +(i.e., the docstring of the package's `__init__.py` module) +should also list the modules and subpackages exported by the package. +``` + +Note that `__init__.py` file used to be a required part of a package +(pre Python 3.3) +where a package was typically implemented as a directory containing +an `__init__.py` file which got implicitly executed when a package was imported. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +So, at the beginning of a module file we can just add +a docstring explaining the nature of a module. +For example, if `fibonacci()` was included in a module with other functions, +our module could have at the start of it: + +```python +"""A module for generating numerical sequences of numbers that occur in nature. + +Functions: + fibonacci - returns the Fibonacci number for a given integer + golden_ratio - returns the golden ratio number to a given Fibonacci iteration + ... +""" +... +``` + +The docstring for a function or a module +is returned when calling the `help` function and passing its name - +for example from the interactive Python console/terminal available from the command line +or when rendering code documentation online +(e.g. see [Python documentation](https://docs.python.org/3.11/library/index.html)). +PyCharm also displays the docstring for a function/module +in a little help popup window when using tab-completion. + +```python +help(fibonacci) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Fix the Docstrings + +Look into `models.py` in PyCharm and improve docstrings for functions +`daily_mean` , +`daily_min`, +`daily_max`. +Commit those changes to feature branch `style-fixes`. + +::::::::::::::: solution + +## Solution + +For example, +the improved docstrings for the above functions would contain explanations +for parameters and return values. + +```python +def daily_mean(data): + """Calculate the daily mean of a 2D inflammation data array for each day. + + :param data: A 2D data array with inflammation data (each row contains measurements for a single patient across all days). + :returns: An array of mean values of measurements for each day. + """ + return np.mean(data, axis=0) +``` + +```python +def daily_max(data): + """Calculate the daily maximum of a 2D inflammation data array for each day. + + :param data: A 2D data array with inflammation data (each row contains measurements for a single patient across all days). + :returns: An array of max values of measurements for each day. + """ + return np.max(data, axis=0) +``` + +```python +def daily_min(data): + """Calculate the daily minimum of a 2D inflammation data array for each day. + + :param data: A 2D data array with inflammation data (each row contains measurements for a single patient across all days). + :returns: An array of minimum values of measurements for each day. + """ + return np.min(data, axis=0) +``` + +Once we are happy with modifications, +as usual before staging and commit our changes, +we check the status of our working directory: + +```bash +$ git status +``` + +```output +On branch style-fixes +Changes not staged for commit: +(use "git add ..." to update what will be committed) +(use "git restore ..." to discard changes in working directory) +modified: inflammation/models.py + +no changes added to commit (use "git add" and/or "git commit -a") +``` + +As expected, Git tells us we are on branch `style-fixes` +and that we have unstaged and uncommited changes to `inflammation/models.py`. +Let us commit them to the local repository. + +```bash +$ git add inflammation/models.py +$ git commit -m "Docstring improvements." +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In the previous exercises, we made some code improvements on feature branch `style-fixes`. +We have committed our changes locally but +have not pushed this branch remotely for others to have a look at our code +before we merge it onto the `develop` branch. +Let us do that now, namely: + +- push `style-fixes` to GitHub +- merge `style-fixes` into `develop` (once we are happy with the changes) +- push updates to `develop` branch to GitHub (to keep it up to date with the latest developments) +- finally, merge `develop` branch into the stable `main` branch + +Here is a set commands that will achieve the above set of actions +(remember to use `git status` often in between other Git commands +to double check which branch you are on and its status): + +```bash +$ git push -u origin style-fixes +$ git switch develop +$ git merge style-fixes +$ git push origin develop +$ git switch main +$ git merge develop +$ git push origin main +``` + +::::::::::::::::::::::::::::::::::::: testimonial + +## Typical Code Development Cycle + +What you have done in the exercises in this episode mimics a typical software development workflow - +you work locally on code on a feature branch, +test it to make sure it works correctly and as expected, +then record your changes using version control +and share your work with others via a centrally backed-up repository. +Other team members work on their feature branches in parallel +and similarly share their work with colleagues for discussions. +Different feature branches from around the team get merged onto the development branch, +often in small and quick development cycles. +After further testing and verifying that no code has been broken by the new features - +the development branch gets merged onto the stable main branch, +where new features finally resurface to end-users in bigger "software release" cycles. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Always assume that someone else will read your code at a later date, including yourself. +- Community coding conventions help you create more readable software projects that are easier to contribute to. +- Python Enhancement Proposals (or PEPs) describe a recommended convention or specification for how to do something in Python. +- Style checking to ensure code conforms to coding conventions is often part of IDEs. +- Consistency with the style guide is important - whichever style you choose. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/16-verifying-code-style-linters.md b/16-verifying-code-style-linters.md new file mode 100644 index 000000000..5a9ad6f8d --- /dev/null +++ b/16-verifying-code-style-linters.md @@ -0,0 +1,221 @@ +--- +title: 1.6 Verifying Code Style Using Linters +teaching: 15 +exercises: 5 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use code linting tools to verify a program's adherence to a Python coding style convention. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What tools can help with maintaining a consistent code style? +- How can we automate code style checking? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Verifying Code Style Using Linters + +We have seen how we can use PyCharm to help us format our Python code in a consistent style. +This aids reusability, +since consistent-looking code is easier to modify +since it is easier to read and understand. +We can also use tools, +called [**code linters**](https://en.wikipedia.org/wiki/Lint_%28software%29), +to identify consistency issues in a report-style. +Linters analyse source code to identify and report on stylistic and even programming errors. +Let us look at a very well used one of these called `pylint`. + +First, let us ensure we are on the `style-fixes` branch once again. + +```bash +$ git switch style-fixes +``` + +Pylint is just a Python package so we can install it in our virtual environment using: + +```bash +$ python3 -m pip install pylint +``` + +We should also update our `requirements.txt` with this new addition: + +```bash +$ python3 -m pip freeze > requirements.txt +``` + +Pylint is a command-line tool that can help our code in many ways: + +- **Check PEP 8 compliance:** + whilst in-IDE context-sensitive highlighting such as that provided via PyCharm + helps us stay consistent with PEP 8 as we write code, this tool provides a full report +- **Perform basic error detection:** Pylint can look for certain Python type errors +- **Check variable naming conventions**: + Pylint often goes beyond PEP 8 to include other common conventions, + such as naming variables outside of functions in upper case +- **Customisation**: + you can specify which errors and conventions you wish to check for, and those you wish to ignore + +Pylint can also identify **code smells**. + +::::::::::::::::::::::::::::::::::::::::: callout + +## How Does Code Smell? + +There are many ways that code can exhibit bad design +whilst not breaking any rules and working correctly. +A *code smell* is a characteristic that indicates +that there is an underlying problem with source code, e.g. +large classes or methods, +methods with too many parameters, +duplicated statements in both if and else blocks of conditionals, etc. +They aren't functional errors in the code, +but rather are certain structures that violate principles of good design +and impact design quality. +They can also indicate that code is in need of maintenance and refactoring. + +The phrase has its origins in Chapter 3 "Bad smells in code" +by Kent Beck and Martin Fowler in +[Fowler, Martin (1999). Refactoring. Improving the Design of Existing Code. Addison-Wesley. ISBN 0-201-48567-2](https://www.amazon.com/Refactoring-Improving-Design-Existing-Code/dp/0201485672/). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Pylint recommendations are given as warnings or errors, +and Pylint also scores the code with an overall mark. +We can look at a specific file (e.g. `inflammation-analysis.py`), +or a package (e.g. `inflammation`). +Let us look at our `inflammation` package and code inside it (namely `models.py` and `views.py`). +From the project root do: + +```bash +$ pylint inflammation +``` + +You should see an output similar to the following: + +```output +************* Module inflammation.models +inflammation/models.py:13:23: C0303: Trailing whitespace (trailing-whitespace) +inflammation/models.py:34:0: C0305: Trailing newlines (trailing-newlines) +************* Module inflammation.views +inflammation/views.py:4:0: W0611: Unused numpy imported as np (unused-import) + +------------------------------------------------------------------ +Your code has been rated at 8.50/10 (previous run: 8.50/10, +0.00) +``` + +Your own outputs of the above commands may vary depending on +how you have implemented and fixed the code in previous exercises +and the coding style you have used. + +The five digit codes, such as `C0303`, are unique identifiers for warnings, +with the first character indicating the type of warning. +There are five different types of warnings that Pylint looks for, +and you can get a summary of them by doing: + +```bash +$ pylint --long-help +``` + +Near the end you'll see: + +```output + Output: + Using the default text output, the message format is : + MESSAGE_TYPE: LINE_NUM:[OBJECT:] MESSAGE + There are 5 kind of message types : + * (C) convention, for programming standard violation + * (R) refactor, for bad code smell + * (W) warning, for python specific problems + * (E) error, for probable bugs in the code + * (F) fatal, if an error occurred which prevented pylint from doing + further processing. +``` + +So for an example of a Pylint Python-specific `warning`, +see the "W0611: Unused numpy imported as np (unused-import)" warning. + +It is important to note that while tools such as Pylint are great at giving you +a starting point to consider how to improve your code, +they will not find everything that may be wrong with it. + +::::::::::::::::::::::::::::::::::::::::: callout + +## How Does Pylint Calculate the Score? + +The Python formula used is +(with the variables representing numbers of each type of infraction +and `statement` indicating the total number of statements): + +```bash +10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10) +``` + +For example, with a total of 31 statements of models.py and views.py, +with a count of the errors shown above, we get a score of 8.00. +Note whilst there is a maximum score of 10, given the formula, +there is no minimum score - it is quite possible to get a negative score! + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Further Improve Code Style of Our Project + +Select and fix a few of the issues with our code that Pylint detected. +Make sure you do not break the rest of the code in the process and that the code still runs. +After making any changes, run Pylint again to verify you have resolved these issues. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Make sure you commit and push `requirements.txt` +and any file with further code style improvements you did on to `style-fixes` branch and then +merge all these changes into your development branch. + +For the time being, we will not merge +the development branch onto `main` until we finish testing our code a bit further and automating +those tests with GitHub's Continuous Integration service called GitHub Actions +(to be covered in the next section). +Note that it is also possible to automate the linting kinds of code checks +with GitHub Actions - we will come back to automated linting in the episode on +["Diagnosing Issues and Improving Robustness"](24-diagnosing-issues-improving-robustness.md). + +```bash +$ git add requirements.txt +$ git commit -m "Added Pylint library" +$ git push origin style-fixes +$ git switch develop +$ git merge style-fixes +$ git push origin develop +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Improve Code Style of Your Other Python Projects + +If you have a Python project you are working on or you worked on in the past, +run it past Pylint to see what issues with your code are detected, if any. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::: challenge + +## Optional Exercise: More on Pylint + +Checkout [this optional exercise](17-section1-optional-exercises.md) +to learn more about `pylint`. + +::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use linting tools on the command line (or via continuous integration) to automatically check your code style. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/17-section1-optional-exercises.md b/17-section1-optional-exercises.md new file mode 100644 index 000000000..f9b169118 --- /dev/null +++ b/17-section1-optional-exercises.md @@ -0,0 +1,124 @@ +--- +title: 1.7 Optional Exercises for Section 1 +start: no +teaching: 0 +exercises: 45 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Explore different options for your coding environment. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I further finetune my coding environment? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +This episode holds some optional exercises for section 1. +The exercises have an explorative nature, so feel free to go off in any direction that interests you. +You will be looking at some tools that either complement or are alternatives to those already introduced. +Even if you find something that you really like, +we still recommend sticking with the tools that were introduced prior to this episode when you move onto other sections of the course. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Apply to your own project(s) + +Apply what you learned in this section to your own project(s). +This is the time to ask any questions to your instructors or helpers. +Everyone has different preferences for tooling, so getting the input of experienced developers is a great opportunity. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Try out different Integrated Development Environments + +Install different Integrated Development Environments (IDEs) and test them out. +Which one do you like the most and why? + +You can try: + +- [Visual Studio Code](https://code.visualstudio.com/), with setup instructions [in the Extras of this course](../learners/vscode.md) +- [Atom](https://atom-editor.cc/) +- [Sublime Text](https://www.sublimetext.com/) +- [RStudio](https://posit.co/download/rstudio-desktop/) + +Technically, compared to PyCharm, the 'IDEs' listed above are source code editors capable of functioning as an IDE +(with RStudio as an example). +To function as an IDE, you have to manually install plugins for more powerful features +such as support for a specific programming language or unit testing. +What do you prefer, a lot of tooling out of the box or a lightweight editor with optional extensions? + +If you want an even more lightweight setup you can try out these configurable source code editors: + +- [Emacs](https://www.gnu.org/software/emacs/) +- [Vim](https://www.vim.org/) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Customize the command line + +You can customize the command line or use alternatives to `bash` to make yourself more productive. + +- Try out [Bash Prompt Generator](https://bash-prompt-generator.org/), it lets you try out different prompts, + depending on the information you want displayed. +- Try out [z, a simple tool to more quickly move around directories](https://github.com/rupa/z). +- Try out [Z shell (zsh)](https://zsh.sourceforge.io/), a shell designed for interactive use. +- Try out [Oh My ZSH](https://ohmyz.sh/), which is a theming and package manager of the `zsh` terminal. +- Try out [fish](https://fishshell.com/), a smart and user-friendly command line shell. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Try out different virtual environment managers + +So far we used `venv`, but there are other virtual environment managers for Python: + +- [Poetry](https://python-poetry.org/), which we will explore using in + [Section 4](43-software-release.md). +- conda, which is part of [Anaconda Distribution)](https://www.anaconda.com/download). + +Anaconda is widely used in academia, but the current license does not allow use for research in most circumstances. +An open-source alternative is [mini-forge](https://github.com/conda-forge/miniforge). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Customize `pylint` + +You decide to change the max line length of your project to 100 instead of the default 80. +Find out how you can configure pylint. You can first try to use the pylint command line interface, +but also play with adding a configuration file that pylint reads in. + +::::::::::::::: solution + +## Solution + +### By passing an argument to `pylint` in the command line + +Specify the max line length as an argument: `pylint --max-line-length=100` + +### Using a configuration file + +You can create a file `.pylintrc` in the root of your project folder to overwrite pylint settings: + +``` +[FORMAT] +max-line-length=100 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + + diff --git a/20-section2-intro.md b/20-section2-intro.md new file mode 100644 index 000000000..58a8af6ec --- /dev/null +++ b/20-section2-intro.md @@ -0,0 +1,82 @@ +--- +title: 'Section 2: Ensuring Correctness of Software at Scale' +teaching: 5 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Introduce the testing tools, techniques, and infrastructure that will be used in this section. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What should we do to ensure our code is correct? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We have just set up a suitable environment for the development of our software project +and are ready to start coding. +However, we want to make sure that the new code we contribute to the project +is actually correct and is not breaking any of the existing code. +So, in this section, +we will look at testing approaches that can help us ensure +that the software we write is behaving as intended, +and how we can diagnose and fix issues once faults are found. +Using such approaches requires us to change our practice of development. +This can take time, but potentially saves us considerable time +in the medium to long term +by allowing us to more comprehensively and rapidly find such faults, +as well as giving us greater confidence in the correctness of our code - +so we should try and employ such practices early on. +We will also make use of techniques and infrastructure that allow us to do this +in a scalable, automated and more performant way as our codebase grows. + +![Section 2 Overview](fig/section2-overview.svg){alt='Tools for scaled software testing'} + + + +In this section we will: + +- Make use of a **test framework** called Pytest, + a free and open source Python library to help us structure and run automated tests. +- Design, write and run **unit tests** using Pytest + to verify the correct behaviour of code and identify faults, + making use of test **parameterisation** + to increase the number of different test cases we can run. +- Automatically run a set of unit tests using **GitHub Actions** - + a **Continuous Integration** infrastructure that allows us to + automate tasks when things happen to our code, + such as running those tests when a new commit is made to a code repository. +- Use PyCharm's integrated **debugger** to + help us locate a fault in our code while it is running, and then fix it. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Using testing requires us to change our practice of code development, but saves time in the long run by allowing us to more comprehensively and rapidly find faults in code, as well as giving us greater confidence in the correctness of our code. +- The use of test techniques and infrastructures such as **parameterisation** and **Continuous Integration** can help scale and further automate our testing process. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/21-automatically-testing-software.md b/21-automatically-testing-software.md new file mode 100644 index 000000000..8e607804c --- /dev/null +++ b/21-automatically-testing-software.md @@ -0,0 +1,647 @@ +--- +title: 2.1 Automatically Testing Software +teaching: 30 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Explain the reasons why testing is important +- Describe the three main types of tests and what each are used for +- Implement and run unit tests to verify the correct behaviour of program functions + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Does the code we develop work the way it should do? +- Can we (and others) verify these assertions for themselves? +- To what extent are we confident of the accuracy of results that are generated by code and appear in publications? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +Being able to demonstrate that a process generates the right results +is important in any field of research, +whether it is software generating those results or not. +So when writing software we need to ask ourselves some key questions: + +- Does the code we develop work the way it should do? +- Can we (and others) verify these assertions for themselves? +- Perhaps most importantly, to what extent are we confident of + the accuracy of results that software produces? + +If we are unable to demonstrate that our software fulfills these criteria, +why would anyone use it? +Having well-defined tests for our software is useful for this, +but manually testing software can prove an expensive process. + +Automation can help, and automation where possible is a good thing - +it enables us to define a potentially complex process in a repeatable way +that is far less prone to error than manual approaches. +Once defined, automation can also save us a lot of effort, particularly in the long run. +In this episode we will look into techniques of automated testing to +improve the predictability of a software change, +make development more productive, +and help us produce code that works as expected and produces desired results. + +## What Is Software Testing? + +For the sake of argument, if each line we write has a 99% chance of being right, +then a 70-line program will be wrong more than half the time. +We need to do better than that, +which means we need to test our software to catch these mistakes. + +We can and should extensively test our software manually, +and manual testing is well-suited to testing aspects such as +graphical user interfaces and reconciling visual outputs against inputs. +However, even with a good test plan, +manual testing is very time consuming and prone to error. +Another style of testing is automated testing, +where we write code that tests the functions of our software. +Since computers are very good and efficient at automating repetitive tasks, +we should take advantage of this wherever possible. + +There are three main types of automated tests: + +- **Unit tests** are tests for fairly small and specific units of functionality, + e.g. determining that a particular function returns output as expected given specific inputs. +- **Functional or integration tests** work at a higher level, + and test functional paths through your code, + e.g. given some specific inputs, + a set of interconnected functions across a number of modules + (or the entire code) produce the expected result. + These are particularly useful for exposing faults in how functional units interact. +- **Regression tests** make sure that your program's output hasn't changed, + for example after making changes your code to add new functionality or fix a bug. + +For the purposes of this course, we will focus on unit tests. +But the principles and practices we wll talk about can be built on +and applied to the other types of tests too. + +## Set Up a New Feature Branch for Writing Tests + +We are going to look at how to run some existing tests and also write some new ones, +so let us ensure we are initially on our `develop` branch. +We will create a new feature branch called `test-suite` off the `develop` branch - +a common term we use to refer to sets of tests - that we will use for our test writing work: + +```bash +$ git switch develop +$ git switch -c test-suite +``` + +Good practice is to write our tests around the same time we write our code on a feature branch. +But since the code already exists, we are creating a feature branch for just these extra tests. +Git branches are designed to be lightweight, and where necessary, transient, +and use of branches for even small bits of work is encouraged. + +Later on, once we have finished writing these tests and are convinced they work properly, +we will merge our `test-suite` branch back into `develop`. +Once bigger code changes are completed, merged to `develop` from various feature branches +and tested *together* with existing code - +which of course may also have been changed by other developers in the meantime - +we will merge all of the work into `main`. + +## Inflammation Data Analysis + +Let us go back to our [patient inflammation software project](11-software-project.md). +Recall that it is based on a clinical trial of inflammation +in patients who have been given a new treatment for arthritis. +There are a number of datasets in the `data` directory +recording inflammation information in patients +(each file representing a different trial), +and are each stored in comma-separated values (CSV) format: +each row holds information for a single patient, +and the columns represent successive days when inflammation was measured in patients. + +Let us take a quick look at the data now from within the Python command line console. +Change directory to the repository root +(which should be in your home directory `~/python-intermediate-inflammation`), +ensure you have your virtual environment activated in your command line terminal +(particularly if opening a new one), +and then start the Python console by invoking the Python interpreter without any parameters, e.g.: + +```bash +$ cd ~/python-intermediate-inflammation +$ source venv/bin/activate +$ python3 +``` + +The last command will start the Python console within your shell, +which enables us to execute Python commands interactively. +Inside the console enter the following: + +```python +import numpy as np +data = np.loadtxt(fname='data/inflammation-01.csv', delimiter=',') +data.shape +``` + +```output +(60, 40) +``` + +The data in this case is two-dimensional - +it has 60 rows (one for each patient) +and 40 columns (one for each day). +Each cell in the data represents an inflammation reading on a given day for a patient. + +Our patient inflammation application has a number of statistical functions +held in `inflammation/models.py`: `daily_mean()`, `daily_max()` and `daily_min()`, +for calculating the mean average, the maximum, and the minimum values +for a given number of rows in our data. +For example, the `daily_mean()` function looks like this: + +```python +def daily_mean(data): + """Calculate the daily mean of a 2D inflammation data array for each day. + + :param data: A 2D data array with inflammation data (each row contains measurements for a single patient across all days). + :returns: An array of mean values of measurements for each day. + """ + return np.mean(data, axis=0) +``` + +Here, we use NumPy's `np.mean()` function to calculate the mean *vertically* across the data +(denoted by `axis=0`), +which is then returned from the function. +So, if `data` was a NumPy array of three rows like... + +```python +[[1, 2], + [3, 4], + [5, 6]] +``` + +...the function would return a 1D NumPy array of `[3, 4]` - +each value representing the mean of each column +(which are, coincidentally, the same values as the second row in the above data array). + +To show this working with our patient data, +we can use the function like this, +passing the first four patient rows to the function in the Python console: + +```python +from inflammation.models import daily_mean + +daily_mean(data[0:4]) +``` + +Note we use a different form of `import` here - +only importing the `daily_mean` function from our `models` instead of everything. +This also has the effect that we can refer to the function using only its name, +without needing to include the module name too +(i.e. `inflammation.models.daily_mean()`). + +The above code will return the mean inflammation for each day column +across the first four patients +(as a 1D NumPy array of shape (40, 0)): + +```output +array([ 0. , 0.5 , 1.5 , 1.75, 2.5 , 1.75, 3.75, 3. , 5.25, + 6.25, 7. , 7. , 7. , 8. , 5.75, 7.75, 8.5 , 11. , + 9.75, 10.25, 15. , 8.75, 9.75, 10. , 8. , 10.25, 8. , + 5.5 , 8. , 6. , 5. , 4.75, 4.75, 4. , 3.25, 4. , + 1.75, 2.25, 0.75, 0.75]) +``` + +The other statistical functions are similar. +Note that in real situations +functions we write are often likely to be more complicated than these, +but simplicity here allows us to reason about what's happening - +and what we need to test - +more easily. + +Let us now look into how we can test each of our application's statistical functions +to ensure they are functioning correctly. + +## Writing Tests to Verify Correct Behaviour + +### One Way to Do It + +One way to test our functions would be to write a series of checks or tests, +each executing a function we want to test with known inputs against known valid results, +and throw an error if we encounter a result that is incorrect. +So, referring back to our simple `daily_mean()` example above, +we could use `[[1, 2], [3, 4], [5, 6]]` as an input to that function +and check whether the result equals `[3, 4]`: + +```python +import numpy.testing as npt + +test_input = np.array([[1, 2], [3, 4], [5, 6]]) +test_result = np.array([3, 4]) +npt.assert_array_equal(daily_mean(test_input), test_result) +``` + +So we use the `assert_array_equal()` function - +part of NumPy's testing library - +to test that our calculated result is the same as our expected result. +This function explicitly checks the array's shape and elements are the same, +and throws an `AssertionError` if they are not. +In particular, note that we cannot just use `==` or other Python equality methods, +since these will not work properly with NumPy arrays in all cases. + +We could then add to this with other tests that use and test against other values, +and end up with something like: + +```python +test_input = np.array([[2, 0], [4, 0]]) +test_result = np.array([2, 0]) +npt.assert_array_equal(daily_mean(test_input), test_result) + +test_input = np.array([[0, 0], [0, 0], [0, 0]]) +test_result = np.array([0, 0]) +npt.assert_array_equal(daily_mean(test_input), test_result) + +test_input = np.array([[1, 2], [3, 4], [5, 6]]) +test_result = np.array([3, 4]) +npt.assert_array_equal(daily_mean(test_input), test_result) +``` + +However, if we were to enter these in this order, we will find we get the following after the first test: + +```output +... +AssertionError: +Arrays are not equal + +Mismatched elements: 1 / 2 (50%) +Max absolute difference: 1. +Max relative difference: 0.5 + x: array([3., 0.]) + y: array([2, 0]) +``` + +This tells us that one element between our generated and expected arrays does not match, +and shows us the different arrays. + +We could put these tests in a separate script to automate the running of these tests. +But a Python script halts at the first failed assertion, +so the second and third tests aren't run at all. +It would be more helpful if we could get data from all of our tests every time they are run, +since the more information we have, +the faster we are likely to be able to track down bugs. +It would also be helpful to have some kind of summary report: +if our set of tests - known as a **test suite** - includes thirty or forty tests +(as it well might for a complex function or library that's widely used), +we would like to know how many passed or failed. + +Going back to our failed first test, what was the issue? +As it turns out, the test itself was incorrect, and should have read: + +```python +test_input = np.array([[2, 0], [4, 0]]) +test_result = np.array([3, 0]) +npt.assert_array_equal(daily_mean(test_input), test_result) +``` + +Which highlights an important point: +as well as making sure our code is returning correct answers, +we also need to ensure the tests themselves are also correct. +Otherwise, we may go on to fix our code only to return +an incorrect result that *appears* to be correct. +So a good rule is to make tests simple enough to understand +so we can reason about both the correctness of our tests as well as our code. +Otherwise, our tests hold little value. + +### Using a Testing Framework + +Keeping these things in mind, +here's a different approach that builds on the ideas we have seen so far +but uses a **unit testing framework**. +In such a framework we define our tests we want to run as functions, +and the framework automatically runs each of these functions in turn, +summarising the outputs. +And unlike our previous approach, +it will run every test regardless of any encountered test failures. + +Most people do not enjoy writing tests, +so if we want them to actually do it, +it must be easy to: + +- Add or change tests, +- Understand the tests that have already been written, +- Run those tests, and +- Understand those tests' results + +Test results must also be reliable. +If a testing tool says that code is working when it is not, +or reports problems when there actually aren't any, +people will lose faith in it and stop using it. + +Look at `tests/test_models.py`: + +```python +"""Tests for statistics functions within the Model layer.""" + +import numpy as np +import numpy.testing as npt + +from inflammation.models import daily_mean + +def test_daily_mean_zeros(): + """Test that mean function works for an array of zeros.""" + + test_input = np.array([[0, 0], + [0, 0], + [0, 0]]) + test_result = np.array([0, 0]) + + # Need to use NumPy testing functions to compare arrays + npt.assert_array_equal(daily_mean(test_input), test_result) + + +def test_daily_mean_integers(): + """Test that mean function works for an array of positive integers.""" + + test_input = np.array([[1, 2], + [3, 4], + [5, 6]]) + test_result = np.array([3, 4]) + + # Need to use NumPy testing functions to compare arrays + npt.assert_array_equal(daily_mean(test_input), test_result) +... +``` + +Here, although we have specified two of our previous manual tests as separate functions, +they run the same assertions. +Each of these test functions, in a general sense, are called **test cases** - +these are a specification of: + +- Inputs, e.g. the `test_input` NumPy array +- Execution conditions - + what we need to do to set up the testing environment to run our test, + e.g. importing the `daily_mean()` function so we can use it. + Note that for clarity of testing environment, + we only import the necessary library function we want to test within each test function +- Testing procedure, e.g. running `daily_mean()` with our `test_input` array + and using `assert_array_equal()` to test its validity +- Expected outputs, e.g. our `test_result` NumPy array that we test against + +Also, we are defining each of these things for a test case we can run independently +that requires no manual intervention. + +Going back to our list of requirements, how easy is it to run these tests? +We can do this using a Python package called `pytest`. +Pytest is a testing framework that allows you to write test cases using Python. +You can use it to test things like Python functions, +database operations, +or even things like service APIs - +essentially anything that has inputs and expected outputs. +We will be using Pytest to write unit tests, +but what you learn can scale to more complex functional testing for applications or libraries. + +::::::::::::::::::::::::::::::::::::::::: callout + +## What About Unit Testing Frameworks in Python and Other Languages? + +Other unit testing frameworks exist for Python, +including Nose2 and Unittest, with Unittest supplied as part of the standard Python library. +It is also worth noting that Pytest supports tests written for Unittest, +a useful feature if you wish to prioritise use of the standard library initially, +but retain the option to move Pytest in the future. + +The unit testing approach can be translated to (and is supported within) other languages as well, +e.g. pFUnit for Fortran, +JUnit for Java (the original unit testing framework), +Catch or gtest for C++, etc. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Why Use pytest over unittest? + +We could alternatively use another Python unit test framework, [unittest](https://docs.python.org/3/library/unittest.html), +which has the advantage of being installed by default as part of Python. This is certainly a solid and established +option, however [pytest has many distinct advantages](https://realpython.com/pytest-python-testing/#what-makes-pytest-so-useful), +particularly for learning, including: + +- unittest requires additional knowledge of object-oriented frameworks (covered later in the course) + to write unit tests, whereas in pytest these are written in simpler functions so is easier to learn +- Being written using simpler functions, pytest's scripts are more concise and contain less boilerplate, and thus are + easier to read +- pytest output, particularly in regard to test failure output, is generally considered more helpful and readable +- pytest has a vast ecosystem of plugins available if ever you need additional testing functionality +- unittest-style unit tests can be run from pytest out of the box! + +A common challenge, particularly at the intermediate level, is the selection of a suitable tool from many alternatives +for a given task. Once you have become accustomed to object-oriented programming you may find unittest a better fit +for a particular project or team, so you may want to revisit it at a later date. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Installing Pytest + +If you have already installed `pytest` package in your virtual environment, +you can skip this step. +Otherwise, as we have seen, we have a couple of options for installing external libraries: + +1. via PyCharm + (see ["Adding an External Library"](13-ides.md) section + in ["Integrated Software Development Environments"](13-ides.md) episode), + or +2. via the command line + (see ["Installing External Libraries in an Environment With `pip`"](12-virtual-environments.md) section + in ["Virtual Environments For Software Development"](12-virtual-environments.md) episode). + +To do it via the command line - +exit the Python console first (either with `Ctrl-D` or by typing `exit()`), +then do: + +```bash +$ python3 -m pip install pytest +``` + +Whether we do this via PyCharm or the command line, +the results are exactly the same: +our virtual environment will now have the `pytest` package installed for use. + +### Running Tests + +Now we can run these tests using `pytest`: + +```bash +$ python3 -m pytest tests/test_models.py +``` + +Here, we use `-m` flag of the `python3` command to invoke the `pytest` module, +and specify the `tests/test_models.py` file to run the tests in that file explicitly. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Why Run Pytest Using `python3 -m pytest` and Not `pytest`? + +`pytest` is another Python module that can be run via its own command but this is a good example +why invoking Python modules via `python3 -m` may be better (recall the [explanation of Python interpreter's `-m` flag](12-virtual-environments.md)). +Had we used `pytest tests/test_models.py` command directly, +this would have led to a "ModuleNotFoundError: No module named 'inflammation'" error. This is +because `pytest` command (unlike `python3 -m pytest`) does not add the current directory to its list of +directories to search for modules, hence the `inflammation` subdirectory's contents are not being +'seen' by `pytest` causing the `ModuleNotFoundError`. There are ways to work around this problem +but `python3 -m pytest` ensures it does not happen in the first place. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```output +============================================== test session starts ================================= +platform darwin -- Python 3.11.4, pytest-7.4.3, pluggy-1.3.0 +rootdir: /Users/alex/work/SSI/training/lessons/python-intermediate-inflammation +plugins: anyio-4.0.0 +collected 2 items + +tests/test_models.py .. [100%] + +=============================================== 2 passed in 0.79s ================================== +``` + +Pytest looks for functions whose names also start with the letters 'test\_' and runs each one. +Notice the `..` after our test script: + +- If the function completes without an assertion being triggered, + we count the test as a success (indicated as `.`). +- If an assertion fails, or we encounter an error, + we count the test as a failure (indicated as `F`). + The error is included in the output so we can see what went wrong. + +So if we have many tests, we essentially get a report indicating which tests succeeded or failed. +Going back to our list of requirements (the bullet points under [Using a Testing +Framework](#using-a-testing-framework) section), +do we think these results are easy to understand? + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Write Some Unit Tests + +We already have a couple of test cases in `test/test_models.py` +that test the `daily_mean()` function. +Looking at `inflammation/models.py`, +write at least two new test cases that test the `daily_max()` and `daily_min()` functions, +adding them to `test/test_models.py`. Here are some hints: + +- You could choose to format your functions very similarly to `daily_mean()`, + defining test input and expected result arrays followed by the equality assertion. +- Try to choose cases that are suitably different, + and remember that these functions take a 2D array and return a 1D array + with each element the result of analysing each *column* of the data. + +Once added, run all the tests again with `python -m pytest tests/test_models.py`, +and you should also see your new tests pass. + +::::::::::::::: solution + +## Solution + +```python +from inflammation.models import daily_max, daily_mean, daily_min +... +def test_daily_max(): + """Test that max function works for an array of positive integers.""" + + test_input = np.array([[4, 2, 5], + [1, 6, 2], + [4, 1, 9]]) + test_result = np.array([4, 6, 9]) + + npt.assert_array_equal(daily_max(test_input), test_result) + + +def test_daily_min(): + """Test that min function works for an array of positive and negative integers.""" + + test_input = np.array([[ 4, -2, 5], + [ 1, -6, 2], + [-4, -1, 9]]) + test_result = np.array([-4, -6, 2]) + + npt.assert_array_equal(daily_min(test_input), test_result) +... +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The big advantage is that as our code develops we can update our test cases and commit them back, +ensuring that ourselves (and others) always have a set of tests +to verify our code at each step of development. +This way, when we implement a new feature, we can check +a) that the feature works using a test we write for it, and +b) that the development of the new feature does not break any existing functionality. + +### What About Testing for Errors? + +There are some cases where seeing an error is actually the correct behaviour, +and Python allows us to test for exceptions. +Add this test in `tests/test_models.py`: + +```python +import pytest +from inflammation.models import daily_min +... +def test_daily_min_string(): + """Test for TypeError when passing strings""" + + with pytest.raises(TypeError): + error_expected = daily_min([['Hello', 'there'], ['General', 'Kenobi']]) +``` + +Note that you need to import the `pytest` library at the top of our `test_models.py` file +with `import pytest` so that we can use `pytest`'s `raises()` function. + +Run all your tests as before. + +Since we have installed `pytest` to our environment, +we should also regenerate our `requirements.txt`: + +```bash +$ python3 -m pip freeze > requirements.txt +``` + +Finally, let us commit our new `test_models.py` file, +`requirements.txt` file, +and test cases to our `test-suite` branch, +and push this new branch and all its commits to GitHub: + +```bash +$ git add requirements.txt tests/test_models.py +$ git commit -m "Add initial test cases for daily_max() and daily_min()" +$ git push -u origin test-suite +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Why Should We Test Invalid Input Data? + +Testing the behaviour of inputs, both valid and invalid, +is a really good idea and is known as *data validation*. +Even if you are developing command line software +that cannot be exploited by malicious data entry, +testing behaviour against invalid inputs prevents generation of erroneous results +that could lead to serious misinterpretation +(as well as saving time and compute cycles +which may be expensive for longer-running applications). +It is generally best not to assume your user's inputs will always be rational. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- The three main types of automated tests are **unit tests**, **functional tests** and **regression tests**. +- We can write unit tests to verify that functions generate expected output given a set of specific inputs. +- It should be easy to add or change tests, understand and run them, and understand their results. +- We can use a unit testing framework like Pytest to structure and simplify the writing of tests in Python. +- We should test for expected errors in our code. +- Testing program behaviour against both valid and invalid inputs is important and is known as **data validation**. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/22-scaling-up-unit-testing.md b/22-scaling-up-unit-testing.md new file mode 100644 index 000000000..0831600af --- /dev/null +++ b/22-scaling-up-unit-testing.md @@ -0,0 +1,370 @@ +--- +title: 2.2 Scaling Up Unit Testing +teaching: 10 +exercises: 5 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use parameterisation to automatically run tests over a set of inputs +- Use code coverage to understand how much of our code is being tested using unit tests + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can we make it easier to write lots of tests? +- How can we know how much of our code is being tested? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +We are starting to build up a number of tests that test the same function, +but just with different parameters. +However, continuing to write a new function for every single test case +is not likely to scale well as our development progresses. +How can we make our job of writing tests more efficient? +And importantly, as the number of tests increases, +how can we determine how much of our code base is actually being tested? + +## Parameterising Our Unit Tests + +So far, we have been writing a single function for every new test we need. +But when we simply want to use the same test code but with different data for another test, +it would be great to be able to specify multiple sets of data to use with the same test code. +Test **parameterisation** gives us this. + +So instead of writing a separate function for each different test, +we can **parameterise** the tests with multiple test inputs. +For example, in `tests/test_models.py` let us rewrite +the `test_daily_mean_zeros()` and `test_daily_mean_integers()` +into a single test function: + +```python +from inflammation.models import daily_mean + +@pytest.mark.parametrize( + "test, expected", + [ + ([ [0, 0], [0, 0], [0, 0] ], [0, 0]), + ([ [1, 2], [3, 4], [5, 6] ], [3, 4]), + ]) +def test_daily_mean(test, expected): + """Test mean function works for array of zeroes and positive integers.""" + npt.assert_array_equal(daily_mean(np.array(test)), np.array(expected)) +``` + +Here, we use Pytest's **mark** capability to add metadata to this specific test - +in this case, marking that it is a parameterised test. +`parameterize()` function is actually a +[Python **decorator**](https://www.programiz.com/python-programming/decorator). +A decorator, when applied to a function, +adds some functionality to it when it is called, and here, +what we want to do is specify multiple input and expected output test cases +so the function is called over each of these inputs automatically when this test is called. + +We specify these as arguments to the `parameterize()` decorator, +firstly indicating the names of these arguments that will be +passed to the function (`test`, `expected`), +and secondly the actual arguments themselves that correspond to each of these names - +the input data (the `test` argument), +and the expected result (the `expected` argument). +In this case, we are passing in two tests to `test_daily_mean()` which will be run sequentially. + +So our first test will run `daily_mean()` on `[ [0, 0], [0, 0], [0, 0] ]` (our `test` argument), +and check to see if it equals `[0, 0]` (our `expected` argument). +Similarly, our second test will run `daily_mean()` +with `[ [1, 2], [3, 4], [5, 6] ]` and check it produces `[3, 4]`. + +The big plus here is that we do not need to write separate functions for each of the tests - +our test code can remain compact and readable as we write more tests +and adding more tests scales better as our code becomes more complex. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Write Parameterised Unit Tests + +Rewrite your test functions for `daily_max()` and `daily_min()` to be parameterised, +adding in new test cases for each of them. + +::::::::::::::: solution + +## Solution + +```python +from inflammation.models import daily_max, daily_min +... +@pytest.mark.parametrize( + "test, expected", + [ + ([ [0, 0, 0], [0, 0, 0], [0, 0, 0] ], [0, 0, 0]), + ([ [4, 2, 5], [1, 6, 2], [4, 1, 9] ], [4, 6, 9]), + ([ [4, -2, 5], [1, -6, 2], [-4, -1, 9] ], [4, -1, 9]), + ]) +def test_daily_max(test, expected): + """Test max function works for zeroes, positive integers, mix of positive/negative integers.""" + npt.assert_array_equal(daily_max(np.array(test)), np.array(expected)) + + +@pytest.mark.parametrize( + "test, expected", + [ + ([ [0, 0, 0], [0, 0, 0], [0, 0, 0] ], [0, 0, 0]), + ([ [4, 2, 5], [1, 6, 2], [4, 1, 9] ], [1, 1, 2]), + ([ [4, -2, 5], [1, -6, 2], [-4, -1, 9] ], [-4, -6, 2]), + ]) +def test_daily_min(test, expected): + """Test min function works for zeroes, positive integers, mix of positive/negative integers.""" + npt.assert_array_equal(daily_min(np.array(test)), np.array(expected)) +... +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Try them out! + +Let us commit our revised `test_models.py` file and test cases to our `test-suite` branch +(but do not push them to the remote repository just yet!): + +```bash +$ git add tests/test_models.py +$ git commit -m "Add parameterisation mean, min, max test cases" +``` + +## Code Coverage - How Much of Our Code is Tested? + +Pytest cannot think of test cases for us. +We still have to decide what to test and how many tests to run. +Our best guide here is economics: +we want the tests that are most likely to give us useful information that we do not already have. +For example, if `daily_mean(np.array([[2, 0], [4, 0]])))` works, +there is probably not much point testing `daily_mean(np.array([[3, 0], [4, 0]])))`, +since it is hard to think of a bug that would show up in one case but not in the other. + +Now, we should try to choose tests that are as different from each other as possible, +so that we force the code we are testing to execute in all the different ways it can - +to ensure our tests have a high degree of **code coverage**. + +A simple way to check the code coverage for a set of tests is +to install an additional package `pytest-cov` to our virtual environment, +which is used by `pytest` to tell us how many statements in our code are being tested. + +```bash +$ python3 -m pip install pytest-cov +$ python3 -m pytest --cov=inflammation.models tests/test_models.py +``` + +So here, we specify the additional named argument `--cov` to `pytest` +specifying the code to analyse for test coverage. + +```output +============================= test session starts ============================== +platform darwin -- Python 3.11.4, pytest-7.4.3, pluggy-1.3.0 +rootdir: /Users/alex/work/SSI/training/lessons/python-intermediate-inflammation +plugins: cov-4.1.0 +collected 9 items + +tests/test_models.py ......... [100%] + +---------- coverage: platform darwin, python 3.11.4-final-0 ---------- +Name Stmts Miss Cover +-------------------------------------------- +inflammation/models.py 9 1 89% +-------------------------------------------- +TOTAL 9 1 89% + +============================== 9 passed in 0.26s =============================== +``` + +Here we can see that our tests are doing very well - +89% of statements in `inflammation/models.py` have been executed. +But which statements are not being tested? +The additional argument `--cov-report term-missing` can tell us: + +```bash +$ python3 -m pytest --cov=inflammation.models --cov-report term-missing tests/test_models.py +``` + +```output +... +Name Stmts Miss Cover Missing +------------------------------------------------------ +inflammation/models.py 9 1 89% 18 +------------------------------------------------------ +TOTAL 9 1 89% +... +``` + +So there is still one statement not being tested at line 18, +and it turns out it is in the function `load_csv()`. +Here we should consider whether or not to write a test for this function, +and, in general, any other functions that may not be tested. +Of course, if there are hundreds or thousands of lines that are not covered +it may not be feasible to write tests for them all. +But we should prioritise the ones for which we write tests, considering +how often they are used, +how complex they are, +and importantly, the extent to which they affect our program's results. + +Again, we should also update our `requirements.txt` file with our latest package environment, +which now also includes `pytest-cov`, and commit it: + +```bash +$ python3 -m pip freeze > requirements.txt +$ cat requirements.txt +``` + +You'll notice `pytest-cov` and `coverage` have been added. +Let us commit this file and push our new branch to GitHub: + +```bash +$ git add requirements.txt +$ git commit -m "Add coverage support" +$ git push origin test-suite +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## What about Testing Against Indeterminate Output? + +What if your implementation depends on a degree of random behaviour? +This can be desired within a number of applications, +particularly in simulations (for example, molecular simulations) +or other stochastic behavioural models of complex systems. +So how can you test against such systems if the outputs are different when given the same inputs? + +One way is to *remove the randomness* during testing. +For those portions of your code that +use a language feature or library to generate a random number, +you can instead produce a known sequence of numbers instead when testing, +to make the results deterministic and hence easier to test against. +You could encapsulate this different behaviour in separate functions, methods, or classes +and call the appropriate one depending on whether you are testing or not. +This is essentially a type of **mocking**, +where you are creating a "mock" version that mimics some behaviour for the purposes of testing. + +Another way is to *control the randomness* during testing +to provide results that are deterministic - the same each time. +Implementations of randomness in computing languages, including Python, +are actually never truly random - they are **pseudorandom**: +the sequence of 'random' numbers are typically generated using a mathematical algorithm. +A **seed** value is used to initialise an implementation's random number generator, +and from that point, the sequence of numbers is actually deterministic. +Many implementations just use the system time as the default seed, +but you can set your own. +By doing so, the generated sequence of numbers is the same, +e.g. using Python's `random` library to randomly select a sample +of ten numbers from a sequence between 0-99: + +```python +import random + +random.seed(1) +print(random.sample(range(0, 100), 10)) +random.seed(1) +print(random.sample(range(0, 100), 10)) +``` + +Will produce: + +```output +[17, 72, 97, 8, 32, 15, 63, 57, 60, 83] +[17, 72, 97, 8, 32, 15, 63, 57, 60, 83] +``` + +So since your program's randomness is essentially eliminated, +your tests can be written to test against the known output. +The trick of course, is to ensure that the output being testing against is definitively correct! + +The other thing you can do while keeping the random behaviour, +is to *test the output data against expected constraints* of that output. +For example, if you know that all data should be within particular ranges, +or within a particular statistical distribution type (e.g. normal distribution over time), +you can test against that, +conducting multiple test runs that take advantage of the randomness +to fill the known "space" of expected results. +Note that this is not as precise or complete, +and bear in mind this could mean you need to run *a lot* of tests +which may take considerable time. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Test Driven Development + +In the [previous episode](21-automatically-testing-software.md) +we learnt how to create *unit tests* to make sure our code is behaving as we intended. +**Test Driven Development** (TDD) is an extension of this. +If we can define a set of tests for everything our code needs to do, +then why not treat those tests as the specification. + +When doing Test Driven Development, +we write our tests first and only write enough code to make the tests pass. +We tend to do this at the level of individual features - +define the feature, +write the tests, +write the code. +The main advantages are: + +- It forces us to think about how our code will be used before we write it +- It prevents us from doing work that we do not need to do, e.g. "I might need this later..." +- It forces us to test that the tests *fail* before we have implemented the code, meaning we + do not inadvertently forget to add the correct asserts. + +You may also see this process called **Red, Green, Refactor**: +'Red' for the failing tests, +'Green' for the code that makes them pass, +then 'Refactor' (tidy up) the result. + +For the challenges from here on, +try to first convert the specification into a unit test, +then try writing the code to pass the test. + +## Limits to Testing + +Like any other piece of experimental apparatus, +a complex program requires a much higher investment in testing than a simple one. +Putting it another way, +a small script that is only going to be used once, +to produce one figure, +probably does not need separate testing: +its output is either correct or not. +A linear algebra library that will be used by +thousands of people in twice that number of applications over the course of a decade, +on the other hand, definitely does. +The key is identify and prioritise against +what will most affect the code's ability to generate accurate results. + +it is also important to remember that unit testing cannot catch every bug in an application, +no matter how many tests you write. +To mitigate this manual testing is also important. +Also remember to test using as much input data as you can, +since very often code is developed and tested against the same small sets of data. +Increasing the amount of data you test against - from numerous sources - +gives you greater confidence that the results are correct. + +Our software will inevitably increase in complexity as it develops. +Using automated testing where appropriate can save us considerable time, +especially in the long term, +and allows others to verify against correct behaviour. + +## Optional exercises + +Checkout +[these optional exercises](25-section2-optional-exercises.md) +to learn more about code coverage. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- We can assign multiple inputs to tests using parametrisation. +- it is important to understand the **coverage** of our tests across our code. +- Writing unit tests takes time, so apply them where it makes the most sense. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/23-continuous-integration-automated-testing.md b/23-continuous-integration-automated-testing.md new file mode 100644 index 000000000..215faef16 --- /dev/null +++ b/23-continuous-integration-automated-testing.md @@ -0,0 +1,449 @@ +--- +title: 2.3 Continuous Integration for Automated Testing +teaching: 45 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the benefits of using Continuous Integration for further automation of testing +- Enable GitHub Actions Continuous Integration for public open source repositories +- Use continuous integration to automatically run unit tests and code coverage when changes are committed to a version control repository +- Use a build matrix to specify combinations of operating systems and Python versions to run tests over + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I automate the testing of my repository's code in a way that scales well? +- What can I do to make testing across multiple platforms easier? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +So far we have been manually running our tests as we require. +Once we have made a change, +or added a new feature with accompanying tests, +we can re-run our tests, +giving ourselves (and others who wish to run them) +increased confidence that everything is working as expected. +Now we are going to take further advantage of automation +in a way that helps testing scale across a development team with very little overhead, +using **Continuous Integration**. + +## What is Continuous Integration? + +The automated testing we have done so far only takes into account +the state of the repository we have on our own machines. +In a software project involving multiple developers working and pushing changes on a repository, +it would be great to know holistically how all these changes are affecting our codebase +without everyone having to pull down all the changes and test them. +If we also take into account the testing required on different target user platforms +for our software and the changes being made to many repository branches, +the effort required to conduct testing at this scale +can quickly become intractable for a research project to sustain. + +Continuous Integration (CI) aims to reduce this burden by further automation, +and automation - wherever possible - helps us to reduce errors +and makes predictable processes more efficient. +The idea is that when a new change is committed to a repository, +CI clones the repository, +builds it if necessary, +and runs any tests. +Once complete, it presents a report to let you see what happened. + +There are many CI infrastructures and services, +free and paid for, +and subject to change as they evolve their features. +We will be looking at [GitHub Actions](https://github.com/features/actions) - +which unsurprisingly is available as part of GitHub. + +## Continuous Integration with GitHub Actions + +### A Quick Look at YAML + +YAML is a text format used by GitHub Action workflow files. +It is also increasingly used for configuration files and storing other types of data, +so it is worth taking a bit of time looking into this file format. + +[YAML](https://www.commonwl.org/user_guide/yaml/) +(a recursive acronym which stands for "YAML Ain't Markup Language") +is a language designed to be human readable. +A few basic things you need to know about YAML to get started with GitHub Actions are +key-value pairs, arrays, maps and multi-line strings. + +So firstly, YAML files are essentially made up of **key-value** pairs, +in the form `key: value`, for example: + +```yaml +name: Kilimanjaro +height_metres: 5892 +first_scaled_by: Hans Meyer +``` + +In general, you do not need quotes for strings, +but you can use them when you want to explicitly distinguish between numbers and strings, +e.g. `height_metres: "5892"` would be a string, +but in the above example it is an integer. +It turns out Hans Meyer is not the only first ascender of Kilimanjaro, +so one way to add this person as another value to this key is by using YAML **arrays**, +like this: + +```yaml +first_scaled_by: +- Hans Meyer +- Ludwig Purtscheller +``` + +An alternative to this format for arrays is the following, which would have the same meaning: + +```yaml +first_scaled_by: [Hans Meyer, Ludwig Purtscheller] +``` + +If we wanted to express more information for one of these values +we could use a feature known as **maps** (dictionaries/hashes), +which allow us to define nested, hierarchical data structures, e.g. + +```yaml +... +height: + value: 5892 + unit: metres + measured: + year: 2008 + by: Kilimanjaro 2008 Precise Height Measurement Expedition +... +``` + +So here, `height` itself is made up of three keys `value`, `unit`, and `measured`, +with the last of these being another nested key with the keys `year` and `by`. +Note the convention of using two spaces for tabs, instead of Python's four. + +We can also combine maps and arrays to describe more complex data. +Let us say we want to add more detail to our list of initial ascenders: + +```yaml +... +first_scaled_by: +- name: Hans Meyer + date_of_birth: 22-03-1858 + nationality: German +- name: Ludwig Purtscheller + date_of_birth: 22-03-1858 + nationality: Austrian +``` + +So here we have a YAML array of our two mountaineers, +each with additional keys offering more information. + +GitHub Actions also makes use of `|` symbol to indicate a multi-line string +that preserves new lines. For example: + +```yaml +shakespeare_couplet: | + Good night, good night. Parting is such sweet sorrow + That I shall say good night till it be morrow. +``` + +They key `shakespeare_couplet` would hold the full two line string, +preserving the new line after sorrow. + +As we will see shortly, GitHub Actions workflows will use all of these. + +### Defining Our Workflow + +With a GitHub repository there is a way we can set up CI +to run our tests automatically when we commit changes. +Let us do this now by adding a new file to our repository whilst on the `test-suite` branch. +First, create the new directories `.github/workflows`: + +```bash +$ mkdir -p .github/workflows +``` + +This directory is used specifically for GitHub Actions, +allowing us to specify any number of workflows that can be run under a variety of conditions, +which is also written using YAML. +So let us add a new YAML file called `main.yml` +(note its extension is `.yml` without the `a`) +within the new `.github/workflows` directory: + +```yaml +name: CI + +# We can specify which Github events will trigger a CI build +on: push + +# now define a single job 'build' (but could define more) +jobs: + + build: + + # we can also specify the OS to run tests on + runs-on: ubuntu-latest + + # a job is a seq of steps + steps: + + # Next we need to checkout out repository, and set up Python + # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Set up Python 3.11 + uses: actions/setup-python@v5 + with: + python-version: "3.11" + + - name: Install Python dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install -r requirements.txt + + - name: Test with PyTest + run: | + python3 -m pytest --cov=inflammation.models tests/test_models.py +``` + +***Note**: be sure to create this file as `main.yml` +within the newly created `.github/workflows` directory, +or it will not work!* + +So as well as giving our workflow a name - CI - +we indicate with `on` that we want this workflow to run when we `push` commits to our repository. +The workflow itself is made of a single `job` named `build`, +and we could define any number of jobs after this one if we wanted, +and each one would run in parallel. + +Next, we define what our build job will do. +With `runs-on` we first state which operating systems we want to use, +in this case just Ubuntu for now. +We will be looking at ways we can scale this up to testing on more systems later. + +Lastly, we define the `step`s that our job will undertake in turn, +to set up the job's environment and run our tests. +You can think of the job's environment initially as a blank slate: +much like a freshly installed machine (albeit virtual) with very little installed on it, +we need to prepare it with what it needs to be able to run our tests. +Each of these steps are: + +- **Checkout repository for the job:** + `uses` indicates that want to use a GitHub Action called `checkout` that does this +- **Set up Python version:** + here we use the `setup-python` Action, indicating that we want Python version 3.11. + Note we specify the version within quotes, + to ensure that this is interpreted as a complete string. + Otherwise, if we wanted to test against for example Python 3.10, + by specifying `3.10` without the quotes, + it would be interpreted as the number `3.1` which - + although it is the same number as `3.10` - + would be interpreted as the wrong version! +- **Install latest version of pip, dependencies, and our inflammation package:** + In order to locally install our `inflammation` package + it is good practice to upgrade the version of pip that is present first, + then we use pip to install our package dependencies. + Once installed, we can use `python3 -m pip install -e .` as before to install our own package. + We use `run` here to run theses commands in the CI shell environment +- **Test with PyTest:** lastly, we run `python3 -m pytest`, + with the same arguments we used manually before + +::::::::::::::::::::::::::::::::::::::::: callout + +## What about other Actions? + +Our workflow here uses standard GitHub Actions (indicated by `actions/*`). +Beyond the standard set of actions, +others are available via the +[GitHub Marketplace](https://docs.github.com/en/developers/github-marketplace/github-marketplace-overview). +It contains many third-party actions (as well as apps) +that you can use with GitHub for many tasks across many programming languages, +particularly for setting up environments for running tests, +code analysis and other tools, +setting up and using infrastructure (for things like Docker or Amazon's AWS cloud), +or even managing repository issues. +You can even contribute your own. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Triggering a Build on GitHub Actions + +Now if we commit and push this change a CI run will be triggered: + +```bash +$ git add .github +$ git commit -m "Add GitHub Actions configuration" +$ git push origin test-suite +``` + +Since we are only committing the GitHub Actions configuration file +to the `test-suite` branch for the moment, +only the contents of this branch will be used for CI. +We can pass this file upstream into other branches (i.e. via merges) when we are happy it works, +which will then allow the process to run automatically on these other branches. +This again highlights the usefulness of the feature-branch model - +we can work in isolation on a feature until it is ready to be passed upstream +without disrupting development on other branches, +and in the case of CI, +we are starting to see its scaling benefits across a larger scale development team +working across potentially many branches. + +### Checking Build Progress and Reports + +Handily, we can see the progress of the build from our repository on GitHub +by selecting the `test-suite` branch from the dropdown menu +(which currently says `main`), +and then selecting `commits` +(located just above the code directory listing on the right, +alongside the last commit message and a small image of a timer). + +![](fig/ci-initial-ga-build.png){alt='Continuous Integration with GitHub Actions - Initial Build' .image-with-shadow width="1000px"} + +You'll see a list of commits for this branch, +and likely see an orange marker next to the latest commit +(clicking on it yields `Some checks haven't completed yet`) +meaning the build is still in progress. +This is a useful view, as over time, it will give you a history of commits, +who did them, and whether the commit resulted in a successful build or not. + +Hopefully after a while, the marker will turn into a green tick indicating a successful build. +Clicking it gives you even more information about the build, +and selecting `Details` link takes you to a complete log of the build and its output. + +![](fig/ci-initial-ga-build-log.png){alt='Continuous Integration with GitHub Actions - Build Log' .image-with-shadow width="1000px"} + +The logs are actually truncated; selecting the arrows next to the entries - +which are the `name` labels we specified in the `main.yml` file - +will expand them with more detail, including the output from the actions performed. + +![](fig/ci-initial-ga-build-details.png){alt='Continuous Integration with GitHub Actions - Build Details' .image-with-shadow width="1000px"} + +GitHub Actions offers these continuous integration features +as a completely free service for public repositories, +and supplies 2000 build minutes a month on as many private repositories that you like. +Paid levels are available too. + +## Scaling Up Testing Using Build Matrices + +Now we have our CI configured and building, +we can use a feature called **build matrices** +which really shows the value of using CI to test at scale. + +Suppose the intended users of our software use either Ubuntu, Mac OS, or Windows, +and either have Python version 3.10 or 3.11 installed, +and we want to support all of these. +Assuming we have a suitable test suite, +it would take a considerable amount of time to set up testing platforms +to run our tests across all these platform combinations. +Fortunately, CI can do the hard work for us very easily. + +Using a build matrix we can specify testing environments and parameters +(such as operating system, Python version, etc.) +and new jobs will be created that run our tests for each permutation of these. + +Let us see how this is done using GitHub Actions. +To support this, we define a `strategy` as +a `matrix` of operating systems and Python versions within `build`. +We then use `matrix.os` and `matrix.python-version` to reference these configuration possibilities +instead of using hardcoded values - +replacing the `runs-on` and `python-version` parameters +to refer to the values from the matrix. +So, our `.github/workflows/main.yml` should look like the following: + +```yaml +# Same key-value pairs as in "Defining Our Workflow" section +name: CI +on: push +jobs: + build: + + # Here we add the matrices definition: + strategy: + matrix: + os: ["ubuntu-latest", "macos-latest", "windows-latest"] + python-version: ["3.10", "3.11"] + + # Here we add the reference to the os matrix values + runs-on: ${{ matrix.os }} + + # Same key-value pairs as in "Defining Our Workflow" section + steps: + + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + # Here we add the reference to the python-version matrix values + python-version: ${{ matrix.python-version }} + # Same steps as in "Defining Our Workflow" section + - name: Install Python dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install -r requirements.txt + - name: Test with PyTest + run: | + python3 -m pytest --cov=catchment.models tests/test_models.py +``` + +The `{{ }}` are used +as a means to reference configuration values from the matrix. +This way, every possible permutation of Python versions 3.10 and 3.11 +with the latest versions of Ubuntu, Mac OS and Windows operating systems +will be tested and we can expect 6 build jobs in total. + +Let us commit and push this change and see what happens: + +```bash +$ git add .github/workflows/main.yml +$ git commit -m "Add GA build matrix for os and Python version" +$ git push origin test-suite +``` + +If we go to our GitHub build now, we can see that a new job has been created for each permutation. + +![](fig/ci-ga-build-matrix.png){alt='Continuous Integration with GitHub Actions - Build Matrix' .image-with-shadow width="1000px"} + +Note all jobs running in parallel (up to the limit allowed by our account) +which potentially saves us a lot of time waiting for testing results. +Overall, this approach allows us to massively scale our automated testing +across platforms we wish to test. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Failed CI Builds + +A CI build can fail when, e.g. a used Python package no longer supports a particular version of +Python indicated in a GitHub Actions CI build matrix. In this case, the solution is either to +upgrade the Python version in the build matrix (when possible) or downgrade the package version (and not use the latest one like we have been doing in this course). + +Also note that, if one job fails in the build for any reason - all subsequent jobs will get cancelled because of the default behavior of +GitHub Actions. From [GitHub's documentation](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#handling-failures): + +*GitHub will cancel all in-progress and queued jobs in the matrix if any job in the matrix fails.* This behaviour can be controlled by changing the value of the `fail-fast` property: + +```yaml +... + strategy: + fail-fast: false + matrix: +... +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Continuous Integration can run tests automatically to verify changes as code develops in our repository. +- CI builds are typically triggered by commits pushed to a repository. +- We need to write a configuration file to inform a CI service what to do for a build. +- We can specify a build matrix to specify multiple platforms and programming language versions to test against +- Builds can be enabled and configured separately for each branch. +- We can run - and get reports from - different CI infrastructure builds simultaneously. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/24-diagnosing-issues-improving-robustness.md b/24-diagnosing-issues-improving-robustness.md new file mode 100644 index 000000000..40ad7bc9f --- /dev/null +++ b/24-diagnosing-issues-improving-robustness.md @@ -0,0 +1,888 @@ +--- +title: 2.4 Diagnosing Issues and Improving Robustness +teaching: 30 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use a debugger to explore behaviour of a running program +- Describe and identify edge and corner test cases and explain why they are important +- Apply error handling and defensive programming techniques to improve robustness of a program +- Integrate linting tool style checking into a continuous integration job + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Once we know our program has errors, how can we locate them in the code? +- How can we make our programs more resilient to failure? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +Unit testing can tell us something is wrong in our code +and give a rough idea of where the error is +by which test(s) are failing. +But it does not tell us exactly where the problem is (i.e. what line of code), +or how it came about. +To give us a better idea of what is going on, we can: + +- output program state at various points, + e.g. by using print statements to output the contents of variables, +- use a logging capability to output + the state of everything as the program progresses, or +- look at intermediately generated files. + +But such approaches are often time consuming +and sometimes not enough to fully pinpoint the issue. +In complex programs, like simulation codes, +we often need to get inside the code while it is running and explore. +This is where using a **debugger** can be useful. + +## Setting the Scene + +Let us add a new function called `patient_normalise()` to our inflammation example +to normalise a given inflammation data array so that all entries fall between 0 and 1. +(Make sure you create a new feature branch for this work off your `develop` branch.) +To normalise each patient's inflammation data +we need to divide it by the maximum inflammation experienced by that patient. +To do so, we can add the following code to `inflammation/models.py`: + +```python +def patient_normalise(data): + """Normalise patient data from a 2D inflammation data array.""" + max = np.max(data, axis=0) + return data / max[:, np.newaxis] +``` + +***Note:** there are intentional mistakes in the above code, +which will be detected by further testing and code style checking below +so bear with us for the moment.* + +In the code above, we first go row by row +and find the maximum inflammation value for each patient +and store these values in a 1-dimensional NumPy array `max`. +We then want to use NumPy's element-wise division, +to divide each value in every row of inflammation data +(belonging to the same patient) +by the maximum value for that patient stored in the 1D array `max`. +However, we cannot do that division automatically +as `data` is a 2D array (of shape `(60, 40)`) +and `max` is a 1D array (of shape `(60, )`), +which means that their shapes are not compatible. + +![](fig/numpy-incompatible-shapes.png){alt='NumPy arrays of incompatible shapes' .image-with-shadow width="800px"} + +Hence, to make sure that we can perform this division and get the expected result, +we need to convert `max` to be a 2D array +by using the `newaxis` index operator to insert a new axis into `max`, +making it a 2D array of shape `(60, 1)`. + +![](fig/numpy-shapes-after-new-axis.png){alt="NumPy arrays' shapes after adding a new\_axis" .image-with-shadow width="800px"} + +Now the division will give us the expected result. +Even though the shapes are not identical, +NumPy's automatic `broadcasting` (adjustment of shapes) will make sure that +the shape of the 2D `max` array is now "stretched" ("broadcast") +to match that of `data` - i.e. `(60, 40)`, +and element-wise division can be performed. + +![](fig/numpy-shapes-after-broadcasting.png){alt="NumPy arrays' shapes after broadcasting" .image-with-shadow width="800px"} + +::::::::::::::::::::::::::::::::::::::::: callout + +## Broadcasting + +The term broadcasting describes how NumPy treats arrays with different shapes +during arithmetic operations. +Subject to certain constraints, +the smaller array is "broadcast" across the larger array +so that they have compatible shapes. +Be careful, though, to understand how the arrays get stretched +to avoid getting unexpected results. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Note there is an assumption in this calculation +that the minimum value we want is always zero. +This is a sensible assumption for this particular application, +since the zero value is a special case indicating that a patient +experienced no inflammation on a particular day. + +Let us now add a new test in `tests/test_models.py` +to check that the normalisation function is correct for some test data. + +```python +from inflammation.models import patient_normalise + +@pytest.mark.parametrize( + "test, expected", + [ + ([[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[0.33, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]]) + ]) +def test_patient_normalise(test, expected): + """Test normalisation works for arrays of one and positive integers. + Test with a relative and absolute tolerance of 0.01.""" + + result = patient_normalise(np.array(test)) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) +``` + +Note that we are using the `assert_allclose()` Numpy testing function +instead of `assert_array_equal()`, +since it allows us to test against values that are **close** to each other. +This is very useful when we have numbers with arbitrary decimal places +and are only concerned with a certain degree of precision, +like the test case above. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Relative and absolute tolerance + +**Relative tolerance** in unit testing means that the acceptable difference between the expected and actual results +depends on the size of the expected result itself. So, if your expected result is 100, +a relative tolerance of 0.1 (or 10%) means the actual result can be anywhere from 90 to 110 and still be considered correct. + +**Absolute tolerance**, on the other hand, +sets a fixed allowable difference regardless of the magnitude of the expected result. +For example, if you set an absolute tolerance of 5, +it means the actual result can be within 5 units of the expected result, +regardless of whether the expected result is 10 or 1000. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Run the tests again using `python -m pytest tests/test_models.py` +and you will note that the new test is failing, +with an error message that does not give many clues as to what went wrong. + +```output +E AssertionError: +E Not equal to tolerance rtol=0.01, atol=0.01 +E +E Mismatched elements: 6 / 9 (66.7%) +E Max absolute difference: 0.57142857 +E Max relative difference: 0.57356077 +E x: array([[0.142857, 0.285714, 0.428571], +E [0.5 , 0.625 , 0.75 ], +E [0.777778, 0.888889, 1. ]]) +E y: array([[0.33, 0.67, 1. ], +E [0.67, 0.83, 1. ], +E [0.78, 0.89, 1. ]]) + +tests/test_models.py:53: AssertionError +``` + +Let us use a debugger at this point to see what is going on and why the function failed. + +## Debugging in PyCharm + +Think of debugging like performing exploratory surgery - on code! +Debuggers allow us to peer at the internal workings of a program, +such as variables and other state, +as it performs its functions. + +### Running Tests Within PyCharm + +Firstly, to make it easier to track what's going on, +we can set up PyCharm to run and debug our tests +instead of running them from the command line. +If you have not done so already, +you will first need to enable the Pytest framework in PyCharm. +You can do this by: + +1. Select either `PyCharm` > `Preferences` (Mac) or `File` > `Settings` (Linux, Windows). +2. Then, in the preferences window that appears, + select `Tools` -> `Python integrated tools` > from the left. +3. Under `Testing`, for `Default test runner` select `pytest`. +4. Select `OK`. + +![](fig/pycharm-test-framework.png){alt='Setting up test framework in PyCharm' .image-with-shadow width="1000px"} + +We can now run `pytest` over our tests in PyCharm, +similarly to how we ran our `inflammation-analysis.py` script before. +Right-click the `test_models.py` file +under the `tests` directory in the file navigation window on the left, +and select `Run 'pytest in test_model...'`. +You'll see the results of the tests appear in PyCharm in a bottom panel. +If you scroll down in that panel you should see +the failed `test_patient_normalise()` test result +looking something like the following: + +![](fig/pytest-pycharm-run-tests.png){alt='Running pytest in PyCharm' .image-with-shadow width="1000px"} + +We can also run our test functions individually. +First, let us check that our PyCharm running and testing configurations are correct. +Select `Run` > `Edit Configurations...` from the PyCharm menu, +and you should see something like the following: + +![](fig/pytest-pycharm-check-config.png){alt='Ensuring testing configurations in PyCharm are correct' .image-with-shadow width="800px"} + +PyCharm allows us to configure multiple ways of running our code. +Looking at the figure above, +the first of these - +`inflammation-analysis` under `Python` - +was configured when we set up how to run our script from within PyCharm. +The second - +`pytest in test_models.py` under `Python tests` - +is our recent test configuration. +If you see just these, you are good to go. +We do not need any others, +so select any others you see and click the `-` button at the top to remove them. +This will avoid any confusion when running our tests separately. +Click `OK` when done. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Buffered Output + +Whenever a Python program prints text to the terminal or to a file, +it first stores this text in an **output buffer**. +When the buffer becomes full or is **flushed**, +the contents of the buffer are written to +the terminal / file in one go and the buffer is cleared. +This is usually done to increase performance +by effectively converting multiple output operations into just one. +Printing text to the terminal is a relatively slow operation, +so in some cases this can make quite a big difference +to the total execution time of a program. + +However, using buffered output can make debugging more difficult, +as we can no longer be quite sure when a log message will be displayed. +In order to make debugging simpler, +PyCharm automatically adds the environment variable `PYTHONUNBUFFERED` +we see in the screenshot above, +which disables output buffering. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now, if you select the green arrow next to a test function +in our `test_models.py` script in PyCharm, +and select `Run 'pytest in test_model...'`, +we can run just that test: + +![](fig/pytest-pycharm-run-single-test.png){alt='Running a single test in PyCharm' .image-with-shadow width="800px"} + +Click on the "run" button next to `test_patient_normalise`, +and you will be able to see that PyCharm runs just that test function, +and we see the same `AssertionError` that we saw before. + +### Running the Debugger + +Now we want to use the debugger to investigate +what is happening inside the `patient_normalise` function. +To do this we will add a *breakpoint* in the code. +A breakpoint will pause execution at that point allowing us to explore the state of the program. + +To set a breakpoint, navigate to the `models.py` file +and move your mouse to the `return` statement of the `patient_normalise` function. +Click to just to the right of the line number for that line +and a small red dot will appear, +indicating that you have placed a breakpoint on that line. + +![](fig/pytest-pycharm-set-breakpoint.png){alt='Setting a breakpoint in PyCharm' .image-with-shadow width="600px"} + +Now if you select the green arrow next to the `test_patient_normalise` function +and instead select `Debug 'pytest in test_model...'`, +you will notice that execution will be paused +at the `return` statement of `patient_normalise`. +In the debug panel that appears below, +we can now investigate the exact state of the program +prior to it executing this line of code. + +In the debug panel below, +in the `Debugger` tab you will be able to see +two sections that looks something like the following: + +![](fig/pytest-pycharm-debug.png){alt='Debugging in PyCharm' .image-with-shadow width="1000px"} + +- The `Frames` section on the left, + which shows the **call stack** + (the chain of functions that have been executed to lead to this point). + We can traverse this chain of functions if we wish, + to observe the state of each function. +- The `Variables` section on the right, + which displays the local and global variables currently in memory. + You will be able to see the `data` array + that is input to the `patient_normalise` function, + as well as the `max` local array + that was created to hold the maximum inflammation values for each patient. + +We also have the ability run any Python code we wish at this point +to explore the state of the program even further! +This is useful if you want to view a particular combination of variables, +or perhaps a single element or slice of an array to see what went wrong. +Select the `Console` tab in the panel (next to the `Debugger` tab), +and you'll be presented with a Python prompt. +Try putting in the expression `max[:, np.newaxis]` into the console, +and you will be able to see the column vector that we are dividing `data` by +in the return line of the function. + +![](fig/pytest-pycharm-console.png){alt='Debugging in PyCharm' .image-with-shadow width="1000px"} + +Now, looking at the `max` variable, +we can see that something looks wrong, +as the maximum values for each patient do not correspond to the `data` array. +Recall that the input `data` array we are using for the function is + +```python + [[1, 2, 3], + [4, 5, 6], + [7, 8, 9]] +``` + +So the maximum inflammation for each patient should be `[3, 6, 9]`, +whereas the debugger shows `[7, 8, 9]`. +You can see that the latter corresponds exactly to the last column of `data`, +and we can immediately conclude that +we took the maximum along the wrong axis of `data`. +Now we have our answer, +stop the debugging process by selecting +the red square at the top right of the main PyCharm window. + +So to fix the `patient_normalise` function in `models.py`, +change `axis=0` in the first line of the function to `axis=1`. +With this fix in place, +running all the tests again should result in all tests passing. +Navigate back to `test_models.py` in PyCharm, +right click `test_models.py` +and select `Run 'pytest in test_model...'`. +You should be rewarded with: + +![](fig/pytest-pycharm-all-tests-pass.png){alt='All tests in PyCharm are successful' .image-with-shadow width="1000px"} + +::::::::::::::::::::::::::::::::::::::::: callout + +## NumPy Axis + +Getting the axes right in NumPy is not trivial - +the [following tutorial](https://www.sharpsightlabs.com/blog/numpy-axes-explained/#:~:text=NumPy%20axes%20are%20the%20directions,along%20the%20rows%20and%20columns) +offers a good explanation on how axes work when applying NumPy functions to arrays. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Debugging Outside of an IDE + +It is worth being aware of the fact that you do not need to use an IDE to debug code, +although it does certainly make it easier! +The Python standard library comes with a command-line capable debugger built in, called [pdb](https://docs.python.org/3/library/pdb.html). +The easiest way to use it is to put one of these lines +anywhere in your code you would like the debugger to stop: +`import pdb; pdb.set_trace()` or `breakpoint()`. +Then you are able to run your Python program from the command line like you normally would, +but instead of completing or erroring out, +a different prompt for the debugger will come up in your terminal. +The debugger has its own commands that you can read about in +[the documentation for pdb](https://docs.python.org/3/library/pdb.html#debugger-commands). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Corner or Edge Cases + +The test case that we have currently written for `patient_normalise` +is parameterised with a fairly standard data array. +However, when writing your test cases, +it is important to consider parameterising them by unusual or extreme values, +in order to test all the edge or corner cases that your code could be exposed to in practice. +Generally speaking, it is at these extreme cases that you will find your code failing, +so it is beneficial to test them beforehand. + +What is considered an "edge case" for a given component depends on +what that component is meant to do. +In the case of `patient_normalise` function, the goal is to normalise a numeric array of numbers. +For numerical values, extreme cases could be zeros, +very large or small values, +not-a-number (`NaN`) or infinity values. +Since we are specifically considering an *array* of values, +an edge case could be that all the numbers of the array are equal. + +For all the given edge cases you might come up with, +you should also consider their likelihood of occurrence. +It is often too much effort to exhaustively test a given function against every possible input, +so you should prioritise edge cases that are likely to occur. +For our `patient_normalise` function, some common edge cases might be the occurrence of zeros, +and the case where all the values of the array are the same. + +When you are considering edge cases to test for, +try also to think about what might break your code. +For `patient_normalise` we can see that there is a division by +the maximum inflammation value for each patient, +so this will clearly break if we are dividing by zero here, +resulting in `NaN` values in the normalised array. + +With all this in mind, +let us add a few edge cases to our parametrisation of `test_patient_normalise`. +We will add two extra tests, +corresponding to an input array of all 0, +and an input array of all 1. + +```python +@pytest.mark.parametrize( + "test, expected", + [ + ([[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]), + ([[1, 1, 1], [1, 1, 1], [1, 1, 1]], [[1, 1, 1], [1, 1, 1], [1, 1, 1]]), + ([[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[0.33, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]]), + ]) +``` + +Running the tests now from the command line results in the following assertion error, +due to the division by zero as we predicted. + +```output +E AssertionError: +E Not equal to tolerance rtol=0.01, atol=0.01 +E +E x and y nan location mismatch: +E x: array([[nan, nan, nan], +E [nan, nan, nan], +E [nan, nan, nan]]) +E y: array([[0, 0, 0], +E [0, 0, 0], +E [0, 0, 0]]) + +tests/test_models.py:88: AssertionError +``` + +How can we fix this? +Luckily, there is a NumPy function that is useful here, +[`np.isnan()`](https://numpy.org/doc/stable/reference/generated/numpy.isnan.html), +which we can use to replace all the NaN's with our desired result, +which is 0. +We can also silence the run-time warning using +[`np.errstate`](https://numpy.org/doc/stable/reference/generated/numpy.errstate.html): + +```python +... +def patient_normalise(data): + """ + Normalise patient data from a 2D inflammation data array. + + NaN values are ignored, and normalised to 0. + + Negative values are rounded to 0. + """ + max = np.nanmax(data, axis=1) + with np.errstate(invalid='ignore', divide='ignore'): + normalised = data / max[:, np.newaxis] + normalised[np.isnan(normalised)] = 0 + normalised[normalised < 0] = 0 + return normalised +... +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Exploring Tests for Edge Cases + +Think of some more suitable edge cases to test our `patient_normalise()` function +and add them to the parametrised tests. +After you have finished remember to commit your changes. + +::::::::::::::: solution + +## Possible Solution + +```python +from inflammation.models import patient_normalise + +@pytest.mark.parametrize( + "test, expected", + [ + ( + [[0, 0, 0], [0, 0, 0], [0, 0, 0]], + [[0, 0, 0], [0, 0, 0], [0, 0, 0]], + ), + ( + [[1, 1, 1], [1, 1, 1], [1, 1, 1]], + [[1, 1, 1], [1, 1, 1], [1, 1, 1]], + ), + ( + [[float('nan'), 1, 1], [1, 1, 1], [1, 1, 1]], + [[0, 1, 1], [1, 1, 1], [1, 1, 1]], + ), + ( + [[1, 2, 3], [4, 5, float('nan')], [7, 8, 9]], + [[0.33, 0.67, 1], [0.8, 1, 0], [0.78, 0.89, 1]], + ), + ( + [[-1, 2, 3], [4, 5, 6], [7, 8, 9]], + [[0, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]], + ), + ( + [[1, 2, 3], [4, 5, 6], [7, 8, 9]], + [[0.33, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]], + ) + ]) +def test_patient_normalise(test, expected): + """Test normalisation works for arrays of one and positive integers.""" + + result = patient_normalise(np.array(test)) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) +... +``` + +You could also, for example, test and handle the case of a whole row of NaNs. + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Defensive Programming + +In the previous section, we made a few design choices for our `patient_normalise` function: + +1. We are implicitly converting any `NaN` and negative values to 0, +2. Normalising a constant 0 array of inflammation results in an identical array of 0s, +3. We do not warn the user of any of these situations. + +This could have be handled differently. +We might decide that we do not want to silently make these changes to the data, +but instead to explicitly check that the input data satisfies a given set of assumptions +(e.g. no negative values) +and raise an error if this is not the case. +Then we can proceed with the normalisation, +confident that our normalisation function will work correctly. + +Checking that input to a function is valid via a set of preconditions +is one of the simplest forms of **defensive programming** +which is used as a way of avoiding potential errors. +Preconditions are checked at the beginning of the function +to make sure that all assumptions are satisfied. +These assumptions are often based on the *value* of the arguments, like we have already discussed. +However, in a dynamic language like Python +one of the more common preconditions is to check that the arguments of a function +are of the correct *type*. +Currently there is nothing stopping someone from calling `patient_normalise` with +a string, a dictionary, or another object that is not an `ndarray`. + +As an example, let us change the behaviour of the `patient_normalise()` function +to raise an error on negative inflammation values. +Edit the `inflammation/models.py` file, +and add a precondition check to the beginning of the `patient_normalise()` function like so: + +```python +... + if np.any(data < 0): + raise ValueError('Inflammation values should not be negative') +... +``` + +We can then modify our test function in `tests/test_models.py` +to check that the function raises the correct exception - a `ValueError` - +when input to the test contains negative values +(i.e. input case `[[-1, 2, 3], [4, 5, 6], [7, 8, 9]]`). +The [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError) exception +is part of the standard Python library +and is used to indicate that the function received an argument of the right type, +but of an inappropriate value. + +```python +from inflammation.models import patient_normalise + +@pytest.mark.parametrize( + "test, expected, expect_raises", + [ + ... # previous test cases here, with None for expect_raises, except for the next one - add ValueError + ... # as an expected exception (since it has a negative input value) + ( + [[-1, 2, 3], [4, 5, 6], [7, 8, 9]], + [[0, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]], + ValueError, + ), + ( + [[1, 2, 3], [4, 5, 6], [7, 8, 9]], + [[0.33, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]], + None, + ), + ]) +def test_patient_normalise(test, expected, expect_raises): + """Test normalisation works for arrays of one and positive integers.""" + + if expect_raises is not None: + with pytest.raises(expect_raises): + result = patient_normalise(np.array(test)) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) + else: + result = patient_normalise(np.array(test)) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) +``` + +Be sure to commit your changes so far and push them to GitHub. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Add a Precondition to Check the Correct Type and Shape of Data + +Add preconditions to check that data is an `ndarray` object and that it is of the correct shape. +Add corresponding tests to check that the function raises the correct exception. +You will find the Python function +[`isinstance`](https://docs.python.org/3/library/functions.html#isinstance) +useful here, as well as the Python exception +[`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError). +Once you are done, commit your new files, +and push the new commits to your remote repository on GitHub. + +::::::::::::::: solution + +## Solution + +In `inflammation/models.py`: + +```python +... +def patient_normalise(data): + """ + Normalise patient data between 0 and 1 of a 2D inflammation data array. + + Any NaN values are ignored, and normalised to 0 + + :param data: 2D array of inflammation data + :type data: ndarray + + """ + if not isinstance(data, np.ndarray): + raise TypeError('data input should be ndarray') + if len(data.shape) != 2: + raise ValueError('inflammation array should be 2-dimensional') + if np.any(data < 0): + raise ValueError('inflammation values should be non-negative') + max = np.nanmax(data, axis=1) + with np.errstate(invalid='ignore', divide='ignore'): + normalised = data / max[:, np.newaxis] + normalised[np.isnan(normalised)] = 0 + return normalised +... +``` + +In `test/test_models.py`: + +```python +from inflammation.models import patient_normalise +... +@pytest.mark.parametrize( + "test, expected, expect_raises", + [ + ... + ( + 'hello', + None, + TypeError, + ), + ( + 3, + None, + TypeError, + ), + ( + [[1, 2, 3], [4, 5, 6], [7, 8, 9]], + [[0.33, 0.67, 1], [0.67, 0.83, 1], [0.78, 0.89, 1]], + None, + ) + ]) +def test_patient_normalise(test, expected, expect_raises): + """Test normalisation works for arrays of one and positive integers.""" + if isinstance(test, list): + test = np.array(test) + if expect_raises is not None: + with pytest.raises(expect_raises): + result = patient_normalise(test) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) + + else: + result = patient_normalise(test) + npt.assert_allclose(result, np.array(expected), rtol=1e-2, atol=1e-2) +... +``` + +Note the conversion from `list` to `np.array` has been moved +out of the call to `npt.assert_allclose()` within the test function, +and is now only applied to list items (rather than all items). +This allows for greater flexibility with our test inputs, +since this wouldn't work in the test case that uses a string. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +If you do the challenge, again, be sure to commit your changes and push them to GitHub. + +You should not take it too far by trying to code preconditions for every conceivable eventuality. +You should aim to strike a balance between +making sure you secure your function against incorrect use, +and writing an overly complicated and expensive function +that handles cases that are likely never going to occur. +For example, it would be sensible to validate the shape of your inflammation data array +when it is actually read from the csv file (in `load_csv`), +and therefore there is no reason to test this again in `patient_normalise`. +You can also decide against adding explicit preconditions in your code, +and instead state the assumptions and limitations of your code +for users of your code in the docstring +and rely on them to invoke your code correctly. +This approach is useful when explicitly checking the precondition is too costly. + +## Improving Robustness with Automated Code Style Checks + +Let us re-run Pylint over our project after having added some more code to it. +From the project root do: + +```bash +$ pylint inflammation +``` + +You may see something like the following in Pylint's output: + +```bash +************* Module inflammation.models +... +inflammation/models.py:60:4: W0622: Redefining built-in 'max' (redefined-builtin) +... +``` + +The above output indicates that by using the local variable called `max` +in the `patient_normalise` function, +we have redefined a built-in Python function called `max`. +This is not a good idea and may have some undesired effects +(e.g. if you redefine a built-in name in a global scope +you may cause yourself some trouble which may be difficult to trace). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Fix Code Style Errors + +Rename our local variable `max` to something else (e.g. call it `max_data`), +then rerun your tests and commit these latest changes and +push them to GitHub using our usual feature branch workflow. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +It may be hard to remember to run linter tools every now and then. +Luckily, we can now add this Pylint execution to our continuous integration builds +as one of the extra tasks. +To add Pylint to our CI workflow, +we can add the following step to our `steps` in `.github/workflows/main.yml`: + +```bash +... + - name: Check style with Pylint + run: | + python3 -m pylint --fail-under=0 --reports=y inflammation +... +``` + +Note we need to add `--fail-under=0` otherwise +the builds will fail if we do not get a 'perfect' score of 10! +This seems unlikely, so let us be more pessimistic. +We have also added `--reports=y` which will give us a more detailed report of the code analysis. + +Then we can just add this to our repo and trigger a build: + +```bash +$ git add .github/workflows/main.yml +$ git commit -m "Add Pylint run to build" +$ git push origin test-suite +``` + +Then once complete, under the build(s) reports you should see +an entry with the output from Pylint as before, +but with an extended breakdown of the infractions by category +as well as other metrics for the code, +such as the number and line percentages of code, docstrings, comments, and empty lines. + +So we specified a score of 0 as a minimum which is very low. +If we decide as a team on a suitable minimum score for our codebase, +we can specify this instead. +There are also ways to specify specific style rules that shouldn't be broken +which will cause Pylint to fail, +which could be even more useful if we want to mandate a consistent style. + +We can specify overrides to Pylint's rules in a file called `.pylintrc` +which Pylint can helpfully generate for us. +In our repository root directory: + +```bash +$ pylint --generate-rcfile > .pylintrc +``` + +Looking at this file, you'll see it is already pre-populated. +No behaviour is currently changed from the default by generating this file, +but we can amend it to suit our team's coding style. +For example, a typical rule to customise - favoured by many projects - +is the one involving line length. +You'll see it is set to 100, so let us set that to a more reasonable 120. +While we are at it, let us also set our `fail-under` in this file: + +```bash +... +# Specify a score threshold to be exceeded before program exits with error. +fail-under=0 +... +# Maximum number of characters on a single line. +max-line-length=120 +... +``` + +do not forget to remove the `--fail-under` argument to Pytest +in our GitHub Actions configuration file too, +since we do not need it anymore. + +Now when we run Pylint we will not be penalised for having a reasonable line length. +For some further hints and tips on how to approach using Pylint for a project, +see [this article](https://pythonspeed.com/articles/pylint/). + +## Merging to `develop` Branch + +Now we are happy with our test suite, we can merge this work +(which currently only exist on our `test-suite` branch) +with our parent `develop` branch. +Again, this reflects us working with impunity on a logical unit of work, +involving multiple commits, +on a separate feature branch until it is ready to be escalated to the `develop` branch. + +Be sure to commit all your changes to `test-suite` and then merge to the +`develop` branch in the usual manner. + +```bash +$ git switch develop +$ git merge test-suite +``` + +Then, assuming there are no conflicts, +we can push these changes back to the remote repository as we have done before: + +```bash +$ git push origin develop +``` + +Now these changes have migrated to our parent `develop` branch, +`develop` will also inherit the configuration to run CI builds, +so these will run automatically on this branch as well. + +This highlights a big benefit of CI when you perform merges (and apply pull requests). +As new branch code is merged into upstream branches like `develop` and `main` +these newly integrated code changes are automatically tested *together* with existing code - +which of course may also have been changed by other developers working on the code at the same time. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Unit testing can show us what does not work, but does not help us locate problems in code. +- Use a **debugger** to help you locate problems in code. +- A **debugger** allows us to pause code execution and examine its state by adding **breakpoints** to lines in code. +- Use **preconditions** to ensure correct behaviour of code. +- Ensure that unit tests check for **edge** and **corner cases** too. +- Using linting tools to automatically flag suspicious programming language constructs and stylistic errors can help improve code robustness. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/25-section2-optional-exercises.md b/25-section2-optional-exercises.md new file mode 100644 index 000000000..664f39bf1 --- /dev/null +++ b/25-section2-optional-exercises.md @@ -0,0 +1,71 @@ +--- +title: 2.5 Optional Exercises for Section 2 +start: no +teaching: 0 +exercises: 45 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Further explore how to measure and use test coverage. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is a desirable way to measure and use test coverage? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +This episode holds some optional exercises for section 2. +The exercises have an explorative nature, so feel free to go off in any direction that interests you. +You will be looking at some tools that either complement or are alternatives to those already introduced. +Even if you find something that you really like, +we still recommend sticking with the tools that were introduced prior to this episode when you move onto other sections of the course. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Apply to your own project(s) + +Apply what you learned in this section to your own project(s). +You can think of adding unit tests, add continuous integration pipelines, +or measure the test coverage of your project(s) +This is the time to ask any questions to your instructors or helpers. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Branch coverage versus line coverage + +For your test coverage, have a look at the concept of +[branch coverage](https://about.codecov.io/blog/line-or-branch-coverage-which-type-is-right-for-you/) +as opposed to just line coverage. +Which do you prefer and why? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Desirable test coverage + +Look at the projects below and see how much test coverage they have. +Should 100% line (or branch) coverage always be the goal? Why or why not? + +- [pytest](https://github.com/pytest-dev/pytest) +- [pyjokes](https://github.com/pyjokes/pyjokes) +- [scikit-learn](https://github.com/scikit-learn/scikit-learn) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Coverage badges + +Try to add a [coverage badge](https://github.com/marketplace/actions/coverage-badge) to the inflammation project. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + + diff --git a/30-section3-intro.md b/30-section3-intro.md new file mode 100644 index 000000000..8ddba6251 --- /dev/null +++ b/30-section3-intro.md @@ -0,0 +1,173 @@ +--- +title: 'Section 3: Software Development as a Process' +teaching: 10 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the differences between writing code and engineering software. +- Define the fundamental stages in a software development process. +- List the benefits of following a process of software development. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can we design and write 'good' software that meets its goals and requirements? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In this section, we will take a step back from coding development practices and tools +and look at the bigger picture of software as a *process* of development. + +> *"If you fail to plan, you are planning to fail."* +> +> --- Benjamin Franklin + +![](fig/section3-overview.svg){alt='Software design and architecture overview flowchart'} + + + +## Writing Code vs Engineering Software + +Traditionally in academia, software - and the process of writing it - +is often seen as a necessary but throwaway artefact in research. +For example, there may be research questions for a given research project, +code is created to answer those questions, +the code is run over some data and analysed, +and finally a publication is written based on those results. +These steps are often taken informally. + +The terms *programming* (or even *coding*) and *software engineering* are often used interchangeably. +They are not. +Programmers or coders tend to focus on one part of software development: +implementation, more than any other. +In academic research, often they are writing software for themselves, +where they are their own stakeholders. +And ideally, they write software from a design, +that fulfils a research goal to publish research papers. + +Someone who is engineering software takes a wider view: + +- The *lifecycle* of software: recognises that software development is a *process* + that proceeds from understanding what is needed, + to writing the software and using/releasing it, + to what happens afterwards. +- Who will (or may) be involved: software is written for *stakeholders*. + This may only be the researcher initially, + but there is an understanding that others may become involved later + (even if that is not evident yet). + A good rule of thumb is to always assume that + code will be read and used by others later on, which includes yourself! +- Software (or code) is an asset: software inherently contains value - + for example, in terms of what it can do, + the lessons learned throughout its development, + and as an implementation of a research approach + (i.e. a particular research algorithm, process, or technical approach). +- As an asset, it could be reused: + again, it may not be evident initially that the software will have use + beyond its initial purpose or project, + but there is an assumption that the software - or even just a part of it - + could be reused in the future. + +## Software Development Process + +The typical stages of a software development process can be categorised as follows: + +- **Requirements gathering:** + the process of identifying and recording the exact requirements for a software project + before it begins. + This helps maintain a clear direction throughout development, + and sets clear targets for what the software needs to do. +- **Design:** where the requirements are translated into an overall design for the software. + It covers what will be the basic software 'components' and how they will fit together, + as well as the tools and technologies that will be used, + which will together address the requirements identified in the first stage. +- **Implementation:** the software is developed according to the design, + implementing the solution that meets the requirements + set out in the requirements gathering stage. +- **Testing:** the software is tested with the intent to discover and rectify any defects, + and also to ensure that the software meets its defined requirements, + i.e. does it actually do what it should do reliably? +- **Deployment:** where the software is deployed or in some way released, + and used for its intended purpose within its intended environment. +- **Maintenance:** where updates are made to the software to ensure it remains fit for purpose, + which typically involves fixing any further discovered issues + and evolving it to meet new or changing requirements. + +The process of following these stages, particularly when undertaken in this order, +is referred to as the *waterfall* model of software development: +each stage's outputs flow into the next stage sequentially. + +Whether projects or people that develop software are aware of them or not, +these stages are followed implicitly or explicitly in every software project. +What is required for a project (during requirements gathering) is always considered, for example, +even if it is not explored sufficiently or well understood. + +Following a **process** of development offers some major benefits: + +- **Stage gating:** a quality *gate* at the end of each stage, + where stakeholders review the stage's outcomes to decide + if that stage has completed successfully before proceeding to the next one + (and even if the next stage is not warranted at all - + for example, it may be discovered during requirements of design + that development of the software is not practical or even required). +- **Predictability:** each stage is given attention in a logical sequence; + the next stage should not begin until prior stages have completed. + Returning to a prior stage is possible and may be needed, but may prove expensive, + particularly if an implementation has already been attempted. + However, at least this is an explicit and planned action. +- **Transparency:** essentially, each stage generates output(s) into subsequent stages, + which presents opportunities for them to be published + as part of an open development process. +- **Time saving:** a well-known result from + [empirical software engineering studies](https://web.archive.org/web/20160731150816/http://superwebdeveloper.com/2009/11/25/the-incredible-rate-of-diminishing-returns-of-fixing-software-bugs/) + is that fixing software mistakes is exponentially more expensive in later software development + stages. + For example, if a mistake takes 1 hour to fix in the requirements stage, + it may take 5 times that during design, + and perhaps as much as 20 times that to fix if discovered during testing. + +In this section we will place the actual writing of software (implementation) +within the context of a typical software development process: + +- Explore the **importance of software requirements**, + different classes of requirements, + and how we can interpret and capture them. +- How requirements inform and drive the **design of software**, + the importance, role, and examples of **software architecture**, + and the ways we can describe a software design. +- How to **improve** existing code to be more **readable**, **testable** and **maintainable**. +- Consider different strategies for writing well designed code, including + using **pure functions**, **classes** and **abstractions**. +- How to create, assess and improve **software design**. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Software engineering takes a wider view of software development beyond programming (or coding). +- Ensuring requirements are sufficiently captured is critical to the success of any project. +- Following a process makes software development predictable, saves time in the long run, and helps ensure each stage of development is given sufficient consideration before proceeding to the next. +- Once you get the hang of a programming language, writing code to do what you want is relatively easy. The hard part is writing code that is easy to adapt when your requirements change. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/31-software-requirements.md b/31-software-requirements.md new file mode 100644 index 000000000..0a0d4e56d --- /dev/null +++ b/31-software-requirements.md @@ -0,0 +1,344 @@ +--- +title: 3.1 Software Requirements +teaching: 25 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the different types of software requirements. +- Explain the difference between functional and non-functional requirements. +- Describe some of the different kinds of software and explain how the environment in which software is used constrains its design. +- Derive new user and solution requirements from business requirements. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Where do we start when beginning a new software project? +- How can we capture and organise what is required for software to function as intended? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The requirements of our software are the basis on which the whole project rests - +if we get the requirements wrong, we will build the wrong software. +However, it is unlikely that we will be able to determine all of the requirements upfront. +Especially when working in a research context, +requirements are flexible and may change as we develop our software. + +## Types of Requirements + +Requirements can be categorised in many ways, +but at a high level a useful way to split them is into +*business requirements*, +*user requirements*, +and *solution requirements*. +Let us take a look at these now. + +### Business Requirements + +Business requirements describe what is needed from the perspective of the organisation, +and define the strategic path of the project, +e.g. to increase profit margin or market share, +or embark on a new research area or collaborative partnership. +These are captured in something like a Business Requirements Specification. + +For adapting our inflammation software project, example business requirements could include: + +- BR1: improving the statistical quality of clinical trial reporting + to meet the needs of external audits +- BR2: increasing the throughput of trial analyses + to meet higher demand during peak periods + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: New Business Requirements + +Think of a new hypothetical business-level requirements for this software. +This can be anything you like, but be sure to keep it at the high-level of the business itself. + +::::::::::::::: solution + +## Solution + +One hypothetical new business requirement (BR3) could be +extending our clinical trial system to keep track of doctors who are being involved in the project. + +Another hypothetical new business requirement (BR4) may be +adding a new parameter to the treatment +and checking if it improves the effect of the drug being tested - +e.g. taking it in conjunction with omega-3 fatty acids and/or +increasing physical activity while taking the drug therapy. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### User (or Stakeholder) Requirements + +These define what particular stakeholder groups each expect from an eventual solution, +essentially acting as a bridge between the higher-level business requirements +and specific solution requirements. +These are typically captured in a User Requirements Specification. + +For our inflammation project, +they could include things for trial managers such as (building on the business requirements): + +- UR1.1 (from BR1): + add support for statistical measures in generated trial reports + as required by revised auditing standards (standard deviation, ...) +- UR1.2 (from BR1): add support for producing textual representations of statistics in trial reports + as required by revised auditing standards +- UR2.1 (from BR2): ability to have an individual trial report processed and generated + in under 30 seconds (if we assume it usually takes longer than that) + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: New User Requirements + +Break down your new business requirements from the +[previous exercise](31-software-requirements.md) +into a number of logical user requirements, +ensuring they stay above the level and detail of implementation. + +::::::::::::::: solution + +## Solution + +For our business requirement BR3 from the previous exercise, +the new user/stakeholder requirements may be the ability to +see all the patients a doctor is being responsible for (UR3.1), +and to find out a doctor looking after any individual patient (UR3.2). + +For our business requirement BR4 from the previous exercise, +the new user/stakeholder requirements may be the ability to +see the effect of the drug with and without the additional parameters +in all reports and graphs (UR4.1). + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Solution Requirements + +Solution (or product) requirements describe characteristics that software must have to +satisfy the stakeholder requirements. +They fall into two key categories: + +- *Functional requirements* focus on functions and features of a solution. + For our software, building on our user requirements, e.g.: + - SR1.1.1 (from UR1.1): + add standard deviation to data model and include a graph visualisation view + - SR1.2.1 (from UR1.2): + add a new view to generate a textual representation of statistics, + which is invoked by an optional command line argument +- *Non-functional requirements* focus on + *how* the behaviour of a solution is expressed or constrained, + e.g. performance, security, usability, or portability. + These are also known as *quality of service* requirements. + For our project, e.g.: + - SR2.1.1 (from UR2.1): + generate graphical statistics report on clinical workstation configuration + in under 30 seconds + +::::::::::::::::::::::::::::::::::::::::: callout + +## Labelling Requirements + +Note that the naming scheme we used for labelling our requirements is quite arbitrary - +you should reference them in a way that is consistent +and makes sense within your project and team. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### The Importance of Non-functional Requirements + +When considering software requirements, +it is *very* tempting to just think about the features users need. +However, many design choices in a software project quite rightly depend on +the users themselves and the environment in which the software is expected to run, +and these aspects should be considered as part of the software's non-functional requirements. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Types of Software + +Think about some software you are familiar with +(could be software you have written yourself or by someone else) +and how the environment it is used in have affected its design or development. +Here are some examples of questions you can use to get started: + +- What environment does the software run in? +- How do people interact with it? +- Why do people use it? +- What features of the software have been affected by these factors? +- If the software needed to be used in a different environment, + what difficulties might there be? + +Some examples of design / development choices constrained by environment might be: + +- Mobile Apps + - Must have graphical interface suitable for a touch display + - Usually distributed via a controlled app store + - Users will not (usually) modify / compile the software themselves + - Should work on a range of hardware specifications + with a range of Operating System (OS) versions + - But OS is unlikely to be anything other than Android or iOS + - Documentation probably in the software itself or on a Web page + - Typically written in one of the platform preferred languages + (e.g. Java, Kotlin, Swift) +- Embedded Software + - May have no user interface - user interface may be physical buttons + - Usually distributed pre-installed on a physical device + - Often runs on low power device with limited memory and CPU performance - + must take care to use these resources efficiently + - Exact specification of hardware is known - + often not necessary to support multiple devices + - Documentation probably in a technical manual with a separate user manual + - May need to run continuously for the lifetime of the device + - Typically written in a lower-level language (e.g. C) for better control of resources + +::::::::::::::: solution + +## Some More Examples + +- Desktop Application + - Has a graphical interface for use with mouse and keyboard + - May need to work on multiple, very different operating systems + - May be intended for users to modify / compile themselves + - Should work on a wide range of hardware configurations + - Documentation probably either in a manual or in the software itself +- Command-line Application - UNIX Tool + - User interface is text based, probably via command-line arguments + - Intended to be modified / compiled by users - though most will choose not to + - Documentation has standard formats - also accessible from the command line + - Should be usable as part of a pipeline +- Command-line Application - High Performance Computing + - Similar to a UNIX Tool + - Usually supports running across multiple networked machines simultaneously + - Usually operated via a scheduler - interface should be scriptable + - May need to run on a wide range of hardware + (e.g. different CPU architectures) + - May need to process large amounts of data + - Often entirely or partially written in a lower-level language for performance + (e.g. C, C++, Fortran) +- Web Application + - Usually has components which run on server and components which run on the user's device + - Graphical interface should usually support both Desktop and Mobile devices + - Client-side component should run on a range of browsers and operating systems + - Documentation probably part of the software itself + - Client-side component typically written in JavaScript + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: New Solution Requirements + +Now break down your new user requirements from the +[earlier exercise](31-software-requirements.md) +into a number of logical solution requirements (functional and non-functional), +that address the detail required to be able to implement them in the software. + +::::::::::::::: solution + +## Solution + +For our new hypothetical business requirement BR3, +new functional solution requirements could be extending +the clinical trial system to keep track of: + +- the names of all patients (SR3.1.1) and doctors (SR3.1.2) involved in the trial +- the name of the doctor for a particular patient (SR3.1.3) +- a group of patients being administered by a particular doctor (SR3.2.1). + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Requirements for Your Software Project + +Think back to a piece of code or software (either small or large) you have written, +or which you have experience using. +First, try to formulate a few of its key business requirements, +then derive these into user and then solution requirements. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Long- or Short-Lived Code? + +Along with requirements, here is something to consider early on. +You, perhaps with others, may be developing open-source software +with the intent that it will live on after your project completes. +It could be important to you that your software is adopted and used by other projects +as this may help you get future funding. +It can make your software more attractive to potential users +if they have the confidence that they can fix bugs that arise or add new features they need, +if they can be assured that the evolution of the software is not dependant upon +the lifetime of your project. +The intended longevity and post-project role of software should be reflected in its requirements - +particularly within its non-functional requirements - +so be sure to consider these aspects. + +On the other hand, you might want to knock together some code to prove a concept +or to perform a quick calculation +and then just discard it. +But can you be sure you will never want to use it again? +Maybe a few months from now you will realise you need it after all, +or you'll have a colleague say "I wish I had a..." +and realise you have already made one. +A little effort now could save you a lot in the future. + +## From Requirements to Implementation, via Design + +In practice, these different types of requirements are sometimes confused and conflated +when different classes of stakeholder are discussing them, which is understandable: +each group of stakeholders has a different view of *what is required* from a project. +The key is to understand the stakeholder's perspective as to +how their requirements should be classified and interpreted, +and for that to be made explicit. +A related misconception is that each of these types are simply +requirements specified at different levels of detail. +At each level, not only are the perspectives different, +but so are the nature of the objectives and the language used to describe them, +since they each reflect the perspective and language of their stakeholder group. + +It is often tempting to go right ahead and implement requirements within existing software, +but this neglects a crucial step: +do these new requirements fit within our existing design, +or does our design need to be revisited? +It may not need any changes at all, +but if it does not fit logically our design will need a bigger rethink +so the new requirement can be implemented in a sensible way. +We will look at this a bit later in this section, +but simply adding new code without considering +how the design and implementation need to change at a high level +can make our software increasingly messy and difficult to change in the future. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- When writing software used for research, requirements will almost *always* change. +- Consider non-functional requirements (*how* the software will behave) as well as functional requirements (*what* the software is supposed to do). +- The environment in which users run our software has an effect on many design choices we might make. +- Consider the expected longevity of any code before you write it. +- The perspective and language of a particular requirement stakeholder group should be reflected in requirements for that group. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/32-software-architecture-design.md b/32-software-architecture-design.md new file mode 100644 index 000000000..f15dc89ff --- /dev/null +++ b/32-software-architecture-design.md @@ -0,0 +1,406 @@ +--- +title: 3.2 Software Architecture and Design +teaching: 25 +exercises: 25 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- List the common aspects of software architecture and design. +- Describe the term technical debt and how it impacts software. +- Understand the goals and principles of designing 'good' software. +- Use a diagramming technique to describe a software architecture. +- What are the components of Model-View-Controller (MVC) architecture? +- Understand the use of common design patterns to improve the extensibility, reusability and overall quality of software. +- List some best practices when designing software. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Why should we invest time in software design? +- What should we consider when designing software? +- What is software architecture? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +Ideally, we should have at least a rough design of our software sketched out +before we write a single line of code. +This design should be based around the requirements and the structure of the problem we are trying +to solve: what are the concepts we need to represent in our code +and what are the relationships between them. +And importantly, who will be using our software and how will they interact with it. + +As a piece of software grows, +it will reach a point where there is too much code for us to keep in mind at once. +At this point, it becomes particularly important to think of the overall design and +structure of our software, how should all the pieces of functionality fit together, +and how should we work towards fulfilling this overall design throughout development. +Even if you did not think about the design of your software from the very beginning - +it is not too late to start now. + +It is not easy to come up with a complete definition for the term **software design**, +but some of the common aspects are: + +- **Software architecture** - + what components will the software have and how will they cooperate? +- **System architecture** - + what other things will this software have to interact with and how will it do this? +- **UI/UX** (User Interface / User Experience) - + how will users interact with the software? +- **Algorithm design** - + what method are we going to use to solve the core research/business problem? + +There is literature on each of the above software design aspects - we will not go into details of +them all here. +Instead, we will learn some techniques to structure our code better to satisfy some of the +requirements of 'good' software and revisit +our software's [MVC architecture](11-software-project.md) +in the context of software design. + +## Poor Design Choices \& Technical Debt + +When faced with a problem that you need to solve by writing code - it may be tempted to +skip the design phase and dive straight into coding. +What happens if you do not follow the good software design and development best practices? +It can lead to accumulated 'technical debt', +which (according to [Wikipedia](https://en.wikipedia.org/wiki/Technical_debt)), +is the "cost of additional rework caused by choosing an easy (limited) solution now +instead of using a better approach that would take longer". +The pressure to achieve project goals can sometimes lead to quick and easy solutions, +which make the software become +more messy, more complex, and more difficult to understand and maintain. +The extra effort required to make changes in the future is the interest paid on the (technical) debt. +It is natural for software to accrue some technical debt, +but it is important to pay off that debt during a maintenance phase - +simplifying, clarifying the code, making it easier to understand - +to keep these interest payments on making changes manageable. + +There is only so much time available in a project. +How much effort should we spend on designing our code properly +and using good development practices? +The following [XKCD comic](https://xkcd.com/844/) summarises this tension: + +![](fig/xkcd-good-code-comic.png){alt='Writing good code comic' .image-with-shadow width="400px" } + +At an intermediate level there are a wealth of practices that *could* be used, +and applying suitable design and coding practices is what separates +an *intermediate developer* from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices *enough* so that progress can be made. +It is very easy to under-design software, +but remember it is also possible to over-design software too. + +## Good Software Design Goals + +Aspirationally, what makes good code can be summarised in the following quote from the +[Intent HG blog](https://intenthq.com/blog/it-audience/what-is-good-code-a-scientific-definition/): + +> *"Good code is written so that is readable, understandable, +> covered by automated tests, not over complicated +> and does well what is intended to do."* + +Software has become a crucial aspect of reproducible research, as well as an asset that +can be reused or repurposed. +Thus, it is even more important to take time to design the software to be easily *modifiable* and +*extensible*, to save ourselves and our team a lot of time later on when we have +to fix a problem or the software's requirements change. + +Satisfying the above properties will lead to an overall software design +goal of having *maintainable* code, which is: + +- **Understandable** by developers who did not develop the code, + by having a clear and well-considered high-level design (or *architecture*) that separates out the different components and aspects of its function logically + and in a modular way, and having the interactions between these different parts clear, simple, and sufficiently high-level that they do not contravene this design. This is known as *separation of concerns*, and is a key principle in good software design. + - Moving this forward into implementation, *understandable* would mean being consistent in coding style, using sensible naming conventions for functions, classes and variables, documenting and commenting code, having a simple control flow, and having small functions and methods focused on single tasks. +- **Adaptable** by designing the code to be easily modifiable and extensible to satisfy new requirements, + by incorporating points in the modular design where new behaviour can be added in a clear and straightforward manner + (e.g. as individual functions in existing modules, or perhaps at a higher-level as plugins). + - In an implementation sense, this means writing low-coupled/decoupled code where each part of the code has a separate concern, and has the lowest possible dependency on other parts of the code. + This makes it easier to test, update or replace. +- **Testable** by designing the code in a sufficiently modular way to make it easier to test the functionality within a modular design, + either as a whole or in terms of its individual functions. + - This would carry forward in an implementation sense in two ways. Firstly, having functions sufficiently small to be amenable to individual (ideally automated) test cases, e.g. by writing unit, regression tests to verify the code produces + the expected outputs from controlled inputs and exhibits the expected behavior over time + as the code changes. + Secondly, at a higher-level in implementation, this would allow functional tests to be written to create tests to verify entire pathways through the code, from initial software input to testing eventual output. + +Now that we know what goals we should aspire to, let us take a critical look at the code in our +software project and try to identify ways in which it can be improved. + +Our software project contains a pre-existing branch `full-data-analysis` which contains code for a new feature of our +inflammation analysis software, which we will consider as a contribution by another developer. +Recall that you can see all your branches as follows: + +```bash +$ git branch --all +``` + +Once you have saved and committed any current changes, +checkout this `full-data-analysis` branch: + +```bash +git switch full-data-analysis +``` + +This new feature enables user to pass a new command-line parameter `--full-data-analysis` causing +the software to find the directory containing the first input data file (provided via command line +parameter `infiles`) and invoke the data analysis over all the data files in that directory. +This bit of functionality is handled by `inflammation-analysis.py` in the project root. + +The new data analysis code is located in `compute_data.py` file within the `inflammation` directory +in a function called `analyse_data()`. +This function loads all the data files for a given a directory path, then +calculates and compares standard deviation across all the data by day and finaly plots a graph. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Identify How Can Code be Improved? + +Critically examine the code in `analyse_data()` function in `compute_data.py` file. + +In what ways does this code not live up to the ideal properties of 'good' code? +Think about ways in which you find it hard to read and understand. +Think about the kinds of changes you might want to make to it, and what would +make those changes challenging. + +::::::::::::::: solution + +## Solution + +You may have found others, but here are some of the things that make the code +hard to read, test and maintain. + +- **Hard to read:** everything is implemented in a single function. + In order to understand it, you need to understand how file loading works at the same time as + the analysis itself. +- **Hard to read:** using the `--full-data-analysis` flag changes the meaning of the `infiles` argument + to indicate a single data directory, instead of a set of data files, which may cause confusion. +- **Hard to modify:** if you wanted to use the data for some other purpose and not just + plotting the graph you would have to change the `analysis_data()` function. +- **Hard to modify or test:** it only analyses a set of CSV data files + matching a very particular hardcoded `inflammation*.csv` pattern, which seems an unreasonable assumption. + What if someone wanted to use files which do not match this naming convention? +- **Hard to modify:** it does not have any tests so we cannot be 100% confident the code does + what it claims to do; any changes to the code may break something and it would be harder and + more time-consuming to figure out what. + +Make sure to keep the list you have created in the exercise above. +For the remainder of this section, we will work on improving this code. +At the end, we will revisit your list to check that you have learnt ways to address each of the +problems you had found. + +There may be other things to improve with the code on this branch, e.g. how command line +parameters are being handled in `inflammation-analysis.py`, but we are focussing on +`analyse_data()` function for the time being. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Software Architecture + +A software architecture is the fundamental structure of a software system +that is typically decided at the beginning of project development +based on its requirements and is not that easy to change once implemented. +It refers to a "bigger picture" of a software system +that describes high-level components (modules) of the system, what their functionality/roles are +and how they interact. + +The basic idea with software architecture design is that you draw boxes that will represent +different units of code, as well as other components of the system (such as users, databases, etc). +Then connect these boxes with lines where information or control will be exchanged. +These lines represent the interfaces in your system. + +As well as helping to visualise the work, doing this sketch can troubleshoot potential issues. +For example, if there is a circular dependency between two sections of the design. +It can also help with estimating how long the work will take, as it forces you to consider all +the components that need to be made. + +Diagrams are not flawless, but are a great starting point to break down the different +responsibilities and think about the kinds of information different parts of the system will need. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Design a High-Level Architecture for a New Requirement + +Sketch out an architectural design for a new feature requested by a user. + +*"I want there to be a Google Drive folder such that when I upload new inflammation data to it, +the software automatically pulls it down and updates the analysis. +The new result should be added to a database with a timestamp. +An email should then be sent to a group mailing list notifying them of the change."* + +You can draw by hand on a piece of paper or whiteboard, or use an online drawing tool +such as [Excalidraw](https://excalidraw.com/). + +::::::::::::::: solution + +## Solution + +![](fig/example-architecture-diagram.svg){alt='Diagram showing proposed architecture of the problem' width="600px" } + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We have been developing our software using the **Model-View-Controller** (MVC) architecture, +but MVC is just one of the common [software architectural patterns](../learners/software-architecture-extra.md) +and is not the only choice we could have made. + +### Model-View-Controller (MVC) Architecture + +Recall that the MVC architecture divides the related program logic into three interconnected components or modules: + +- **Model** (data) +- **View** (client interface), and +- **Controller** (processes that handle input/output and manipulate the data). + +The *Model* represents the data used by a program and also contains operations/rules +for manipulating and changing the data in the model. +This may be a database, a file, a single data object or a series of objects - +for example a table representing patients' data. + +The *View* is the means of displaying data to users/clients within an application +(i.e. provides visualisation of the state of the model). +For example, displaying a window with input fields and buttons (Graphical User Interface, GUI) +or textual options within a command line (Command Line Interface, CLI) are examples of Views. +They include anything that the user can see from the application. +While building GUIs is not the topic of this course, +we do cover building CLIs (handling command line arguments) in Python to a certain extent. + +The *Controller* manipulates both the Model and the View. +It accepts input from the View +and performs the corresponding action on the Model (changing the state of the model) +and then updates the View accordingly. +For example, on user request, +Controller updates a picture on a user's GitHub profile +and then modifies the View by displaying the updated profile back to the user. + +### Limitations to Architectural Design + +There are limits to everything - and MVC architecture is no exception. +The Controller often transcends into the Model and View, +and a clear separation is sometimes difficult to maintain. +For example, the Command Line Interface provides both the View +(what user sees and how they interact with the command line) +and the Controller (invoking of a command) aspects of a CLI application. +In Web applications, Controller often manipulates the data (received from the Model) +before displaying it to the user or passing it from the user to the Model. + +There are many variants of an MVC-like pattern +(such as [Model-View-Presenter](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93presenter) (MVP), +[Model-View-Viewmodel](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93viewmodel) (MVVM), etc.), +where the Controller role is handled slightly differently, +but in most cases, the distinction between these patterns is not particularly important. +What really matters is that we are making conscious decisions about the architecture of our software +that suit the way in which we expect to use it. +We should reuse and be consistent with these established ideas where we can, +but we do not need to stick to them exactly. + +The key thing to take away is the distinction between the Model and the View code, while +the View and the Controller can be more or less coupled together (e.g. the code that specifies +there is a button on the screen, might be the same code that specifies what that button does). +The View may be hard to test, or use special libraries to draw the UI, but should not contain any +complex logic, and is really just a presentation layer on top of the Model. +The Model, conversely, should not care how the data is displayed. +For example, the View may present dates as "Monday 24th July 2023", +but the Model stores it using a `Date` object rather than its string representation. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Reusable "Patterns" of Architecture + +[Architectural](https://www.redhat.com/architect/14-software-architecture-patterns) and +[programming patterns](https://refactoring.guru/design-patterns/catalog) are reusable templates for +software systems and code that provide solutions for some common software design challenges. +MVC is one architectural pattern. +Patterns are a useful starting point for how to design your software and also provide +a common vocabulary for discussing software designs with other developers. +They may not always provide a full design solution as some problems may require +a bespoke design that maps cleanly on to the specific problem you are trying to solve. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Architectural Design Guidelines + +Creating good software architecture is not about applying any rules or patterns blindly, +but instead practise and taking care to: + +- Discuss design with your colleagues before writing the code. +- Separate different concerns into different sections of the code. +- Avoid duplication of code or data. +- Keep how much a person has to understand at once to a minimum. +- Try not to have too many abstractions (if you have to jump around a lot when reading the + code that is a clue that your code may be too abstract). +- Think about how will your components interface other components and external systems. +- Not try to design a future-proof solution or to anticipate future requirements or adaptations + of the software - design the simplest solution that solves the problem at hand. +- (When working on a less well-structured part of the code), start by refactoring it so that your + change fits in cleanly. +- Try to leave the code in a better state that you found it. + +## Techniques for Good Software Design + +Once we have a good high-level architectural design, +it is important to follow this philosophy through to the process of developing the code itself, +and there are some key techniques to keep in mind that will help. + +As we have discussed, +how code is structured is important for helping people who are developing and maintaining it +to understand and update it. +By breaking down our software into modular components with a single responsibility, +we avoid having to rewrite it all when requirements change. +This also means that these smaller components can be understood individually without having to understand +the entire codebase at once. +The following techniques build on this concept of modularity: + +- *Abstraction* is the process of hiding the implementation details of a piece of + code (typically behind an interface) - i.e. the details of *how* something works are hidden away, + leaving code developers to deal only with *what* it does. + This allows developers to work with the code at a higher level + of abstraction, without needing to understand fully (or keep in mind) all the underlying + details at any given time and thereby reducing the cognitive load when programming. + Abstraction can be achieved through techniques such as *encapsulation*, *inheritance*, and + *polymorphism*, which we will explore in the next episodes. There are other [abstraction techniques](https://en.wikipedia.org/wiki/Abstraction_\(computer_science\)) + available too. + +- *Code decoupling* is a code design technique that involves breaking a (complex) + software system into smaller, more manageable parts, and reducing the interdependence + between these different parts of the system. + This means that a change in one part of the code usually does not require a change in the other, + thereby making its development more efficient and less error prone. + +- *Code refactoring* is the process of improving the design of an existing code - + changing the internal structure of code without changing its + external behavior, with the goal of making the code more readable, maintainable, efficient or easier + to test. + This can include things such as renaming variables, reorganising + functions to avoid code duplication and increase reuse, and simplifying conditional statements. + +Writing good code is hard and takes practise. +You may also be faced with an existing piece of code that breaks some (or all) of the +good code principles, and your job will be to improve/refactor it so that it can evolve further. +We will now look into some examples of these techniques that can help us redesign our code +and incrementally improve its quality. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- 'Good' code is designed to be maintainable: readable by people who did not author the code, testable through a set of automated tests, adaptable to new requirements. +- Use abstraction and decoupling to logically separate the different aspects of your software within design as well as implementation. +- Use refactoring to improve existing code to improve its consistency internally and within its overall architecture. +- Include software design as a key stage in the lifecycle of your project so that development and maintenance becomes easier. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/33-code-decoupling-abstractions.md b/33-code-decoupling-abstractions.md new file mode 100644 index 000000000..616741b74 --- /dev/null +++ b/33-code-decoupling-abstractions.md @@ -0,0 +1,545 @@ +--- +title: 3.3 Code Decoupling & Abstractions +teaching: 30 +exercises: 45 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the benefits of code decoupling. +- Introduce appropriate abstractions to simplify code. +- Understand the principles of encapsulation, polymorphism and interfaces. +- Use mocks to replace a class in test code. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What is decoupled code? +- What are commonly used code abstractions? +- When is it useful to use classes to structure code? +- How can we make sure the components of our software are reusable? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +**Code decoupling** refers to breaking up the software into smaller components and reducing the +interdependence between these components so that they can be tested and maintained independently. +Two components of code can be considered *decoupled* if a change in one does not +necessitate a change in the other. +While two connected units cannot always be totally decoupled, *loose coupling* +is something we should aim for. + +**Code abstraction** is the process of hiding the implementation details of a piece of +code behind an interface - i.e. the details of *how* something works are hidden away, +leaving us to deal only with *what* it does. +This allows developers to work with the code at a higher level +of abstraction, without needing to understand fully (or keep in mind) all the underlying +details and thereby reducing the cognitive load when programming. + +Abstractions can aid decoupling of code. +If one part of the code only uses another part through an appropriate abstraction +then it becomes easier for these parts to change independently. +Benefits of using these techniques include having the codebase that is: + +- easier to read as you only need to understand the + details of the (smaller) component you are looking at and not the whole monolithic codebase. +- easier to test, as one of the components can be replaced + by a test or a mock version of it. +- easier to maintain, as changes can be isolated + from other parts of the code. + +Let us start redesigning our code by introducing some of the abstraction techniques +to incrementally decouple it into smaller components to improve its overall design. + +In the code from our current branch `full-data-analysis`, +you may have noticed that loading data from CSV files from a `data` directory is "hardcoded" into +the `analyse_data()` function. +Data loading is a functionality separate from data analysis, so firstly +let us decouple the data loading part into a separate component (function). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Decouple Data Loading from Data Analysis + +Modify `compute_data.py` to separate out the data loading functionality from `analyse_data()` into a new function +`load_inflammation_data()`, that returns a list of 2D NumPy arrays with inflammation data +loaded from all inflammation CSV files found in a specified directory path. +Then, change your `analyse_data()` function to make use of this new function instead. + +::::::::::::::: solution + +## Solution + +The new function `load_inflammation_data()` that reads all the inflammation data into the +format needed for the analysis could look something like: +. + +```python +def load_inflammation_data(dir_path): + data_file_paths = glob.glob(os.path.join(dir_path, 'inflammation*.csv')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation CSV files found in path {dir_path}") + data = map(models.load_csv, data_file_paths) # Load inflammation data from each CSV file + return list(data) # Return the list of 2D NumPy arrays with inflammation data +``` + +The new function `analyse_data()` could then look like: + +```python +def analyse_data(data_dir): + data = load_inflammation_data(data_dir) + + means_by_day = map(models.daily_mean, data) + means_by_day_matrix = np.stack(list(means_by_day)) + + daily_standard_deviation = np.std(means_by_day_matrix, axis=0) + + graph_data = { + 'standard deviation by day': daily_standard_deviation, + } + views.visualize(graph_data) +``` + +The code is now easier to follow since we do not need to understand the data loading part +to understand the statistical analysis part, and vice versa. +In most cases, functions work best when they are short! + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +However, even with this change, the data loading is still coupled with the data analysis to a +large extent. +For example, if we have to support loading data from different sources +(e.g. JSON files or an SQL database), we would have to pass some kind of a flag into `analyse_data()` +indicating the type of data we want to read from. Instead, we would like to decouple the +consideration of data source from the `analyse_data()` function entirely. +One way we can do this is by using *encapsulation* and *classes*. + +## Encapsulation \& Classes + +**Encapsulation** is the process of packing the "data" and "functions operating on that data" into a +single component/object. +It is also provides a mechanism for restricting the access to that data. +Encapsulation means that the internal representation of a component is generally hidden +from view outside of the component's definition. + +Encapsulation allows developers to present a consistent interface to the component/object +that is independent of its internal implementation. +For example, encapsulation can be used to hide the values or +state of a structured data object inside a **class**, preventing direct access to them +that could violate the object's state maintained by the class' methods. +Note that object-oriented programming (OOP) languages support encapsulation, +but encapsulation is not unique to OOP. + +So, a class is a way of grouping together data with some methods that manipulate that data. +In Python, you can *declare* a class as follows: + +```python +class Circle: + pass +``` + +Classes are typically named using "CapitalisedWords" naming convention - e.g. FileReader, +OutputStream, Rectangle. + +You can *construct* an *instance* of a class elsewhere in the code by doing the following: + +```python +my_circle = Circle() +``` + +When you construct a class in this ways, the class' *constructor* method is called. +It is also possible to pass values to the constructor in order to configure the class instance: + +```python +class Circle: + def __init__(self, radius): + self.radius = radius + +my_circle = Circle(10) +``` + +The constructor has the special name `__init__`. +Note it has a special first parameter called `self` by convention - it is +used to access the current *instance* of the object being created. + +A class can be thought of as a cookie cutter template, and instances as the cookies themselves. +That is, one class can have many instances. + +Classes can also have other methods defined on them. +Like constructors, they have the special parameter `self` that must come first. + +```python +import math + +class Circle: + ... + def get_area(self): + return math.pi * self.radius * self.radius +... +print(my_circle.get_area()) +``` + +On the last line of the code above, the instance of the class, `my_circle`, will be automatically +passed as the first parameter (`self`) when calling the `get_area()` method. +The `get_area()` method can then access the variable `radius` encapsulated within the object, which +is otherwise invisible to the world outside of the object. +The method `get_area()` itself can also be accessed via the object/instance only. + +As we can see, internal representation of any instance of class `Circle` is hidden +outside of this class (encapsulation). +In addition, implementation of the method `get_area()` is hidden too (abstraction). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Encapsulation \& Abstraction + +Encapsulation provides **information hiding**. Abstraction provides **implementation hiding**. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Use Classes to Abstract out Data Loading + +Inside `compute_data.py`, declare a new class `CSVDataSource` that contains the +`load_inflammation_data()` function we wrote in the previous exercise as a method of this class. +The directory path where to load the files from should be passed in the class' constructor method. +Finally, construct an instance of the class `CSVDataSource` outside the statistical +analysis and pass it to `analyse_data()` function. + +> ## Hint +> +> At the end of this exercise, the code in the `analyse_data()` function should look like: +> +> ```python +> def analyse_data(data_source): +> data = data_source.load_inflammation_data() +> ... +> ``` +> +> The controller code should look like: +> +> ```python +> data_source = CSVDataSource(os.path.dirname(infiles[0])) +> analyse_data(data_source) +> ``` + +::::::::::::::: solution + +## Solution + +For example, we can declare class `CSVDataSource` like this: + +```python +class CSVDataSource: + """ + Loads all the inflammation CSV files within a specified directory. + """ + def __init__(self, dir_path): + self.dir_path = dir_path + + def load_inflammation_data(self): + data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation CSV files found in path {self.dir_path}") + data = map(models.load_csv, data_file_paths) + return list(data) +``` + +In the controller, we create an instance of CSVDataSource and pass it +into the the statistical analysis function. + +```python +data_source = CSVDataSource(os.path.dirname(infiles[0])) +analyse_data(data_source) +``` + +The `analyse_data()` function is modified to receive any data source object (that implements +the `load_inflammation_data()` method) as a parameter. + +```python +def analyse_data(data_source): + data = data_source.load_inflammation_data() + ... +``` + +We have now fully decoupled the reading of the data from the statistical analysis and +the analysis is not fixed to reading from a directory of CSV files. Indeed, we can pass various +data sources to this function now, as long as they implement the `load_inflammation_data()` +method. + +While the overall behaviour of the code and its results are unchanged, +the way we invoke data analysis has changed. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Interfaces + +An **interface** is another important concept in software design related to abstraction and +encapsulation. For a software component, it declares the operations that can be invoked on +that component, along with input arguments and what it returns. By knowing these details, +we can communicate with this component without the need to know how it implements this interface. + +API (Application Programming Interface) is one example of an interface that allows separate +systems (external to one another) to communicate with each other. +For example, a request to Google Maps service API may get +you the latitude and longitude for a given address. +Twitter API may return all tweets that contain +a given keyword that have been posted within a certain date range. + +Internal interfaces within software dictate how +different parts of the system interact with each other. +Even when these are not explicitly documented - they still exist. + +For example, our `Circle` class implicitly has an interface - you can call `get_area()` method +on it and it will return a number representing its surface area. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Identify an Interface Between `CSVDataSource` and `analyse_data` + +What would you say is the interface between the CSVDataSource class +and `analyse_data()` function? +Think about what functions `analyse_data()` needs to be able to call to perform its duty, +what parameters they need and what they return. + +::::::::::::::: solution + +## Solution + +The interface is the `load_inflammation_data()` method, which takes no parameters and +returns a list where each entry is a 2D NumPy array of patient inflammation data (read from some +data source). + +Any object passed into `analyse_data()` should conform to this interface. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Polymorphism + +In general, **polymorphism** is the idea of having multiple implementations/forms/shapes +of the same abstract concept. +It is the provision of a single interface to entities of different types, +or the use of a single symbol to represent multiple different types. + +There are [different versions of polymorphism](https://www.bmc.com/blogs/polymorphism-programming/). +For example, method or operator overloading is one +type of polymorphism enabling methods and operators to take parameters of different types. + +We will have a look at the *interface-based polymorphism*. +In OOP, it is possible to have different object classes that conform to the same interface. +For example, let us have a look at the following class representing a `Rectangle`: + +```python +class Rectangle: + def __init__(self, width, height): + self.width = width + self.height = height + def get_area(self): + return self.width * self.height +``` + +Like `Circle`, this class provides the `get_area()` method. +The method takes the same number of parameters (none), and returns a number. +However, the implementation is different. This is interface-based polymorphism. + +The word "polymorphism" means "many forms", and in programming it refers to +methods/functions/operators with the same name that can be executed on many objects or classes. + +Using our `Circle` and `Rectangle` classes, we can create a list of different shapes and iterate +through the list to find their total surface area as follows: + +```python +my_circle = Circle(radius=10) +my_rectangle = Rectangle(width=5, height=3) +my_shapes = [my_circle, my_rectangle] +total_area = sum(shape.get_area() for shape in my_shapes) +``` + +Note that we have not created a common superclass or linked the classes `Circle` and `Rectangle` +together in any way. It is possible due to polymorphism. +You could also say that, when we are calculating the total surface area, +the method for calculating the area of each shape is abstracted away to the relevant class. + +How can polymorphism be useful in our software project? +For example, we can replace our `CSVDataSource` with another class that reads a totally +different file format (e.g. JSON), or reads from an external service or a database. +All of these changes can be now be made without changing the analysis function as we have decoupled +the process of data loading from the data analysis earlier. +Conversely, if we wanted to write a new analysis function, we could support any of these +data sources with no extra work. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Add an Additional DataSource + +Create another class that supports loading patient data from JSON files, with the +appropriate `load_inflammation_data()` method. +There is a function in `models.py` that loads from JSON in the following format: + +```json +[ + { + "observations": [0, 1] + }, + { + "observations": [0, 2] + } +] +``` + +Finally, at run-time, construct an appropriate data source instance based on the file extension. + +::::::::::::::: solution + +## Solution + +The class that reads inflammation data from JSON files could look something like: + +```python +class JSONDataSource: + """ + Loads patient data with inflammation values from JSON files within a specified folder. + """ + def __init__(self, dir_path): + self.dir_path = dir_path + + def load_inflammation_data(self): + data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.json')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation JSON files found in path {self.dir_path}") + data = map(models.load_json, data_file_paths) + return list(data) +``` + +Additionally, in the controller we will need to select an appropriate DataSource instance to +provide to the analysis: + +```python +_, extension = os.path.splitext(infiles[0]) +if extension == '.json': + data_source = JSONDataSource(os.path.dirname(infiles[0])) +elif extension == '.csv': + data_source = CSVDataSource(os.path.dirname(infiles[0])) +else: + raise ValueError(f'Unsupported data file format: {extension}') +analyse_data(data_source) +``` + +As you can seen, all the above changes have been made made without modifying +the analysis code itself. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Testing Using Mock Objects + +We can use a **mock object** abstraction to make testing more straightforward. +Instead of having our tests use real data stored on a file system, we can provide +a mock or dummy implementation instead of one of the real classes. +Providing that what we use as a substitute conforms to the same interface, +the code we are testing should work just the same. +Such mock/dummy implementation could just return some fixed example data. + +An convenient way to do this in Python is using Python's [mock object library](https://docs.python.org/3/library/unittest.mock.html). +This is a whole topic in itself - +but a basic mock can be constructed using a couple of lines of code: + +```python +from unittest.mock import Mock + +mock_version = Mock() +mock_version.method_to_mock.return_value = 42 +``` + +Here we construct a mock in the same way you would construct a class. +Then we specify a method that we want to behave a specific way. + +Now whenever you call `mock_version.method_to_mock()` the return value will be `42`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Test Using a Mock Implementation + +Complete this test for `analyse_data()`, using a mock object in place of the +`data_source`: + +```python +from unittest.mock import Mock + +def test_compute_data_mock_source(): + from inflammation.compute_data import analyse_data + data_source = Mock() + + # TODO: configure data_source mock + + result = analyse_data(data_source) + + # TODO: add assert on the contents of result +``` + +Create a mock that returns some fixed data and to use as the `data_source` in order to test +the `analyse_data` method. +Use this mock in a test. + +Do not forget to import `Mock` from the `unittest.mock` package. + +::::::::::::::: solution + +## Solution + +```python +from unittest.mock import Mock + +def test_compute_data_mock_source(): + from inflammation.compute_data import analyse_data + data_source = Mock() + data_source.load_inflammation_data.return_value = [[[0, 2, 0]], + [[0, 1, 0]]] + + result = analyse_data(data_source) + npt.assert_array_almost_equal(result, [0, math.sqrt(0.25) ,0]) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Safe Code Structure Changes + +With the changes to the code structure we have done using code decoupling and abstractions we have +already refactored our code to a certain extent but we have not tested that the changes work as +intended. +We will now look into how to properly refactor code to guarantee that the code still works +as before any modifications. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Code decoupling is separating code into smaller components and reducing the interdependence between them so that the code is easier to understand, test and maintain. +- Abstractions can hide certain details of the code behind classes and interfaces. +- Encapsulation bundles data into a structured component along with methods that operate on the data, and provides a mechanism for restricting access to that data, hiding the internal representation of the component. +- Polymorphism describes the provision of a single interface to entities of different types, or the use of a single symbol to represent different types. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/34-code-refactoring.md b/34-code-refactoring.md new file mode 100644 index 000000000..244427bce --- /dev/null +++ b/34-code-refactoring.md @@ -0,0 +1,392 @@ +--- +title: 3.4 Code Refactoring +teaching: 30 +exercises: 20 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Employ code refactoring to improve the structure of existing code. +- Understand the use of regressions tests to avoid breaking existing code when refactoring. +- Understand the use of pure functions in software design to make the code easier to read, test amd maintain. +- Refactor a piece of code to separate out 'pure' from 'impure' code. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do you refactor existing code without breaking it? +- What are benefits of using pure functions in code? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +Code refactoring is the process of improving the design of an existing codebase - changing the +internal structure of code without changing its external behavior, with the goal of making the code +more readable, maintainable, efficient or easier to test. This can include introducing things such +as code decoupling and abstractions, but also renaming variables, reorganising functions to avoid +code duplication and increase reuse, and simplifying conditional statements. + +In the previous episode, we have already changed the structure of our code (i.e. refactored it +to a certain extent) +when we separated out data loading from data analysis but we have not tested that the new code +works as intended. This is particularly important with bigger code changes but even a small change +can easily break the codebase and introduce bugs. + +When faced with an existing piece of code that needs modifying a good refactoring +process to follow is: + +1. Make sure you have tests that verify the current behaviour +2. Refactor the code +3. Verify that that the behaviour of the code is identical to that before refactoring. + +In this episode we will further improve the code from our project in the following two ways: + +- add more tests so we can be more confident that future changes will have the + intended effect and will not break the existing code. +- further split `analyse_data()` function into a number of smaller and more + decoupled functions (continuing the work from the previous episode). + +## Writing Tests Before Refactoring + +When refactoring, first we need to make sure there are tests in place that can verify +the code behaviour as it is now (or write them if they are missing), +then refactor the code and, finally, check that the original tests still pass. + +There is a bit of a "chicken and egg" problem here - if the refactoring is supposed to make it easier +to write tests in the future, how can we write tests before doing the refactoring? +The tricks to get around this trap are: + +- test at a higher level, with coarser accuracy, and +- write tests that you intend to replace or remove. + +The best tests are the ones that test a single bit of functionality rigorously. +However, with our current `analyse_data()` code that is not possible because it is a +large function doing a little bit of everything. +Instead we will make minimal changes to the code to make it a bit more testable. + +Firstly, +we will modify the function to return the data instead of visualising it because graphs are harder +to test automatically (i.e. they need to be viewed and inspected manually in order to determine +their correctness). +Next, we will make the assert statements verify what the current outcome is, rather than check +whether that is correct or not. +Such tests are meant to +verify that the behaviour does not *change* rather than checking the current behaviour is correct +(there should be another set of tests checking the correctness). +This kind of testing is called **regression testing** as we are testing for +regressions in existing behaviour. + +Refactoring code is not meant to change its behaviour, but sometimes to make it possible to verify +you are not changing the important behaviour you have to make small tweaks to the code to write +the tests at all. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Write Regression Tests + +Modify the `analyse_data()` function not to plot a graph and return the data instead. +Then, add a new test file called `test_compute_data.py` in the `tests` folder and +add a regression test to verify the current output of `analyse_data()`. We will use this test +in the remainder of this section to verify the output `analyse_data()` is unchanged each time +we refactor or change code in the future. + +Start from the skeleton test code below: + +```python +def test_analyse_data(): + from inflammation.compute_data import analyse_data + path = Path.cwd() / "../data" + data_source = CSVDataSource(path) + result = analyse_data(data_source) + + # TODO: add assert statement(s) to test the result value is as expected +``` + +Use `assert_array_almost_equal` from the `numpy.testing` library to +compare arrays of floating point numbers. + +::::::::::::::: solution + +## Hint + +When determining the correct return data result to use in tests, it may be helpful to assert the +result equals some random made-up data, observe the test fail initially and then +copy and paste the correct result into the test. + + +::::::::::::::::::::::::: + +::::::::::::::: solution + +## Solution + +One approach we can take is to: + +- comment out the visualise method on `analyse_data()` + (this will cause our test to hang waiting for the result data) +- return the data (instead of plotting it on a graph), so we can write assert statements + on the data +- see what the calculated result value is, and assert that it is the same as the expected value + +Putting this together, our test may look like: + +```python +import numpy.testing as npt +from pathlib import Path + +def test_analyse_data(): + from inflammation.compute_data import analyse_data + path = Path.cwd() / "../data" + data_source = CSVDataSource(path) + result = analyse_data(data_source) + expected_output = [0.,0.22510286,0.18157299,0.1264423,0.9495481,0.27118211, + 0.25104719,0.22330897,0.89680503,0.21573875,1.24235548,0.63042094, + 1.57511696,2.18850242,0.3729574,0.69395538,2.52365162,0.3179312, + 1.22850657,1.63149639,2.45861227,1.55556052,2.8214853,0.92117578, + 0.76176979,2.18346188,0.55368435,1.78441632,0.26549221,1.43938417, + 0.78959769,0.64913879,1.16078544,0.42417995,0.36019114,0.80801707, + 0.50323031,0.47574665,0.45197398,0.22070227] + npt.assert_array_almost_equal(result, expected_output) +``` + +Note that while the above test will detect if we accidentally break the analysis code and +change the output of the analysis, it is still not a complete test for the following reasons: + +- It is not obvious why the `expected_output` is correct +- It does not test edge cases +- If the data files in the directory change - the test will fail + +We would need to add additional tests to check the above. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Separating Pure and Impure Code + +Now that we have our regression test for `analyse_data()` in place, we are ready to refactor the +function further. +We would like to separate out as much of its code as possible as *pure functions*. + +### Pure Functions + +A pure function in programming works like a mathematical function - +it takes in some input and produces an output and that output is +always the same for the same input. +That is, the output of a pure function does not depend on any information +which is not present in the input (such as global variables). +Furthermore, pure functions do not cause any *side effects* - they do not modify the input data +or data that exist outside the function (such as printing text, writing to a file or +changing a global variable). They perform actions that affect nothing but the value they return. + +### Benefits of Pure Functions + +Pure functions are easier to understand because they eliminate side effects. +The reader only needs to concern themselves with the input +parameters of the function and the function code itself, rather than +the overall context the function is operating in. +Similarly, a function that calls a pure function is also easier +to understand - we only need to understand what the function returns, which will probably +be clear from the context in which the function is called. +Finally, pure functions are easier to reuse as the caller +only needs to understand what parameters to provide, rather +than anything else that might need to be configured prior to the call. +For these reasons, you should try and have as much of the complex, analytical and mathematical +code are pure functions. + +Some parts of a program are inevitably impure. +Programs need to read input from users, generate a graph, or write results to a file or a database. +Well designed programs separate complex logic from the necessary impure "glue" code that +interacts with users and other systems. +This way, you have easy-to-read and easy-to-test pure code that contains the complex logic +and simplified impure code that reads data from a file or gathers user input. Impure code may +be harder to test but, when simplified like this, may only require a handful of tests anyway. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Refactoring To Use a Pure Function + +Refactor the `analyse_data()` function to delegate the data analysis to a new +pure function `compute_standard_deviation_by_day()` and separate it +from the impure code that handles the input and output. +The pure function should take in the data, and return the analysis result, as follows: + +```python +def compute_standard_deviation_by_day(data): + # TODO + return daily_standard_deviation +``` + +::::::::::::::: solution + +## Solution + +The analysis code will be refactored into a separate function that may look something like: + +```python +def compute_standard_deviation_by_day(data): + means_by_day = map(models.daily_mean, data) + means_by_day_matrix = np.stack(list(means_by_day)) + + daily_standard_deviation = np.std(means_by_day_matrix, axis=0) + return daily_standard_deviation +``` + +The `analyse_data()` function now calls the `compute_standard_deviation_by_day()` function, +while keeping all the logic for reading the data, processing it and showing it in a graph: + +```python +def analyse_data(data_dir): + """Calculates the standard deviation by day between datasets. + Gets all the inflammation data from CSV files within a directory, works out the mean + inflammation value for each day across all datasets, then visualises the + standard deviation of these means on a graph.""" + data_file_paths = glob.glob(os.path.join(data_dir, 'inflammation*.csv')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation csv's found in path {data_dir}") + data = map(models.load_csv, data_file_paths) + daily_standard_deviation = compute_standard_deviation_by_day(data) + + graph_data = { + 'standard deviation by day': daily_standard_deviation, + } + # views.visualize(graph_data) + return daily_standard_deviation +``` + +Make sure to re-run the regression test to check this refactoring has not +changed the output of `analyse_data()`. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Testing Pure Functions + +Now we have our analysis implemented as a pure function, we can write tests that cover +all the things we would like to check without depending on CSVs files. +This is another advantage of pure functions - they are very well suited to automated testing, +i.e. their tests are: + +- **easier to write** - we construct input and assert the output + without having to think about making sure the global state is correct before or after +- **easier to read** - the reader will not have to open a CSV file to understand why + the test is correct +- **easier to maintain** - if at some point the data format changes + from CSV to JSON, the bulk of the tests need not be updated + +::: challenge +## Exercise: Testing a Pure Function + +Add tests for `compute_standard_deviation_by_data()` that check for situations +when there is only one file with multiple rows, +multiple files with one row, and any other cases you can think of that should be tested. + +:::: solution + +You might have thought of more tests, but we can easily extend the test by parametrizing +with more inputs and expected outputs: + +```python +@pytest.mark.parametrize('data,expected_output', [ + ([[[0, 1, 0], [0, 2, 0]]], [0, 0, 0]), + ([[[0, 2, 0]], [[0, 1, 0]]], [0, math.sqrt(0.25), 0]), + ([[[0, 1, 0], [0, 2, 0]], [[0, 1, 0], [0, 2, 0]]], [0, 0, 0]) +], +ids=['Two patients in same file', 'Two patients in different files', 'Two identical patients in two different files']) +def test_compute_standard_deviation_by_day(data, expected_output): + from inflammation.compute_data import compute_standard_deviation_by_data + + result = compute_standard_deviation_by_data(data) + npt.assert_array_almost_equal(result, expected_output) +``` + +:::: +::: + +::: callout + +## Functional Programming + +**Functional programming** is a programming paradigm where programs are constructed by +applying and composing/chaining pure functions. +Some programming languages, such as Haskell or Lisp, support writing pure functional code only. +Other languages, such as Python, Java, C++, allow mixing **functional** and **procedural** +programming paradigms. +Read more in the [extra episode on functional programming](../learners/functional-programming.md) +and when it can be very useful to switch to this paradigm +(e.g. to employ MapReduce approach for data processing). + +::: + +There are no definite rules in software design but making your complex logic out of +composed pure functions is a great place to start when trying to make your code readable, +testable and maintainable. This is particularly useful for: + +* Data processing and analysis +(for example, using [Python Pandas library](https://pandas.pydata.org/) for data manipulation where most of functions appear pure) +* Doing simulations (? needs more explanation) +* Translating data from one format to another (? an example would be good) + +## Programming Paradigms + +Until this section, we have mainly been writing procedural code. +In the previous episode, we have touched a bit upon classes, encapsulation and polymorphism, +which are characteristics of (but not limited to) the object-oriented programming (OOP). +In this episode, we mentioned [pure functions](./index.html#pure-functions) +and Functional Programming. + +These are examples of different [programming paradigms](../learners/programming-paradigms.md) +and provide varied approaches to structuring your code - +each with certain strengths and weaknesses when used to solve particular types of problems. +In many cases, particularly with modern languages, a single language can allow many different +structural approaches and mixing programming paradigms within your code. +Once your software begins to get more complex - it is common to use aspects of [different paradigm](../learners/programming-paradigms.md) +to handle different subtasks. +Because of this, it is useful to know about the [major paradigms](../learners/programming-paradigms.md), +so you can recognise where it might be useful to switch. +This is outside of scope of this course - we have some extra episodes on the topics of +[procedural programming](../learners/procedural-programming.md), +[functional programming](../learners/functional-programming.md) and +[object-oriented programming](../learners/object-oriented-programming.md) if you want to know more. + +::: callout + +## So Which One is Python? + +Python is a multi-paradigm and multi-purpose programming language. +You can use it as a procedural language and you can use it in a more object oriented way. +It does tend to land more on the object oriented side as all its core data types +(strings, integers, floats, booleans, lists, +sets, arrays, tuples, dictionaries, files) +as well as functions, modules and classes are objects. + +Since functions in Python are also objects that can be passed around like any other object, +Python is also well suited to functional programming. +One of the most popular Python libraries for data manipulation, +[Pandas](https://pandas.pydata.org/) (built on top of NumPy), +supports a functional programming style +as most of its functions on data are not changing the data (no side effects) +but producing a new data to reflect the result of the function. +::: + +## Software Design and Architecture + +In this section so far we have been talking about **software design** - the individual modules and +components of the software. We are now going to have a brief look into **software architecture** - +which is about the overall structure that these software components fit into, a *design pattern* +with a common successful use of software components. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Code refactoring is a technique for improving the structure of existing code. +- Implementing regression tests before refactoring gives you confidence that your changes have not broken the code. +- Using pure functions that process data without side effects whenever possible makes the code easier to understand, test and maintain. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/35-software-architecture-revisited.md b/35-software-architecture-revisited.md new file mode 100644 index 000000000..000748bc7 --- /dev/null +++ b/35-software-architecture-revisited.md @@ -0,0 +1,365 @@ +--- +title: 3.5 Software Architecture Revisited +teaching: 15 +exercises: 30 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Analyse new code to identify Model, View, Controller aspects. +- Refactor new code to conform to an MVC architecture. +- Adapt our existing code to include the new re-architected code. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do we handle code contributions that do not fit within our existing architecture? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In the previous few episodes we have looked at the importance and principles of good software architecture and design, +and how techniques such as code abstraction and refactoring fulfil that design within an implementation, +and help us maintain and improve it as our code evolves. + +Let us now return to software architecture and consider how we may refactor some new code to fit within our existing MVC architectural design using the techniques we have learnt so far. + +## Revisiting Our Software's Architecture + +Recall that in our software project, the **Controller** module is in `inflammation-analysis.py`, +and the View and Model modules are contained in +`inflammation/views.py` and `inflammation/models.py`, respectively. +Data underlying the Model is contained within the directory `data`. + +Looking at the code in the branch `full-data-analysis` (where we should be currently located), +we can notice that the new code was added in a separate script `inflammation/compute_data.py` and +contains a mix of Model, View and Controller code. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Identify Model, View and Controller Parts of the Code + +Looking at the code inside `compute_data.py`, what parts could be considered +Model, View and Controller code? + +::::::::::::::: solution + +## Solution + +- Computing the standard deviation belongs to Model. +- Reading the data from CSV files also belongs to Model. +- Displaying of the output as a graph is View. +- The logic that processes the supplied files is Controller. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Within the Model further separations make sense. +For example, as we did in the before, separating out the impure code that interacts with +the file system from the pure calculations helps with readability and testability. +Nevertheless, the MVC architectural pattern is a great starting point when thinking about +how you should structure your code. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Split out the Model, View and Controller Code + +Refactor `analyse_data()` function so that the Model, View and Controller code +we identified in the previous exercise is moved to appropriate modules. + +::::::::::::::: solution + +## Solution + +The idea here is for the `analyse_data()` function not to have any "view" considerations. +That is, it should just compute and return the data and +should be located in `inflammation/models.py`. + +```python +def analyse_data(data_source): + """Calculate the standard deviation by day between datasets + Gets all the inflammation csvs within a directory, works out the mean + inflammation value for each day across all datasets, then graphs the + standard deviation of these means.""" + data = data_source.load_inflammation_data() + daily_standard_deviation = compute_standard_deviation_by_data(data) + + return daily_standard_deviation +``` + +There can be a separate bit of code in the Controller `inflammation-analysis.py` +that chooses how data should be presented, e.g. as a graph: + +```python +if args.full_data_analysis: + _, extension = os.path.splitext(infiles[0]) + if extension == '.json': + data_source = JSONDataSource(os.path.dirname(infiles[0])) + elif extension == '.csv': + data_source = CSVDataSource(os.path.dirname(infiles[0])) + else: + raise ValueError(f'Unsupported file format: {extension}') + data_result = analyse_data(data_source) + graph_data = { + 'standard deviation by day': data_result, + } + views.visualize(graph_data) + return +``` + +Note that this is, more or less, the change we did to write our regression test. +This demonstrates that splitting up Model code from View code can +immediately make your code much more testable. +Ensure you re-run our regression test to check this refactoring has not +changed the output of `analyse_data()`. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +At this point, you have refactored and tested all the code on branch `full-data-analysis` +and it is working as expected. The branch is ready to be incorporated into `develop` +and then, later on, `main`, which may also have been changed by other developers working on +the code at the same time so make sure to update accordingly or resolve any conflicts. + +```bash +$ git switch develop +$ git merge full-data-analysis +``` + +Let us now have a closer look at our Controller, and how can handling command line arguments in Python +(which is something you may find yourself doing often if you need to run the code from a +command line tool). + +### Controller Structure + +You will have noticed already that structure of the `inflammation-analysis.py` file +follows this pattern: + +```python +# import modules + +def main(args): + # perform some actions + +if __name__ == "__main__": + # perform some actions before main() + main(args) +``` + +In this pattern the actions performed by the script are contained within the `main` function +(which does not need to be called `main`, +but using this convention helps others in understanding your code). +The `main` function is then called within the `if` statement `__name__ == "__main__"`, +after some other actions have been performed +(usually the parsing of command-line arguments, which will be explained below). +`__name__` is a special dunder variable which is set, +along with a number of other special dunder variables, +by the python interpreter before the execution of any code in the source file. +What value is given by the interpreter to `__name__` is determined by +the manner in which it is loaded. + +If we run the source file directly using the Python interpreter, e.g.: + +```bash +$ python3 inflammation-analysis.py +``` + +then the interpreter will assign the hard-coded string `"__main__"` to the `__name__` variable: + +```python +__name__ = "__main__" +... +# rest of your code +``` + +However, if your source file is imported by another Python script, e.g: + +```python +import inflammation-analysis +``` + +then the interpreter will assign the name `"inflammation-analysis"` +from the import statement to the `__name__` variable: + +```python +__name__ = "inflammation-analysis" +... +# rest of your code +``` + +Because of this behaviour of the interpreter, +we can put any code that should only be executed when running the script +directly within the `if __name__ == "__main__":` structure, +allowing the rest of the code within the script to be +safely imported by another script if we so wish. + +While it may not seem very useful to have your controller script importable by another script, +there are a number of situations in which you would want to do this: + +- for testing of your code, you can have your testing framework import the main script, + and run special test functions which then call the `main` function directly; +- where you want to not only be able to run your script from the command-line, + but also provide a programmer-friendly application programming interface (API) for advanced users. + +### Passing Command-line Options to Controller + +The standard Python library for reading command line arguments passed to a script is +[`argparse`](https://docs.python.org/3/library/argparse.html). +This module reads arguments passed by the system, +and enables the automatic generation of help and usage messages. +These include, as we saw at the start of this course, +the generation of helpful error messages when users give the program invalid arguments. + +The basic usage of `argparse` can be seen in the `inflammation-analysis.py` script. +First we import the library: + +```python +import argparse +``` + +We then initialise the argument parser class, passing an (optional) description of the program: + +```python +parser = argparse.ArgumentParser( + description='A basic patient inflammation data management system') +``` + +Once the parser has been initialised we can add +the arguments that we want argparse to look out for. +In our basic case, we want only the names of the file(s) to process: + +```python +parser.add_argument( + 'infiles', + nargs='+', + help='Input CSV(s) containing inflammation series for each patient') +``` + +Here we have defined what the argument will be called (`'infiles'`) when it is read in; +the number of arguments to be expected +(`nargs='+'`, where `'+'` indicates that there should be 1 or more arguments passed); +and a help string for the user +(`help='Input CSV(s) containing inflammation series for each patient'`). + +You can add as many arguments as you wish, +and these can be either mandatory (as the one above) or optional. +Most of the complexity in using `argparse` is in adding the correct argument options, +and we will explain how to do this in more detail below. + +Finally we parse the arguments passed to the script using: + +```python +args = parser.parse_args() +``` + +This returns an object (that we have called `args`) containing all the arguments requested. +These can be accessed using the names that we have defined for each argument, +e.g. `args.infiles` would return the filenames that have been input. + +The help for the script can be accessed using the `-h` or `--help` optional argument +(which `argparse` includes by default): + +```bash +$ python3 inflammation-analysis.py --help +``` + +```output +usage: inflammation-analysis.py [-h] infiles [infiles ...] + +A basic patient inflammation data management system + +positional arguments: + infiles Input CSV(s) containing inflammation series for each patient + +optional arguments: + -h, --help show this help message and exit +``` + +The help page starts with the command line usage, +illustrating what inputs can be given (any within `[]` brackets are optional). +It then lists the **positional** and **optional** arguments, +giving as detailed a description of each as you have added to the `add_argument()` command. +Positional arguments are arguments that need to be included +in the proper position or order when calling the script. + +Note that optional arguments are indicated by `-` or `--`, followed by the argument name. +Positional arguments are simply inferred by their position. +It is possible to have multiple positional arguments, +but usually this is only practical where all (or all but one) positional arguments +contains a clearly defined number of elements. +If more than one option can have an indeterminate number of entries, +then it is better to create them as 'optional' arguments. +These can be made a required input though, +by setting `required = True` within the `add_argument()` command. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Positional and Optional Argument Order + +The usage section of the help page above shows +the optional arguments going before the positional arguments. +This is the customary way to present options, but is not mandatory. +Instead there are two rules which must be followed for these arguments: + +1. Positional and optional arguments must each be given all together, and not inter-mixed. + For example, the order can be either "optional, positional" or "positional, optional", + but not "optional, positional, optional". +2. Positional arguments must be given in the order that they are shown + in the usage section of the help page. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Additional Reading Material \& References + +Now that we have covered and revisited [software architecture](../learners/software-architecture-extra.md) +and [different programming paradigms](../learners/programming-paradigms.md) +and how we can integrate them into our architecture, +there are two optional extra episodes which you may find interesting. + +Both episodes cover the persistence layer of software architectures +and methods of persistently storing data, but take different approaches. +The episode on [persistence with JSON](../learners/persistence.md) covers +some more advanced concepts in Object Oriented Programming, while +the episode on [databases](../learners/databases.md) starts to build towards a true multilayer architecture, +which would allow our software to handle much larger quantities of data. + +## Towards Collaborative Software Development + +Having looked at some aspects of software design and architecture, +we are now circling back to implementing our software design +and developing our software to satisfy the requirements collaboratively in a team. +At an intermediate level of software development, +there is a wealth of practices that could be used, +and applying suitable design and coding practices is what separates +an intermediate developer from someone who has just started coding. +The key for an intermediate developer is to balance these concerns +for each software project appropriately, +and employ design and development practices enough so that progress can be made. + +One practice that should always be considered, +and has been shown to be very effective in team-based software development, +is that of *code review*. +Code reviews help to ensure the 'good' coding standards are achieved +and maintained within a team by having multiple people +have a look and comment on key code changes to see how they fit within the codebase. +Such reviews check the correctness of the new code, test coverage, functionality changes, +and confirm that they follow the coding guides and best practices. +Let us have a look at some code review techniques available to us. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Sometimes new, contributed code needs refactoring for it to fit within an existing codebase. +- Try to leave the code in a better state that you found it. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/40-section4-intro.md b/40-section4-intro.md new file mode 100644 index 000000000..c7b97a249 --- /dev/null +++ b/40-section4-intro.md @@ -0,0 +1,94 @@ +--- +title: 'Section 4: Collaborative Software Development for Reuse' +teaching: 5 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the code review process and employ it to improve the quality of code. +- Understand the process and best practices for preparing Python code for reuse by others. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What practices help us develop software collaboratively that will make it easier for us and others to further develop and reuse it? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +When changes - particularly big changes - are made to a codebase, +how can we as a team ensure that these changes are well considered and represent good solutions? +And how can we increase the overall knowledge of a codebase across a team? +Sometimes project goals and time pressures take precedence +and producing maintainable, reusable code is not given the time it deserves. +So, when a change or a new feature is needed - +often the shortest route to making it work is taken as opposed to a more well thought-out solution. +For this reason, it is important not to write the code alone and in isolation +and use other team members to verify each other's code and measure our coding standards against. +This process of having multiple team members comment on key code changes is called *code review* - +this is one of the most important practices of collaborative software development +that helps ensure the 'good' coding standards are achieved and maintained within a team, +as well as increasing knowledge about the codebase across the team. +We will thus look at the benefits of reviewing code, +in particular, the value of this type of activity within a team, +and how this can fit within various ways of team working. +We will see how GitHub can support code review activities via pull requests, +and how we can do these ourselves making use of best practices. + +After that, we will look at some general principles of software maintainability +and the benefits that writing maintainable code can give you. +There will also be some practice at identifying problems with existing code, +and some general, established practices you can apply +when writing new code or to the code you have already written. +We will also look at how we can package software for release and distribution, +using **Poetry** to manage our Python dependencies +and produce a code package we can use with a Python package indexing service +to illustrate these principles. + +![](fig/section4-overview.svg){alt='Software design and architecture' .image-with-shadow width="1000px" } + + + + + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Agreeing on a set of best practices within a software development team will help to improve your software's understandability, extensibility, testability, reusability and overall sustainability. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/41-code-review.md b/41-code-review.md new file mode 100644 index 000000000..694784d76 --- /dev/null +++ b/41-code-review.md @@ -0,0 +1,721 @@ +--- +title: '4.1 Developing Software In a Team: Code Review' +teaching: 30 +exercises: 30 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe commonly used code review techniques. +- Understand how to do a pull request via GitHub to engage in code review with a team and contribute to a shared code repository. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do we develop software in a team? +- What is code review and how it can improve the quality of code? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +So far in this course we've focused on learning software design +and (some) technical practices, tools and infrastructure that +help the development of software in a team environment, but in an individual setting. +Despite developing tests to check our code - no one else from the team had a look at our code +before we merged it into the main development stream. +Software is often designed and built as part of a team, +so in this episode we will be looking at how to manage the process of team software development +and improve our code by engaging in code review process with other team members. + +## Collaborative Code Development Models + +The way a team provides contributions to a shared codebase depends on +the type of development model used in a project. +Two commonly used models are described below. + +### Fork and Pull Model + +In the **fork and pull** model, anyone can **fork** an existing repository +(to create their copy of the project linked to the source) +and push changes to their personal fork. +A contributor can work independently on their own fork as they do not need +permissions on the source repository to push modifications to a fork they own. +The changes from contributors can then be **pulled** into the source repository +by the project maintainer on request and after a code review process. +This model is popular with open source projects as it +reduces the start up costs for new contributors +and allows them to work independently without upfront coordination +with source project maintainers. +So, for example, you may use this model when you are an external collaborator on a project +rather than a core team member. + +### Shared Repository Model + +In the **shared repository model**, collaborators are granted push access to a single shared code repository. +By default, collaborators have write access to the main branch. +However, it is best practice to create feature branches for new developments +and protect the main branch from direct and unreviewed commits to keep it stable - see +[GitHub's documentation](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/managing-a-branch-protection-rule) +on how to do this. +While this model of collaboration requires more upfront coordination, +it makes it easier to share each other's work. It works well for more stable teams and +is more prevalent with teams and organisations collaborating on private projects. + +## Code Review + +Regardless of the collaborative code development model your team uses, +[code review][code-review] is one of the widely accepted best practices for software development in teams +and something you should adopt in your development process too. + +Code review is a software quality assurance practice +where one or several people from the team (different from the code's author) +check the software by viewing parts of its source code at the point when the code changes. +Code review is very useful for all parties involved - +someone checks your design or code for errors and gets to learn from your solution; +having to explain code to someone else clarifies +your rationale and design decisions in your mind too. + +Code review is universally applicable throughout the software development cycle - +from design to development to maintenance. +According to Michael Fagan, the author of the +[code inspection technique](https://en.wikipedia.org/wiki/Fagan_inspection), +rigorous inspections can remove 60-90% of errors from the code +even before the first tests are run ([Fagan, 1976](https://doi.org/10.1147%2Fsj.153.0182)). +Furthermore, according to Fagan, +the cost to remedy a defect in the early (design) stage is 10 to 100 times less compared to +fixing the same defect in the development and maintenance stages, respectively. +Since the cost of bug fixes grows in orders of magnitude throughout the software lifecycle, +it is far more efficient to find and fix defects +as close as possible to the point where they were introduced. + +There are several **code review techniques** with various degree of formality +and the use of a technical infrastructure, including: + +- [Over-the-shoulder code review](https://about.gitlab.com/topics/version-control/what-is-code-review/#Over-the-shoulder%20reviews) - + one developer talks the other developer through the code changes while sitting + at the same machine. +- [Pair programming](https://about.gitlab.com/topics/version-control/what-is-code-review/#Pair%20programming) - + two developers work on the code at the same time with one of them actively coding and the + other providing real-time feedback. +- [Formal code inspection](https://en.wikipedia.org/wiki/Fagan_inspection) - + up to 6 partipants go through a formalised process to inspect the code specification or + design for defects. +- [Tool assisted code review](https://about.gitlab.com/topics/version-control/what-is-code-review/#Tool-assisted%20reviews) - + developers use tools such as GitHub to review the code independently and give feedback. + +You can read more about these techniques in the ["Five Types of Review" section](https://www.khoury.northeastern.edu/home/lieber/courses/cs4500/f07/lectures/code-review-types.pdf) of the ["Best Kept Secrets of Peer Code Review" eBook](https://www.yumpu.com/en/document/view/19324443/best-kept-secrets-of-peer-code-review-pdf-smartbear). + +It is worth trying multiple code review techniques to see what works +best for you and your team. +We will have a look at the **tool-assisted code review process**, which is likely to be the most effective and efficient. +We will use GitHub's built-in code review tool - **pull requests**, or PRs. +It is a lightweight tool, included with GitHub's core service for free +and has gained popularity within the software development community in recent years. + +## Code Reviews via GitHub's Pull Requests + +Pull requests are fundamental to how teams review and improve code +on GitHub (and similar code sharing platforms) - +they let you tell others about changes you have pushed to a branch in a repository on GitHub +and that your code is ready for review. +Once a pull request is opened, +you can discuss and review the potential changes with others on the team +and add follow-up commits based on the feedback +before your changes are merged into the development branch. +The name 'pull request' suggests you are **requesting** the codebase moderators +to **pull** your changes into the codebase. + +Such changes are normally done on a feature branch, +to ensure that they are separate and self-contained, +that the main branch only contains "production-ready" work, +and that the development branch contains code that has already been extensively tested. +You create a branch for your work based on one of the existing branches +(typically the development branch but can be any other branch), +do some commits on that branch, +and, once you are ready to merge your changes, +create a pull request to bring the changes back to the branch that you started from. +In this context, the branch from which you branched off to do your work +and where the changes should be applied back to +is called the **base branch**, +while the feature branch that contains changes you would like to be applied is the **compare branch**. + +How you create your feature branches and open pull requests in GitHub will depend on +your collaborative code development model: + +- In the fork and pull model, + where you do not have write permissions to the source repository, + you need to fork the repository first + before you create a feature branch (in your fork) to base your pull request on. +- In the shared repository model, + in order to create a feature branch and open a pull request based on it + you must have write access to the source repository or, + for organisation-owned repositories, + you must be a member of the organisation that owns the repository. + Once you have access to the repository, + you proceed to create a feature branch on that repository directly. + +In both development models, +it is recommended to create a feature branch for your work and the subsequent pull request, +even though you can submit pull requests from any branch or commit. +This is because, with a feature branch, +you can push follow-up commits as a response to feedback +and update your proposed changes within a self-contained bundle. +The only difference in creating a pull request between the two models is +how you create the feature branch. +In either model, once you are ready to merge your changes in - +you will need to specify the base branch and the compare branch. + +Let us see this in action - +you are going to act as a reviewer on a proposed change to the codebase contributed by a +fictional colleague on one of your fellow learner's repository. +One of your fellow learners will review the proposed changes on your repository. +Once the review is done, you will then take on the role of the fictional colleague +and respond to the review on your repository. +If you are completing the course by yourself, you can add the review on the proposed changes in +your own repository and then respond to your own review comments by fixing the proposed code. +This is actually a very sensible thing to do in general - looking +at your own code in a review window will allow you to spot mistakes you +have not seen before. + +Here is an outline of the process of a tool assisted code review. + + + +![](fig/code-review-sequence-diagram.svg){alt='Code review process sequence' .image-with-shadow width="600px"} + +Recall [solution requirement SR1.1.1](31-software-requirements.md) +from an earlier episode. +A fictional colleague has implemented it according to the specification +and pushed it to a branch `feature-std-dev` of our software repository. +You will turn this branch into a pull request on your repository. +You will then engage in code review for the change (acting as a code reviewer) on +a fellow learner's repository. +Once complete, you will respond to the comments from another team member on the pull request +on your repository (acting as a code author). + +### Raising a Pull Request + +1. Head over to your software repository in GitHub. +2. Navigate to the pull requests tab. +3. Create a new pull request by clicking the green `New pull request` button. + ![](fig/github-pull-request-tab.png){alt='GitHub pull requests tab' .image-with-shadow width="900px"} +4. Select the base and the compare branch - `main` and `feature-std-dev`, respectively. + Recall that the base branch is where you want your changes to be merged + and the compare branch contains the changes. +5. Click `Create pull request` button to open the request. + ![](fig/github-create-pull-request.png){alt='Creating a new pull request.' .image-with-shadow width="900px"} +6. Add a comment describing the nature of the changes, + and then submit the pull request by clicking the `Create pull request` button (in the new window). + ![](fig/github-submit-pull-request.png){alt='Submitting a pull request.' .image-with-shadow width="900px"} +7. At this point, the code review process is initiated. + +We will now discuss what to look for in a code review, +before practising it on this fictional change. + +### Reviewing a Pull Request + +Once a review has been raised it is over to the reviewer to review the code +and submit feedback. + +Reviewing code effectively takes practice. +However, here is some guidance on what you should +be looking for when reviewing a piece of code. + +#### Things to Look for in a Code Review + +Start by understanding what the code *should* do, by reading the specification/user requirements, +the pull request description or talking to the developer if need be. +In this case, understand what [SR1.1.1](31-software-requirements.md) means. + +Once you are happy, start reading the code (skip the test code for now - we will come back to it later). +you are going to be assessing the code in the following key areas. + +##### Is the proposed code readable? + +- Think about the names of the variables and functions - do they [follow naming conventions](15-coding-conventions.md)? +- Do you understand what conditions in each if statements are for? +- Does a function name match the behavior of the function? + +##### Is the proposed code a minimal change? + +- Does the code reimplement anything that already exists, either + elsewhere in the codebase or in a library you know about? +- Does the code implement something that is not the requirement or in the issue/ticket? + +##### Is the structure of the code clear? + +- Do functions do just one thing? +- Is the code using the right level of modularity? +- Is the code consistent with the structure of the rest of the code? + +##### Is there an appropriate and up-to-date documentation for the proposed code? + +- If functionality has changed, has corresponding documentation been + updated? +- If new functions have been added, do they have the associated documentation? +- Does the documentation make sense? +- Are there clear and useful comments that explain complex designs + and focus on the "why/because" rather than the "what/how"? + +#### Things Not to Look for in a Code Review + +The overriding priority for reviewing code should be making sure progress is being made - +do not let perfect be the enemy of the good here. +According to ["Best Kept Secrets of Peer Code Review" (Cohen, 2006)](https://www.amazon.co.uk/Best-Kept-Secrets-Peer-Review/dp/1599160676) +the first hour of reviewing code is the most effective, with diminishing returns after that. + +To that end, here are a few things you *should not* be trying to spot when reviewing: + +- Linting issues, or anything else that an automated tool can spot - get the Continuous Integration (CI) to do it. +- Bugs - instead make sure there are tests for all cases. +- Issues that pre-date the change - raise separate PRs fixing these issues separately to avoid heading down a rabbit hole. +- Architecture re-writes - try to have design discussions upfront, + or else have a meeting to decide whether the code needs to be rewritten. + +#### Adding a review comment + +Here, we are outlining the process of adding a review to a pull request. +There is doing to be an exercise next for you to practice it. + +1. Your fellow learner should add you as a collaborator on their repository to begin with. + They can do that from + the `Settings` tab of the repository, then `Collaborators and teams` tab on the left, + then clicking the `Add people` button. + Once they find you by your GitHub name - you will receive an invitation via email to join the + repository as a collaborator. + You will have to do the same for the collaborator doing the review on your repository. + + ::::::::::::::::::::::::::::::::::::::::: callout + + ## Code Review from External Contributors + + You do not have to be a collaborator on a public repository to do code reviews + so this step is not strictly necessary. + We are still asking you to do it now as we will get you working in teams + for the rest of the course so it may make sense to start setting up your collaborators now. + + ![](fig/github-add-collaborator.png){alt='Adding a collaborator in GitHub' .image-with-shadow width="900px"} + + :::::::::::::::::::::::::::::::::::::::::::::::::: + +2. Locate up the pull request from the GitHub's `Pull Requests` tab on the home page + of your fellow learner's software repository, then head to the `Files changed` tab + on the pull request. + ![](fig/github-pull-request-files-changed.png){alt='The files changed tab of a pull request' .image-with-shadow width="900px"} + +3. When you find a line that you want to add a comment to, click on the blue + plus (+) button next to the line. This will bring up a "Write" box to add your comment. + ![](fig/github-pull-request-add-comment.png){alt='Adding a review comment to a pull request' .image-with-shadow width="800px"} + You can also add comments referring to multiple lines by clicking the plus and + dragging down over the relevant lines. + If you want to make a concrete suggestion or a change to the code directly, + such as renaming a variable, you can click the `Add a suggestion` button + (which looks like a document with a plus and a minus in it). + This will populate the comment with the existing code, and you can edit it to be + what you think the code should be. + + ***Note:** you can only make direct code suggestions if you are a collaborator on a repository. + Otherwise, you can add comments only.* + ![](fig/github-pull-request-add-suggestion.png){alt='Adding a suggestion to a pull request' .image-with-shadow width="800px"} + GitHub will then provide a button for the code author to apply your changes directly. + +4. Write your comment in the box, and then click `Start review`. + This will save your comment, but not publish it yet. + You can use `Add single comment` button to immediately post a comment. + However, it is best to batch the comments into a single review, so that the author + knows when you have finished adding comments + (and avoid spamming their email with notifications). + +5. Continue adding comments in this way, if you have any, using the `Add review comment` button. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Effective Review Comments + +- Make sure your review comments are specific and actionable. +- Try to be as specific as you can - instead of "this code is unclear" + instead say "I do not understand what values this variable can hold". +- Make it clear in the comment if you want something to change as part + of this pull request. +- Ideally provide a concrete suggestion (e.g. a better variable name). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Review Some Code + +Pair up with a colleague from your group/team and go to the pull request your colleague created on their +project repository. +If there is an odd number of people in your group, three people can go in a round robin fashion +(the first team member will review the pull request on the second member's repository +and will receive comments on the pull request on their repository from +the third team member, and so on). +If you are going through the material on your own and do not have a collaborator, +you can be the reviewer on the pull requests on your own repository. + +Review the code, looking for the kinds of problems that we have just discussed. +There are examples of all four main problem areas in the pull request, +so try to make at least one suggestion for each area. + +**Add your review comments but do not submit your review just yet.** + +::::::::::::::: solution + +## Solution + +Here are some of the things you might have found were wrong with the code. + +##### Is the proposed code readable? + +- Function name `s_dev` is not the best or self-explanatory - it uses an uncommon abbreviation + and does not make it clear immediately what the function does without reading the code. + A better name is `standard_deviation`. +- Not clear what variable `number` contains - better option is a business-logic name + like `mean` or `mean_of_data`. + +##### Is the proposed code a minimal change? + +- Could have used `np.std` to compute standard deviation of data without having to reimplement + from scratch. + +##### Is the structure of the proposed code clear? + +- Have the function return the data, rather than having the graph name (a view layer consideration) + leak into the model code. + +##### Is there an appropriate and up-to-date documentation for the proposed code? + +- The documentation say it returns the standard deviation, but it actually returns a dictionary containing + the standard deviation. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Making Sure Code is Valid + +The other key thing you want to verify in a code review is that the code is correct and +well tested. +One approach to do this is to build up a list of tests you would expect to see +(and the results you would expect them to have), +and then verify that all these tests are present and correct. + +Start by listing out all the tests you would expect to see based on the specification. +As you are going through the code, add to this list any other tests you can think +of, making sure to add tests for: + +- All paths through the code. +- Making each `if` statement be evaluated as `True` and `False`. +- Executing loops with empty, single and multi-element sequences. +- Edge cases that you spot. +- Any circumstances where you are not certain how code would behave. + +Once you have the list, go through the tests in the pull request. +Inspect them closely to make sure they test everything you expect them to. + +### Submitting a Review + +Once you have a list of tests you want the author to add, it is time to +submit your review. + +1. To do this, click the `Finish your review` button at the top of the `Files changed` tab. + ![](fig/github-submit-pull-request-review.png){alt='Using the finishing your review dialog' .image-with-shadow width="800px"} + In the comment box, you can add any other comments that are not + associated with a specific line. + For example, you can put the list of tests that you want to see + added here. +2. Next you will need select to one of `Comment`, `Approve` or `Request changes`. + ![](fig/github-finish-pull-request-review.png){alt='Using the finishing your review dialog' .image-with-shadow width="900px"} + +- Use `Approve` if you would be happy for the code to + go in with no further changes. +- Use `Request changes` to communicate to the author that + they should address your comments before you will approve it. +- Use `Comment` if you do not want to express a decision on + whether the code should be accepted. For example, if you have been asked + to look at a specific part of the code, or if you are part way through + a review, but wanted to share some comments sooner. + +3. Finally, you can press `Submit review` button. + This will publish all the comments you have made as part of the review and + let the author know that the review is complete and it is their + turn for action. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Review the Code for Suitable Tests + +Remind yourself of the [specification for SR1.1.1](31-software-requirements.md) +and write a list of tests you would expect to see for this feature. +Review the code again and expand this list to include any other +edge cases the code makes you think of. +Go through the tests in the pull request and work out which tests are present. + +Once you are happy, you can submit your review. +Select `Request changes` to let the author know they need to address your comments. + +::::::::::::::: solution + +## Solution + +Your list might include the following: + +1. Standard deviation for one patient with multiple observations. +2. Standard deviation for two patients. +3. Graph includes a standard deviation graph. +4. Standard deviation function should raise an error if given empty data. +5. Computing standard deviation where deviation is different from variance. +6. Standard deviation function should give correct result given negative inputs. +7. Function should work with numpy arrays. + +Looking at the tests in the PR, you might be content that tests for 1, 4 and 7 are present +so you would request changes to add tests 2, 3, 5 and 6. + +In looking at the tests, you may have noticed that the test for numpy arrays is currently +spuriously passing as it does not use the return value from the function in the assert. + +You may have spotted that the function actually computes the variance rather than +the standard deviation. Perhaps that made you think to add the test +for some data where the variance and standard deviation are different. +In more complex examples, it is often easier to spot code that looks like it could be wrong +and think of a test that will exercise it. This saves embarrassment if the code turns out +to be right, means you have the missing test written if it is wrong, and is often quicker +than trying to execute the code in your head to find out if it is correct. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Responding to Review Comments + +Once you receive comments on your code, a few different scenarios can occur: + +1. You understand and agree with the reviewer's comments. + In this scenario, you should make the requested change to your branch (or accept the + suggested change by the reviewer) and commit it. + It might be helpful to add a thumbs up reaction to the comment, so the reviewer knows + you have addressed it. Even better, leave a comment such as "Fixed via #commit\_number" with a link + to your commit that implemented the change. + ![](fig/github-respond-to-review-comment-with-emoji.png){alt='Responding to a review comment with an emoji' .image-with-shadow width="800px"} + ![](fig/github-respond-to-review-comment-with-commit-link.png){alt='Responding to a review comment with a link to commit' .image-with-shadow width="800px"} +2. It is not completely clear what the requested change should be - in this scenario + you should reply to such a review to ask for clarification. +3. You disagree with the reviewer - in this scenario, it might be best to talk to them in person. + Discussions that happen on code reviews can often feel quite adversarial - + discussing what the best solution is in person can help defuse this. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Responding to Comments + +Look at the pull request that you created on your repository. +By now you should have someone else's comments on it. +For each comment, either reply explaining why you do not think the change is necessary +or make the change and push a commit fixing it. Reply to each of the comments indicating you +have addressed it. + +At the same time, people will be addressing your comments on the pull request in their repository. +If you are satisfied that your comment has been suitably addressed, you can mark it as resolved. +Once all comments have been addressed, you can approve the pull request by submitting +a new review and this time selecting `Approve`. +This tells the author you are happy for them to merge the pull request. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Approving a Pull Request + +1. Once the reviewer approves the changes, the person whose repository it is can + merge the changes onto the base branch. + Typically, it is the code author's responsibility to merge + but this may differ from team to team. + In our case, you will merge the changes on the PR on your repository. + ![](fig/github-merge-pull-request.png){alt='Merging a pull request in GitHub' .image-with-shadow width="800px"} +2. Delete the merged branch to reduce the clutter in the repository. + +## Writing Easy-To-Review Code + +There are a few things you can do to make it +as easy as possible for the reviewer to review your code: + +- Keep the changes **small**. +- Keep each commit as **one logical change**. +- Provide a **clear description** of the change. +- **Review your code yourself**, before requesting a review. + +The most important thing to keep in mind is how long your pull request is. +Smaller changes that just make one small improvement will be much quicker and easier to review. +There is no golden rule, but [studies into code review](https://smartbear.com/resources/ebooks/the-state-of-code-review-2020-report/) show that you should not review more +than 400 lines of code at a time, so this is a reasonable target to aim for. +You can refer to some [studies](https://jserd.springeropen.com/articles/10.1186/s40411-018-0058-0) +and [Google recommendations](https://google.github.io/eng-practices/review/developer/small-cls.html) +as to what a "large pull request" is but be aware that it is not an exact science. + +Try to keep each commit (within your pull request) to be making one logical change. +This can especially help with larger pull requests that would otherwise be harder to review. +In particular, if you have reformatted, refactored and changed the behavior of the +code make sure each of these is in a separate commit +(i.e reformat the code, commit, refactor the code, commit, alter the behavior of the code, commit). + +Make sure you write a clear description of the content and purpose of the change. +This should be provided as the pull request description. +This should provide the context needed to read the code. + +It is also a good idea to review your code yourself, +before requesting a review. +In doing this you will spot the more obvious issues with your code, +allowing your reviewer to focus on the things you cannot spot. + +## Writing Effective Review Comments + +Code is written by humans (mostly!), and code review is a form of communication. +As such, empathy is important for effective reviewing. + +When reviewing code, it can be sometimes frustrating when code is confusing, particularly +as it is implemented differently to how you would have done it. +However, it is important as a reviewer to be compassionate to the +person whose code you are reviewing. +Specifically: + +- Identify positives in code as and when you find them (particularly if it is an improvement on + something you have fed back on in a previous review). +- Remember different does not mean better - only request changes if the code is wrong or + hard to understand. +- Only provide a few non-critical suggestions - you are aiming for better rather than perfect. +- Ask questions to understand why something has been done a certain way rather than assuming you + know a better way. +- If a conversation is taking place on a review and has not been resolved by a + single back-and-forth exchange, then schedule a conversation to discuss instead + (and record the outcome of the discussion in the PR's comments). + +## Defining a Review Process For Your Team + +To be effective, code review needs to be a process that is followed by everyone +in the team developing the code. +Everyone should believe that the process provides value. +One way to foster this is to agree on the review process as a team and consider, e.g.: + +- Whether all changes need to go through code review +- What technologies you are going to use to manage the review process +- How quickly you expect someone to review the code once a PR has been raised +- How long should be spent reviewing code +- What kind of issues are (and are not) appropriate to raise in a PR +- How will someone know when they are expected to take action (e.g. review a PR). + +You could also consider using pull request states in GitHub: + +- Open a pull request in a `DRAFT` state to show progress or request early feedback +- `READY FOR REVIEW` when you are ready for feedback +- `CHANGES REQUESTED` to let the author know + they need to fix the requested changes or discuss more +- `APPROVED` to let the author they can merge their pull request. + +Once you have agreed on a review process, you should monitor (either formally or +informally) how well it is working. + +It is important that reviews are processed quickly, to avoid costly context switching for the +code author moving on to work on other things and coming back to their PR. +Try and set the targets for when you would want the first review submitted on a PR +and the PR merged, based on how your team works. +If you are regularly missing your targets, then you should review your process to identify +where things are getting stuck and work out what you can do to move things along. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Code Review in Your Own Working Environment + +In this episode we have looked at some best practices for code review and practiced +tool assisted code review with GitHub's pull requests. + +Now think about how you and your collaborators typically develop code. +What benefits do you see for introducing a code review process in +your work environment? +How might you institute code review practices within your environment? +Write down a process for a tool assisted code review for your team, answering the questions +above. + +Once complete, discuss with the rest of the class what are the advantages of +a code review process and what challenges you think you would face in implementing +this process in your own working environment. + +::::::::::::::: solution + +## Solution + +The purposes of code review include: + +- improving internal code readability, understandability, quality and maintainability, +- checking for coding standards compliance, code uniformity and consistency, +- checking for test coverage and detecting bugs and code defects early, +- detecting performance problems and identifying code optimisation points, +- finding alternative/better solutions, +- sharing knowledge of the code, and of coding standards and expectations of quality. + +Finally, it helps increase the sense of collective code ownership and responsibility, +which in turn helps increase the "bus factor" +and reduce the risk resulting from information and capabilities +being held by a single person "responsible" for a certain part of the codebase +and not being shared among team members. + +Challenges you might face introducing a code review process: + +- complaints that it is a waste of time, +- creating a negative atmosphere where people are overly critical of each others work, + or are defensive of their own, +- perfectionism leading to slower development, +- people not sharing code to avoid the review process. + +Make sure to monitor whether these are happening, and adjust the process accordingly. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Further Reading + +There are multiple perspectives to a code review process - +from general practices to technical details relating to different roles involved in the process. +We have discussed the main points, but do check these useful code review blogs from [ +Swarmia](https://www.swarmia.com/blog/a-complete-guide-to-code-reviews/?utm_term=code%20review&utm_campaign=Code+review+best+practices&utm_source=adwords&utm_medium=ppc&hsa_acc=6644081770&hsa_cam=14940336179&hsa_grp=131344939434&hsa_ad=552679672005&hsa_src=g&hsa_tgt=kwd-17740433&hsa_kw=code%20review&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQiAw9qOBhC-ARIsAG-rdn7_nhMMyE7aeSzosRRqZ52vafBOyMrpL4Ypru0PHWK4Rl8QLIhkeA0aAsxqEALw_wcB) +and [Smartbear](https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/). + +The key thing is to try it, and iterate the process until it works well for your team. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Code review is a team software quality assurance practice where team members look at parts of the codebase in order to improve their code's readability, understandability, quality and maintainability. +- It is important to agree on a set of best practices and establish a code review process in a team to help to sustain a good, stable and maintainable code for many years. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/42-software-reuse.md b/42-software-reuse.md new file mode 100644 index 000000000..ca2b67f8f --- /dev/null +++ b/42-software-reuse.md @@ -0,0 +1,528 @@ +--- +title: 4.2 Preparing Software for Reuse and Release +start: no +teaching: 35 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the different levels of software reusability +- Explain why documentation is important +- Describe the minimum components of software documentation to aid reuse +- Create a repository README file to guide others to successfully reuse a program +- Understand other documentation components and where they are useful +- Describe the basic types of open source software licence +- Explain the importance of conforming to data policy and regulation +- Prioritise and work on improvements for release as a team + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What can we do to make our programs reusable by others? +- How should we document and license our code? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +In previous episodes we have looked at skills, practices, and tools to help us +design and develop software in a collaborative environment. +In this lesson we will be looking at +a critical piece of the development puzzle that builds on what we have learnt so far - +sharing our software with others. + +## The Levels of Software Reusability - Good Practice Revisited + +Let us begin by taking a closer look at software reusability and what we want from it. + +Firstly, whilst we want to ensure our software is reusable by others, as well as ourselves, +we should be clear what we mean by 'reusable'. +There are a number of definitions out there, +but a helpful one written by [Benureau and Rougler in 2017](https://dx.doi.org/10.3389/fninf.2017.00069) +offers the following levels by which software can be characterised: + +1. Re-runnable: the code is simply executable + and can be run again (but there are no guarantees beyond that) +2. Repeatable: the software will produce the same result more than once +3. Reproducible: published research results generated from the same version of the software + can be generated again from the same input data +4. Reusable: easy to use, understand, and modify +5. Replicable: the software can act as an available reference + for any ambiguity in the algorithmic descriptions made in the published article. + That is, a new implementation can be created from the descriptions in the article + that provide the same results as the original implementation, + and that the original - or reference - implementation, + can be used to clarify any ambiguity in those descriptions for the purposes of reimplementation + +Later levels imply the earlier ones. +So what should we aim for? +As researchers who develop software - or developers who write research software - +we should be aiming for at least the fourth one: reusability. +Reproducibility is required if we are to successfully claim that +what we are doing when we write software fits within acceptable scientific practice, +but it is also crucial that we can write software that can be *understood* +and ideally *modified* by others. +If others are unable to verify that a piece of software follows published algorithms, +how can they be certain it is producing correct results? +Where 'others', of course, can include a future version of ourselves. + +## Documenting Code to Improve Reusability + +Reproducibility is a cornerstone of science, +and scientists who work in many disciplines are expected to document +the processes by which they have conducted their research so it can be reproduced by others. +In medicinal, pharmacological, and similar research fields for example, +researchers use logbooks which are then used to write up protocols and methods for publication. + +Many things we have covered so far contribute directly to making our software +reproducible - and indeed reusable - by others. +A key part of this we will cover now is software documentation, +which is ironically very often given short shrift in academia. +This is often the case even in fields where +the documentation and publication of research method is otherwise taken very seriously. + +A few reasons for this are that writing documentation is often considered: + +- A low priority compared to actual research (if it is even considered at all) +- Expensive in terms of effort, with little reward +- Writing documentation is boring! + +A very useful form of documentation for understanding our code is code commenting, +and is most effective when used to explain complex interfaces or behaviour, +or the reasoning behind why something is coded a certain way. +But code comments only go so far. + +Whilst it is certainly arguable that writing documentation is not as exciting as writing code, +it does not have to be expensive and brings many benefits. +In addition to enabling general reproducibility by others, documentation... + +- Helps bring new staff researchers and developers up to speed quickly with using the software +- Functions as a great aid to research collaborations involving software, + where those from other teams need to use it +- When well written, can act as a basis for detailing + algorithms and other mechanisms in research papers, + such that the software's functionality can be *replicated* and re-implemented elsewhere +- Provides a descriptive link back to the science that underlies it. + As a reference, it makes it far easier to know how to + update the software as the scientific theory changes (and potentially vice versa) +- Importantly, it can enable others to understand the software sufficiently to + *modify and reuse* it to do different things + +In the next section we will see that writing +a sensible minimum set of documentation in a single document does not have to be expensive, +and can greatly aid reproducibility. + +### Writing a README + +A README file is the first piece of documentation +(perhaps other than publications that refer to it) +that people should read to acquaint themselves with the software. +It concisely explains what the software is about and what it is for, +and covers the steps necessary to obtain and install the software +and use it to accomplish basic tasks. +Think of it not as a comprehensive reference of all functionality, +but more a short tutorial with links to further information - +hence it should contain brief explanations and be focused on instructional steps. + +Our repository already has a README that describes the purpose of the repository for this workshop, +but let us replace it with a new one that describes the software itself. +First let us delete the old one: + +```bash +$ rm README.md +``` + +In the root of your repository create a replacement `README.md` file. +The `.md` indicates this is a **Markdown** file, +a lightweight markup language which is basically a text file with +some extra syntax to provide ways of formatting them. +A big advantage of them is that they can be read as plain-text files +or as source files for rendering them with formatting structures, +and are very quick to write. +GitHub provides a very useful [guide to writing Markdown][github-markdown] for its repositories. + +Let us start writing `README.md` using a text editor of your choice and add the following line. + +```markdown +# Inflam +``` + +So here, we are giving our software a name. +Ideally something unique, short, snappy, and perhaps to some degree an indicator of what it does. +We would ideally rename the repository to reflect the new name, but let us leave that for now. +In Markdown, the `#` designates a heading, two `##` are used for a subheading, and so on. +The Software Sustainability Institute's +[guide on naming projects][ssi-choosing-name] +and products provides some helpful pointers. + +We should also add a short description underneath the title. + +```markdown +# Inflam +Inflam is a data management system written in Python that manages trial data used in clinical inflammation studies. +``` + +To give readers an idea of the software's capabilities, let us add some key features next: + +```markdown +# Inflam +Inflam is a data management system written in Python that manages trial data used in clinical inflammation studies. + +## Main features +Here are some key features of Inflam: + +- Provide basic statistical analyses over clinical trial data +- Ability to work on trial data in Comma-Separated Value (CSV) format +- Generate plots of trial data +- Analytical functions and views can be easily extended based on its Model-View-Controller architecture +``` + +As well as knowing what the software aims to do and its key features, +it is very important to specify what other software and related dependencies +are needed to use the software (typically called `dependencies` or `prerequisites`): + +```markdown +# Inflam +Inflam is a data management system written in Python that manages trial data used in clinical inflammation studies. + +## Main features +Here are some key features of Inflam: + +- Provide basic statistical analyses over clinical trial data +- Ability to work on trial data in Comma-Separated Value (CSV) format +- Generate plots of trial data +- Analytical functions and views can be easily extended based on its Model-View-Controller architecture + +## Prerequisites +Inflam requires the following Python packages: + +- [NumPy](https://www.numpy.org/) - makes use of NumPy's statistical functions +- [Matplotlib](https://matplotlib.org/stable/index.html) - uses Matplotlib to generate statistical plots + +The following optional packages are required to run Inflam's unit tests: + +- [pytest](https://docs.pytest.org/en/stable/) - Inflam's unit tests are written using pytest +- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing +``` + +Here we are making use of Markdown links, +with some text describing the link within `[]` followed by the link itself within `()`. + +One really neat feature - and a common practice - of using many CI infrastructures is that +we can include the status of running recent tests within our README file. +Just below the `# Inflam` title on our README.md file, +add the following (replacing `` with your own: + +```markdown +# Inflam +![Continuous Integration build in GitHub Actions](https://github.com//python-intermediate-inflammation/actions/workflows/main.yml/badge.svg?branch=main) +... +``` + +This will embed a *badge* (icon) at the top of our page that +reflects the most recent GitHub Actions build status of your software repository, +essentially showing whether the tests that were run +when the last change was made to the `main` branch succeeded or failed. + +That's got us started with documenting our code, +but there are other aspects we should also cover: + +- *Installation/deployment:* step-by-step instructions for setting up the software so it can be used +- *Basic usage:* step-by-step instructions that cover using the software to accomplish basic tasks +- *Contributing:* for those wishing to contribute to the software's development, + this is an opportunity to detail what kinds of contribution are sought and how to get involved +- *Contact information/getting help:* which may include things like key author email addresses, + and links to mailing lists and other resources +- *Credits/acknowledgements:* where appropriate, be sure to credit those who + have helped in the software's development or inspired it +- *Citation:* particularly for academic software, + it is a very good idea to specify a reference to an appropriate academic publication + so other academics can cite use of the software in their own publications and media. + You can do this within a separate + [CITATION text file](https://github.com/citation-file-format/citation-file-format) + within the repository's root directory and link to it from the Markdown +- *Licence:* a short description of and link to the software's licence + +For more verbose sections, +there are usually just highlights in the README with links to further information, +which may be held within other Markdown files within the repository or elsewhere. + +We will finish these off later. +See [Matias Singer's curated list of awesome READMEs](https://github.com/matiassingers/awesome-readme) for inspiration. + +### Other Documentation + +There are many different types of other documentation you should also consider +writing and making available that's beyond the scope of this course. +The key is to consider which audiences you need to write for, +e.g. end users, developers, maintainers, etc., +and what they need from the documentation. +There is a Software Sustainability Institute +[blog post on best practices for research software documentation](https://www.software.ac.uk/blog/2019-06-21-what-are-best-practices-research-software-documentation) +that helpfully covers the kinds of documentation to consider +and other effective ways to convey the same information. + +One that you should always consider is **technical documentation**. +This typically aims to help other developers understand your code +sufficiently well to make their own changes to it, +including external developers, other members in your team and a future version of yourself too. +This may include documentation that covers the software's architecture, +including its different components and how they fit together, +API (Application Programming Interface) documentation +that describes the interface points designed into your software for other developers to use, +e.g. for a software library, +or technical tutorials/'HOW TOs' to accomplish developer-oriented tasks. + +## Choosing an Open Source Licence + +Software licensing is a whole topic in itself, so we'll just summarise here. +Your institution's Intellectual Property (IP) team will be able to offer specific guidance that +fits the way your institution thinks about software. + +In IP law, software is considered a creative work of literature, +so any code you write automatically has copyright protection applied. +This copyright will usually belong to the institution that employs you, +but this may be different for PhD students. +If you need to check, +this should be included in your employment/studentship contract +or talk to your university's IP team. + +Since software is automatically under copyright, without a licence no one may: + +- Copy it +- Distribute it +- Modify it +- Extend it +- Use it (actually unclear at present - this has not been properly tested in court yet) + +Fundamentally there are two kinds of licence, +**Open Source licences** and **Proprietary licences**, +which serve slightly different purposes: + +- *Proprietary licences* are designed to pass on limited rights to end users, + and are most suitable if you want to commercialise your software. + They tend to be customised to suit the requirements of the software + and the institution to which it belongs - + again your institutions IP team will be able to help here. +- *Open Source licences* are designed more to protect the rights of end users - + they specifically grant permission to make modifications and redistribute the software to others. + The [website Choose A License](https://choosealicense.com/) provides recommendations + and a simple summary of some of the most common open source licences. + +Within the open source licences, there are two categories, **copyleft** and **permissive**: + +- The permissive licences such as MIT and the multiple variants of the BSD licence + are designed to give maximum freedom to the end users of software. + These licences allow the end user to do almost anything with the source code. +- The copyleft licences in the GPL still give a lot of freedom to the end users, + but any code that they write based on GPLed code must also be licensed under the same licence. + This gives the developer assurance that anyone building on their code is also + contributing back to the community. + It's actually a little more complicated than this, + and the variants all have slightly different conditions and applicability, + but this is the core of the licence. + +Which of these types of licence you prefer is up to you and those you develop code with. +If you want more information, or help choosing a licence, +the [Choose An Open-Source Licence](https://choosealicense.com/) +or [tl;dr Legal](https://tldrlegal.com/) sites can help. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Preparing for Release + +In a (hopefully) highly unlikely and thoroughly unrecommended scenario, +your project leader has informed you of the need to release your software +within the next half hour, +so it can be assessed for use by another team. +You'll need to consider finishing the README, +choosing a licence, +and fixing any remaining problems you are aware of in your codebase. +Ensure you prioritise and work on the most pressing issues first! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Merging into `main` + +Once you have done these updates, +commit your changes, +and if you are doing this work on a feature branch also ensure you merge it into `develop`, +e.g.: + +```bash +$ git switch develop +$ git merge my-feature-branch +``` + +Finally, once we have fully tested our software +and are confident it works as expected on `develop`, +we can merge our `develop` branch into `main`: + +```bash +$ git switch main +$ git merge develop +$ git push origin main +``` + +The software on your `main` branch is now ready for release. + +## Tagging a Release in GitHub + +There are many ways in which Git and GitHub can help us make a software release from our code. +One of these is via **tagging**, +where we attach a human-readable label to a specific commit. +Let us see what tags we currently have in our repository: + +```bash +$ git tag +``` + +Since we have not tagged any commits yet, there is unsurprisingly no output. +We can create a new tag on the last commit in our `main` branch by doing: + +```bash +$ git tag -a v1.0.0 -m "Version 1.0.0" +``` + +So we can now do: + +```bash +$ git tag +``` + +```output +v.1.0.0 +``` + +And also, for more information: + +```bash +$ git show v1.0.0 +``` + +You should see something like this: + +```output +tag v1.0.0 +Tagger: +Date: Fri Dec 10 10:22:36 2021 +0000 + +Version 1.0.0 + +commit 2df4bfcbfc1429c12f92cecba751fb2d7c1a4e28 (HEAD -> main, tag: v1.0.0, origin/main, origin/develop, origin/HEAD, develop) +Author: +Date: Fri Dec 10 10:21:24 2021 +0000 + + Finalising README. + +diff --git a/README.md b/README.md +index 4818abb..5b8e7fd 100644 +--- a/README.md ++++ b/README.md +@@ -22,4 +22,33 @@ Flimflam requires the following Python packages: + The following optional packages are required to run Flimflam's unit tests: + + - [pytest](https://docs.pytest.org/en/stable/) - Flimflam's unit tests are written using pytest +-- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing +\ No newline at end of file ++- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing ++ ++## Installation ++- Clone the repo ``git clone repo`` ++- Check everything runs by running ``python -m pytest`` in the root directory ++- Hurray ++ ++## Contributing ++- Create an issue [here](https://github.com/Onoddil/python-intermediate-inflammation/issues) ++ - What works, what does not? You tell me ++- Randomly edit some code and see if it improves things, then submit a [pull request](https://github.com/Onoddil/python-intermediate-inflammation/pulls) ++- Just yell at me while I edit the code, pair programmer style! ++ ++## Getting Help ++- Nice try ++ ++## Credits ++- Directed by Michael Bay ++ ++## Citation ++Please cite [J. F. W. Herschel, 1829, MmRAS, 3, 177](https://ui.adsabs.harvard.edu/abs/1829MmRAS...3..177H/abstract) if you used this work in your day-to-day life. ++Please cite [C. Herschel, 1787, RSPT, 77, 1](https://ui.adsabs.harvard.edu/abs/1787RSPT...77....1H/abstract) if you actually use this for scientific work. ++ ++## License ++This source code is protected under international copyright law. All rights ++reserved and protected by the copyright holders. ++This file is confidential and only available to authorized individuals with the ++permission of the copyright holders. If you encounter this file and do not have ++permission, please contact the copyright holders and delete this file. +\ No newline at end of file +``` + +So now we have added a tag, we need this reflected in our Github repository. +You can push this tag to your remote by doing: + +```bash +$ git push origin v1.0.0 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is a Version Number Anyway? + +Software version numbers are everywhere, +and there are many different ways to do it. +A popular one to consider is [**Semantic Versioning**](https://semver.org/), +where a given version number uses the format MAJOR.MINOR.PATCH. +You increment the: + +- MAJOR version when you make incompatible API changes +- MINOR version when you add functionality in a backwards compatible manner +- PATCH version when you make backwards compatible bug fixes + +You can also add a hyphen followed by characters to denote a pre-release version, +e.g. 1.0.0-alpha1 (first alpha release) or 1.2.3-beta4 (fourth beta release) + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can now use the more memorable tag to refer to this specific commit. +Plus, once we have pushed this back up to GitHub, +it appears as a specific release within our code repository +which can be downloaded in compressed `.zip` or `.tar.gz` formats. +Note that these downloads just contain the state of the repository at that commit, +and not its entire history. + +Using features like tagging allows us to highlight commits that are particularly important, +which is very useful for *reproducibility* purposes. +We can (and should) refer to specific commits for software in +academic papers that make use of results from software, +but tagging with a specific version number makes that just a little bit easier for humans. + +## Conforming to Data Policy and Regulation + +We may also wish to make data available to either +be used with the software or as generated results. +This may be via GitHub or some other means. +An important aspect to remember with sharing data on such systems is that +they may reside in other countries, +and we must be careful depending on the nature of the data. + +We need to ensure that we are still conforming to +the relevant policies and guidelines regarding how we manage research data, +which may include funding council, +institutional, +national, +and even international policies and laws. +Within Europe, for example, there is the need to conform to things like [GDPR][gdpr]. +it is a very good idea to make yourself aware of these aspects. + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- The reuse battle is won before it is fought. Select and use good practices consistently throughout development and not just at the end. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/43-software-release.md b/43-software-release.md new file mode 100644 index 000000000..d66f5a421 --- /dev/null +++ b/43-software-release.md @@ -0,0 +1,327 @@ +--- +title: 4.3 Packaging Code for Release and Distribution +teaching: 0 +exercises: 20 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the steps necessary for sharing Python code as installable packages. +- Use Poetry to prepare an installable package. +- Explain the differences between runtime and development dependencies. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do we prepare our code for sharing as a Python package? +- How do we release our project for other people to install and reuse? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Why Package our Software? + +We have now got our software ready to release - +the last step is to package it up so that it can be distributed. + +For very small pieces of software, +for example a single source file, +it may be appropriate to distribute to non-technical end-users as source code, +but in most cases we want to bundle our application or library into a package. +A package is typically a single file which contains within it our software +and some metadata which allows it to be installed and used more simply - +e.g. a list of dependencies. +By distributing our code as a package, +we reduce the complexity of fetching, installing and integrating it for the end-users. + +In this session we will introduce +one widely used method for building an installable package from our code. +There are range of methods in common use, +so it is likely you will also encounter projects which take different approaches. + +There is some confusing terminology in this episode around the use of the term "package". +This term is used to refer to both: + +- A directory containing Python files / modules and an `__init__.py` - a "module package" +- A way of structuring / bundling a project for easier distribution and installation - + a "distributable package" + +## Packaging our Software with Poetry + +### Installing Poetry + +Because we have recommended GitBash if you are using Windows, +we are going to install Poetry using a different method to the officially recommended one. +If you are on MacOS or Linux, +are comfortable with installing software at the command line +and want to use Poetry to manage multiple projects, +you may instead prefer to follow the official +[Poetry installation instructions](https://python-poetry.org/docs/#installation). + +We can install Poetry much like any other Python distributable package, using `pip`: + +```bash +$ source venv/bin/activate +$ python3 -m pip install poetry +``` + +To test, we can ask where Poetry is installed: + +```bash +$ which poetry +``` + +```output +/home/alex/python-intermediate-inflammation/venv/bin/poetry +``` + +If you do not get similar output, +make sure you have got the correct virtual environment activated. + +Poetry can also handle virtual environments for us, +so in order to behave similarly to how we used them previously, +let us change the Poetry config to put them in the same directory as our project: + +```bash, bash +$ poetry config virtualenvs.in-project true +``` + +### Setting up our Poetry Config + +Poetry uses a **pyproject.toml** file to describe +the build system and requirements of the distributable package. +This file format was introduced to solve problems with bootstrapping packages +(the processing we do to prepare to process something) +using the older convention with **setup.py** files and to support a wider range of build tools. +It is described in +[PEP 518 (Specifying Minimum Build System Requirements for Python Projects)](https://www.python.org/dev/peps/pep-0518/). + +Make sure you are in the root directory of your software project +and have activated your virtual environment, +then we are ready to begin. + +To create a `pyproject.toml` file for our code, we can use `poetry init`. +This will guide us through the most important settings - +for each prompt, we either enter our data or accept the default. + +*Displayed below are the questions you should see +with the recommended responses to each question so try to follow these, +although use your own contact details!* + +**NB: When you get to the questions about defining our dependencies, +answer no, so we can do this separately later.** + +```bash +$ poetry init +``` + +```output +This command will guide you through creating your pyproject.toml config. + +Package name [example]: inflammation +Version [0.1.0]: 1.0.0 +Description []: Analyse patient inflammation data +Author [None, n to skip]: James Graham +License []: MIT +Compatible Python versions [^3.11]: ^3.11 + +Would you like to define your main dependencies interactively? (yes/no) [yes] no +Would you like to define your development dependencies interactively? (yes/no) [yes] no +Generated file + +[tool.poetry] +name = "inflammation" +version = "1.0.0" +description = "Analyse patient inflammation data" +authors = ["James Graham "] +license = "MIT" + +[tool.poetry.dependencies] +python = "^3.11" + +[tool.poetry.dev-dependencies] + +[build-system] +requires = ["poetry-core>=1.0.0"] +build-backend = "poetry.core.masonry.api" + + +Do you confirm generation? (yes/no) [yes] yes +``` + +Note that we have called our package "inflammation" in the setup above, +instead of "inflammation-analysis". +This is because Poetry will automatically find our code +if the name of the distributable package matches the name of our module package. +If we wanted our distributable package to have a different name, +for example "inflammation-analysis", +we could do this by explicitly listing the module packages to bundle - +see [the Poetry docs on packages](https://python-poetry.org/docs/pyproject/#packages) +for how to do this. + +### Project Dependencies + +Previously, we looked at using a `requirements.txt` file to define the dependencies of our software. +Here, Poetry takes inspiration from package managers in other languages, +particularly NPM (Node Package Manager), +often used for JavaScript development. + +Tools like Poetry and NPM understand that there are two different types of dependency: +runtime dependencies and development dependencies. +Runtime dependencies are those dependencies that +need to be installed for our code to run, like NumPy. +Development dependencies are dependencies which +are an essential part of your development process for a project, +but are not required to run it. +Common examples of developments dependencies are linters and test frameworks, +like `pylint` or `pytest`. + +When we add a dependency using Poetry, +Poetry will add it to the list of dependencies in the `pyproject.toml` file, +add a reference to it in a new `poetry.lock` file, +and automatically install the package into our virtual environment. +If we do not yet have a virtual environment activated, +Poetry will create it for us - using the name `.venv`, +so it appears hidden unless we do `ls -a`. +Because we have already activated a virtual environment, Poetry will use ours instead. +The `pyproject.toml` file has two separate lists, +allowing us to distinguish between runtime and development dependencies. + +```bash +$ poetry add matplotlib numpy +$ poetry add --group dev pylint +$ poetry install +``` + +These two sets of dependencies will be used in different circumstances. +When we build our package and upload it to a package repository, +Poetry will only include references to our runtime dependencies. +This is because someone installing our software through a tool like `pip` is only using it, +but probably does not intend to contribute to the development of our software +and does not require development dependencies. + +In contrast, if someone downloads our code from GitHub, +together with our `pyproject.toml`, +and installs the project that way, +they will get both our runtime and development dependencies. +If someone is downloading our source code, +that suggests that they intend to contribute to the development, +so they will need all of our development tools. + +Have a look at the `pyproject.toml` file again to see what's changed. + +### Packaging Our Code + +The final preparation we need to do is to +make sure that our code is organised in the recommended structure. +This is the Python module structure - +a directory containing an `__init__.py` and our Python source code files. +Make sure that the name of this Python package +(`inflammation` - unless you have renamed it) +matches the name of your distributable package in `pyproject.toml` +unless you have chosen to explicitly list the module packages. + +By convention distributable package names use hyphens, +whereas module package names use underscores. +While we could choose to use underscores in a distributable package name, +we cannot use hyphens in a module package name, +as Python will interpret them as a minus sign in our code when we try to import them. + +Once we have got our `pyproject.toml` configuration done and our project is in the right structure, +we can go ahead and build a distributable version of our software: + +```bash +$ poetry build +``` + +This should produce two files for us in the `dist` directory. +The one we care most about is the `.whl` or **wheel** file. +This is the file that `pip` uses to distribute and install Python packages, +so this is the file we would need to share with other people who want to install our software. + +Now if we gave this wheel file to someone else, +they could install it using `pip` - +you do not need to run this command yourself, +you have already installed it using `poetry install` above. + +```bash +$ python3 -m pip install dist/inflammation*.whl +``` + +The star in the line above is a **wildcard**, +that means Bash should use any filenames that match that pattern, +with any number of characters in place for the star. +We could also rely on Bash's autocomplete functionality and type `dist/inflammation`, +then hit the Tab key if we have only got one version built. + +After we have been working on our code for a while and want to publish an update, +we just need to update the version number in the `pyproject.toml` file +(using [SemVer](https://semver.org/) perhaps), +then use Poetry to build and publish the new version. +If we do not increment the version number, +people might end up using this version, +even though they thought they were using the previous one. +Any re-publishing of the package, no matter how small the changes, +needs to come with a new version number. +The advantage of [SemVer](https://semver.org/) is that the change in the version number +indicates the degree of change in the code and thus the degree of risk of breakage when we update. + +```bash +$ poetry build +``` + +In addition to the commands we have already seen, +Poetry contains a few more that can be useful for our development process. +For the full list see the [Poetry CLI documentation](https://python-poetry.org/docs/cli/). + +The final step is to publish our package to a package repository. +A package repository could be either public or private - +while you may at times be working on public projects, +it is likely the majority of your work will be published internally +using a private repository such as JFrog Artifactory. +Every repository may be configured slightly differently, +so we will leave that to you to investigate. + +## What if We Need More Control? + +Sometimes we need more control over the process of +building our distributable package than Poetry allows. +There many ways to distribute Python code in packages, +with some degree of flux in terms of which methods are most popular. +For a more comprehensive overview of Python packaging you can see the +[Python docs on packaging](https://packaging.python.org/en/latest/), +which contains a helpful guide to the overall +[packaging process, or 'flow'](https://packaging.python.org/en/latest/flow/), +using the [Twine](https://pypi.org/project/twine/) tool to +upload created packages to PyPI for distribution as an alternative. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Optional Exercise: Enhancing our Package Metadata + +The [Python Packaging User Guide](https://packaging.python.org/) +provides documentation on +[how to package a project](https://packaging.python.org/en/latest/tutorials/packaging-projects/) +using a manual approach to building a `pyproject.toml` file, +and using Twine to upload the distribution packages to PyPI. + +Referring to the +[section on metadata](https://packaging.python.org/en/latest/tutorials/packaging-projects/#configuring-metadata) +in the documentation, +enhance your `pyproject.toml` with some additional metadata fields +to improve the information your package. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Poetry allows us to produce an installable package and upload it to a package repository. +- Making our software installable with Pip makes it easier for others to start using it. +- For complete control over building a package, we can use a `setup.py` file. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/50-section5-intro.md b/50-section5-intro.md new file mode 100644 index 000000000..7c5a01417 --- /dev/null +++ b/50-section5-intro.md @@ -0,0 +1,94 @@ +--- +title: 'Section 5: Managing and Improving Software Over Its Lifetime' +teaching: 5 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use established tools to track and manage software problems and enhancements in a team. +- Understand the importance of critical reflection to improving software quality and reusability. +- Improve software through feedback, work estimation, prioritisation and agile development. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do we manage the process of developing and improving our software? +- How do we ensure we reuse other people's code while maintaining the sustainability of our own software? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In this section of the course we look at managing the **development and evolution** of software - +how to keep track of the tasks the team has to do, +how to improve the quality and reusability of our software for others as well as ourselves, +and how to assess other people's software for reuse within our project. +The focus in this section will move beyond just software development to **software management**: +internal planning and prioritising tasks for future development, +management of internal communication as well as +how the outside world interacts with and makes use of our software, +how others can interact with ourselves to report issues, +and the ways we can successfully manage software improvement in response to feedback. + +![](fig/section5-overview.svg){alt='Managing software' .image-with-shadow width="1000px" } + + + +In this section we will: + +- Use GitHub to **track issues with our software** registered by ourselves and external users. +- Use GitHub's **Mentions** and notifications system to + effectively **communicate within the team** on software development tasks. +- Use GitHub's **Project Boards** and **Milestones** for project planning and management. +- Learn to manage the **improvement of our software through feedback** + using **agile** management techniques. +- Employ **effort estimation** of development tasks + as a foundational tool for prioritising future team work, + and use the **MoSCoW approach** and software development **sprints** to manage improvement. + As we will see, it is very difficult to prioritise work effectively + without knowing both its relative importance to others + as well as the effort required to deliver those work items. +- Learn how to employ a critical mindset when reviewing software for reuse. + + + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- For software to succeed it needs to be managed as well as developed. +- Estimating the effort to deliver work items is a foundational tool for prioritising that work. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/51-managing-software.md b/51-managing-software.md new file mode 100644 index 000000000..8e372bb23 --- /dev/null +++ b/51-managing-software.md @@ -0,0 +1,503 @@ +--- +title: 5.1 Managing a Collaborative Software Project +teaching: 15 +exercises: 20 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Register and track progress on issues with the code in our project repository +- Describe some different types of issues we can have with software +- Manage communications on software development activities within the team using GitHub's notification system **Mentions** +- Use GitHub's **Project Boards** and **Milestones** for software project management, planning sprints and releases + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can we keep track of identified issues and the list of tasks the team has to do? +- How can we communicate within a team on code-related issues and share responsibilities? +- How can we plan, prioritise and manage tasks for future development? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +Developing software is a project and, like most projects, it consists of multiple tasks. +Keeping track of identified issues with the software, +the list of tasks the team has to do, progress on each, +prioritising tasks for future development, +planning sprints and releases, etc., +can quickly become a non-trivial task in itself. +Without a good team project management process and framework, +it can be hard to keep track of what's done, or what needs doing, +and particularly difficult to convey that to others +in the team or share the responsibilities. + +## Using GitHub to Manage Issues With Software + +As a piece of software is used, +bugs and other issues will inevitably come to light - nothing is perfect! +If you work on your code with collaborators, +or have non-developer users, +it can be helpful to have a single shared record of +all the problems people have found with the code, +not only to keep track of them for you to work on later, +but to avoid people emailing you to report a bug that you already know about! + +GitHub provides **Issues** - +a framework for managing bug reports, feature requests, and lists of future work. + +Go back to the home page for your `python-intermediate-inflammation` repository in GitHub, +and click on the `Issues` tab. +You should see a page listing the open issues on your repository - +currently there should be none. +If you do not see the `Issues` tab, you must first enable it in the settings of your repository: +go to the `Settings` tab, scroll down to the `Features` section and activate the checkmark on `Issues`. + +![](fig/github-issue-list.png){alt='List of project issues in GitHub' .image-with-shadow width="1000px"} + +Let us go through the process of creating a new issue. +Start by clicking the `New issue` button. + +![](fig/github-new-issue.png){alt='Creating a new issue in GitHub' .image-with-shadow width="1000px"} + +When you create an issue, you can add a range of details to them. +They can be *assigned to a specific developer* for example - +this can be a helpful way to know who, if anyone, is currently working to fix the issue, +or a way to assign responsibility to someone to deal with it. + +They can also be assigned a *label*. +The labels available for issues can be customised, +and given a colour, +allowing you to see at a glance the state of your code's issues. +The [default labels available in GitHub](https://docs.github.com/en/issues/using-labels-and-milestones-to-track-work/managing-labels) include: + +- `bug` - indicates an unexpected problem or unintended behavior +- `documentation` - indicates a need for improvements or additions to documentation +- `duplicate` - indicates similar or already reported issues, pull requests, or discussions +- `enhancement` - indicates new feature requests, + or if they are created by a developer, indicate planned new features +- `good first issue` - indicates a good issue for first-time contributors +- `help wanted` - indicates that a maintainer wants help on an issue or pull request +- `invalid` - indicates that an issue, pull request, or discussion is no longer relevant +- `question` - indicates that an issue, pull request, or discussion needs more information +- `wontfix` - indicates that work will not continue on an issue, pull request, or discussion + +You can also create your own custom labels to help with classifying issues. +There are no rules really about naming the labels - +use whatever makes sense for your project. +Some conventional custom labels include: +`status:in progress` (to indicate that someone started working on the issue), +`status:blocked` (to indicate that the progress on addressing issue is +blocked by another issue or activity), etc. + +As well as highlighting problems, +the `bug` label can make code much more usable by +allowing users to find out if anyone has had the same problem before, +and also how to fix (or work around) it on their end. +Enabling users to solve their own problems can save you a lot of time. +In general, a good bug report should contain only one bug, +specific details of the environment in which the issue appeared +(e.g. operating system or browser, version of the software and its dependencies), +and sufficiently clear and concise steps that allow a developer to reproduce the bug themselves. +They should also be clear on what the bug reporter considers factual +("I did this and this happened") +and speculation +("I think it was caused by this"). +If an error report was generated from the software itself, +it is a very good idea to include that in the issue. + +The `enhancement` label is a great way to communicate your future priorities +to your collaborators but also to yourself - +it's far too easy to leave a software project for a few months to work on something else, +only to come back and forget the improvements you were going to make. +If you have other users for your code, +they can use the label to request new features, +or changes to the way the code operates. +It's generally worth paying attention to these suggestions, +especially if you spend more time developing than running the code. +It can be very easy to end up with quirky behaviour +because of off-the-cuff choices during development. +Extra pairs of eyes can point out ways the code can be made more accessible - +the easier the code is to use, the more widely it will be adopted +and the greater impact it will have. + +One interesting label is `wontfix`, +which indicates that an issue simply will not be worked on for whatever reason. +Maybe the bug it reports is outside of the use case of the software, +or the feature it requests simply is not a priority. +This can make it clear you have thought about an issue and dismissed it. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Locking and Pinning Issues + +The **Lock conversation** and **Pin issue** buttons are both available +from individual issue pages. +Locking conversations allows you to block future comments on the issue, +e.g. if the conversation around the issue is not constructive +or violates your team's code of conduct. +Pinning issues allows you to pin up to three issues to the top of the issues page, +e.g. to emphasise their importance. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: testimonial + +## Manage Issues With Your Code Openly + +Having open, publicly-visible lists of the limitations and problems with your code +is incredibly helpful. +Even if some issues end up languishing unfixed for years, +letting users know about them can save them a huge amount of work +attempting to fix what turns out to be an unfixable problem on their end. +It can also help you see at a glance what state your code is in, +making it easier to prioritise future work! + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Our First Issue! + +Individually, with a critical eye, +think of an aspect of the code you have developed so far that needs improvement. +It could be a bug, for example, +or a documentation issue with your README, +a missing LICENSE file, +or an enhancement. +In GitHub, enter the details of the issue and select `Submit new issue`. +Add a label to your issue, if appropriate. + +::::::::::::::: solution + +## Solution + +For example, "Add a licence file" could be a good first issue, with a label `documentation`. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Issue (and Pull Request) Templates + +GitHub also allows you to set up issue and pull request templates for your software project. +Such templates provide a structure for the issue/pull request descriptions, +and/or prompt issue reporters and collaborators to fill in answers to pre-set questions. +They can help contributors raise issues or submit pull requests +in a way that is clear, helpful and provides enough information for maintainers to act upon +(without going back and forth to extract it). +GitHub provides a range of default templates, +but you can also [write your own](https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository). + +## Using GitHub's Notifications \& Referencing System to Communicate + +GitHub implements a comprehensive +[notifications system](https://docs.github.com/en/account-and-profile/managing-subscriptions-and-notifications-on-github/setting-up-notifications/configuring-notifications) +to keep the team up-to-date with activities in your code repository +and notify you when something happens or changes in your software project. +You can choose whether to watch or unwatch an individual repository, +or can choose to only be notified of certain event types +such as updates to issues, pull requests, direct mentions, etc. +GitHub also provides an additional useful notification feature for collaborative work - **Mentions**. +In addition to referencing team members +(which will result in an appropriate notification), +GitHub allows us to reference issues, pull requests and comments from one another - +providing a useful way of connecting things and conversations in your project. + +### Referencing Team Members Using Mentions + +The mention system notifies team members when somebody else references them +in an issue, comment or pull request - +you can use this to notify people when you want to check a detail with them, +or let them know something has been fixed or changed +(much easier than writing out all the same information again in an email). + +You can use the mention system to link to/notify an individual GitHub account +or a whole team for notifying multiple people. +Typing @ in GitHub will bring up a list of +all accounts and teams linked to the repository that can be "mentioned". +People will then receive notifications based on their preferred notification methods - +e.g. via email or GitHub's User Interface. + +### Referencing Issues, Pull Requests and Comments + +GitHub also lets you mention/reference one issue or pull request from another +(and people "watching" these will be notified of any such updates). +Whilst writing the description of an issue, or commenting on one, +if you type \# you should see +a list of the issues and pull requests on the repository. +They are coloured green if they are open, or white if they are closed. +Continue typing the issue number, and the list will narrow down, +then you can hit Return to select the entry and link the two. +For example, if you realise that several of your bugs have common roots, +or that one enhancement cannot be implemented before you have finished another, +you can use the mention system to indicate the depending issue(s). +This is a simple way to add much more information to your issues. + +While not strictly notifying anyone, +GitHub lets you also reference individual comments and commits. +If you click the `...` button on a comment, +from the drop down list you can select to `Copy link` +(which is a URL that points to that comment that can be pasted elsewhere) +or to `Reference [a comment] in a new issue` +(which opens a new issue and references the comment by its URL). +Within a text box for comments, issue and pull request descriptions, +you can reference a commit by pasting its long, unique identifier +(or its first few digits which uniquely identify it) +and GitHub will render it nicely using the identifier's short form +and link to the commit in question. + +![](fig/github-reference-comments-commits.png){alt='Referencing comments and commits in GitHub' .image-with-shadow width="700px"} + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Our First Mention/Reference! + +Add a mention to one of your team members using the `@` notation +in a comment within an issue or a pull request in your repository - +e.g. to ask them a question or a clarification on something or to do some additional work. + +Alternatively, add another issue to your repository +and reference the issue you created in the previous exercise +using the `#` notation. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## You Are Also a User of Your Code + +This section focuses a lot on how issues and mentions can help +communicate the current state of the code to others +and document what conversations were held around particular issues. +As a sole developer, and possibly also the only user of the code, +you might be tempted to not bother with recording issues, comments and new features +as you do not need to communicate the information to anyone else. + +Unfortunately, human memory is not infallible! +After spending six months on a different topic, +it is inevitable you'll forget some of the plans you had and problems you faced. +Not documenting these things can lead to you having to +re-learn things you already put the effort into discovering before. +Also, if others are brought on to the project at a later date, +the software's existing issues and potential new features are already in place to build upon. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Software Project Management in GitHub + +Managing issues within your software project is one aspect of project management but it gives a relative flat +representation of tasks and may not be as suitable for higher-level project management such as +prioritising tasks for future development, planning sprints and releases. Luckily, +GitHub provides two project management tools for this purpose - **Projects** and **Milestones**. + +Both GitHub Projects and Milestones provide [agile development and project management systems](https://www.atlassian.com/agile) +and ways of organising issues into smaller "sub-projects" (i.e. +smaller than the "project" represented by the whole repository). +Projects provide a way of visualising and organising work which is not time-bound and is on a higher level (e.g. more suitable for +project management tasks). Milestones are typically used to +organise lower-level tasks that have deadlines and progress of which needs to be closely tracked +(e.g. release and version management). The main difference is that Milestones are a repository-level feature +(i.e. they belong and are managed from a single repository), whereas projects are account-level and can manage tasks +across many repositories under the same user or organisational account. + +How you organise and partition your project work and which tool you want to use +to track progress (if at all) is up to you and the size of your project. For example, you could create a project per +milestone or have several milestones in a single project, or split milestones into shorter sprints. +We will use Milestones soon to organise work on a mini sprint within our team - +for now, we will have a brief look at Projects. + +### Projects + +A GitHub Project uses a "project board" consisting of columns and cards to keep track of tasks +(although GitHub now also provides a table view over a project's tasks). +You break down your project into smaller sub-projects, +which in turn are split into tasks which you write on cards, +then move the cards between columns that describe the status of each task. +Cards are usually small, descriptive and self-contained tasks that build on each other. +Breaking a project down into clearly-defined tasks makes it a lot easier to manage. +GitHub project boards interact and integrate with the other features of the site +such as issues and pull requests - +cards can be added to track the progress of such tasks +and automatically moved between columns based on their progress or status. + +GitHub projects are "an adaptable, flexible tool for planning and tracking work on GitHub" - +they now provide interchangeable spreadsheet, task-board, or roadmap views of your project +that integrates with your issues and pull requests on GitHub to help you +plan and track your work effectively. +We recommend you to have a look at [GitHub's documentation on Projects](https://docs.github.com/en/issues/planning-and-tracking-with-projects/learning-about-projects/about-projects) +and see if they are suitable for your software development workflow. + +::::::::::::::::::::::::::::::::::::::::: callout + +## GitHub Projects are a Cross-Repository Management Tool + +[Project in GitHub](https://docs.github.com/en/issues/planning-and-tracking-with-projects/learning-about-projects/about-projects) +are created on a user or organisation level, +i.e. they can span all repositories owned by a user or organisation in GitHub +and are not a repository-level feature any more. +A project can integrate your issues and pull requests on GitHub from multiple repositories +to help you plan and track your team's work effectively. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let us have a quick look at how Projects are created in GitHub - we will not use them much in +this course but it is good to be aware of how to make use of them when suitable. + +1. From your GitHub account's home page (not your repository's home page!), + select the "Projects" tab, then click the `New project` button on the right. + + ![](fig/github-new-project.png){alt='Adding a new project board in GitHub' .image-with-shadow width="800px"} + +2. In the "Create project" pop-up window, you can either start from one of the featured existing + project templates or create your project from scratch using one of the three standard project + types/views that you customise yourself: + + - Table - a spreadsheet-style table to filter, sort and group your issues and pull requests. + - Board - a "cards on a board" view of the project, with issues and pull requests being + spread across customizable columns as cards on kanban board + - Roadmap - suitable for a high-level visualisation of your project over time. + ![](fig/github-project-template.png){alt='Selecting a project board template in GitHub' .image-with-shadow width="800px"} + Regardless of which project type/view you select, you can easily switch to a different + project layout later on. + +3. For example, select the "Board" type for the project, fill in the name of your project + (e.g. "Inflammation project - release v0.1"), and select `Create project`. + +4. After it is created, you should also populate the description of the project from the project's Settings, + which can be found by clicking the `...` button in the top right corner of the project. + ![](fig/github-project-settings.png){alt='Project board setting in GitHub' .image-with-shadow width="800px"} + ![](fig/github-project-description.png){alt='Adding project description and metadata in GitHub' .image-with-shadow width="800px"} + After adding a description, select `Save`. + +5. GitHub's default card board template contains + the following three columns with pretty self-explanatory names: + + - `To Do` + - `In Progress` + - `Done` + + ![](fig/github-project-view-add-remove-items.png){alt='Default card board in GitHub' .image-with-shadow width="800px"} + + You can add or remove columns from your project board to suit your use case. + One commonly seen extra column is `On hold` or `Waiting` - + if you have tasks that get held up by waiting on other people + (e.g. to respond to your questions) + then moving them to a separate column makes their current state clearer. + Another way to organise your table is to have a column for each quarter of the year - + it is up to you to decide how you want to view your project's activities. + + To add a new column, + press the `+` button on the right; + to remove a column select the `...` button in the top right corner of the column itself + and then the `Delete column` option. + +6. You can now add new items (cards) to columns by pressing + the `+ Add item` button at the bottom of each column (see the previous image) - + a text box to add a card will appear. + Cards can be simple textual notes + which you type into the text box and pres `Enter` when finished. + Cards can also be (links to) existing issues and pull requests, + which can be filtered out from the text box by pressing `#` + (to activate GitHub's referencing mechanism) + and selecting the repository + and an issue or pull request from that repository that you want to add. + + ![](fig/github-project-new-items.png){alt='Adding issues and notes to a project board in GitHub' .image-with-shadow width="800px"} + + Notes contain task descriptions and can have detailed content like checklists. + In some cases, e.g. if a note becomes too complex, + you may want to convert it into an issue so you can add labels, + assign them to team members + or write more detailed comments + (for that, use the `Convert to issue` option from the `...` menu on the card itself). + + ![](fig/github-convert-task-to-issue.png){alt='Converting a task to issue' .image-with-shadow width="800px"} + +7. In addition to creating new tasks as notes and converting them to issues - + you can add an existing issue or pull request (from any repository visible to you) + as a task on a column by pasting its URL into the `Add item` field + and pressing the `Enter` key. + +8. You can drag a task/card from `Todo` to `In Progress` column to indicate that you are working on it + or to the `Done` column to indicate that it has been completed. + Issues and pull requests on cards will automatically be moved to the `Done` column for you + when you close the issue or merge the pull request - + which is very convenient and can save you some project management time. + +9. Finally, you can change the way you view your project by adding another view. + For example, we can add a Table view to our Board view by clicking the `New button` + and selecting it from the drop down menu. + ![](fig/github-project-add-view.png){alt='Add another project view' .image-with-shadow width="800px"} + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Working With Projects + +Spend a few minutes planning what you want to do with your project as a bigger chunk of work +(you can continue working on the first release of your software if you like) +and play around with your project board to manage tasks around the project: + +- practice adding and removing columns, +- practice adding different types of cards + (notes and from already existing open issues and/or unmerged pull requests), +- practice turing cards into issues and closing issues, etc. + +Make sure to add a certain number of issues to your repository +to be able to use in your project board. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Prioritisation With Project Boards + +Once your project board has a large number of cards on it, +you might want to begin priorisiting them. +Not all tasks are going to be equally important, +and some will require others to be completed before they can even be begun. +Common methods of prioritisation include: + +- **Vertical position**: + the vertical arrangement of cards in a column implicitly represents their importance. + High-priority issues go to the top of `To Do`, + whilst tasks that depend on others go beneath them. + This is the easiest one to implement, + though you have to remember to correctly place cards when you add them. +- **Priority columns**: instead of a single `To Do` column, + you can have two or more, for example - + `To Do: Low Priority` and `To Do: High Priority`. + When adding a card, you pick which is the appropriate column for it. + You can even add a `Triage` column for newly-added issues + that you've not yet had time to classify. + This format works well for project boards devoted to bugs. +- **Labels**: if you convert each card into an issue, + then you can label them with their priority - + remember GitHub lets you create custom labels and set their colours. + Label colours can provide a very visually clear indication of issue priority + but require more administrative work on the project, + as each card has to be an issue to be assigned a label. + If you choose this route for issue prioritisation - + be aware of accessibility issues for colour-blind people when picking colours. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- We should use GitHub's **Issues** to keep track of software problems and other requests for change - even if we are the only developer and user. +- GitHub’s **Mentions** play an important part in communicating between collaborators and is used as a way of alerting team members of activities and referencing one issue/pull requests/comment/commit from another. +- Without a good project and issue management framework, it can be hard to keep track of what’s done, or what needs doing, and particularly difficult to convey that to others in the team or sharing the responsibilities. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/52-assessing-software-suitability-improvement.md b/52-assessing-software-suitability-improvement.md new file mode 100644 index 000000000..ed838ea06 --- /dev/null +++ b/52-assessing-software-suitability-improvement.md @@ -0,0 +1,120 @@ +--- +title: 5.2 Assessing Software for Suitability and Improvement +teaching: 15 +exercises: 30 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Explain why a critical mindset is important when selecting software +- Conduct an assessment of software against suitability criteria +- Describe what should be included in software issue reports and register them + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What makes good code actually good? +- What should we look for when selecting software to reuse? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +What we have been looking at so far enables us to adopt +a more proactive and managed attitude and approach to the software we develop. +But we should also adopt this attitude when +selecting and making use of third-party software we wish to use. +With pressing deadlines it is very easy to reach for +a piece of software that appears to do what you want +without considering properly whether it is a good fit for your project first. +A chain is only as strong as its weakest link, +and our software may inherit weaknesses in any dependent software or create other problems. + +Overall, when adopting software to use it is important to consider +not only whether it has the functionality you want, +but a broader range of qualities that are important for your project. +Adopting a critical mindset when assessing other software against suitability criteria +will help you adopt the same attitude when assessing your own software for future improvements. + +## Assessing Software for Suitability + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Decide on Your Group's Repository! + +You all have your code repositories you have been working on throughout the course so far. +For the upcoming exercise, +groups will exchange repositories and review the code of the repository they inherit, +and provide feedback. + +1. Decide as a team on one of your repositories that will represent your group. + You can do this any way you wish, + but if you are having trouble then a pseudo-random number might help: + `python -c "import numpy as np; print(np.random.randint(low=1, high=))"` +2. Add the URL of the repository to + the section of the shared notes labelled 'Decide on your Group's Repository', + next to your team's name. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Conduct Assessment on Third-Party Software + +*The scenario:* It is envisaged that a piece of software developed by another team will be +adopted and used for the long term in a number of future projects. +You have been tasked with conducting an assessment of this software +to identify any issues that need resolving prior to working with it, +and will provide feedback to the developing team to fix these issues. + +1. As a team, briefly decide who will assess which aspect of the repository, + e.g. its documentation, tests, codebase, etc. +2. Obtain the URL for the repository you will assess from the shared notes document, + in the section labelled 'Decide on your Group's Repository' - + see the last column which indicates which team's repository you are assessing. +3. Conduct the assessment + and register any issues you find on the other team's software repository on GitHub. +4. Be meticulous in your assessment and register as many issues as you can! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Supporting Your Software - How and How Much? + +Within your collaborations and projects, what should you do to support other users? +Here are some key aspects to consider: + +- Provide contact information: + so users know what to do and how to get in contact if they run into problems +- Manage your support: + an issue tracker - like the one in GitHub - is essential to track and manage issues +- Manage expectations: + let users know the level of support you offer, + in terms of when they can expect responses to queries, + the scope of support (e.g. which platforms, types of releases, etc.), + the types of support (e.g. bug resolution, helping develop tailored solutions), + and expectations for support in the future (e.g. when project funding runs out) + +All of this requires effort, and you cannot do everything. +It is therefore important to agree and be clear on +how the software will be supported from the outset, +whether it is within the context of a single laboratory, +project, +or other collaboration, +or across an entire community. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- It is as important to have a critical attitude to adopting software as we do to developing it. +- As a team agree on who and to what extent you will support software you make available to others. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/53-improvement-through-feedback.md b/53-improvement-through-feedback.md new file mode 100644 index 000000000..8fd79cdee --- /dev/null +++ b/53-improvement-through-feedback.md @@ -0,0 +1,313 @@ +--- +title: 5.3 Software Improvement Through Feedback +teaching: 25 +exercises: 45 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Prioritise and work on externally registered issues +- Respond to submitted issue reports and provide feedback +- Explain the importance of software support and choosing a suitable level of support + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How should we handle feedback on our software? +- How, and to what extent, should we provide support to our users? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction + +When a software project has been around for even just a short amount of time, +you'll likely discover many aspects that can be improved. +These can come from issues that have been registered via collaborators or users, +but also those you are aware of internally, +which should also be registered as issues. +When starting a new software project, +you'll also have to determine how you'll handle all the requirements. +But which ones should you work on first, +which are the most important and why, +and how should you organise all this work? + +Software has a fundamental role to play in doing science, +but unfortunately software development is often +given short shrift in academia when it comes to prioritising effort. +There are also many other draws on our time +in addition to the research, development, and writing of publications that we do, +which makes it all the more important to prioritise our time for development effectively. + +In this lesson we will be looking at prioritising work we need to do +and what we can use from the agile perspective of project management +to help us do this in our software projects. + +## Estimation as a Foundation for Prioritisation + +For simplicity, we will refer to our issues as *requirements*, +since that's essentially what they are - +new requirements for our software to fulfil. + +But before we can prioritise our requirements, +there are some things we need to find out. + +Firstly, we need to know: + +- *The period of time we have to resolve these requirements* - + e.g. before the next software release, pivotal demonstration, + or other deadlines requiring their completion. + This is known as a **timebox**. + This might be a week or two, but for agile, this should not be longer than a month. + Longer deadlines with more complex requirements may be split into a number of timeboxes. +- *How much overall effort we have available* - +- i.e. who will be involved and how much of their time we will have during this period. + +We also need estimates for how long each requirement will take to resolve, +since we cannot meaningfully prioritise requirements without +knowing what the effort tradeoffs will be. +Even if we know how important each requirement is, +how would we even know if completing the project is possible? +Or if we do not know how long it will take +to deliver those requirements we deem to be critical to the success of a project, +how can we know if we can include other less important ones? + +It is often not the reality, +but estimation should ideally be done by the people likely to do the actual work +(i.e. the Research Software Engineers, researchers, or developers). +It shouldn't be done by project managers or PIs +simply because they are not best placed to estimate, +and those doing the work are the ones who are effectively committing to these figures. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Why is it so Difficult to Estimate? + +Estimation is a very valuable skill to learn, and one that is often difficult. +Lack of experience in estimation can play a part, +but a number of psychological causes can also contribute. +One of these is [Dunning-Kruger](https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect), +a type of cognitive bias in which people tend to overestimate their abilities, +whilst in opposition to this is [imposter syndrome](https://en.wikipedia.org/wiki/Impostor_syndrome), +where due to a lack of confidence people underestimate their abilities. +The key message here is to be honest about what you can do, +and find out as much information that is reasonably appropriate before arriving at an estimate. + +More experience in estimation will also help to reduce these effects. +So keep estimating! + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +An effective way of helping to make your estimates more accurate is to do it as a team. +Other members can ask prudent questions that may not have been considered, +and bring in other sanity checks and their own development experience. +Just talking things through can help uncover other complexities and pitfalls, +and raise crucial questions to clarify ambiguities. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Where to Record Effort Estimates? + +There is no dedicated place to record the effort estimate on an issue in current GitHub's interface. +Therefore, you can agree on a convention within your team on how to record this information - +e.g. you can add the effort in person/days in the issue title. +Recording estimates within comments on an issue may not be the best idea +as it may get lost among other comments. +Another place where you can record estimates for your issues is on project boards - +there is no default field for this but you can create a custom numeric field +and use it to assign effort estimates +(note that you cannot sum them yet in the current GitHub's interface). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Estimate! + +As a team +go through the issues that your partner team has registered with your software repository, +and quickly estimate how long each issue will take to resolve in minutes. +Do this by blind consensus first, +each anonymously submitting an estimate, +and then briefly discuss your rationale and decide on a final estimate. +Make sure these are honest estimates, +and you are able to complete them in the allotted time! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Using MoSCoW to Prioritise Work + +Now we have our estimates we can decide +how important each requirement is to the success of the project. +This should be decided by the project stakeholders; +those - or their representatives - +who have a stake in the success of the project +and are either directly affected or affected by the project, +e.g. Principle Investigators, +researchers, +Research Software Engineers, +collaborators, etc. + +To prioritise these requirements we can use a method called **MoSCoW**, +a way to reach a common understanding with stakeholders +on the importance of successfully delivering each requirement for a timebox. +MoSCoW is an acronym that stands for +**Must have**, +**Should have**, +**Could have**, +and **Won't have**. +Each requirement is discussed by the stakeholder group and falls into one of these categories: + +- *Must Have* (MH) - + these requirements are critical to the current timebox for it to succeed. + Even the inability to deliver just one of these would + cause the project to be considered a failure. +- *Should Have* (SH) - + these are important requirements but not *necessary* for delivery in the timebox. + They may be as *important* as Must Haves, + but there may be other ways to achieve them + or perhaps they can be held back for a future development timebox. +- *Could Have* (CH) - + these are desirable but not necessary, + and each of these will be included in this timebox if it can be achieved. +- *Won't Have* (WH) - + these are agreed to be out of scope for this timebox, + perhaps because they are the least important or not critical for this phase of development. + +In typical use, the ratio to aim for of requirements to the MH/SH/CH categories is +60%/20%/20% for a particular timebox. +Importantly, the division is by the requirement *estimates*, +not by number of requirements, +so 60% means 60% of the overall estimated effort for requirements are Must Haves. + +Why is this important? +Because it gives you a unique degree of control of your project for each time period. +It awards you 40% of flexibility with allocating your effort +depending on what's critical and how things progress. +This effectively forces a tradeoff between the effort available and critical objectives, +maintaining a significant safety margin. +The idea is that as a project progresses, +even if it becomes clear that you are only able to +deliver the Must Haves for a particular time period, +you have still delivered it *successfully*. + +### GitHub's Milestones + +Once we have decided on those we will work on (i.e. not Won't Haves), +we can (optionally) use a GitHub's **Milestone** to organise them for a particular timebox. +Remember, a milestone is a collection of issues to be worked on in a given period (or timebox). +We can create a new one by selecting `Issues` on our repository, +then `Milestones` to display any existing milestones, +then clicking the "New milestone" button to the right. + +![](fig/github-milestones.png){alt='Milestones in GitHub' .image-with-shadow width="1000px"} + +![](fig/github-create-milestone.png){alt='Create a milestone in GitHub' .image-with-shadow width="1000px"} + +We add in a title, +a completion date (i.e. the end of this timebox), +and any description for the milestone. + +![](fig/github-new-milestone-description.png){alt='Create a milestone in GitHub' .image-with-shadow width="800px"} + +Once created, we can view our issues +and assign them to our milestone from the `Issues` page or from an individual issue page. + +![](fig/github-assign-milestone.png){alt='Milestones in GitHub' .image-with-shadow width="1000px"} + +Let us now use Milestones to plan and prioritise our team's next sprint. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Prioritise! + +Put your stakeholder hats on, and as a team apply MoSCoW to the repository issues +to determine how you will prioritise effort to resolve them in the allotted time. +Try to stick to the 60/20/20 rule, +and assign all issues you will be working on (i.e. not `Won't Haves`) to a new milestone, +e.g. "Tidy up documentation" or "version 0.1". + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +## Using Sprints to Organise and Work on Issues + +A sprint is an activity applied to a timebox, +where development is undertaken on the agreed prioritised work for the period. +In a typical sprint, there are daily meetings called **scrum meetings** +which check on how work is progressing, +and serves to highlight any blockers and challenges to meeting the sprint goal. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Conduct a Mini Mini-Sprint + +For the remaining time in this course, +assign repository issues to team members and work on resolving them as per your MoSCoW breakdown. +Once an issue has been resolved, notable progress made, or an impasse has been reached, +provide concise feedback on the repository issue. +Be sure to add the other team members to the chosen repository so they have access to it. +You can grant `Write` access to others on a GitHub repository +via the `Settings` tab for a repository, then selecting `Collaborators`, +where you can invite other GitHub users to your repository with specific permissions. + +Time: however long is left + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Depending on how many issues were registered on your repository, +it is likely you will not have resolved all the issues in this first milestone. +Of course, in reality, a sprint would be over a much longer period of time. +In any event, as the development progresses into future sprints +any unresolved issues can be reconsidered and prioritised for another milestone, +which are then taken forward, and so on. +This process of receiving new requirements, prioritisation, +and working on them is naturally continuous - +with the benefit that at key stages +you are repeatedly **re-evaluating what is important and needs to be worked on** +which helps to ensure real concrete progress against project goals and requirements +which may change over time. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Project Boards For Planning Sprints + +Remember, you can use project boards for higher-level project management - +e.g. planning several sprints in advance +(and use milestones for tracking progress on individual sprints). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Prioritisation is a key tool in academia where research goals can change and software development is often given short shrift. +- In order to prioritise things to do we must first estimate the effort required to do them. +- For accurate effort estimation, it should be done by the people who will *actually do the work*. +- Aim to reduce cognitive biases in effort estimation by being honest about your abilities. +- Ask other team members - or do estimation as a team - to help make accurate estimates. +- MoSCoW is a useful technique for prioritising work to help ensure projects deliver successfully. +- Aim for a 60%/20%/20% ratio of Must Haves/Should Haves/Could Haves for requirements within a timebox. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/60-wrap-up.md b/60-wrap-up.md new file mode 100644 index 000000000..3a3f73a5c --- /dev/null +++ b/60-wrap-up.md @@ -0,0 +1,155 @@ +--- +title: Wrap-up +teaching: 15 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Put the course in context with future learning. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- Looking back at what was covered and how different pieces fit together +- Where are some advanced topics and further reading available? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + +## Summary + +As part of this course we have looked at a core set of +established, intermediate-level software development tools and best practices +for working as part of a team. +The course teaches a selected subset of skills that have been tried and tested +in collaborative research software development environments, +although not an all-encompassing set of every skill you might need +(check some [further reading](.#further-resources)). +It will provide you with a solid basis for writing industry-grade code, +which relies on the same best practices taught in this course: + +- Collaborative techniques and tools play an important part + of research software development in teams, + but also have benefits in solo development. + We have looked at the benefits of a well-considered development environment, + using practices, tools and infrastructure + to help us write code more effectively in collaboration with others. +- We have looked at the importance of being able to + verify the correctness of software and automation, + and how we can leverage techniques and infrastructure + to automate and scale tasks such as testing to save us time - + but automation has a role beyond simply testing: + what else can you automate that would save you even more time? + Once found, we have also examined how to locate faults in our software. +- We have gone beyond procedural programming and explored different software design paradigms, + such as object-oriented and functional styles of programming. + We have contrasted their pros, cons, and the situations in which they work best, + and how separation of concerns through modularity and architectural design + can help shape good software. +- As an intermediate developer, + aspects other than technical skills become important, + particularly in development teams. + We have looked at the importance of good, + consistent practices for team working, + and the importance of having a self-critical mindset when developing software, + and ways to manage feedback effectively and efficiently. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Reflection Exercise: Putting the Pieces Together + +As a group, reflect on the concepts +(e.g. tools, techniques and practices) +covered throughout the course, +how they relate to one another, +how they fit together in a bigger picture or skill learning pathways +and in which order you need to learn them. + +::::::::::::::: solution + +## Solution + +One way to think about these concepts is to +make a list and try to organise them along two axes - +'perceived usefulness of a concept' versus +'perceived difficulty or time needed to master a concept', +as shown in the table below +(for the exercise, you can make your own copy of the +[template table](https://docs.google.com/document/d/1NdE6PjqxjSsf1K4ofkCoWc2GA3sY2RIsjRg8BghTXas/edit?usp=sharing) +for the purpose of this exercise). +You then may think in which order you want to learn the skills +and how much effort they require - +e.g. start with those that are more useful but, for the time being, +hold off those that are not too useful to you and take loads of time to master. +You will likely want to focus on the concepts in the top right corner of the table first, +but investing time to master more difficult concepts may pay off in the long run +by saving you time and effort and helping reduce technical debt. + +![](fig/wrapup-perceived-usefulness-time.png){alt='Usefulness versus time to master grid' .image-with-shadow width="800px"} + +Another way you can organise the concepts is using a +[concept map](https://en.wikipedia.org/wiki/Concept_map) +(a directed graph depicting suggested relationships between concepts) +or any other diagram/visual aid of your choice. +Below are some example views of tools and techniques covered in the course using concept maps. +Your views may differ but that is not to say that either view is right or wrong. +This exercise is meant to get you to reflect on what was covered in the course +and hopefully to reinforce the ideas and concepts you learned. + +![](fig/wrapup-concept-map.png){alt='Overview of tools and techniques covered in the course' .image-with-shadow width="800px"} + +A different concept map tries to organise concepts/skills based on their level of difficulty +(novice, intermediate and advanced, and in-between!) +and tries to show which skills are prerequisite for others +and in which order you should consider learning skills. + +![](fig/wrapup-concept-map-difficulty-level.png){alt='Overview of topics covered in the course based on level of difficulty' .image-with-shadow width="800px"} + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Further Resources + +Below are some additional resources to help you continue learning: + +- [Additional episode on persisting data](../learners/persistence.md) +- [Additional episode on databases](../learners/databases.md) +- [Additional episode on software architecture](../learners/software-architecture-extra.md) +- [Additional episode on programming paradigms](../learners/programming-paradigms.md) +- [CodeRefinery lessons][coderefinery-lessons] on writing software for open and reproducible research +- [Python documentation][python-documentation] +- [GitHub Actions documentation][github-actions] + + + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Collaborative techniques and tools play an important part of research software development in teams. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..f19b80495 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: "Contributor Code of Conduct" +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/GOVERNANCE.md b/GOVERNANCE.md new file mode 100644 index 000000000..ccdacdedf --- /dev/null +++ b/GOVERNANCE.md @@ -0,0 +1,125 @@ +# Project Governance +This document describes the roles and responsibilities of people who manage the +python-intermediate-development curriculum in this repository +and the way they make decisions about how the project develops. +For information about how to contribute to the project, see [CONTRIBUTING.md](./CONTRIBUTING.md). +For information about the project's Code of Conduct +and its reporting and enforcement mechanisms, see [CODE_OF_CONDUCT.md](./CODE_OF_CONDUCT.md). + +## Roles + +### Maintainers +A team of 3-5 Maintainers is responsible for the lesson repository +and makes decisions about changes to be incorporated into the default branch. +Changes to the default branch can only be made by pull request, +and all pull requests to the default branch require +review and approval from at least one Maintainer before merging. + +Responsibilities of Maintainers include: + +* Reviewing and responding to new issues and pull requests in a timely manner +* Attending [Maintainer meetings](#maintainer-meetings) where availability allows +* Voting asynchronously on decisions where needed + +#### Lead Maintainer +The Maintainer team includes one person in a Lead role, +who is responsible for coordinating the activity of the group. +In addition to the responsibilities listed for all Maintainers above, +the Lead Maintainer: + +* schedules Maintainer meetings +* prepares [the agenda for Maintainer meetings](#meeting-agenda) +* shares the agenda with all Maintainers at least 24 hours before the meeting +* [assigns roles at the start of each meeting](#meeting-roles) +* schedules official releases of the lesson +* acts as a point of contact for the Maintainer team +* invites other community members to Maintainer meetings as non-voting participants + +Where needed e.g. due to absence, +the Lead Maintainer may defer any of these responsibilities to another member of the Maintainer team. + +The Lead Maintainer has a term length of 6 months, +and it is expected that the role will rotate among members of the Maintainer team. +If a Maintainer is up next in the rotation and wishes to decline the role of Lead +e.g. due to a lack of capacity, +they should communicate that with the other Maintainers at the earliest opportunity +to help the team plan accordingly. + +#### Current Maintainers +See [README.md](./README.md) for a list of the current project Maintainers. + +#### Joining/Leaving the Maintainer Team +Maintainers volunteer to take on the role, and other members of the community may +volunteer to join the Maintainer team at any time, +or be invited by the existing Maintainers. +Additions to the Maintainer team will be discussed and approved by the current membership. +No formal onboarding exists for new Maintainers, +but some informal onboarding can be expected from the existing Maintainers. + +Maintainers may step away from the role at any time, +but are expected to communicate the decision to the whole Maintainer team +and to coordinate with other Maintainers to transfer responsibilities, e.g. +re-assign issues, resolve outstanding pull requests, etc. + +### Contributors +Anyone who opens or comments on an issue or pull request, +or who provides feedback on the curriculum through another means, +is considered to be a Contributor to the project. + +Maintainers are responsible for ensuring that all such contributions are credited, +e.g. on the curriculum site and/or when (and if) a release of the curriculum is made to Zenodo. + +Contributors of more significant changes to the lesson may be invited by the Maintainers to add themselves to the +Authors list. + +## Maintainer Meetings +The Maintainer team meets frequently, +at minimum for at least 30 minutes four times per year. +Meetings provide an opportunity for Maintainers to +discuss outstanding issues and pull requests +and co-work on the project where necessary. + +### Meeting schedule +The maintainer team aims to meet at 11:00 UK time (BST or GMT) on the fourth Wednesday each month. The meetings alternate between operations meetings and co-working sprints. + +### Meeting agenda +The [agenda for Maintainer meetings](https://docs.google.com/document/d/1-SvoY_2GvlQgJnu8zfr6VnU7sev_iWZAIwBUywNSfWE/edit#) will be prepared as a collaborative document, +with (at least) sections to record the following information: + +* lists of Maintainers attending and absent from the meeting +* a list of items for discussion and, if necessary, amount of time assigned to each item + * wherever possible, the list should include a link to the relevant issue/pull request/discussion + +### Meeting roles +Each meeting will have a Facilitator, a Notetaker, and (if needed) a Timekeeper: + +* Facilitator: + introduces agenda items (or delegates this responsibility to another participant) + and controls the flow of discussion by keeping track of who wishes to speak + and calling on them to do so. + The meeting Facilitator is responsible for keeping discussion on-topic, + ensuring decisions are made and recorded where appropriate, + and giving every attendee an equal opportunity to participate in the meeting. + They also act as backup Notetaker, taking over when the Notetaker is speaking. +* Notetaker: + ensures that the main points of discussion are recorded throughout the meeting. + Although a full transcript of the discussion is not required, + the Notetaker is responsible for ensuring that the main points are captured + and any decisions made and actions required are noted prominently. +* Timekeeper (if needed): + the Maintainer Lead or meeting Facilitator may choose to assign a Timekeeper, + whose responsibility is to note the time alloted for each item on the agenda + and communicate to the Facilitator where that time has run out. + The decision to move from one agenda item to the next belongs to the meeting Facilitator. + +### Decision-making +Decisions within the Maintainer Team will be made by [lazy consensus](https://medlabboulder.gitlab.io/democraticmediums/mediums/lazy_consensus/) +among all Team members, +with fallback to simple majority vote only in cases +where a decision must be made urgently and no consensus can be found. + +Decisions will preferably be made during Maintainer meetings with every +member of the team present. +Where this is not possible, decision-making will happen asynchronously via +an issue on the curriculum repository. +Decisions made asynchronously must allow at least one week for Maintainers to respond and vote/abstain. diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 000000000..7632871ff --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: "Licenses" +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to ), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +* You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +* No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/common-issues.md b/common-issues.md new file mode 100644 index 000000000..ded4ac79e --- /dev/null +++ b/common-issues.md @@ -0,0 +1,278 @@ +--- +title: Common Issues, Fixes & Tips +--- + +Here is a list of issues previous participants of the course encountered +and some tips to help you with troubleshooting. + +## Command Line/Git Bash Issues + +### Python Hangs in Git Bash + +Hanging issues with trying to run Python 3 in Git Bash on Windows +(i.e. typing `python` in the shell, which causes it to just hang with no error message or output). +The solution appears to be to use `winpty` - +a Windows software package providing an interface similar to a Unix pty-master +for communicating with Windows command line tools. +Inside the shell type: + +```bash +$ alias python="winpty python.exe" +``` + +This alias will be valid for the duration of the shell session. +For a more permanent solution, from the shell do: + +```bash +$ echo "alias python='winpty python.exe'" >> ~/.bashrc +$ source ~/.bashrc +``` + +(and from there on remember to invoke Python as `python` +or whatever command you aliased it to). +Read more details on the issue at +[Stack Overflow](https://stackoverflow.com/questions/32597209/python-not-working-in-the-command-line-of-git-bash) +or [Superuser](https://superuser.com/questions/1403345/git-bash-not-running-python3-as-expected-hanging-issues). + +### Customising Command Line Prompt + +Minor annoyance with the ultra long prompt command line sometimes gives you - +if you do not want a reminder of the current working directory, +you can set it to just `$` by typing the following in your command line: `export PS1="$ "`. +More details on command line prompt customisation can be found in this +[guide](https://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html). + +## Git/GitHub Issues + +### Connection Issues When Accessing GitHub Using Git Over VPN or Protected Networks - Proxy Needed + +When accessing external services and websites +(such as GitHub using `git` or to +[install Python packages with `pip`](../learners/common-issues.md#connection-issues-when-installing-packages-with-pip-over-vpn-or-protected-networks---proxy-needed)), +you may experience connection errors +(e.g. similar to `fatal: unable to access '....': Failed connect to github.com`) +or a connection that hangs. +This may indicate that they need to configure a proxy server user by your organisation +to tunnel SSH traffic through a HTTP proxy. + +To get `git` to work through a proxy server in Windows, +you'll need `connect.exe` program that comes with GitBash +(which you should have installed as part of setup, so no additional installation is needed). +If installed in the default location, +this file should be found at `C:\Program Files\Git\mingw64\bin\connect.exe`. +Next, you'll need to modify your ssh config file (typically in `~/.ssh/config`) +and add the following: + +``` +Host github.com + ProxyCommand "C:/Program Files/Git/mingw64/bin/connect.exe" -H : %h %p + TCPKeepAlive yes + IdentitiesOnly yes + User git + Port 22 + Hostname github.com +``` + +Mac and Linux users can use the [Corkscrew tool](https://github.com/bryanpkc/corkscrew) +for tunneling SSH through HTTP proxies, +which would have to be installed separately. +Next, you'll need to modify your SSH config file (typically in `~/.ssh/config`) +and add the following: + +``` +Host github.com + ProxyCommand corkscrew %h %p + TCPKeepAlive yes + IdentitiesOnly yes + User git + Port 22 + Hostname github.com +``` + +### Creating a GitHub Key Without 'Workflow' Authorisation Scope + +If a learner creates a GitHub authentication token +but forgets to check 'workflow' scope +(to allow the token to be used to update GitHub Action workflows) +they will get the following error when trying to push a new workflow +(when adding the `pytest` action in Section 2) to GitHub: + +```error +! [remote rejected] test-suite -> test-suite (refusing to allow an OAuth App to create or update workflow `.github/workflows/main.yml` without `workflow` scope` +``` + +The solution is to generate a new token with the correct scope/usage permissions +and clear the local credential cache (if that's where the token has been saved). +In same cases, simply clearing credential cache was not enough and updating to Git 2.29 was needed. + +### `Please tell me who you are` Git Error + +If you experience the following error the first time you do a Git commit, +you may not have configured your identity with Git on your machine: + +```error +fatal: unable to auto-detect email address +*** Please tell me who you are +``` + +This can be configured from the command line as follows: + +```bash +$ git config --global user.name "Your Name" +$ git config --global user.email "name@example.com" +``` + +The option `--global` tells Git to use these settings "globally" +(i.e. for every project that uses Git for version control on your machine). +If you use different identifies for different projects, +then you should not use the `--global` option. +Make sure to use the same email address you used to open an account on GitHub +that you are using for this course. + +At this point it may also be a good time to configure your favourite text editor with Git, +if you have not already done so. +For example, to use the editor `nano` with Git: + +```bash +$ git config --global core.editor "nano -w" +``` + +## SSH key authentication issues with Git Bash + +Git Bash uses its own SSH library by default, which may result in errors such as the one below +even after adding your SSH key correctly: + +``` +$ git clone git@github.com:https://github.com/ukaea-rse-training/python-intermediate-inflammation +Cloning into 'python-intermediate-inflammation'... +git@github.com: Permission denied (publickey). +fatal: Could not read from remote repository. + +Please make sure you have the correct access rights +and the repository exists. +``` + +The solution is to change the SSH library used by Git: + +``` +$ git config --global core.sshCommand C:/windows/System32/OpenSSH/ssh.exe +``` + +## Python, `pip`, `venv` \& Installing Packages Issues + +### Issues With Numpy (and Potentially Other Packages) on New M1 Macs + +When using `numpy` package installed via `pip` on a command line on a new Apple M1 Mac, +you get a failed installation with the error: + +```error +... +mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e'). +... +``` + +Numpy is a package heavily optimised for performance, +and many parts of it are written in C and compiled for specific architectures, +such as Intel (x86\_64, x86\_32, etc.) +or Apple's M1 (arm64e). +In this instance, `pip` is obtaining a version of `numpy` with the incorrect compiled binaries, +instead of the ones needed for Apple's M1 Mac. +One way that was found to work was to install numpy via PyCharm into your environment instead, +which seems able to determine the correct packages to download and install. + +### Python 3 Installed but not Found When Using `python3` Command + +Python 3 installed on some Windows machines +may not be accessible using the `python3` command from the command line, +but works fine when invoked via the command `python`. + +### Connection Issues When Installing Packages With `pip` Over VPN or Protected Networks - Proxy Needed + +If you encounter issues when trying to +install packages with `pip` over your organisational network - +it may be because your may need to +[use a proxy](https://stackoverflow.com/questions/30992717/proxy-awareness-with-pip) +provided by your organisation. +In order to get `pip` to use the proxy, +you need to add an additional parameter when installing packages with `pip`: + +```bash +$ python3 -m pip install --proxy +``` + +To keep these settings permanently, +you may want to add the following to your `.zshrc`/`.bashrc` file +to avoid having to specify the proxy for each session, +and restart your command line terminal: + +``` +# call set_proxies to set proxies and unset_proxies to remove them +set_proxies() { +export {http,https,ftp}_proxy='' +export {HTTP,HTTPS,FTP}_PROXY='' +export NO_PROXY=localhost,127.0.0.1,10.96.0.0/12,192.168.99.0/24,192.168.39.0/24,192.168.64.2,., +} + +unset_proxies() { +export {http,https,ftp}_proxy= +export {HTTP,HTTPS,FTP}_PROXY= +export NO_PROXY= +} +``` + +## PyCharm Issues + +### Using GitBash from PyCharm + +To embed Git Bash in PyCharm as external tool and work with it in PyCharm window, +from Settings +select "Tools->Terminal->Shell path" +and enter `"C:\Program Files\Git\bin\sh.exe" --login`. +See [more details](https://stackoverflow.com/questions/20573213/embed-git-bash-in-pycharm-as-external-tool-and-work-with-it-in-pycharm-window-w) +on Stack Overflow. + +### Virtual Environments Issue `"no such option: –build-dir"` + +Using PyCharm to add a package to a virtual environment created from the command line using `venv` +can fail with error `"no such option: –build-dir"`, +which appears to be caused by the latest version of `pip` (20.3) +where the flag `-build-dir` was removed but is required by PyCharm to install packages. +A workaround is to: + +- Close PyCharm +- Downgrade the version of `pip` used by `venv`, e.g. in a command line terminal type: + ```bash + $ python3 -m pip install pip==20.2.4 + ``` +- Restart PyCharm + +See [the issue](https://youtrack.jetbrains.com/issue/PY-45712) for more details. +This issue seems to only occur with older versions of PyCharm - recent versions should be fine. + +### Invalid YAML Issue + +If YAML is copy+pasted from the course material, +it might not get pasted correctly in PyCharm and some extra indentation may occur. +Annoyingly, PyCharm will not flag this up as invalid YAML +and learners may get all sort of different issues and errors with these files - +e.g. 'actions must start with run or uses' with GitHub Actions workflows. + +An example of incorrect extra indentation: + +``` +steps: + - name: foo + uses: bar +``` + +Instead of + +``` +steps: + - name: foo + uses: bar +``` + + + + diff --git a/config.yaml b/config.yaml new file mode 100644 index 000000000..56b86e2c9 --- /dev/null +++ b/config.yaml @@ -0,0 +1,130 @@ +#------------------------------------------------------------ +# Values for this lesson. +#------------------------------------------------------------ + +# Which carpentry is this (swc, dc, lc, or cp)? +# swc: Software Carpentry +# dc: Data Carpentry +# lc: Library Carpentry +# cp: Carpentries (to use for instructor training for instance) +# incubator: The Carpentries Incubator +# Note that you can also use a custom carpentry type. For more info, +# see the documentation: https://carpentries.github.io/sandpaper-docs/editing.html +carpentry: 'incubator' + +# Custom carpentry description +# This will be used as the alt text for the logo +# carpentry_description: "Custom Carpentry" + +# Overall title for pages. +title: 'Intermediate Research Software Development' + +# Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: 2020-04-23 + +# Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' + +# Life cycle stage of the lesson +# possible values: pre-alpha, alpha, beta, stable +life_cycle: 'beta' + +# License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' + +# Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/python-intermediate-development' + +# Default branch of your lesson +branch: 'main' + +# Who to contact if there are any issues +contact: 'info@software.ac.uk' + +# Navigation ------------------------------------------------ +# +# Use the following menu items to specify the order of +# individual pages in each dropdown section. Leave blank to +# include all pages in the folder. +# +# Example ------------- +# +# episodes: +# - introduction.md +# - first-steps.md +# +# learners: +# - setup.md +# +# instructors: +# - instructor-notes.md +# +# profiles: +# - one-learner.md +# - another-learner.md + +# Disable sidebar automatic numbering +disable_sidebar_numbering: true + +# Order of episodes in your lesson +episodes: +- 00-setting-the-scene.md +- 10-section1-intro.md +- 11-software-project.md +- 12-virtual-environments.md +- 13-ides.md +- 14-collaboration-using-git.md +- 15-coding-conventions.md +- 16-verifying-code-style-linters.md +- 17-section1-optional-exercises.md +- 20-section2-intro.md +- 21-automatically-testing-software.md +- 22-scaling-up-unit-testing.md +- 23-continuous-integration-automated-testing.md +- 24-diagnosing-issues-improving-robustness.md +- 25-section2-optional-exercises.md +- 30-section3-intro.md +- 31-software-requirements.md +- 32-software-architecture-design.md +- 33-code-decoupling-abstractions.md +- 34-code-refactoring.md +- 35-software-architecture-revisited.md +- 40-section4-intro.md +- 41-code-review.md +- 42-software-reuse.md +- 43-software-release.md +- 50-section5-intro.md +- 51-managing-software.md +- 52-assessing-software-suitability-improvement.md +- 53-improvement-through-feedback.md +- 60-wrap-up.md + +# Information for Learners +learners: +- quiz.md +- installation-instructions.md +- common-issues.md +- software-architecture-extra.md +- programming-paradigms.md +- procedural-programming.md +- functional-programming.md +- object-oriented-programming.md +- persistence.md +- databases.md +- vscode.md +- reference.md + +# Information for Instructors +instructors: +- instructor-notes.md + +# Learner Profiles +profiles: + +# Customisation --------------------------------------------- +# +# This space below is where custom yaml items (e.g. pinning +# sandpaper and varnish versions) should live + + +carpentry_description: Lesson Description diff --git a/databases.md b/databases.md new file mode 100644 index 000000000..effa5b10d --- /dev/null +++ b/databases.md @@ -0,0 +1,472 @@ +--- +title: "Extra Content: Databases" +teaching: 30 +exercises: 30 +--- + +::: questions +- How can we persist complex structured data for efficient access? +::: + +::: objectives +- Describe the structure of a relational database +- Store and retrieve structured data using an Object Relational Mapping (ORM) +::: + + +::::::::::::::::::::::::::::::::::::::::: callout + +## Follow up from Section 3 + +This episode could be read as a follow up from the end of +[Section 3 on software design and development](../episodes/35-software-architecture-revisited.md). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A **database** is an organised collection of data, +usually organised in some way to mimic the structure of the entities it represents. +There are several major families of database model, +but the dominant form is the **relational database**. + +Relational databases focus on describing the relationships between entities in the data, +similar to the object oriented paradigm. +The key concepts in a relational database are: + +Tables + +- Within a database we can have multiple tables - + each table usually represents all entities of a single type. +- E.g., we might have a `patients` table to represent all of our patients. + +Columns / Fields + +- Each table has columns - each column has a name and holds data of a specific type +- E.g., we might have a `name` column in our `patients` table + which holds text data representing the names of our patients. + +Rows + +- Each table has rows - each row represents a single entity and has a value for each field. +- E.g., each row in our `patients` table represents a single patient - + the value of the `name` field in this row is our patient's name. + +Primary Keys + +- Each row has a primary key - + this is a unique ID that can be used to select this from from the data. +- E.g., each patient might have a `patient_id` + which can be used to distinguish two patients with the same name. + +Foreign Keys + +- A relationship between two entities is described using a foreign key - + this is a field which points to the primary key of another row / table. +- E.g., Each patient might have a foreign key field called `doctor` + pointing to a row in a `doctors` table representing the doctor responsible for them - + i.e. this doctor *has a* patient. + +While relational databases are typically accessed using **SQL queries**, +we are going to use a library to help us translate between Python and the database. +[SQLAlchemy](https://www.sqlalchemy.org/) is a popular Python library +which contains an **Object Relational Mapping** (ORM) framework. + +::::::::::::::::::::::::::::::::::::::::: callout + +## SQLAlchemy + +For more information, see SQLAlchemy's [ORM tutorial](https://docs.sqlalchemy.org/en/13/orm/tutorial.html). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Our first step is to install SQLAlchemy, then we can create our first **mapping**. + +```bash +$ python3 -m pip install sqlalchemy +``` + +A mapping is the core component of an ORM - +it describes how to convert between our Python classes and the contents of our database tables. +Typically, we can take our existing classes +and convert them into mappings with a little modification, +so we do not have to start from scratch. + +```python +# file: inflammation/models.py +from sqlalchemy import Column, create_engine, Integer, String +from sqlalchemy.ext.declarative import declarative_base + +Base = declarative_base() + +... + +class Patient(Base): + __tablename__ = 'patients' + + id = Column(Integer, primary_key=True) + name = Column(String) + + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + self.observations = [] + if 'observations' in kwargs: + self.observations = kwargs['observations'] +``` + +Now that we have defined how to translate between our Python class and a database table, +we need to hook our code up to an actual database. + +The library we are using, SQLAlchemy, does everything through a database **engine**. +This is essentially a wrapper around the real database, +so we do not have to worry about which particular database software is being used - +we just need to write code for a generic relational database. + +For these lessions we are going to use the SQLite engine +as this requires almost no configuration and no external software. +Most relational database software runs as a separate service which we can connect to from our code. +This means that in a large scale environment, +we could have the database and our software running on different computers - +we could even have the database spread across several servers +if we have particularly high demands for performance or reliability. +Some examples of databases which are used like this are PostgreSQL, MySQL and MSSQL. + +On the other hand, SQLite runs entirely within our software +and uses only a single file to hold its data. +It will not give us +the extremely high performance or reliability of a properly configured PostgreSQL database, +but it is good enough in many cases and much less work to get running. + +Let us write some test code to setup and connect to an SQLite database. +For now we will store the database in memory rather than an actual file - +it will not actually allow us to store data after the program finishes, +but it allows us not to worry about **migrations**. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Migrations + +When we make changes to our mapping (e.g. adding / removing columns), +we need to get the database to update its tables to make sure they match the new format. +This is what the `Base.metadata.create_all` method does - +creates all of these tables from scratch +because we are using an in-memory database which we know will be removed between runs. + +If we are actually storing data persistently, +we need to make sure that when we change the mapping, +we update the database tables without damaging any of the data they currently contain. +We could do this manually, +by running SQL queries against the tables to get them into the right format, +but this is error-prone and can be a lot of work. + +In practice, we generate a migration for each change. +Tools such as [Alembic](https://alembic.sqlalchemy.org/en/latest/) +will compare our mappings to the known state of the database +and generate a Python file which updates the database to the necessary state. + +Migrations can be quite complex, so we will not be using them here - +but you may find it useful to read about them later. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```python +# file: tests/test_models.py + +... + +from sqlalchemy import create_engine +from sqlalchemy.orm import sessionmaker + +from inflammation.models import Base, Patient + +... + +def test_sqlalchemy_patient_search(): + """Test that we can save and retrieve patient data from a database.""" + + # Setup a database connection - we are using a database stored in memory here + engine = create_engine('sqlite:///:memory:', echo=True) + Session = sessionmaker(bind=engine) + session = Session() + Base.metadata.create_all(engine) + + # Save a patient to the database + test_patient = Patient(name='Alice') + session.add(test_patient) + + # Search for a patient by name + queried_patient = session.query(Patient).filter_by(name='Alice').first() + self.assertEqual(queried_patient.name, 'Alice') + self.assertEqual(queried_patient.id, 1) + + # Wipe our temporary database + Base.metadata.drop_all(engine) +``` + +For this test, we have imported our models inside the test function, +rather than at the top of the file like we normally would. +This is not recommended in normal code, +as it means we are paying the performance cost of importing every time we run the function, +but can be useful in test code. +Since each test function only runs once per test session, +this performance cost is not as important as a function we were going to call many times. +Additionally, if we try to import something which does not exist, it will fail - +by imporing inside the test function, +we limit this to that specific test failing, +rather than the whole file failing to run. + +### Relationships + +Relational databases do not typically have an 'array of numbers' column type, +so how are we going to represent our observations of our patients' inflammation? +Well, our first step is to create a table of observations. +We can then use a **foreign key** to point from the observation to a patient, +so we know which patient the data belongs to. +The table also needs a column for the actual measurement - +we will call this `value` - +and a column for the day the measurement was taken on. + +We can also use the ORM's `relationship` helper function +allow us to go between the observations and patients +without having to do any of the complicated table joins manually. + +```python +from sqlalchemy import Column, ForeignKey, Integer, String +from sqlalchemy.ext.declarative import declarative_base +from sqlalchemy.orm import relationship + +... + +class Observation(Base): + __tablename__ = 'observations' + + id = Column(Integer, primary_key=True) + day = Column(Integer) + value = Column(Integer) + patient_id = Column(Integer, ForeignKey('patients.id')) + + patient = relationship('Patient', back_populates='observations') + + +class Patient(Base): + __tablename__ = 'patients' + + id = Column(Integer, primary_key=True) + name = Column(String) + + observations = relationship('Observation', + order_by=Observation.day, + back_populates='patient') + +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Time is Hard + +We are using an integer field to store the day on which a measurement was taken. +This keeps us consistent with what we had previously +as it is essentialy the position of the measurement in the Numpy array. +It also avoids us having to worry about managing actual date / times. + +The Python `datetime` module we have used previously in the Academics example would be useful here, +and most databases have support for 'date' and 'time' columns, +but to reduce the complexity, we will just use integers here. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Our test code for this is going to look very similar to our previous test code, +so we can copy-paste it and make a few changes. +This time, after setting up the database, we need to add a patient and an observation. +We then test that we can get the observations from a patient we have searched for. + +```python +# file: tests/test_models.py + +from inflammation.models import Base, Observation, Patient +... + +def test_sqlalchemy_observations(): + """Test that we can save and retrieve inflammation observations from a database.""" + + # Setup a database connection - we are using a database stored in memory here + engine = create_engine('sqlite:///:memory:', echo=True) + Session = sessionmaker(bind=engine) + session = Session() + Base.metadata.create_all(engine) + + # Save a patient to the database + test_patient = Patient(name='Alice') + session.add(test_patient) + + test_observation = Observation(patient=test_patient, day=0, value=1) + session.add(test_observation) + + queried_patient = session.query(Patient).filter_by(name='Alice').first() + first_observation = queried_patient.observations[0] + self.assertEqual(first_observation.patient, queried_patient) + self.assertEqual(first_observation.day, 0) + self.assertEqual(first_observation.value, 1) + + # Wipe our temporary database + Base.metadata.drop_all(engine) +``` + +Finally, let us put in a way to convert all of our observations into a Numpy array, +so we can use our previous analysis code. +We will use the `property` decorator here again, +to create a method that we can use as if it was a normal data attribute. + +```python +# file: inflammation/models.py + +... + +class Patient(Base): + __tablename__ = 'patients' + + id = Column(Integer, primary_key=True) + name = Column(String) + + observations = relationship('Observation', + order_by=Observation.day, + back_populates='patient') + + @property + def values(self): + """Convert inflammation data into Numpy array.""" + last_day = self.observations[-1].day + values = np.zeros(last_day + 1) + + for observation in self.observations: + values[observation.day] = observation.value + + return values +``` + +Once again we will copy-paste the test code and make some changes. +This time we want to create a few observations for our patient +and test that we can turn them into a Numpy array. + +```python +# file: tests/test_models.py +from inflammation.models import Base, Observation, Patient +... +def test_sqlalchemy_observations_to_array(): + """Test that we can save and retrieve inflammation observations from a database.""" + + # Setup a database connection - we are using a database stored in memory here + engine = create_engine('sqlite:///:memory:') + Session = sessionmaker(bind=engine) + session = Session() + Base.metadata.create_all(engine) + + # Save a patient to the database + test_patient = Patient(name='Alice') + session.add(test_patient) + + for i in range(5): + test_observation = Observation(patient=test_patient, day=i, value=i) + session.add(test_observation) + + queried_patient = session.query(Patient).filter_by(name='Alice').first() + npt.assert_array_equal([0, 1, 2, 3, 4], queried_patient.values) + + # Wipe our temporary database + Base.metadata.drop_all(engine) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Further Array Testing + +there is an important feature of the behaviour of our `Patient.values` property +that's not currently being tested. +What is this feature? +Write one or more extra tests to cover this feature. + +::::::::::::::: solution + +## Hint + +The `Patient.values` property creates an array of zeroes, +then fills it with data from the table. +If a measurement was not taken on a particular day, +that day's value will be left as zero. + +If this is intended behaviour, +it would be useful to write a test for it, +to ensure that we do not break it in future. +Using tests in this way is known as **regression testing**. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Refactoring for Reduced Redundancy + +You have probably noticed that there is a lot of replicated code in our database tests. +It is fine if some code is replicated a bit, +but if you keep needing to copy the same code, +that's a sign it should be refactored. + +Refactoring is the process of changing the structure of our code, +without changing its behaviour, +and one of the main benefits of good test coverage is that it makes refactoring easier. +If we have got a good set of tests, +it is much more likely that we will detect any changes to behaviour - +even when these changes might be in the tests themselves. + +Try refactoring the database tests to see if you can +reduce the amount of replicated code +by moving it into one or more functions at the top of the test file. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge: Connecting More Views + +We have added the ability to store patient records in the database, +but not actually connected it to any useful views. +there is a common pattern in data management software +which is often referred to as **CRUD** - Create, Read, Update, Delete. +These are the four fundamental views that we need to provide +to allow people to manage their data effectively. + +Each of these applies at the level of a single record, +so for both patients and observations we should have a view to: +create a new record, +show an existing record, +update an existing record +and delete an existing record. +It is also sometimes useful to provide a view which lists all existing records for each type - +for example, a list of all patients would probably be useful, +but a list of all observations might not be. + +Pick one (or several) of these views to implement - +you may want to refer back to the section where we added our initial patient read view. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge: Managing Dates Properly + +Try converting our existing models to use actual dates instead of just a day number. +The Python [datetime module documentation](https://docs.python.org/3/library/datetime.html) +and SQLAlchemy [Column and Data Types page](https://docs.sqlalchemy.org/en/13/core/type_basics.html) +will be useful to you here. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + +::: keypoints +- Relational databases are often the best persistence mechanism for data +which fits well to the Object Oriented paradigm. +::: diff --git a/fig/PyCharm_Icon.png b/fig/PyCharm_Icon.png new file mode 100644 index 000000000..7df396f27 Binary files /dev/null and b/fig/PyCharm_Icon.png differ diff --git a/fig/Software_Development_Life_Cycle.jpg b/fig/Software_Development_Life_Cycle.jpg new file mode 100644 index 000000000..516431917 Binary files /dev/null and b/fig/Software_Development_Life_Cycle.jpg differ diff --git a/fig/Visual_Studio_Code_1.35_icon.png b/fig/Visual_Studio_Code_1.35_icon.png new file mode 100644 index 000000000..eb9c933af Binary files /dev/null and b/fig/Visual_Studio_Code_1.35_icon.png differ diff --git a/fig/car-dashboard.jpg b/fig/car-dashboard.jpg new file mode 100644 index 000000000..a21ed1a44 Binary files /dev/null and b/fig/car-dashboard.jpg differ diff --git a/fig/car-engine.jpg b/fig/car-engine.jpg new file mode 100644 index 000000000..aa42645ef Binary files /dev/null and b/fig/car-engine.jpg differ diff --git a/fig/car-fuel.jpg b/fig/car-fuel.jpg new file mode 100644 index 000000000..93011b7c3 Binary files /dev/null and b/fig/car-fuel.jpg differ diff --git a/fig/chef-food.jpg b/fig/chef-food.jpg new file mode 100644 index 000000000..fd3d07572 Binary files /dev/null and b/fig/chef-food.jpg differ diff --git a/fig/ci-ga-build-matrix.png b/fig/ci-ga-build-matrix.png new file mode 100644 index 000000000..c92ac0cdc Binary files /dev/null and b/fig/ci-ga-build-matrix.png differ diff --git a/fig/ci-initial-build-travis.png b/fig/ci-initial-build-travis.png new file mode 100644 index 000000000..1972c5f81 Binary files /dev/null and b/fig/ci-initial-build-travis.png differ diff --git a/fig/ci-initial-ga-build-details.png b/fig/ci-initial-ga-build-details.png new file mode 100644 index 000000000..de7437d99 Binary files /dev/null and b/fig/ci-initial-ga-build-details.png differ diff --git a/fig/ci-initial-ga-build-log.png b/fig/ci-initial-ga-build-log.png new file mode 100644 index 000000000..bd0562c31 Binary files /dev/null and b/fig/ci-initial-ga-build-log.png differ diff --git a/fig/ci-initial-ga-build.png b/fig/ci-initial-ga-build.png new file mode 100644 index 000000000..0ed6c4324 Binary files /dev/null and b/fig/ci-initial-ga-build.png differ diff --git a/fig/ci-initial-travis-build-log.png b/fig/ci-initial-travis-build-log.png new file mode 100644 index 000000000..a6fe74858 Binary files /dev/null and b/fig/ci-initial-travis-build-log.png differ diff --git a/fig/ci-travis-permissions.png b/fig/ci-travis-permissions.png new file mode 100644 index 000000000..15bdbb1da Binary files /dev/null and b/fig/ci-travis-permissions.png differ diff --git a/fig/clone-repository.png b/fig/clone-repository.png new file mode 100644 index 000000000..bba875706 Binary files /dev/null and b/fig/clone-repository.png differ diff --git a/fig/code-review-sequence-diagram.svg b/fig/code-review-sequence-diagram.svg new file mode 100644 index 000000000..dd57d1ec8 --- /dev/null +++ b/fig/code-review-sequence-diagram.svg @@ -0,0 +1 @@ +ReviewerAuthorReviewerAuthorloop[Until approved]Write some codeRaise a pull requestAdd comments to a reviewSubmit a reviewAddress or respond to review commentsClarify or resolve commentsApprove pull requestMerge pull request \ No newline at end of file diff --git a/fig/course-overview.svg b/fig/course-overview.svg new file mode 100644 index 000000000..6866a83f3 --- /dev/null +++ b/fig/course-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment
2. Verifying
software correctness
3. Software development
as a process
4. Collaborative
development for reuse
5. Managing software
over its lifetime
\ No newline at end of file diff --git a/fig/example-architecture-daigram.mermaid.txt b/fig/example-architecture-daigram.mermaid.txt new file mode 100644 index 000000000..c3ab99112 --- /dev/null +++ b/fig/example-architecture-daigram.mermaid.txt @@ -0,0 +1,18 @@ +graph TD + A[(GDrive Folder)] + B[(Database)] + C[GDrive Monitor] + C -- Checks periodically--> A + D[Download inflammation data] + C -- Trigger update --> D + E[Parse inflammation data] + D --> E + F[Perform analysis] + E --> F + G[Upload analysis] + F --> G + G --> B + H[Notify users] + I[Monitor database] + I -- Check periodically --> B + I --> H diff --git a/fig/example-architecture-diagram.svg b/fig/example-architecture-diagram.svg new file mode 100644 index 000000000..02a7ecceb --- /dev/null +++ b/fig/example-architecture-diagram.svg @@ -0,0 +1 @@ +
Checks periodically
Trigger update
Check periodically
GDrive Folder
Database
GDrive Monitor
Download inflammation data
Parse inflammation data
Perform analysis
Upload analysis
Notify users
Monitor database
\ No newline at end of file diff --git a/fig/file_explorer.png b/fig/file_explorer.png new file mode 100644 index 000000000..ec1caa2a3 Binary files /dev/null and b/fig/file_explorer.png differ diff --git a/fig/git-distributed.png b/fig/git-distributed.png new file mode 100644 index 000000000..92100c707 Binary files /dev/null and b/fig/git-distributed.png differ diff --git a/fig/git-feature-branch.svg b/fig/git-feature-branch.svg new file mode 100644 index 000000000..ca60846ba --- /dev/null +++ b/fig/git-feature-branch.svg @@ -0,0 +1,439 @@ + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + X + + + Develop + + + + + + + + + + diff --git a/fig/git-lifecycle.svg b/fig/git-lifecycle.svg new file mode 100644 index 000000000..5f420841b --- /dev/null +++ b/fig/git-lifecycle.svg @@ -0,0 +1 @@ +Remote Repository BranchLocal Repository BranchStaging AreaWorking DirectoryRemote Repository BranchLocal Repository BranchStaging AreaWorking Directorygit addgit commitgit pushgit fetchgit mergegit pull (shortcut for git fetch followed by git merge for a 'tracking branch') diff --git a/fig/github-add-collaborator.png b/fig/github-add-collaborator.png new file mode 100644 index 000000000..6b59686ea Binary files /dev/null and b/fig/github-add-collaborator.png differ diff --git a/fig/github-add-collaborators.png b/fig/github-add-collaborators.png new file mode 100644 index 000000000..7c24b4ae9 Binary files /dev/null and b/fig/github-add-collaborators.png differ diff --git a/fig/github-add-emoji.png b/fig/github-add-emoji.png new file mode 100644 index 000000000..34a5062e6 Binary files /dev/null and b/fig/github-add-emoji.png differ diff --git a/fig/github-assign-milestone.png b/fig/github-assign-milestone.png new file mode 100644 index 000000000..6d68d7187 Binary files /dev/null and b/fig/github-assign-milestone.png differ diff --git a/fig/github-board-template.png b/fig/github-board-template.png new file mode 100644 index 000000000..aafdf6aab Binary files /dev/null and b/fig/github-board-template.png differ diff --git a/fig/github-branch-protection-settings.png b/fig/github-branch-protection-settings.png new file mode 100644 index 000000000..7e3434a7f Binary files /dev/null and b/fig/github-branch-protection-settings.png differ diff --git a/fig/github-branches.png b/fig/github-branches.png new file mode 100644 index 000000000..2f19200f8 Binary files /dev/null and b/fig/github-branches.png differ diff --git a/fig/github-convert-task-to-issue.png b/fig/github-convert-task-to-issue.png new file mode 100644 index 000000000..62e900a46 Binary files /dev/null and b/fig/github-convert-task-to-issue.png differ diff --git a/fig/github-create-board.png b/fig/github-create-board.png new file mode 100644 index 000000000..907d9033a Binary files /dev/null and b/fig/github-create-board.png differ diff --git a/fig/github-create-milestone.png b/fig/github-create-milestone.png new file mode 100644 index 000000000..5f2c41539 Binary files /dev/null and b/fig/github-create-milestone.png differ diff --git a/fig/github-create-pull-request.png b/fig/github-create-pull-request.png new file mode 100644 index 000000000..8cbf5c7c0 Binary files /dev/null and b/fig/github-create-pull-request.png differ diff --git a/fig/github-develop-branch.png b/fig/github-develop-branch.png new file mode 100644 index 000000000..59769698a Binary files /dev/null and b/fig/github-develop-branch.png differ diff --git a/fig/github-finish-pull-request-review.png b/fig/github-finish-pull-request-review.png new file mode 100644 index 000000000..31e030022 Binary files /dev/null and b/fig/github-finish-pull-request-review.png differ diff --git a/fig/github-fork-repository-confirm.png b/fig/github-fork-repository-confirm.png new file mode 100644 index 000000000..12bb399d6 Binary files /dev/null and b/fig/github-fork-repository-confirm.png differ diff --git a/fig/github-fork-repository.png b/fig/github-fork-repository.png new file mode 100644 index 000000000..3eb6cb4e6 Binary files /dev/null and b/fig/github-fork-repository.png differ diff --git a/fig/github-forked-repository-own.png b/fig/github-forked-repository-own.png new file mode 100644 index 000000000..1a4755c2f Binary files /dev/null and b/fig/github-forked-repository-own.png differ diff --git a/fig/github-issue-list.png b/fig/github-issue-list.png new file mode 100644 index 000000000..6b405696c Binary files /dev/null and b/fig/github-issue-list.png differ diff --git a/fig/github-main-branch.png b/fig/github-main-branch.png new file mode 100644 index 000000000..a5da69aa7 Binary files /dev/null and b/fig/github-main-branch.png differ diff --git a/fig/github-manage-access.png b/fig/github-manage-access.png new file mode 100644 index 000000000..0bb1b6b85 Binary files /dev/null and b/fig/github-manage-access.png differ diff --git a/fig/github-merge-pull-request.png b/fig/github-merge-pull-request.png new file mode 100644 index 000000000..e69bdf4fc Binary files /dev/null and b/fig/github-merge-pull-request.png differ diff --git a/fig/github-milestone-in-project-board.png b/fig/github-milestone-in-project-board.png new file mode 100644 index 000000000..1653f31bc Binary files /dev/null and b/fig/github-milestone-in-project-board.png differ diff --git a/fig/github-milestones.png b/fig/github-milestones.png new file mode 100644 index 000000000..72b90ac4c Binary files /dev/null and b/fig/github-milestones.png differ diff --git a/fig/github-new-issue.png b/fig/github-new-issue.png new file mode 100644 index 000000000..0940754bf Binary files /dev/null and b/fig/github-new-issue.png differ diff --git a/fig/github-new-milestone-description.png b/fig/github-new-milestone-description.png new file mode 100644 index 000000000..850585b89 Binary files /dev/null and b/fig/github-new-milestone-description.png differ diff --git a/fig/github-new-project.png b/fig/github-new-project.png new file mode 100644 index 000000000..98f123e04 Binary files /dev/null and b/fig/github-new-project.png differ diff --git a/fig/github-project-add-view.png b/fig/github-project-add-view.png new file mode 100644 index 000000000..0301f8deb Binary files /dev/null and b/fig/github-project-add-view.png differ diff --git a/fig/github-project-description.png b/fig/github-project-description.png new file mode 100644 index 000000000..982619563 Binary files /dev/null and b/fig/github-project-description.png differ diff --git a/fig/github-project-new-items.png b/fig/github-project-new-items.png new file mode 100644 index 000000000..68d7e8a07 Binary files /dev/null and b/fig/github-project-new-items.png differ diff --git a/fig/github-project-settings.png b/fig/github-project-settings.png new file mode 100644 index 000000000..02c80e14f Binary files /dev/null and b/fig/github-project-settings.png differ diff --git a/fig/github-project-template.png b/fig/github-project-template.png new file mode 100644 index 000000000..ddb43d8ac Binary files /dev/null and b/fig/github-project-template.png differ diff --git a/fig/github-project-view-add-remove-items.png b/fig/github-project-view-add-remove-items.png new file mode 100644 index 000000000..c705b9c85 Binary files /dev/null and b/fig/github-project-view-add-remove-items.png differ diff --git a/fig/github-pull-request-add-comment.png b/fig/github-pull-request-add-comment.png new file mode 100644 index 000000000..56d127034 Binary files /dev/null and b/fig/github-pull-request-add-comment.png differ diff --git a/fig/github-pull-request-add-suggestion.png b/fig/github-pull-request-add-suggestion.png new file mode 100644 index 000000000..61326106d Binary files /dev/null and b/fig/github-pull-request-add-suggestion.png differ diff --git a/fig/github-pull-request-files-changed.png b/fig/github-pull-request-files-changed.png new file mode 100644 index 000000000..9a7393985 Binary files /dev/null and b/fig/github-pull-request-files-changed.png differ diff --git a/fig/github-pull-request-tab.png b/fig/github-pull-request-tab.png new file mode 100644 index 000000000..9a5c7751d Binary files /dev/null and b/fig/github-pull-request-tab.png differ diff --git a/fig/github-reference-comments-commits.png b/fig/github-reference-comments-commits.png new file mode 100644 index 000000000..63378f6d5 Binary files /dev/null and b/fig/github-reference-comments-commits.png differ diff --git a/fig/github-respond-to-review-comment-with-commit-link.png b/fig/github-respond-to-review-comment-with-commit-link.png new file mode 100644 index 000000000..5ef00d4d2 Binary files /dev/null and b/fig/github-respond-to-review-comment-with-commit-link.png differ diff --git a/fig/github-respond-to-review-comment-with-emoji.png b/fig/github-respond-to-review-comment-with-emoji.png new file mode 100644 index 000000000..7d6124ad4 Binary files /dev/null and b/fig/github-respond-to-review-comment-with-emoji.png differ diff --git a/fig/github-settings.png b/fig/github-settings.png new file mode 100644 index 000000000..c2be1daaa Binary files /dev/null and b/fig/github-settings.png differ diff --git a/fig/github-submit-pull-request-review.png b/fig/github-submit-pull-request-review.png new file mode 100644 index 000000000..65321aef2 Binary files /dev/null and b/fig/github-submit-pull-request-review.png differ diff --git a/fig/github-submit-pull-request.png b/fig/github-submit-pull-request.png new file mode 100644 index 000000000..59f7cce2e Binary files /dev/null and b/fig/github-submit-pull-request.png differ diff --git a/fig/inflammation-dataset.svg b/fig/inflammation-dataset.svg new file mode 100644 index 000000000..b33f577cc --- /dev/null +++ b/fig/inflammation-dataset.svg @@ -0,0 +1,398 @@ + + + + + + + + + Day 1 + Day 2 + Day 3 + Day 4 + Day 5 + Day 6 + Day 7 + Patients + + + + + + + + + + + + + + + + + + + + + + + + 0013124 + + + + + 0121213 + + + + + 0113326 + + + + + 0020422 + + + + + 0113313 + + + Inflammation data + + + + + diff --git a/fig/inflammation-study-pipeline.png b/fig/inflammation-study-pipeline.png new file mode 100644 index 000000000..636df7cc7 Binary files /dev/null and b/fig/inflammation-study-pipeline.png differ diff --git a/fig/intro-diagrams.xcf b/fig/intro-diagrams.xcf new file mode 100644 index 000000000..219f9360b Binary files /dev/null and b/fig/intro-diagrams.xcf differ diff --git a/fig/mva-diagram.png b/fig/mva-diagram.png new file mode 100644 index 000000000..7c065dc45 Binary files /dev/null and b/fig/mva-diagram.png differ diff --git a/fig/mvc-DNA-guide-CLI.png b/fig/mvc-DNA-guide-CLI.png new file mode 100644 index 000000000..f12d8e23e Binary files /dev/null and b/fig/mvc-DNA-guide-CLI.png differ diff --git a/fig/mvc-DNA-guide-GUI.png b/fig/mvc-DNA-guide-GUI.png new file mode 100644 index 000000000..a924b03d3 Binary files /dev/null and b/fig/mvc-DNA-guide-GUI.png differ diff --git a/fig/mvc-car.png b/fig/mvc-car.png new file mode 100644 index 000000000..a95f843cb Binary files /dev/null and b/fig/mvc-car.png differ diff --git a/fig/mvc-car.xcf b/fig/mvc-car.xcf new file mode 100644 index 000000000..768ed28a7 Binary files /dev/null and b/fig/mvc-car.xcf differ diff --git a/fig/mvc-restaurant.png b/fig/mvc-restaurant.png new file mode 100644 index 000000000..0126dd2a5 Binary files /dev/null and b/fig/mvc-restaurant.png differ diff --git a/fig/mvc-restaurant.xcf b/fig/mvc-restaurant.xcf new file mode 100644 index 000000000..9706a5569 Binary files /dev/null and b/fig/mvc-restaurant.xcf differ diff --git a/fig/numpy-incompatible-shapes.png b/fig/numpy-incompatible-shapes.png new file mode 100644 index 000000000..dc6f32f60 Binary files /dev/null and b/fig/numpy-incompatible-shapes.png differ diff --git a/fig/numpy-shapes-after-broadcasting.png b/fig/numpy-shapes-after-broadcasting.png new file mode 100644 index 000000000..f62096753 Binary files /dev/null and b/fig/numpy-shapes-after-broadcasting.png differ diff --git a/fig/numpy-shapes-after-new-axis.png b/fig/numpy-shapes-after-new-axis.png new file mode 100644 index 000000000..fe9e900a3 Binary files /dev/null and b/fig/numpy-shapes-after-new-axis.png differ diff --git a/fig/paradigms.png b/fig/paradigms.png new file mode 100644 index 000000000..9275f7b8c Binary files /dev/null and b/fig/paradigms.png differ diff --git a/fig/pycharm-add-library.png b/fig/pycharm-add-library.png new file mode 100644 index 000000000..c93f2f9e1 Binary files /dev/null and b/fig/pycharm-add-library.png differ diff --git a/fig/pycharm-add-run-configuration.png b/fig/pycharm-add-run-configuration.png new file mode 100644 index 000000000..ffe0f950b Binary files /dev/null and b/fig/pycharm-add-run-configuration.png differ diff --git a/fig/pycharm-code-completion.png b/fig/pycharm-code-completion.png new file mode 100644 index 000000000..1f78f2aff Binary files /dev/null and b/fig/pycharm-code-completion.png differ diff --git a/fig/pycharm-code-reference.png b/fig/pycharm-code-reference.png new file mode 100644 index 000000000..2b4b9bdb4 Binary files /dev/null and b/fig/pycharm-code-reference.png differ diff --git a/fig/pycharm-code-search.png b/fig/pycharm-code-search.png new file mode 100644 index 000000000..025049ee2 Binary files /dev/null and b/fig/pycharm-code-search.png differ diff --git a/fig/pycharm-configuring-interpreter.png b/fig/pycharm-configuring-interpreter.png new file mode 100644 index 000000000..e36d96c27 Binary files /dev/null and b/fig/pycharm-configuring-interpreter.png differ diff --git a/fig/pycharm-find-panel.png b/fig/pycharm-find-panel.png new file mode 100644 index 000000000..26023c06e Binary files /dev/null and b/fig/pycharm-find-panel.png differ diff --git a/fig/pycharm-indentation.png b/fig/pycharm-indentation.png new file mode 100644 index 000000000..269eb1737 Binary files /dev/null and b/fig/pycharm-indentation.png differ diff --git a/fig/pycharm-installed-packages.png b/fig/pycharm-installed-packages.png new file mode 100644 index 000000000..2940de150 Binary files /dev/null and b/fig/pycharm-installed-packages.png differ diff --git a/fig/pycharm-missing-python-interpreter.png b/fig/pycharm-missing-python-interpreter.png new file mode 100644 index 000000000..17673de22 Binary files /dev/null and b/fig/pycharm-missing-python-interpreter.png differ diff --git a/fig/pycharm-open-project.png b/fig/pycharm-open-project.png new file mode 100644 index 000000000..49ded5759 Binary files /dev/null and b/fig/pycharm-open-project.png differ diff --git a/fig/pycharm-run-configuration-popup.png b/fig/pycharm-run-configuration-popup.png new file mode 100644 index 000000000..a51b7fccc Binary files /dev/null and b/fig/pycharm-run-configuration-popup.png differ diff --git a/fig/pycharm-run-script.png b/fig/pycharm-run-script.png new file mode 100644 index 000000000..19a612c13 Binary files /dev/null and b/fig/pycharm-run-script.png differ diff --git a/fig/pycharm-syntax-highlighting.png b/fig/pycharm-syntax-highlighting.png new file mode 100644 index 000000000..618929785 Binary files /dev/null and b/fig/pycharm-syntax-highlighting.png differ diff --git a/fig/pycharm-test-framework.png b/fig/pycharm-test-framework.png new file mode 100644 index 000000000..4b43bb887 Binary files /dev/null and b/fig/pycharm-test-framework.png differ diff --git a/fig/pycharm-version-control.png b/fig/pycharm-version-control.png new file mode 100644 index 000000000..e805c1c55 Binary files /dev/null and b/fig/pycharm-version-control.png differ diff --git a/fig/pycharm-whitespace.png b/fig/pycharm-whitespace.png new file mode 100644 index 000000000..16831209b Binary files /dev/null and b/fig/pycharm-whitespace.png differ diff --git a/fig/pytest-pycharm-all-tests-pass.png b/fig/pytest-pycharm-all-tests-pass.png new file mode 100644 index 000000000..732ba79be Binary files /dev/null and b/fig/pytest-pycharm-all-tests-pass.png differ diff --git a/fig/pytest-pycharm-check-config.png b/fig/pytest-pycharm-check-config.png new file mode 100644 index 000000000..a9d4eb841 Binary files /dev/null and b/fig/pytest-pycharm-check-config.png differ diff --git a/fig/pytest-pycharm-console.png b/fig/pytest-pycharm-console.png new file mode 100644 index 000000000..8697e3bbb Binary files /dev/null and b/fig/pytest-pycharm-console.png differ diff --git a/fig/pytest-pycharm-debug.png b/fig/pytest-pycharm-debug.png new file mode 100644 index 000000000..247f017d0 Binary files /dev/null and b/fig/pytest-pycharm-debug.png differ diff --git a/fig/pytest-pycharm-run-single-test.png b/fig/pytest-pycharm-run-single-test.png new file mode 100644 index 000000000..85f6f71dc Binary files /dev/null and b/fig/pytest-pycharm-run-single-test.png differ diff --git a/fig/pytest-pycharm-run-tests.png b/fig/pytest-pycharm-run-tests.png new file mode 100644 index 000000000..86a9ee27d Binary files /dev/null and b/fig/pytest-pycharm-run-tests.png differ diff --git a/fig/pytest-pycharm-set-breakpoint.png b/fig/pytest-pycharm-set-breakpoint.png new file mode 100644 index 000000000..69d76cbc8 Binary files /dev/null and b/fig/pytest-pycharm-set-breakpoint.png differ diff --git a/fig/python-environment-hell.png b/fig/python-environment-hell.png new file mode 100644 index 000000000..4f482458d Binary files /dev/null and b/fig/python-environment-hell.png differ diff --git a/fig/section1-overview.svg b/fig/section1-overview.svg new file mode 100644 index 000000000..0d7f9f868 --- /dev/null +++ b/fig/section1-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment

- Isolate and run code: command line, virtual environment & IDE
- Version control and share code: Git & GitHub
- Write well-written code: PEP8
2. Verifying
software correctness
3. Software development
as a process
4. Collaborative
development for reuse
5. Managing software
over its lifetime
\ No newline at end of file diff --git a/fig/section2-overview.svg b/fig/section2-overview.svg new file mode 100644 index 000000000..c58603de0 --- /dev/null +++ b/fig/section2-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment
2. Verifying
software correctness

- Test frameworks
- Automate and scale testing: CI and GitHub Actions
- Debug code
3. Software development
as a process
4. Collaborative
development for reuse
5. Managing software
over its lifetime
\ No newline at end of file diff --git a/fig/section3-overview.svg b/fig/section3-overview.svg new file mode 100644 index 000000000..3eb7e7edd --- /dev/null +++ b/fig/section3-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment
2. Verifying
software correctness
3. Software development
as a process

- Software requirements
- Software architecture & design
- Programming paradigms
4. Collaborative
development for reuse
5. Managing software
over its lifetime
\ No newline at end of file diff --git a/fig/section4-overview.svg b/fig/section4-overview.svg new file mode 100644 index 000000000..e747024cf --- /dev/null +++ b/fig/section4-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment
2. Verifying
software correctness
3. Software development
as a process
4. Collaborative
development for reuse

- Code review
- Software documentation
- Software packaging & release
5. Managing software
over its lifetime
\ No newline at end of file diff --git a/fig/section5-overview.svg b/fig/section5-overview.svg new file mode 100644 index 000000000..2f22871a5 --- /dev/null +++ b/fig/section5-overview.svg @@ -0,0 +1 @@ +
1. Setting up
software environment
2. Verifying
software correctness
3. Software development
as a process
4. Collaborative
development for reuse
5. Managing software
over its lifetime

- Issue reporting & prioritisation
- Agile development in sprints
- software project management
\ No newline at end of file diff --git a/fig/use_env.png b/fig/use_env.png new file mode 100644 index 000000000..b1c1facb4 Binary files /dev/null and b/fig/use_env.png differ diff --git a/fig/vim-vs-emacs.png b/fig/vim-vs-emacs.png new file mode 100644 index 000000000..18042a369 Binary files /dev/null and b/fig/vim-vs-emacs.png differ diff --git a/fig/vs-code-extensions.png b/fig/vs-code-extensions.png new file mode 100644 index 000000000..4b84dd0a9 Binary files /dev/null and b/fig/vs-code-extensions.png differ diff --git a/fig/vs-code-install-linter-extension.png b/fig/vs-code-install-linter-extension.png new file mode 100644 index 000000000..8eb09766c Binary files /dev/null and b/fig/vs-code-install-linter-extension.png differ diff --git a/fig/vs-code-linter-problems-pane-annotated.png b/fig/vs-code-linter-problems-pane-annotated.png new file mode 100644 index 000000000..68618807d Binary files /dev/null and b/fig/vs-code-linter-problems-pane-annotated.png differ diff --git a/fig/vs-code-python-extension.png b/fig/vs-code-python-extension.png new file mode 100644 index 000000000..af42ebd93 Binary files /dev/null and b/fig/vs-code-python-extension.png differ diff --git a/fig/vs-code-run-script.png b/fig/vs-code-run-script.png new file mode 100644 index 000000000..711e000e0 Binary files /dev/null and b/fig/vs-code-run-script.png differ diff --git a/fig/vs-code-run-test.png b/fig/vs-code-run-test.png new file mode 100644 index 000000000..7eca3e184 Binary files /dev/null and b/fig/vs-code-run-test.png differ diff --git a/fig/vs-code-test-explorer.png b/fig/vs-code-test-explorer.png new file mode 100644 index 000000000..437f5cdb5 Binary files /dev/null and b/fig/vs-code-test-explorer.png differ diff --git a/fig/vs-code-virtual-env-indicator.png b/fig/vs-code-virtual-env-indicator.png new file mode 100644 index 000000000..f60b81d21 Binary files /dev/null and b/fig/vs-code-virtual-env-indicator.png differ diff --git a/fig/vs-code.png b/fig/vs-code.png new file mode 100644 index 000000000..9591453d7 Binary files /dev/null and b/fig/vs-code.png differ diff --git a/fig/waiter-food.png b/fig/waiter-food.png new file mode 100644 index 000000000..e84d0c6d7 Binary files /dev/null and b/fig/waiter-food.png differ diff --git a/fig/wrapup-concept-map-difficulty-level.png b/fig/wrapup-concept-map-difficulty-level.png new file mode 100644 index 000000000..d47ac0e6d Binary files /dev/null and b/fig/wrapup-concept-map-difficulty-level.png differ diff --git a/fig/wrapup-concept-map.png b/fig/wrapup-concept-map.png new file mode 100644 index 000000000..4a60ecb4b Binary files /dev/null and b/fig/wrapup-concept-map.png differ diff --git a/fig/wrapup-perceived-usefulness-time.png b/fig/wrapup-perceived-usefulness-time.png new file mode 100644 index 000000000..21ed648a9 Binary files /dev/null and b/fig/wrapup-perceived-usefulness-time.png differ diff --git a/fig/xkcd-good-code-comic.png b/fig/xkcd-good-code-comic.png new file mode 100644 index 000000000..627e9e703 Binary files /dev/null and b/fig/xkcd-good-code-comic.png differ diff --git a/functional-programming.md b/functional-programming.md new file mode 100644 index 000000000..3fb21a7ed --- /dev/null +++ b/functional-programming.md @@ -0,0 +1,879 @@ +--- +title: "Extra Content: Functional Programming" +teaching: 30 +exercises: 30 +--- + +::: questions +- What is functional programming? +- Which situations/problems is functional programming well suited for? +::: + +::: objectives +- Describe the core concepts that define the functional programming paradigm +- Describe the main characteristics of code that is written in functional programming + style +- Learn how to generate and process data collections efficiently using MapReduce and + Python's comprehensions +::: + +Functional programming is a programming paradigm where +programs are constructed by applying and composing/chaining **functions**. +Functional programming is based on the +[mathematical definition of a function](https://en.wikipedia.org/wiki/Function_\(mathematics\)) +`f()`, +which applies a transformation to some input data giving us some other data as a result +(i.e. a mapping from input `x` to output `f(x)`). +Thus, a program written in a functional style becomes a series of transformations on data +which are performed to produce a desired output. +Each function (transformation) taken by itself is simple and straightforward to understand; +complexity is handled by composing functions in various ways. + +Often when we use the term function we are referring to +a construct containing a block of code which performs a particular task and can be reused. +We have already seen this in procedural programming - +so how are functions in functional programming different? +The key difference is that functional programming is focussed on +**what** transformations are done to the data, +rather than **how** these transformations are performed +(i.e. a detailed sequence of steps which update the state of the code to reach a desired state). +Let us compare and contrast examples of these two programming paradigms. + +## Functional vs Procedural Programming + +The following two code examples implement the calculation of a factorial +in procedural and functional styles, respectively. +Recall that the factorial of a number `n` (denoted by `n!`) is calculated as +the product of integer numbers from 1 to `n`. + +The first example provides a procedural style factorial function. + +```python +def factorial(n): + """Calculate the factorial of a given number. + + :param int n: The factorial to calculate + :return: The resultant factorial + """ + if n < 0: + raise ValueError('Only use non-negative integers.') + + factorial = 1 + for i in range(1, n + 1): # iterate from 1 to n + # save intermediate value to use in the next iteration + factorial = factorial * i + + return factorial +``` + +Functions in procedural programming are *procedures* that describe +a detailed list of instructions to tell the computer what to do step by step +and how to change the state of the program and advance towards the result. +They often use *iteration* to repeat a series of steps. +Functional programming, on the other hand, typically uses *recursion* - +an ability of a function to call/repeat itself until a particular condition is reached. +Let us see how it is used in the functional programming example below +to achieve a similar effect to that of iteration in procedural programming. + +```python +# Functional style factorial function +def factorial(n): + """Calculate the factorial of a given number. + + :param int n: The factorial to calculate + :return: The resultant factorial + """ + if n < 0: + raise ValueError('Only use non-negative integers.') + + if n == 0 or n == 1: + return 1 # exit from recursion, prevents infinite loops + else: + return n * factorial(n-1) # recursive call to the same function +``` + +***Note:** You may have noticed that both functions in the above code examples have the same signature +(i.e. they take an integer number as input and return its factorial as output). +You could easily swap these equivalent implementations +without changing the way that the function is invoked. +Remember, a single piece of software may well contain instances of multiple programming paradigms - +including procedural, functional and object-oriented - +it is up to you to decide which one to use and when to switch +based on the problem at hand and your personal coding style.* + +Functional computations only rely on the values that are provided as inputs to a function +and not on the state of the program that precedes the function call. +They do not modify data that exists outside the current function, including the input data - +this property is referred to as the *immutability of data*. +This means that such functions do not create any *side effects*, +i.e. do not perform any action that affects anything other than the value they return. +For example: printing text, +writing to a file, +modifying the value of an input argument, +or changing the value of a global variable. +Functions without side affects +that return the same data each time the same input arguments are provided +are called *pure functions*. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Pure Functions + +Which of these functions are pure? +If you are not sure, explain your reasoning to someone else, do they agree? + +```python +def add_one(x): + return x + 1 + +def say_hello(name): + print('Hello', name) + +def append_item_1(a_list, item): + a_list += [item] + return a_list + +def append_item_2(a_list, item): + result = a_list + [item] + return result +``` + +::::::::::::::: solution + +## Solution + +1. `add_one` is pure - it has no effects other than to return a value and this value will always be the same when given the same inputs +2. `say_hello` is not pure - printing text counts as a side effect, even though it is the clear purpose of the function +3. `append_item_1` is not pure - the argument `a_list` gets modified as a side effect - try this yourself to prove it +4. `append_item_2` is pure - the result is a new variable, so this time `a_list` does not get modified - again, try this yourself + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Benefits of Functional Code + +There are a few benefits we get when working with pure functions: + +- Testability +- Composability +- Parallelisability + +**Testability** indicates how easy it is to test the function - usually meaning unit tests. +It is much easier to test a function if we can be certain that +a particular input will always produce the same output. +If a function we are testing might have different results each time it runs +(e.g. a function that generates random numbers drawn from a normal distribution), +we need to come up with a new way to test it. +Similarly, it can be more difficult to test a function with side effects +as it is not always obvious what the side effects will be, or how to measure them. + +**Composability** refers to the ability to make a new function from a chain of other functions +by piping the output of one as the input to the next. +If a function does not have side effects or non-deterministic behaviour, +then all of its behaviour is reflected in the value it returns. +As a consequence of this, any chain of combined pure functions is itself pure, +so we keep all these benefits when we are combining functions into a larger program. +As an example of this, we could make a function called `add_two`, +using the `add_one` function we already have. + +```python +def add_two(x): + return add_one(add_one(x)) +``` + +**Parallelisability** is the ability for operations to be performed at the same time (independently). +If we know that a function is fully pure and we have got a lot of data, +we can often improve performance by +splitting data and distributing the computation across multiple processors. +The output of a pure function depends only on its input, +so we will get the right result regardless of when or where the code runs. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Everything in Moderation + +Despite the benefits that pure functions can bring, +we should not be trying to use them everywhere. +Any software we write needs to interact with the rest of the world somehow, +which requires side effects. +With pure functions you cannot read any input, write any output, +or interact with the rest of the world in any way, +so we cannot usually write useful software using just pure functions. +Python programs or libraries written in functional style will usually not be +as extreme as to completely avoid reading input, writing output, +updating the state of internal local variables, etc.; +instead, they will provide a functional-appearing interface +but may use non-functional features internally. +An example of this is the [Python Pandas library](https://pandas.pydata.org/) +for data manipulation built on top of NumPy - +most of its functions appear pure +as they return new data objects instead of changing existing ones. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +There are other advantageous properties that can be derived from the functional approach to coding. +In languages which support functional programming, +a function is a *first-class object* like any other object - +not only can you compose/chain functions together, +but functions can be used as inputs to, +passed around or returned as results from other functions +(remember, in functional programming *code is data*). +This is why functional programming is suitable for processing data efficiently - +in particular in the world of Big Data, where code is much smaller than the data, +sending the code to where data is located is cheaper and faster than the other way round. +Let us see how we can do data processing using functional programming. + +## MapReduce Data Processing Approach + +When working with data you will often find that you need to +apply a transformation to each datapoint of a dataset +and then perform some aggregation across the whole dataset. +One instance of this data processing approach is known as MapReduce +and is applied when processing (but not limited to) Big Data, +e.g. using tools such as [Spark](https://en.wikipedia.org/wiki/Apache_Spark) +or [Hadoop](https://hadoop.apache.org/). +The name MapReduce comes from applying an operation to (mapping) each value in a dataset, +then performing a reduction operation which +collects/aggregates all the individual results together to produce a single result. +MapReduce relies heavily on composability and parallelisability of functional programming - +both map and reduce can be done in parallel and on smaller subsets of data, +before aggregating all intermediate results into the final result. + +### Mapping + +`map(f, C)` is a function takes another function `f()` and a collection `C` of data items as inputs. +Calling `map(f, C)` applies the function `f(x)` to every data item `x` in a collection `C` +and returns the resulting values as a new collection of the same size. + +This is a simple mapping that takes a list of names and +returns a list of the lengths of those names using the built-in function `len()`: + +```python +name_lengths = map(len, ["Mary", "Isla", "Sam"]) +print(list(name_lengths)) +``` + +```output +[4, 4, 3] +``` + +This is a mapping that squares every number in the passed collection using anonymous, +inlined *lambda* expression (a simple one-line mathematical expression representing a function): + +```python +squares = map(lambda x: x * x, [0, 1, 2, 3, 4]) +print(list(squares)) +``` + +```output +[0, 1, 4, 9, 16] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Lambda + +Lambda expressions are used to create anonymous functions that can be used to +write more compact programs by inlining function code. +A lambda expression takes any number of input parameters and +creates an anonymous function that returns the value of the expression. +So, we can use the short, one-line `lambda x, y, z, ...: expression` code +instead of defining and calling a named function `f()` as follows: + +```python +def f(x, y, z, ...): + return expression +``` + +The major distinction between lambda functions and 'normal' functions is that +lambdas do not have names. +We could give a name to a lambda expression if we really wanted to - +but at that point we should be using a 'normal' Python function instead. + +```python +# Do not do this +add_one = lambda x: x + 1 + +# Do this instead +def add_one(x): + return x + 1 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In addition to using built-in or inlining anonymous lambda functions, +we can also pass a named function that we have defined ourselves to the `map()` function. + +```python +def add_one(num): + return num + 1 + +result = map(add_one, [0, 1, 2]) +print(list(result)) +``` + +```output +[1, 2, 3] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Check Inflammation Patient Data Against A Threshold Using Map + +Write a new function called `daily_above_threshold()` in our inflammation `models.py` that +determines whether or not each daily inflammation value for a given patient +exceeds a given threshold. + +Given a patient row number in our data, the patient dataset itself, and a given threshold, +write the function to use `map()` to generate and return a list of booleans, +with each value representing whether or not the daily inflammation value for that patient +exceeded the given threshold. + +Ordinarily we would use Numpy's own `map` feature, +but for this exercise, let us try a solution without it. + +::::::::::::::: solution + +## Solution + +```python +def daily_above_threshold(patient_num, data, threshold): + """Determine whether or not each daily inflammation value exceeds a given threshold for a given patient. + + :param patient_num: The patient row number + :param data: A 2D data array with inflammation data + :param threshold: An inflammation threshold to check each daily value against + :returns: A boolean list representing whether or not each patient's daily inflammation exceeded the threshold + """ + + return list(map(lambda x: x > threshold, data[patient_num])) +``` + +***Note:** `map()` function returns a map iterator object +which needs to be converted to a collection object +(such as a list, dictionary, set, tuple) +using the corresponding "factory" function (in our case `list()`).* + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Comprehensions for Mapping/Data Generation + +Another way you can generate new collections of data from existing collections in Python is +using *comprehensions*, +which are an elegant and concise way of creating data from +[iterable objects](https://www.w3schools.com/python/python_iterators.asp) using *for loops*. +While not a pure functional concept, +comprehensions provide data generation functionality +and can be used to achieve the same effect as the built-in "pure functional" function `map()`. +They are commonly used and actually recommended as a replacement of `map()` in modern Python. +Let us have a look at some examples. + +```python +integers = range(5) +double_ints = [2 * i for i in integers] + +print(double_ints) +``` + +```output +[0, 2, 4, 6, 8] +``` + +The above example uses a *list comprehension* to double each number in a sequence. +Notice the similarity between the syntax for a list comprehension and a for loop - +in effect, this is a for loop compressed into a single line. +In this simple case, the code above is equivalent to using a map operation on a sequence, +as shown below: + +```python +integers = range(5) +double_ints = map(lambda i: 2 * i, integers) +print(list(double_ints)) +``` + +```output +[0, 2, 4, 6, 8] +``` + +We can also use list comprehensions to filter data, by adding the filter condition to the end: + +```python +double_even_ints = [2 * i for i in integers if i % 2 == 0] +print(double_even_ints) +``` + +```output +[0, 4, 8] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Set and Dictionary Comprehensions and Generators + +We also have *set comprehensions* and *dictionary comprehensions*, +which look similar to list comprehensions +but use the set literal and dictionary literal syntax, respectively. + +```python +double_even_int_set = {2 * i for i in integers if i % 2 == 0} +print(double_even_int_set) + +double_even_int_dict = {i: 2 * i for i in integers if i % 2 == 0} +print(double_even_int_dict) +``` + +```output +{0, 4, 8} +{0: 0, 2: 4, 4: 8} +``` + +Finally, there is one last 'comprehension' in Python - a *generator expression* - +a type of an iterable object which we can take values from and loop over, +but does not actually compute any of the values until we need them. +Iterable is the generic term for anything we can loop or iterate over - +lists, sets and dictionaries are all iterables. + +The `range` function is an example of a generator - +if we created a `range(1000000000)`, but didn't iterate over it, +we'd find that it takes almost no time to do. +Creating a list containing a similar number of values would take much longer, +and could be at risk of running out of memory. + +We can build our own generators using a generator expression. +These look much like the comprehensions above, +but act like a generator when we use them. +Note the syntax difference for generator expressions - +parenthesis are used in place of square or curly brackets. + +```python +doubles_generator = (2 * i for i in integers) +for x in doubles_generator: + print(x) +``` + +```output +0 +2 +4 +6 +8 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let us now have a look at reducing the elements of a data collection into a single result. + +### Reducing + +`reduce(f, C, initialiser)` function accepts a function `f()`, +a collection `C` of data items +and an optional `initialiser`, +and returns a single cumulative value which +aggregates (reduces) all the values from the collection into a single result. +The reduction function first applies the function `f()` to the first two values in the collection +(or to the `initialiser`, if present, and the first item from `C`). +Then for each remaining value in the collection, +it takes the result of the previous computation +and the next value from the collection as the new arguments to `f()` +until we have processed all of the data and reduced it to a single value. +For example, if collection `C` has 5 elements, the call `reduce(f, C)` calculates: + +``` +f(f(f(f(C[0], C[1]), C[2]), C[3]), C[4]) +``` + +One example of reducing would be to calculate the product of a sequence of numbers. + +```python +from functools import reduce + +sequence = [1, 2, 3, 4] + +def product(a, b): + return a * b + +print(reduce(product, sequence)) + +# The same reduction using a lambda function +print(reduce((lambda a, b: a * b), sequence)) +``` + +```output +24 +24 +``` + +Note that `reduce()` is not a built-in function like `map()` - +you need to import it from library `functools`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Calculate the Sum of a Sequence of Numbers Using Reduce + +Using reduce calculate the sum of a sequence of numbers. +Although in practice we would use the built-in `sum()` function for this - try doing it without it. + +::::::::::::::: solution + +## Solution + +```python +from functools import reduce + +sequence = [1, 2, 3, 4] + +def add(a, b): + return a + b + +print(reduce(add, sequence)) + +# The same reduction using a lambda function +print(reduce((lambda a, b: a + b), sequence)) +``` + +```output +10 +10 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Putting It All Together + +Let us now put together what we have learned about map and reduce so far +by writing a function that calculates the sum of the squares of the values in a list +using the MapReduce approach. + +```python +from functools import reduce + +def sum_of_squares(sequence): + squares = [x * x for x in sequence] # use list comprehension for mapping + return reduce(lambda a, b: a + b, squares) +``` + +We should see the following behaviour when we use it: + +```python +print(sum_of_squares([0])) +print(sum_of_squares([1])) +print(sum_of_squares([1, 2, 3])) +print(sum_of_squares([-1])) +print(sum_of_squares([-1, -2, -3])) +``` + +```output +0 +1 +14 +1 +14 +``` + +Now let's assume we're reading in these numbers from an input file, +so they arrive as a list of strings. +We will modify the function so that it passes the following tests: + +```python +print(sum_of_squares(['1', '2', '3'])) +print(sum_of_squares(['-1', '-2', '-3'])) +``` + +```output +14 +14 +``` + +The code may look like: + +```python +from functools import reduce + +def sum_of_squares(sequence): + integers = [int(x) for x in sequence] + squares = [x * x for x in integers] + return reduce(lambda a, b: a + b, squares) +``` + +Finally, like comments in Python, we'd like it to be possible for users to +comment out numbers in the input file they give to our program. +We will finally extend our function so that the following tests pass: + +```python +print(sum_of_squares(['1', '2', '3'])) +print(sum_of_squares(['-1', '-2', '-3'])) +print(sum_of_squares(['1', '2', '#100', '3'])) +``` + +```output +14 +14 +14 +``` + +To do so, we may filter out certain elements and have: + +```python +from functools import reduce + +def sum_of_squares(sequence): + integers = [int(x) for x in sequence if x[0] != '#'] + squares = [x * x for x in integers] + return reduce(lambda a, b: a + b, squares) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Extend Inflammation Threshold Function Using Reduce + +Extend the `daily_above_threshold()` function you wrote previously +to return a count of the number of days a patient's inflammation is over the threshold. +Use `reduce()` over the boolean array that was previously returned to generate the count, +then return that value from the function. + +You may choose to define a separate function to pass to `reduce()`, +or use an inline lambda expression to do it (which is a bit trickier!). + +Hints: + +- Remember that you can define an `initialiser` value with `reduce()` + \> to help you start the counter +- If defining a lambda expression, + \> note that it can conditionally return different values using the syntax + \> ` if else ` in the expression. + +::::::::::::::: solution + +## Solution + +Using a separate function: + +```python +def daily_above_threshold(patient_num, data, threshold): + """Count how many days a given patient's inflammation exceeds a given threshold. + + :param patient_num: The patient row number + :param data: A 2D data array with inflammation data + :param threshold: An inflammation threshold to check each daily value against + :returns: An integer representing the number of days a patient's inflammation is over a given threshold + """ + def count_above_threshold(a, b): + if b: + return a + 1 + else: + return a + + # Use map to determine if each daily inflammation value exceeds a given threshold for a patient + above_threshold = map(lambda x: x > threshold, data[patient_num]) + # Use reduce to count on how many days inflammation was above the threshold for a patient + return reduce(count_above_threshold, above_threshold, 0) +``` + +Note that the `count_above_threshold` function used by `reduce()` +was defined within the `daily_above_threshold()` function +to limit its scope and clarify its purpose +(i.e. it may only be useful as part of `daily_above_threshold()` +hence being defined as an inner function). + +The equivalent code using a lambda expression may look like: + +```python +from functools import reduce + +... + +def daily_above_threshold(patient_num, data, threshold): + """Count how many days a given patient's inflammation exceeds a given threshold. + + :param patient_num: The patient row number + :param data: A 2D data array with inflammation data + :param threshold: An inflammation threshold to check each daily value against + :returns: An integer representing the number of days a patient's inflammation is over a given threshold + """ + + above_threshold = map(lambda x: x > threshold, data[patient_num]) + return reduce(lambda a, b: a + 1 if b else a, above_threshold, 0) +``` + +Where could this be useful? +For example, you may want to define the success criteria for a trial if, say, +80% of patients do not exhibit inflammation in any of the trial days, or some similar metrics. + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Decorators + +Finally, we will look at one last aspect of Python where functional programming is coming handy. +As we have seen in the +[episode on parametrising our unit tests](../episodes/22-scaling-up-unit-testing.md#parameterising-our-unit-tests), +a decorator can take a function, modify/decorate it, then return the resulting function. +This is possible because Python treats functions as first-class objects +that can be passed around as normal data. +Here, we discuss decorators in more detail and learn how to write our own. +Let us look at the following code for ways on how to "decorate" functions. + +```python +def with_logging(func): + + """A decorator which adds logging to a function.""" + def inner(*args, **kwargs): + print("Before function call") + result = func(*args, **kwargs) + print("After function call") + return result + + return inner + + +def add_one(n): + print("Adding one") + return n + 1 + +# Redefine function add_one by wrapping it within with_logging function +add_one = with_logging(add_one) + +# Another way to redefine a function - using a decorator +@with_logging +def add_two(n): + print("Adding two") + return n + 2 + +print(add_one(1)) +print(add_two(1)) +``` + +```output +Before function call +Adding one +After function call +2 +Before function call +Adding two +After function call +3 +``` + +In this example, we see a decorator (`with_logging`) +and two different syntaxes for applying the decorator to a function. +The decorator is implemented here as a function which encloses another function. +Because the inner function (`inner()`) calls the function being decorated (`func()`) +and returns its result, +it still behaves like this original function. +Part of this is the use of `*args` and `**kwargs` - +these allow our decorated function to accept any arguments or keyword arguments +and pass them directly to the function being decorated. +Our decorator in this case does not need to modify any of the arguments, +so we do not need to know what they are. +Any additional behaviour we want to add as part of our decorated function, +we can put before or after the call to the original function. +Here we print some text both before and after the decorated function, +to show the order in which events happen. + +We also see in this example the two different ways in which a decorator can be applied. +The first of these is to use a normal function call (`with_logging(add_one)`), +where we then assign the resulting function back to a variable - +often using the original name of the function, so replacing it with the decorated version. +The second syntax is the one we have seen previously (`@with_logging`). +This syntax is equivalent to the previous one - +the result is that we have a decorated version of the function, +here with the name `add_two`. +Both of these syntaxes can be useful in different situations: +the `@` syntax is more concise if we never need to use the un-decorated version, +while the function-call syntax gives us more flexibility - +we can continue to use the un-decorated function +if we make sure to give the decorated one a different name, +and can even make multiple decorated versions using different decorators. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Measuring Performance Using Decorators + +One small task you might find a useful case for a decorator is +measuring the time taken to execute a particular function. +This is an important part of performance profiling. + +Write a decorator which you can use to measure the execution time of the decorated function +using the [time.process\_time\_ns()](https://docs.python.org/3/library/time.html#time.process_time_ns) function. +There are several different timing functions each with slightly different use-cases, +but we won't worry about that here. + +For the function to measure, you may wish to use this as an example: + +```python +def measure_me(n): + total = 0 + for i in range(n): + total += i * i + + return total +``` + +::::::::::::::: solution + +## Solution + +```python +import time + +def profile(func): + def inner(*args, **kwargs): + start = time.process_time_ns() + result = func(*args, **kwargs) + stop = time.process_time_ns() + + print("Took {0} seconds".format((stop - start) / 1e9)) + return result + + return inner + +@profile +def measure_me(n): + total = 0 + for i in range(n): + total += i * i + + return total + +print(measure_me(1000000)) +``` + +```output +Took 0.124199753 seconds +333332833333500000 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + +::: keypoints +- Functional programming is a programming paradigm where programs are constructed + by applying and composing smaller and simple functions into more complex ones (which + describe the flow of data within a program as a sequence of data transformations). +- In functional programming, functions tend to be *pure* - they do not exhibit *side-effects* + (by not affecting anything other than the value they return or anything outside + a function). Functions can also be named, passed as arguments, and returned from + other functions, just as any other data type. +- MapReduce is an instance of a data generation and processing approach, in particular + suited for functional programming and handling Big Data within parallel and distributed + environments. +- Python provides comprehensions for lists, dictionaries, sets and generators - a + concise (if not strictly functional) way to generate new data from existing data + collections while performing sophisticated mapping, filtering and conditional logic + on original dataset's members. +::: diff --git a/index.md b/index.md new file mode 100644 index 000000000..fb0384e5f --- /dev/null +++ b/index.md @@ -0,0 +1,123 @@ +--- +permalink: index.html +site: sandpaper::sandpaper_site +--- + +This course aims to teach a **core set** of established, +intermediate-level software development skills +and best practices for working as part of a team in a research environment +using Python as an example programming language +(see detailed [learning objectives](index.md#learning-objectives-for-the-workshop) below). +The core set of skills we teach is not a comprehensive set of all-encompassing skills, +but a selective set of tried-and-tested collaborative development skills +that forms a firm foundation for continuing on your learning journey. + +A **typical learner** for this course may be someone who +is working in a research environment, +needing to write some code, +has **gained basic software development skills** +either by self-learning or attending, +e.g., a novice [Software Carpentry Python course](https://software-carpentry.org/lessons). +They have been **applying those skills in their domain of work by writing code for some time**, +e.g. half a year or more. +However, their software development-related projects are now becoming larger +and are involving more researchers and other stakeholders (e.g. users), for example: + +- Software is becoming more complex + and more collaborative development effort is needed to keep the software running +- Software is going further than just the small group developing and/or using the code - + there are more users and an increasing need to add new features +- ['Technical debt'](https://en.wikipedia.org/wiki/Technical_debt) is increasing + with demands to add new functionality + while ensuring previous development efforts remain functional and maintainable + +They now need intermediate software engineering skills +to help them design more robust software code that goes +beyond a few thrown-together proof-of-concept scripts, +taking into consideration the lifecycle of software, +writing software for stakeholders, +team ethic +and applying a process to understanding, designing, building, releasing, and maintaining software. + +## Target Audience + +This course is for you if: + +- You have been writing software for a while, + which may be used by people other than yourself, + but it is currently undocumented or unstructured +- You want to learn: + - more intermediate software engineering techniques and tools + - how to collaborate with others to develop software + - how to prepare software for others to use +- You are currently comfortable with: + - basic Python programming + (though this may not be the main language you use) + and applying it to your work on a regular basis + - basic version control using Git + - command line interface (shell) + +This course is not for you if: + +- You have not yet started writing software + (in which case have a look at the + [Software Carpentry course](https://software-carpentry.org/lessons) + or some other Python course for novices first) +- You have learned the basics of writing software but have not + applied that knowledge yet (or are unsure how to apply it) to your work. + In this case, we suggest you revisit the course + after you have been programming for at least 6 months +- You are well familiar with the + [learning objectives of the course](index.md#learning-objectives-for-the-workshop) + and those of individual episodes +- The software you write is fully documented and well architected + +::::::::::::::::::::::::::::::::::::: objectives + +## Learning Objectives + +After going through this course, participants will be able to: + +- Set up and use a suitable development environment + together with popular source code management infrastructure to develop software collaboratively +- Use a test framework to automate the verification of correct behaviour of code, + and employ parameterisation and continuous integration to scale and further automate code testing +- Design robust, extensible software through the application of suitable programming paradigms + and design techniques +- Understand the code review process and employ it to improve the quality of code +- Prepare and release software for reuse by others +- Manage software improvement from feedback through agile techniques + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Before joining this training, participants should meet the following criteria. +(You can use [this short quiz](learners/quiz.md) to test your prerequisite knowledge.) + +### Git + +- **You are familiar with the concept of version control** +- **You have experience configuring Git for the first time and creating a local repository** +- **You have experience using Git to create and clone a repository + and add/commit changes to it and to push to/pull from a remote repository** +- Optionally, you have experience comparing various versions of tracked files + or ignoring specific files + +### Python + +- **You have a basic knowledge of programming in Python + (using variables, lists, conditional statements, + functions and importing external libraries)** +- **You have previously written Python scripts or iPython/Jupyter notebooks + to accomplish tasks in your domain of work** + +### Shell + +- **You have experience using a command line interface, such as Bash, + to navigate a UNIX-style file system and run commands with arguments** +- Optionally, you have experience redirecting inputs and outputs from a command + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/installation-instructions.md b/installation-instructions.md new file mode 100644 index 000000000..50f57d664 --- /dev/null +++ b/installation-instructions.md @@ -0,0 +1,317 @@ +--- +title: Installation Instructions +--- + +You will need the following software and accounts setup to be able to follow the course: + +- [Command line tool](#command-line-tool) (such as Bash, Zsh or Git Bash) +- [Git version control program](#git-version-control-tool) +- [GitHub account](#github-account) +- [Python 3 distribution](#python-3-distribution) +- [PyCharm](#pycharm-ide) integrated development environment (IDE) + +::::::::::::::::::::::::::::::::::::::::: callout + +## Common Issues \& Tips + +If you are having issues installing or running some of the tools below, +check a list of [common issues](learners/common-issues.md) other course participants encountered and some useful tips for using the tools and working through the material. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Command Line Tool + +You will need a command line tool (shell/console) in order to run Python scripts and version control your code with Git. + +- On Windows, it is **strongly** recommended to use **Git Bash** (which is included in + [Git For Windows package](https://gitforwindows.org/) - see the Git installation section below). The use of + Windows command line tool `cmd` is not suitable for the course. We also advise against using + [Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/) for this course as we do not + provide instructions for troubleshooting any potential issues between WSL and PyCharm. +- On macOS and Linux, you will already have a command line tool available on your system. You can use a command line tool such as [**Bash**](https://www.gnu.org/software/bash/), + or any other [command line tool that has similar syntax to Bash](https://en.wikipedia.org/wiki/Comparison_of_command_shells), + since none of the content of this course is specific to Bash. Note that starting with macOS Catalina, + Macs will use [Zsh (Z shell)](https://www.zsh.org/) as the default command line tool instead of Bash. + +To test your command line tool, start it up and type: + +```bash +$ date +``` + +If your command line program is working - it should return the current date and time similar to: + +```output +Wed 21 Apr 2021 11:38:19 BST +``` + +## Git Version Control Tool + +Git is a program that can be accessed from your command line tool. + +- On Windows, it is recommended to use **Git Bash**, which comes included as part of the [Git For Windows package](https://gitforwindows.org/) and will + install the Bash command line tool as well as Git. +- On macOS, Git is included as part of Apple's [Xcode tools](https://en.wikipedia.org/wiki/Xcode) + and should be available from the command line as long as you have Xcode. If you do not have Xcode installed, you can download it from + [Apple's App Store](https://apps.apple.com/us/app/xcode/id497799835?mt=12) or you can + [install Git using alternative methods](https://git-scm.com/download/mac). +- On Linux, Git can be installed using your favourite package manager. + +To test your Git installation, start your command line tool and type: + +```bash +$ git help +``` + +If your Git installation is working you should see something like: + +```output +usage: git [-v | --version] [-h | --help] [-C ] [-c =] + [--exec-path[=]] [--html-path] [--man-path] [--info-path] + [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare] + [--git-dir=] [--work-tree=] [--namespace=] + [--config-env==] [] + +These are common Git commands used in various situations: + +start a working area (see also: git help tutorial) + clone Clone a repository into a new directory + init Create an empty Git repository or reinitialize an existing one + +work on the current change (see also: git help everyday) + add Add file contents to the index + mv Move or rename a file, a directory, or a symlink + restore Restore working tree files + rm Remove files from the working tree and from the index + +examine the history and state (see also: git help revisions) + bisect Use binary search to find the commit that introduced a bug + diff Show changes between commits, commit and working tree, etc + grep Print lines matching a pattern + log Show commit logs + show Show various types of objects + status Show the working tree status + +grow, mark and tweak your common history + branch List, create, or delete branches + commit Record changes to the repository + merge Join two or more development histories together + rebase Reapply commits on top of another base tip + reset Reset current HEAD to the specified state + switch Switch branches + tag Create, list, delete or verify a tag object signed with GPG + +collaborate (see also: git help workflows) + fetch Download objects and refs from another repository + pull Fetch from and integrate with another repository or a local branch + push Update remote refs along with associated objects + +'git help -a' and 'git help -g' list available subcommands and some +concept guides. See 'git help ' or 'git help ' +to read about a specific subcommand or concept. +See 'git help git' for an overview of the system. +``` + +When you use Git on a machine for the first time, you need to configure a few things: + +- your name, +- your email address (the one you used to open your GitHub account with, which will be used to uniquely identify your commits), +- preferred text editor for Git to use (e.g. `nano` or another text editor of your choice), +- whether you want to use these settings globally (i.e. for every Git project on your machine). + +This can be done from the command line as follows: + +```bash +$ git config --global user.name "Your Name" +$ git config --global user.email "name@example.com" +$ git config --global core.editor "nano -w" +``` + +### GitHub Account + +GitHub is a free, online host for Git repositories that you will use during the course to store your code in so +you will need to open a free [GitHub](https://github.com/) account unless you do not already have one. + +### Secure Access To GitHub Using Git From Command Line + +In order to access GitHub using Git from your machine securely, +you need to set up a way of authenticating yourself with GitHub through Git. +The recommended way to do that for this course is to set up +[*SSH authentication*](https://www.ssh.com/academy/ssh/public-key-authentication) - +a method of authentication that is more secure than sending +[*passwords over HTTPS*](https://security.stackexchange.com/questions/110415/is-it-ok-to-send-plain-text-password-over-https) +and which requires a pair of keys - +one public that you upload to your GitHub account, and one private that remains on your machine. + +GitHub provides full documentation and guides on how to: + +- [generate an SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent), and +- [add an SSH key to a GitHub account](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). + +A short summary of the commands you need to perform is shown below. + +To generate an SSH key pair, you will need to run the `ssh-keygen` command from your command line tool/GitBash +and provide **your identity for the key pair** (e.g. the email address you used to register with GitHub) +via the `-C` parameter as shown below. +Note that the `ssh-keygen` command can be run with different parameters - +e.g. to select a specific public key algorithm and key length; +if you do not use them `ssh-keygen` will generate an +[RSA](https://en.wikipedia.org/wiki/RSA_\(cryptosystem\)#:~:text=RSA%20involves%20a%20public%20key,by%20using%20the%20private%20key.) +key pair for you by default. +GitHub now recommends that you use a newer cryptographic standard (such as [EdDSA](https://en.wikipedia.org/wiki/EdDSA) variant algorithm [Ed25519](https://cryptobook.nakov.com/digital-signatures/eddsa-and-ed25519)), +so please be sure to specify it using the `-t` flag as shown below. +It will also prompt you to answer a few questions - +e.g. where to save the keys on your machine and a passphrase to use to protect your private key. +Pressing 'Enter' on these prompts will get `ssh-keygen` to use the default key location (within +`.ssh` folder in your home directory) +and set the passphrase to empty. + +```bash +$ ssh-keygen -t ed25519 -C "your-github-email@example.com" +``` + +```output +Generating public/private ed25519 key pair. +Enter file in which to save the key (/Users//.ssh/id_ed25519): +Enter passphrase (empty for no passphrase): +Enter same passphrase again: +Your identification has been saved in /Users//.ssh/id_ed25519 +Your public key has been saved in /Users//.ssh/id_ed25519.pub +The key fingerprint is: +SHA256:qjhN/iO42nnYmlpink2UTzaJpP8084yx6L2iQkVKdHk your-github-email@example.com +The key's randomart image is: ++--[ED25519 256]--+ +|.. .. | +| ..o A | +|. o.. | +| .o.o . | +| ..+ = B | +| .o = .. | +|o..X *. | +|++B=@.X | +|+*XOoOo+ | ++----[SHA256]-----+ +``` + +Next, you need to copy your public key (**not your private key - this is important!**) over to +your GitHub account. The `ssh-keygen` command above will let you know where your public key is saved (the file should have the +extension ".pub"), and you can get its contents (e.g. on a Mac OS system) as follows: + +```bash +$ cat /Users//.ssh/id_ed25519.pub +``` + +```output +ssh-ed25519 AABAC3NzaC1lZDI1NTE5AAAAICWGVRsl/pZsxx85QHLwSgJWyfMB1L8RCkEvYNkP4mZC your-github-email@example.com +``` + +Copy the line of output that starts with "ssh-ed25519" and ends with your email address +(it may start with a different algorithm name based on which one you used to generate the key pair +and it may have gone over multiple lines if your command line window is not wide enough). + +Finally, go to your [GitHub Settings -> SSH and GPG keys -> Add New](https://github.com/settings/ssh/new) page to add a new +SSH public key. Give your key a memorable name (e.g. the name of the computer you are working on that contains the +private key counterpart), paste the public key +from your clipboard into the box labelled "Key" (making sure it does not contain any line breaks), then click the "Add SSH key" button. + +Now, we can check that the SSH connection is working: + +```bash +$ ssh -T git@github.com +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## What About Passwords? + +While using passwords over HTTPS for authentication is easier to setup and will allow you *read access* to your repository on GitHub from your machine, +it alone is not sufficient any more to allow you to send changes or *write* to your remote repository on GitHub. This is because, +on 13 August 2021, GitHub has [strengthened security requirements for all authenticated Git operations](https://github.blog/changelog/2021-08-12-git-password-authentication-is-shutting-down/). This means you would need to use a +personal access token instead of your password for added security each time you need to authenticate yourself to +GitHub from the command line (e.g. when you want to push your local changes to your code repository on GitHub). +While using +SSH key pair for authentication may seem complex, once set up, it is actually more convenient than keeping track of/caching +your access token. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Python 3 Distribution + +To download the latest Python 3 distribution for your operating system, +please head to [Python.org](https://www.python.org/downloads/). + +If you are on Linux, +it is likely that the system Python 3 already installed will satisfy the requirements +of this course (the material has been tested using the standard Python distribution version 3.11 +but any [supported version](https://devguide.python.org/versions/#versions) should work). + +The course uses `venv` for virtual environment management and `pip` for package management. +The material has not been extensively tested with other Python distributions and package managers, +but most sections are expected to work with some modifications. +For example, package installation and virtual environments would need to be managed differently, but Python script +invocations should remain the same regardless of the Python distribution used. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Recommended Python Version + +We recommend using the latest Python version but any [supported version](https://devguide.python.org/versions/#versions) +should work. +Specifically, we recommend upgrading from Python 2.7 wherever possible; +continuing to use it will likely result in difficulty finding supported dependencies or +syntax errors. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You can +test your Python installation from the command line with: + +```bash +$ python3 --version # on Mac/Linux +$ python --version # on Windows — Windows installation comes with a python.exe file rather than a python3.exe file +``` + +If you are using Windows and invoking `python` command causes your Git Bash terminal to hang with no error message or output, you may +need to create an alias for the python executable `python.exe`, as explained in the [troubleshooting section](learners/common-issues.md#python-hangs-in-git-bash). + +If all is well with your installation, you should see something like: + +```output +Python 3.11.4 +``` + +To make sure you are using the standard Python distribution and not some other distribution you may have on your system, +type the following in your shell: + +```bash +$ python3 # python on Windows +``` + +This should enter you into a Python console and you should see something like: + +```bash +Python 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin +Type "help", "copyright", "credits" or "license" for more information. +>>> +``` + +Press `CONTROL-D` or type `exit()` to exit the Python console. + +### `venv` and `pip` + +If you are using a Python 3 distribution from [Python.org](https://www.python.org/), +`venv` and `pip` will be automatically installed for you. If not, please make sure you have these +two tools (that correspond to your Python distribution) installed on your machine. + +## PyCharm IDE + +We use JetBrains's [PyCharm Python Integrated Development Environment](https://www.jetbrains.com/pycharm) for the course. +PyCharm can be downloaded from [the JetBrains website](https://www.jetbrains.com/pycharm/download). +The Community edition is fine, though if you are developing software for the purpose of academic research you may be eligible for a free license for the Professional edition which contains extra features. + + + + diff --git a/instructor-notes.md b/instructor-notes.md new file mode 100644 index 000000000..b49c366fc --- /dev/null +++ b/instructor-notes.md @@ -0,0 +1,166 @@ +--- +title: Instructor Notes +--- + +::::::::::::::::::::::::::::::::::::::::: callout + +## Common Issues \& Tips + +Check out a [list of issues](../learners/common-issues.md) previous participants of the course encountered +and some tips to help you with troubleshooting at the workshop. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Course Design + +The course follows a narrative around +a software development team working on an existing software project +that is analysing patients' inflammation data +(from the [novice Software Carpentry Python course](https://software-carpentry.org/lessons). +The course is meant to be delivered as a single unit +as the course's code examples and exercises built on top of previously covered topics and code - +so skipping or missing bits of the course would cause students to +get out of sync and cause them difficulties in following subsequent sections. + +A typical learner for the course is +someone who has gained foundational software development skills in using Git, +command line shell and Python +(e.g. by attending prior courses or by self-learning), +and has used these skills for individual code development and scripting. +They are now joining the development team where they will require +a number of software development tools and intermediate software development skills +to engineer their code more properly +taking into consideration the lifecycle of software, +team ethic, writing software for stakeholders, +and applying a process to understanding, designing, building, releasing, and maintaining software. + +The course has been separated into 5 sections: + +- Section 1: Setting Up Environment For Collaborative Code Development +- Section 2: Ensuring Correctness of Software at Scale +- Section 3: Software Development as a Process +- Section 4: Collaborative Software Development for Reuse +- Section 5: Improving and Managing Software Over Its Lifetime + +Each section can be approximately delivered in a half-day but even better if you can allow 1 day per section. + +## Course Delivery + +The course is intended primarily for self-learning +but other modes of delivery have been used successfully +(e.g. fully instructor-led code-along mode or mixing in elements of instructor-led with self-work). +The way the course has been delivered so far is that +students are organised in small groups from the outset +and initially work individually through the material. +In later sections, +exercises involve more group work +and people from the same group form small development teams +and collaborate on a mini software project +(to provide more in-depth practice for software development in teams). +There is a bunch of helpers on hand who sit with learners in groups. +This provides a more comfortable and less intimidating learning environment +with learners more willing to engage and chat with their group colleagues about what they are doing +and ask for help. + +The course can be delivered online or in-person. +A good ratio is 4-6 learners to 1 helper. +If you have a smaller number of helpers than groups - +helpers can roam around to make sure groups are making progress. +While this course can be live-coded by an instructor as well, +we felt that intermediate-level learners are capable of +going through the material on their own at a reasonable speed +and would not require to code-along to the same extent as novice learners. +In later stages, exercises require participants to develop code more individually +so they can review and comment on each other's code, +so the codes need to be sufficiently different for these exercises to be effective. +For instructor-led mode of delivery, you can have an instructor live-code these group exercises +after learners have been given a chance to work on them as a team. + +A workshop kicks off with everyone together at the start of each day. +One of course leads/helpers provides workshop introduction +and motivation to paint the bigger picture and set the scene for the whole workshop. +In addition, a short intro to the section topics is provided on each day, +to explain what the students will be learning and doing on that particular day. +After that, participants are split into groups +and go through the materials for that day on their own with helpers on hand. +Each section holds optional exercises at the end for fast learners to go through if they finish early. +At the end of each section, all reconvene for a joint Q\&A session, feedback and wrap-up. +If participants have not finished all exercises for a section (in "self-learning with helpers" mode), +they are asked to finish them off before the next section starts +to make sure everyone is in sync as much as possible and are working on similar things +(though students will inevitably cover the material at different speeds). +This synchronisation becomes particularly important for later workshop stages +when students start with group exercises. + +Although not explicitly endorsed, +it is quite possible for learners to do the course using VS Code instead of PyCharm. +There is a section for setting up VS Code in the [this adjacent extras page](../learners/vscode.md). +However, when progressing through the section [Integrated Software Development Environments](../episodes/13-ides.md), +it can be a bit difficult for learners to pay attention to both pages. +Therefore, some instructors have found it helpful to perform a demonstration on their own machines of how to use VS Code to achieve the same functionality as PyCharm. +It is worthwhile preparing this in advance of the session. + +### Helpers Roles and Responsibilities + +At the workshop, when using the "self-learning with helpers" delivery mode, everyone in the training team is a helper and +there are no instructors per se. +You may have more experienced helpers delivering introductions to the workshop and sections. +Contact the course authors for section intro slides you can reuse, and you can also find slides for each +section in the course repository (for instructor-led delivery mode). + +Roles and responsibilities of helpers include: + +- Being familiar with the material +- Facilitating groups/breakout rooms and helping people going through the material +- Try to prepare a few questions/discussion points + to take to groups/breakout rooms to make sure the groups are engaged + (but note some learners may find discussions distracting so try to find a balance) +- Taking notes on what works well and what not - throughout the workshop - + from their individual perspective and perspectives of students: + - Collecting general feelings and comments + - Their thoughts as a potential student and instructor +- Noting mistakes, inconsistencies and learning obstacles in the materials +- Recording issues or doing PRs in the lesson repository during or after of the workshop +- Helping students get through the material + but also being ready to answer questions on applying the material in learners' domains, + if possible +- Monitoring the progress of students + - get up every now and then and do a walk around the room, look at stickies and have a peak at + computer screens (particularly if the session is running a bit behind) + - ask any learners that you might have helped previously how they are getting on + +### Group Exercises + +Here is some advice on how best to sync and organise group exercises in later stages of the course. + +- For earlier workshop stages, + where learners go through the material individually (though placed in groups), + maintaining the same group composition is not all that important. + However, it would be good to maintain the same teams once group exercises start, + as group will chose one software project to be the "team project" to work on. +- Take a note of who was in which group between different days + (e.g. in a share document where people can sign up), + as people tend to forget (especially for online workshop). +- Some group exercises start in the middle (rather than at the beginning) of a section. + This means that synchronisation is needed to make sure + everyone starts at the same time during that particular session. + As some students will naturally be ready faster, + perhaps have a shared document for people to put their names down + as they are ready to start with the group exercises, + and organise them in teams based on the speed they are covering the material. + Even if these groups change from previous days, + it will ensure people's idle time is minimised. +- People may lose motivation in the later stages involving teamwork + if some team members are missing - + while this may be inevitable due to other commitments, + make it clear during workshop advertising + that people should try to commit workshop days/times. +- Make it obvious to the learners that they should + catch up with any unfinished material or exercises from the previous session + before joining the next one - + this is even more important for group exercises so the teams are not stalled. + + + + diff --git a/learner-profiles.md b/learner-profiles.md new file mode 100644 index 000000000..434e335aa --- /dev/null +++ b/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. diff --git a/links.md b/links.md new file mode 100644 index 000000000..2d8cd4b46 --- /dev/null +++ b/links.md @@ -0,0 +1,57 @@ +[best-practices]: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[ci]: http://communityin.org/ +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html +[coderefinery-lessons]: https://coderefinery.org/lessons/ +[code-review]: https://en.wikipedia.org/wiki/Code_review +[concept-maps]: https://carpentries.github.io/instructor-training/05-memory/ +[contrib-covenant]: https://contributor-covenant.org/ +[cran-checkpoint]: https://cran.r-project.org/package=checkpoint +[cran-knitr]: https://cran.r-project.org/package=knitr +[cran-stringr]: https://cran.r-project.org/package=stringr +[dc-lessons]: http://www.datacarpentry.org/lessons/ +[email]: mailto:team@carpentries.org +[functional-programming]: https://en.wikipedia.org/wiki/Functional_programming +[gdpr]: https://ec.europa.eu/info/law/law-topic/data-protection/eu-data-protection-rules_en +[github-importer]: https://import2.github.com/ +[github-markdown]: https://guides.github.com/features/mastering-markdown/ +[github-actions]: https://docs.github.com/en/actions +[good-practices]: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510 +[importer]: https://github.com/new/import +[jekyll-collection]: https://jekyllrb.com/docs/collections/ +[jekyll-install]: https://jekyllrb.com/docs/installation/ +[jekyll-windows]: http://jekyll-windows.juthilo.com/ +[jekyll]: https://jekyllrb.com/ +[jupyter]: https://jupyter.org/ +[kramdown]: https://kramdown.gettalong.org/ +[lc-lessons]: https://librarycarpentry.org/lessons/ +[lesson-example]: https://carpentries.github.io/lesson-example/ +[mit-license]: https://opensource.org/licenses/mit-license.html +[morea]: https://morea-framework.github.io/ +[numfocus]: https://numfocus.org/ +[numpy]: http://www.numpy.org/ +[osi]: https://opensource.org +[pandas]: https://pandas.pydata.org/ +[pandoc]: https://pandoc.org/ +[paper-now]: https://github.com/PeerJ/paper-now +[pull-request]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests +[python-documentation]: https://docs.python.org/3/ +[python-gapminder]: https://swcarpentry.github.io/python-novice-gapminder/ +[pyyaml]: https://pypi.org/project/PyYAML/ +[r-markdown]: https://rmarkdown.rstudio.com/ +[rstudio]: https://www.rstudio.com/ +[ruby-install-guide]: https://www.ruby-lang.org/en/downloads/ +[ruby-installer]: https://rubyinstaller.org/ +[rubygems]: https://rubygems.org/pages/download/ +[scikit-learn]: https://github.com/scikit-learn/scikit-learn +[ssi]: https://software.ac.uk/ +[ssi-choosing-name]: https://software.ac.uk/resources/guides/choosing-project-and-product-names +[styles]: https://github.com/carpentries/styles/ +[swc-lessons]: https://software-carpentry.org/lessons/ +[swc-programming-with-python]: https://swcarpentry.github.io/python-novice-inflammation/ +[swc-releases]: https://github.com/swcarpentry/swc-releases +[training]: https://carpentries.github.io/instructor-training/ +[yaml]: http://yaml.org/ +[ssi-fair-lesson]: https://github.com/carpentries-incubator/fair-research-software diff --git a/md5sum.txt b/md5sum.txt new file mode 100644 index 000000000..3870ccd74 --- /dev/null +++ b/md5sum.txt @@ -0,0 +1,58 @@ +"file" "checksum" "built" "date" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-12-06" +"GOVERNANCE.md" "60fee43ad99002ea28cf82fc4f426c01" "site/built/GOVERNANCE.md" "2024-12-06" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-12-06" +"config.yaml" "4799f899c7c44468eede6fbfa1be4f8e" "site/built/config.yaml" "2025-01-23" +"index.md" "13a35007c30050e2d6cc5eccaec9a735" "site/built/index.md" "2024-12-06" +"links.md" "7615d5c37931c582c3c7e4d575011c33" "site/built/links.md" "2024-12-06" +"paper.md" "6f5f7ec22f895ea5fd49d6bf0016a2ee" "site/built/paper.md" "2024-12-06" +"episodes/00-setting-the-scene.md" "134311748519aaa07caa19a7989357ff" "site/built/00-setting-the-scene.md" "2024-12-06" +"episodes/10-section1-intro.md" "5a9130373443cadeb19aa1a46dd3c1b2" "site/built/10-section1-intro.md" "2024-12-06" +"episodes/11-software-project.md" "f5e403d2b25781407eaed7534dfff9ec" "site/built/11-software-project.md" "2024-12-06" +"episodes/12-virtual-environments.md" "f49837c733584f7bca96cef405b2dcc0" "site/built/12-virtual-environments.md" "2024-12-06" +"episodes/13-ides.md" "1ae19b798bc07c88f241ec792ff41ef3" "site/built/13-ides.md" "2024-12-06" +"episodes/14-collaboration-using-git.md" "065fefef86c0202023975a04120da090" "site/built/14-collaboration-using-git.md" "2024-12-06" +"episodes/15-coding-conventions.md" "930ee3f977bcde70f4919a8e11ba3e02" "site/built/15-coding-conventions.md" "2024-12-16" +"episodes/16-verifying-code-style-linters.md" "d790a49e61e0c28fc2eb153ec1a00d82" "site/built/16-verifying-code-style-linters.md" "2024-12-06" +"episodes/17-section1-optional-exercises.md" "c7a9b304ba70d7d15ea5334fd211248b" "site/built/17-section1-optional-exercises.md" "2024-12-06" +"episodes/20-section2-intro.md" "de5af08b90601de19c724a0c9d7c3082" "site/built/20-section2-intro.md" "2024-12-06" +"episodes/21-automatically-testing-software.md" "fefa4876a74e1ae8c7e0927912ca876c" "site/built/21-automatically-testing-software.md" "2024-12-06" +"episodes/22-scaling-up-unit-testing.md" "73bcee5ed247ef0092b202af9cc810de" "site/built/22-scaling-up-unit-testing.md" "2024-12-06" +"episodes/23-continuous-integration-automated-testing.md" "c9f389e7a47a33e1220497c43e69a1c4" "site/built/23-continuous-integration-automated-testing.md" "2024-12-06" +"episodes/24-diagnosing-issues-improving-robustness.md" "b9d217dd779141be70604c0b6b13195a" "site/built/24-diagnosing-issues-improving-robustness.md" "2024-12-06" +"episodes/25-section2-optional-exercises.md" "439682a4955568fa290b79ab2b486797" "site/built/25-section2-optional-exercises.md" "2024-12-06" +"episodes/30-section3-intro.md" "24e70667c1848061ecb3d42ecf17dbf8" "site/built/30-section3-intro.md" "2024-12-06" +"episodes/31-software-requirements.md" "7831d18a89d1e07dd312dc89a8e861dc" "site/built/31-software-requirements.md" "2024-12-06" +"episodes/32-software-architecture-design.md" "368bfd3da942c5aeceb2491a7203e787" "site/built/32-software-architecture-design.md" "2024-12-06" +"episodes/33-code-decoupling-abstractions.md" "d005ae2985203c1565a859b45a8fb597" "site/built/33-code-decoupling-abstractions.md" "2024-12-06" +"episodes/34-code-refactoring.md" "a326b0b49f7114cf0b2e56c81231c4c7" "site/built/34-code-refactoring.md" "2024-12-06" +"episodes/35-software-architecture-revisited.md" "8f92fb912d61d3aded15f51020699a14" "site/built/35-software-architecture-revisited.md" "2024-12-06" +"episodes/40-section4-intro.md" "d8aa3c327409db1b14826b7619287c45" "site/built/40-section4-intro.md" "2024-12-06" +"episodes/41-code-review.md" "fb1ab5cac0c57cfe3d162fc04c840b2e" "site/built/41-code-review.md" "2024-12-06" +"episodes/42-software-reuse.md" "d97b8a23401a52bfb6dc5e564981fcf2" "site/built/42-software-reuse.md" "2024-12-06" +"episodes/43-software-release.md" "af8aca5b2c4fb2575192c3ed74ebeb9a" "site/built/43-software-release.md" "2024-12-06" +"episodes/50-section5-intro.md" "85de74adcd13ba9e8a6df600a17b9ddc" "site/built/50-section5-intro.md" "2024-12-06" +"episodes/51-managing-software.md" "570fbc180f98caf5ce4aac2703d254cd" "site/built/51-managing-software.md" "2024-12-06" +"episodes/52-assessing-software-suitability-improvement.md" "9092d3b746792536b43dd9d9eb1da26e" "site/built/52-assessing-software-suitability-improvement.md" "2024-12-06" +"episodes/53-improvement-through-feedback.md" "b46cb516f900e5001609c4d4165254ae" "site/built/53-improvement-through-feedback.md" "2024-12-06" +"episodes/60-wrap-up.md" "8063854ac1eeeea9c176700fe6990045" "site/built/60-wrap-up.md" "2024-12-06" +"instructors/instructor-notes.md" "0145a4d0e4df14ce1d4f08ecaa171515" "site/built/instructor-notes.md" "2024-12-06" +"learners/quiz.md" "bd60170ec9f07bc2d510f55353179217" "site/built/quiz.md" "2024-12-06" +"learners/installation-instructions.md" "516ae49d52941813dae9d2ed82bd6127" "site/built/installation-instructions.md" "2024-12-06" +"learners/common-issues.md" "cbe920fbcf2d876be9bcceed8f3a57a7" "site/built/common-issues.md" "2025-01-23" +"learners/software-architecture-extra.md" "75cca9330b84bddf8223944131639f4f" "site/built/software-architecture-extra.md" "2024-12-06" +"learners/programming-paradigms.md" "2c3cdee71c1c975c0cf99260493b6e67" "site/built/programming-paradigms.md" "2024-12-06" +"learners/procedural-programming.md" "ede81ccae989c46e47af0417ac31b401" "site/built/procedural-programming.md" "2024-12-06" +"learners/functional-programming.md" "6e68bd30935f968e283bf1ecf5160edf" "site/built/functional-programming.md" "2024-12-06" +"learners/object-oriented-programming.md" "cde54aa934af3c87cc5224ef45775975" "site/built/object-oriented-programming.md" "2024-12-06" +"learners/persistence.md" "86e55224911506ab5b62abac5f002854" "site/built/persistence.md" "2024-12-06" +"learners/databases.md" "4954ff7461b7fdb63913df9719f16ad2" "site/built/databases.md" "2024-12-06" +"learners/vscode.md" "671387b7374a86e897f209d7336c75b2" "site/built/vscode.md" "2024-12-06" +"learners/reference.md" "d5d60c894664895ab3423a31bd1e2aaa" "site/built/reference.md" "2024-12-06" +"learners/setup.md" "172016f61e91f7d5402d1c8ebdde32df" "site/built/setup.md" "2024-12-06" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2024-12-06" +"slides/section_1_setting_up_environment.md" "3f43466ae09b48d6c2e9bd6d867562ae" "site/built/section_1_setting_up_environment.md" "2024-12-06" +"slides/section_2_ensuring_correctness.md" "c07cf076af61c8fdcbcfe66f6817b5e7" "site/built/section_2_ensuring_correctness.md" "2024-12-06" +"slides/section_3_software_dev_process.md" "95bbcd9fd58b0c7d522b786748cb0298" "site/built/section_3_software_dev_process.md" "2024-12-06" +"slides/section_4_collaborative_soft_dev.md" "f729a79969cbe81f5b86cdfd1a037643" "site/built/section_4_collaborative_soft_dev.md" "2024-12-06" +"slides/section_5_managing_software.md" "ee351c9e92ba29cda41bca7d2647fb1f" "site/built/section_5_managing_software.md" "2024-12-06" diff --git a/object-oriented-programming.md b/object-oriented-programming.md new file mode 100644 index 000000000..a2a9c6094 --- /dev/null +++ b/object-oriented-programming.md @@ -0,0 +1,921 @@ +--- +title: "Extra Content: Object Oriented Programming" +teaching: 30 +exercises: 35 +--- + +::: questions: +- How can we use code to describe the structure of data? +- How should the relationships between structures be described? +::: + +::: objectives +- Describe the core concepts that define the object oriented paradigm +- Use classes to encapsulate data within a more complex program +- Structure concepts within a program in terms of sets of behaviour +- Identify different types of relationship between concepts within a program +- Structure data within a program using these relationships +::: + +Object oriented programming is a programming paradigm based on the concept of objects, +which are data structures that contain (encapsulate) data and code. +Data is encapsulated in the form of fields (attributes) of objects, +while code is encapsulated in the form of procedures (methods) +that manipulate objects' attributes and define "behaviour" of objects. +So, in object oriented programming, +we first think about the data and the things that we are modelling - +and represent these by objects - +rather than define the logic of the program, +and code becomes a series of interactions between objects. + +## Structuring Data + +One of the main difficulties we encounter when building more complex software is +how to structure our data. +So far, we have been processing data from a single source and with a simple tabular structure, +but it would be useful to be able to combine data from a range of different sources +and with more data than just an array of numbers. + +```python +data = np.array([[1., 2., 3.], + [4., 5., 6.]]) +``` + +Using this data structure has the advantage of +being able to use NumPy operations to process the data +and Matplotlib to plot it, +but often we need to have more structure than this. +For example, we may need to attach more information about the patients +and store this alongside our measurements of inflammation. + +We can do this using the Python data structures we are already familiar with, +dictionaries and lists. +For instance, we could attach a name to each of our patients: + +```python +patients = [ + { + 'name': 'Alice', + 'data': [1., 2., 3.], + }, + { + 'name': 'Bob', + 'data': [4., 5., 6.], + }, +] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: Structuring Data + +Write a function, called `attach_names`, +which can be used to attach names to our patient dataset. +When used as below, it should produce the expected output. + +If you are not sure where to begin, +think about ways you might be able to effectively loop over two collections at once. +Also, do not worry too much about the data type of the `data` value, +it can be a Python list, or a NumPy array - either is fine. + +```python +data = np.array([[1., 2., 3.], + [4., 5., 6.]]) + +output = attach_names(data, ['Alice', 'Bob']) +print(output) +``` + +```output +[ + { + 'name': 'Alice', + 'data': [1., 2., 3.], + }, + { + 'name': 'Bob', + 'data': [4., 5., 6.], + }, +] +``` + +::::::::::::::: solution + +## Solution + +One possible solution, perhaps the most obvious, +is to use the `range` function to index into both lists at the same location: + +```python +def attach_names(data, names): + """Create datastructure containing patient records.""" + output = [] + + for i in range(len(data)): + output.append({'name': names[i], + 'data': data[i]}) + + return output +``` + +However, this solution has a potential problem that can occur sometimes, +depending on the input. +What might go wrong with this solution? +How could we fix it? + +::::::::::::::: solution + +## A Better Solution + +What would happen if the `data` and `names` inputs were different lengths? + +If `names` is longer, we will loop through, until we run out of rows in the `data` input, +at which point we will stop processing the last few names. +If `data` is longer, we will loop through, but at some point we will run out of names - +but this time we try to access part of the list that does not exist, +so we will get an exception. + +A better solution would be to use the `zip` function, +which allows us to iterate over multiple iterables without needing an index variable. +The `zip` function also limits the iteration to whichever of the iterables is smaller, +so we will not raise an exception here, +but this might not quite be the behaviour we want, +so we will also explicitly `assert` that the inputs should be the same length. +Checking that our inputs are valid in this way is an example of a precondition, +which we introduced conceptually in an earlier episode. + +If you have not previously come across the `zip` function, +read [this section](https://docs.python.org/3/library/functions.html#zip) +of the Python documentation. + +```python +def attach_names(data, names): + """Create datastructure containing patient records.""" + assert len(data) == len(names) + output = [] + + for data_row, name in zip(data, names): + output.append({'name': name, + 'data': data_row}) + + return output +``` + +::::::::::::::::::::::::: + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Classes in Python + +Using nested dictionaries and lists should work for some of the simpler cases +where we need to handle structured data, +but they get quite difficult to manage once the structure becomes a bit more complex. +For this reason, in the object oriented paradigm, +we use **classes** to help with managing this data +and the operations we would want to perform on it. +A class is a **template** (blueprint) for a structured piece of data, +so when we create some data using a class, +we can be certain that it has the same structure each time. + +With our list of dictionaries we had in the example above, +we have no real guarantee that each dictionary has the same structure, +e.g. the same keys (`name` and `data`) unless we check it manually. +With a class, if an object is an **instance** of that class +(i.e. it was made using that template), +we know it will have the structure defined by that class. +Different programming languages make slightly different guarantees +about how strictly the structure will match, +but in object oriented programming this is one of the core ideas - +all objects derived from the same class must follow the same behaviour. + +You may not have realised, but you should already be familiar with +some of the classes that come bundled as part of Python, for example: + +```python +my_list = [1, 2, 3] +my_dict = {1: '1', 2: '2', 3: '3'} +my_set = {1, 2, 3} + +print(type(my_list)) +print(type(my_dict)) +print(type(my_set)) +``` + +```output + + + +``` + +Lists, dictionaries and sets are a slightly special type of class, +but they behave in much the same way as a class we might define ourselves: + +- They each hold some data (**attributes** or **state**). +- They also provide some methods describing the behaviours of the data - + what can the data do and what can we do to the data? + +The behaviours we may have seen previously include: + +- Lists can be appended to +- Lists can be indexed +- Lists can be sliced +- Key-value pairs can be added to dictionaries +- The value at a key can be looked up in a dictionary +- The union of two sets can be found (the set of values present in any of the sets) +- The intersection of two sets can be found (the set of values present in all of the sets) + +## Encapsulating Data + +Let us start with a minimal example of a class representing our patients. + +```python +# file: inflammation/models.py + +class Patient: + def __init__(self, name): + self.name = name + self.observations = [] + +alice = Patient('Alice') +print(alice.name) +``` + +```output +Alice +``` + +Here we have defined a class with one method: `__init__`. +This method is the **initialiser** method, +which is responsible for setting up the initial values and structure of the data +inside a new instance of the class - +this is very similar to **constructors** in other languages, +so the term is often used in Python too. +The `__init__` method is called every time we create a new instance of the class, +as in `Patient('Alice')`. +The argument `self` refers to the instance on which we are calling the method +and gets filled in automatically by Python - +we do not need to provide a value for this when we call the method. + +Data encapsulated within our Patient class includes +the patient's name and a list of inflammation observations. +In the initialiser method, +we set a patient's name to the value provided, +and create a list of inflammation observations for the patient (initially empty). +Such data is also referred to as the attributes of a class +and holds the current state of an instance of the class. +Attributes are typically hidden (encapsulated) internal object details +ensuring that access to data is protected from unintended changes. +They are manipulated internally by the class, +which, in addition, can expose certain functionality as public behavior of the class +to allow other objects to interact with this class' instances. + +## Encapsulating Behaviour + +In addition to representing a piece of structured data +(e.g. a patient who has a name and a list of inflammation observations), +a class can also provide a set of functions, or **methods**, +which describe the **behaviours** of the data encapsulated in the instances of that class. +To define the behaviour of a class we add functions which operate on the data the class contains. +These functions are the member functions or methods. + +Methods on classes are the same as normal functions, +except that they live inside a class and have an extra first parameter `self`. +Using the name `self` is not strictly necessary, but is a very strong convention - +it is extremely rare to see any other name chosen. +When we call a method on an object, +the value of `self` is automatically set to this object - hence the name. +As we saw with the `__init__` method previously, +we do not need to explicitly provide a value for the `self` argument, +this is done for us by Python. + +Let us add another method on our Patient class that adds a new observation to a Patient instance. + +```python +# file: inflammation/models.py + +class Patient: + """A patient in an inflammation study.""" + def __init__(self, name): + self.name = name + self.observations = [] + + def add_observation(self, value, day=None): + if day is None: + if self.observations: + day = self.observations[-1]['day'] + 1 + else: + day = 0 + + new_observation = { + 'day': day, + 'value': value, + } + + self.observations.append(new_observation) + return new_observation + +alice = Patient('Alice') +print(alice) + +observation = alice.add_observation(3) +print(observation) +print(alice.observations) +``` + +```output +<__main__.Patient object at 0x7fd7e61b73d0> +{'day': 0, 'value': 3} +[{'day': 0, 'value': 3}] +``` + +Note also how we used `day=None` in the parameter list of the `add_observation` method, +then initialise it if the value is indeed `None`. +This is one of the common ways to handle an optional argument in Python, +so we will see this pattern quite a lot in real projects. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Class and Static Methods + +Sometimes, the function we are writing does not need access to +any data belonging to a particular object. +For these situations, we can instead use a **class method** or a **static method**. +Class methods have access to the class that they are a part of, +and can access data on that class - +but do not belong to a specific instance of that class, +whereas static methods have access to neither the class nor its instances. + +By convention, class methods use `cls` as their first argument instead of `self` - +this is how we access the class and its data, +just like `self` allows us to access the instance and its data. +Static methods have neither `self` nor `cls` +so the arguments look like a typical free function. +These are the only common exceptions to using `self` for a method's first argument. + +Both of these method types are created using **decorators** - +for more information see +the [classmethod](https://docs.python.org/3/library/functions.html#classmethod) +and [staticmethod](https://docs.python.org/3/library/functions.html#staticmethod) +decorator sections of the Python documentation. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Dunder Methods + +Why is the `__init__` method not called `init`? +There are a few special method names that we can use +which Python will use to provide a few common behaviours, +each of which begins and ends with a **d**ouble-**under**score, +hence the name **dunder method**. + +When writing your own Python classes, +you'll almost always want to write an `__init__` method, +but there are a few other common ones you might need sometimes. +You may have noticed in the code above that the method `print(alice)` +returned `<__main__.Patient object at 0x7fd7e61b73d0>`, +which is the string representation of the `alice` object. +We may want the print statement to display the object's name instead. +We can achieve this by overriding the `__str__` method of our class. + +```python +# file: inflammation/models.py + +class Patient: + """A patient in an inflammation study.""" + def __init__(self, name): + self.name = name + self.observations = [] + + def add_observation(self, value, day=None): + if day is None: + try: + day = self.observations[-1]['day'] + 1 + + except IndexError: + day = 0 + + + new_observation = { + 'day': day, + 'value': value, + } + + self.observations.append(new_observation) + return new_observation + + def __str__(self): + return self.name + + +alice = Patient('Alice') +print(alice) +``` + +```output +Alice +``` + +These dunder methods are not usually called directly, +but rather provide the implementation of some functionality we can use - +we didn't call `alice.__str__()`, +but it was called for us when we did `print(alice)`. +Some we see quite commonly are: + +- `__str__` - converts an object into its string representation, used when you call `str(object)` or `print(object)` +- `__getitem__` - Accesses an object by key, this is how `list[x]` and `dict[x]` are implemented +- `__len__` - gets the length of an object when we use `len(object)` - usually the number of items it contains + +There are many more described in the Python documentation, +but it's also worth experimenting with built in Python objects to +see which methods provide which behaviour. +For a more complete list of these special methods, +see the [Special Method Names](https://docs.python.org/3/reference/datamodel.html#special-method-names) +section of the Python documentation. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: A Basic Class + +Implement a class to represent a book. +Your class should: + +- Have a title +- Have an author +- When printed using `print(book)`, show text in the format "title by author" + +```python +book = Book('A Book', 'Me') + +print(book) +``` + +```output +A Book by Me +``` + +::::::::::::::: solution + +## Solution + +```python +class Book: + def __init__(self, title, author): + self.title = title + self.author = author + + def __str__(self): + return self.title + ' by ' + self.author +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Properties + +The final special type of method we will introduce is a **property**. +Properties are methods which behave like data - +when we want to access them, we do not need to use brackets to call the method manually. + +```python +# file: inflammation/models.py + +class Patient: + ... + + @property + def last_observation(self): + return self.observations[-1] + +alice = Patient('Alice') + +alice.add_observation(3) +alice.add_observation(4) + +obs = alice.last_observation +print(obs) +``` + +```output +{'day': 1, 'value': 4} +``` + +You may recognise the `@` syntax from episodes on +parameterising unit tests and functional programming - +`property` is another example of a **decorator**. +In this case the `property` decorator is taking the `last_observation` function +and modifying its behaviour, +so it can be accessed as if it were a normal attribute. +It is also possible to make your own decorators, but we will not cover it here. + +## Relationships Between Classes + +We now have a language construct for grouping data and behaviour +related to a single conceptual object. +The next step we need to take is to describe the relationships between the concepts in our code. + +There are two fundamental types of relationship between objects +which we need to be able to describe: + +1. Ownership - x **has a** y - this is **composition** +2. Identity - x **is a** y - this is **inheritance** + +### Composition + +You should hopefully have come across the term **composition** already - +in the novice Software Carpentry, we use composition of functions to reduce code duplication. +That time, we used a function which converted temperatures in Celsius to Kelvin +as a **component** of another function which converted temperatures in Fahrenheit to Kelvin. + +In the same way, in object oriented programming, we can make things components of other things. + +We often use composition where we can say 'x *has a* y' - +for example in our inflammation project, +we might want to say that a doctor *has* patients +or that a patient *has* observations. + +In the case of our example, we are already saying that patients have observations, +so we are already using composition here. +We are currently implementing an observation as a dictionary with a known set of keys though, +so maybe we should make an `Observation` class as well. + +```python +# file: inflammation/models.py + +class Observation: + def __init__(self, day, value): + self.day = day + self.value = value + + def __str__(self): + return str(self.value) + +class Patient: + """A patient in an inflammation study.""" + def __init__(self, name): + self.name = name + self.observations = [] + + def add_observation(self, value, day=None): + if day is None: + try: + day = self.observations[-1].day + 1 + + except IndexError: + day = 0 + + new_observation = Observation(day, value) + + self.observations.append(new_observation) + return new_observation + + def __str__(self): + return self.name + + +alice = Patient('Alice') +obs = alice.add_observation(3) + +print(obs) +``` + +```output +3 +``` + +Now we are using a composition of two custom classes to +describe the relationship between two types of entity in the system that we are modelling. + +### Inheritance + +The other type of relationship used in object oriented programming is **inheritance**. +Inheritance is about data and behaviour shared by classes, +because they have some shared identity - 'x *is a* y'. +If class `X` inherits from (*is a*) class `Y`, +we say that `Y` is the **superclass** or **parent class** of `X`, +or `X` is a **subclass** of `Y`. + +If we want to extend the previous example to also manage people who aren't patients +we can add another class `Person`. +But `Person` will share some data and behaviour with `Patient` - +in this case both have a name and show that name when you print them. +Since we expect all patients to be people (hopefully!), +it makes sense to implement the behaviour in `Person` and then reuse it in `Patient`. + +To write our class in Python, +we used the `class` keyword, the name of the class, +and then a block of the functions that belong to it. +If the class **inherits** from another class, +we include the parent class name in brackets. + +```python +# file: inflammation/models.py + +class Observation: + def __init__(self, day, value): + self.day = day + self.value = value + + def __str__(self): + return str(self.value) + +class Person: + def __init__(self, name): + self.name = name + + def __str__(self): + return self.name + +class Patient(Person): + """A patient in an inflammation study.""" + def __init__(self, name): + super().__init__(name) + self.observations = [] + + def add_observation(self, value, day=None): + if day is None: + try: + day = self.observations[-1].day + 1 + + except IndexError: + day = 0 + + new_observation = Observation(day, value) + + self.observations.append(new_observation) + return new_observation + +alice = Patient('Alice') +print(alice) + +obs = alice.add_observation(3) +print(obs) + +bob = Person('Bob') +print(bob) + +obs = bob.add_observation(4) +print(obs) +``` + +```output +Alice +3 +Bob +AttributeError: 'Person' object has no attribute 'add_observation' +``` + +As expected, an error is thrown because we cannot add an observation to `bob`, +who is a Person but not a Patient. + +We see in the example above that to say that a class inherits from another, +we put the **parent class** (or **superclass**) in brackets after the name of the **subclass**. + +There is something else we need to add as well - +Python does not automatically call the `__init__` method on the parent class +if we provide a new `__init__` for our subclass, +so we will need to call it ourselves. +This makes sure that everything that needs to be initialised on the parent class has been, +before we need to use it. +If we do not define a new `__init__` method for our subclass, +Python will look for one on the parent class and use it automatically. +This is true of all methods - +if we call a method which does not exist directly on our class, +Python will search for it among the parent classes. +The order in which it does this search is known as the **method resolution order** - +a little more on this in the Multiple Inheritance callout below. + +The line `super().__init__(name)` gets the parent class, +then calls the `__init__` method, +providing the `name` variable that `Person.__init__` requires. +This is quite a common pattern, particularly for `__init__` methods, +where we need to make sure an object is initialised as a valid `X`, +before we can initialise it as a valid `Y` - +e.g. a valid `Person` must have a name, +before we can properly initialise a `Patient` model with their inflammation data. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Composition vs Inheritance + +When deciding how to implement a model of a particular system, +you often have a choice of either composition or inheritance, +where there is no obviously correct choice. +For example, it is not obvious whether a photocopier *is a* printer and *is a* scanner, +or *has a* printer and *has a* scanner. + +```python +class Machine: + pass + +class Printer(Machine): + pass + +class Scanner(Machine): + pass + +class Copier(Printer, Scanner): + # Copier `is a` Printer and `is a` Scanner + pass +``` + +```python +class Machine: + pass + +class Printer(Machine): + pass + +class Scanner(Machine): + pass + +class Copier(Machine): + def __init__(self): + # Copier `has a` Printer and `has a` Scanner + self.printer = Printer() + self.scanner = Scanner() +``` + +Both of these would be perfectly valid models and would work for most purposes. +However, unless there is something about how you need to use the model +which would benefit from using a model based on inheritance, +it is usually recommended to opt for **composition over inheritance**. +This is a common design principle in the object oriented paradigm and is worth remembering, +as it is very common for people to overuse inheritance once they have been introduced to it. + +For much more detail on this see the +[Python Design Patterns guide](https://python-patterns.guide/gang-of-four/composition-over-inheritance/). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Multiple Inheritance + +**Multiple Inheritance** is when a class inherits from more than one direct parent class. +It exists in Python, but is often not present in other Object Oriented languages. +Although this might seem useful, like in our inheritance-based model of the photocopier above, +it is best to avoid it unless you are sure it is the right thing to do, +due to the complexity of the inheritance heirarchy. +Often using multiple inheritance is a sign you should instead be using composition - +again like the photocopier model above. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: A Model Patient + +Let us use what we have learnt in this episode and combine it with what we have learnt on +[software requirements](../episodes/31-software-requirements.md) +to formulate and implement a +[few new solution requirements](../episodes/31-software-requirements.md#exercise-new-solution-requirements) +to extend the model layer of our clinical trial system. + +Let us start with extending the system such that there must be +a `Doctor` class to hold the data representing a single doctor, which: + +- must have a `name` attribute +- must have a list of patients that this doctor is responsible for. + +In addition to these, try to think of an extra feature you could add to the models +which would be useful for managing a dataset like this - +imagine we are running a clinical trial, what else might we want to know? +Try using Test Driven Development for any features you add: +write the tests first, then add the feature. +The tests have been started for you in `tests/test_patient.py`, +but you will probably want to add some more. + +Once you have finished the initial implementation, do you have much duplicated code? +Is there anywhere you could make better use of composition or inheritance +to improve your implementation? + +For any extra features you have added, +explain them and how you implemented them to your neighbour. +Would they have implemented that feature in the same way? + +::::::::::::::: solution + +## Solution + +One example solution is shown below. +You may start by writing some tests (that will initially fail), +and then develop the code to satisfy the new requirements and pass the tests. + +```python +# file: tests/test_patient.py +"""Tests for the Patient model.""" +from inflammation.models import Doctor, Patient, Person + +def test_create_patient(): + """Check a patient is created correctly given a name.""" + name = 'Alice' + p = Patient(name=name) + assert p.name == name + +def test_create_doctor(): + """Check a doctor is created correctly given a name.""" + name = 'Sheila Wheels' + doc = Doctor(name=name) + assert doc.name == name + +def test_doctor_is_person(): + """Check if a doctor is a person.""" + doc = Doctor("Sheila Wheels") + assert isinstance(doc, Person) + +def test_patient_is_person(): + """Check if a patient is a person. """ + alice = Patient("Alice") + assert isinstance(alice, Person) + +def test_patients_added_correctly(): + """Check patients are being added correctly by a doctor. """ + doc = Doctor("Sheila Wheels") + alice = Patient("Alice") + doc.add_patient(alice) + assert doc.patients is not None + assert len(doc.patients) == 1 + +def test_no_duplicate_patients(): + """Check adding the same patient to the same doctor twice does not result in duplicates. """ + doc = Doctor("Sheila Wheels") + alice = Patient("Alice") + doc.add_patient(alice) + doc.add_patient(alice) + assert len(doc.patients) == 1 +... +``` + +``` +# file: inflammation/models.py +... +class Person: + """A person.""" + def __init__(self, name): + self.name = name + + def __str__(self): + return self.name + +class Patient(Person): + """A patient in an inflammation study.""" + def __init__(self, name): + super().__init__(name) + self.observations = [] + + def add_observation(self, value, day=None): + if day is None: + try: + day = self.observations[-1].day + 1 + except IndexError: + day = 0 + new_observation = Observation(day, value) + self.observations.append(new_observation) +``` + +::::::::::::::::::::::::: + +```python + return new_observation +``` + +> class Doctor(Person): +> """A doctor in an inflammation study.""" +> def **init**(self, name): +> super().**init**(name) +> self.patients = [] +> +> ``` +> def add_patient(self, new_patient): +> # A crude check by name if this patient is already looked after +> # by this doctor before adding them +> for patient in self.patients: +> if patient.name == new_patient.name: +> return +> self.patients.append(new_patient) +> ``` +> +> ... +> +> ``` +> ``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + +::: keypoints +- Object oriented programming is a programming paradigm based on the concept of classes, + which encapsulate data and code. +- Classes allow us to organise data into distinct concepts. +- By breaking down our data into classes, we can reason about the behaviour of parts + of our data. +- Relationships between concepts can be described using inheritance (*is a*) and composition + (*has a*). +::: + diff --git a/paper.md b/paper.md new file mode 100644 index 000000000..c986f6fd2 --- /dev/null +++ b/paper.md @@ -0,0 +1,139 @@ +--- +title: 'Intermediate Research Software Development Skills (Python)' +tags: + - software design + - software engineering + - research software + - carpentry + - intermediate + - python +authors: + - name: Stephen Crouch + email: s.crouch@software.ac.uk + orcid: 0000-0001-8985-6814 + affiliation: 1 + - name: Aleksandra Nenadic + email: a.nenadic@software.ac.uk + orcid: 0000-0002-2269-3894 + affiliation: 1 + - name: James Graham + email: james.a.graham@kcl.ac.uk + orcid: 000-0001-5217-3104 + affiliation: 1,2 + - name: Martin Robinson + email: martin.robinson@cs.ox.ac.uk + orcid: 0000-0002-1572-6782 + affiliation: 3 + - name: Sam Mangham + email: S.Mangham@soton.ac.uk + orcid: 0000-0001-7511-5652 + affiliation: 1 + - name: Jacalyn Laird + email: + orcid: 000-0002-9048-9393 + affiliation: 1,4 + - name: Thomas Kiley + email: + orcid: + affiliation: + - name: Matthew Bluteau + email: matthew.bluteau@ukaea.uk + orcid: 0000-0001-9498-8475 + affiliation: 5 + - name: Sven van der Burg + email: s.vanderburg@esciencecenter.nl + orcid: 0000-0003-1250-6968 + affiliation: 6 + - name: Giulia Crocioni + email: + orcid: 0000-0002-0823-0121 + affiliation: 6 +affiliations: + - name: Software Sustainability Institute + index: 1 + - name: King's College London + index: 2 + - name: University of Oxford + index: 3 + - name: SAC Consulting + index: 4 + - name: UK Atomic Energy Authority + index: 5 + - name: Netherlands eScience Center + index: 6 +date: 2024-05-28 +bibliography: paper.bib + + +--- + +# Summary + +This course aims to teach a core set of established, intermediate-level software development skills and best practices for working as part of a team in a +research environment using Python as an example programming language. +The core set of skills we teach is not a comprehensive set of all-encompassing skills, but a selective set of tried-and-tested collaborative development +skills that forms a firm foundation for continuing on your learning journey. +The course teaches these skills in a way that mimics a typical software development process working as a part of a team, starting from an existing piece of software. + +# Statement of need + + + +A typical learner for this course may be someone who is working in a research environment, needing to write some code, has gained basic software development skills +either by self-learning or attending, e.g., a novice Software Carpentry Python course. +They have been applying those skills in their domain of work by writing code for some time, e.g. half a year or more. +However, their software development-related projects are now becoming larger and are involving more researchers and other stakeholders (e.g. users), for example: + +* Software is becoming more complex and more collaborative development effort is needed to keep the software running +* Software is going further than just the small group developing and/or using the code - there are more users and an increasing need to add new features +* ‘Technical debt’ is increasing with demands to add new functionality while ensuring previous development efforts remain functional and maintainable + +They now need intermediate software engineering skills to help them design more robust software code that goes beyond a few thrown-together proof-of-concept scripts, +taking into consideration the lifecycle of software, writing software for stakeholders, team ethic and applying a process to understanding, +designing, building, releasing, and maintaining software. + +# Learning objectives, design, and experience + + + +After going through this course, participants will be able to: + +* Set up and use a suitable development environment together with popular source code management infrastructure to develop software collaboratively +* Use a test framework to automate the verification of correct behaviour of code, and employ parameterisation and continuous integration to scale and further automate code testing +* Design robust, extensible software through the application of suitable programming paradigms and design techniques +* Understand the code review process and employ it to improve the quality of code +* Prepare and release software for reuse by others +* Manage software improvement from feedback through agile techniques + +The course follows a narrative around a software development team working on an existing software project that is analysing patients’ inflammation data +(from the novice Software Carpentry's "Programming in Python" course). +The course is split into 5 sections, each of which can be delivered in approximately half to a full day, in either guided self-learning mode (where helpers provide help +and answer questions - synchronously or asynchrounously) or in a standard instructor-led mode. +Learners are typically organised in small groups from the outset and initially work individually through the material on their own with the aid of helpers (or follow an instructor). +In later sections, exercises involve more group work and learners from the same group form a development team and collaborate on a mini software project. + +# How the lesson came to be + +The Software Sustainability Institute (SSI) conducted an international RSE survey in 2018 as well as a series of internal interviews with the key RSE group leaders and +[SSI's Open Call research software projects](https://www.software.ac.uk/news/need-free-help-your-research-software-try-institutes-open-call-1) we supported with free +software development expertise and consultancy, and asked them about the current training needs. +They all came back to us with a single feedback - what software engineering skills to learn next +after gaining foudnational computational skills via Software, Data or Library Carpentry and where to find such training resources. +There was also a shift from working on research software development projects in isolation and solo towards working in teams and collaboratively, +as software is developed in industry, and how to learn those skills. + +Original lesson authors Aleksandra Nenadic, James Graham, and Steve Crouch from the Software Sustainability Institute joined up to create this +course to fill on those gaps and started working on this course in 2019. + +# Acknowledgements + +Original lesson authors Aleksandra Nenadic, James Graham, and Steve Crouch were supported by the UK's Software Sustainability Institute +via the EPSRC, BBSRC, ESRC, NERC, AHRC, STFC and MRC grant EP/S021779/1. + +Since then, many people have contributed to the course material - see [AUTHORS](https://github.com/carpentries-incubator/python-intermediate-development/blob/gh-pages/AUTHORS). + + +# References + +See [paper.bib](https://github.com/carpentries-incubator/python-intermediate-development/blob/gh-pages/paper.bib) file. + diff --git a/persistence.md b/persistence.md new file mode 100644 index 000000000..b20ec956e --- /dev/null +++ b/persistence.md @@ -0,0 +1,442 @@ +--- +title: "Extra Content: Persistence" +teaching: 25 +exercises: 25 +--- + +::: questions +- How can we store and transfer structured data? +- How can we make it easier to substitute new components into our software? +::: + +::: objectives +- Describe how the environment in which software is used may constrain its design. +- Identify common components of multi-layer software projects. +- Define serialisation and deserialisation. +- Store and retrieve structured data using an appropriate format. +- Define what is meant by a contract in the context of Object Oriented design. +- Explain the benefits of contracts and implement software components which fulfill + them. +::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Follow up from Section 3 + +This episode could be read as a follow up from the end of +[Section 3 on software design and development](../episodes/35-software-architecture-revisited.md#towards-collaborative-software-development). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Our patient data system so far can read in some data, process it, and display it to people. +What's missing? + +Well, at the moment, if we wanted to add a new patient or perform a new observation, +we would have to edit the input CSV file by hand. +We might not want our staff to have to manage their patients +by making changes to the data by hand, +but rather provide the ability to do this through the software. +That way we can perform any necessary validation +(e.g. inflammation measurements must be a number) +or transformation before the data gets accepted. + +If we want to bring in this data, +modify it somehow, +and save it back to a file, +all using our existing MVC architecture pattern, +we will need to: + +- Write some code to perform data import / export (**persistence**) +- Add some views we can use to modify the data +- Link it all together in the controller + +## Serialisation and Serialisers + +The process of converting data from an object to and from storable formats +is often called **serialisation** and **deserialisation** +and is handled by a **serialiser**. +Serialisation is the process of +exporting our structured data to a usually text-based format for easy storage or transfer, +while deserialisation is the opposite. +We are going to be making a serialiser for our patient data, +but since there are many different formats we might eventually want to use to store the data, +we will also make sure it is possible to add alternative serialisers later and swap between them. +So let us start by creating a base class +to represent the concept of a serialiser for our patient data - +then we can specialise this to make serialisers for different formats +by inheriting from this base class. + +By creating a base class we provide a contract that any kind of patient serialiser must satisfy. +If we create some alternative serialisers for different data formats, +we know that we will be able to use them all in exactly the same way. +This technique is part of an approach called **design by contract**. + +We will call our base class `PatientSerializer` and put it in file `inflammation/serializers.py`. + +```python +# file: inflammation/serializers.py + +from inflammation import models + + +class PatientSerializer: + model = models.Patient + + @classmethod + def serialize(cls, instances): + raise NotImplementedError + + @classmethod + def save(cls, instances, path): + raise NotImplementedError + + @classmethod + def deserialize(cls, data): + raise NotImplementedError + + @classmethod + def load(cls, path): + raise NotImplementedError +``` + +Our serialiser base class has two pairs of class methods +(denoted by the `@classmethod` decorators), +one to serialise (save) the data and one to deserialise (load) it. +We are not actually going to implement any of them quite yet +as this is just a template for how our real serialisers should look, +so we will raise `NotImplementedError` to make this clear +if anyone tries to use this class directly. +The reason we have used class methods is that +we do not need to be able to pass any data in using the `__init__` method, +as we will be passing the data to be serialised directly to the `save` function. + +There are many different formats we could use to store our data, +but a good one is [**JSON** (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON). +This format comes originally from JavaScript, +but is now one of the most widely used serialisation formats +for exchange or storage of structured data, +used across most common programming languages. + +Data in JSON format is structured using nested +**arrays** (very similar to Python lists) +and **objects** (very similar to Python dictionaries). +For example, we are going to try to use this format to store data about our patients: + +```json +[ + { + "name": "Alice", + "observations": [ + { + "day": 0, + "value": 3 + }, + { + "day": 1, + "value": 4 + } + ] + }, + { + "name": "Bob", + "observations": [ + { + "day": 0, + "value": 10 + } + ] + } +] +``` + +Compared to the CSV format, +this gives us much more flexibility to describe complex structured data. +If we wanted to represent this data in CSV format, +the most natural way would be to have two separate files: +one with each row representing a patient, +the other with each row representing an observation. +We would then need to use a unique identifier to link each observation record to the relevant patient. +This is how relational databases work, +but it would be quite complicated to manage this ourselves with CSVs. + +Now, if we are going to follow +[TDD (Test Driven Development)](../episodes/22-scaling-up-unit-testing.md#test-driven-development), +we should write some test code. +Our JSON serialiser should be able to save and load our patient data to and from a JSON file, +so for our test we could try these save-load steps +and check that the result is the same as the data we started with. +Again you might need to change these examples slightly +to get them to fit with how you chose to implement your `Patient` class. + +```python +# file: tests/test_serializers.py + +from inflammation import models, serializers + +def test_patients_json_serializer(): + # Create some test data + patients = [ + models.Patient('Alice', [models.Observation(i, i + 1) for i in range(3)]), + models.Patient('Bob', [models.Observation(i, 2 * i) for i in range(3)]), + ] + + # Save and reload the data + output_file = 'patients.json' + serializers.PatientJSONSerializer.save(patients, output_file) + patients_new = serializers.PatientJSONSerializer.load(output_file) + + # Check that we have got the same data back + for patient_new, patient in zip(patients_new, patients): + assert patient_new.name == patient.name + + for obs_new, obs in zip(patient_new.observations, patient.observations): + assert obs_new.day == obs.day + assert obs_new.value == obs.value +``` + +Here we set up some patient data, which we save to a file named `patients.json`. +We then load the data from this file and check that the results match the input. + +With our test, we know what the correct behaviour looks like - now it is time to implement it. +For this, we will use one of Python's built-in libraries. +Among other more complex features, +the `json` library provides functions for +converting between Python data structures and JSON formatted text files. +Our test also didn't specify what the structure of our output data should be, +so we need to make that decision here - +we will use the format we used as JSON example earlier. + +```python +# file: inflammation/serializers.py + +import json +from inflammation import models + +class PatientSerializer: + model = models.Patient + + @classmethod + def serialize(cls, instances): + return [{ + 'name': instance.name, + 'observations': instance.observations, + } for instance in instances] + + @classmethod + def deserialize(cls, data): + return [cls.model(**d) for d in data] + + +class PatientJSONSerializer(PatientSerializer): + @classmethod + def save(cls, instances, path): + with open(path, 'w') as jsonfile: + json.dump(cls.serialize(instances), jsonfile) + + @classmethod + def load(cls, path): + with open(path) as jsonfile: + data = json.load(jsonfile) + + return cls.deserialize(data) +``` + +For our `save` / `serialize` methods, +since the JSON format is similar to nested Python lists and dictionaries, +it makes sense as a first step to convert the data from our `Patient` class into a dictionary - +we do this for each patient using a list comprehension. +Then we can pass this to the `json.dump` function to save it to a file. + +As we might expect, the `load` / `deserialize` methods are the opposite of this. +Here we need to first read the data from our input file, +then convert it to instances of our `Patient` class. +The `**` syntax here may be unfamiliar to you - +this is the **dictionary unpacking operator**. +The dictionary unpacking operator can be used when calling a function +(like a class `__init__` method) +and passes the items in the dictionary as named arguments to the function. +The name of each argument passed is the dictionary key, +the value of the argument is the dictionary value. + +When we run the tests however, we should get an error: + +```error +FAILED tests/test_serializers.py::test_patients_json_serializer - TypeError: Object of type Observation is not JSON serializable +``` + +This means that our patient serializer almost works, +but we need to write a serializer for our observation model as well! + +Since this new serializer is not a type of `PatientSerializer`, +we need to inherit from a new base class +which holds the design that is shared between `PatientSerializer` and `ObservationSerializer`. +Since we do not actually need to save the observation data to a file independently, +we will not worry about implementing the `save` and `load` methods for the `Observation` model. + +```python +# file: inflammation/serializers.py + +from inflammation import models + + +class Serializer: + @classmethod + def serialize(cls, instances): + raise NotImplementedError + + @classmethod + def save(cls, instances, path): + raise NotImplementedError + + @classmethod + def deserialize(cls, data): + raise NotImplementedError + + @classmethod + def load(cls, path): + raise NotImplementedError + + +class ObservationSerializer(Serializer): + model = models.Observation + + @classmethod + def serialize(cls, instances): + return [{ + 'day': instance.day, + 'value': instance.value, + } for instance in instances] + + @classmethod + def deserialize(cls, data): + return [cls.model(**d) for d in data] + +... +``` + +Now we can link this up to the `PatientSerializer` and our test should finally pass. + +```python +# file: inflammation/serializers.py +... + +class PatientSerializer(Serializer): + model = models.Patient + + @classmethod + def serialize(cls, instances): + return [{ + 'name': instance.name, + 'observations': ObservationSerializer.serialize(instance.observations), + } for instance in instances] + + @classmethod + def deserialize(cls, data): + instances = [] + + for item in data: + item['observations'] = ObservationSerializer.deserialize(item.pop('observations')) + instances.append(cls.model(**item)) + + return instances + +... +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Linking it All Together + +We have now got some code which we can use to save and load our patient data, +but we have not yet linked it up so people can use it. + +Try adding some views to work with our patient data using the JSON serialiser. +When you do this, think about the design of the command line interface - +what arguments will you need to get from the user, +what output should they receive back? + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Equality Testing + +When we wrote our serialiser test, +we said we wanted to check that the data coming out was the same as our input data, +but we actually compared just parts of the data, +rather than just using `assert patients_new == patients`. + +The reason for this is that, +by default, `==` comparing two instances of a class +tests whether they are stored at the same location in memory, +rather than just whether they contain the same data. + +Add some code to the `Patient` and `Observation` classes, +so that we get the expected result when we do `assert patients_new == patients`. +When you have this comparison working, +update the serialiser test to use this instead. + +**Hint:** The method Python uses to check for equality of two instances of a class +is called `__eq__` and takes the arguments `self` (as all normal methods do) and `other`. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge: Abstract Base Classes + +Since our `Serializer` class is designed not to be directly usable +and its methods raise `NotImplementedError`, +it ideally should be an abstract base class. +An abstract base class is one which is intended to be used only by creating subclasses of it +and can mark some or all of its methods as requiring implementation in the new subclass. + +Using Python's documentation on +the [abc module](https://docs.python.org/3/library/abc.html), +convert the `Serializer` class into an ABC. + +**Hint:** The only component that needs to be changed is `Serializer` - +this should not require any changes to the other classes. + +**Hint:** The abc module documentation refers to metaclasses - do not worry about these. +A metaclass is a template for creating a class (classes are instances of a metaclass), +just like a class is a template for creating objects (objects are instances of a class), +but this is not necessary to understand +if you are just using them to create your own abstract base classes. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge: CSV Serialization + +Try implementing an alternative serialiser, using the CSV format instead of JSON. + +**Hint:** Python also has a module for handling CSVs - +see the documentation for the [csv module](https://docs.python.org/3/library/csv.html). +This module provides a CSV reader and writer which are a bit more flexible, +but slower for purely numeric data, +than the ones we have seen previously as part of NumPy. + +Can you think of any cases when a CSV might not be a suitable format to hold our patient data? + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::: keypoints +- Planning software projects in advance can save a lot of effort later - even a partial + plan is better than no plan at all. +- The environment in which users run our software has an effect on many design choices + we might make. +- By breaking down our software into components with a single responsibility, we avoid + having to rewrite it all when requirements change. +- These components can be as small as a single function, or be a software package + in their own right. +- When writing software used for research, requirements *always* change. +::: + + diff --git a/procedural-programming.md b/procedural-programming.md new file mode 100644 index 000000000..8741bf785 --- /dev/null +++ b/procedural-programming.md @@ -0,0 +1,78 @@ +--- +title: "Extra Content: Procedural Programming" +teaching: 10 +exercises: 0 +--- + +::: questions +- What is procedural programming? +- Which situations/problems is procedural programming well suited for? +::: +::: objectives +- Describe the core concepts that define the procedural programming paradigm +- Describe the main characteristics of code that is written in procedural programming + style +::: + + +In procedural programming code is grouped into +procedures (also known as routines - reusable piece of code that performs a specific action but +have no return value) and functions (similar to procedures but return value after an execution). +Procedures and function both perform a single task, with exactly one entry and one exit point and +containing a series of logical steps (instructions) to be carried out. +The primary concern is the *process* through which the input is transformed into the desired output. + +Key features of procedural programming include: + +- Sequence control: the code execution process goes through the steps in a defined order, with clear starting and ending points. +- Modularity: code can be divided into separate modules or procedures to perform specific tasks, making it easier to maintain and reuse. +- Standard data structures: Procedural Programming makes use of standard data structures such as + arrays, lists, and records to store and manipulate data efficiently. +- Abstraction: procedures encapsulate complex operations and allow them to be represented as simple, high-level commands. +- Execution control: variable implementations of loops, branches, and jumps give more control over the flow of execution. + +To better understand procedural programming, it is useful to compare it with other prevalent +programming paradigms such as +[object-oriented programming](../learners/object-oriented-programming.md) (OOP) +and [functional programming](../learners/functional-programming.md) +to shed light on their distinctions, advantages, and drawbacks. + +Procedural programming uses a very detailed list of instructions to tell the computer what to do +step by step. This approach uses iteration to repeat a series of steps as often as needed. +Functional programming is an approach to problem solving that treats every computation as a +mathematical function (an expression) and relies more heavily on recursion as a primary control +structure (rather than iteration). +Procedural languages treat data and procedures as two different +entities whereas, in functional programming, code is also treated as data - functions +can take other functions as arguments or return them as results. +Compare and contract [two different implementations](../learners/functional-programming.md#functional-vs-procedural-programming) +of the same functionality in procedural and functional programming styles +to better grasp their differences. + +Procedural and [object-oriented programming](../learners/object-oriented-programming.md) have fundamental differences in their approach to +organising code and solving problems. +In procedural programming, the code is structured around functions and procedures that execute a +specific task or operations. Object-oriented programming is based around objects and classes, +where data is encapsulated within objects and methods on objects that used to manipulate that data. +Both procedural and object-oriented programming paradigms support [abstraction and modularization](../episodes/33-code-decoupling-abstractions.md). +Procedural programming achieves this through procedures and functions, while OOP uses classes and +objects. +However, OOP goes further by encapsulating related data and methods within objects, +enabling a higher level of abstraction and separation between different components. +Inheritance and polymorphism are two vital features provided by OOP, which are not intrinsically +supported by procedural languages. [Inheritance](../learners/object-oriented-programming.md#inheritance) allows the creation of classes that inherit +properties and methods from existing classes – enabling code reusability and reducing redundancy. +[Polymorphism](../episodes/33-code-decoupling-abstractions.md#polymorphism) permits a single function or method to operate on multiple data types or objects, +improving flexibility and adaptability. + +The choice between procedural, functional and object-oriented programming depends primarily on +the specific project requirements and personal preference. +Procedural programming may be more suitable for smaller projects, whereas OOP is typically +preferred for larger and more complex projects, especially when working in a team. +Functional programming can offer more elegant and scalable solutions for complex problems, +particularly in parallel computing. + +::: keypoints: +- Procedural Programming emphasises a structured approach to coding, using + a sequence of tasks and subroutines to create a well-organised program. +::: diff --git a/programming-paradigms.md b/programming-paradigms.md new file mode 100644 index 000000000..86b5ebcf9 --- /dev/null +++ b/programming-paradigms.md @@ -0,0 +1,168 @@ +--- +title: "Extra Content: Programming Paradigms" +teaching: 20 +exercises: 0 +--- + + +::: questions: +- What should we consider when designing software? +::: + +::: objectives: +- Describe some of the major software paradigms we can use to classify programming languages. +::: + + +In addition to [architectural decisions](../learners/software-architecture-extra.md) on bigger components of your code, it is important +to understand the wider landscape of programming paradigms and languages, +with each supporting at least one way to approach a problem and structure your code. +In many cases, particularly with modern languages, +a single language can allow many different structural approaches within your code. + +One way to categorise these structural approaches is into **paradigms**. +Each paradigm represents a slightly different way of thinking about and structuring our code +and each has certain strengths and weaknesses when used to solve particular types of problems. +Once your software begins to get more complex +it is common to use aspects of different paradigms to handle different subtasks. +Because of this, it is useful to know about the major paradigms, +so you can recognise where it might be useful to switch. + +There are two major families that we can group the common programming paradigms into: +**Imperative** and **Declarative**. +An imperative program uses statements that change the program's state - +it consists of commands for the computer to perform +and focuses on describing **how** a program operates step by step. +A declarative program expresses the logic of a computation +to describe **what** should be accomplished +rather than describing its control flow as a sequence steps. + +We will look into three major paradigms +from the imperative and declarative families that may be useful to you - +**Procedural Programming**, **Functional Programming** and **Object-Oriented Programming**. +Note, however, that most of the languages can be used with multiple paradigms, +and it is common to see multiple paradigms within a single program - +so this classification of programming languages based on the paradigm they use is not as strict. + +### Procedural Programming + +Procedural Programming comes from a family of paradigms known as the Imperative Family. +With paradigms in this family, we can think of our code as the instructions for processing data. + +Procedural Programming is probably the style you are most familiar with +and the one we used up to this point, +where we group code into +*procedures performing a single task, with exactly one entry and one exit point*. +In most modern languages we call these **functions**, instead of procedures - +so if you are grouping your code into functions, this might be the paradigm you are using. +By grouping code like this, we make it easier to reason about the overall structure, +since we should be able to tell roughly what a function does just by looking at its name. +These functions are also much easier to reuse than code outside of functions, +since we can call them from any part of our program. + +So far we have been using this technique in our code - +it contains a list of instructions that execute one after the other starting from the top. +This is an appropriate choice for smaller scripts and software +that we are writing just for a single use. +Aside from smaller scripts, Procedural Programming is also commonly seen +in code focused on high performance, with relatively simple data structures, +such as in High Performance Computing (HPC). +These programs tend to be written in C (which does not support Object Oriented Programming) +or Fortran (which did not until recently). +HPC code is also often written in C++, +but C++ code would more commonly follow an Object Oriented style, +though it may have procedural sections. + +Note that you may sometimes hear people refer to this paradigm as "functional programming" +to contrast it with Object Oriented Programming, +because it uses functions rather than objects, +but this is incorrect. +Functional Programming is a separate paradigm that +places much stronger constraints on the behaviour of a function +and structures the code differently as we will see soon. + +You can read more in an [extra episode on Procedural Programming](../learners/procedural-programming.md). + +### Functional Programming + +Functional Programming comes from a different family of paradigms - +known as the Declarative Family. +The Declarative Family is a distinct set of paradigms +which have a different outlook on what a program is - +here code describes *what* data processing should happen. +What we really care about here is the outcome - how this is achieved is less important. + +Functional Programming is built around +a more strict definition of the term **function** borrowed from mathematics. +A function in this context can be thought of as +a mapping that transforms its input data into output data. +Anything a function does other than produce an output is known as a **side effect** +and should be avoided wherever possible. + +Being strict about this definition allows us to +break down the distinction between **code** and **data**, +for example by writing a function which accepts and transforms other functions - +in Functional Programming *code is data*. + +The most common application of Functional Programming in research is in data processing, +especially when handling **Big Data**. +One popular definition of Big Data is +data which is too large to fit in the memory of a single computer, +with a single dataset sometimes being multiple terabytes or larger. +With datasets like this, we cannot move the data around easily, +so we often want to send our code to where the data is instead. +By writing our code in a functional style, +we also gain the ability to run many operations in parallel +as it is guaranteed that each operation will not interact with any of the others - +this is essential if we want to process this much data in a reasonable amount of time. + +You can read more in an [extra episode on Functional Programming](../learners/functional-programming.md). + +### Object Oriented Programming + +Object Oriented Programming focuses on the specific characteristics of each object +and what each object can do. +An object has two fundamental parts - properties (characteristics) and behaviours. +In Object Oriented Programming, +we first think about the data and the things that we are modelling - and represent these by objects. + +For example, if we are writing a simulation for our chemistry research, +we are probably going to need to represent atoms and molecules. +Each of these has a set of properties which we need to know about +in order for our code to perform the tasks we want - +in this case, for example, we often need to know the mass and electric charge of each atom. +So with Object Oriented Programming, +we will have some **object** structure which represents an atom and all of its properties, +another structure to represent a molecule, +and a relationship between the two (a molecule contains atoms). +This structure also provides a way for us to associate code with an object, +representing any **behaviours** it may have. +In our chemistry example, this could be our code for calculating the force between a pair of atoms. + +Most people would classify Object Oriented Programming as an +[extension of the Imperative family of languages](https://www.digitalocean.com/community/tutorials/functional-imperative-object-oriented-programming-comparison) +(with the extra feature being the objects), but +[others disagree](https://stackoverflow.com/questions/38527078/what-is-the-difference-between-imperative-and-object-oriented-programming). + +You can read more in an [extra episode on Object Oriented Programming](../learners/object-oriented-programming.md). + +## Other Paradigms + +The three paradigms introduced here are some of the most common, +but there are many others which may be useful for addressing specific classes of problem - +for much more information see the Wikipedia's page on +[programming paradigms](https://en.wikipedia.org/wiki/Programming_paradigm). + +We have mainly used Procedural Programming in this lesson, but you can +have a closer look at [Functional](../learners/functional-programming.md) and +[Object Oriented Programming](../learners/object-oriented-programming.md) paradigms +in extra episodes and how they can affect our architectural design choices. + + +::: keypoints +- A software paradigm describes a way of structuring or reasoning about code. +- Different programming languages are suited to different paradigms. +- Different paradigms are suited to solving different classes of problems. +- A single piece of software will often contain instances of multiple paradigms. +::: + diff --git a/quiz.md b/quiz.md new file mode 100644 index 000000000..507bb3f51 --- /dev/null +++ b/quiz.md @@ -0,0 +1,210 @@ +--- +title: Quiz +--- + +This is an intermediate-level software development course +so it is expected for you to have some prerequisite knowledge on the topics covered, +as outlined at the [beginning of the lesson](../index.md#prerequisites). +Here is a little quiz that you can do to test your prior knowledge +to determine where you fit on the skills spectrum and if this course is for you. + +## Git + +1. Which command should you use to initialise a new Git repository? + + ``` + a. git bash + b. git install + c. git init + d. git start + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `git init` is the command to initialise a Git repository + > and tell Git to start tracking files in it. + > `git bash`, `git start` and `git install` are not Git commands and will return an error. + > + > + > ::::::::::::::::::::::::: + +2. After you initialise a new Git repository + and create a file named `LICENCE.md` in the root of the repository, + which of the following commands will not work? + + ``` + a. git add LICENCE.md + b. git status + c. git add . + d. git commit -m "Licence file added" + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `git commit -m "Licence file added"` won't work + > because you need to add the file to Git's staging area first + > before you can commit. + > + > + > ::::::::::::::::::::::::: + +3. `git clone` command downloads and creates a local repository from a remote repository. + Which command can then be used to upload your local changes back to the remote repository? + + ``` + a. git push + b. git add + c. git upload + d. git commit + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `git push` is the correct command. + > `git add` adds a file to the local staging area, + > `git commit` commits the staged changes to the local repository + > and `git push` will push those committed changes to the remote repository. + > `git upload` is not a Git command and will return an error. + > + > + > ::::::::::::::::::::::::: + +## Shell + +1. In the command line shell, + which command can you use to see the directory you are currently in? + + ``` + a. whereami + b. locate + c. map + d. pwd + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `pwd` (which stands for 'print working directory') is the correct command. + > + > + > ::::::::::::::::::::::::: + +2. Which command do you use to go to the parent directory of the directory you are currently in? + + ``` + a. cd - + b. cd ~ + c. cd /up + d. cd .. + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `cd ..` is the correct command. + > `cd -` goes to the previous location in history (not parent). + > `cd ~` goes to the home folder. + > `cd /up` goes to a folder `up` in the root (`/`) of the file system. + > + > + > ::::::::::::::::::::::::: + +3. How can you append the output of a command to a file? + + ``` + a. command > file + b. command >> file + c. command file + d. command < file + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `command >> file` is the correct command. + > `command > file` will redirect the output of a command to a file + > and overwrite its content, + > `command file` will pass the file as an argument to the command + > and `command < file` redirects input rather than output. + > + > + > ::::::::::::::::::::::::: + +## Python + +1. Which of these collections defines a list in Python? + + ``` + a. {"apple", "banana", "cherry"} + b. {"name": "apple", "type": "fruit"} + c. ["apple", "banana", "cherry"] + d. ("apple", "banana", "cherry") + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > While all of the answers define a collection in Python, + > `["apple", "banana", "cherry"]` defines a list and is the correct answer. + > `{"apple", "banana", "cherry"}` defines a set; + > `{"name": "apple", "type": "fruit"}` defines a dictionary (a hash map); + > `("apple", "banana", "cherry")` defines a tuple (an ordered and unchangeable collection). + > + > + > ::::::::::::::::::::::::: + +2. What is the correct syntax for *if* statement in Python? + + ``` + a. if (x > 3): + b. if (x > 3) then: + c. if (x > 3) + d. if (x > 3); + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `if (x > 3):` is the correct answer. + > + > + > ::::::::::::::::::::::::: + +3. Look at the following 3 assignment statements in Python. + + ``` + n = 300 + m = n + n = -100 + ``` + + What is the result at the end of the above assignments? + + ``` + a. n = 300 and m = 300 + b. n = -100 and m = 300 + c. n = -100 and m = -100 + d. n = 300 and m = -100 + ``` + + > ::::::::::::::: solution + > + > ## Solution + > + > `n = -100 and m = 300` is the correct answer. + > + > + > ::::::::::::::::::::::::: + + diff --git a/reference.md b/reference.md new file mode 100644 index 000000000..2a2cfb45d --- /dev/null +++ b/reference.md @@ -0,0 +1,16 @@ +--- +title: 'Reference' +--- + +## Glossary + +- Branch +- Commit +- CSS +- Git +- GitHub +- README +- Repository/repo +- Version Control +- YAML + diff --git a/section_1_setting_up_environment.md b/section_1_setting_up_environment.md new file mode 100644 index 000000000..df6a3fecc --- /dev/null +++ b/section_1_setting_up_environment.md @@ -0,0 +1,601 @@ +--- +jupyter: + celltoolbar: Slideshow + jupytext: + notebook_metadata_filter: -kernelspec,-jupytext.text_representation.jupytext_version,rise,celltoolbar + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + rise: + theme: solarized +--- + + + + # Intermediate Research Software Development in Python + + + +## Setup + +- As participants arrive, ask them about the installation of required software: Bash, Git, a GitHub account, Python with pip and venv +- Concurrently, send the link for the collaborative notes document and get them to sign the attendance list +- If you are recording the session, notify participants you will be doing so, and start recording + - Breakout rooms are not recorded + + + +## Setting the Scene + +Why this course? + +Have you ever thought: +- "there must be a better way to do this" +- "this software is getting in the way of my research" +- "why is it so difficult to get this program to run?" +- "this code is incomprehensible and really difficult to modify" +- "I screwed up my Python installation again and need to reinstall my OS" + + + +- why this course and why are you here? + - we have learned programming to do our research: it is a tool and a means to an end + - likely we are mostly self-taught or have taken some intro courses + - but we now find the techniques we have picked up to be inadequate for the software we need to write + - single scripts no longer cut it and we are collaborating with more people, or have users for the software we are producing + - you need some new skills and tools to tackle these problems + +- You might have thought at some point (see questions on slide) + +- If so, then you have come to the right place. The objective of this course is to deal with these struggles you might be facing by teaching some intermediate software engineering skills + - just like maths, statistics, and physics theory, software engineering is a skill you need to continue to develop as a researcher + + + +### Intermediate Research Software Development + +What you will be able to do at the end that should help your work: + +- restructure existing code and design more robust software from scratch +- automate the process of testing and verifying software correctness +- support collaborations with others in a way that mimics a typical collaborative software development process +- get you ready to distribute your code for use by others + + + +### PSA: This is a Collaborative Learning Session + + + +- PSA: This is a Collaborative Learning Session + - this is meant in two ways + - I'll be doing a bit of instructing, but most of the learning will come from the many exercises and other activities that you will do in groups. This is the first aspect of collaboration. + - Secondly, there will naturally be a variety of different backgrounds and levels of experience, and we the instructors and helpers should not be seen as the final authority on these matters. We have valuable experience to share, but so do you. Everyone here can and should contribute to this learning process, and this shouldn't be viewed as knowledge being imparted from your instructors on high; please speak up and get involved in the conversation. + + + +### Tools + +- Python +- Integrated Development Environment: PyCharm or VS Code +- `pip` and `venv` +- GitHub + + + +- This course has necessarily made some decisions about the tools used to demonstrate the concepts being taught + - Python is used as a fairly ubiquitous and syntactically easy language; however, the point needs to be clear that this is not a course about Python; the course is about software engineering, and it is using Python as the playground to demonstrate the skills and concepts that should be valuable independent of the domain and language + - to this end, I will be trying to draw connections with other languages and development scenarios when applicable since I know Python is not necessarily the main development language for everyone at UKAEA + - Learners should have already been notified about the IDE selection and installation. If the instructor has decided to allow different editors, reiterate any caveats (e.g. happy for you to use these editors, but no guarantee that we can help you if you are stuck). At an intermediate level, it is likely learners already have exposure to a preferred IDE, so they can shoulder more of the responsibility for getting that to work. + - GitHub is ubiquitous in software development, and a lot of research code ends up there. Other platforms are similar and so whatever is learnt here will be applicable. +- in the long run, you will encounter many more tools than those shown here, and you will form your own preferences; that is fine and we are in no way suggesting these are the definitive tools that should be used by any researcher who codes + + + +### Rules of Engagement + +- Monitoring your status + - self reporting + - sporadic polls +- Questions at any time by raising hand ✋ +- Some lecturing by instructor combined with independent study, exercises, and group activities +- Take a break whenever you need it ☕ + + + +- Rules of engagement +- Monitoring you status + - We want to know how you are doing, and the more information we have about your progress, the better we can tailor the course to you and make it more valuable. + - There are two main ways to do this. + - Self reporting: Please use the green check mark and red 'x' in Zoom (or stickies if in person) to indicate your status with lessons or the current content; this is a more subtle way of indicating that you need help without interrupting the instructor. The helpers will be keeping an eye on the list of participants and their statuses. Can everyone please check now that they can put the green check mark up. + - Polls within Zoom will also be used to check how you are getting on. Please fill these in and do not ignore them! In person, it is easier to see how people are getting on. +- Throughout the course, please feel free to interrupt at any point with a question (preferably by raising hand if in person or using the raise hand feature in Zoom or relevant analogue). +- Many portions of the course will involve breaking into separate groups to do work. Most of this will be independent work, but there are a few group tasks. There will usually be a helper in your room if you need assistance, but again, they are not all-knowing, so please help other participants if you think you can help. + - There will no doubt be a range of experiences and people moving at different paces in these groups. Please be mindful of that. If you find there is too much chatter and you cannot focus on getting things done, feel free to mute audio. + - If you fall behind on the independent exercises, do not worry and prioritise any group work or discussion at the end of a breakout session. You can catch up between sessions. + + + +## Content Overview + +![](../fig/course-overview.png) + + + +There are five main sections of the course, each roughly to be covered in one of the 6 sessions. + +1. Setting up Software Environment: **PyCharm or VSCode** for editing, testing and debugging, **GitHub** for collaborative development, **virtual environments** for dependency isolation, and **Python code style**. +2. Ensuring Software Correctness at Scale: how to set up a **test framework** and automate and scale testing with **Continuous Integration (CI)** +3. Software Development as a Process: an exploration of different **software design paradigms** and their advantages and disadvantages +4. Collaborative Software Development for Reuse: how do we start **collaborating** on a software project through processes like **code review** +5. Managing and Improving Software: move beyond the mechanics of _just_ collaborative software development and towards the maintenance and prioritisation of the evolution of our project through things like **issue tracking** and **software support** + + + +## Section 1: Setting Up The Environment For Collaborative Code Development + +![](../fig/section1-overview.png) + + + +- the overall objective for this section is to get set up with the _tools_ for collaborative code development, and of course there are lots of decisions to make +- the recommendations are opinionated but backed by experience + +1. Command Line & Virtual Development Environment: use command line to run our code and then the Python tools `venv` and `pip` to manage dependencies and isolate our project +2. Integrated Development Environment (IDE): course content supports **PyCharm** directly, but there is now additional material to support use of **VSCode** + - Show the "Extras" tab of the course material website and the VSCode information under that +3. GitHub and Git development workflows +4. Python coding style: PEP8 + +- But first, we will get an overview of the example project that we will be working on throughout the course and its structure. + + + +## Introduction to Our Project and Architecture + +The "patient inflammation" example from the Novice Software Carpentry Python Lesson. + +
+ + + +- Give an introduction to the "patient inflammation project" + - the software project studies inflammation in patients who have been given a new treatment for arthritis and re-uses the inflammation dataset from the novice Software Carpentry Python lesson + - The dataset contains information for 60 patients, who had their inflammation levels recorded for 40 days, so a 2D dataset like shown in the figure +- The analysis is incomplete and there are some errors that you will need to correct +- First, we need to get the project, so go to the course website and follow the instructions there for copying and then cloning the repository locally on your machine to work on + - The link for this episode is in the shared document as the section header "Introduction to Our Project and Architecture" + - Complete the lesson "Obtain the Software Project Locally" ~ 2-3 minutes + - please let us know when you are done by taking off your status now and then setting it to the green check at the end + + + +### Exercise: 🖉 Obtain the Software Project Locally + + + +### Project Structure + +
.
+├── data/
+│   └── inflammation-*.csv
+├── inflammation/
+│   ├── models.py
+│   └── views.py
+├── inflammation-analysis.py
+├── README.md
+└── tests/
+    ├── test_models.py
+    └── test_patient.py
+
+ + + +- Let us take a look at the project structure + - I like to use `tree` (on Ubuntu installable through apt-get, not sure if it comes with Git for Windows) + - With this we see: + - README file (that typically describes the project, its usage, installation, authors and how to contribute), + - Python script `inflammation-analysis.py` provides the main entry point into the application + - three directories - inflammation, data and tests + - inflammation directory has two other Python scripts that we will look at more later + - data directory has the data we will be analysing in csv files + - tests directory has tests for our Python programs that we will be adding to and correcting + - **Important Point**: the structure of this project is not arbitrary + - a difference between novice and intermediate softare development is that at the intermediate level the structure of the project should be planned in advance, and this includes the structure of abstract entities like software components and how they interact + - in contrast, a novice will make this structure up as they go along (nothing wrong with that, it is part of learning, but at some point you need to stop doing that and have a think about these things in advance before you start a project). + - this is probably an appropriate point to link to the Python Cookiecutter project template: https://github.com/ukaea/scientific-python-cookiecutter (navigate to this page in shared browser screen) + + + +### Exercise: 🖉 Have a Peak at the Data + +Please post your answers in the shared document. + + + +- This should be quick, 1-2 minutes, no need to check on status. +- Demonstrate answer from command line using preferred command: `head -n 5 data/inflammation-01.csv` +- Explain that each line (row) is indexed by patient, and each comma delimited field (i.e. column) is indexed by the day + + + +### Software Architecture + +**Theory covered later in Section 3: Software Architecture and Design** + + + +- _Skip over the higher-level discussion of architecture because it is a bit out of place here; it is covered anyway in the later Section 3: Software Design_ + + + +### Model-View-Adapter (MVA) + +
+ +By Soroush Khanlou, https://khanlou.com/2014/03/model-view-whatever/ + + + +- Model represents the data used by a program and also contains operations/rules for manipulating and changing the data in the model. This may be a database, a file, a single data object or a series of objects. For our example project, the Model is embodied in the `inflammation/model.py` module, which contains the patient data in appropriate data structures along with the methods to manipulate that patient data to get useful statistics. Show this file from command line using `tree -L 2`. + +- View is the means of displaying data to users/clients within an application (i.e. provides visualisation of the state of the model) **and** collecting user input in the case of a Graphical User Interface (GUI). However, sometimes the line can get a bit blurred, and the Adapter might collect user input directly (which is actually the case for our example project). For our example project, the View is represented by the module `inflammation/view.py` and it contains the routines to produce graphs of our patient data and the results from analysis. Show this file from command line. + +- Adapter manipulates both the Model and the View. Usually, it accepts input from the View and performs the corresponding action on the Model (changing the state of the model) and then updates the View accordingly. In our simple example project, the file `inflammation-analysis.py` is the Adapter, and it actually does handle user input so it not quite fully abiding by MVA, and actually shares features with another architectural pattern called Model-View-Controller + +- Some final words on architecture and these particular patterns: + - do not get too caught up determining exactly what functionality should be the responsibility of each component + - the act of splitting things up and thinking about how they will interact through interfaces is where you get the most value + - it is likely you were already doing this in an informal fashion, but good to think about it more explicitly **and try to record your design in some appropriate format** + + + +## ☕ 5 Minute Break ☕ + + + +## Virtual Environments For Software Development + + + + +- Switch to terminal and the directory of the example project at its initial commit + - Make sure you do not have a virtual environment activated, and preferably no numpy or matplotlib in your system python installation. If you do, create a fresh virtual environment that does not have these packages. +- Try to run the analysis script from the command line: `python3 inflammation-analysis.py` + - If you are in a clean Python installation, this should throw a `ModuleNotFoundError` which proves we have some external dependencies that are not installed and we need to get through a package manager + - Depending on what learners have in their `PYTHONPATH` and site packages for their current default environment, they may or may not have success with this command + - Take a look at the top of the views file to see the other dependencies: `head inflammation/views.py` +- Before jumping to install matplotlib and numpy, it is worth a thought about other projects we might be currently be working on or in the future + - what if they have a requirement for a different version of numpy or matplotlib? or a different python version? how are you going to share your project with collaborators and make sure they have the correct dependencies? + - in general, each project is going to have its own unique configuration and set of dependencies + - to solve this in python, we set up a virtual environment for each project, containing a set of libraries that Will not interact with others on the system + - it can be thought of like an isolated partial installation of Python specifically for your project + + + +### Tools for Dependency Management + +- For creating and managing virtual environments: `venv` +- For installing dependencies in those environments: `pip` + + + +- `venv` comes standard in `Python 3.3+` and is the main advantage for its use + - however, important thing to note with `venv` is that you can only ever use the system version of python with it (e.g. if you have Python 3.8 on your system, you can only ever create an virtual environment with Python 3.8). Most of the time this is not a problem, but if you are in dire need of a particular Python version, then there are other tools that can do that job (next slide). + - Another consequence is that if there is an update of your system installation then your virtual environment will stop working, and you will need to get rid of it and create a new one (more on that later) +- `pip` stands for "Pip Installs Packages" and it queries the Python Package Index (PyPI) to install dependencies + - it is ubiquitous and compatible with all Python distributions + + + +### Lots of other tools... + +
+ + + +- there are plenty of other tools out there that manage Python environments, and it can become messy +- worth a note is Anaconda which supplies `conda` + - `conda` is both a package manager and virtual environment manager, and it can install non-Python packages + - this has made it popular in a number of scientific settings; however, due to licensing ambiguity, we advise against the Anaconda distributed version + - there is an open source fork called `miniforge` that you might consider if your project has a lot of non-Python dependencies + + + +### Breakout Exercise: Creating a `venv` Environment + +Read through and follow along until the end of the episode page. + + + +- send into breakout rooms to do work and read for about 15 minutes +- remember they should clear their status and use the green check when they are done +- Checks at the end of the breakout + 1. Did everyone get the error when trying to run the `inflammation-analysis.py` script? + 1. Consider running some quizzes (i.e. formative assessment). Question suggestions: + - Which statements are true? + 1. Virtual environments created by `venv` are a completely new and distinct entire Python installation (i.e. the interpreter, standard library, and third party dependencies all installed in an isolated folder) + 1. Virtual environments help keep dependencies required by different projects separate + 1. A `requirements.txt` file is used by `venv` as a list of the dependencies that it will install + 1. `pip` is a tool to download and install Python packages in whatever your current environment is + 1. all of the above + 1. a and c + 1. b and d (correct answer) + 1. A comment about exporting/importing an environment + - I think there are actually two scenarios here: + 1. If you are providing a python application (i.e. building and deploying something) or doing a project that is a scientific analysis, then it is fine to pin your dependencies as detailed here in a `requirements.txt` + 2. If you are providing a reusable library (i.e. one that might be called from someone else's code or another library) then pinning can be overly restrictive and cause issues for package managers, and it is considered bad practice to pin your dependencies like this + - Instead, you should specify loose dependency requirements in the `install_requires=[...]` metadata of `setup.py`. A full setup.py project is outside the scope of this course, but there are many good resources on this. + - There are some links in the shared document that discuss this further + - + - + - and if you want a template for Python projects that keeps `requirements.txt` and `install_requires` synced: + - In general, I would recommend against pinning unless necessary + + + +### Need to recreate your virtual environment? + +```bash +rm -r venv/ +python3 -m venv venv +source venv/bin/activate +pip install +# or +pip install -r requirements.txt # great reason to have this file + +``` + + + +- this should be in the shared document as well + + + +### Dependency Management in Other Languages + +- Each language will have its own way of handling this, and it will also depend on _where_ you are doing your development +- The _coverall_ option these days is to develop in a Docker container (or relevant analogue) + - The `Dockerfile` codifies the dependencies and setup for your project +- If you are on a cluster, then you might be familiar with the `module` command + - This allows you to get different versions of libraries without installing them yourself (and indeed, because you do not have permission to install them) + - Spack and Easy Build are also quite popular package management tools for HPC; Spack has virtual environments! +- C++ + - CMake is an ubiquitous build tool and overlaps with dependency management + - Conan is a specific package manager for C++ + - Spack also a good option +- Fortran + - Very nascent creation of the Fortran package manager (fpm) and probably more for modern Fortran + - Spack again + + + +## Integrated Development Environments + + + + +
+ + + +
+ + + +- Most of us probably started out programming with a simple text editor and ran our programs from the command line with a compiler or interpreter + - This is fine to start off, but as our projects become more complex with more files and configurations, it natural that the tools we use to develop need to evolve as well + - Enter the Integrated Development Environment (IDE) +- Preference for Code Editors and IDEs is one of the more contentious and strongly felt topics among software developers, but the bottom line is that if a tool works for you and helps you be productive, then it is absolutely fine to use that tool + - But again, for the practicalities of this course, the decision to support two editors, PyCharm and VSCode, has been made + - If you are comfortable enough in another IDE or code editor to get the functionality demonstrated in the content below, then please feel free to use that tool here, but this is a disclaimer that we cannot promise to resolve any issues you have, and if these issues are holding the group up then we will need to move on + + + +### Breakout Exercise: Using the PyCharm IDE + +Start from this heading and continue to the end of the page. + + + +- Before launching into this exercise, you should poll how many students are using each editor + - If the majority are using VS Code, consider doing a demo of all the features listed for PyCharm + using your own VS Code editor + - Otherwise, send learners off to read through and try out content from "Using the PyCharm IDE" (~ 30mins, but could be less, so poll after 20 minutes to get a status check, or ask directly if in person) + - For VSCode users, remind them to consult the "Extras" content of the course web page and find the analogous functionality described there; if you are having trouble getting something to work, please ask for help! + - Remind to use status green check when done (or red x if having trouble) + - Encourage learners to try out the features that are being discussed and don't worry about making modifications to their code since it is under version control it will be easy to reset any changes + - Reinforce that we won't be using the version control interface of PyCharm, but it is a perfectly useable feature, and again this comes down to preference + + + +## ☕ 15 Minute Break ☕ + + + +## Collaborative Software Development Using Git and GitHub + +
+
+
+ + + + +- Git is the de facto tool for version control in software development + - we should all be familiar with the time machine magic of git + - however, to call it just a version control tool misses the fact that what git really does is facilitate non-linear and distributed development collaboration on software projects +- Walk through this image as a Git refresher +- Do a poll to see if everyone is comfortable with all of the operations and terminology in that diagram + - Ask any uncertain terms to be put into the chat or shared document + - Go into more depth on the terms that come up + + + +### Breakout Exercise: Checking in Changes to Our Project + +Start from this heading and go until the "Git Branches" heading. + + + +- Get learners to independently go through the section "Checking-in Changes to Our Project" (~ 10 minutes) + - stop before the "Git Branches" Section + - note that SSH keys are now the recommended form of authentication with GitHub for this course, as explained in the Setup section + - if someone decides they want to do token authentication, this seems to be the only resource that is actually needed: https://www.edgoad.com/2021/02/using-personal-access-tokens-with-git-and-github.html (put this in shared document) + - Remind learners that they will need to copy the access token somewhere on their computer; if they use a password manager, consider making a new entry for this token; also, there are instructions to cache their token with the git cli, and that will make this more convenient since they will not need to enter the token with every git operation that communicates with GitHub + + + +### Git Branches + +
+ + + +- Git branches + - branches are actually just a pointer to a commit, and that commit _can_ (but does not have to) define a distinct or divergent commit history of our main branch + - this allows developers to take "copies" of the code and make their own modifications without making changes to original nor affecting the commit history of the main branch (so others Will not see the changes there until they are merged) + - this is the main aspect of git that facilitates collaboration + - talk through the image + - the best practice is to use a new branch for each separate and self-contained unit/piece of work you want to add to the project. This unit of work is also often called a feature and the branch where you develop it is called a feature branch. Each feature branch should have its own meaningful name - indicating its purpose (e.g. “issue23-fix”). If we keep making changes and pushing them directly to main branch on GitHub, then anyone who downloads our software from there will get all of our work in progress - whether or not it’s ready to use! So, working on a separate branch for each feature you are adding is good for several reasons: + - it enables the main branch to remain stable while you and the team explore and test the new code on a feature branch, + - it enables you to keep the untested and not-yet-functional feature branch code under version control and backed up, + - you and other team members may work on several features at the same time independently from one another, + - if you decide that the feature is not working or is no longer needed - you can easily and safely discard that branch without affecting the rest of the code. +- Something missing from this section is a mention that a multi-person project, even if not external facing or with no users other than the developers, should have some record or agreement of how branching will work, and some document telling potential contributors how they can submit contributions through pull requests, usually in a `CONTRIBUTING.md` file. + - e.g. contributors fork you project, then work in their own feature branch, and when tested, they submit a PR to the *develop* branch of the upstream project + + + +### Breakout Exercise: 🖉 Creating Branches + +Continue from this heading to the end of the page. + + + +- Get learners to go through the remainder of the content from "Creating Branches" onwards (~ 15 minutes) +- Once everyone is complete, consider running a quiz. + - You are working on a software project that has a main and develop branch. Feature branches are supposed to be created off of the develop branch, but you mistakenly create your feature branch off of the main branch. You do not realise this until you have already committed some changes, and now you are freaking out because you think you might have affected the code on the main branch. Is this worry valid? + 1. yes + 1. no (correct answer) + + + +## ☕ 5 Minute Break ☕ + + + +## Python Code Style Conventions + +> "Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” — Martin Fowler + +- Coding _style_ is one factor that makes our code more understandable +- Consistency is key + + + +- one of the main features of whether code is understandable is whether it follows a consistent *style* +- *style* encompasses but is not limited to + - cleanly and consistently formatted + - descriptive comments and docstrings + - descriptive names for variables, functions, classes, and modules +- the style you use for your code will vary depending on the language and what your team has agreed upon + - in order to help with implementing a consistent style, style guides or sets of conventions are used + - these can be agreed upon by colleagues or communities + - the important point is this: make sure whatever sytle you choose that it is consistent **within** a project, and if possible also across related projects + + + +### Style in Different Languages and Tools + +- Python: PEP8 + - `black`, `flake8`, `pylint`, etc... +- C++: no language-wide consensus + - `clang-format` is widely used for enforcing formatting, and there are built-in presets for existing conventions followed by Google, LLVM, etc. Project specific settings made in a `.clang-format` file. + - `cpplint` is another option +- Fortran: no language-wide consensus + - some tools for VSCode + - recent revival and there is a push towards modernising (best practices on new website) + + + +- Unless you have particular requirements, it is best to go with a sytle guide that has the majority consensus for a particular language (albeit sometimes this Will not exist, so choose what seems best) + - In Python, this is PEP8 + - In PyCharm, adherance to PEP8 will automatically be checked and violations flagged for fixing (demonstrate this live) + - VSCode can do the same thing with an extension. See the "Extras" section. + - It is worth mentioning that at a project level, not everyone will be using the same IDE, so it is better to use an independent tool called a linter that will enforce these style requirements + - `black` is a popular but harsh and opinionated tool that can take some getting used to + - `flake8` and `pylint` a bit more conventional -> PyCharm can be modified to use one of these directly (outside of the scope of this course) + - C++ does not have a language-wide convention for style + - [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html) is widely used for enforcing formatting, and there are [built-in presets](https://clang.llvm.org/docs/ClangFormatStyleOptions.html#configurable-format-style-options) for existing conventions followed by Google, LLVM, etc. Project specific settings made in a `.clang-format` file. + - our guide on C++ for VSCode recommends cpplint: https://intranet.ccfe.ac.uk/software/guides/vscode-cpp.html + - Some other useful resources that cover a broader scope than just style and formatting are [Google's C++ Style Guide](https://google.github.io/styleguide/cppguide.html#Formatting) and the [C++ Core Guidelines by Bjarne Stroustrup (the creator of C++)](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md) + - Fortran also does not have a language-wide convention + - we have a great guide on tooling in VSCode: https://intranet.ccfe.ac.uk/software/guides/vscode-fortran.html + - this is a good online resource: https://fortran-lang.org/learn/best_practices + + + +### Breakout Exercise: 🖉 Indentation + +Start from this section heading and go to the end of the page. + + + +- Split learners into breakout rooms and get them to work through content starting from "Indentation" section (~ 30 minutes) going to the end of the page + - A lot of these checks for formatting can now be done automatically with your IDE or linters, so do not spend too long absorbing them. It is good to be aware why rules are being applied, but the details of implementation are less important. + - poll/status check at the end +- Some comments after the exercises + - There are many different docstring formats, and I tend to not like the Sphynx default very much. Google or Numpy style docstrings much more readable. + - For the exercise to improve the docstrings, no mention is made of the fact that the module docstring should include a list of the functions in the module. This is another valid improvement. (Advance to next slide to see this) + + + +```python +""" +Functions: + load_csv - Load a Numpy array from a CSV file + daily_mean - Calculate the daily mean of a 2D inflammation data array + daily_max - Calculate the daily max of a 2D inflammation data array + daily_min - Calculate the daily min of a 2D inflammation data array +""" +``` + + + +## Verifying Code Style Using Linters + +- A direct continuation of the previous lesson about coding conventions and style. +- Linters help us enforce some aspects of these. + + + +### Linting vs Formatting + +- In Python, formatting is effectively reduced to how the whitespace and newlines are arranged around the actual text of your source code + - It is much easier to automate and enforce complete consistency for this subset of style. One could say "black or white", hence where the formatting tool `black` gets its name. +- But there are other aspects of _style_ that are not formatting, e.g. naming of variables from the previous section + - This is where linting comes in + + + +### Breakout Exercise + +Read from the top of the page to the bottom, completing exercises as you go. + + + +- Check in on students' status to see how they are making progress. +- A quick quiz idea: + - Your IDE is telling you that your Python source file has a number of issues with indentation and line length. These problems fall under: + 1. linting + 1. formatting + 1. style + 1. all of the above + 1. a and c (correct answer) + + + +## 🕓 End of Section 1 🕓 + diff --git a/section_2_ensuring_correctness.md b/section_2_ensuring_correctness.md new file mode 100644 index 000000000..32c97ca74 --- /dev/null +++ b/section_2_ensuring_correctness.md @@ -0,0 +1,210 @@ +--- +jupyter: + celltoolbar: Slideshow + jupytext: + notebook_metadata_filter: -kernelspec,-jupytext.text_representation.jupytext_version,rise,celltoolbar + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + rise: + theme: solarized +--- + + +# Section 2: Ensuring Correctness of Software at Scale + +
+
+
+ + + +- Probably the most important thing to take away from this course + + + +## Automatically Testing your Software + +- Big questions: how can we be sure the code we have written is correct, produces accurate results, and is of good quality? + + + +> **testing:** The process of operating a system or component under specified conditions, observing or recording the results, and making an evaluation of some aspect of the system or component +> — IEEE Standard Glossary of Software Engineering + + + +- Big questions: how can we be sure the code we have written is correct, produces accurate results, and is of good quality? + - This is the domain of Verification and Validation (V&V), in which testing plays an important role + +> **testing:** The process of operating a system or component under specified conditions, observing or recording the results, and making an evaluation of some aspect of the system or component +> — IEEE Standard Glossary of Software Engineering +- i.e. inferring the _behaviour_ of our code through artifacts and making sure that matches what we expect or is required + + + +### Types of Testing + +- Types of testing + - Manual testing + - Automated testing + - Unit tests + - Functional or integration tests + - End-to-end + - Regression tests + + + +- Types of (dynamic) testing + - Manual testing: an important part of exploratory research + - Automated testing: codify the expected behaviour of our software such that verification can happen repeatedly without user inspection + - Unit tests: tests for small function units of our code (i.e functions, class methods, class objects) + - Functional or integration tests: work at a higher level, and test functional paths through your code, e.g. given some specific inputs, a set of interconnected functions across a number of modules (or the entire code) produce the expected result. + - Regression tests: compare the current output of your code (usually an end-to-end result) to make sure it matches previous output that you do not want to change +- there was a question that came in about drift in regression tests, and the short answer with how to deal with this is first determining whether the output you are tracking is actually an invariant (or something close to an invariant) + - If not, then you will necessarily need to allow for relative proximity, but then you might question whether this is a good long term output to base your regression test on. + - In our area and science broadly, invariants tend to be some observable or experimental physical results, so if you test is not based on that, you are probably going to have a tough time. + + + +### Breakout Exercise: 🖉 Set Up a New Feature Branch for Writing Tests + +Start from this section heading and go to the end of the page. + + + +- Breakout rooms from the page section "Set Up a New Feature Branch for Writing Tests" (~45 minutes) until the end of the page +- status check and any questions +- potential quiz questions: + - what is the correct order of these different types of tests if they should be in ascending order of scope (i.e. the amount of the codebase they cover) (answer: 3) + 1. end-to-end, functional/integration, unit + 2. functional/integration, unit, end-to-end + 3. unit, functional/integration, end-to-end + 4. unit, end-to-end, functional/integration + - automated testing of our code is important because (select all that apply, answer: 4) + 1. it can quickly help anyone verify that the code is working the way that the developer intended it to + 2. it provides an harness in which we can make changes to the code and be more sure that these changes do not alter behaviour of the code that shouldn't change + 3. it reduces the overhead of new developers trying to determine if the code works on their machine + 4. all of the above + 5. none of the above + + + +## ☕ 5 Minute Break ☕ + + + +## Scaling Up Unit Testing + +1. Parameterise our tests to reduce repetition +2. Check the test coverage of our code + + + +1. Parameterise our tests + - from the previous example, you may have noticed that if you want to run a test with the same logic but different input data, you will need to create a new test function that is mostly the same + - there is a convenient way to avoid this in pytest called _parameterisation_, allowing a single test function to run through a variety of test input cases + - very powerful to improve the coverage of the parameter space that you code might be dealing with +2. Check the test coverage + - on a related note, it is important to see how much of our code is "covered" (i.e. verified) by our tests so that we can get at least a relative idea of how the quality of our code is faring overtime, and where we should focus testing efforts + + + +### Breakout Exercise: 🖉 Parameterising Our Unit Tests + +Go through this page to the end, starting from this section heading. In the last 5-7 minutes, please think about the question: + +_Where can and might the input data and corresponding expected results come from for code you use in your usual work?_ + +Please discuss with your peers. Record answers in the shared document if you can. + + + +- send learners into breakout rooms for ~ 20 minutes + - before sending, make sure they are clear on the discussion question + - with about 5 minutes left, remind the groups to have a little discussion about their test data + - status check +- check answers to question in shared document and briefly summarise + - example answer: You are working on an old plasma magnetohydrodynamics code that has been extensively tested against experiments. You have been tasked with adding some functionality to that code, but you want to make sure that you do not change the key results of the code. You take some inputs for well known runs of the code that have been verified against experiment and save the outputs. You then use the outputs to compare against when you run the code in a test suite with those original inputs. This is basically creating some regression tests for the code, using results that you know are correct because of extensive experimental validation of the code in the past. +- comments about the limits of testing: + - there are some good points there about getting value from testing + - what most researchers think: + - "Peer review of my paper will be the test" + - "Looking at a graph is enough" + - "I do not have time to implement a clunky testing framework" + - it hints that there is a spectrum between throwaway code that does not need to be tested and library code used by hundreds in a community that requires extensive testing suites with more than just unit tests + - where your particular code lies is a tricky question to answer sometimes, but a good rule of thumb is that if there is a chance that someone else will be using it, then you should give some thought to tests + - some further thoughts here: https://bielsnohr.github.io/2021/11/29/iccs-part2-and-testing.html + - testing has a demonstrably positive impact upon the design on your code + - it must of course also be acknowledged that testing is not the answer to everything, and that it cannot substitute for good manual and acceptance testing + + + +## ☕ 5 Minute Break ☕ + + + +# Continuous Integration for Automated Testing + +
+
+ +_How do we know our tests—and code in general—will work on other people's machines?_ + + + +- How do we know our tests—and code in general—will work on other people's machines? + - the main answer these days is to use Continuous Integration. +- What is Continuous Integration? + - very loosely, it is an automated system that is triggered upon certain actions to your repository (like pushing or merging) and performs quality checks on your code (and nearly anything else you like too!) + - the key part is that this all happens on a remote "virtual" machine that is set up and torn down each time the tasks need to be performed, thus ensuring there are no idiosyncracies that arise because of our particular development environment + - in our case, we will be setting up CI to run our tests on the remote service provided by GitHub called GitHub Actions + + + +### Breakout Exercise: 🖉 Continuous Integration with GitHub Actions + +Follow along from this section heading to the bottom of the page. + + + +- breakout rooms (for ~ 45 mins) from this section heading to the bottom of the page +- status check: consider using a poll to see how many people have a green check on their repo +- comments + - GitLab has very similar functionality and it is common for institutions to host their own GitLab instance internally. These instances will have their own documentation, and it is worthwhile to check if the RSE group or IT services have any guides to using these resources. + - Because the supported Python versions are constantly changing, the numbers above might be a little out of date, or inconsistent. + - do not worry about this too much, but if you want to show the current supported Python versions, this site is very useful: https://devguide.python.org/versions/ + + + +## ☕ 15 Minute Break ☕ + + + +# Diagnosing Issues and Improving Robustness + + + +- already while you have been creating tests, you might have encountered errors while you are trying to write those tests, and it is not immediately obvious what is going on + - debugging offers a powerful technique for investigating in these situations, and more generally +- there will also be some content about defensive programming + + + +### Breakout Exercise: 🖉 Setting the Scene (for Debugging) + +Follow along from this section heading to the bottom of the page. + + + +- split learners into breakout rooms (~50 mins although likely less, so take a status check early) starting from this section heading and going to the end of the page + - if learners are using different editors, then encourage them to try and replicate the technique of debugging that is explained here +- status check + + + +## 🕓 End of Section 2 🕓 + +Please fill out the end-of-section survey! + diff --git a/section_3_software_dev_process.md b/section_3_software_dev_process.md new file mode 100644 index 000000000..58ecb95bf --- /dev/null +++ b/section_3_software_dev_process.md @@ -0,0 +1,889 @@ +--- +jupyter: + celltoolbar: Slideshow + jupytext: + notebook_metadata_filter: -kernelspec,-jupytext.text_representation.jupytext_version,rise,celltoolbar + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + rise: + theme: solarized +--- + + +# Section 3: Software Development as a Process + +
+
+
+ + + +- There is a lot bundled in here! Make it clear this will be a challenging section +- We are going to step up a level and look at the overall process of developing software + + + +## Writing Code versus Engineering Software + +- Software is _not_ just a tool for answering a research question +- Writing code is only concerned with the implementation of software +- Sofware Engineering views software in a holistic manner + - Software has a _lifecycle_ ♻ + - Software has stakeholders 👥 + - Software is an asset with its own inherent value 💵 + - Software can be reused 🔁 + + + +- Software is _not_ just a tool for answering a research question + - Software is shared frequently between researchers and _reused_ after publication + - Therefore, we need to be concerned with more than just the implementation, i.e. "writing code" +- Sofware Engineering views software in a holistic manner + - Software has a _lifecycle_: more on the next slide + - Software has stakeholders: it might just be you the researcher now, but invariably other people will be involved in using or developing the code eventually + - Software is an asset with its own inherent value: algorithms it contains and what those can do, encoded knowledge of lessons learned along the way, etc. + - Software can be reused: like with stakeholders, it is hard to predict how the software will be used in the future, and we want to make it easy for reuse to happen + + + +## Software Development Lifecycle + +
+ +Cliffydcw, CC BY-SA 3.0, via Wikimedia Commons + + + +The typical stages of a software development process can be categorised as follows: + +- Requirements gathering (coming up next): the process of identifying and recording the exact requirements for a software project before it begins. This helps maintain a clear direction throughout development, and sets clear targets for what the software needs to do. +- Design (later in this section): where the requirements are translated into an overall design for the software. It covers what will be the basic software ‘components’ and how they’ll fit together, as well as the tools and technologies that will be used, which will together address the requirements identified in the first stage. Designs are quite dependent on what programming paradigm is used, something we will explore also in a later section. +- Implementation (throughout this course): the software is developed according to the design, implementing the solution that meets the requirements set out in the requirements gathering stage. +- Testing (done in section 2): the software is tested with the intent to discover and rectify any defects, and also to ensure that the software meets its defined requirements, i.e. does it actually do what it should do reliably? +- Deployment (not shown on this figure): where the software is deployed or in some way released, and used for its intended purpose within its intended environment. +- Maintenance/evolution: where updates are made to the software to ensure it remains fit for purpose, which typically involves fixing any further discovered issues and evolving it to meet new or changing requirements. + +The process of following these stages, particularly when undertaken in this order, is referred to as the waterfall model of software development. +Each stage’s outputs flow into the next stage sequentially. +As the cyclic nature of the image suggests, this linear process is not the only, nor necessarily the best, +way to think about the SDLC. + +There is value we get from following some sort of process: + +- Stage gating: a quality gate at the end of each stage, where stakeholders review the stage’s outcomes to decide if that stage has completed successfully before proceeding to the next one, or if the next stage is not warranted at all. For example, it may be discovered during requirements collection, design, or implementation that development of the software isn’t practical or even required. +- Predictability: each stage is given attention in a logical sequence; the next stage should not begin until prior stages have completed. Returning to a prior stage is possible and may be needed, but may prove expensive, particularly if an implementation has already been attempted. However, at least this is an explicit and planned action. +- Transparency: essentially, each stage generates output(s) into subsequent stages, which presents opportunities for them to be published as part of an open development process. +- It saves time: a well-known result from empirical software engineering studies is that it becomes exponentially more expensive to fix mistakes in future stages. For example, if a mistake takes 1 hour to fix in requirements, it may take 5 times that during design, and perhaps as much as 20 times that to fix if discovered during testing. + + + + +## Software Requirements + +- How can we capture and organise what is required for software to function as intended? + - With software requirements of course! + - They are the linchpin of ensuring our software does what it is supposed to do +- We will look at 3 types: + 1. business requirements: the why + 2. user requirements: the who and what + 3. solution requirements: the how + + + +### Breakout: Reading and Exercises + +Read from the top of the "Software Requirements" page and do the exercises as you go. + + + +If you are using a shared document, you could have sections for each of the +requirement types and get learners to write their suggestions in there. +Afterwards, you could go through some of the suggestions and see whether there +is agreement about whether they have been categorised correctly. + + + +## ☕ 5 Minute Break ☕ + + + +## Software Architecture and Design + + + +## Maintainable Code + +Software Architecture and Design is about writing *maintainable code*. + + * Easy to read + * Testable + * Adaptable + + + + +## Maintainable Code + +Maintainable code is vital as projects grow + + * More people being involved + * Adding new features + + + + + +## Exercise: + +Try to come up with examples of code that has been hard to understand - why? + +Try to come up with examples of code that was easy to understand and modify - why? + +Time: 5min + + + + +After 5 min spend 5-15 min discussing examples the group has come up with + + + + +## Cognitive Load + +For code to be readable, readers have to be able to understand what the code does. + +Cognitive load - the amount a reader has to remember at once + +There is a limit (and it is low!) + + + + + +## Cognitive Load + +Reduce cognitive load for a bit of code by: + + * Good variable names: `toroidal_magnetic_field` much better than `btor` + * Simple control flow + * Functions doing one thing + * Good abstractions (next slide!) + + + + +Good variable names - we not longer have punch card restrictions, so use more descriptive names! + +Simple control flow - explain means not lots of nesting if statement or for loops + + + + +## Abstractions + +An **abstraction** hides the details of one part of a system from another. + + + + +Give some examples of abstractions, or maybe ask for people to think of ideas of abstractions in the real world? + +Examples: + +- A brake pedal in a car: we don't need to know the exact mechanism by which the car slows down, so that implementation has been "abstracted" away from the car user +- Similarly, a light switch is an abstraction: we don't need to know what happens with the wiring and flow of electricity in order to understand that one side means the light will be on and vice versa +- human society is full of things like these... + + + + +## Abstractions + +Help to make code easier - as do not have to understand details all at once. + +Lowers cognitive load for each part. + + + + + +## Refactoring + +**Refactoring** is modifying code, such that: + + * external behaviour unchanged, + * code itself is easier to read / test / extend. + + + + + +## Refactoring + +Refactoring is vital for improving code quality. + + + +Often working on existing software - refactoring is how we improve it + + + + +## Refactoring Loop + +When making a change to a piece of software, do the following: + +* Automated tests verify current behaviour +* Refactor code (so new change slots in cleanly) +* Re-run tests to ensure nothing is broken +* Make the desired change, which now fits in easily. + + + + + +## Refactoring + +Rest of section we will learn how to refactor an existing piece of code + + + +In the process of refactoring, we will try to target some of the "good practices" we just talked about, like making good abstractions and reducing cognitive load. + + + + +## Refactoring Exercise + +Look at `inflammation/compute_data.py` + + + +Bring up the code + +Explain the feature: +In it, if the user adds --full-data-analysis then the program will scan the directory of one of the provided files, compare standard deviations across the data by day and plot a graph. + +The main body of it exists in inflammation/compute_data.py in a function called analyse_data. + + + + +## Exercise: why is not this code maintainable? + +How is this code hard to maintain? + +Maintainable code should be: + + * Easy to read + * Easy to test + * Easy to extend or modify + +Time: 5min + + + +Solution: + +Hard to read: Everything is in a single function - reading it you have to understand how the file loading works at the same time as the analysis itself. + +Hard to modify: If you want to use the data without using the graph you’d have to change it + +Hard to modify or test: It is always analysing a fixed set of data stored on the disk + +Hard to modify: It doesn’t have any tests meaning changes might break something + + + + +## Key Points + +> "Good code is written so that is readable, understandable, covered by automated tests, not over complicated and does well what is intended to do." + + + + +## ☕ 5 Minute Break ☕ + + + +## Refactoring Functions to do Just One Thing + + + +## Introduction + +Functions that just do one thing are: + +* Easier to test +* Easier to read +* Easier to re-use + + + + +We identified last episode that the code has a function that does many more than one thing + +Hard to understand - high cognitive load + +Hard to test as mixed lots of different things together + +Hard to reuse as was very fixed in its behaviour. + + + +## Test Before Refactoring + +* Write tests *before* refactoring to ensure we do not change behaviour. + + + + +## Writing Tests for Code that is Hard to Test + +What can we do? + +* Test at a higher level, with coarser accuracy +* Write "hacky" temporary tests + + + + +Think of hacky tests like scaffolding - we will use them to ensure we can do the work safely, +but we will remove them in the end. + + + +## Exercise: Write a Regression Test for Analyse Data Before Refactoring + +Add a new test file called `test_compute_data.py` in the tests folder. There is more information on the relevant web page. +Complete the regression test to verify the current output of analyse_data is unchanged by the refactorings we are going to do. + +Time: 10min + + + +Hint: You might find it helpful to assert the results equal some made up array, observe the test failing and copy and paste the correct result into the test. + +When talking about the solution: + + * We will have to remove it as we modified the code to get it working + * Is not a good test - not obvious it is correct + * Brittle - changing the files will break the tests + + + + +## Pure Functions + +A **pure function** takes in some inputs as parameters, and it produces a consistent output. + +That is, just like a mathematical function. + +The output does not depend on externalities. + +There will be no side effects from running the function + + + + +Externalities like what is in a database or the time of day + +Side effects like modifying a global variable or writing a file + + + +## Pure Functions + +Pure functions have a number of advantages for maintainable code: + + * Easier to read as do not need to know calling context + * Easier to reuse as do not need to worry about invisible dependencies + + + + +## Refactor Code into a Pure Function + +Refactor the analyse_data function into a pure function with the logic, and an impure function that handles the input and output. The pure function should take in the data, and return the analysis results: + +```python +def compute_standard_deviation_by_day(data): + # TODO + return daily_standard_deviation +``` + +Time: 10min + + + + +## Testing Pure Functions + +Pure functions are also easier to test + + * Easier to write as can create the input as we need it + * Easier to read as do not need to read any external files + * Easier to maintain - tests will not need to change if the file format changes + + + + +Can focus on making sure we get all edge cases without real world considerations + + + +## Write Test Cases for the Pure Function + +Now we have refactored our a pure function, we can more easily write comprehensive tests. Add tests that check for when there is only one file with multiple rows, multiple files with one row and any other cases you can think of that should be tested. + +Time: 10min + + + + +## Functional Programming + +Pure functions are a concept from an approach to programming called **functional programming**. + +Python, and other languages, provide features that make it easier to write "functional" code: + + * `map` / `filter` / `reduce` can be used to chain pure functions together into pipelines + + + + +If there is time - do some live coding to show imperative code, then transform into a pipeline: + + * Sequence of numbers + * Remove all the odd numbers + * Square all the numbers + * Add them together + + +```python +# Imperative +numbers = range(1, 100) +total = 0 +for number in numbers: + if number % 2 == 0: + squared = number**2 + total += squared + + +# Functional +def is_even(number): + return number % 2 == 0 + + +def squared(number): + return number**2 + + +total = sum(map(squared, filter(is_even, numbers))) +``` + + + + +## ☕ 10 Minute Break ☕ + + + +## Using Classes to Decouple Code + + + +### Decoupled Code + +When thinking about code, we tend to think of it in distinct parts or **units**. + +Two units are **decoupled** if changes in one can be made independently of the other + + + + +E.g we have the part that loads a file and the part that draws a graph + +Or the part that the user interacts with and the part that does the calculations + + + +### Decoupled Code + +Abstractions allow decoupling code + + + + +When we have a suitable abstraction, we do not need to worry about the inner workings of the other part. + +For example break of a car, the details of how to slow down are abstracted, so when we change how +breaking works, we do not need to retrain the driver. + + + +### Exercise: Decouple the File Loading from the Computation + +Currently the function is hard coded to load all the files in a directory. + +Decouple this into a separate function that returns all the files to load + +Time: 10min + + + + +### Decoupled... but not completely + +Although we have separated out the data loading, there is still an assumption and therefore coupling in terms of the format of that data (in this case CSV). + +Is there a way we could make this more flexible? + + + +- The format of the data stored is a practical detail which we don't want to limit the use of our `analyse_data()` function +- We could add an argument to our function to specify the format, but then we might have quite a long conditional list of all the different possible formats, and the user would need to request changes to `analyse_data()` any time they want to add a new format +- Is there a way we can let the user more flexibly specify the way in which their data gets read? + + + +One way is with **classes**! + + + +### Python Classes + +A **class** is a Python feature that allows grouping methods (i.e. functions) with some data. + + + + +Do some live coding, ending with: + +```python +import math + +class Circle: + def __init__(self, radius): + self.radius = radius + + def get_area(self): + return math.pi * self.radius * self.radius + +my_circle = Circle(10) +print(my_circle.get_area()) +``` + + + + +### Exercise: Use a Class to Configure Loading + +Put the `load_inflammation_data` function we wrote in the last exercise as a member method of a new class called `CSVDataSource`. + +Put the configuration of where to load the files in the class' initialiser. + +Once this is done, you can construct this class outside the the statistical analysis and pass the instance in to analyse_data. + +Time: 10min + + + + +### Interfaces + +**Interfaces** describe how different parts of the code interact with each other. + + + + +For example, the interface of the breaking system in a car, is the break pedal. +The user can push the pedal harder or softer to get more or less breaking. +The interface of our circle class is the user can call get_area to get the 2D area of the circle +as a number. + + + +### Interfaces + +Question: what is the interface for CSVDataSource + +```python +class CSVDataSource: + """ + Loads all the inflammation csvs within a specified folder. + """ + def __init__(self, dir_path): + self.dir_path = dir_path + + def load_inflammation_data(self): + data_file_paths = glob.glob(os.path.join(self.dir_path, 'inflammation*.csv')) + if len(data_file_paths) == 0: + raise ValueError(f"No inflammation csv's found in path {self.dir_path}") + data = map(models.load_csv, data_file_paths) + return list(data) +``` + + + + +Suggest discuss in groups for 1min. + +Answer: the interface is the signature of the `load_inflammation_data()` method, i.e. what arguments it takes and what it returns. + + + +### Common Interfaces + +If we have two classes that share the same interface, we can use the interface without knowing which class we have + + + + +Easiest shown with an example, lets do more live coding: + +```python +class Rectangle(Shape): + def __init__(self, width, height): + self.width = width + self.height = height + def get_area(self): + return self.width * self.height + +my_circle = Circle(radius=10) +my_rectangle = Rectangle(width=5, height=3) +my_shapes = [my_circle, my_rectangle] +total_area = sum(shape.get_area() for shape in my_shapes) +``` + + + + +### Polymorphism + +Using an interface to call different methods is a technique known as **polymorphism**. + +A form of abstraction - we have abstracted what kind of shape we have. + + + + +### Exercise: Introduce an alternative implementation of DataSource + +Polymorphism is very useful - suppose we want to read a JSON (JavaScript Object Notation) file. + +Write a class that has the same interface as `CSVDataSource` that +loads from JSON. + +There is a function in `models.py` that loads from JSON. + +Time: 15min + + + + +Remind learners to check the course webpage for further details and some important hints. + + + +### Mocks + +Another use of polymorphism is **mocking** in tests. + + + + + +Lets live code a mock shape: + +```python +from unittest.mock import Mock + +def test_sum_shapes(): + + mock_shape1 = Mock() + mock_shape1.get_area().return_value = 10 + + mock_shape2 = Mock() + mock_shape2.get_area().return_value = 13 + my_shapes = [mock_shape1, mock_shape2] + total_area = sum(shape.get_area() for shape in my_shapes) + + assert total_area = 23 +``` + +Easier to read this test as do not need to understand how +get_area might work for a real shape. + +Focus on testing behaviour rather than implementation. + + + + +## Exercise: Test Using a Mock Implementation + +Complete the exercise to write a mock data source for `analyse_data`. + +Time: 15min + + + + + +## Object Oriented Programming + +These are techniques from **object oriented programming**. + +There is a lot more that we will not go into: + +* Inheritance +* Information hiding + + + + + +## A note on Data Classes + +Regardless of doing Object Oriented Programming or Functional Programming + +**Grouping data into logical classes is vital for writing maintainable code.** + + + + +## ☕ 10 Minute Break ☕ + + + +## Architecting Code to Separate Responsibilities + + + +## Model-View-Controller + +Reminder - this program is using the MVC Architecture: + +* Model - Internal data of the program, and operations that can be performed on it +* View - How the data is presented to the user +* Controller - Responsible for how the user interacts with the system + + + + +### Breakout: Read and do the exercise + +Read the section **Separating Out Responsibilities**. + +Complete the exercise. + +Time: 10min + + + + +Suggest discussing answer to the exercise as a table. +Once time is up, ask one table to share their answer and any questions +Then do the other exercise + + + +### Breakout Exercise: Split out the model code from the view code + +Refactor `analyse_data` such the view code we identified in the last exercise is removed from the function, so the function contains only model code, and the view code is moved elsewhere. + +Time: 10min + + + + + +## Programming Patterns + +* MVC is a programming pattern +* Others exist - like the visitor pattern +* Useful for discussion and ideas - not a complete solution + + + + + +Next slide if it feels like we have got loads of time. + + + + +### Breakout Exercise: Read about a random pattern on the website and share it with the group + +Go to the website linked and pick a random pattern, see if you can understand what it is doing +and why you'd want to use it. + +Time: 15min + + + + + +## Architecting larger changes + +* Use diagrams of boxes and lines to sketch out how code will be structured +* Useful for larger changes, new code, or even understanding complex projects + + + + + +## Exercise: Design a high-level architecture + +Sketch out a design for something you have come up with or the current project. + + +Time: 10min + + + + + +At end of time, share diagrams, discussion. + + + + + +## Breakout: Read to end of page + +Read til the end, including the exercise on real world examples + +Time: 15min + + + + + +At end of time, reconvene to discuss real world examples as a group. + + + + + +## Conclusion + +Good software architecture and design is a **huge** topic. + +Practise makes perfect: + + * Spot signs things could be improved - like duplication + * Think about why things are working or not working + * Do not design for an imagined future + * Keep refactoring as you go + + + + + +## 🕓 End of Section 3 🕓 + diff --git a/section_4_collaborative_soft_dev.md b/section_4_collaborative_soft_dev.md new file mode 100644 index 000000000..92c8d8323 --- /dev/null +++ b/section_4_collaborative_soft_dev.md @@ -0,0 +1,394 @@ +--- +jupyter: + celltoolbar: Slideshow + jupytext: + notebook_metadata_filter: -kernelspec,-jupytext.text_representation.jupytext_version,rise,celltoolbar + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + rise: + theme: solarized +--- + + +# Section 4: Collaborative Software Development for Reuse + +
+
+
+ + + +- up until this point, the course has been primarily focussed on technical practices, tools, and infrastructure, and primarily from the perspective of a single developer/researcher, albeit within a team environment +- in this section, we are going to start broadening our attention to the collaborative side of software development + - there are primarily two practices that facilitate collaboration: code review and package release +- code review has many benefits, but top among them is that it provides a gate check on software quality, + - it is also a way to share knowledge within a team, improving the redundancy of that team (which is actually a good thing regardless of what corporate types might say!) + - getting another set of eyes on your code also means you are less likely to flout coding standards and convention + - there are many different types of code review, and we will explore the most common in this section +- the other collaborative practice is packaging our software for release + - it will be very difficult to collaborate if no one else is able to install our software + - we have used a very rudimentary technique for distributing our project up until now, and it has some key limitations + - to overcome these, we will look at a tool called Poetry and use it to help make our Python package more distributable + - we will also talk about some more general points around software maintainability and sustainability that should be done before distributing our software + + + +## Quick Check: Who Has All the Branches! + +Check if you have a branch named `remote/origin/feature-std-dev` or `feature-std-dev` after running: + +```bash +git branch --all +``` + +If not, please run these commands: + +```bash +git remote add upstream git@github.com:ukaea-rse-training/python-intermediate-inflammation.git +git fetch upstream +git checkout upstream/feature-std-dev +git switch --create feature-std-dev +git push --set-upstream origin feature-std-dev +``` + + + +## Developing Software in a Team: Code Review + +Two main ways to collaborate with git: + +1. Fork and Pull Model +2. Shared Repository Model + + + + +TODO make a nice mermaid diagram for this + +- In the absence of a nice diagram, draw something on the whiteboard for the above + + + +### Code Review + +> Code review (n.): a software quality assurance practice where one or several people from the team, different from the code’s author, check the software by viewing parts of its source code, making comments, and rejecting or approving those changes + + + +- Up until now, we have been merging code into our main branches individually + - This is generally not how things are done in teams + - Instead, there is a gate check before anything gets merged into the main or develop branch of a repo + + + +Lots of benefits: + +1. 👥 Knowledge sharing: improve redundancy in the team. +2. 🧠 Explanation improves understanding and rationale. Better decisions are made. +3. ❌ Reduce errors in code. Between 60 and 90% of errors can be caught by rigourous code review (Fagan, 1979). + - Errors caught earlier are 10 to 100 times less expensive or time-consuming to fix. + + + +### Types of Code Review + +1. Over-the-shoulder review +2. Pair programming +3. Formal code inspection +4. **Asynchronous, tool-assisted review ⬅️** + + + +- There are a variety of different code review techniques + - Briefly explain each +- We will be using **asynchronous, tool-assisted review** because it is currently the most common form in software development, especially with the rise of interfaces like GitHub and GitLab + + + +### Code Review Exercise Steps + +
+ + + + +TODO the source of the png is online from mermaid.ink editor. We should figure a way to incorporate mermaid into these slides directly. + +- Walk through the steps of the code review process + + + +### Exercise: Raising a pull request for your fictional colleague + +Go through the steps described under heading. Stop when you reach **Reviewing a pull request** + + + +- Should be pretty quick, 5 minutes max. +- Status check then move on. + + + +### Exercise: review some code + +Pair up with someone else in your group and exchange repository links. You will be taking on the role of _Reviewer_ on your partner's repository. Before leaving review comments, read the content under the heading **Reviewing a pull request**. Try to make a comment from each of the main areas identified. + +**Do not submit your review just yet!!!** + + + +- 10-15 minutes +- Status check then move on. + + + +### Exercise: review the code for suitable tests + +Add a list of expected tests to the comment box after clicking `Finish your review` near the top right of the `Files changed` tab. Use the content under **Making sure code is valid** to come up with these tests, and think back to the requirement SR1.1.1. + +When done, select `Request changes` from the list of toggles, then `Submit review`. + + + +- 10-15 minutes +- Status check then move on. + + + +### Exercise: responding and addressing comments + +Respond to the _Reviewers_ comments on the PR in _your_ repository. Use the information in **Responding to review comments** to guide your responses. And remember that you can talk to your _Reviewer_ for clarification, just make sure you record that in a comment on the PR. + +Do not implement changes that will take more than 5 minutes. Instead, raise them as an issue on your repo for future work, and link to that issue in a comment on the PR. + + + +- 10-15 minutes + - tell learners not to worry too much about implementing all of the changes requested by reviewers. If it looks like a requested change will take longer than 5 mintues, open a new issue on your repository to address it in the future. +- Status check then move on. + + + +### Making code easy to review + +- 🤏 Keep the changes small. +- 1️⃣ Keep each commit as one logical change. +- 🪟 Provide a clear description of the change. +- 🕵️ Review your code yourself, before requesting a review. + + + + +### Empathy in review comments + +* Identify positives in code as and when you find them +* Remember different does not mean better +* Only provide a few non-critical suggestions - you are aiming for better rather than perfect +* Ask questions to understand why something has been done a certain way rather than assuming you + know a better way +* If a conversation is taking place on a review and hasn't been resolved by a + single back-and-forth exchange, then schedule a conversation to discuss instead + (recording the results of the discussion in the PR) + + + +### Exercise: Code Review in Your Own Working Environment + +Follow the instructions under this exercise heading. Read the content above the exercise to figure out what is involved in a code review process for a team. After about 5 minutes, have a small conversation in your group about what your code review process would look like. + + + +## Preparing Software for Reuse and Release + + + +- 🔁 We want our code to be somewhere on the "reusablility" spectrum +- 📝 Documentation is an important part of our code being reusable + + + +- We want our code to be somewhere on the "reusablility" spectrum + - but where exactly? this will depend on the maturity of your code and how widely it will be used (similar to testing) + - at a minimum, we want to aim for reproducibility if we are publishing: someone else should be able to take our code and data and run it themselves and get the same result + - however, for big library packages, we probably want to bump that up to reusable, where our code is easy to use, understand, and modify +- Documentation is an important part of our code being reusable + - Even if you write incredibly expressive code, it will not be enough for someone new to start using and modifying your code base + - How do they install it? Are there any development tools they need? What is the scientific context and limitations of the code? + - We need to answer all of these questions and more if we want our code to be approachable and reusable + +TODO would be nice to modify the image from so that it better reflects the ACM definition of reproducibility/replicability + + + +### Breakout: Start from the Top + +Start from the top of this episode page and go to the end. + + + +- Learners can skim the first two sections if you have talked about them in the previous slide +- Split into breakout rooms for about 50 minutes +- A preface note: if you have been using codimd or hackmd for the shared document, then learners will have already been exposed to Markdown, so this section will not contain much new for them +- Post episode comments + - A README is a great place to start your documentation, but at some point it will outgrow that, and you will need a bigger documentation system. The most popular in Python is Sphinx, which can be used with Markdown or another markup language called ReStructuredText (`.rst` files) + - For writing documentation, this is another great link that can be added to the shared document: https://documentation.divio.com/ + - For licensing software, make some notes in the shared document about the policy of your institution + + + +## ☕ Break Time ☕ + + + +## Packaging Code for Release and Distribution + + + +### Why Package Our Software? + +- ⏬ It reduces "the complexity of fetching, installing, and integrating it [our code] for the end-users" +- 📦 Packaging combines the relevant source files and necessary metadata to achieve the above +- 😕 Confusing term, _package_ + - module _package_ : a directory containing Python files and an `__init__.py` file + - distributable _package_ : a way of structuring and bundling a Python project for easier distribution and installation + + + +### Packaging Our Software with Poetry + +- 📌 Pinning dependencies in a `requirements.txt` has some limitations +- 📜 Poetry is a tool that helps overcome some of these deficiencies + + + +- 📌 Pinning dependencies in a `requirements.txt` has some serious limitations + - It does not differentiate between production dependencies (i.e. what our package needs to be used in a standalone manner) and library dependencies (i.e. what our package needs to be used as part of another application, which itself has dependencies) + - It reduces the portability of our code across Python versions: some learners may have encountered this when setting up the CI matrices in GitHub Actions. e.g. Pinning dependencies at Python 3.10 could (and does!) cause issues if those same dependencies need to be installed in a Python 3.8 environment. + - It is prone to error: what if we forget to add a dependency to requirements.txt? We could happily use pip to install something into our environment and the code will work, but when someone tries to use it themselves, they will be missing a dependency and the code will error out. In other words, we have two disconnected steps we need to perform when installing a dependency. + - Distributing Python packages to popular repositories like PyPI requires more metadata than having a simple `requirements.txt` and we would need to manually create this +- 📜 Poetry is a tool that helps overcome some of these deficiencies + - It separates production and library dependencies between `poetry.lock` and `pyproject.toml` + - It provides a unified interface for adding dependencies to our project so that this is immediately recorded upon installation + - It partially automates the process of creating a distributable package + + + +### Installing Poetry + +| ⚠️ Warning ⚠️ | +|:--------------| +| The documentation for Poetry explicitly discourages installing Poetry into your current virtual environment. Therefore, please use the installation instructions from their website. | + +Since we are all on Linux, it should roughly be: + +```bash +curl -sSL https://install.python-poetry.org | python3 - +ls $HOME/.local/bin # make sure poetry executable is listed there +which poetry # if no output, then poetry not in your path +poetry --version # check we have access to the poetry executable +``` + + + +- Important warning: Poetry explicitly recommends that you shouldn't install Poetry within the virtual environment of a specific project. Rather, it should have its own isolated environment, which the official download script or `pipx` ensures. This is in direct contradiction to what the course material currently recommends. + - So, unless it really is not possible, encourage learners to follow the link to Poetry's install website and follow instructions there +- Give learners about 5 mins to complete this and status check at the end + + + +### Setting up Our Poetry Config + +- The current way of sharing our package is: + ```bash + git clone + python -m venv venv + . venv/bin/activate + pip install -r requirements.txt + python inflammation-analysis.py + ... + ``` + - and then there are a bunch of hoops to jump through to make sure import statements work when testing +- What if someone wants to just `pip install` our package? +- Poetry helps us with this + + + +- The current way of sharing our package is... clunky!!! + - and then there are a bunch of hoops to jump through to make sure import statements work when testing +- What if someone wants to just `pip install` our package? +- Poetry helps us with this + - we need to define some metadata for our project so that poetry can properly install it in a Python environment + - this is done in a `pyproject.toml` file that `poetry` can help us generate pretty quickly + - send learners off to do this for about 5 minutes + + + +### Project Dependencies + +We will look at two types of dependencies: + +1. Runtime dependencies +3. Development dependencies + + + +Runtime dependencies can be further subdivided: + +1. _Pinned_ runtime dependencies when our package is used as a standalone application +2. _Looser_ runtime dependencies when our package is used as a library + + + +### Exercise: Project Dependencies + +Commit your initial `pyproject.toml` into git. Then, run the commands: + +```bash +poetry add matplotlib numpy +poetry add --group dev pylint +poetry install +``` + +Inspect how `pyproject.toml` has changed. Look at what has gone into `poetry.lock`. + + + +- Give learners a few minutes to do this + - Then, quickly run through it and look at the changes in your own repo +- Explain when `poetry.lock` is used: if it is present in a repo when `poetry install` is called, the _exact_ versions of dependencies in `poetry.lock` will be used. Again, useful if we are distributing a standalone application. + - Do not check in `poetry.lock` into version control if you want your package to be used as a library +- Note that the last command is quite important because it puts the current package we are developing into our environment + - This means that some of those annoying `ModuleNotFound` errors will be eliminated + - Generally, we want to install the package we are working on in our environment + + + +### Packaging Our Code + +Now that we have a `pyproject.toml` file, building a distributable package is as easy as: + +```bash +poetry build +``` + +🤯🤯🤯 + + + +- Live code what happens when you run `poetry build` +- Look at the two files in the `dist/` folder + - The `.whl` one is what we are most interested in + - We can send this single file directly to someone and they can `pip install` it! + - Demo this quickly by creating a new venv, installing the `.whl` file, then run a Python interpreter and prove that we have access to our package, e.g. `from inflammation.models import daily_mean` + - Then, show how we can more easily share this file by attaching it to a GitHub release (use the web UI) +- There are other ways to distribute your packages, notably to PyPI (which is where pip defaults to grab packages from) but we will leave that participants to figure out from the content at the bottom of the lesson +- Look at specific package registries for your institution if you have time + + + +## 🕓 End of Section 4 🕓 + +☕ Break time ☕ + diff --git a/section_5_managing_software.md b/section_5_managing_software.md new file mode 100644 index 000000000..46dd3bd44 --- /dev/null +++ b/section_5_managing_software.md @@ -0,0 +1,62 @@ +--- +jupyter: + celltoolbar: Slideshow + jupytext: + notebook_metadata_filter: -kernelspec,-jupytext.text_representation.jupytext_version,rise,celltoolbar + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + rise: + theme: solarized +--- + + +# Section 5: Managing and Improving Software Over Its Lifetime + +
+
+
+ + + +- In this section of the course we look at managing the development and evolution of software - how to keep track of the tasks the team has to do, how to improve the quality and reusability of our software for others as well as ourselves, and how to assess other people’s software for reuse within our project. +- We are therefore moving into the realm of software management, not just software development; do not be scared off! + - We all need to do a bit of project management from time to time + + + +## Assessing Software for Suitability and Improvement + +### Breakout: Start from the Top + +Start reading from the top of this episode page all the way to the end. Complete exercises as you go. + +You will need to synchronise as a group at the **🖉 Decide on your Group’s Repository!** exercise. Please use a sticky/reaction to indicate when you have reached this exercise. + +For the next exercise, you will then need to wait for the other group you are assessing to fill in their repo URL. + + + +- Set learners into breakout rooms for 45 minutes with instructions on slide + + + +## ☕ Break Time ☕ + + + +## Improvement Through Feedback + +### Breakout: Start from the Top + +Start reading from the top of this episode page all the way to the end. Complete exercises as a group. + +There is a separate shared documents specific to your group linked from the original shared document. This will give your group an uncrowded space to handle the issues that another group has submitted on your repo. + + + +## 🕓 End of Section 5 🕓 + +Please fill out the end-of-course survey! + diff --git a/setup.md b/setup.md new file mode 100644 index 000000000..8f8858d1d --- /dev/null +++ b/setup.md @@ -0,0 +1,16 @@ +--- +title: Setup +--- + +## Setup + +You will need the following software and accounts setup to be able to follow the course: + +- Command line tool (such as Bash, Zsh or Git Bash) +- Git version control program +- GitHub account +- Python 3 distribution +- PyCharm integrated development environment (IDE) + +Please follow the [installation instructions](installation-instructions.md) to install the above tools and +set up for the course. diff --git a/software-architecture-extra.md b/software-architecture-extra.md new file mode 100644 index 000000000..1a5d9c82c --- /dev/null +++ b/software-architecture-extra.md @@ -0,0 +1,179 @@ +--- +title: "Extra Content: Software Architecture" +teaching: 15 +exercises: 0 +--- + +::: questions +- What should we consider when designing software? +::: + +::: objectives +- Understand the components of multi-layer software architectures. +::: + +**Software architecture** provides an answer to the question +"what components will the software have and how will they cooperate?". +Software engineering borrowed this term, and a few other terms, +from architects (of buildings) as many of the processes and techniques have some similarities. +One of the other important terms we borrowed is 'pattern', +such as in **design patterns** and **architecture patterns**. +This term is often attributed to the book +['A Pattern Language' by Christopher Alexander *et al.*](https://en.wikipedia.org/wiki/A_Pattern_Language) +published in 1977 +and refers to a template solution to a problem commonly encountered when building a system. + +Design patterns are relatively small-scale templates +which we can use to solve problems which affect a small part of our software. +For example, the **[adapter pattern](https://en.wikipedia.org/wiki/Adapter_pattern)** +(which allows a class that does not have the "right interface" to be reused) +may be useful if part of our software needs to consume data +from a number of different external data sources. +Using this pattern, +we can create a component whose responsibility is +transforming the calls for data to the expected format, +so the rest of our program does not have to worry about it. + +Architecture patterns are similar, +but larger scale templates which operate at the level of whole programs, +or collections or programs. +Model-View-Controller (which we chose for our project) is one of the best known architecture +patterns. +Many patterns rely on concepts from [Object Oriented Programming](../learners/object-oriented-programming.md). + +There are many online sources of information about design and architecture patterns, +often giving concrete examples of cases where they may be useful. +One particularly good source is [Refactoring Guru](https://refactoring.guru/design-patterns). + +There are various software architectures around defining different ways of +dividing the code into smaller modules with well defined roles, for example: + +- [Model–View–Controller (MVC) architecture](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller), + which separates the code into three distinct components - **Model** represents the data and + contains operations/rules for manipulating and changing it; **View** is responsible for + displaying data to users; **Controller** accepts input from the View and performs the + corresponding action on the Model and then updates the View accordingly, +- [Service-oriented architecture (SOA)](https://en.wikipedia.org/wiki/Service-oriented_architecture), + which separates code into distinct services, + accessible over a network by consumers (users or other services) + that communicate with each other by passing data in a well-defined, shared format (protocol), +- [Client-server architecture](https://en.wikipedia.org/wiki/Client%E2%80%93server_model), + where clients request content or service from a server, + initiating communication sessions with servers, + which await incoming requests (e.g. email, network printing, the Internet), +- [Multilayer architecture](https://en.wikipedia.org/wiki/Multitier_architecture), + is a type of architecture in which presentation, + application processing + and data management functions + are split into distinct layers and may even be physically separated to run on separate machines. + +### Multilayer Architecture + +One common architectural pattern for larger software projects is **Multilayer Architecture**. +Software designed using this architecture pattern is split into layers, +each of which is responsible for a different part of the process of manipulating data. + +Often, the software is split into three layers: + +- **Presentation Layer** + - This layer is responsible for managing the interaction between + our software and the people using it + - May include the **View** components if also using the MVC pattern +- **Application Layer / Business Logic Layer** + - This layer performs most of the data processing required by the presentation layer + - Likely to include the **Controller** components if also using an MVC pattern + - May also include the **Model** components +- **Persistence Layer / Data Access Layer** + - This layer handles data storage and provides data to the rest of the system + - May include the **Model** components of an MVC pattern + if they are not in the application layer + +Although we have drawn similarities here between the layers of a system and the components of MVC, +they are actually solutions to different scales of problem. +In a small application, a multilayer architecture is unlikely to be necessary, +whereas in a very large application, +the MVC pattern may be used just within the presentation layer, +to handle getting data to and from the people using the software. + +### Model-View-Controller (MVC) Architecture + +MVC architecture can be applied in scientific applications in the following manner. +Model comprises those parts of the application that deal with +some type of scientific processing or manipulation of the data, +e.g. numerical algorithm, simulation, DNA. +View is a visualisation, or format, of the output, +e.g. graphical plot, diagram, chart, data table, file. +Controller is the part that ties the scientific processing and output parts together, +mediating input and passing it to the model or view, +e.g. command line options, mouse clicks, input files. +For example, the diagram below depicts the use of MVC architecture for the +[DNA Guide Graphical User Interface application](https://www.software.ac.uk/developing-scientific-applications-using-model-view-controller-approach). + +![](fig/mvc-DNA-guide-GUI.png){alt='MVC example of a DNA Guide Graphical User Interface application' .image-with-shadow width="400px" } +{% comment %}Image from endcomment %} + +::::::::::::::::::::::::::::::::::::::: challenge + +## Exercise: MVC Application Examples From your Work + +Think of some other examples from your work or life +where MVC architecture may be suitable +or have a discussion with your fellow learners. + +::::::::::::::: solution + +## Solution + +MVC architecture is a popular choice when designing web and mobile applications. +Users interact with a web/mobile application by sending various requests to it. +Forms to collect users inputs/requests +together with the info returned and displayed to the user as a result represent the View. +Requests are processed by the Controller, +which interacts with the Model to retrieve or update the underlying data. +For example, a user may request to view its profile. +The Controller retrieves the account information for the user from the Model +and passes it to the View for rendering. +The user may further interact with the application +by asking it to update its personal information. +Controller verifies the correctness of the information +(e.g. the password satisfies certain criteria, +postal address and phone number are in the correct format, etc.) +and passes it to the Model for permanent storage. +The View is then updated accordingly and the user sees its updated profile details. + +Note that not everything fits into the MVC architecture +but it is still good to think about how things could be split into smaller units. +For a few more examples, have a look at this short +[article on MVC from CodeAcademy](https://www.codecademy.com/articles/mvc). + + + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Separation of Concerns + +Separation of concerns is important when designing software architectures +in order to reduce the code's complexity. +Note, however, there are limits to everything - +and MVC architecture is no exception. +Controller often transcends into Model and View +and a clear separation is sometimes difficult to maintain. +For example, the Command Line Interface provides both the View +(what user sees and how they interact with the command line) +and the Controller (invoking of a command) aspects of a CLI application. +In Web applications, Controller often manipulates the data (received from the Model) +before displaying it to the user or passing it from the user to the Model. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + + + +::: keypoints: +- Software architecture provides an answer to the question 'what components + will the software have and how will they cooperate?'. +::: diff --git a/vscode.md b/vscode.md new file mode 100644 index 000000000..4f0a1a1fc --- /dev/null +++ b/vscode.md @@ -0,0 +1,213 @@ +--- +title: "Extra Content: Using Microsoft Visual Studio Code" +--- + +::: objectives +- Use VS Code as an IDE of choice instead of PyCharm +::: + +::: questions +- How do we set up VS Code as our IDE of choice for this course? +::: + +[Visual Studio Code (VS Code)](https://code.visualstudio.com/), not to be confused with [Visual Studio](https://visualstudio.microsoft.com/), +is an Integrated Development Environment (IDE) by Microsoft. You can use it as your IDE for this course +instead of PyCharm - bellow are some instructions to help you set up. + +## Installation + +You can download VS Code from the [VS Code project website](https://code.visualstudio.com/download). + +### Extensions + +VS Code can be used to develop code in many programming languages, provided the appropriate extensions have been installed. +For this course we will require the extensions for Python. To install extensions click the icon highlighted below +in the VS Code sidebar: + +![](fig/vs-code-extensions.png){alt='VS Code application window with the Extensions button highlighted' .image-with-shadow width="800px" } + +In the search box, type "python" and select the Intellisense Python extension by Microsoft, +then click the "Install" button to install the extension. +You may be asked to reload the VS Code IDE for the changes to take effect. + +![](fig/vs-code-python-extension.png){alt='VS Code application with the list of extensions found by search term "python"' .image-with-shadow width="800px" } + +### Using VS Code with Windows Subsystem for Linux + +If you are developing software on Windows, +and particularly software that comes from or targets Unix or Linux systems, +it can be advantageous to use [WSL (Windows Subsystem for Linux)][wsl]. +Although this course does not explicitly support WSL, +we will provide some guidance here on how to best link up WSL with VS Code (if that is your use case). +In your WSL terminal, navigate to the project folder for this course and execute the command: + +```bash +code . +``` + +This should launch VS Code in a way that ensures it performs most operations within WSL. +To do this, the [WSL - Remote extension][vscode-wsl-extension] +for VS Code should automatically be installed. +If this does not happen, please install the extension manually. +You can also launch WSL sessions from VS Code itself using the +[instructions on the extension page.][vscode-wsl-extension-launch-options] + +## Using the VS Code IDE + +Let us open our software project in VS Code and familiarise ourselves with some commonly used features needed for this course. + +### Opening a Software Project + +Select `File` > `Open Folder` from the top-level menu and navigate to the directory where you saved the +[`python-intermediate-inflammation` project](../episodes/11-software-project.md#downloading-our-software-project), +which we are using in this course. + +### Configuring a Virtual Environment in VS Code + +As in the episode on +[virtual environments for software development](../episodes/12-virtual-environments.md), +we would want to create a virtual environment for our project to work in (unless you have already done so earlier in the course). +From the top menu, select `Terminal` > `New Terminal` to open a new terminal (command line) session within the project directory, +and run the following command to create a new environment: + +```bash +python3 -m venv venv +``` + +This will create a new folder called `venv` within your project root. +VS Code will notice the new environment and ask if you want to use it as the default Python interpreter for this project - +click "Yes". + +![](fig/use_env.png){alt='VS Code popup window asking which Python interpreter to use for the current project'} + +*** + +#### Troubleshooting Setting the Interpreter + +If the prompt did not appear, you can manually set the interpreter. + +1. Navigate to the location of the `python` binary within the virtual environment + using the file browser sidebar (see below). The binary will be located in `/bin/python` within the project directory. +2. Right-click on the binary and select `Copy Path`. +3. Use the keyboard shortcut `CTRL-SHIFT-P` to bring up the command palette, then search for `Python: Select Interpreter`. +4. Click `Enter interpreter path...`, paste the path you copied followed by Enter. + +*** + +You can verify the setup has worked correctly by selecting an existing Python script in the project folder (or creating a blank +new one, if you do not have it, by right-clicking on the file explorer sidebar, selecting `New File` and creating a new file +with the extension `.py`). + +If everything is setup correctly, when you select a Python file in the file explorer you should see +the interpreter and virtual environment stated in the information bar at the bottom of VS Code, e.g., +something similar to the following: + +![](fig/vs-code-virtual-env-indicator.png){alt='VS Code bottom bar indicator of the virtual environment'} + +Any terminal you now open will start with the activated virtual environment. + +### Adding Dependencies + +For this course you will need to install `pytest`, `numpy` and `matplotlib`. Start a new terminal and run the +following: + +```bash +python3 -m pip install numpy matplotlib pytest +``` + +*** + +#### Troubleshooting Dependencies + +If you are having issues with `pip`, it may be that `pip` version you have is too old. +Pip will usually inform you via a warning if a newer version is available. +You can upgrade pip by running the following from the terminal: + +```bash +python3 -m pip install --upgrade pip +``` + +You can now try to install the packages again. + +*** + +## Running Python Scripts in VS Code + +To run a Python script in VS Code, open the script by clicking on it, +and then either click the Play icon in the top right corner, +or use the keyboard shortcut `CTRL-ALT-N`. + +![](fig/vs-code-run-script.png){alt='VS Code application window with highlighted Run button' .image-with-shadow width="800px" } + +## Adding a Linter in VS Code + +In [the episode on coding style](../episodes/15-coding-conventions.md) +and [the subsequent episode on linters](../episodes/16-verifying-code-style-linters.md), +you are asked to use an automatic feature in PyCharm +that picks up linting issues with the source code. +Because it is language agnostic, VS Code does not have a linter for Python built into it. +Instead, you will need to install an extension to get linting hints. +Get to the "Extensions" side pane by one of these actions: + +1. Bring up the command palette with `CTRL-SHFT-P`, search for `View: Show Extensions` +2. Use the direct keyboard shortcut `CTRL-SHFT-X` +3. Click on the ["Extensions" icon](.#extensions) on the left side panel we used previously. + +In the Extensions panel, type "pylint" into the search bar. Select Pylint from the result panel +that comes up and then the `Install` button: + +![](fig/vs-code-install-linter-extension.png){alt='VS Code Extensions Panel showing searching for pylint extension' .image-with-shadow width="800px" } + +Once installed, Pylint warnings about your code should automatically populate the "Problems" panel +at the bottom of VS Code window, as shown below. You can also bring up the "Problems" panel using the keyboard shortcut `CTRL-SHFT-M`. + +![](fig/vs-code-linter-problems-pane-annotated.png){alt='VS Code Problems Panel' .image-with-shadow width="800px" } + +There are other Python linters available, such as [Flake8](https://flake8.pycqa.org/en/latest/), +and Python code formatters, such as [Black](https://pypi.org/project/black/). +All are available as extensions that can be installed in a similar manner from the "Extensions" panel. + +We also recommend that you install these linters and formatters in your virtual environment, +since then you will be able to run them from the terminal as well. +For example, if you want `pylint` and `black` packages, execute the following from the terminal: + +```bash +$ python3 -m pip install pylint black +``` + +They will now both be available to run as command line applications, +and you will find the details of how to run `pylint` in the lesson material (`black` in not covered). + +## Running Tests + +VS Code also allows you to run tests from a dedicated test viewer. +Clicking the "laboratory flask" button in the sidebar allows you to set up test exploration: + +![](fig/vs-code-test-explorer.png){alt='VS Code application window for setting up test framework' .image-with-shadow width="800px" } + +Click `Configure Python Tests`, +select `pytest` as the test framework, +and the `tests` directory as the directory for searching. + +You should now be able to run tests individually +using the test browser (available from the top level menu `View` > `Testing`) and selecting the test of interest. + +![](fig/vs-code-run-test.png){alt='VS Code application window for running tests' .image-with-shadow width="800px" } + +### Running Code in Debug Mode + +When clicking on a test you will see two icons, +the ordinary Run/Play icon, and a Run/Play icon with a bug. +The latter allows you to run the tests in debug mode +useful for obtaining further information as to why a failure has occurred - this will be covered in the main lesson material. + +::: keypoints +- It is possible to switch to using VS Code for this course with a few tweaks +::: + +[wsl]: https://learn.microsoft.com/en-us/windows/wsl/about +[vscode-wsl-extension]: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-wsl +[vscode-wsl-extension-launch-options]: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-wsl#commands + + +