diff --git a/CHANGELOG.md b/CHANGELOG.md
index 202b562..e5c027b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -21,7 +21,19 @@
## MAJOR CHANGES
-* Improved `metrics/emd_per_samples` component (PR #16)
+* Updated file schema (PR #18):
+ * Add is_control obs to indicate whether a cell should be used as control when correcting batch effect.
+ * Removed donor_id obs from unintegrated censored.
+ * Removed to_correct var from everything except common_dataset.
+ All datasets now will only contain markers that need to be corrected.
+
+* Reupdated the file schema (PR #19):
+ * Included changes in PR #21: data Processor component partitions cells between unintegrated(censored)
+ and validation.
+ * Add back to_correct var to every file except integrated to reflect the real world
+ batch correction workflow better.
+ * Reverted PR #18 to retain only the 1st two changes (add is_control and remove
+ donor_id from unintegrated_censored).
## MINOR CHANGES
@@ -29,5 +41,7 @@
* Added integrated test resource (PR #5).
+* Updated file description in yaml file (PR #15).
+
## BUGFIXES
diff --git a/README.md b/README.md
index 019b536..a584f77 100644
--- a/README.md
+++ b/README.md
@@ -73,7 +73,7 @@ Format:
AnnData object
- obs: 'cell_type', 'batch', 'sample', 'donor', 'group'
+ obs: 'cell_type', 'batch', 'sample', 'donor', 'group', 'is_control', 'is_validation'
var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
layers: 'preprocessed'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
@@ -91,6 +91,8 @@ Data structure:
| `obs["sample"]` | `string` | Sample ID. |
| `obs["donor"]` | `string` | Donor ID. |
| `obs["group"]` | `string` | Biological group of the donor. |
+| `obs["is_control"]` | `integer` | Whether the sample the cell came from can be used as a control for batch effect correction. 0: cannot be used as a control. \>= 1: can be used as a control. For cells with \>= 1: cells with the same value come from the same donor. Different values indicate different donors. |
+| `obs["is_validation"]` | `boolean` | Whether the cell will be used as validation data or not. If FALSE, then the cell will only be included in unintegrated and unintegrated_censored. If TRUE, then the cell will only be included in validation. |
| `var["numeric_id"]` | `integer` | Numeric ID associated with each marker. |
| `var["channel"]` | `string` | The channel / detector of the instrument. |
| `var["marker"]` | `string` | (*Optional*) The marker name associated with the channel. |
@@ -118,25 +120,30 @@ Arguments:
| Name | Type | Description |
|:---|:---|:---|
| `--input` | `file` | A subset of the common dataset. |
-| `--output_unintegrated_censored` | `file` | (*Output*) Unintegrated dataset. |
-| `--output_unintegrated` | `file` | (*Output*) Unintegrated dataset. |
+| `--output_unintegrated_censored` | `file` | (*Output*) An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias. The batch correction algorithm should not have to rely on these information to properly integrate different batches. This dataset is used as the input for the batch correction algorithm. The cells therein are identical to those in the unintegrated dataset. |
+| `--output_unintegrated` | `file` | (*Output*) The complete unintegrated dataset, including all cells’ metadata (columns) from the unintegrated_censored dataset. The cells in this dataset are the same to those in the unintegrated_censored dataset. |
| `--output_validation` | `file` | (*Output*) Hold-out dataset for validation. |
## File format: Unintegrated Censored
-Unintegrated dataset
+An unintegrated dataset with certain columns (cells metadata), such as
+the donor information, hidden. These columns are intentionally hidden to
+prevent bias. The batch correction algorithm should not have to rely on
+these information to properly integrate different batches. This dataset
+is used as the input for the batch correction algorithm. The cells
+therein are identical to those in the unintegrated dataset.
Example file:
-`resources_test/task_cyto_batch_integration/cxg_mouse_pancreas_atlas/train.h5ad`
+`resources_test/task_cyto_batch_integration/starter_file/unintegrated_censored.h5ad`
Format:
AnnData object
- obs: 'batch', 'sample', 'donor'
+ obs: 'batch', 'sample', 'is_control'
var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
layers: 'preprocessed'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
@@ -151,7 +158,7 @@ Data structure:
|:---|:---|:---|
| `obs["batch"]` | `string` | Batch information. |
| `obs["sample"]` | `string` | Sample ID. |
-| `obs["donor"]` | `string` | (*Optional*) Donor ID. |
+| `obs["is_control"]` | `integer` | Whether the sample the cell came from can be used as a control for batch effect correction. 0: cannot be used as a control. \>= 1: can be used as a control. For cells with \>= 1: cells with the same value come from the same donor. Different values indicate different donors. |
| `var["numeric_id"]` | `integer` | Numeric ID associated with each marker. |
| `var["channel"]` | `string` | The channel / detector of the instrument. |
| `var["marker"]` | `string` | (*Optional*) The marker name associated with the channel. |
@@ -170,17 +177,19 @@ Data structure:
## File format: Unintegrated
-Unintegrated dataset
+The complete unintegrated dataset, including all cells’ metadata
+(columns) from the unintegrated_censored dataset. The cells in this
+dataset are the same to those in the unintegrated_censored dataset.
Example file:
-`resources_test/task_cyto_batch_integration/cxg_mouse_pancreas_atlas/train.h5ad`
+`resources_test/task_cyto_batch_integration/starter_file/unintegrated.h5ad`
Format:
AnnData object
- obs: 'cell_type', 'batch', 'sample', 'donor', 'group'
+ obs: 'cell_type', 'batch', 'sample', 'donor', 'group', 'is_control'
var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
layers: 'preprocessed'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
@@ -198,6 +207,7 @@ Data structure:
| `obs["sample"]` | `string` | Sample ID. |
| `obs["donor"]` | `string` | Donor ID. |
| `obs["group"]` | `string` | Biological group of the donor. |
+| `obs["is_control"]` | `integer` | Whether the sample the cell came from can be used as a control for batch effect correction. 0: cannot be used as a control. \>= 1: can be used as a control. For cells with \>= 1: cells with the same value come from the same donor. Different values indicate different donors. |
| `var["numeric_id"]` | `integer` | Numeric ID associated with each marker. |
| `var["channel"]` | `string` | The channel / detector of the instrument. |
| `var["marker"]` | `string` | (*Optional*) The marker name associated with the channel. |
@@ -219,21 +229,29 @@ Data structure:
Hold-out dataset for validation.
Example file:
-`resources_test/task_cyto_batch_integration/cxg_mouse_pancreas_atlas/solution.h5ad`
+`resources_test/task_cyto_batch_integration/starter_file/validation.h5ad`
Description:
-Samples that were held out and will later be used only to assess whether
-the batch integration was successful. E.g. if a donor from batch 2 was
-corrected towards batch 1, but also actually measured in batch 1
-(without being used as input to the algorithm).
+Dataset containing cells from samples that were held out for evaluating
+batch integration output. The cells that are in this dataset belong to
+samples which are not included in the unintegrated or
+unintegrated_censored datasets. For example, if samples from donor A are
+present in batch 1 and 2, the sample from batch 1 may be used as input
+for the batch correction algorithm (and thus present in unintegrated and
+unintegrated_censored datasets). The sample from batch 2, may not be
+included as an input for the batch correction algorithm, but is needed
+to validate whether whether the algorithm managed to correct the batch
+effect in batch 2 towards batch 1. This sample will then be included in
+this dataset (but not in unintegrated and unintegrated_censored
+datasets).
Format:
AnnData object
- obs: 'cell_type', 'batch', 'sample', 'donor', 'group'
+ obs: 'cell_type', 'batch', 'sample', 'donor', 'group', 'is_control'
var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
layers: 'preprocessed'
uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'
@@ -251,6 +269,7 @@ Data structure:
| `obs["sample"]` | `string` | Sample ID. |
| `obs["donor"]` | `string` | Donor ID. |
| `obs["group"]` | `string` | Biological group of the donor. |
+| `obs["is_control"]` | `integer` | Whether the sample the cell came from can be used as a control for batch effect correction. 0: cannot be used as a control. \>= 1: can be used as a control. For cells with \>= 1: cells with the same value come from the same donor. Different values indicate different donors. |
| `var["numeric_id"]` | `integer` | Numeric ID associated with each marker. |
| `var["channel"]` | `string` | The channel / detector of the instrument. |
| `var["marker"]` | `string` | (*Optional*) The marker name associated with the channel. |
@@ -269,16 +288,16 @@ Data structure:
## Component type: Method
-A method.
+A method for integrating batch effects in cytometry data.
Arguments:
-| Name | Type | Description |
-|:-----------|:-------|:-------------------------------|
-| `--input` | `file` | Unintegrated dataset. |
-| `--output` | `file` | (*Output*) Integrated dataset. |
+| Name | Type | Description |
+|:---|:---|:---|
+| `--input` | `file` | An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias. The batch correction algorithm should not have to rely on these information to properly integrate different batches. This dataset is used as the input for the batch correction algorithm. The cells therein are identical to those in the unintegrated dataset. |
+| `--output` | `file` | (*Output*) Integrated dataset which batch effect was corrected by an algorithm. |
@@ -290,11 +309,11 @@ Arguments:
-| Name | Type | Description |
-|:-----------------------|:-------|:---------------------------------|
-| `--input_unintegrated` | `file` | Unintegrated dataset. |
-| `--input_validation` | `file` | Hold-out dataset for validation. |
-| `--output` | `file` | (*Output*) Integrated dataset. |
+| Name | Type | Description |
+|:---|:---|:---|
+| `--input_unintegrated` | `file` | The complete unintegrated dataset, including all cells’ metadata (columns) from the unintegrated_censored dataset. The cells in this dataset are the same to those in the unintegrated_censored dataset. |
+| `--input_validation` | `file` | Hold-out dataset for validation. |
+| `--output` | `file` | (*Output*) Integrated dataset which batch effect was corrected by an algorithm. |
@@ -309,18 +328,18 @@ Arguments:
| Name | Type | Description |
|:---|:---|:---|
| `--input_validation` | `file` | Hold-out dataset for validation. |
-| `--input_unintegrated` | `file` | Unintegrated dataset. |
-| `--input_integrated` | `file` | Integrated dataset. |
+| `--input_unintegrated` | `file` | The complete unintegrated dataset, including all cells’ metadata (columns) from the unintegrated_censored dataset. The cells in this dataset are the same to those in the unintegrated_censored dataset. |
+| `--input_integrated` | `file` | Integrated dataset which batch effect was corrected by an algorithm. |
| `--output` | `file` | (*Output*) File indicating the score of a metric. |
## File format: Integrated
-Integrated dataset
+Integrated dataset which batch effect was corrected by an algorithm
Example file:
-`resources_test/task_cyto_batch_integration/cxg_mouse_pancreas_atlas/prediction.h5ad`
+`resources_test/task_cyto_batch_integration/starter_file/integrated.h5ad`
Format:
@@ -350,14 +369,14 @@ Data structure:
File indicating the score of a metric.
Example file:
-`resources_test/task_cyto_batch_integration/cxg_mouse_pancreas_atlas/score.h5ad`
+`resources_test/task_cyto_batch_integration/starter_file/score.h5ad`
Format:
AnnData object
- uns: 'dataset_id', 'normalization_id', 'method_id', 'metric_ids', 'metric_values'
+ uns: 'dataset_id', 'method_id', 'sample_ids', 'metric_ids', 'metric_values'
@@ -368,10 +387,10 @@ Data structure:
| Slot | Type | Description |
|:---|:---|:---|
| `uns["dataset_id"]` | `string` | A unique identifier for the dataset. |
-| `uns["normalization_id"]` | `string` | Which normalization was used. |
-| `uns["method_id"]` | `string` | A unique identifier for the method. |
+| `uns["method_id"]` | `string` | A unique identifier for the batch correction method. |
+| `uns["sample_ids"]` | `string` | The samples assessed by the metric. |
| `uns["metric_ids"]` | `string` | One or more unique metric identifiers. |
-| `uns["metric_values"]` | `double` | The metric values obtained for the given prediction. Must be of same length as ‘metric_ids’. |
+| `uns["metric_values"]` | `double` | The metric values obtained. Must be of same length as ‘metric_ids’. |
diff --git a/src/api/file_common_dataset.yaml b/src/api/file_common_dataset.yaml
index d9746a2..87795cd 100644
--- a/src/api/file_common_dataset.yaml
+++ b/src/api/file_common_dataset.yaml
@@ -31,6 +31,23 @@ info:
name: group
description: Biological group of the donor
required: true
+ - type: integer
+ name: is_control
+ description: |
+ Whether the sample the cell came from can be used as a control for batch
+ effect correction.
+ 0: cannot be used as a control.
+ >= 1: can be used as a control.
+ For cells with >= 1: cells with the same value come from the same donor.
+ Different values indicate different donors.
+ required: true
+ - type: boolean
+ name: is_validation
+ description: |
+ Whether the cell will be used as validation data or not.
+ If FALSE, then the cell will only be included in unintegrated and unintegrated_censored.
+ If TRUE, then the cell will only be included in validation.
+ required: true
var:
- type: integer
name: numeric_id
@@ -52,11 +69,6 @@ info:
name: to_correct
description: Whether the marker will be batch corrected
required: true
- # obsm:
- # - type: double
- # name: X_pca
- # description: The resulting PCA embedding.
- # required: true
uns:
- type: string
name: dataset_id
diff --git a/src/api/file_integrated.yaml b/src/api/file_integrated.yaml
index 9882901..0ef5380 100644
--- a/src/api/file_integrated.yaml
+++ b/src/api/file_integrated.yaml
@@ -1,7 +1,7 @@
type: file
example: "resources_test/task_cyto_batch_integration/starter_file/integrated.h5ad"
label: Integrated
-summary: "Integrated dataset"
+summary: "Integrated dataset which batch effect was corrected by an algorithm"
info:
format:
type: h5ad
diff --git a/src/api/file_unintegrated.yaml b/src/api/file_unintegrated.yaml
index f8e9b00..c81705b 100644
--- a/src/api/file_unintegrated.yaml
+++ b/src/api/file_unintegrated.yaml
@@ -1,9 +1,11 @@
#TODO: Change to the required and/or optional fields of the anndata
type: file
example: "resources_test/task_cyto_batch_integration/starter_file/unintegrated.h5ad"
-label: "Unintegrated"
-summary: "Unintegrated dataset"
-
+label: Unintegrated
+summary: |
+ The complete unintegrated dataset, including all cells' metadata (columns) from the
+ unintegrated_censored dataset.
+ The cells in this dataset are the same to those in the unintegrated_censored dataset.
info:
format:
type: h5ad
@@ -33,6 +35,16 @@ info:
name: group
description: Biological group of the donor
required: true
+ - type: integer
+ name: is_control
+ description: |
+ Whether the sample the cell came from can be used as a control for batch
+ effect correction.
+ 0: cannot be used as a control.
+ >= 1: can be used as a control.
+ For cells with >= 1: cells with the same value come from the same donor.
+ Different values indicate different donors.
+ required: true
var:
- type: integer
name: numeric_id
@@ -54,11 +66,6 @@ info:
name: to_correct
description: Whether the marker will be batch corrected
required: true
- # obsm:
- # - type: double
- # name: X_pca
- # description: The resulting PCA embedding.
- # required: true
uns:
- type: string
name: dataset_id
diff --git a/src/api/file_unintegrated_censored.yaml b/src/api/file_unintegrated_censored.yaml
index 20d9f7b..874482e 100644
--- a/src/api/file_unintegrated_censored.yaml
+++ b/src/api/file_unintegrated_censored.yaml
@@ -1,8 +1,14 @@
#TODO: Change to the required and/or optional fields of the anndata
type: file
example: "resources_test/task_cyto_batch_integration/starter_file/unintegrated_censored.h5ad"
-label: "Unintegrated Censored"
-summary: "Unintegrated dataset"
+label: Unintegrated Censored
+summary: |
+ An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden.
+ These columns are intentionally hidden to prevent bias.
+ The batch correction algorithm should not have to rely on these information
+ to properly integrate different batches.
+ This dataset is used as the input for the batch correction algorithm.
+ The cells therein are identical to those in the unintegrated dataset.
info:
format:
type: h5ad
@@ -20,10 +26,16 @@ info:
name: sample
description: Sample ID
required: true
- - type: string
- name: donor
- description: Donor ID
- required: false
+ - type: integer
+ name: is_control
+ description: |
+ Whether the sample the cell came from can be used as a control for batch
+ effect correction.
+ 0: cannot be used as a control.
+ >= 1: can be used as a control.
+ For cells with >= 1: cells with the same value come from the same donor.
+ Different values indicate different donors.
+ required: true
var:
- type: integer
name: numeric_id
@@ -45,11 +57,6 @@ info:
name: to_correct
description: Whether the marker will be batch corrected
required: true
- # obsm:
- # - type: double
- # name: X_pca
- # description: The resulting PCA embedding.
- # required: true
uns:
- type: string
name: dataset_id
diff --git a/src/api/file_validation.yaml b/src/api/file_validation.yaml
index ad29b2e..dda4365 100644
--- a/src/api/file_validation.yaml
+++ b/src/api/file_validation.yaml
@@ -3,10 +3,17 @@ example: "resources_test/task_cyto_batch_integration/starter_file/validation.h5a
label: Validation
summary: Hold-out dataset for validation.
description: |
- Samples that were held out and will later be used only to assess whether
- the batch integration was successful. E.g. if a donor from batch 2 was corrected towards batch 1,
- but also actually measured in batch 1 (without being used as input to the algorithm).
-
+ Dataset containing cells from samples that were held out for evaluating batch integration output.
+ The cells that are in this dataset belong to samples which are not included in the unintegrated
+ or unintegrated_censored datasets.
+ For example, if samples from donor A are present in batch 1 and 2, the sample from batch 1
+ may be used as input for the batch correction algorithm (and thus present in unintegrated
+ and unintegrated_censored datasets).
+ The sample from batch 2, may not be included as an input for the batch correction algorithm,
+ but is needed to validate whether whether the algorithm managed to correct the batch effect
+ in batch 2 towards batch 1.
+ This sample will then be included in this dataset (but not in unintegrated
+ and unintegrated_censored datasets).
info:
format:
type: h5ad
@@ -36,6 +43,16 @@ info:
name: group
description: Biological group of the donor
required: true
+ - type: integer
+ name: is_control
+ description: |
+ Whether the sample the cell came from can be used as a control for batch
+ effect correction.
+ 0: cannot be used as a control.
+ >= 1: can be used as a control.
+ For cells with >= 1: cells with the same value come from the same donor.
+ Different values indicate different donors.
+ required: true
var:
- type: integer
name: numeric_id
@@ -57,11 +74,6 @@ info:
name: to_correct
description: Whether the marker will be batch corrected
required: true
- # obsm:
- # - type: double
- # name: X_pca
- # description: The resulting PCA embedding.
- # required: true
uns:
- type: string
name: dataset_id
diff --git a/src/data_processors/process_dataset/script.py b/src/data_processors/process_dataset/script.py
index 2ec82c0..e41720d 100644
--- a/src/data_processors/process_dataset/script.py
+++ b/src/data_processors/process_dataset/script.py
@@ -27,13 +27,12 @@
adata = ad.read_h5ad(par["input"])
print("input:", adata)
-validation_names = par["validation_sample_names"] or []
-is_validation = adata.obs["sample"].isin(validation_names)
+print(">> Creating unintegrated data", flush=True)
+adata_unintegrated = adata[adata.obs.is_validation==False]
-print(">> Creating train data", flush=True)
output_unintegrated = subset_h5ad_by_format(
- adata[[not x for x in is_validation]],
+ adata_unintegrated,
config,
"output_unintegrated"
)
@@ -41,15 +40,18 @@
print(">> Creating test data", flush=True)
output_unintegrated_censored = subset_h5ad_by_format(
- adata[[not x for x in is_validation]],
+ adata_unintegrated,
config,
"output_unintegrated_censored"
)
print(f"output_unintegrated_censored: {output_unintegrated_censored}")
-print(">> Creating solution data", flush=True)
+print(">> Creating validation data", flush=True)
+
+adata_validation = adata[adata.obs.is_validation==True]
+
output_validation = subset_h5ad_by_format(
- adata[is_validation],
+ adata_validation,
config,
"output_validation"
)
diff --git a/src/metrics/emd_per_samples/config.vsh.yaml b/src/metrics/emd_per_samples/config.vsh.yaml
index 3cfc168..ba03b80 100644
--- a/src/metrics/emd_per_samples/config.vsh.yaml
+++ b/src/metrics/emd_per_samples/config.vsh.yaml
@@ -83,4 +83,4 @@ runners:
# Allows turning the component into a Nextflow module / pipeline.
- type: nextflow
directives:
- label: [midtime,midmem,midcpu]
+ label: [midtime,midmem,midcpu]
\ No newline at end of file
diff --git a/src/metrics/emd_per_samples/script.py b/src/metrics/emd_per_samples/script.py
index 98d7bca..838ab2f 100644
--- a/src/metrics/emd_per_samples/script.py
+++ b/src/metrics/emd_per_samples/script.py
@@ -38,9 +38,11 @@
# )
# ].copy()
-markers_to_assess = input_unintegrated.var[
- input_unintegrated.var["to_correct"]
-].index.to_numpy()
+# markers_to_assess = input_integrated.var[
+# input_integrated.var["to_correct"]
+# ].index.to_numpy()
+
+markers_to_assess = input_integrated.var.index.to_numpy()
print("Extracting samples to compute the metric for", flush=True)