-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_TidySLAV.py
1042 lines (867 loc) · 46.2 KB
/
2_TidySLAV.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# ---
# jupyter:
# jupytext:
# cell_metadata_filter: title,-all
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.16.2
# kernelspec:
# display_name: .venv
# language: python
# name: python3
# ---
# %% [markdown]
# # TidySLAV
# %% [markdown]
# TidySLAV takes Prepare-a-SLAV's preprocessed, columnar Parquet files and turns the table into a tidy, tall format ([paper](https://www.jstatsoft.org/article/view/v059i10), [informal version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) by the creator of the concept). This enables easy addition of experimental metadata and later analysis in a statistical software package, like with StatistiSLAV.
#
# Additionally, some columns are removed, such as disconnected and faulty sensors, as specified by the user. Brief periods when operational sensors were disconnected (e.g. due to behavioural testing or bedding change) are automatically recognised and set to *NA*, indicating an unknown activity state. Finally, the time series are standardised according to each sensor’s respective baseline measurements taken before treatment application.
#
# If you are running TidySLAV via Google Colab, TidySLAV will autodetect and set up the Colab environment in the following cell, and pull example data from the [MIROSLAV toolkit GitHub repository](https://github.com/davorvr/MIROSLAV-analysis).
#
# If you want to run TidySLAV in Google Colab *and* with your own data, you can upload it using the File Browser in the sidebar on the left after running the following cell.
# %%
import importlib.util
try:
importlib.util.find_spec("google.colab")
except ModuleNotFoundError:
IN_COLAB = False
else:
IN_COLAB = True
from IPython.display import clear_output
import importlib.metadata
import packaging.version
if packaging.version.Version(importlib.metadata.version("pandas")) < packaging.version.Version("2.2"):
# %pip install -Uq "pandas==2.2"
from google.colab import output
output.enable_custom_widget_manager()
# !mkdir -p 1_outputs_prepared
# !wget -O 1_outputs_prepared/mph-pir-rack_M-dtyped-resampled-1minute.parquet https://github.com/davorvr/MIROSLAV-analysis/raw/main/1_outputs_prepared/mph-pir-rack_M-dtyped-resampled-1minute.parquet
# !wget -O 1_outputs_prepared/mph-pir-rack_R-dtyped-resampled-1minute.parquet https://github.com/davorvr/MIROSLAV-analysis/raw/main/1_outputs_prepared/mph-pir-rack_R-dtyped-resampled-1minute.parquet
# %% [markdown]
# ***
# %% [markdown]
# ## Setup
# %% [markdown]
# ### Import modules
# %%
import datetime
import warnings
from pathlib import Path
import pandas as pd
import numpy as np
import pprint
import plotly.io as pio
# if figures don't display properly, you can try changing the renderer
# https://plotly.com/python/renderers/
if IN_COLAB:
pio.renderers.default = "colab"
else:
pio.renderers.default = "notebook"
#pio.renderers.default = "vscode"
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
# %% [markdown]
# ### Set the working directories
# %% [markdown]
# * `wd` - Working directory, set to where this notebook is located.
#
# * `dir_in` - Location of files generated by Prepare-a-SLAV.
#
# * `dir_out` - Location where the finished dataframe will be saved.
#
# * `dir_out_ch` - Location where "checkpoint" files will be stored, holding the state of the dataset after each big cleaning step.
# %%
wd = Path.cwd()
dir_in = wd / "1_outputs_prepared"
dir_out = wd / "2_outputs_tidy"
dir_out_ch = wd / "2_outputs_tidy" / "checkpoints"
# %% [markdown]
# Also create the output directories if they don't already exist.
# %%
dir_out.mkdir(exist_ok=True)
dir_out_ch.mkdir(exist_ok=True)
# %% [markdown]
# This dictionary will hold the locations and names of the checkpoint files so we can easily refer to them to create diagnostic plots at the end.
# %%
checkpoint_files = {}
# %% [markdown]
# ### Define variables pertaining to the experiment
# %% [markdown]
# All times are in 24-hour format.
#
# * `tod` - time of day of a recurring event
#
# * `ts` - timestamp of a one-shot event, the format is `year-month-day hour:minute:second`
#
# * `al` - animal label (or, more exactly, cage label)
#
# * `sl` - sensor labels corresponding to either of the two sensors attached to each cage (format `H_{cage label}` or `L_{cage label}`)
# %% [markdown]
# #### Day-night
# %% [markdown]
# Manually recorded values when habitat lights (as the main Zeitgeber) switch on or off. `day` in our case refers to lights on, and `night` to lights off.
# %%
tod_day_start = datetime.time(5,46)
tod_night_start = datetime.time(17,46)
# %% [markdown]
# #### Experiment start/end timestamps
# %% [markdown]
# Values in the data that come before `ts_experiment_start` and after `ts_experiment_end` will be discarded. Since the destiny of this analysis will be to fit a sine curve to each 24-hour period of each sensor with MIRO The Explorer, all of our timestamps are rounded to the nearest lights on/off event.
# %%
ts_experiment_start = pd.to_datetime("2022-05-06 17:46:00")
ts_experiment_end = pd.to_datetime("2022-05-31 17:46:00")
# %% [markdown]
# #### Baseline end timestamp
# %% [markdown]
# Due to inter-sensor variability, values will be standardised according to baseline measurements taken before any treatment was applied. This allows us to track intra- and inter-individual changes in activity over time with respect to their baselines.
#
# In our case, treatment was applied on the morning of 11th May 2022, between 8:00 and 10:00, so this timestamp is rounded to the last light switch event that came just before, which was lights-on at 5:46 in this case.
# %%
ts_baseline_end = pd.to_datetime("2022-05-11 05:46:00")
# %% [markdown]
# #### Additional metadata
# %% [markdown]
# Streptozotocin is applied intracerebroventricularly to induce a sporadic Alzheimer's disease model. It is given in two doses 48 hours apart. In accordance with how metadata is recorded in tall data formats, these timestamps will be used to generate an extra column in the data which will have a `1` for all times between these two timestamps and `0` for all other rows, thus signifying that the AD model induction happened throughout these day/night periods.
#
# These values, however, won't be used programmatically to modify the MIROSLAV activity data (such as dataset trimming or standardisation as was the case previously).
# %%
ts_model_induction_start = pd.to_datetime("2022-05-11 05:46:00")
ts_model_induction_end = pd.to_datetime("2022-05-13 17:46:00")
# %% [markdown]
# #### Animals and their treatments
# %% [markdown]
#
# This is a dictionary of which animal (or, more accurately, cage) label corresponds to which treatment. More than two groups may be defined, but this experiment only had one treatment/two groups. Our animals were labelled using markers of different colours, so the names correspond to colours (in Croatian, as we are SLAVs :-) ).
# %%
al_treatment = {
"stz": ["crni3", "crni5", "crni7", "crni9", "crni10",
"crveni1", "crveni3", "crveni8", "crveni10",
"zeleni1", "zeleni2", "zeleni6", "zeleni7", "zeleni8", "zeleni9"],
"ctr": ["crni1", "crni2", "crni4", "crni6", "crni8",
"crveni2", "crveni4", "crveni5", "crveni6", "crveni7", "crveni9",
"zeleni3", "zeleni4", "zeleni5", "zeleni10"]
}
# %% [markdown]
# #### Faulty sensors
# %% [markdown]
# Sensors that have been recognised to have a technical fault and should be removed from the analysis
# %%
sl_faulty = ["L_crni1", "H_zeleni1", "L_crveni3", "L_crveni9", "H_crni7", "H_crveni9"]
# %% [markdown]
# #### Choose which data to load
# %% [markdown]
# This string corresponds to the one specified to Prepare-a-SLAV and indicates the resampling frequency of the data. It needs to be specified exactly as it was to the Prepare-a-SLAV script as it will be used to load the specified data files. It is a part of the input data filenames, so it can be found there as well.
# %%
input_data_period = "1 minute"
# %% [markdown]
# #### Additional resampling
# %% [markdown]
# Data can be resampled to even wider time intervals. This means you can produce a dataset with a narrow time interval once with Prepare-a-SLAV, and then resample it further here with TidySLAV, which is quicker than re-running Prepare-a-SLAV if you need wider bins. If you don't want to perform additional resampling, you can set the variable to `None`
# %%
resample_period_string = "5 minutes"
#resample_period_string = None
# %% [markdown]
# #### Specify experiment name and select devices
# %% [markdown]
# Specify the experiment name and, optionally, which devices you want to merge into one dataset. These are the values specified in the Prepare-a-SLAV TOML configuration file, and can be found in the filenames of Prepare-a-SLAV's output files which are saved in the `outputs_intermediate` directory, and follow this pattern:
#
# {experiment_name}-pir-{device_name}-dtyped-resampled-{input_data_period}.gz
#
# The experiment name must be specified.
#
# Only the devices belonging to a specific experiment can be included, and you have two options:
#
# - `devices = ["rack_M", "rack_R"]` - Specify the exact device(s) that you want included. Other device files belonging to this experiment will be disregarded
#
# - `devices = []` - Leave the list empty. All devices/files found that belong to the experiment will be included automatically.
# %%
experiment_name = "mph"
devices = []
# %% [markdown]
# ***
# %% [markdown]
# ## Merging
# %% [markdown]
# Merge the data of multiple/all MIROSLAV devices used in an experiment into one dataframe.
# %%
# Get a list of all Prepare-a-SLAV files for the given experiment and bin
device_file = {}
prepfile_list = [f for f in dir_in.glob(f"{experiment_name}-pir-*-{input_data_period.replace(' ', '')}.parquet")]
if not devices:
# If no specific devices are given, get all
for f in prepfile_list:
d_str_start = f.stem.find("-pir-")+5
d_str_end = f.stem.find("-dtyped")
d = f.stem[d_str_start:d_str_end]
if d in device_file:
raise ValueError(f"Multiple Prepare-a-SLAV files found for device {d}, bin {input_data_period}!")
device_file[d] = f
else:
# If specific devices are given, only get those
for d in devices:
device_file[d] = [f for f in prepfile_list if d in f.stem]
if len(device_file[d]) == 0:
raise ValueError(f"No Prepare-a-SLAV files found for device {d}, bin {input_data_period}!")
elif len(device_file[d]) > 1:
raise ValueError(f"Multiple Prepare-a-SLAV files found for device {d}, bin {input_data_period}!")
else:
device_file[d] = device_file[d][0]
if not device_file:
raise ValueError("No Prepare-a-SLAV files found!")
# print the files we found in the format
# device name: input dir/file name
{d: f.parent.name+"/"+f.name for d, f in device_file.items()}
# %% [markdown]
# Now, we read the data from the files, rename the index columns appropriately (as we will have duplicates - one coming from each device).
#
# Then, we remove columns for inputs that were labelled as disconnected (i.e. no sensor array was connected to these inputs on the MIROSLAV input boards).
# %%
device_dataframes = {}
delta_cols = []
for d, file in device_file.items():
df = pd.read_parquet(file)
# We have two time columns
# - ts_sent, recorded by the MIROSLAV device
# - ts_recv, recorded by the computer logging MIROSLAV data
# One of them has been defined by Prepare-a-SLAV as the index, holding
# absolute time, and the other holding the delta between the two timestamps.
delta_col_name = df.columns[df.columns.str.endswith("delta")][0]
df = df.rename(columns={delta_col_name: delta_col_name+f"_{d}"})
# We also record the names of the delta columns, this will be useful later on.
delta_cols.append(delta_col_name+f"_{d}")
# Here, we remove columns called "na", which means that no sensor array
# was connected to them. We also remove duplicate rows, which can occur
# as an artifact of the resampling process after piecewise reading.
for sensor_type in ["H", "L"]:
df = df.loc[:,~df.columns.str.startswith(f"{sensor_type}_na")]
df = df[~df.index.duplicated()]
device_dataframes[d] = df.copy()
del(df)
# %% [markdown]
# Merge the processed device dataframes into one dataset, and save it to a file. This is a merged dataset without any additional post-processing which can be analysed further with other tools as well.
#
# Then delete the device dataframes to free up RAM.
# %%
df = pd.concat(device_dataframes.values(), axis=1)
checkpoint_files["1_merged"] = dir_out_ch / f"{experiment_name}-pir-tidy-1_merged-{input_data_period}.parquet".replace(" ", "")
df.to_parquet(checkpoint_files["1_merged"])
# delete these large variables to free up RAM
del(device_dataframes)
# %% [markdown]
# ***
# %% [markdown]
# ## Tidying
# %% [markdown]
# ### Melt the dataframe from a table into a tall format
# %% [markdown]
# Now, we turn the dataframe into a tall format - one row for one measurement. Each row also contains the animal identifier, timestamp, sensor identifier (called `H` or `L`, each corresponding to one of the two sensors in one cage-mounted array) but in this experiment they were configured to the same sensitivity level for redundancy.
# %%
# which column holds the timestamp
ts_column = df.index.name
# drop faulty sensors
df = df.drop(columns=sl_faulty)
# reset the index so the timestamps become a regular column
# timestamps cannot be the index anymore since they will be duplicated in a tall format
# (we have two sensors * many animals for each timestamp)
df = df.reset_index()
# get a list of columns corresponding to the "H" and "L" sensors respectively
highcols = df[[c for c in df.columns if c.startswith("H_")]+[ts_column]+delta_cols]
lowcols = df[[c for c in df.columns if c.startswith("L_")]+[ts_column]+delta_cols]
# melt the "H" and "L" dataframes into a tall format separately
melted_h = highcols.melt(id_vars=[ts_column]+delta_cols, var_name="sensor_animal_id", value_name="miro_value")
melted_h["sensor_id"] = "H"
melted_h["animal_id"] = melted_h["sensor_animal_id"].str[2:]
melted_l = lowcols.melt(id_vars=[ts_column]+delta_cols, var_name="sensor_animal_id", value_name="miro_value")
melted_l["sensor_id"] = "L"
melted_l["animal_id"] = melted_l["sensor_animal_id"].str[2:]
# merge the two to recreate the full dataset, and save it (with the suffix -tall)
df = pd.concat([melted_h, melted_l])
df = df.reset_index()
# delete these large variables to free up RAM
del([melted_h, melted_l, highcols, lowcols])
# %% [markdown]
# All data is now in one dataframe, and it's tall, without any other processing. We will save another Parquet file as a checkpoint.
# %%
checkpoint_files["2_tall"] = dir_out_ch / f"{experiment_name}-pir-tidy-2_tall-{input_data_period}.parquet".replace(" ", "")
df.to_parquet(checkpoint_files["2_tall"])
# %% [markdown]
# ### Detect unplugged periods and set them to NA
# %% [markdown]
# When a sensor array gets unplugged, it is read as a continuous stream of `1` values. The following code finds these patterns of consecutive high values and replaces them with `NA`.
#
# The function `replace_consecutive_ones` does the work based on two parameters (consecutive series of some value is called a "run length", which is what the `rl` stands for in the `thresh_rl` argument):
#
# - `thresh_value` - How high does a value need to be to be recognised as part of an unplugged period?
#
# - `thresh_rl` - How many consecutive values are needed to state that this pattern is sufficiently long to indicate an unplugged sensor?
#
# Since our dataset is resampled, we use `0.95` as a threshold instead of `1`.
#
# %%
def replace_consecutive_ones(g, thresh_value, thresh_rl):
""" Function to replace runs of at least `run_length_threshold` values of at least `value_threshold` with NA """
# get series where all 1s are True, all other values are False
ones = g["miro_value"] >= thresh_value
# get series where a True appears when the ones series switches from True to False or vice versa
# cumsum basically turns this into a series of numbers that increment whenever a new
# block of consecutive values starts
group_split = ones.ne(ones.shift()).cumsum()
# the aforementioned is used to group the mask of True and False values and get the
# sizes/lengths of each of these blocks of values
run_lengths = ones.groupby(group_split).transform("size")
# blocks of False values (miro_value != 1) are set to 0
run_lengths.loc[~ones] = 0
# only blocks of 1s that cross the run_length_threshold are set to true
run_lengths = np.where(run_lengths >= thresh_rl, True, False)
# this is used as a mask to set all of these miro_value entries to NA
g.loc[run_lengths, "miro_value"] = pd.NA
return g
# %% [markdown]
# Set the index to the timestamp column again. We need this for the procedure to work, but it does mean we will have duplicate indices. This is fine, as we will operate on the data in batches - one sensor at a time. We will reset the index when we're done.
# %%
df = df.set_index(ts_column)
# %% [markdown]
# We calculate the frequency at which the data was resampled without relying on what is stated in the filename, just in case.
# %%
input_data_period_calculated = pd.Series(df.index).diff().mode()[0]
# %% [markdown]
# These are the values that will be passed to `thresh_value` and `thresh_rl` arguments of the `replace_consecutive_ones` function. More specifically, the run length as a number of values will be calculated from the time delta given.
# %%
THRESHOLD_VALUE = 0.95
THRESHOLD_MINUTES = pd.Timedelta("2 minutes")
if THRESHOLD_MINUTES < input_data_period_calculated:
THRESHOLD_RUNLENGTH = 1
else:
THRESHOLD_RUNLENGTH = int(round(THRESHOLD_MINUTES/input_data_period_calculated, 0))
if THRESHOLD_RUNLENGTH <= 1:
print(f"Warning: Threshold run length is 1. This may lead to excessive NA values as all values >=f{THRESHOLD_VALUE} will be set to NA.")
# %% [markdown]
# We group the dataframe so each group contains values from one sensor (the `sensor_animal_id` column), and apply the function to each group. Then we reset the index to a numerical one again, which won't have duplicate values, and thus move the timestamps into a separate column.
# %%
grouped = df.groupby("sensor_animal_id")
grouped = grouped.apply(lambda x: replace_consecutive_ones(x, THRESHOLD_VALUE, THRESHOLD_RUNLENGTH),
include_groups=False)
df = grouped.reset_index()
# %% [markdown]
# We will save this dataframe, with NAs appropriately set, as another Parquet file checkpoint.
# %%
checkpoint_files["3_na"] = dir_out_ch / f"{experiment_name}-pir-tidy-3_na-{input_data_period}.parquet".replace(" ", "")
df.to_parquet(checkpoint_files["3_na"])
# %% [markdown]
# ***
# %% [markdown]
# ## Standardisation
# %% correct for baseline [markdown]
# We want all sensor readings to look the same on baseline, thereby disregarding differences on account of individual differences between animals or sensors, as we're only interested in the effect of the treatment. So we perform data standardisation by:
#
# 1. **Subtracting the mean** - this centers the mean of the data around 0.
# 2. **Dividing by the standard deviation** - this reduces the standard deviation to 1 from whatever it may have been originally.
#
# Standardised sinusoidal rhythms look the same - subtracting the mean centers the sinusoids, and dividing by the standard deviation effectively matches the amplitudes. We do this for each sensor readings individually and only according to the mean and standard deviation of values recorded during the baseline period. Therefore, all rhythms will overlap as much as possible before treatment application, and divergences later on will be due to the effects of the treatment.
# %%
def miro_normalise(g, baseline_cutoff: pd.Timestamp):
""" Perform data standardisation on MIRO values
Data standardisation entails subtracting the mean and dividing the dataset
by the standard deviation of the data. This all miro values so that those before a cutoff have mean=0 and std=1 """
miro = g["miro_value"]
baseline_miro = g.loc[g.index < baseline_cutoff, "miro_value"]
g["miro_value"] = (miro-baseline_miro.mean())/baseline_miro.std()
return g
df = df.set_index(ts_column)
grouped = df.groupby("sensor_animal_id").apply(lambda x: miro_normalise(x, ts_baseline_end),
include_groups=False)
df = grouped.reset_index()
# %% [markdown]
# ***
# %% [markdown]
# ## Inserting metadata into the dataframe
# %% [markdown]
# At the end, the metadata defined at the beginning of the notebook is inserted into the fully processed dataframe. Understandably, this metadata heavily differs between experiments, so to add metadata to your own MIROSLAV dataframes, you can either modify and reuse the code below, or remove it altogether and add the metadata subsequently either programmatically in a different manner (e.g. with R), or manually.
#
# As a side note, manual metadata addition using a spreadsheet program (e.g. LibreOffice Calc, Microsoft Excel) might be doable in the tidied dataframe exported at the end of this notebook, if there aren't too many values (depends on bin size, duration of recording etc.), though this requires exporting to XLSX instead of Parquet at the end. For this reason, code for exporting to an XLSX spreadsheet will be provided as well.
# %% [markdown]
# Here, we insert a `treatment` column specifying which animal received which treatment/belongs to which group based on the `al_treatment` dictionary [defined near the beginning of the notebook](#animals-and-their-treatments)
# %% add experimental variables as df columns
for tr, al in al_treatment.items():
df.loc[df["animal_id"].isin(al), "treatment"] = tr
# %% [markdown]
# These are additional metadata related to our experiment, which are [elaborated on earlier in the notebook](#additional-metadata).
# %%
# the column "treatment_applied" is added and all values are set to 0
df["treatment_applied"] = 0
# only those values after ts_model_induction_start are set to 1, as the first treatment has been applied at this point
df.loc[df[ts_column] > ts_model_induction_start, "treatment_applied"] = 1
# the column "model_induced" is added and all values are set to 0
df["model_induced"] = 0
# only those values after ts_model_induction_end are set to 1, as the complete model induction procedure has been completed
df.loc[df[ts_column] > ts_model_induction_end, "model_induced"] = 1
# %% [markdown]
# ***
# %% [markdown]
# ## Further resampling
# %% [markdown]
# Now that we're finished with everything, we can resample the data to an even wider bin, specified [earlier in the script](#additional-resampling). The bin width depends on which type of analysis you want to perform later on, with tradeoffs between wider and narrower bins described in the [Prepare-a-SLAV TOML configuration file](./1_Prepare-a-SLAV_config.toml):
#
# > _Wider bins give you easy insight into general trends across longer time periods, while narrower bins give you more detailed insight into the data. You can run the script multiple times to resample the data to different frequencies, e.g. first at "4 hours" to see general trends, and then to "15 minutes" to be able to analyse interesting periods in more detail._
#
# Even though Prepare-a-SLAV performs resampling just as well, running Prepare-a-SLAV takes ~15 minutes for us in our example as it needs to parse the raw textual logs. If you decide you want a different bin, you can do it here instead of re-running Prepare-a-SLAV as re-resampling the data at this stage takes considerably less time, making it easier to play with the data and examine it from many perspectives.
#
# This step is optional, and will be skipped if `resample_period_string` is set to `None`. The data will be saved as-is instead.
# %% resample to an arbitrary time period for easier analysis
if resample_period_string:
agg_functions = {"miro_value": "mean",
"sensor_id": "last",
"animal_id": "last",
"treatment": "last",
"treatment_applied": "last",
"model_induced": "last"}
resample_period = pd.to_timedelta(resample_period_string)
df = df.set_index(ts_column)
resampled_index = pd.date_range(start=ts_experiment_start, end=ts_experiment_end, freq=resample_period, inclusive="left")
df = df.groupby("sensor_animal_id").resample(resample_period, origin=ts_experiment_start).agg(agg_functions)
#df = df.drop(pd.to_datetime("2022-05-06 17:46:00"), level=1)
df = df.reset_index().set_index(ts_column)
def reindex_group(g):
g = g.reindex(resampled_index, fill_value=pd.NA)
for col in ["sensor_id", "animal_id", "treatment", "treatment_applied", "model_induced"]:
g[col] = g[col].ffill().bfill()
return g
df = df.groupby("sensor_animal_id").apply(reindex_group, include_groups=False).reset_index(names=["sensor_animal_id", ts_column])
checkpoint_files["finished"] = dir_out / f"{experiment_name}-pir-tidy-source{input_data_period}-resampled{resample_period_string}.parquet".replace(" ", "")
else:
checkpoint_files["finished"] = dir_out / f"{experiment_name}-pir-tidy-source{input_data_period}.parquet".replace(" ", "")
# %% [markdown]
# ***
# %% [markdown]
# ## Saving
# %% [markdown]
# Finally, the finished tidy, tall dataframe is saved in the `2_outputs_tidy` directory as a Parquet file.
# %%
df.to_parquet(checkpoint_files["finished"])
# %% [markdown]
# ***
# %% [markdown]
# ## Diagnostics
# %% [markdown]
# ### Per-sensor heatmaps
# %% [markdown]
# These tests produce heatmaps for each sensor individually, across the entire experimental period. They can be plotted from:
#
# * The `3_na` checkpoint - Handy for deciding which sensors are faulty to [go back and add them to `sl_faulty`](#faulty-sensors), or
#
# * The `finished` checkpoint - For examining the final product of TidySLAV on a per-sensor basis.
#
# Just uncomment the one you want, and comment out the other one.
# %%
checkpoint = "3_na"
#checkpoint = "finished"
# %% [markdown]
# For each sensor, we will produce three types of heatmaps:
#
# * **MIRO values** - circadian activity heatmap
#
# * **NA values** - periods when the sensors were disconnected
#
# * **Delta plots** - both of the above, but showing the difference between the two cage sensors
#
# The heatmaps use 15-minute bins, and each row is 24 hours.
#
# _Note: The code for calculating heatmap data is spaghetti test code adapted many times for different experiments. It was written from the start to be flexible so very few values are hardcoded (some are, though, e.g. the experimental period starts and ends at 6 AM), but, unfortunately, we haven't gotten around to cleaning it up yet so it's not very readable and well-documented at the moment. It's on our to-do list, though!_
# %% [markdown]
#
# #### Preparation (SPAGHETTI AHOY!)
# %% [markdown]
# You generally (hopefully) don't need to touch this part.
# %% [markdown]
# The following cell calculates and saves all the heatmap data we'll need.
# %%
df = pd.read_parquet(checkpoint_files[checkpoint])
#### produce activity heatmaps
testing = False
freq = "15min"
display_unit = "min"
if checkpoint == "finished":
timebin = resample_period_string
else:
timebin = input_data_period
freq_td = pd.to_timedelta(freq)
display_unit_td = pd.to_timedelta("1"+display_unit)
timebin_td = pd.to_timedelta(timebin)
# display unit coefficient
display_unit_coeff = display_unit_td/timebin_td
mirovals = df[[ts_column, "miro_value", "sensor_animal_id"]].copy()
mirovals["date"] = mirovals[ts_column].dt.date
unique_dates = np.sort(mirovals["date"].unique())
mirovals["is_na"] = mirovals["miro_value"].isna()
if mirovals[ts_column].min().time() < datetime.time(6, 0, 0):
start_timestamp = mirovals["date"].min() - datetime.timedelta(days=1)
else:
start_timestamp = mirovals["date"].min()
if mirovals[ts_column].max().time() >= datetime.time(6, 0, 0):
end_timestamp = mirovals["date"].max() + datetime.timedelta(days=1)
else:
end_timestamp = mirovals["date"].max()
start_timestamp = datetime.datetime.combine(start_timestamp, datetime.time(6, 0, 0))
end_timestamp = datetime.datetime.combine(end_timestamp, datetime.time(5, 59, 45))
time_range = pd.date_range(start=start_timestamp, end=end_timestamp, freq=freq)
data_heatmap = {}
data_delta = {}
data_na = {}
data_na_delta = {}
all_zmin = {"heatmap": 0,
"delta": 0,
"na": 0,
"na_delta": 0}
all_zmax = {"heatmap": 0,
"delta": 0,
"na": np.ceil(freq_td/display_unit_td),
"na_delta": 0}
def nan_to_None(x: np.ndarray):
x_obj = x.astype(object)
x_obj[np.isnan(x)] = None
return x_obj
for i, animal in enumerate(df["animal_id"].unique()):
traces = {}
pivot_tables = {"H": None, "L": None}
pivot_tables_na = {"H": None, "L": None}
data_na[animal] = {}
data_heatmap[animal] = {}
animal_values = mirovals.loc[(mirovals["sensor_animal_id"] == "H_"+animal) | (mirovals["sensor_animal_id"] == "L_"+animal), "miro_value"]
animal_zmin = np.nanmin(animal_values)
animal_zmax = np.nanmax(animal_values)
for sensor in ["H", "L"]:
sensor_animal_id = f"{sensor}_{animal}"
# Filter data for the specific date and remove the date column
data_animal = mirovals[mirovals["sensor_animal_id"] == sensor_animal_id][[ts_column, "is_na", "miro_value"]].copy()
data_animal["is_na"] = data_animal["is_na"].astype("int64")
#data_animal[ts_column] = data_animal[ts_column].dt.time
# Resample into 5-minute bins and aggregate
data_animal = data_animal.sort_values(by=ts_column)
data_animal = data_animal.set_index(ts_column)
data_animal = data_animal.resample(freq, origin="06:00").sum()
# Make sampling intervals regular in case the df is just a collection of events that occurred at
# random times. (Needed for VlaDiSlav, not needed for MIROSLAV)
# Reindex the data with the defined time range and fill missing values with NaN
data_animal = data_animal.reindex(time_range, fill_value=np.NaN)
# Set the "sensor_animal_id" column to the current id
#data_animal["sensor_animal_id"] = animal
# Replace NaN values in the "trials" column with 0
#animal_data.loc[animal_data["trials"].isnull(), "trials"] = 0
# Reset the index and rename the "index" column back to ts_column
data_animal = data_animal.reset_index().rename({"index":ts_column}, axis="columns")
data_animal["time"] = data_animal[ts_column].dt.strftime("%H:%M")
data_animal["date"] = data_animal[ts_column].dt.date
data_animal = data_animal.drop(ts_column, axis=1)
# Pivot the table for the heatmap
#heatmap_data_date = filled_data.pivot_table(index="animal", columns="t_start", values="impulsive_nosepokes", aggfunc="sum", dropna=False)
heat_data_animal = data_animal.pivot_table(index="date", columns="time", values="miro_value", aggfunc="mean", dropna=False, sort=True)
pivot_tables[sensor] = heat_data_animal.copy()
with warnings.catch_warnings():
warnings.simplefilter("ignore")
all_zmin["heatmap"] = np.nanmin((all_zmin["heatmap"], np.nanmin(heat_data_animal.values)))
all_zmax["heatmap"] = np.nanmax((all_zmax["heatmap"], np.nanmax(heat_data_animal.values)))
z = nan_to_None(heat_data_animal.values)
# Create the heatmap using Plotly
data_heatmap[animal][sensor] = go.Heatmap(z=z,
x=heat_data_animal.columns,
y=[x.isoformat() for x in heat_data_animal.index],
#zmin=heat_data_animal.values, zmax=mirovals["miro_value"].max(), #zmax=np.ceil(freq_td/display_unit_td),
zmin=animal_zmin, zmax=animal_zmax,
zauto=False,
xgap=0, ygap=0,
name=f"{sensor} sensor",
colorscale="Brwnyl",
colorbar=dict(ticksuffix=f" {display_unit}")
)
# Pivot the table for the heatmap
#heatmap_data_date = filled_data.pivot_table(index="animal", columns="t_start", values="impulsive_nosepokes", aggfunc="sum", dropna=False)
na_data_animal = data_animal.pivot_table(index="date", columns="time", values="is_na", aggfunc="sum", dropna=False, sort=True)/display_unit_coeff
pivot_tables_na[sensor] = na_data_animal.copy()
all_zmin["na"] = np.nanmin((all_zmin["na"], np.nanmin(na_data_animal.values)))
all_zmax["na"] = np.nanmax((all_zmax["na"], np.nanmax(na_data_animal.values)))
# Create the heatmap using Plotly
z = nan_to_None(na_data_animal.values)
data_na[animal][sensor] = go.Heatmap(z=z,
x=na_data_animal.columns,
y=[x.isoformat() for x in na_data_animal.index],
zmin=0, zmax=np.ceil(freq_td/display_unit_td),
xgap=0, ygap=0,
name=f"{sensor} sensor",
colorscale="Brwnyl",
colorbar=dict(ticksuffix=f" {display_unit}")
)
heatmap_delta = pivot_tables["H"]-pivot_tables["L"]
heatmap_delta = heatmap_delta.copy()
#all_zmin["delta"] = min(all_zmin["delta"], np.nanmin(heatmap_delta.values))
#all_zmax["delta"] = max(all_zmax["delta"], np.nanmax(heatmap_delta.values))
with warnings.catch_warnings():
warnings.simplefilter("ignore")
max_value = np.nanmax((np.abs(np.nanmin(heatmap_delta.values)), np.abs(np.nanmax(heatmap_delta.values))))
# round up to the nearest 10
max_value_round = (max_value//5)*5
if max_value_round < max_value:
max_value_round += 5
zmax = np.nanmax((np.abs(max_value_round), all_zmax["delta"]))
all_zmax["delta"] = zmax
all_zmin["delta"] = -zmax
z = nan_to_None(heatmap_delta.values)
if np.isnan(max_value_round):
max_value_round = 0
data_delta[animal] = go.Heatmap(z=z,
x=heatmap_delta.columns,
y=[x.isoformat() for x in heatmap_delta.index],
colorscale="Tealrose",
zmin=-max_value_round, zmax=max_value_round,
zauto=False,
xgap=1, ygap=1,
colorbar=dict(title="L"+" "*41+"H",
title_side="right")
)
na_delta = pivot_tables_na["H"]-pivot_tables_na["L"]
max_value = np.nanmax((np.abs(na_delta.min().min()), np.abs(na_delta.max().max())))
# round up to the nearest 10
max_value_round = (max_value//5)*5
if max_value_round < max_value:
max_value_round += 5
zmax = np.nanmax((np.abs(max_value_round), all_zmax["na_delta"]))
all_zmax["na_delta"] = zmax
all_zmin["na_delta"] = -zmax
z = nan_to_None(na_delta.values)
data_na_delta[animal] = go.Heatmap(z=z,
x=na_delta.columns,
y=[x.isoformat() for x in na_delta.index],
colorscale="Tealrose",
zmin=-max_value_round, zmax=max_value_round,
zauto=False,
xgap=1, ygap=1,
colorbar=dict(title="L" + " "*41 + "H",
title_side="right"))
#with open(wd / "nanplots" / f"{animal}.html", "a") as f:
# f.write(fig.to_html(full_html=False, include_plotlyjs='cdn'))
# f.write("<br><br>")
#figs.append(fig)
if testing and i > 1:
break # Break after the second iteration for testing
#del(pivot_tables, pivot_tables_na)
# %% [markdown]
# This cell defines some handy functions for drawing the interactive plots.
# %%
def setup_hm(fig, autoscale=True, zmin=None, zmax=None):
fig.add_traces([go.Heatmap(), go.Heatmap()], rows=[1, 2], cols=[1, 1])
fig['layout']['yaxis']['title']='H sensor'
fig['layout']['yaxis2']['title']='L sensor'
fig.update_layout(plot_bgcolor="#ffffff",
height=800, width=800, autosize=False)
#margin=dict(l=100, r=50, t=50, b=50, pad=5))
fig.update_yaxes(tickformat="%b %d", autorange="reversed")
fig.update_xaxes(side="bottom",
tickmode="array",
#tickvals=[time(6, 0, 0), time(12, 0, 0), time(18, 0, 0), time(0, 0, 0)],
ticktext=[datetime.time(i, 0).strftime("%H:%M") for i in range(0,24)])
if not autoscale:
fig.update_traces(zmin=zmin, zmax=zmax)
def setup_delta_hm(fig, autoscale=True, zmin=None, zmax=None):
fig.update_layout(plot_bgcolor="#ffffff",
height=500, width=800, autosize=False)
#margin=dict(l=100, r=50, t=100, b=50, pad=5))
fig.update_yaxes(tickformat="%b %d", autorange="reversed")
if not autoscale:
fig.update_traces(zmin=zmin, zmax=zmax)
def update_hm(fig, plot_index, plot_indexes, data, title, checkpoint, autoscale=True, zmin=None, zmax=None):
animal = plot_indexes[plot_index]
#fig_delta.update(data=[data_delta[animal]])
#fig_delta.update_layout(title=f'Miro NaNs: H-L sensor delta for {animal}<br><sup>positive: more NaNs in H, negative: more NaNs in L</sup>')
if animal in data:
if "H" in data[animal]:
trace1 = data[animal]["H"]
else:
trace1 = go.Heatmap()
if "L" in data[animal]:
trace2 = data[animal]["L"]
else:
trace2 = go.Heatmap()
else:
trace1 = go.Heatmap()
trace2 = go.Heatmap()
fig.update(data=[trace1, trace2])
if not autoscale:
fig.update_traces(zmin=zmin, zmax=zmax, zauto=False)
fig.update_layout(title=title.format(checkpoint=checkpoint, animal=animal))
if IN_COLAB:
fig.show()
def update_delta_hm(fig, plot_index, plot_indexes, data, title, checkpoint, autoscale=True, zmin=None, zmax=None):
animal = plot_indexes[plot_index]
#fig_delta.update(data=[data_delta[animal]])
#fig_delta.update_layout(title=f'Miro NaNs: H-L sensor delta for {animal}<br><sup>positive: more NaNs in H, negative: more NaNs in L</sup>')
if animal in data:
trace1 = data[animal]
else:
trace1 = go.Heatmap()
fig.update(data=[trace1])
if not autoscale:
fig.update_traces(zmin=zmin, zmax=zmax, zauto=False)
fig.update_layout(title=title.format(checkpoint=checkpoint, animal=animal))
if IN_COLAB:
fig.show()
# %% [markdown]
# **We have survived the spaghetti.**
# %% [markdown]
#
# #### Plots
# %% [markdown]
# At the beginning of every cell, you can set the `autoscale` parameter.
#
# * `True` means that each plot's colour scale limits will correspond to that plot's minimum and maximum values. This is useful for examining plots individually.
#
# * `False` means that all plots from a single cell will use the same colour scale. This is useful for comparing different sensors.
#
# You can scroll through individual plots using the slider. If you click on it, you can use arrow keys on your keyboard as well.
# %% [markdown]
# ##### NA values, per sensor
# %%
autoscale = False
# %%
plot_indexes = dict(enumerate(data_na))
fig_na = go.FigureWidget(make_subplots(rows=2, cols=1,
shared_xaxes=True,
vertical_spacing=0.02))
setup_hm(fig_na, autoscale, all_zmin["na"], all_zmax["na"])
update_this_plot = lambda plot_index: update_hm(fig_na, plot_index, plot_indexes, data_na, "MIRO NAs (data: {checkpoint}): count for {animal} per sensor", checkpoint, autoscale, all_zmin["na"], all_zmax["na"])
if IN_COLAB:
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
w = widgets.interactive(update_this_plot, plot_index=slider)
display(w)
else:
# Fill the plot with the first dataset
update_this_plot(0)
# Create the slider and show it
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
display(slider)
# Link the slider to the update function
widgets.interactive(update_this_plot, plot_index=slider)
# Show the plot
display(fig_na)
# %% [markdown]
# ##### NA values, sensor deltas
# %%
autoscale = False
# %%
plot_indexes = dict(enumerate(data_na_delta))
fig_na_delta = go.FigureWidget(go.Heatmap())
setup_delta_hm(fig_na_delta, autoscale, all_zmin["na_delta"], all_zmax["na_delta"])
update_this_plot = lambda plot_index: update_delta_hm(fig_na_delta, plot_index, plot_indexes, data_na_delta, "MIRO NAs (data: {checkpoint}): H-L sensor delta for {animal}<br><sup>positive: more NAs in H, negative: more NAs in L</sup>", checkpoint, autoscale, all_zmin["na_delta"], all_zmax["na_delta"])
if IN_COLAB:
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
w = widgets.interactive(update_this_plot, plot_index=slider)
display(w)
else:
# Fill the plot with the first dataset
update_this_plot(0)
# Create the slider and show it
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
display(slider)
# Link the slider to the update function
widgets.interactive(update_this_plot, plot_index=slider)
# Show the plot
display(fig_na_delta)
# %% [markdown]
# ##### MIRO readings, per sensor
# %%
autoscale = False
# %%
plot_indexes = dict(enumerate(data_heatmap))
fig_hm = go.FigureWidget(make_subplots(rows=2, cols=1,
shared_xaxes=True,
vertical_spacing=0.02))
setup_hm(fig_hm, autoscale, all_zmin["heatmap"], all_zmax["heatmap"])
update_this_plot = lambda plot_index: update_hm(fig_hm, plot_index, plot_indexes, data_heatmap, "MIRO values (data: {checkpoint}): bin mean for {animal} per sensor", checkpoint, autoscale, all_zmin["heatmap"], all_zmax["heatmap"])
if IN_COLAB:
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
w = widgets.interactive(update_this_plot, plot_index=slider)
display(w)
else:
# Fill the plot with the first dataset
update_this_plot(0)
# Create the slider and show it
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
display(slider)
# Link the slider to the update function
widgets.interactive(update_this_plot, plot_index=slider)
# Show the plot
display(fig_hm)
# %% [markdown]
# ##### MIRO readings, sensor deltas
# %%
autoscale = False
# %%
plot_indexes = dict(enumerate(data_delta))
fig_hm_delta = go.FigureWidget(go.Heatmap())
setup_delta_hm(fig_hm_delta, autoscale, all_zmin["delta"], all_zmax["delta"])
update_this_plot = lambda plot_index: update_delta_hm(fig_hm_delta, plot_index, plot_indexes, data_delta, "MIRO values (data: {checkpoint}): H-L sensor delta for {animal}<br><sup>positive: more detections by H, negative: more detections by L</sup>", checkpoint, autoscale, all_zmin["delta"], all_zmax["delta"])
if IN_COLAB:
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
w = widgets.interactive(update_this_plot, plot_index=slider)
display(w)
else:
# Fill the plot with the first dataset
update_this_plot(0)
# Create the slider and show it
slider = widgets.IntSlider(value=0, min=0, max=len(data_na)-1, step=1, description="Plot index")
display(slider)
# Link the slider to the update function
widgets.interactive(update_this_plot, plot_index=slider)
# Show the plot
display(fig_hm_delta)
# %% [markdown]
#
# ### Standardisation tests
# %% [markdown]
# Check how well the standardisation worked as the final processing step.
# %%
df = pd.read_parquet(checkpoint_files["finished"])
# %% [markdown]
# Standardisation + baseline correction should've made the curves overlap as much as possible at baseline, and subsequently diverge if there is a treatment effect.
# %% [markdown]
# The curves will be really hard to compare when using narrow bins, so we can perform resampling again, this time only on the dataframe in the memory that we've just loaded, thus making no permanent changes to our saved data (if we want this, we can set [`resample_period_string`](#Additional-resampling) appropriately, and the [Standardisation code section](#Standardisation) will do it for us).
# %%