Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5-Fold CV split Implementation #11

Open
adiv5 opened this issue Jul 18, 2024 · 1 comment
Open

5-Fold CV split Implementation #11

adiv5 opened this issue Jul 18, 2024 · 1 comment

Comments

@adiv5
Copy link

adiv5 commented Jul 18, 2024

Hi,
As mentioned in the paper

Each split was stratified according to the sample site to mitigate potential batch artifacts

In the code provided, I'm unable to understand how was the split done. How were the site information sourced for each sample?

It would be really great if you could point to a snippet that you used for making the splits

Note: I was going through this and here the tissue source site is supposed to be the component in the barcode just after "TCGA". Please confirm if this is so?

Thanks in Advance

@ajv012
Copy link
Collaborator

ajv012 commented Dec 16, 2024

Yes, you are correct that the name of the case has the tissue source site in it. However, there can be 70+ sites in some cohorts. So, we used Howard et al, Nat. Comms, 2021 to make the site-stratified splits.

As a note to the community, it is imperative for us as a field to move away from using simple cross validation folds for TCGA-related tasks. Such a strategy leads to train-test leakage due to cases from same site being present in train and test sets, which can cause artificial inflation of performance. We highly advise the community to use site stratified splits combined with external testing (for example by using CPTAC cohorts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants