Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bfcquery returns inconsistent column types for empty rows #26

Open
omsai opened this issue Jun 29, 2020 · 3 comments
Open

bfcquery returns inconsistent column types for empty rows #26

omsai opened this issue Jun 29, 2020 · 3 comments

Comments

@omsai
Copy link

omsai commented Jun 29, 2020

The column header types for the columns create_time and access_time are character vectors when non-empty, and double vectors when empty.
I expect that they should consistently return the same type; maybe character vectors always; although it's not clear why they are not date or datetime types instead.
Returning inconsistent types throws an error when trying to row bind join multiple queries using purrr::map_df where some of the queries are successful and some of them fail:

> files_remote
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_minus.bw"
[2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_plus.bw" 
> map_df(files_remote, bfcquery, x = bfc)
Error: Can't combine `create_time` <character> and `create_time` <double>.
Run `rlang::last_error()` to see where the error occurred.
> map_df(files_remote[1], bfcquery, x = bfc)
# A tibble: 1 x 10
  rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
  <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
1 BFC6  ftp:… 2020-06-29… 2020-06-29… /hom… web   ftp:…               NA NA   
# … with 1 more variable: expires <dbl>
> map_df(files_remote[2], bfcquery, x = bfc)
# A tibble: 0 x 10
# … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>,
#   access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
#   last_modified_time <dbl>, etag <chr>, expires <dbl>
>  
@omsai
Copy link
Author

omsai commented Jun 29, 2020

I'm a little behind on my R installation and can update if you can't reproduce the problem:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: PureOS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] usethis_1.6.1        tidyr_1.1.0          tibble_3.0.1        
 [4] stringr_1.4.0        purrr_0.3.4          dplyr_1.0.0         
 [7] rtracklayer_1.44.4   GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 
[10] IRanges_2.18.3       S4Vectors_0.22.1     GEOquery_2.52.0     
[13] Biobase_2.44.0       BiocGenerics_0.30.0  evolength_0.0.0.9000
[16] testthat_2.3.2      

loaded via a namespace (and not attached):
 [1] httr_1.4.1                  pkgload_1.1.0              
 [3] bit64_0.9-7                 Rdpack_0.11-1              
 [5] assertthat_0.2.1            BiocFileCache_1.8.0        
 [7] blob_1.2.1                  GenomeInfoDbData_1.2.1     
 [9] Rsamtools_2.0.3             remotes_2.1.1              
[11] sessioninfo_1.1.1           lattice_0.20-41            
[13] pillar_1.4.4                RSQLite_2.2.0              
[15] backports_1.1.7             glue_1.4.1                 
[17] limma_3.40.6                digest_0.6.25              
[19] XVector_0.24.0              Matrix_1.2-18              
[21] XML_3.99-0.3                pkgconfig_2.0.3            
[23] devtools_2.3.0              bibtex_0.4.2.2             
[25] zlibbioc_1.30.0             processx_3.4.2             
[27] BiocParallel_1.18.1         generics_0.0.2             
[29] ellipsis_0.3.1              withr_2.2.0                
[31] SummarizedExperiment_1.14.1 cli_2.0.2                  
[33] magrittr_1.5                crayon_1.3.4               
[35] memoise_1.1.0               ps_1.3.3                   
[37] fs_1.4.1                    fansi_0.4.1                
[39] xml2_1.3.2                  pkgbuild_1.0.8             
[41] tools_3.6.3                 prettyunits_1.1.1          
[43] hms_0.5.3                   matrixStats_0.56.0         
[45] gbRd_0.4-11                 lifecycle_0.2.0            
[47] DelayedArray_0.10.0         callr_3.4.3                
[49] Biostrings_2.52.0           RcppHMM_1.2.2              
[51] compiler_3.6.3              rlang_0.4.6                
[53] grid_3.6.3                  RCurl_1.98-1.2             
[55] rstudioapi_0.11             rappdirs_0.3.1             
[57] bitops_1.0-6                DBI_1.1.0                  
[59] curl_4.3                    R6_2.4.1                   
[61] GenomicAlignments_1.20.1    utf8_1.1.4                 
[63] bit_1.1-15.2                rprojroot_1.3-2            
[65] readr_1.3.1                 desc_1.2.0                 
[67] stringi_1.4.6               Rcpp_1.0.4.6               
[69] vctrs_0.3.0                 dbplyr_1.4.4               
[71] tidyselect_1.1.0           
> 

@lshep
Copy link
Contributor

lshep commented Apr 1, 2021

Sorry for the long delay. I'm looking into this and I'm not quite sure how to correct it. It seems like a bug when using dplyr::filter that somehow changes the columns type.

> tbl
# Source:   table<resource> [?? x 11]
# Database: sqlite 3.35.2
#   [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
      id rid   rname  create_time access_time rpath rtype fpath last_modified_t…
   <int> <chr> <chr>  <chr>       <chr>       <chr> <chr> <chr> <chr>           
 1     1 BFC1  annot… 2020-07-20… 2021-03-30… 534a… web   http… 2021-03-15 14:4…
 2     2 BFC2  annot… 2020-07-20… 2021-03-30… 534a… rela… 534a… NA              
 3     4 BFC4  AH800… 2020-07-27… 2021-03-30… 21c6… web   http… NA              


> tbl %>% dplyr::filter(rid == NA_character_)
# Source:   lazy query [?? x 11]
# Database: sqlite 3.35.2
#   [/home/shepherd/.cache/BiocFileCache/BiocFileCache.sqlite]
# … with 11 variables: id <int>, rid <chr>, rname <chr>, create_time <dbl>,
#   access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
#   last_modified_time <dbl>, etag <chr>, expires <dbl>

@omsai
Copy link
Author

omsai commented Dec 6, 2024

If I omit dplyr::filter, using an empty BiocFileCache defaults to double for time columns or - in the second bfcquery below - columns only containing NA. Can empty maintain preserve a consistent type?

library(purrr)
library(stringr)
library(BiocFileCache)

path <- tempfile()

bfc <- BiocFileCache(path, ask = FALSE)

files_remote <-
  str_c(file.path("ftp://ftp.ncbi.nlm.nih.gov",
                  "geo/samples/GSM1480nnn/GSM1480327/suppl",
                  "GSM1480327_K562_PROseq_"),
        c("minus", "plus"),
        ".bw")

map_df(files_remote, bfcquery, x = bfc)
# A tibble: 0 × 10
# ℹ 10 variables: rid <chr>, rname <chr>, create_time <dbl>, access_time <dbl>,
#   rpath <chr>, rtype <chr>, fpath <chr>, last_modified_time <dbl>,
#   etag <chr>, expires <dbl>

bfcadd(bfc, files_remote[1])
#> |======================================================================| 100%
#> BFC1 
#> "/tmp/RtmpDRIP5H/file2ff2a62a8acdc8/2ff2a64678a220_GSM1480327_K562_PROseq_minus.bw"

map_df(files_remote[1], bfcquery, x = bfc)
#> # A tibble: 1 × 10
#>   rid   rname create_time access_time rpath rtype fpath last_modified_time etag 
#>   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>              <dbl> <chr>
#> 1 BFC1  ftp:… 2024-12-06… 2024-12-06… /tmp… web   ftp:…                 NA NA   
#> # ℹ 1 more variable: expires <dbl>

map_df(files_remote, bfcquery, x = bfc)
#> Error in `dplyr::bind_rows()`:
#> ! Can't combine `..1$create_time` <character> and `..2$create_time` <double>.
#> Run `rlang::last_trace()` to see where the error occurred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants