Last updated: 2025-01-17

Checks: 6 1

Knit directory: Genomic-Selection-for-Drought-Tolerance-Using-Genome-Wide-SNPs-in-Casava/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20221020) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 00c2e25. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figure/GWS.Rmd/
    Ignored:    data/Articles/
    Ignored:    data/Artigo/
    Ignored:    data/allchrAR08.txt
    Ignored:    data/data.rar
    Ignored:    data/geno.rds
    Ignored:    data/pheno.rds
    Ignored:    output/BLUPS_density_med_row_col.png
    Ignored:    output/Density_residual_row_col.tiff
    Ignored:    output/Residuals_vs_fitted_row_col.tiff
    Ignored:    output/result_sommer_row_col_random.RDS
    Ignored:    output/varcomp_row_col.tiff

Untracked files:
    Untracked:  analysis/GWS_BayesA.Rmd
    Untracked:  analysis/GWS_BayesB.Rmd
    Untracked:  analysis/GWS_RF.Rmd
    Untracked:  analysis/clone_selection.Rmd

Unstaged changes:
    Modified:   output/BayesA_ETA_1_ScaleBayesA.dat
    Modified:   output/BayesA_mu.dat
    Modified:   output/BayesA_varE.dat

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


Configurations and packages

To perform the analyses, we will need the following packages:

library(furrr)
library(tidyverse)
library(cvTools)
library(kableExtra)
library(BGLR)

Data

geno<- readRDS("data/geno.rds")
pheno<-readRDS("data/pheno.rds")
traits <- colnames(pheno)[-1]

Bayes B

We can perform the LASSO in the Bayesian framework. For this purpose, we will use the BGLR package [@BGLR]. In this package, we will need an element called “ETA”. In the ETA, we will set the linear predictors of the model and the priors.

Cross-validation

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 5
nrept = 5

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)
[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

For better performance, we included this step in the loop of the next step.

Run model

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called results_cv_BayesB and, contains the results for each fold assignation that we built in the previous step. The second GEBVS_BayesB contains the GEBVs.

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition and trait.

Set up parallel processing. Define and run multiple models to get the GEBVs ATTENTION: This process is time-consuming.

# Planejar o processamento em várias sessões para paralelização
plan(multisession)

# Iniciar uma lista para armazenar os resultados
results_list <-
  future_map(traits, .options = furrr_options(seed = 100), function(j)  {
    # Pré-processamento dos dados
    pheno2 <- pheno %>%
      select(ID_Clone, all_of(j)) %>%
      na.omit() %>%
      ungroup() %>%
      filter(ID_Clone %in% rownames(geno)) %>%
      mutate(ID = factor(1:nrow(.))) %>%
      droplevels()
    
    # Filtrar as linhas de Z com base em pheno2$ID_Clone
    ETA = list(list(X = geno[rownames(geno) %in% pheno2$ID_Clone, ], model = "BayesB"))
    
    #Semente para reprodução
    set.seed(100)
    
    # Listas das folds
    fold.list <- lapply(1:nrept, function(a) {
      folds <- cvFolds(nlevels(pheno2$ID),
                       type = "random",
                       K = nfolds)
      
      split(folds$subsets, f = folds$which)
    })
    
    # Usar future_map_dfr para criar dobras cruzadas em paralelo
    future_map_dfr(1:length(fold.list), function(z) {
      for (i in 1:nfolds) {
        # Conjunto de treinamento
        train_data <- pheno2
        
        # Conjunto de validação
        train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
        
        # Fitting model
        CV_M <-
          BGLR(
            y = train_data[[j]],
            ETA = ETA,
            nIter = 20000,
            burnIn = 5000,
            thin = 5,
            verbose = F,
            saveAt = "output/BayesB_"
          )
        
        # GEBV
        Pred <-
          data.frame(
            Yhat = CV_M$yHat,
            ID = train_data$ID,
            ID_Clone = train_data$ID_Clone
          ) %>%
          filter(ID %in% fold.list[[z]][[i]])
        
        result1 <-
          tibble(
            GEBV = Pred$Yhat,
            ID = Pred$ID,
            ID_Clone = Pred$ID_Clone,
            Trait = j,
            rep = z,
            fold = i
          )
        
        # Resultados da previsão
        if (i == 1) {
          result <- result1
        } else {
          result <- rbind(result, result1)
        }
      }
      
      #Ordenando os resultados de acordo com o ID
      result <- result[order(result$ID, decreasing = F),]
      
      # Resultados da validação cruzada
      result_cv <- tibble(
        Trait = j,
        rep = z,
        Log = all(result$ID == pheno2$ID),
        Ac = round(cor(result$GEBV, pheno2[[j]], use = "na.or.complete"), 3),
        MSPE = round(mean((
          result$GEBV - pheno2[[j]]
        ) ^ 2, na.rm = TRUE), 3),
        result = list(result)
      )
    })
  })

# Combinar todos os resultados da validação cruzada em uma única estrutura
results_cv_BayesB <- do.call(rbind, results_list)

# Extrair os GEBVs (Valores Genômicos Empíricos Melhorados) em uma estrutura separada
GEBVS_BayesB <- do.call(rbind, results_cv_BayesB$result)

Results

The object “GEBVS_BayesB” contains the GEBVs for each trait and clone. The table above shows the first five clones and their GEBVs for each trait.

GEBVS_BayesB |>
  group_by(Trait, ID_Clone) |>
  summarise(GEBV = mean(GEBV)) %>%
  pivot_wider(names_from = Trait, values_from = GEBV) %>%
  head() %>%
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)
`summarise()` has grouped output by 'Trait'. You can override using the
`.groups` argument.
ID_Clone DMC FRY HI N_Roots Nstem.Plant Plant.Height Root.Di Root.Le ShY StC StY Stem.D
BGM_2042 0.4221528 1.1226259 1.6673786 0.3365653 0.1910706 0.0198205 0.3852329 0.9828070 1.6615206 0.4322724 0.2786178 0.0596624
BGM0368.250437186 0.7210614 -0.5003791 0.5863924 0.0493258 -0.0206214 -0.0151336 -0.5668654 -0.6252787 -2.7935867 0.6956083 -0.0522265 -0.0404363
BGM0424.250437187 0.1404585 -0.2338929 -0.0630925 -0.2161889 -0.0595420 0.0005057 -0.5305112 -0.5022821 -1.7013112 0.1613155 -0.0213632 -0.0189943
BGM0611.250437190 -0.9767817 -0.2149977 -0.5346830 -0.0229744 0.0674738 -0.0567443 -0.2465179 0.0802446 -0.6964862 -0.9717067 -0.0749044 -0.0729285
BGM0652.250437861 -0.1125248 -0.1292787 -0.4897120 -0.1226576 0.0428889 -0.0320396 -0.5445785 -0.1824430 -0.2076394 -0.1105305 -0.0417151 -0.0250116
BGM0699.250437199 -0.1998181 -0.3962469 -0.9569924 -0.3755002 -0.0598799 -0.0101787 -0.3868621 -0.1280144 -0.6278451 -0.2131864 -0.1028979 0.0334348

Acuracy and MSPE for each trait and repetition:

results_cv_BayesB[-6] |>
  group_by(Trait) %>% 
  summarise_all(mean) %>% 
  arrange(desc(Ac), MSPE) %>% 
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)
Trait rep Log Ac MSPE
Root.Le 3 1 0.3654 1.8142
Stem.D 3 1 0.3464 0.0076
ShY 3 1 0.3202 12.5772
FRY 3 1 0.3194 1.1634
Root.Di 3 1 0.3128 3.3362
StY 3 1 0.3004 0.0710
Plant.Height 3 1 0.2972 0.0074
Nstem.Plant 3 1 0.2628 0.0280
N_Roots 3 1 0.2494 0.8614
HI 3 1 0.2390 11.9912
StC 3 1 0.2116 5.8994
DMC 3 1 0.2094 5.9386

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Let’s now move on to processing each model. Since this can be time-consuming, I’ve separated it into parallel processing and each model into each script:


sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BGLR_1.1.3          kableExtra_1.4.0    cvTools_0.3.3      
 [4] robustbase_0.99-4-1 lattice_0.22-6      lubridate_1.9.4    
 [7] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[10] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
[13] tibble_3.2.1        ggplot2_3.5.1       tidyverse_2.0.0    
[16] furrr_0.3.1         future_1.34.0      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.50         bslib_0.8.0       tzdb_0.4.0       
 [5] vctrs_0.6.5       tools_4.3.3       generics_0.1.3    parallel_4.3.3   
 [9] DEoptimR_1.1-3-1  pkgconfig_2.0.3   truncnorm_1.0-9   lifecycle_1.0.4  
[13] compiler_4.3.3    git2r_0.35.0      munsell_0.5.1     codetools_0.2-20 
[17] httpuv_1.6.15     htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10      
[21] later_1.4.1       pillar_1.10.1     jquerylib_0.1.4   MASS_7.3-60.0.1  
[25] cachem_1.1.0      parallelly_1.41.0 tidyselect_1.2.1  digest_0.6.37    
[29] stringi_1.8.4     listenv_0.9.1     rprojroot_2.0.4   fastmap_1.2.0    
[33] grid_4.3.3        colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3   
[37] withr_3.0.2       scales_1.3.0      promises_1.3.2    timechange_0.3.0 
[41] rmarkdown_2.29    globals_0.16.3    workflowr_1.7.1   hms_1.1.3        
[45] evaluate_1.0.3    knitr_1.49        viridisLite_0.4.2 rlang_1.1.4      
[49] Rcpp_1.0.14       glue_1.8.0        xml2_1.3.6        svglite_2.1.3    
[53] rstudioapi_0.17.1 jsonlite_1.8.9    R6_2.5.1          systemfonts_1.1.0
[57] fs_1.6.5         

  1. Weverton Gomes da Costa, Pós-Doutorando, Embrapa Mandioca e Fruticultura, ↩︎