Last updated: 2022-11-18

Checks: 6 1

Knit directory: Genomic-Selection-for-Drought-Tolerance-Using-Genome-Wide-SNPs-in-Casava/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has staged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20221020) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 38077f1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/allchrAR08.txt

Untracked files:
    Untracked:  analysis/figure/
    Untracked:  output/G_matrix.csv
    Untracked:  output/accuracy3.png
    Untracked:  output/accuracy4.png
    Untracked:  output/results_cv3.csv
    Untracked:  output/results_cv4.csv

Unstaged changes:
    Modified:   analysis/GWS.Rmd
    Modified:   analysis/phenotype.Rmd
    Modified:   analysis/phenotype_multi.Rmd
    Deleted:    output/BLUPS.RDS
    Deleted:    output/firts_results/accuracy.png
    Deleted:    output/firts_results/accuracy_multi.png
    Deleted:    output/herdabilidade.csv
    Deleted:    output/resultMM.Rdata

Staged changes:
    Modified:   analysis/GWS.Rmd
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-11-1.png
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-15-1.png
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-16-1.png
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-19-1.png
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-20-1.png
    Deleted:    analysis/figure/GWS.Rmd/unnamed-chunk-21-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-11-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-11-2.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-12-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-12-2.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-13-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-15-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-16-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-21-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-5-1.png
    Deleted:    analysis/figure/phenotype.Rmd/unnamed-chunk-7-1.png
    Modified:   analysis/gws-multi-enviroment.Rmd
    Modified:   data/genomic selection.pptx
    New:        output/accuracy.png
    New:        output/accuracy2.png
    New:        output/firts_results/accuracy.png
    New:        output/firts_results/accuracy_multi.png
    Modified:   output/results_cv.csv
    New:        output/results_cv2.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/GWS.Rmd) and HTML (docs/GWS.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 90dc112 WevertonGomesCosta 2022-11-17 Update
Rmd 6cc4d23 WevertonGomesCosta 2022-11-17 Update
html 6cc4d23 WevertonGomesCosta 2022-11-17 Update
Rmd 1936c51 WevertonGomesCosta 2022-11-11 Update
html 1936c51 WevertonGomesCosta 2022-11-11 Update
Rmd 62bf2ac WevertonGomesCosta 2022-11-11 Update
html 62bf2ac WevertonGomesCosta 2022-11-11 Update
Rmd 5988c27 WevertonGomesCosta 2022-11-11 Update
Rmd bf7b1d3 WevertonGomesCosta 2022-11-11 Update
html bf7b1d3 WevertonGomesCosta 2022-11-11 Update

To perform the analyses, we will need the following packages:

library(readxl)
require(tidyverse)
library(kableExtra)
library(janitor)
library(genomicMateSelectR)
library(AGHmatrix)
require(ComplexHeatmap)
require(rrBLUP)
library(ggpmisc)
library(cvTools)
theme_set(theme_bw())


Data

The data set is based in genotypes evalueted in five years (2016 to 2020), each year was considered as environment.

Names marker data

Primeiro vamos buscar os ID dos marcadores para cada clone.

names <-
  read_excel("data/Phenotyping.xlsx", sheet = "GBS") %>%
  rename(Clone = `Names trials Petrolina`,
         ID_Clone = `Nome GBS`) %>%
  mutate(ID_Clone = str_replace_all(ID_Clone, ":", ".")) %>%
  select(Clone, ID_Clone)

names %>%
  head() %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
Clone ID_Clone
4271 4271.250494233
9624-09 962409.250370255
Aipim Abacate ERETA.250437577
Alagoana Alagoana363.250437472
BGM-0004 CNPMF4.250370278
BGM-0019 CNPMF19.250370327

Phenotypic data

Agora vamos agrupar os IDs dos marcadores com os nomes dos clones.

pheno <- read.csv("output/BLUPS.csv") %>%
  inner_join(names) %>% # Join Phenotypic data with Genotypics names (ID_Clones) of the Clones
  mutate(Clone = as.factor(Clone),
         ID_Clone = as.factor(ID_Clone))
Joining, by = "Clone"
pheno %>%
  head() %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
Clone NR.P PTR PPA MS PROD.AMD AP HI AMD Porte RF CR DR DCaule Ácaro Nº.Hastes Stand_Final Branching_Level Staygreen ID_Clone
4271 -1.0991002 -0.9232850 1.2370042 4.1458588 0.0261588 -0.0519819 -4.7190501 4.1704796 0.1990073 -0.0646686 0.3402402 -1.684985 0.1064159 0.0098488 -0.2106357 0.2585191 0.2189612 -0.0017225 4271.250494233
9624-09 2.5448509 -1.2179385 2.1813479 0.0766224 -0.4243017 NA -5.3432293 -0.0078057 0.6737550 NA -0.2501030 -3.687546 NA NA NA NA -0.2181686 -0.0500388 962409.250370255
Aipim Abacate 1.0022448 1.0140978 1.1807247 0.9060262 0.2158898 0.3327817 4.4575307 0.9313454 -0.7791874 0.2490219 0.8804850 -1.147279 -0.0315661 -0.1144109 1.0019293 -0.3944873 -0.3775248 -0.0017225 ERETA.250437577
Alagoana -0.6253802 0.1014609 0.1866389 NA NA -0.0272367 0.2111938 NA NA -0.0402352 0.3013914 2.457162 -0.0021300 -0.1964558 NA NA NA NA Alagoana363.250437472
BGM-0004 -0.6980727 -3.3388524 -2.2929597 0.4748982 -1.0998681 -0.1925518 -13.6431414 0.4197789 -0.3558847 0.2427626 -6.1505681 -4.975497 -0.1898330 0.1316063 0.8123351 0.1444067 -0.0193399 0.0242331 CNPMF4.250370278
BGM-0019 -2.8554890 -2.4587770 -5.7454297 -3.5447503 -0.1443845 -0.1536869 -14.0193447 -3.5290772 0.3620398 -0.0646686 -5.4186360 -3.088693 -0.2775123 -0.1144109 -0.7880476 -0.0679841 0.4177899 -0.0017225 CNPMF19.250370327

Genotypic data

Agora vamos carregar os dados genotípicos dos marcadores GBS e corrigir os valores dos pares de base. Além disso, também vamos dividir a coluna alleles em duas colunas, para o alelo de referencia e o recessivo. E vamos selecionar as colunas com os nomes dos marcadores, alelos de referencia e as colunas com os IDs dos clones de acordo com os dados dos BLUPs.

geno <- read.table("data/allchrAR08.txt", header = T)

geno <- geno %>%
  mutate_at(vars(12:ncol(geno)), funs(
    case_when(
      . == "A" ~ "AA",
      . == "R" ~ "AG",
      . == "W" ~ "AT",
      . == "M" ~ "AC",
      . == "C" ~ "CC",
      . == "S" ~ "CG",
      . == "Y" ~ "CT",
      . == "G" ~ "GG",
      . == "K" ~ "GT",
      . == "T" ~ "TT",
    )
  )) %>%
  separate(alleles, c("reference", "recess")) %>%
  select(rs, reference, recess, any_of(pheno$ID_Clone))

geno[1:5, 1:20] %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
rs reference recess ERETA.250437577 Alagoana363.250437472 CNPMF4.250370278 CNPMF19.250370327 CNPMF30.250370293 CNPMF32.250370305 CNPMF36.250370328 CNPMF46.250370283 CNPMF48.250370295 CNPMF56.250370248 CNPMF65.250370285 CNPMF66.250370296 CNPMF74.250370321 CNPMF80.250370250 CNPMF84.250370263 CNPMF89.250370287 CNPMF91.250370299
S1_84637 A G AA AA AA AG AA AA AA AA AA AA AA AA AA AA AA AA AA
S1_84843 C A CC CC CC CC AC CC CC CC CC CC CC CC CC CC CC CC CC
S1_126260 A C AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
S1_126261 A G AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA
S1_126264 T G TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT

Agora precisamos fazer a conversão de pares de base para dosagem alélica de acordo com o alelo de referência. Também vou adcionar a coluna rs como nome das colunas. Depois vou excluir as colunas dos alelos de reference e recess.

geno <- geno %>%
  mutate_at(vars(4:ncol(geno)), funs(case_when(
    . == paste(reference, reference, sep = "") ~ 2,
    . == paste(recess, recess, sep = "") ~ 0,
    T ~ 1
  ))) %>%
  column_to_rownames("rs") %>%
  select(-c(reference, recess))

geno[1:5, 1:15] %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
ERETA.250437577 Alagoana363.250437472 CNPMF4.250370278 CNPMF19.250370327 CNPMF30.250370293 CNPMF32.250370305 CNPMF36.250370328 CNPMF46.250370283 CNPMF48.250370295 CNPMF56.250370248 CNPMF65.250370285 CNPMF66.250370296 CNPMF74.250370321 CNPMF80.250370250 CNPMF84.250370263
S1_84637 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
S1_84843 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2
S1_126260 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
S1_126261 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
S1_126264 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Para converter no formato para realizar as análises de GWS temos que transpor a matriz de marcadores.

geno <- geno %>%
  t()

geno[1:5, 1:5] %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
S1_84637 S1_84843 S1_126260 S1_126261 S1_126264
ERETA.250437577 2 2 2 2 2
Alagoana363.250437472 2 2 2 2 2
CNPMF4.250370278 2 2 2 2 2
CNPMF19.250370327 1 2 2 2 2
CNPMF30.250370293 2 1 2 2 2

Vamos verificar quantos clones apresentam dados genotipados com os marcadores.

geno %>%
  dim() %>%
  t() %>%
  kbl(
    escape = F,
    align = 'c',
    col.names = c("Number of Clones", "Number of markers")
  ) %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
Number of Clones Number of markers
414 27045

Com a filtragem dos genótipos em comum, temos 414 genótipos e 27405 marcas.

Agora vamos filtrar os SNPS usando MAF de 0.01 e verificar quantos marcadores se manterão.

geno <- maf_filter(geno, thresh = 0.01)

geno %>%
  dim() %>%
  t() %>%
  kbl(
    escape = F,
    align = 'c',
    col.names = c("Number of Clones", "Number of markers")
  ) %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
  )
Number of Clones Number of markers
414 22779

Com o filtro de MAF a 1% restaram 22779 marcadores. Vou salvar a matriz agora para que possamos carregar ela caso seja necessário.

saveRDS(geno, "data/SNPs.rds")
geno<-readRDS("data/SNPs.rds")

Genomic selection

For this purpose, we will use only individuals with BLUps and SNPs available.

pheno <- pheno %>%
  filter(ID_Clone %in% rownames(geno)) %>%
  droplevels()

Building the G matrix

Again, we will use the AGHmatrix package [@amadeu_aghmatrix_2016] to build the G matrix:

G_matrix <- Gmatrix(geno,
                    method = "VanRaden",
                    ploidy = 2,
                    missingValue = NA)
Initial data: 
    Number of Individuals: 414 
    Number of Markers: 22779 

Missing data check: 
    Total SNPs: 22779 
     0 SNPs dropped due to missing data threshold of 1 
    Total of: 22779  SNPs 
MAF check: 
    No SNPs with MAF below 0 
Monomorphic check: 
    No monomorphic SNPs 
Summary check: 
    Initial:  22779 SNPs 
    Final:  22779  SNPs ( 0  SNPs removed) 
 
Completed! Time = 15.76  seconds 
write.csv(G_matrix, "output/G_matrix.csv", row.names = F, quote = F)

Now we have the whole G matrix (414 x 414), which we can represent using a heatmap:

Heatmap(
  G_matrix,
  show_row_names = F,
  show_column_names = F,
  heatmap_legend_param = list(title = "Res")
)

“Res” in the heatmap legend title is for “Resemblance”.

RRBLUP

The Ridge Regression BLUP, or RRBLUP, will predict the marker effect. In the RRBLUP, we will use the matrix of markers directly.

For this purpose, we will use the rrBLUP package [@endelman_2011]. In the code below, y is for the vector with the means, Z is where we will insert the SNPs matrix, K is for a covariance matrix for the random effects, which will be and identity matrix by default; and X is a design matrix for the fixed effects, which will be a vector of ones (1) by default. Note that we are returning to the “1, 0, -1” codification in the SNPs matrix. This is a requirement of the rrBlUP package.

result_RR_BLUP <- data.frame(
  Clone = unique(pheno$Clone),
  ID_Clone = unique(pheno$ID_Clone),
  stringsAsFactors = F
)

traits <- colnames(pheno)[2:19]

for (i in traits) {
  RRBLUP <- mixed.solve(y = pheno[[i]],
                        Z = geno - 1,
                        K = NULL,
                        X = NULL)
  GEBV <- geno %*% RRBLUP$u
  result <- data.frame(ID_Clone = rownames(GEBV),
                       stringsAsFactors = F)
  result[, i] <- data.frame(GEBV)
  result_RR_BLUP <-
    merge(result_RR_BLUP, result, by = "ID_Clone", all.x = T)
}

GEBV x BLUP

The scatter plot above represents the additive genetic value of each marker. Once we have acquired these values, we may calculate the Genomic Estimated Breeding Values (GEBV) of the genotypes. Aqui, iremos adicionar os valores médios dos valores fenotípicos aos GEBVs obtidos pelo RR-BLUP e aos BLUPs para uma melhor comparação. These are the product of the SNPs matrix with the vector of the markers’ genetic values:

media_pheno <- read.table("output/media_pheno.csv")

colnames(media_pheno) <-
  c(colnames(result_RR_BLUP[3:ncol(result_RR_BLUP)]))

for (i in traits) {
  result_RR_BLUP[i] <- result_RR_BLUP[i] + media_pheno[, i]
}

GEBVS_RR_BLUP <- result_RR_BLUP %>%
  pivot_longer(NR.P:Staygreen,
               names_to = "Variable",
               values_to = "GEBVS")

BLUPS_MED <- pheno

for (i in traits) {
  BLUPS_MED[i] <- BLUPS_MED[i] + media_pheno[, i]
}

BLUPS_MED <- BLUPS_MED %>%
  pivot_longer(NR.P:Staygreen,
               names_to = "Variable",
               values_to = "BLUPS")

data_gws_RR_BLUP <- GEBVS_RR_BLUP %>%
  full_join(BLUPS_MED)
Joining, by = c("ID_Clone", "Clone", "Variable")

Agora vamos produzir os gráficos para cada variável comparando a correlação entre os GEBVs e BLUPs.

data_gws_RR_BLUP %>%
  ggplot(aes(x = BLUPS, y = GEBVS)) +
  geom_point(aes(color = GEBVS), show.legend = F) +
  geom_smooth(method = lm, se = F) +
  stat_poly_eq(formula = y ~ x, label.y = "bottom") +
  scale_color_gradient(low = '#c80000', high = '#FFFF00') +
  facet_wrap(. ~ Variable, ncol = 6, scales = "free") +
  expand_limits(y = 0, x = 0) +
  ggtitle("RR-BLUP")
`geom_smooth()` using formula 'y ~ x'

Agora vamos comparar os modelos por meio de um boxplot.

data_gws_RR_BLUP %>%
  pivot_longer(4:5, names_to = "Method", values_to = "Values") %>%
  ggplot(aes(x = , y = Values, fill = Method)) +
  geom_boxplot() +
  facet_wrap(. ~ Variable, ncol = 6, scales = "free")  +
  expand_limits(y = 0, x = 0) +
  ggtitle("RR-BLUP")

GBLUP

In the GBLUP, we will use the G matrix instead of the SNPs matrix. Thus, we will obtain the GEBV directly. Note that we will need to build the G matrix again, since some genotypes were dropped after our filtering. The rrBLUP package has a function called “A.mat” that build the Additive Genomic Matrix from a SNP matrix with “-1,0,1” codification:

result_G_BLUP <- data.frame(
  Clone = unique(pheno$Clone),
  ID_Clone = unique(pheno$ID_Clone),
  stringsAsFactors = F
)

for (i in traits) {
  GBLUP <- mixed.solve(pheno[[i]], K = A.mat(geno - 1))
  result <- data.frame(ID_Clone = rownames(GBLUP$u),
                       stringsAsFactors = F)
  result[, i] <- data.frame(GBLUP$u)
  result_G_BLUP <-
    merge(result_G_BLUP, result, by = "ID_Clone", all.x = T)
}

Aqui, iremos adicionar novamente os valores médios dos valores fenotípicos aos GEBVs obtidos pelo G-BLUP.

for(i in traits) {
  result_G_BLUP[i] <- result_G_BLUP[i] + media_pheno[, i]
}

GEBVS_G_BLUP <- result_G_BLUP %>%
  pivot_longer(NR.P:Staygreen,
               names_to = "Variable",
               values_to = "GEBVS")

data_gws_G_BLUP <- GEBVS_G_BLUP %>%
  full_join(BLUPS_MED
  )
Joining, by = c("ID_Clone", "Clone", "Variable")

Agora vamos produzir os gráficos para cada variável comparando a correlação entre os GEBVs e BLUPs.

data_gws_G_BLUP %>%
  ggplot(aes(x = BLUPS, y = GEBVS)) +
  geom_point(aes(color = GEBVS), show.legend = F) +
  geom_smooth(method = lm, se = F) +
  stat_poly_eq(formula = y ~ x, label.y = "bottom") +
  scale_color_gradient(low = '#c80000', high = '#FFFF00') +
  facet_wrap(. ~ Variable, ncol = 6, scales = "free") +
  expand_limits(y = 0, x = 0) +
  ggtitle("G-BLUP")
`geom_smooth()` using formula 'y ~ x'

Agora vamos comparar os modelos por meio de um boxplot.

data_gws_G_BLUP %>%
  pivot_longer(4:5, names_to = "Method", values_to = "Values") %>%
  ggplot(aes(x = , y = Values, fill = Method)) +
  geom_boxplot() +
  facet_wrap(. ~ Variable, ncol = 6, scales = "free")  +
  expand_limits(y = 0, x = 0) +
  ggtitle("G-BLUP")

Agora vamos produzir os gráficos para cada variável comparando a correlação entre os GEBVs do RR-BLUP e GEBVs do G-BLUP.

GEBVS_RR_BLUP <- GEBVS_RR_BLUP %>%
  rename(GEBVS_RR_BLUP = GEBVS)

data_gws <- GEBVS_RR_BLUP %>%
  full_join(GEBVS_G_BLUP)
Joining, by = c("ID_Clone", "Clone", "Variable")
data_gws %>%
  ggplot(aes(x = GEBVS_RR_BLUP, y = GEBVS)) +
  geom_point() +
  geom_smooth(method = lm, se = F) +
  stat_poly_eq(formula = y ~ x, label.y = "bottom") +
  facet_wrap(. ~ Variable, ncol = 6, scales = "free") +
  expand_limits(y = 0, x = 0) +
  ggtitle("G-BLUP")
`geom_smooth()` using formula 'y ~ x'

Cross-validation 5 folds 5 rep

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 5
nrept = 5

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)
[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

set.seed(100)

sort <- list()

for (a in 1:nrept) {
  for (j in 1:nfolds) {
    folds <- cvFolds(nlevels(pheno$ID),
                     type = "random",
                     K = nfolds,
                     R = 1)
    Sample <- cbind(folds$which, folds$subsets)
    cv <- split(Sample[, 2], f = Sample[, 1])
  }
  sort[[a]] <- cv
}

rm(a, folds, j, cv, Sample)

5. Cross-validation

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called “fold.list”, contains the folds assignation that we built in the previous step. The second (“results_cv”) is empty and will store the outputs of each iteration of the loop.

fold.list <- sort
results <- list()
results_cv <- data.frame()
traits <- colnames(media_pheno)

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition.

for(j in traits) {
  for (z in 1:length(fold.list)) {
    for (i in 1:nfolds) {
      # Training set
      train_data <- pheno
      
      # Validation set
      train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
      
      # Fitting model
      GBLUP <- mixed.solve(train_data[[j]], K = A.mat(geno - 1))
      
      # GEBV
      Pred <- data.frame(Yhat = GBLUP$u, G = pheno$ID)
      
      rownames(Pred) <- rownames(geno)
      
      # Predicted GEBV
      results[[i]] <- Pred[Pred[, "G"] %in% fold.list[[z]][[i]],]
      
      # Remove
      #rmGBLUP, Pred, train_data)
    }
    
    GEBV <- do.call(rbind, results)
    GEBV <- GEBV[order(GEBV$G),]
    
    # Log
    log <- all(GEBV$G == pheno$ID)
    
    # Results
    result_cv <- data.frame(
      Trait = j,
      Rep = z,
      Log = log,
      Ac = round(cor(GEBV$Yhat, train_data[[j]], use = "na.or.complete"), 3),
      MSPE = round(mean(((GEBV$Yhat - train_data[[j]]) ^ 2
      ), na.rm = T), 3)
    )
    
    results_cv <-
      rbind(results_cv, result_cv)
  }
}

results_cv %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
    
  )
Trait Rep Log Ac MSPE
NR.P 1 TRUE 0.201 1.633
NR.P 2 TRUE 0.309 1.464
NR.P 3 TRUE 0.288 1.498
NR.P 4 TRUE 0.283 1.629
NR.P 5 TRUE 0.225 1.663
PTR 1 TRUE 0.323 3.729
PTR 2 TRUE 0.367 3.738
PTR 3 TRUE 0.392 3.471
PTR 4 TRUE 0.352 3.526
PTR 5 TRUE 0.295 3.752
PPA 1 TRUE 0.339 21.231
PPA 2 TRUE 0.321 22.917
PPA 3 TRUE 0.367 21.571
PPA 4 TRUE 0.304 20.694
PPA 5 TRUE 0.315 20.841
MS 1 TRUE 0.244 9.896
MS 2 TRUE 0.282 9.997
MS 3 TRUE 0.248 9.808
MS 4 TRUE 0.306 9.073
MS 5 TRUE 0.207 10.733
PROD.AMD 1 TRUE 0.285 0.325
PROD.AMD 2 TRUE 0.313 0.323
PROD.AMD 3 TRUE 0.360 0.295
PROD.AMD 4 TRUE 0.326 0.306
PROD.AMD 5 TRUE 0.256 0.312
AP 1 TRUE 0.329 0.016
AP 2 TRUE 0.323 0.017
AP 3 TRUE 0.345 0.018
AP 4 TRUE 0.307 0.017
AP 5 TRUE 0.327 0.016
HI 1 TRUE 0.243 47.302
HI 2 TRUE 0.307 43.743
HI 3 TRUE 0.286 43.342
HI 4 TRUE 0.297 43.804
HI 5 TRUE 0.178 48.268
AMD 1 TRUE 0.245 9.875
AMD 2 TRUE 0.285 9.974
AMD 3 TRUE 0.250 9.794
AMD 4 TRUE 0.307 9.063
AMD 5 TRUE 0.209 10.710
Porte 1 TRUE 0.338 0.229
Porte 2 TRUE 0.382 0.214
Porte 3 TRUE 0.358 0.212
Porte 4 TRUE 0.361 0.231
Porte 5 TRUE 0.293 0.226
RF 1 TRUE 0.251 0.027
RF 2 TRUE 0.243 0.028
RF 3 TRUE 0.233 0.030
RF 4 TRUE 0.229 0.028
RF 5 TRUE 0.252 0.029
CR 1 TRUE 0.344 4.065
CR 2 TRUE 0.387 3.808
CR 3 TRUE 0.343 4.320
CR 4 TRUE 0.359 4.142
CR 5 TRUE 0.344 4.271
DR 1 TRUE 0.304 9.342
DR 2 TRUE 0.323 9.496
DR 3 TRUE 0.320 8.959
DR 4 TRUE 0.255 9.009
DR 5 TRUE 0.214 10.105
DCaule 1 TRUE 0.373 0.024
DCaule 2 TRUE 0.347 0.025
DCaule 3 TRUE 0.367 0.026
DCaule 4 TRUE 0.303 0.028
DCaule 5 TRUE 0.366 0.027
Ácaro 1 TRUE 0.370 0.035
Ácaro 2 TRUE 0.371 0.038
Ácaro 3 TRUE 0.412 0.033
Ácaro 4 TRUE 0.378 0.033
Ácaro 5 TRUE 0.409 0.035
Nº.Hastes 1 TRUE 0.328 0.132
Nº.Hastes 2 TRUE 0.325 0.135
Nº.Hastes 3 TRUE 0.315 0.134
Nº.Hastes 4 TRUE 0.255 0.142
Nº.Hastes 5 TRUE 0.274 0.136
Stand_Final 1 TRUE 0.236 0.217
Stand_Final 2 TRUE 0.243 0.239
Stand_Final 3 TRUE 0.153 0.216
Stand_Final 4 TRUE 0.149 0.232
Stand_Final 5 TRUE 0.252 0.222
Branching_Level 1 TRUE 0.396 0.122
Branching_Level 2 TRUE 0.376 0.130
Branching_Level 3 TRUE 0.361 0.141
Branching_Level 4 TRUE 0.346 0.137
Branching_Level 5 TRUE 0.412 0.130
Staygreen 1 TRUE 0.266 0.000
Staygreen 2 TRUE 0.187 0.000
Staygreen 3 TRUE 0.214 0.000
Staygreen 4 TRUE 0.198 0.000
Staygreen 5 TRUE 0.109 0.000
write.csv(results_cv, "output/results_cv.csv", row.names = F, quote = F)

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Agora vamos plotar os resultados de acurácias para cada característica

results_cv %>%
  ggplot(aes(x = Trait, y = Ac , fill = Trait)) +
  geom_boxplot() +
  expand_limits(y = 0) +
  theme(axis.text.x = element_text(
    angle = 60,
    vjust = 1,
    hjust = 1
  ),
  text = element_text(size = 15)) +
  labs(y = "Accuracy") + 
  guides(fill = "none")

ggsave(
  "output/accuracy.png",
  width = 16,
  height = 8,
  dpi = 300
)

Cross-validation 10 folds 10 rep

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 10
nrept = 10

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)
[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

set.seed(100)

sort <- list()

for (a in 1:nrept) {
  for (j in 1:nfolds) {
    folds <- cvFolds(nlevels(pheno$ID),
                     type = "random",
                     K = nfolds,
                     R = 1)
    Sample <- cbind(folds$which, folds$subsets)
    cv <- split(Sample[, 2], f = Sample[, 1])
  }
  sort[[a]] <- cv
}

rm(a, folds, j, cv, Sample)

5. Cross-validation

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called “fold.list”, contains the folds assignation that we built in the previous step. The second (“results_cv”) is empty and will store the outputs of each iteration of the loop.

fold.list <- sort
results <- list()
results_cv2 <- data.frame()
traits <- colnames(media_pheno)

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition.

for(j in traits) {
  for (z in 1:length(fold.list)) {
    for (i in 1:nfolds) {
      # Training set
      train_data <- pheno
      
      # Validation set
      train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
      
      # Fitting model
      GBLUP <- mixed.solve(train_data[[j]], K = A.mat(geno - 1))
      
      # GEBV
      Pred <- data.frame(Yhat = GBLUP$u, G = pheno$ID)
      
      rownames(Pred) <- rownames(geno)
      
      # Predicted GEBV
      results[[i]] <- Pred[Pred[, "G"] %in% fold.list[[z]][[i]],]
      
      # Remove
      #rmGBLUP, Pred, train_data)
    }
    
    GEBV <- do.call(rbind, results)
    GEBV <- GEBV[order(GEBV$G),]
    
    # Log
    log <- all(GEBV$G == pheno$ID)
    
    # Results
    result_cv <- data.frame(
      Trait = j,
      Rep = z,
      Log = log,
      Ac = round(cor(GEBV$Yhat, train_data[[j]], use = "na.or.complete"), 3),
      MSPE = round(mean(((GEBV$Yhat - train_data[[j]]) ^ 2
      ), na.rm = T), 3)
    )
    
    results_cv2 <-
      rbind(results_cv2, result_cv)
  }
}

results_cv2 %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
    
  )
Trait Rep Log Ac MSPE
NR.P 1 TRUE 0.291 1.480
NR.P 2 TRUE 0.258 1.576
NR.P 3 TRUE 0.244 1.650
NR.P 4 TRUE 0.271 1.554
NR.P 5 TRUE 0.257 1.552
NR.P 6 TRUE 0.281 1.489
NR.P 7 TRUE 0.271 1.579
NR.P 8 TRUE 0.282 1.535
NR.P 9 TRUE 0.286 1.550
NR.P 10 TRUE 0.304 1.616
PTR 1 TRUE 0.349 3.774
PTR 2 TRUE 0.357 3.583
PTR 3 TRUE 0.324 3.780
PTR 4 TRUE 0.341 3.575
PTR 5 TRUE 0.352 3.659
PTR 6 TRUE 0.370 3.612
PTR 7 TRUE 0.405 3.412
PTR 8 TRUE 0.340 3.683
PTR 9 TRUE 0.365 3.435
PTR 10 TRUE 0.383 3.438
PPA 1 TRUE 0.337 21.937
PPA 2 TRUE 0.335 20.650
PPA 3 TRUE 0.299 22.702
PPA 4 TRUE 0.335 22.768
PPA 5 TRUE 0.367 19.932
PPA 6 TRUE 0.339 21.828
PPA 7 TRUE 0.347 21.693
PPA 8 TRUE 0.288 22.502
PPA 9 TRUE 0.354 21.435
PPA 10 TRUE 0.347 21.449
MS 1 TRUE 0.282 10.131
MS 2 TRUE 0.262 10.155
MS 3 TRUE 0.230 10.729
MS 4 TRUE 0.223 10.597
MS 5 TRUE 0.243 9.996
MS 6 TRUE 0.262 10.185
MS 7 TRUE 0.242 10.381
MS 8 TRUE 0.249 10.504
MS 9 TRUE 0.287 9.985
MS 10 TRUE 0.209 10.472
PROD.AMD 1 TRUE 0.320 0.319
PROD.AMD 2 TRUE 0.334 0.311
PROD.AMD 3 TRUE 0.284 0.322
PROD.AMD 4 TRUE 0.309 0.316
PROD.AMD 5 TRUE 0.316 0.318
PROD.AMD 6 TRUE 0.334 0.303
PROD.AMD 7 TRUE 0.327 0.313
PROD.AMD 8 TRUE 0.295 0.312
PROD.AMD 9 TRUE 0.314 0.295
PROD.AMD 10 TRUE 0.309 0.310
AP 1 TRUE 0.317 0.017
AP 2 TRUE 0.328 0.017
AP 3 TRUE 0.349 0.018
AP 4 TRUE 0.311 0.017
AP 5 TRUE 0.343 0.017
AP 6 TRUE 0.326 0.017
AP 7 TRUE 0.359 0.017
AP 8 TRUE 0.333 0.017
AP 9 TRUE 0.337 0.017
AP 10 TRUE 0.321 0.018
HI 1 TRUE 0.294 44.290
HI 2 TRUE 0.290 43.571
HI 3 TRUE 0.276 46.070
HI 4 TRUE 0.305 41.078
HI 5 TRUE 0.276 44.543
HI 6 TRUE 0.319 43.163
HI 7 TRUE 0.323 42.088
HI 8 TRUE 0.322 43.122
HI 9 TRUE 0.272 44.295
HI 10 TRUE 0.314 42.444
AMD 1 TRUE 0.283 10.121
AMD 2 TRUE 0.263 10.143
AMD 3 TRUE 0.230 10.724
AMD 4 TRUE 0.225 10.584
AMD 5 TRUE 0.244 9.986
AMD 6 TRUE 0.263 10.170
AMD 7 TRUE 0.243 10.370
AMD 8 TRUE 0.250 10.494
AMD 9 TRUE 0.288 9.975
AMD 10 TRUE 0.210 10.466
Porte 1 TRUE 0.402 0.215
Porte 2 TRUE 0.372 0.225
Porte 3 TRUE 0.384 0.216
Porte 4 TRUE 0.350 0.221
Porte 5 TRUE 0.395 0.213
Porte 6 TRUE 0.389 0.216
Porte 7 TRUE 0.358 0.221
Porte 8 TRUE 0.385 0.214
Porte 9 TRUE 0.357 0.217
Porte 10 TRUE 0.361 0.223
RF 1 TRUE 0.275 0.028
RF 2 TRUE 0.275 0.029
RF 3 TRUE 0.311 0.027
RF 4 TRUE 0.303 0.029
RF 5 TRUE 0.326 0.028
RF 6 TRUE 0.271 0.027
RF 7 TRUE 0.289 0.027
RF 8 TRUE 0.286 0.027
RF 9 TRUE 0.277 0.028
RF 10 TRUE 0.328 0.025
CR 1 TRUE 0.348 4.271
CR 2 TRUE 0.374 4.161
CR 3 TRUE 0.344 4.335
CR 4 TRUE 0.326 4.538
CR 5 TRUE 0.376 4.117
CR 6 TRUE 0.377 4.200
CR 7 TRUE 0.384 4.216
CR 8 TRUE 0.340 4.375
CR 9 TRUE 0.360 4.213
CR 10 TRUE 0.378 4.304
DR 1 TRUE 0.275 9.507
DR 2 TRUE 0.268 8.986
DR 3 TRUE 0.301 9.211
DR 4 TRUE 0.284 8.539
DR 5 TRUE 0.301 9.049
DR 6 TRUE 0.346 8.955
DR 7 TRUE 0.352 8.529
DR 8 TRUE 0.315 8.944
DR 9 TRUE 0.310 9.157
DR 10 TRUE 0.291 8.925
DCaule 1 TRUE 0.358 0.025
DCaule 2 TRUE 0.355 0.026
DCaule 3 TRUE 0.346 0.027
DCaule 4 TRUE 0.344 0.026
DCaule 5 TRUE 0.369 0.026
DCaule 6 TRUE 0.358 0.025
DCaule 7 TRUE 0.364 0.026
DCaule 8 TRUE 0.351 0.026
DCaule 9 TRUE 0.374 0.026
DCaule 10 TRUE 0.408 0.025
Ácaro 1 TRUE 0.384 0.036
Ácaro 2 TRUE 0.406 0.034
Ácaro 3 TRUE 0.338 0.036
Ácaro 4 TRUE 0.371 0.036
Ácaro 5 TRUE 0.401 0.035
Ácaro 6 TRUE 0.402 0.035
Ácaro 7 TRUE 0.428 0.034
Ácaro 8 TRUE 0.362 0.036
Ácaro 9 TRUE 0.372 0.035
Ácaro 10 TRUE 0.353 0.036
Nº.Hastes 1 TRUE 0.314 0.138
Nº.Hastes 2 TRUE 0.268 0.143
Nº.Hastes 3 TRUE 0.263 0.137
Nº.Hastes 4 TRUE 0.322 0.129
Nº.Hastes 5 TRUE 0.270 0.136
Nº.Hastes 6 TRUE 0.284 0.133
Nº.Hastes 7 TRUE 0.259 0.135
Nº.Hastes 8 TRUE 0.295 0.138
Nº.Hastes 9 TRUE 0.295 0.137
Nº.Hastes 10 TRUE 0.309 0.141
Stand_Final 1 TRUE 0.222 0.240
Stand_Final 2 TRUE 0.161 0.223
Stand_Final 3 TRUE 0.235 0.237
Stand_Final 4 TRUE 0.167 0.226
Stand_Final 5 TRUE 0.177 0.220
Stand_Final 6 TRUE 0.222 0.236
Stand_Final 7 TRUE 0.163 0.242
Stand_Final 8 TRUE 0.172 0.237
Stand_Final 9 TRUE 0.230 0.243
Stand_Final 10 TRUE 0.203 0.242
Branching_Level 1 TRUE 0.398 0.127
Branching_Level 2 TRUE 0.394 0.128
Branching_Level 3 TRUE 0.354 0.133
Branching_Level 4 TRUE 0.359 0.133
Branching_Level 5 TRUE 0.356 0.128
Branching_Level 6 TRUE 0.360 0.124
Branching_Level 7 TRUE 0.371 0.129
Branching_Level 8 TRUE 0.364 0.124
Branching_Level 9 TRUE 0.414 0.122
Branching_Level 10 TRUE 0.371 0.134
Staygreen 1 TRUE 0.198 0.000
Staygreen 2 TRUE 0.219 0.000
Staygreen 3 TRUE 0.239 0.000
Staygreen 4 TRUE 0.215 0.000
Staygreen 5 TRUE 0.199 0.000
Staygreen 6 TRUE 0.213 0.000
Staygreen 7 TRUE 0.196 0.000
Staygreen 8 TRUE 0.213 0.000
Staygreen 9 TRUE 0.255 0.000
Staygreen 10 TRUE 0.156 0.000
write.csv(results_cv2, "output/results_cv2.csv", row.names = F, quote = F)

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Agora vamos plotar os resultados de acurácias para cada característica

results_cv2 %>%
  ggplot(aes(x = Trait, y = Ac , fill = Trait)) +
  geom_boxplot() +
  expand_limits(y = 0) +
  theme(axis.text.x = element_text(
    angle = 60,
    vjust = 1,
    hjust = 1
  ),
  text = element_text(size = 15)) +
  labs(y = "Accuracy") + 
  guides(fill = "none")

ggsave(
  "output/accuracy2.png",
  width = 16,
  height = 8,
  dpi = 300
)

RHKS

We can also use the eigenvalues of the G covariance matrix to perform the analyses. But first, we have to build G:

X<-scale(geno,center=TRUE,scale=TRUE)

# Gaussian Kernel (GK)

dist<-as.matrix(dist(X))^2

GK<-exp(-dist/median(dist))

GK[1:5,1:5]
                      Alagoana363.250437472 Arari424.250437476  BGM_2042
Alagoana363.250437472             1.0000000          0.4274471 0.3888266
Arari424.250437476                0.4274471          1.0000000 0.3833092
BGM_2042                          0.3888266          0.3833092 1.0000000
BGM0290.250581760                 0.3544169          0.3449295 0.3623902
BGM0368.250437186                 0.3193199          0.3275549 0.3878667
                      BGM0290.250581760 BGM0368.250437186
Alagoana363.250437472         0.3544169         0.3193199
Arari424.250437476            0.3449295         0.3275549
BGM_2042                      0.3623902         0.3878667
BGM0290.250581760             1.0000000         0.3407408
BGM0368.250437186             0.3407408         1.0000000

Cross-validation 5 folds 5 rep

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 5
nrept = 5

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
GK <- GK[order(row.names(GK)),]
all(rownames(GK) == pheno$ID_Clone)
[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

set.seed(100)

sort <- list()

for (a in 1:nrept) {
  for (j in 1:nfolds) {
    folds <- cvFolds(nlevels(pheno$ID),
                     type = "random",
                     K = nfolds,
                     R = 1)
    Sample <- cbind(folds$which, folds$subsets)
    cv <- split(Sample[, 2], f = Sample[, 1])
  }
  sort[[a]] <- cv
}

rm(a, folds, j, cv, Sample)

5. Cross-validation

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called “fold.list”, contains the folds assignation that we built in the previous step. The second (“results_cv”) is empty and will store the outputs of each iteration of the loop.

fold.list <- sort
results <- list()
results_cv3 <- data.frame()
traits <- colnames(media_pheno)

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition.

for(j in traits) {
  for (z in 1:length(fold.list)) {
    for (i in 1:nfolds) {
      # Training set
      train_data <- pheno
      
      # Validation set
      train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
      
      # Fitting model
      RHKS <- mixed.solve(train_data[[j]], K = GK)
      
      # GEBV
      Pred <- data.frame(Yhat = RHKS$u, G = pheno$ID)
      
      rownames(Pred) <- rownames(geno)
      
      # Predicted GEBV
      results[[i]] <- Pred[Pred[, "G"] %in% fold.list[[z]][[i]],]
      
      # Remove
      #rmRHKS, Pred, train_data)
    }
    
    GEBV <- do.call(rbind, results)
    GEBV <- GEBV[order(GEBV$G),]
    
    # Log
    log <- all(GEBV$G == pheno$ID)
    
    # Results
    result_cv <- data.frame(
      Trait = j,
      Rep = z,
      Log = log,
      Ac = round(cor(GEBV$Yhat, train_data[[j]], use = "na.or.complete"), 3),
      MSPE = round(mean(((GEBV$Yhat - train_data[[j]]) ^ 2
      ), na.rm = T), 3)
    )
    
    results_cv3 <-
      rbind(results_cv3, result_cv)
  }
}
results_cv3 %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
    
  )
Trait Rep Log Ac MSPE
NR.P 1 TRUE 0.190 1.775
NR.P 2 TRUE 0.264 1.624
NR.P 3 TRUE 0.287 1.650
NR.P 4 TRUE 0.244 1.734
NR.P 5 TRUE 0.216 1.821
PTR 1 TRUE 0.315 3.892
PTR 2 TRUE 0.356 3.864
PTR 3 TRUE 0.416 3.592
PTR 4 TRUE 0.344 3.631
PTR 5 TRUE 0.301 3.867
PPA 1 TRUE 0.349 21.849
PPA 2 TRUE 0.325 23.935
PPA 3 TRUE 0.386 22.645
PPA 4 TRUE 0.307 22.099
PPA 5 TRUE 0.320 21.776
MS 1 TRUE 0.236 10.019
MS 2 TRUE 0.278 10.203
MS 3 TRUE 0.232 10.112
MS 4 TRUE 0.296 9.163
MS 5 TRUE 0.145 11.289
PROD.AMD 1 TRUE 0.266 0.338
PROD.AMD 2 TRUE 0.303 0.330
PROD.AMD 3 TRUE 0.391 0.298
PROD.AMD 4 TRUE 0.309 0.312
PROD.AMD 5 TRUE 0.248 0.321
AP 1 TRUE 0.300 0.018
AP 2 TRUE 0.346 0.018
AP 3 TRUE 0.366 0.018
AP 4 TRUE 0.319 0.018
AP 5 TRUE 0.313 0.017
HI 1 TRUE 0.251 46.889
HI 2 TRUE 0.303 44.048
HI 3 TRUE 0.303 42.820
HI 4 TRUE 0.290 44.187
HI 5 TRUE 0.198 47.569
AMD 1 TRUE 0.236 9.998
AMD 2 TRUE 0.280 10.176
AMD 3 TRUE 0.232 10.094
AMD 4 TRUE 0.297 9.151
AMD 5 TRUE 0.148 11.248
Porte 1 TRUE 0.341 0.240
Porte 2 TRUE 0.368 0.223
Porte 3 TRUE 0.361 0.216
Porte 4 TRUE 0.354 0.243
Porte 5 TRUE 0.287 0.242
RF 1 TRUE 0.226 0.028
RF 2 TRUE 0.211 0.029
RF 3 TRUE 0.202 0.030
RF 4 TRUE 0.208 0.028
RF 5 TRUE 0.232 0.030
CR 1 TRUE 0.342 4.437
CR 2 TRUE 0.390 4.185
CR 3 TRUE 0.355 4.716
CR 4 TRUE 0.362 4.395
CR 5 TRUE 0.344 4.707
DR 1 TRUE 0.307 9.495
DR 2 TRUE 0.318 9.625
DR 3 TRUE 0.329 9.081
DR 4 TRUE 0.257 9.110
DR 5 TRUE 0.210 10.397
DCaule 1 TRUE 0.360 0.026
DCaule 2 TRUE 0.357 0.028
DCaule 3 TRUE 0.375 0.029
DCaule 4 TRUE 0.306 0.030
DCaule 5 TRUE 0.352 0.030
Ácaro 1 TRUE 0.358 0.037
Ácaro 2 TRUE 0.347 0.041
Ácaro 3 TRUE 0.399 0.035
Ácaro 4 TRUE 0.354 0.035
Ácaro 5 TRUE 0.381 0.038
Nº.Hastes 1 TRUE 0.332 0.166
Nº.Hastes 2 TRUE 0.319 0.166
Nº.Hastes 3 TRUE 0.304 0.169
Nº.Hastes 4 TRUE 0.233 0.177
Nº.Hastes 5 TRUE 0.269 0.177
Stand_Final 1 TRUE 0.233 0.218
Stand_Final 2 TRUE 0.230 0.243
Stand_Final 3 TRUE 0.207 0.214
Stand_Final 4 TRUE 0.112 0.238
Stand_Final 5 TRUE 0.207 0.227
Branching_Level 1 TRUE 0.405 0.141
Branching_Level 2 TRUE 0.396 0.147
Branching_Level 3 TRUE 0.372 0.159
Branching_Level 4 TRUE 0.333 0.159
Branching_Level 5 TRUE 0.432 0.149
Staygreen 1 TRUE 0.246 0.000
Staygreen 2 TRUE 0.188 0.000
Staygreen 3 TRUE 0.235 0.000
Staygreen 4 TRUE 0.209 0.000
Staygreen 5 TRUE 0.079 0.000
write.csv(results_cv3, "output/results_cv3.csv", row.names = F, quote = F)

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Agora vamos plotar os resultados de acurácias para cada característica

results_cv3 %>%
  ggplot(aes(x = Trait, y = Ac , fill = Trait)) +
  geom_boxplot() +
  expand_limits(y = 0) +
  theme(axis.text.x = element_text(
    angle = 60,
    vjust = 1,
    hjust = 1
  ),
  text = element_text(size = 15)) +
  labs(y = "Accuracy") + 
  guides(fill = "none")

ggsave(
  "output/accuracy3.png",
  width = 16,
  height = 8,
  dpi = 300
)

Cross-validation 10 folds 10 rep

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 10
nrept = 10

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
GK <- GK[order(row.names(GK)),]
all(rownames(GK) == pheno$ID_Clone)
[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

set.seed(100)

sort <- list()

for (a in 1:nrept) {
  for (j in 1:nfolds) {
    folds <- cvFolds(nlevels(pheno$ID),
                     type = "random",
                     K = nfolds,
                     R = 1)
    Sample <- cbind(folds$which, folds$subsets)
    cv <- split(Sample[, 2], f = Sample[, 1])
  }
  sort[[a]] <- cv
}

rm(a, folds, j, cv, Sample)

5. Cross-validation

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called “fold.list”, contains the folds assignation that we built in the previous step. The second (“results_cv”) is empty and will store the outputs of each iteration of the loop.

fold.list <- sort
results <- list()
results_cv4 <- data.frame()
traits <- colnames(media_pheno)

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition.

for(j in traits) {
  for (z in 1:length(fold.list)) {
    for (i in 1:nfolds) {
      # Training set
      train_data <- pheno
      
      # Validation set
      train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
      
      # Fitting model
      RHKS <- mixed.solve(train_data[[j]], K = A.mat(geno - 1))
      
      # GEBV
      Pred <- data.frame(Yhat = RHKS$u, G = pheno$ID)
      
      rownames(Pred) <- rownames(geno)
      
      # Predicted GEBV
      results[[i]] <- Pred[Pred[, "G"] %in% fold.list[[z]][[i]],]
      
      # Remove
      #rmRHKS, Pred, train_data)
    }
    
    GEBV <- do.call(rbind, results)
    GEBV <- GEBV[order(GEBV$G),]
    
    # Log
    log <- all(GEBV$G == pheno$ID)
    
    # Results
    result_cv <- data.frame(
      Trait = j,
      Rep = z,
      Log = log,
      Ac = round(cor(GEBV$Yhat, train_data[[j]], use = "na.or.complete"), 3),
      MSPE = round(mean(((GEBV$Yhat - train_data[[j]]) ^ 2
      ), na.rm = T), 3)
    )
    
    results_cv4 <-
      rbind(results_cv4, result_cv)
  }
}
results_cv4 %>%
  kbl(escape = F, align = 'c') %>%
  kable_classic(
    "hover",
    full_width = F,
    position = "center",
    fixed_thead = T
    
  )
Trait Rep Log Ac MSPE
NR.P 1 TRUE 0.291 1.480
NR.P 2 TRUE 0.258 1.576
NR.P 3 TRUE 0.244 1.650
NR.P 4 TRUE 0.271 1.554
NR.P 5 TRUE 0.257 1.552
NR.P 6 TRUE 0.281 1.489
NR.P 7 TRUE 0.271 1.579
NR.P 8 TRUE 0.282 1.535
NR.P 9 TRUE 0.286 1.550
NR.P 10 TRUE 0.304 1.616
PTR 1 TRUE 0.349 3.774
PTR 2 TRUE 0.357 3.583
PTR 3 TRUE 0.324 3.780
PTR 4 TRUE 0.341 3.575
PTR 5 TRUE 0.352 3.659
PTR 6 TRUE 0.370 3.612
PTR 7 TRUE 0.405 3.412
PTR 8 TRUE 0.340 3.683
PTR 9 TRUE 0.365 3.435
PTR 10 TRUE 0.383 3.438
PPA 1 TRUE 0.337 21.937
PPA 2 TRUE 0.335 20.650
PPA 3 TRUE 0.299 22.702
PPA 4 TRUE 0.335 22.768
PPA 5 TRUE 0.367 19.932
PPA 6 TRUE 0.339 21.828
PPA 7 TRUE 0.347 21.693
PPA 8 TRUE 0.288 22.502
PPA 9 TRUE 0.354 21.435
PPA 10 TRUE 0.347 21.449
MS 1 TRUE 0.282 10.131
MS 2 TRUE 0.262 10.155
MS 3 TRUE 0.230 10.729
MS 4 TRUE 0.223 10.597
MS 5 TRUE 0.243 9.996
MS 6 TRUE 0.262 10.185
MS 7 TRUE 0.242 10.381
MS 8 TRUE 0.249 10.504
MS 9 TRUE 0.287 9.985
MS 10 TRUE 0.209 10.472
PROD.AMD 1 TRUE 0.320 0.319
PROD.AMD 2 TRUE 0.334 0.311
PROD.AMD 3 TRUE 0.284 0.322
PROD.AMD 4 TRUE 0.309 0.316
PROD.AMD 5 TRUE 0.316 0.318
PROD.AMD 6 TRUE 0.334 0.303
PROD.AMD 7 TRUE 0.327 0.313
PROD.AMD 8 TRUE 0.295 0.312
PROD.AMD 9 TRUE 0.314 0.295
PROD.AMD 10 TRUE 0.309 0.310
AP 1 TRUE 0.317 0.017
AP 2 TRUE 0.328 0.017
AP 3 TRUE 0.349 0.018
AP 4 TRUE 0.311 0.017
AP 5 TRUE 0.343 0.017
AP 6 TRUE 0.326 0.017
AP 7 TRUE 0.359 0.017
AP 8 TRUE 0.333 0.017
AP 9 TRUE 0.337 0.017
AP 10 TRUE 0.321 0.018
HI 1 TRUE 0.294 44.290
HI 2 TRUE 0.290 43.571
HI 3 TRUE 0.276 46.070
HI 4 TRUE 0.305 41.078
HI 5 TRUE 0.276 44.543
HI 6 TRUE 0.319 43.163
HI 7 TRUE 0.323 42.088
HI 8 TRUE 0.322 43.122
HI 9 TRUE 0.272 44.295
HI 10 TRUE 0.314 42.444
AMD 1 TRUE 0.283 10.121
AMD 2 TRUE 0.263 10.143
AMD 3 TRUE 0.230 10.724
AMD 4 TRUE 0.225 10.584
AMD 5 TRUE 0.244 9.986
AMD 6 TRUE 0.263 10.170
AMD 7 TRUE 0.243 10.370
AMD 8 TRUE 0.250 10.494
AMD 9 TRUE 0.288 9.975
AMD 10 TRUE 0.210 10.466
Porte 1 TRUE 0.402 0.215
Porte 2 TRUE 0.372 0.225
Porte 3 TRUE 0.384 0.216
Porte 4 TRUE 0.350 0.221
Porte 5 TRUE 0.395 0.213
Porte 6 TRUE 0.389 0.216
Porte 7 TRUE 0.358 0.221
Porte 8 TRUE 0.385 0.214
Porte 9 TRUE 0.357 0.217
Porte 10 TRUE 0.361 0.223
RF 1 TRUE 0.275 0.028
RF 2 TRUE 0.275 0.029
RF 3 TRUE 0.311 0.027
RF 4 TRUE 0.303 0.029
RF 5 TRUE 0.326 0.028
RF 6 TRUE 0.271 0.027
RF 7 TRUE 0.289 0.027
RF 8 TRUE 0.286 0.027
RF 9 TRUE 0.277 0.028
RF 10 TRUE 0.328 0.025
CR 1 TRUE 0.348 4.271
CR 2 TRUE 0.374 4.161
CR 3 TRUE 0.344 4.335
CR 4 TRUE 0.326 4.538
CR 5 TRUE 0.376 4.117
CR 6 TRUE 0.377 4.200
CR 7 TRUE 0.384 4.216
CR 8 TRUE 0.340 4.375
CR 9 TRUE 0.360 4.213
CR 10 TRUE 0.378 4.304
DR 1 TRUE 0.275 9.507
DR 2 TRUE 0.268 8.986
DR 3 TRUE 0.301 9.211
DR 4 TRUE 0.284 8.539
DR 5 TRUE 0.301 9.049
DR 6 TRUE 0.346 8.955
DR 7 TRUE 0.352 8.529
DR 8 TRUE 0.315 8.944
DR 9 TRUE 0.310 9.157
DR 10 TRUE 0.291 8.925
DCaule 1 TRUE 0.358 0.025
DCaule 2 TRUE 0.355 0.026
DCaule 3 TRUE 0.346 0.027
DCaule 4 TRUE 0.344 0.026
DCaule 5 TRUE 0.369 0.026
DCaule 6 TRUE 0.358 0.025
DCaule 7 TRUE 0.364 0.026
DCaule 8 TRUE 0.351 0.026
DCaule 9 TRUE 0.374 0.026
DCaule 10 TRUE 0.408 0.025
Ácaro 1 TRUE 0.384 0.036
Ácaro 2 TRUE 0.406 0.034
Ácaro 3 TRUE 0.338 0.036
Ácaro 4 TRUE 0.371 0.036
Ácaro 5 TRUE 0.401 0.035
Ácaro 6 TRUE 0.402 0.035
Ácaro 7 TRUE 0.428 0.034
Ácaro 8 TRUE 0.362 0.036
Ácaro 9 TRUE 0.372 0.035
Ácaro 10 TRUE 0.353 0.036
Nº.Hastes 1 TRUE 0.314 0.138
Nº.Hastes 2 TRUE 0.268 0.143
Nº.Hastes 3 TRUE 0.263 0.137
Nº.Hastes 4 TRUE 0.322 0.129
Nº.Hastes 5 TRUE 0.270 0.136
Nº.Hastes 6 TRUE 0.284 0.133
Nº.Hastes 7 TRUE 0.259 0.135
Nº.Hastes 8 TRUE 0.295 0.138
Nº.Hastes 9 TRUE 0.295 0.137
Nº.Hastes 10 TRUE 0.309 0.141
Stand_Final 1 TRUE 0.222 0.240
Stand_Final 2 TRUE 0.161 0.223
Stand_Final 3 TRUE 0.235 0.237
Stand_Final 4 TRUE 0.167 0.226
Stand_Final 5 TRUE 0.177 0.220
Stand_Final 6 TRUE 0.222 0.236
Stand_Final 7 TRUE 0.163 0.242
Stand_Final 8 TRUE 0.172 0.237
Stand_Final 9 TRUE 0.230 0.243
Stand_Final 10 TRUE 0.203 0.242
Branching_Level 1 TRUE 0.398 0.127
Branching_Level 2 TRUE 0.394 0.128
Branching_Level 3 TRUE 0.354 0.133
Branching_Level 4 TRUE 0.359 0.133
Branching_Level 5 TRUE 0.356 0.128
Branching_Level 6 TRUE 0.360 0.124
Branching_Level 7 TRUE 0.371 0.129
Branching_Level 8 TRUE 0.364 0.124
Branching_Level 9 TRUE 0.414 0.122
Branching_Level 10 TRUE 0.371 0.134
Staygreen 1 TRUE 0.198 0.000
Staygreen 2 TRUE 0.219 0.000
Staygreen 3 TRUE 0.239 0.000
Staygreen 4 TRUE 0.215 0.000
Staygreen 5 TRUE 0.199 0.000
Staygreen 6 TRUE 0.213 0.000
Staygreen 7 TRUE 0.196 0.000
Staygreen 8 TRUE 0.213 0.000
Staygreen 9 TRUE 0.255 0.000
Staygreen 10 TRUE 0.156 0.000
write.csv(results_cv4, "output/results_cv4.csv", row.names = F, quote = F)

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Agora vamos plotar os resultados de acurácias para cada característica

results_cv4 %>%
  ggplot(aes(x = Trait, y = Ac , fill = Trait)) +
  geom_boxplot() +
  expand_limits(y = 0) +
  theme(axis.text.x = element_text(
    angle = 60,
    vjust = 1,
    hjust = 1
  ),
  text = element_text(size = 15)) +
  labs(y = "Accuracy") + 
  guides(fill = "none")

ggsave(
  "output/accuracy4.png",
  width = 16,
  height = 8,
  dpi = 300
)

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] cvTools_0.3.2            robustbase_0.95-0        lattice_0.20-45         
 [4] ggpmisc_0.5.1            ggpp_0.4.5               rrBLUP_4.6.1            
 [7] ComplexHeatmap_2.10.0    AGHmatrix_2.0.4          genomicMateSelectR_0.2.0
[10] janitor_2.1.0            kableExtra_1.3.4         forcats_0.5.2           
[13] stringr_1.4.1            dplyr_1.0.10             purrr_0.3.4             
[16] readr_2.1.2              tidyr_1.2.1              tibble_3.1.8            
[19] ggplot2_3.3.6            tidyverse_1.3.2          readxl_1.4.1            

loaded via a namespace (and not attached):
  [1] googledrive_2.0.0   colorspace_2.0-3    rjson_0.2.21       
  [4] ellipsis_0.3.2      rprojroot_2.0.3     circlize_0.4.15    
  [7] snakecase_0.11.0    GlobalOptions_0.1.2 fs_1.5.2           
 [10] clue_0.3-61         rstudioapi_0.14     farver_2.1.1       
 [13] MatrixModels_0.5-1  fansi_1.0.3         lubridate_1.8.0    
 [16] xml2_1.3.3          splines_4.1.3       codetools_0.2-18   
 [19] doParallel_1.0.17   cachem_1.0.6        confintr_0.2.0     
 [22] knitr_1.40          polynom_1.4-1       jsonlite_1.8.0     
 [25] workflowr_1.7.0     broom_1.0.1         cluster_2.1.2      
 [28] dbplyr_2.2.1        png_0.1-7           compiler_4.1.3     
 [31] httr_1.4.4          backports_1.4.1     assertthat_0.2.1   
 [34] Matrix_1.5-1        fastmap_1.1.0       gargle_1.2.1       
 [37] cli_3.3.0           later_1.3.0         htmltools_0.5.3    
 [40] quantreg_5.94       tools_4.1.3         gtable_0.3.1       
 [43] glue_1.6.2          Rcpp_1.0.9          cellranger_1.1.0   
 [46] jquerylib_0.1.4     vctrs_0.4.1         nlme_3.1-159       
 [49] svglite_2.1.0       iterators_1.0.14    xfun_0.32          
 [52] rvest_1.0.3         lifecycle_1.0.3     googlesheets4_1.0.1
 [55] DEoptimR_1.0-11     MASS_7.3-58.1       zoo_1.8-11         
 [58] scales_1.2.1        ragg_1.2.3          hms_1.1.2          
 [61] promises_1.2.0.1    parallel_4.1.3      SparseM_1.81       
 [64] RColorBrewer_1.1-3  yaml_2.3.5          sass_0.4.2         
 [67] stringi_1.7.6       highr_0.9           S4Vectors_0.32.4   
 [70] foreach_1.5.2       BiocGenerics_0.40.0 boot_1.3-28        
 [73] shape_1.4.6         rlang_1.0.6         pkgconfig_2.0.3    
 [76] systemfonts_1.0.4   matrixStats_0.62.0  evaluate_0.17      
 [79] labeling_0.4.2      tidyselect_1.2.0    magrittr_2.0.3     
 [82] R6_2.5.1            magick_2.7.3        IRanges_2.28.0     
 [85] generics_0.1.3      DBI_1.1.3           mgcv_1.8-40        
 [88] pillar_1.8.1        haven_2.5.1         whisker_0.4        
 [91] withr_2.5.0         survival_3.4-0      modelr_0.1.9       
 [94] crayon_1.5.2        utf8_1.2.2          tzdb_0.3.0         
 [97] rmarkdown_2.17      GetoptLong_1.0.5    git2r_0.30.1       
[100] reprex_2.0.2        digest_0.6.29       webshot_0.5.4      
[103] httpuv_1.6.5        textshaping_0.3.6   stats4_4.1.3       
[106] munsell_0.5.0       viridisLite_0.4.1   bslib_0.4.0        

  1. Weverton Gomes da Costa, Pós-Doutorando, Embrapa Mandioca e Fruticultura, ↩︎