Gene ID转换

为什么要在R中进行ID转换

在转录组数据或者其他分析中,好吧,其实是因为我做的物种是家鸡,根本就没有人和鼠那样有着非常完善和时刻更新的数据库做为支持,我经常会遇见GeneID转换,同源基因转换的问题,之前一直利用Ensembl主页里面的biomart进行数据下载,然后在excel里面利用vlookup进行操作,哎,费时费力,而且不能高度自动化和重复化~~(传统湿实验室人员的基本操作思维)~~,现在已经开始将大部分操作都在R里面进行了,刚好学习了biomaRt包可以完美替代之前的操作流程,下面就是相关笔记及备注:

好吧,这是第一篇Rmd笔记,Rmarkdown真的超级好用啊!

安装所需的软件包

下面的R包也包括Y叔的clusterProfiler了,里面也有ID转换的板块,也是非常实用的,一并汇总了。

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("biomaRt")

biomaRt示例

首先还是推荐看官方文档,没有什么攻略比官方文档更靠谱了,除非懒~

library("biomaRt")
library(org.Gg.eg.db)
library(org.Hs.eg.db)
library(clusterProfiler)
library(DOSE)

# 查看Marts库
listMarts(host="asia.ensembl.org")
##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 101
## 2   ENSEMBL_MART_MOUSE      Mouse strains 101
## 3     ENSEMBL_MART_SNP  Ensembl Variation 101
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 101

一步到位代码

选择所需的数据库代码,我需要的是人和家鸡的数据库,直接定义到dataset即可,有时网速很慢,添加了对应地区的host

# 定义人对应数据mart
human = useMart("ensembl",dataset="hsapiens_gene_ensembl",host="asia.ensembl.org")

# 定义家鸡对应数据mart
chicken = useMart("ensembl",dataset="ggallus_gene_ensembl",host="asia.ensembl.org")

# 查看对应`attributes`说明
listAttributes(chicken)[1:40,]
##                               name                                description
## 1                  ensembl_gene_id                             Gene stable ID
## 2          ensembl_gene_id_version                     Gene stable ID version
## 3            ensembl_transcript_id                       Transcript stable ID
## 4    ensembl_transcript_id_version               Transcript stable ID version
## 5               ensembl_peptide_id                          Protein stable ID
## 6       ensembl_peptide_id_version                  Protein stable ID version
## 7                  ensembl_exon_id                             Exon stable ID
## 8                      description                           Gene description
## 9                  chromosome_name                   Chromosome/scaffold name
## 10                  start_position                            Gene start (bp)
## 11                    end_position                              Gene end (bp)
## 12                          strand                                     Strand
## 13                            band                             Karyotype band
## 14                transcript_start                      Transcript start (bp)
## 15                  transcript_end                        Transcript end (bp)
## 16        transcription_start_site             Transcription start site (TSS)
## 17               transcript_length Transcript length (including UTRs and CDS)
## 18               transcript_appris                          APPRIS annotation
## 19              external_gene_name                                  Gene name
## 20            external_gene_source                        Source of gene name
## 21        external_transcript_name                            Transcript name
## 22 external_transcript_source_name                  Source of transcript name
## 23                transcript_count                           Transcript count
## 24      percentage_gene_gc_content                          Gene % GC content
## 25                    gene_biotype                                  Gene type
## 26              transcript_biotype                            Transcript type
## 27                          source                              Source (gene)
## 28               transcript_source                        Source (transcript)
## 29                         version                             Version (gene)
## 30              transcript_version                       Version (transcript)
## 31                 peptide_version                          Version (protein)
## 32                external_synonym                               Gene Synonym
## 33           phenotype_description                      Phenotype description
## 34                     source_name                                Source name
## 35               study_external_id                   Study external reference
## 36                     strain_name                                Strain name
## 37                   strain_gender                              Strain gender
## 38                         p_value                                    P value
## 39                           go_id                          GO term accession
## 40                       name_1006                               GO term name
##            page
## 1  feature_page
## 2  feature_page
## 3  feature_page
## 4  feature_page
## 5  feature_page
## 6  feature_page
## 7  feature_page
## 8  feature_page
## 9  feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page
## 21 feature_page
## 22 feature_page
## 23 feature_page
## 24 feature_page
## 25 feature_page
## 26 feature_page
## 27 feature_page
## 28 feature_page
## 29 feature_page
## 30 feature_page
## 31 feature_page
## 32 feature_page
## 33 feature_page
## 34 feature_page
## 35 feature_page
## 36 feature_page
## 37 feature_page
## 38 feature_page
## 39 feature_page
## 40 feature_page
listAttributes(human)[1:10,]
##                             name                  description         page
## 1                ensembl_gene_id               Gene stable ID feature_page
## 2        ensembl_gene_id_version       Gene stable ID version feature_page
## 3          ensembl_transcript_id         Transcript stable ID feature_page
## 4  ensembl_transcript_id_version Transcript stable ID version feature_page
## 5             ensembl_peptide_id            Protein stable ID feature_page
## 6     ensembl_peptide_id_version    Protein stable ID version feature_page
## 7                ensembl_exon_id               Exon stable ID feature_page
## 8                    description             Gene description feature_page
## 9                chromosome_name     Chromosome/scaffold name feature_page
## 10                start_position              Gene start (bp) feature_page
# 进行ID转换

hgnc_swissprot <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','hgnc_symbol','uniprotswissprot'),filters = 'ensembl_gene_id', values = 'ENSG00000139618', mart = human)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
hgnc_swissprot
##   ensembl_gene_id ensembl_transcript_id hgnc_symbol uniprotswissprot
## 1 ENSG00000139618       ENST00000671466       BRCA2                 
## 2 ENSG00000139618       ENST00000544455       BRCA2           P51587
## 3 ENSG00000139618       ENST00000530893       BRCA2                 
## 4 ENSG00000139618       ENST00000380152       BRCA2           P51587
## 5 ENSG00000139618       ENST00000670614       BRCA2                 
## 6 ENSG00000139618       ENST00000665585       BRCA2                 
## 7 ENSG00000139618       ENST00000528762       BRCA2                 
## 8 ENSG00000139618       ENST00000470094       BRCA2                 
## 9 ENSG00000139618       ENST00000666593       BRCA2

同源基因ID转换代码

跨物种同源基因转换使用getLDS命令即可,也是非常方便,但是有一个问题就是家鸡一个基因可能对应多个人的基因,系统会自动添加对应的旁系同源基因,所以转换后的数据使用需要小心,可能需要人工排查一下,或者人为的去掉部分数据。

#human = useMart("ensembl",dataset="hsapiens_gene_ensembl",host="asia.ensembl.org")
#chicken = useMart("ensembl",dataset="ggallus_gene_ensembl",host="asia.ensembl.org")

genelist_megre <- c("ENSGALG00000001136","ENSGALG00000005956","ENSGALG00000006407","ENSGALG00000007728","ENSGALG00000009628","ENSGALG00000010316","ENSGALG00000011190","ENSGALG00000011687","ENSGALG00000011806","ENSGALG00000015656")

genelist_megre
##  [1] "ENSGALG00000001136" "ENSGALG00000005956" "ENSGALG00000006407"
##  [4] "ENSGALG00000007728" "ENSGALG00000009628" "ENSGALG00000010316"
##  [7] "ENSGALG00000011190" "ENSGALG00000011687" "ENSGALG00000011806"
## [10] "ENSGALG00000015656"
genelist_megre_Hs <- getLDS(attributes = c("ensembl_gene_id","entrezgene_id", "external_gene_name"),      
                            filters = "ensembl_gene_id", 
                            values = genelist_megre, 
                            mart = chicken,
                            attributesL = c("ensembl_gene_id","entrezgene_id","external_gene_name"), 
                            martL = human)

genelist_megre_Hs
##       Gene.stable.ID NCBI.gene..formerly.Entrezgene..ID Gene.name
## 1 ENSGALG00000005956                             423774        NA
## 2 ENSGALG00000005956                             423774        NA
## 3 ENSGALG00000011687                          100859120        NA
## 4 ENSGALG00000011190                          100857411        NA
## 5 ENSGALG00000011806                          100859223        NA
## 6 ENSGALG00000007728                             770734        NA
## 7 ENSGALG00000007728                             770734        NA
##   Gene.stable.ID.1 NCBI.gene..formerly.Entrezgene..ID.1 Gene.name.1
## 1  ENSG00000264230                               728113     ANXA8L1
## 2  ENSG00000265190                               653145       ANXA8
## 3  ENSG00000185567                               113146      AHNAK2
## 4  ENSG00000145287                                51316       PLAC8
## 5  ENSG00000173702                                56667       MUC13
## 6  ENSG00000100033                                 5625       PRODH
## 7  ENSG00000277196                            102724788  AC007325.2

插一句,总算知道为什么R代码有规范了,参数主动换行,真的好看太多了。

提取蛋白/DNA序列并导出FASTA格式文件

使用过程中发现居然还可以调取序列等信息,少量操作也就不用打开网页,快速方便完成氨基酸/核酸序列的获取,非常实用!

protein = getSequence(id=c("GRP"),
                      type="external_gene_name",
                      seqType="peptide", 
                      mart= chicken)

protein
##                                                                                                                                                                 peptide
## 1              MGGGGPRRPGTLPLLALLALLAAHGGAAPLQPGGSPALTKIYPRGSHWAVGHLMGKKSTGDFPYAYEEENKIPLSASPENIKQLDDYLQREEMSKHLLQLLEGNENKSAHFSKGGLPWHTRNSWETDDSSSWKDVVEYLLQVVNMKESAPS*
## 2 MGGGGPRRPGTLPLLALLALLAAHGGAAPLQPGGSPALTKIYPRGSHWAVGHLMGKKSTGDFPYAYEEENKIPLSASPENIKQLDDYLQREEMSKHLLQLLEGNENKSAHFSKGGLPWHTRNSWETDDSSSWKDVSRTRCVSAFLTVTFCSKVAYQLCPTSALS*
##   external_gene_name
## 1                GRP
## 2                GRP

混杂基因名称转换处理,这里使用了bitr进行示例

在获得基因列表是不可避免会出现name和ID混杂在一起的情况,目前只能采用分批操作的流程进行处理,有更好方法再更新。

genelist <- c("S100A6","S100A10","SNCG","S100A1","S100A11","ID3","ENSGALG00000011190","AKR1B10","CSRP1","SPARCL1","HBEGF","FAM107B","ENSGALG00000009628","ENSGALG00000010316","ENSGALG00000011190","ENSGALG00000011687","ENSGALG00000011806","ENSGALG00000015656")

genelist
##  [1] "S100A6"             "S100A10"            "SNCG"              
##  [4] "S100A1"             "S100A11"            "ID3"               
##  [7] "ENSGALG00000011190" "AKR1B10"            "CSRP1"             
## [10] "SPARCL1"            "HBEGF"              "FAM107B"           
## [13] "ENSGALG00000009628" "ENSGALG00000010316" "ENSGALG00000011190"
## [16] "ENSGALG00000011687" "ENSGALG00000011806" "ENSGALG00000015656"
# 定义ENSEMBL命名的基因
index <- grep("ENSGALG", genelist)

index
## [1]  7 13 14 15 16 17 18
# 筛选ENSGALG命名的基因并转换
genelistA <- genelist[index]
geneA = bitr(genelistA, fromType="ENSEMBL", toType=c("SYMBOL", "ENTREZID"), OrgDb="org.Gg.eg.db")
head(geneA)
##              ENSEMBL     SYMBOL  ENTREZID
## 1 ENSGALG00000011190     PLACL2 100857411
## 2 ENSGALG00000009628     KCTD12    425504
## 3 ENSGALG00000010316      FRAS1    422506
## 4 ENSGALG00000011687     AHNAK2 100859120
## 5 ENSGALG00000011806      MUC13 100859223
## 6 ENSGALG00000015656 PALM2AKAP2 100533110
# 筛选SYMBOL命名的基因并转换
genelistB <- genelist[-index]
geneB = bitr(genelistB, fromType="SYMBOL", toType=c("ENSEMBL", "ENTREZID"), OrgDb="org.Gg.eg.db")
head(geneB)
##    SYMBOL            ENSEMBL  ENTREZID
## 1  S100A6 ENSGALG00000041826    373951
## 2 S100A10 ENSGALG00000028774    396506
## 3    SNCG ENSGALG00000002015    395392
## 4  S100A1 ENSGALG00000034458 100858258
## 5 S100A11 ENSGALG00000036711    396075
## 6     ID3 ENSGALG00000035317    395281
gene <- rbind(geneA,geneB)

gene
##               ENSEMBL     SYMBOL  ENTREZID
## 1  ENSGALG00000011190     PLACL2 100857411
## 2  ENSGALG00000009628     KCTD12    425504
## 3  ENSGALG00000010316      FRAS1    422506
## 4  ENSGALG00000011687     AHNAK2 100859120
## 5  ENSGALG00000011806      MUC13 100859223
## 6  ENSGALG00000015656 PALM2AKAP2 100533110
## 7  ENSGALG00000041826     S100A6    373951
## 8  ENSGALG00000028774    S100A10    396506
## 9  ENSGALG00000002015       SNCG    395392
## 10 ENSGALG00000034458     S100A1 100858258
## 11 ENSGALG00000036711    S100A11    396075
## 12 ENSGALG00000035317        ID3    395281
## 13 ENSGALG00000013072    AKR1B10    395338
## 14 ENSGALG00000000318      CSRP1    396176
## 15 ENSGALG00000010929    SPARCL1    422586
## 16 ENSGALG00000000949      HBEGF    395654
## 17 ENSGALG00000033226    FAM107B    426376
summary(gene)
##    ENSEMBL             SYMBOL            ENTREZID        
##  Length:17          Length:17          Length:17         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
# 同时可以导出到本地csv文件
#write.csv(as.data.frame(gene),"./gene.csv",row.names =FALSE)

当然,使用getBM也可以实现ID转换,但是结果会出现不少信息残缺,还是推荐bitr

gene_A <- getBM(attributes=c("ensembl_gene_id", "external_gene_name", "entrezgene_id"),
                filters = 'ensembl_gene_id', 
                values = genelistA, 
                mart = chicken)

gene_B <- getBM(attributes=c("ensembl_gene_id", "external_gene_name", "entrezgene_id"),
                filters = 'external_gene_name', 
                values = genelistB, 
                mart = chicken)

gene_ABmegre <- rbind(gene_A,gene_B)
gene_ABmegre
##       ensembl_gene_id external_gene_name entrezgene_id
## 1  ENSGALG00000009628                           425504
## 2  ENSGALG00000010316                           422506
## 3  ENSGALG00000011190                        100857411
## 4  ENSGALG00000011687                        100859120
## 5  ENSGALG00000011806                        100859223
## 6  ENSGALG00000015656         PALM2AKAP2     100533110
## 7  ENSGALG00000028774            S100A10        396506
## 8  ENSGALG00000000318              CSRP1        396176
## 9  ENSGALG00000036711            S100A11        396075
## 10 ENSGALG00000000949              HBEGF        395654
## 11 ENSGALG00000002015               SNCG        395392
## 12 ENSGALG00000013072            AKR1B10        395338
## 13 ENSGALG00000041826             S100A6        373951
## 14 ENSGALG00000034458             S100A1     100858258
## 15 ENSGALG00000010929            SPARCL1        422586
## 16 ENSGALG00000033226            FAM107B        426376
## 17 ENSGALG00000035317                ID3        395281
summary(gene_ABmegre)
##  ensembl_gene_id    external_gene_name entrezgene_id      
##  Length:17          Length:17          Min.   :   373951  
##  Class :character   Class :character   1st Qu.:   395654  
##  Mode  :character   Mode  :character   Median :   422506  
##                                        Mean   : 29929910  
##                                        3rd Qu.:100533110  
##                                        Max.   :100859223

小结

学习使人懒惰,以前要花半小时进行EXCEL文件准备和ID转换,现在也就几分钟,成功偷懒~~

参考资料

Jiannan Zhang
Jiannan Zhang
Associate Professor

My research interests include Avian Physiology, Endocrinology and Metabolism.

Related