fetchExtendedChromInfoFromUCSC {GenomeInfoDb} | R Documentation |
Fetch the chromosomes info for some of the UCSC genomes. Only supports hg38, hg19, mm10, dm3, and sacCer3 at the moment.
fetchExtendedChromInfoFromUCSC(genome, goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath")
genome |
A single string specifying the UCSC genome e.g. |
goldenPath_url |
A single string specifying the URL to the UCSC goldenPath location.
This URL is used internally to build the full URL to the 'chromInfo'
MySQL dump containing chromosomes information for |
Chromosomes information (e.g. names and lengths) for any UCSC genome is stored in the UCSC database in the 'chromInfo' table and is normally available as a MySQL dump at:
goldenPath_url/<genome>/database/chromInfo.txt.gz
fetchExtendedChromInfoFromUCSC
downloads and imports that table
into a data frame, and keeps only the UCSC_seqlevels
and
UCSC_seqlengths
columns (after renaming them). Then it lookups the
assembly report at NCBI for that genome corresponding (e.g. GRCh38
assembly for hg38), extracts the seqlevels and GenBank accession numbers
from the report, matches them to each UCSC seqlevels (using some heuristic),
and adds them to the returned data frame.
A data frame with one row per seqlevel in the UCSC genome, and with the following columns:
UCSC_seqlevels: Character vector with no NAs. This is the
chrom
field of the UCSC 'chromInfo' table for the
genome. See Details section above.
UCSC_seqlengths: Integer vector with no NAs. This is the
size
field of the UCSC 'chromInfo' table for the
genome. See Details section above.
NCBI_seqlevels: Character vector. This information is obtained from
the NCBI assembly report for the genome. Will contain NAs for UCSC
seqlevels with no corresponding NCBI seqlevels (e.g. for chrM in hg18
or chrUextra in dm3), in which case
fetchExtendedChromInfoFromUCSC
emits a warning.
GenBank_accns: Character vector. This information is obtained from the NCBI assembly report for the genome. Can contain NAs but no warning is emitted in that case.
Note that the rows are not sorted in any particular order.
Only supports the hg38, hg19, mm10, dm3, and sacCer3 genomes at the moment. More will come...
H. Pages
The seqlevels
getter and setter
in the GenomicRanges package.
The seqlevelsStyle
getter and setter.
The getBSgenome
utility in the
BSgenome package for searching the installed BSgenome
data packages.
## --------------------------------------------------------------------- ## A. BASIC EXAMPLE ## --------------------------------------------------------------------- chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3") chrominfo ## --------------------------------------------------------------------- ## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON ## THE GRCh38 GENOME ## --------------------------------------------------------------------- ## Load the BSgenome.Hsapiens.NCBI.GRCh38 package: library(BSgenome) genome <- getBSgenome("GRCh38") # this loads the # BSgenome.Hsapiens.NCBI.GRCh38 package ## A quick look at the GRCh38 seqlevels: length(seqlevels(genome)) head(seqlevels(genome), n=30) ## Fetch the extended chromosomes info for the hg38 genome: hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38") dim(hg38_chrominfo) head(hg38_chrominfo, n=30) ## 2 sanity checks: ## 1. Check the NCBI seqlevels: stopifnot(setequal(hg38_chrominfo$NCBI_seqlevels, seqlevels(genome))) ## 2. Check that the sequence lengths in 'hg38_chrominfo' (which are ## coming from the same 'chromInfo' table as the UCSC seqlevels) ## are the same as in 'genome': stopifnot( identical(hg38_chrominfo$UCSC_seqlengths, unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevels])) ) ## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as ## the names: hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevels, hg38_chrominfo$NCBI_seqlevels) ## Set the hg38 seqlevels on 'genome': seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)] head(seqlevels(genome), n=30)