XStringSet-io {Biostrings} | R Documentation |
Functions to read/write an XStringSet object from/to a file.
## Read FASTA (or FASTQ) files in an XStringSet object: readBStringSet(filepath, format="fasta", nrec=-1L, skip=0L, use.names=TRUE) readDNAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, use.names=TRUE) readRNAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, use.names=TRUE) readAAStringSet(filepath, format="fasta", nrec=-1L, skip=0L, use.names=TRUE) ## Extract basic information about FASTA (or FASTQ) files ## without actually loading the sequence data: fasta.info(filepath, nrec=-1L, skip=0L, use.names=TRUE, seqtype="B") fastq.geometry(filepath, nrec=-1L, skip=0L) ## Write an XStringSet object to a FASTA (or FASTQ) file: writeXStringSet(x, filepath, append=FALSE, format="fasta", ...) ## Serialize an XStringSet object: saveXStringSet(x, objname, dirpath=".", save.dups=FALSE, verbose=TRUE)
filepath |
A character vector (of arbitrary length when reading, of length 1
when writing) containing the path(s) to the file(s) to read or write.
Note that special values like |
format |
Either |
nrec |
Single integer. The maximum of number of records to read in. Negative values are ignored. |
skip |
Single non-negative integer. The number of records of the data file(s) to skip before beginning to read in records. |
use.names |
Should the returned vector be named? For FASTA the names are taken from the record description lines. For FASTQ they are taken from the record sequence ids. Dropping the names can help reducing memory footprint e.g. for a FASTQ file containing millions of reads. |
seqtype |
A single string specifying the type of sequences contained in the FASTA file(s). Supported sequence types:
Invalid one-letter sequence codes are ignored with a warning. |
x |
For For |
append |
|
... |
Further format-specific arguments.
If |
objname |
The name of the serialized object. |
dirpath |
The path to the directory where to save the serialized object. |
save.dups |
|
verbose |
|
Only FASTA and FASTQ files are supported for now. The qualities stored in the FASTQ records are ignored.
Reading functions readBStringSet
, readDNAStringSet
,
readRNAStringSet
and readAAStringSet
load sequences
from an input file (or set of input files) into an XStringSet
object.
When multiple input files are specified, they are read in the
corresponding order and their data are stored in the returned object
in that order. Note that when multiple input FASTQ files are specified,
all must have the same "width" (i.e. all their sequences must have the
same length).
The fasta.info
utility returns an integer vector with one element
per FASTA record in the input files. Each element is the length of the
sequence found in the corresponding record, that is, the number of
valid one-letter sequence codes in the record. See description of the
seqtype
argument above for how to control the set of valid
one-letter sequence codes.
The fastq.geometry
utility returns an integer vector describing
the "geometry" of the FASTQ files i.e. a vector of length 2 where the
first element is the total number of FASTQ records in the files and
the second element the common "width" of these files (this width is
NA
if the files contain no FASTQ records or records with
different widths).
writeXStringSet
writes an XStringSet object to a file.
WARNING: Please be aware that using writeXStringSet
on a
BStringSet object that contains the '\n' (LF) or '\r' (CR)
characters or the FASTA markup characters '>' or ';' is almost
guaranteed to produce a broken FASTA file!
Serializing an XStringSet object with saveXStringSet
is equivalent to using the standard save
mechanism. But it will
try to reduce the size of x
in memory first before calling
save
. Most of the times this leads to a much reduced size on disk.
http://en.wikipedia.org/wiki/FASTA_format
XStringSet-class, BString-class, DNAString-class, RNAString-class, AAString-class
## --------------------------------------------------------------------- ## A. READ/WRITE FASTA FILES ## --------------------------------------------------------------------- filepath <- system.file("extdata", "someORF.fa", package="Biostrings") fasta.info(filepath, seqtype="DNA") x <- readDNAStringSet(filepath) x out1 <- tempfile() writeXStringSet(x, out1) ## --------------------------------------------------------------------- ## B. READ/WRITE FASTQ FILES ## --------------------------------------------------------------------- filepath <- system.file("extdata", "s_1_sequence.txt", package="Biostrings") fastq.geometry(filepath) readDNAStringSet(filepath, format="fastq") library(BSgenome.Celegans.UCSC.ce2) ## Create a "sliding window" on chr I: sw_start <- seq.int(1, length(Celegans$chrI)-50, by=50) sw <- Views(Celegans$chrI, start=sw_start, width=10) my_fake_shortreads <- as(sw, "XStringSet") my_fake_ids <- sprintf("ID%06d", seq_len(length(my_fake_shortreads))) names(my_fake_shortreads) <- my_fake_ids my_fake_shortreads ## Fake quality ';' will be assigned to each base in 'my_fake_shortreads': out2 <- tempfile() writeXStringSet(my_fake_shortreads, out2, format="fastq") ## Passing qualities thru the 'qualities' argument: my_fake_quals <- rep.int(BStringSet("DCBA@?>=<;"), length(my_fake_shortreads)) my_fake_quals out3 <- tempfile() writeXStringSet(my_fake_shortreads, out3, format="fastq", qualities=my_fake_quals) ## --------------------------------------------------------------------- ## C. SERIALIZATION ## --------------------------------------------------------------------- saveXStringSet(my_fake_shortreads, "my_fake_shortreads", dirpath=tempdir())