Tags: adequate coverage, alignment, alignments, biotechnology information, deanna, e mail, genbank, institute for genomic research, ncbi, nih, nlm, nucleotide sequence, retrieval techniques, rfc, sequence data, sequence submissions, snp, trace archive, trace data, vladimir,
NCBI Assembly Archive
RFC v.1.1
Martin Shumway1, Vladimir Alexeyev2, Deanna Church (communicating author) (e-mail:
trace@ncbi.nlm.nih.gov)2, Steven Salzberg1
( The Institute for Genomic Research (TIGR), 2Center for Biotechnology Information (NCBI))
1
1. Overview
The new Assembly Archive at NCBI is a repository of fully and partially complete genomic
assemblies that exists in association with sequence submissions in Genbank and trace
submissions in the NCBI Trace Archive. This is a major addition to the existing archives of
trace data and sequence data. The repository provides users with the ability to access and
evaluate the assemblies from which finished genomic nucleotide sequence has been
derived. Many benefits accrue to users of this data, including for example the ability to
determine that a spurious frame shift has occurred, or that a putative SNP is not well
supported by adequate coverage.
This Specification describes the information content of a submission object, its format, and
the procedure by which submissions and updates should be made. It does not describe
retrieval techniques and viewing options. It does not specify the manner in which the data is
stored at NCBI.
The repository has the ability to store two types of information, assembly and alignment.
The set of instructions detailing how a set of traces contribute to an assembly is called
labeled 'assembly'. Information concerning the alignments of traces (that may or may not
have been used in the assembly) is labeled 'alignment'. The distinction is that an alignment
of a pool of traces to an assembly may give different results than the instructions given
when the same set of traces are used to define an assembly. During the course of an
assembly, heuristics are generally applied in specific regions in the assembly to produce a
more biologically relevant answer. The rest of the field names and their definitions remain
the same.
Information fields are denoted as ., for example
contig.submitter_reference. Data types are left unspecified in this draft but in all cases are
assumed to be strings.
The impetus for writing this specification came from a meeting between TIGR and NLM held
on 22 Aug 2003 hosted by David Lipman and Jim Ostell of NLM.
1.1. Goals
· Provide a central repository of genome assemblies.
· Augment public repositories of experimental data that supports scientific results from
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
genome sequencing.
· Allow anyone to evaluate the quality of a genome assembly.
· Provide links to sequence data held in the Trace Archive.
· Allow for submission of alternate assemblies of the same sequences.
· Allow for submission of alternate base callings of the same assembly.
1.2. Scope
IS
· Submit and store contigs
· Refers to traces
· Submission and storage of contig consensus
· Refers to taxonomic objects
· Refers to traces
· Supports circularized contigs
· Allows for the construction of a multiple alignment for the traces in a contig
IS NOT
· Submission and storage of scaffolds
· Storage of trace data
· Submission and storage of contig features
· Description of sequencing project or organism
· Submission and storage of lab data including clone insert mapping
· Description of secondary structure such as hairpins and hard stops
· Stores edits that would have to be applied to raw traces in order to construct a multiple
alignment
2. Assembly Submission
The core of each Assembly Submission is the ASSEMBLY.xml XML document format of
which is described in details in this section.
The submission .xml file contains the following sections:
Assembly Block
Contigs
[
Contig
Traces
[
Trace
]*
]+
where the '*' denotes the section which can be repeated 0 or more times, and the '+' - for
the section which can be repeated 1 or more times.
2.1. Assembly Block
This block gives general information about the entry. Usually a submission corresponds
to a genome assembly of an organism or a structure within the organism's genome. The
assembly is uniquely identified by its ID (AI) which is assigned on submission.
The submission block consists of the following fields and/or attributes:
Assembly fields/elements:
center_name Submitter's institution designation (required)
NIH/NLM/NCBI Page 2 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
taxid Genbank taxonomic reference (required)
date Date that the submission was prepared (optional)
description Freeform description of the assembly or the submission (required)
structure Submitter's structural assignment, for example chromosome 3. For some
genomes these designations may have been standardized (required)
ncontigs Number of contigs in the assembly (required)
nconbases Number of bases of consensus in the contigs of the assembly(optional for
validation purposes) (required)
nbasecalls Total number of base calls used in the assembly (required)
ntraces Number of traces referred to in the assembly (required)
coverage Ratio of (optional)
contig The submission contains one or more contigs (required)
Assembly attributes:
submitter_reference Submitter's free text reference attribute, submitter's internal
reference id (optional)
type Attribute of the submission type (NEW, UPDATE, REPLACE or REMOVE)
(required)
ai Assembly archive identifier (required if type == UPDATE or REPLACE or REMOVE)
2.2. Contig Set
The submission contains one or more contigs. This is the contig set.
2.2.1. Contig
The contig record is uniquely identified by its ID (CI) which is assigned on submission.
The contig.submitter_reference attribute's value allows the contigs to be identified with
a designator meaningful to the submitting institution.
The contig is composed of a consensus sequence (contig.consensus) of
contig.nconbases base pairs and a list of gaps (contig.congaps) that must be inserted
in order to align the consensus with its constituent tiling. The contig's consensus can also
be accompanied by a sequence of quality scores, (contig.conqualities) which are
generated during the assembly.
The consensus is a sequence of nucleotide base calls that can comprise any IUB code
(A,C,G,T,M,R, #). In addition further codes that denote ambiguities with respect to gaps
are supported:
a = A or gap
c = C or gap
t = T or gap
g = G or gap
At NCBI side these IUB extensions will be substituted with the appropriate standard IUB
codes:
a with A, c with C, t with T, and g with G accordingly.
The contig consensus strand is not specified and its sense is always "forward".
The consensus also contains a set of gap that describes how to "stretch" the contig
consensus in order to fit the tiling underneath. The gapped consensus so described is
indexed from 1 and is inclusive (thus the trivial consensus of one base has a span of
[1,1] ). The gap set is denoted as a set of relative offsets that tells how many base calls
to skip before inserting a gap. For example, the following ungapped consensus can be
re-gapped as follows:
NIH/NLM/NCBI Page 3 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
ACTTTCATGGTACTGATCCAGT + (1,5,0,0,2) => A-CTTTC---AT-GGTACTGATCCAGT
A gapped consensus may start with a gap run and may end with a gap run.
The sum of contig.nconbases and |contig.ncongaps| is the size of the gapped
consensus, while contig.nbasecalls refers to the total number of basecalls within the
valid ranges of the traces used in the assembly.
2.2.2. Alternate Consensus
Rather than supplying the consensus sequence as part of the Assembly Archive
submission (either in-line or as a file), it is possible to simply refer to an entry in NCBI
Genbank by accession or by GI. Set the contig.source value to ACCESSION.VERSION
or GI depending on which reference is used. The contig.congaps field must still be set.
Care must be taken to ensure that the assembly consensus indeed matches the
Genbank sequence. In addition, it is important that the submitter maintain the
consistency between the assembly and the Genbank entry if either is changed over time.
A major feature of the Assembly Archive Viewer is the ability to project the annotation of
genomic information from Genbank onto the assembly contigs.
2.2.3. Contig Conformation
It is possible for a contig to represent a circularized structure.
The contig.conformation attribute denotes whether the contig is linear (default) or
circular. In the case of circularity, the consensus should not wrap on itself. Rather, traces
that begin at the end of the contig and wraparound to the beginning are denoted with
their trace.tiling_start and trace.consensus.start having negative values equal to the
number of bases (gapped/ungapped) by which they wraparound. For example, if a trace
wraps around the end of a contig by two bases, then trace.consensus_start is -1. This
scheme maintains the invariant that the end minus the start equals the length
(gapped/ungapped). The trace ordering scheme defined above is also maintained.
2.2.4. Pseudo-molecules
It is possible for a contig to represent a pseudo-molecule. Such submissions may contain
more information about the relationship of contigs to one another, and more faithfully
reflect the associated Genbank sequence.
The preparation of such a contig is described as follows. The actual contigs in the
original assembly are ordered and oriented. Gaps are filled in with the approximate
number of Ns representing the each gap's size. Negative gaps (overlaps) between
adjacent contigs are resolved by unifying their alignments. A single contig is produced
that spans the pseudo-molecule and aligns exactly with it. The contig may have areas of
zero trace coverage. This contig thus constitutes the assembly submission.
2.2.5. Description
In many cases it may be useful to provide descriptive information for a contig other than
just the identifier. Such optional descriptive information can be added here.
2.2.6. Scaffolds
Many assemblies produce a rich set of relationships among contig and markers that
together represent a scaffolding of a genomic structure (e.g. chromosome arm). The
NIH/NLM/NCBI Page 4 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
Assembly Archive Specification does not presently address support for scaffolds, but will
do so in a future revision, and will probably be based on the existing Genbank AGP
format.
2.3. Trace Set
A consensus is composed of zero or more traces called a trace set. In the case where no
traces are available, contig.ntraces is 0 or this entry can be omitted.
Traces are references to entries in the NCBI Trace Archive and must have either the
trace.ti or the trace.trace_name attribute set. In later case the trace.center_name
optional attribute maybe set, otherwise the center name assumed to be same as the
assembly.cetner_name. Once a Trace Archive entry is inserted the entry never expires
or changes, so its identifier uniquely identifies the entry. While the presence of the
trace.ti is guaranteed, it may be marked as replaced or withdrawn. The Assembly
Archive does not take responsibility for assessing whether a contig record is still valid
given the state of its referenced traces. No rule is proposed to ensure the consistency of
contig records with respect to its constituent traces. For example, if one trace is
invalidated in the Trace Archive and it is used in an area of deep coverage in the
Assembly Archive record, then the contig might still be considered valid and its result is
not challenged.
The trace record is designed to be lightweight, referring to all content as stored in the
trace record. The number of bases in the trace that are used in the assembly is
trace.nbasecalls. The valid range trace.valid's_start and stop attributes denote the
start and stop coordinates (1-based, inclusive) of the unclipped trace sequence as stored
in the Trace record. In other words, [trace.valid.start, trace.valid.stop] describes the
closed interval of the sequence that may be used in assembly. Because the trace record
is considered "forward", trace.valid.start < trace.valid.stop. The value of
trace.nbasecalls is trace.valid.stop - trace.valid.start + 1. The trace.nbasecalls
parameter is optional. The trace valid range does not have to be contained within the
trace's clear range (the region screened of vector and clear of bad quality, as stored in
the Trace Archive). This is done in order to permit alternative vector clipping and low
quality trimming, for example when an assembly is performed on traces submitted by
another center.
The tiling range of trace.tiling's start and stop attributes (1-based inclusive) denote the
gapped coordinates that locate the trace within the contig's tiling. The trace.tiling
direction attribute denotes the trace's orientation within the tiling with respect to the trace
record. This attribute can be consulted for ease of validating mate pair orientations. Note
that this is not the same as the direction of the trace with respect to its clone (a trace
attribute). Unlike the contig case, a gap may not start a trace sequence nor end it.
The trace.consensus's start and stop attributes denote the consensus range of the
trace in the ungapped consensus of the first and last base of the trace contributing to the
consensus. In other words, the trace.consensus.start-th base position (from 1) is the
first position covered by the trace. This is useful for quickly determining coverage depth
of the contig. These attributes are optional and help to locate the trace with respect to the
consensus of the genome rather than the tiling.
Traces are listed in order of their occurrence in the contig's tiling. The first order is by
trace.tiling.start, then by length of the tiling range (shortest first), else by trace.ti, else
by trace.trace_name.
The following invariants are observed:
NIH/NLM/NCBI Page 5 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
· The number of bases in the valid range is equal to the number of base calls in the tiling
range.
· The consensus range is contained within the tiling range.
· The start coordinate is always
NIH/NLM/NCBI Page 9 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
NIH/NLM/NCBI Page 10 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
6. Appendix B: Assembly XML Schema
NIH/NLM/NCBI Page 11 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
NIH/NLM/NCBI Page 12 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
NIH/NLM/NCBI Page 13 of 14
NCBI Assembly Archive RFC (v.1.1 March 4, 2005)
Built: March 4, 2005
NIH/NLM/NCBI Page 14 of 14