3 . Acessing Sequence Information:
HIV sequences can be acessed locally using one of our several sequence
database using the GDE sequence database menu.
Sequence information can be acessed online using one of the specialized
bioinformatics tools:
3.1. Electropherogram analysis:
To access sequence information from sequencers, we suggest using the
phred/phrap/Consed programs.
Phred reads DNA sequencer trace data, calls bases, assigns quality
values to the bases, and writes the base calls and quality values to output
files. Phred can read trace data from SCF files and ABI model 310,
3100 and 377 DNA sequencer chromat files, automatically detecting the file
format.
Phrap ("phragment assembly program", or "phil's revised assembly
program"; a homonym of "frappe" = French for "swat") -- a program for assembling
shotgun DNA sequence data. Key features: allows use of entire read
(not just trimmed high quality part); uses a combination of user-supplied
and internally computed data qualityinformation to improve accuracy of
assembly in the presence of repeats; constructs contig sequence as a mosaic
of the highest quality parts of reads (rather than a consensus); provides
extensive information about assembly (including quality values for contig
sequence) to assist trouble-shooting; able to handle very large datasets.
N.B. phrap does not provide editing or viewing capabilities; these
are available with consed and phrapview. It is strongly recommended
that phrap be used in conjunction with the base calls and base quality
values produced by the basecaller, phred; and with the sequence editor/assembly
viewer, consed.
Consed is a program for viewing and editing assemblies assembled
with the phrap assembly program.
A set of scripts had been developed that integrate in an automatic
way all the phred/phrap/consed programs. Picture of the program in action
and the manual.
3.2 Sequence formats:
GenBank format
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word
LOCUS and a number
of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes
("//").
An example sequence in GenBank format is:
LOCUS AAU03518
237 bp DNA
PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1
(ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a
77 c 67 g 52 t
ORIGIN
1 aacctgcgga
aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 aacctgcgga aggatcatta
ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
121 aacctgcgga aggatcatta ccgagtgcgg
gtcctttggg cccaacctcc catccgtgtc
FASTA Format
Sequences in fasta formatted files are preceded by a line starting with
>.
The first word on this line is the name of the sequence. The rest of
the line is a description of the sequence. The remaining lines contain
the sequence itself.
Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs. An example of a FASTA file containing multiple sequences is:
> seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC
> seq2 This is the description of my second sequence.
GTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC
> seq3 This is the description of my third sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC
Sequence Formats for Phylogenetic Analysis:
Nexus format (PAUP)
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=10 NCHAR=30;
FORMAT DATATYPE=DNA MISSING=? GAP=- INTERLEAVE ;
MATRIX
[
10 20
30]
[
. .
. ]
'Taxa A' GCAATCATCCAAATCGGTCAACTTAATAAA
[30]
'Taxa B' GCCATCATCCATAACGGTGAACTTTTAATG
[30]
'Taxa C' GCCATACTCCATAACGGTGAACTTGTAATA
[30]
'Taxa D' GCCAAACCCCATATCGTGCAACTTAATAAG
[30]
'Taxa E' GGCTATCCACCTTAAGTGTAAATTGTTGAT
[30]
'Taxa F' GGCTATCCAACTATAGTGCAACTTAATACA
[30]
'Taxa G' GGCTAGGCCAATAATATGAAACTTTTAATG
[30]
'Taxa H' GTCTAGGCCAAAAATATGAAACTTGTTATA
[30]
'Taxa I' GTCGAAGCAAAAATAGTGCAACTCAATAAA
[30]
'Taxa J' GTCGAAGCAAAAATAGTGAAACTCAATAGA
[30]
;
END;
Phylip format
The first line of the input file contains the
number of species and the number of characters separated by blanks. The
information for each species follows, starting with a ten-character
species name (which can include punctuation marks and blanks), and continuing
with the characters for that species. Phylip format files can be interleaved,
as in the example below, or sequential. More information about phylip format
is available here.
4 123
seq1 ---------- ---------- ---KSKERYK
DENGGNYFQL
seq2 ---------- -----YEGLT TANGXKEYYQ
DKNGGNFFKL
seq3 ---------- ---------- ----SQRHYK
D-DGGNYFQL
seq4 ---------- ---------- NVAALKTRYE
K-DGQNFYQL
TVWKAITCNA --GGGKYFRN TCDG--GQNP
TETQNNCRCI G---------
TVWKAITCGA P-GDASYFHA TCDSGDGRGG
AQAPHKCRCD G---------
TVWEAITCSA DKGNA-YFRR TCNSADGKSQ
SQARNQCRC- --KDENGKN-
TIWEAITCSA DKGNA-YFRA TCNSADGKSQ
SQARNQCRC- --KDENGXN-
3.1 Readseq:
GDE uses the (excellent) biological sequence reading/writing/conversion
utility ReadSeq, written by and Copyright Don Gilbert.
Readseq is particularly useful as it automatically detects many
sequence formats, and interconverts among them.
Formats added to this release include
+ MSF multi sequence format used by GCG software
+ PAUP's multiple sequence (NEXUS) format
+ PIR/CODATA format used by PIR
+ ASN.1 format used by NCBI
+ Pretty print with various options for nice looking output.
File Formats Conversion:
1. IG/Stanford
10. Olsen (in-only)
2. GenBank/GB
11. Phylip3.2
3. NBRF
12. Phylip
4. EMBL
13. Plain/Raw
5. GCG
14. PIR/CODATA
6. DNAStrider
15. MSF
7. Fitch
16. ASN.1
8. Pearson/Fasta
17. PAUP
9. Zuker
18. Pretty (out-only)
3.3. Sequence organization and storage.
It is of crucial importance organize the sequence dataset that will
be analysed.
This includes:
1. Sequence name (maximum 10 characters), including country
of isolation, sequence identifier and year of isolation.
E.g: patient TV001 from Brazil isolated in 1999: BRtv001.99
2. Always in FASTA format (can be read by most of
the software)
3. Spreadsheet with clinical and patient details.
Example in the next page.
The best way to organize a sequence dataset is to develop a SQL database with patient, clinical data and sequences. This would allow the selection of subset of sequences according to different criterion, as viral load levels, stage of disease, body compartment, subtype, epidemiologically linked and etc.
Never begin sequence analysis before the dataset are organized!!!!!
3.4. Sequence database acession:
| Main Sequence
databases (include DNA, RNA, protein and structure data)
Entrez (GeneBank+PIR+Medline) at NCBI SRS6 (Sequence Retrieval System) at Sanger Centre - EMBL DDJB database from japan Patheways Search. KEEG - Pathway Information database Protein Databases:
|
HIV Specialized
Databases:Mycobacterium tuberculosis Mycobacterium tuberculosis complexTubercuList
World-Wide Web Server
Los Alamos HIV Sequence Database Los Alamos National Laboratory, USA Los Alamos Immunology Website, our sister site, houses a huge searchable collection of HIV immunological epitopes Los Alamos Drug Resistance Database contains information about anti-HIV drugs and drug-resistance-conferring mutations Retrovirus Resources at NCBI Stanford Drug Resistance Database - Curated database containign RT and Protease sequences for evolutionary and drug resistance studies. AIDS Reagent Program The NIH AIDS Research and Reference Reagent Program provides biological and chemical materials for studychemical materials for study of HIV and related opportunistic infections. |
3.5. Los Alamos HIV Sequence Database Bioinformatics tools:
3.5.Alignment Reference sets:
Subtype Reference Set for Phylogenetic Analysis can be downloaded from the Los Alamos HIV Seq. Db.
GDE for HIV Sequence analysis contains several
HIV sequence databases.