logobioafrica.GIF - 31844 Bytes

BioAfrica web pageBioafrica CourseGDE and Bioinformatics courseSequence databasesHIV/AIDS prevalence in AfricaCourses Exercises

3 . Acessing Sequence Information:

HIV sequences can be acessed locally using one of our several sequence database using the GDE sequence database menu.
Sequence information can be acessed online using one of the specialized bioinformatics tools:


3.1. Electropherogram analysis:

To access sequence information from sequencers, we suggest using the phred/phrap/Consed programs.
Phred reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files.  Phred can read trace data from SCF files and ABI model 310, 3100 and 377 DNA sequencer chromat files, automatically detecting the file format.
Phrap ("phragment assembly program", or "phil's revised assembly program"; a homonym of "frappe" = French for "swat") -- a program for assembling shotgun DNA sequence data.  Key features: allows use of entire read (not just trimmed high quality part); uses a combination of user-supplied and internally computed data qualityinformation to improve accuracy of assembly in the presence of repeats; constructs contig sequence as a mosaic of the highest quality parts of reads (rather than a consensus); provides extensive information about assembly (including quality values for contig sequence) to assist trouble-shooting; able to handle very large datasets.  N.B. phrap does not provide editing or viewing capabilities; these
are available with consed and phrapview. It is strongly recommended that phrap be used in conjunction with the base calls and base quality values produced by the basecaller, phred; and with the sequence editor/assembly viewer, consed.
Consed is a program for viewing and editing assemblies assembled with the phrap assembly program.
A set of scripts had been developed that integrate in an automatic way all the phred/phrap/consed programs. Picture of the program in action and the manual.

3.2 Sequence formats:

GenBank format

A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word LOCUS and a number
of annotation lines. The start of the sequence is marked by a line
containing "ORIGIN" and the end of the sequence is marked by two slashes ("//").

    An example sequence in GenBank format is:

LOCUS       AAU03518      237 bp    DNA             PLN       04-FEB-1995
DEFINITION  Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
            rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION   U03518
BASE COUNT       41 a     77 c     67 g     52 t
ORIGIN
          1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
        61 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
      121 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
 

FASTA Format

Sequences in fasta formatted files are preceded by a line starting with >.
The first word on this line is the name of the sequence. The rest of the line is a description of the sequence. The remaining lines contain the sequence itself.

Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs. An example of a FASTA file containing multiple sequences is:

  > seq1 This is the description of my first sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC
 > seq2 This is the description of my second sequence.
GTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC
 > seq3 This is the description of my third sequence.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTCGATGC

Sequence Formats for Phylogenetic Analysis:

Nexus format (PAUP)

#NEXUS
BEGIN DATA;
DIMENSIONS  NTAX=10 NCHAR=30;
FORMAT DATATYPE=DNA  MISSING=? GAP=-  INTERLEAVE ;
MATRIX
 [                      10        20        30]
 [                      .         .         . ]

'Taxa A'      GCAATCATCCAAATCGGTCAACTTAATAAA   [30]
'Taxa B'      GCCATCATCCATAACGGTGAACTTTTAATG   [30]
'Taxa C'      GCCATACTCCATAACGGTGAACTTGTAATA   [30]
'Taxa D'      GCCAAACCCCATATCGTGCAACTTAATAAG   [30]
'Taxa E'      GGCTATCCACCTTAAGTGTAAATTGTTGAT   [30]
'Taxa F'      GGCTATCCAACTATAGTGCAACTTAATACA   [30]
'Taxa G'      GGCTAGGCCAATAATATGAAACTTTTAATG   [30]
'Taxa H'      GTCTAGGCCAAAAATATGAAACTTGTTATA   [30]
'Taxa I'      GTCGAAGCAAAAATAGTGCAACTCAATAAA   [30]
'Taxa J'      GTCGAAGCAAAAATAGTGAAACTCAATAGA   [30]
;
END;

Phylip format

     The first line of the input file contains the number of species and the number of characters separated by blanks. The information for each  species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential. More information about phylip format is available here.
 
 

     4 123
     seq1 ---------- ---------- ---KSKERYK DENGGNYFQL
     seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL
     seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL
     seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL

      TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCI G---------
      TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G---------
      TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN-
      TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN-

3.1 Readseq:

GDE uses the (excellent) biological sequence reading/writing/conversion
utility ReadSeq, written by and Copyright Don Gilbert.

Readseq is particularly useful as it automatically detects many
sequence formats, and interconverts among them.
Formats added to this release include
  + MSF multi sequence format used by GCG software
  + PAUP's multiple sequence (NEXUS) format
  + PIR/CODATA format used by PIR
  + ASN.1 format used by NCBI
  + Pretty print with various options for nice looking output.

    File Formats Conversion:

     1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP
         9. Zuker                 18. Pretty (out-only)
 

3.3. Sequence organization and storage.

It is of crucial importance organize the sequence dataset that will be analysed.
This includes:
  1. Sequence name (maximum 10 characters), including country of isolation, sequence identifier and year of isolation.
 E.g: patient TV001 from Brazil isolated in 1999: BRtv001.99
 2.   Always in FASTA format (can be read by most of the software)
 3.   Spreadsheet with clinical and patient details. Example in the next page.

The best way to organize a sequence dataset is to develop a SQL database with patient, clinical data and sequences. This would allow the selection of subset of sequences according to different criterion, as viral load levels, stage of disease, body compartment, subtype, epidemiologically linked and etc.

Never begin sequence analysis before the dataset are organized!!!!!

3.4. Sequence database acession:
 
Main Sequence databases (include DNA, RNA, protein and structure data)
Entrez (GeneBank+PIR+Medline) at NCBI
SRS6  (Sequence Retrieval System)  at Sanger Centre - EMBL
DDJB database from japan Patheways Search.
KEEG  - Pathway Information database

Protein Databases:
SWISS-PROT (amino acid sequences and others)
PIR (Protein Identification Resource) (amino acid sequences and others)
PDB (Protein Data Bank) (three-dimensional structures of proteins)

HIV Specialized Databases:Mycobacterium tuberculosis Mycobacterium tuberculosis complexTubercuList World-Wide Web Server
Los Alamos HIV Sequence Database  Los Alamos National Laboratory, USA
Los Alamos Immunology Website, our sister site, houses a huge searchable collection of HIV immunological epitopes
Los Alamos Drug Resistance Database contains information about anti-HIV drugs and drug-resistance-conferring mutations
Retrovirus Resources at NCBI
Stanford Drug Resistance Database  - Curated database containign RT and Protease sequences for evolutionary and drug resistance studies.
AIDS Reagent Program The NIH AIDS Research and Reference Reagent Program provides biological and chemical materials for studychemical materials for study of HIV and related opportunistic infections.

3.5. Los Alamos HIV Sequence Database Bioinformatics tools:

Gapstrip This tools lets you strip out the gaps from your sequences, in preparation for making a tree or other analysis.
Multiple Motif Scan Search HXB2 or your own amino acid sequence for any HLA peptide binding motif
Primalign Automatically align your primer or sequence fragment to the complete genome alignment. The interface returns the coordinates (HXB2 numbering) and an alignment of the fragment to all sequences in the whole genome alignment EpilignAutomatically align your amino acid epitope against the alignments we have up on the web.
SeqPublish Paste your alignment into the window and have it formatted for publication: identical columns are replaced by dashes, and the sequences are printed in blocks of user-determined length.
HXB2 Numbering Engine A quick way to find position numbers in HIV relative to HXB2.
HIV Subtyping using BLAST This website allows subtyping of a new sequence by comparing it to a set of reference sequences using the BLAST local similarity search algorithm.

HIV Subtyping analysis at Los Alamos Sequence Database
Recombinant Identification Program (RIP) - Los Alamos.
SNA/SNAP/WEBSNAP/SNAP.html">SNAP (Synonymous/Non-synonymous Analysis Program)
HYPERMUT  This interface takes a nucleotide alignment and documents the nature and context of nucleotide substitutions in a sequence population relative to a reference sequence.

3.5.Alignment Reference sets:

Subtype Reference Set for Phylogenetic Analysis can be downloaded from the Los Alamos HIV Seq. Db.

GDE for HIV Sequence analysis contains several HIV sequence databases.