Working with very large fasta datasets


  • Run FastQC on your data to make sure the format/content is what you expect. Run more QA as needed.
    • Search GTN tutorials with the keyword “qa-qc” for examples.
    • Search Galaxy Help with the keywords “qa-qc” and “fasta” for more help.
  • Assembly result?
    • Consider filtering by length to remove reads that did not assemble.
    • Formatting criteria:
      • All sequence identifiers must be unique.
      • Some tools will require that there is no description line content, only identifiers, in the fasta title line (“>” line). Use NormalizeFasta to remove the description (all content after the first whitespace) and wrap the sequences to 80 bases.
  • Custom genome, transcriptome exome?
    • Only appropriate for smaller genomes (bacterial, viral, most insects).
    • Not appropriate for any mammalian genomes, or some plants/fungi.
    • Sequence identifiers must be an exact match with all other inputs or expect problems. See GFF GFT GFF3.
    • Formatting criteria:
      • All sequence identifiers must be unique.
      • ALL tools will require that there is no description content, only identifiers, in the fasta title line (“>” line). Use NormalizeFasta to remove the description (all content after the first whitespace) and wrap the sequences to 80 bases.
      • The only exception is when executing the MakeBLASTdb tool and when the input fasta is in NCBI BLAST format (see the tool form).
Still have questions?
Gitter Chat Support
Galaxy Help Forum
Want to embed this snippet (FAQ) in your GTN Tutorial?
{% snippet  faqs/galaxy/datasets_working_with_fasta.md %}