beam-logo
← All posts
Tutorials

Using FASTQC: A Guide to Quality Control in High-Throughput Sequencing

Eli MernitEli Mernit
January 14, 20253 min read
Using FASTQC: A Guide to Quality Control in High-Throughput Sequencing

High-throughput sequencing (HTS) generates large amounts of data, and it's important to ensure the quality of this data for analysis downstream.

One of the most widely used tool for this type of quality control is FASTQC. In this guide, we'll explain what FASTQC is, how to use it, and how to interpret its results.

What is FASTQC?

FASTQC is a quality control tool for high-throughput sequence data. It's designed to find potential issues in raw sequencing reads, and it's compatible with data from various sequencing platforms, including Illumina, Oxford Nanopore, and PacBio.

Key features of FASTQC

  • FASTQC produces easy-to-interpret HTML reports, making it a go-to tool for sequencing data quality checks.
  • Quality score distribution analysis
  • Detection of adapter contamination
  • Identifying of overrepresented sequences
  • Assessing GC content.
  • Per-sequence quality score metrics.

Installing FASTQC

FASTQC can be installed on most operating systems.

Download FASTQC: Visit the FASTQC download page and download the right version for your operating system.

Install Java: FASTQC requires Java Runtime Environment (JRE). Ensure you have Java installed by running: java -version

Install FASTQC: After downloading, extract the compressed file and add the fastqc executable to your PATH for easy access.

Test Installation: Confirm FASTQC is installed by running: fastqc -h

Running FASTQC

You can run FASTQC on individual files or batch-process for multiple files. It accepts FASTQ, BAM, or SAM files as input.

Basic Command:

fastqc filename.fastq

Batch Processing: Analyze multiple files at once:

fastqc *.fastq

Specify an Output Directory: By default, FASTQC saves results in the current directory. You can specify an output directory:

fastqc -o /path/to/output/ *.fastq

Enable Multi-Threading: To speed up analysis, use the -t option to specify the number of threads:

fastqc -t 4 *.fastq

Interpreting FASTQC Reports

FASTQC generates an HTML report for each input file, with a visual representation of the data. These are the main sections of the report:

Basic Statistics: File name, encoding, total sequences, GC content, and more.

Per Base Sequence Quality: Boxplots showing quality scores across all bases. Green, orange, and red zones indicate good, warning, and poor quality.

Per Sequence Quality Scores: Distribution of quality scores across all reads. A single peak in the high-quality range is ideal.

Per Base GC Content: Compares observed GC content to expected. Deviation may indicate contamination or bias.

Adapter Content: Highlights the presence of adapter sequences. Significant adapter contamination may require trimming.

Overrepresented Sequences: Identifies sequences that appear more frequently than expected. These could be adapters or contaminants.

Sequence Length Distribution: Ensures all reads have expected lengths. Uneven lengths could indicate trimming or sequencing issues.

Addressing Common Issues

Low Quality Scores: Use a read trimming tool like Trimmomatic or Cutadapt to remove low-quality bases.

Adapter Contamination: Identify and trim adapters using Cutadapt or similar tools.

GC Content Bias: Check for contamination or bias in library preparation.

Overrepresented Sequences: Investigate sequences flagged by FASTQC to identify potential contaminants or artifacts.

Integrating FASTQC in Pipelines

FASTQC can be incorporated into bioinformatics workflows for automated quality checks. Tools like Snakemake, Nextflow, or bash scripts can simplify its integration.

Here's an example Snakemake rule:

rule fastqc: input: "{sample}.fastq" output: "{sample}_fastqc.zip", "{sample}_fastqc.html" shell: "fastqc {input} -o ./"

FASTQC is an important tool for any sequencing workflow, providing insights into the quality of your data and potential areas for improvement. By regularly running FASTQC on your datasets, you can ensure the that your analyses are robust.

Eli Mernit
Eli Mernit
Published January 14, 2025
$30 free creditrefreshed monthly

Start shipping on infra
you won’t outgrow.

Run sandboxes and GPU workloads on your cloud, and scale out to ours when you need to. No infra to manage.