October 4, 2025 3 min to read

Navigating Bioinformatics Datasets

Understanding Core File Formats in Computational Biology Workflows

Getting Started

Downloading the Dataset

Before we can analyze data, we need to obtain it. The sample dataset for this lesson contains various file types commonly encountered in bioinformatics projects.

To download the dataset, run the following commands in your terminal:

git clone https://github.com/bioinformaticsguy/bash_essentials_for_bioinformatics/tree/main
cd data_for_bash_essentials

Once downloaded, you’ll find several file types:

Plain text files (.txt)
Tab-separated values files (.tsv)
Compressed files (.gz)
FastQ sequencing files (.fastq.gz)

Understanding FastQ Files

What Are FastQ Files?

FastQ files are the workhorse of high-throughput sequencing projects. These text-based files store biological sequence data (typically nucleotide sequences) along with quality scores that indicate the confidence of each base call.

FastQ File Structure

Each sequence entry in a FastQ file follows a four-line format:

Sequence Identifier: Begins with @ and contains a unique identifier for the sequence
Nucleotide Sequence: The actual biological sequence (A, T, C, G for DNA; A, U, C, G for RNA)
Separator Line: Begins with + and may optionally repeat the identifier from line 1
Quality Scores: ASCII-encoded characters representing the quality score for each nucleotide

Example:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Why FastQ Files Matter

FastQ files are essential for:

Sequence alignment and mapping
Variant calling and mutation detection
Quality control of sequencing runs
RNA-seq expression analysis
Genome assembly projects

Understanding TSV Files

What Are TSV Files?

TSV (Tab-Separated Values) files store data in a tabular format where:

Each row represents a single record or observation
Columns are separated by tab characters (\t)
The first row often contains column headers

TSV vs CSV

While similar to CSV (Comma-Separated Values) files, TSV files use tabs instead of commas as delimiters. This makes them particularly useful when data fields contain commas, avoiding parsing ambiguity.

Common Uses in Bioinformatics

TSV files frequently store:

Gene expression matrices
Annotation tables
Experimental metadata
Analysis results and summary statistics

Example:

Gene_ID	Sample1	Sample2	Sample3
GENE001	125.4	98.2	142.7
GENE002	45.8	52.1	48.9
GENE003	312.5	289.3	301.2

Working with Compressed Files

Many bioinformatics files use gzip compression (.gz extension) to save storage space. Sequencing data files can be enormous, so compression is standard practice.

Key points:

FastQ files are commonly distributed as .fastq.gz
Most bioinformatics tools can read compressed files directly
Use gunzip to decompress or gzip to compress files

What’s Next?

Now that you understand the file types in your dataset, the next lesson will cover practical commands for:

Exploring file contents
Extracting information from FastQ and TSV files
Navigating directory structures efficiently
Basic quality checks on sequencing data

Stay tuned for Lesson 05: Hands-On Data Exploration!

Key Takeaways

✓ FastQ files store sequencing data with quality scores in a four-line format
✓ TSV files provide a simple, tab-delimited structure for tabular data
✓ Compressed files (.gz) save space while remaining readable by most tools
✓ Understanding file formats is fundamental to bioinformatics workflows.

← Previous Next →

Bioinformatics Guy

Navigating Bioinformatics Datasets

Getting Started

Downloading the Dataset

Understanding FastQ Files

What Are FastQ Files?

FastQ File Structure

Why FastQ Files Matter

Understanding TSV Files

What Are TSV Files?

TSV vs CSV

Common Uses in Bioinformatics

Working with Compressed Files

What’s Next?

Key Takeaways

Viewing File Contents in Bioinformatics

Ali Hassan

Comments

Navigating Bioinformatics Datasets

Getting Started

Downloading the Dataset

Understanding FastQ Files

What Are FastQ Files?

FastQ File Structure

Why FastQ Files Matter

Understanding TSV Files

What Are TSV Files?

TSV vs CSV

Common Uses in Bioinformatics

Working with Compressed Files

What’s Next?

Key Takeaways

Viewing File Contents in Bioinformatics

Don't go yet!

From Command Line to Script

Mastering For Loops in Bash

Ali Hassan

Comments