Grep

Master Pattern Matching and Text Searching from the Command Line

Alt text

If you’ve ever “googled” something, you’re already familiar with using a brand name as a verb. In the Unix world, programmers do the same thing with “grep.” But grep isn’t just slang—it’s one of the most essential and powerful commands in your terminal toolkit.

The Origin Story

The name “grep” stands for “global/regular expression/print,” which describes a common sequence of operations in early Unix text editors. Today, it’s evolved into a versatile command-line program that searches through text files to find patterns. Once you master grep, you’ll wonder how you ever worked without it.

Creating a Practice File

Before we dive into grep, let’s create a simple text file to practice with. We’ll use the echo command with redirection (>) to create content:

$ echo "This is line 1" > practice.txt
$ echo "This line contains the word PATTERN" >> practice.txt
$ echo "Another line here" >> practice.txt
$ echo "pattern appears again in lowercase" >> practice.txt
$ echo "You can also find this online" >> practice.txt
$ echo "Final line without the word" >> practice.txt

Note: We use > for the first line (creates/overwrites the file) and >> for subsequent lines (appends to the file).

Let’s verify our file was created:

$ cat practice.txt
This is line 1
This line contains the word PATTERN
Another line here
pattern appears again in lowercase
You can also find this online
Final line without the word

Basic Searching: Finding Patterns in Text

The basic syntax for grep is straightforward:

grep <pattern> <filename>

Let’s search for lines containing the word “line”:

$ grep line practice.txt
This is line 1
This line contains the word PATTERN
Another line here
You can also find this online
Final line without the word

Grep returns every line containing the letters “line.” Notice it found “line” in all five lines. Note that it also greps the line that contains the word online rather than just line.

Case-Insensitive Search: The -i Flag

By default, grep is case-sensitive. Let’s search for “pattern”:

$ grep pattern practice.txt
pattern appears again in lowercase

It only found the lowercase version. To make grep ignore case, use the -i flag:

$ grep -i pattern practice.txt
This line contains the word PATTERN
pattern appears again in lowercase

Now we found both uppercase and lowercase versions!

Searching for Whole Words: The -w Flag

What if you only want to match complete words, not fragments? The -w flag restricts matches to word boundaries:

$ grep -w line practice.txt
This is line 1
This line contains the word PATTERN
Another line here
Final line without the word

This finds “line” as a standalone word. Without -w, “line” would also match words like “online” or “outline.”

Adding Line Numbers: The -n Flag

When working with larger files, knowing which line contains your match is invaluable. The -n flag adds line numbers:

$ grep -n line practice.txt
1:This is line 1
2:This line contains the word PATTERN
3:Another line here
5:You can also find this online
6:Final line without the word

Inverting Your Search: The -v Flag

Sometimes you want to find everything that doesn’t match a pattern. The -v flag inverts your search:

$ grep -v line practice.txt
pattern appears again in lowercase

This returns the only line that doesn’t contain “line” anywhere in it.

Getting Context: The -A Flag

The -A flag (which stands for “After”) prints additional lines following each match:

$ grep -A 1 "word PATTERN" practice.txt
This line contains the word PATTERN
Another line here

The -A 1 flag shows us the line immediately following our match, giving us context.

Real-World Example: Searching FASTA Files

Now let’s apply grep to a real bioinformatics scenario. We will be working with sequences.fasta file.

Let’s view our FASTA file:

$ cat sequences.fasta

You should be able to see the following.

>gene_001 hypothetical protein
ATGCGATCGATCGATCG
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_004 transcription factor
CGCGCGCGCGCGCGCG
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
>gene_004 transcription factor
CGCGCGCGCGCGCGCG
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_003 hypothetical protein
TTAATTAATTAATTAA

Finding Specific Gene Headers

Let’s find all headers containing “hypothetical”:

$ grep hypothetical sequences.fasta
>gene_001 hypothetical protein
>gene_003 hypothetical protein
>gene_003 hypothetical protein
>gene_003 hypothetical protein

Extracting Headers and Sequences Together

Here’s where grep gets powerful for bioinformatics. Use -A 1 to get each header plus its sequence:

$ grep -A 1 "hypothetical" sequences.fasta
>gene_001 hypothetical protein
ATGCGATCGATCGATCG
--
>gene_003 hypothetical protein
TTAATTAATTAATTAA
--
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_003 hypothetical protein
TTAATTAATTAATTAA

The -- is a separator grep adds between matches.

Finding All Headers

To extract all FASTA headers, search for lines starting with “>”:

$ grep "^>" sequences.fasta
>gene_001 hypothetical protein
>gene_002 kinase domain
>gene_003 hypothetical protein
>gene_004 transcription factor
>gene_002 kinase domain
>gene_004 transcription factor
>gene_003 hypothetical protein
>gene_003 hypothetical protein

The ^ symbol means “starts with,” so ^> finds lines beginning with “>”.

Counting Sequences

Combine grep with -c to count how many sequences are in your file:

$ grep -c "^>" sequences.fasta
8

The -c flag counts matching lines instead of displaying them.

Excluding Specific Genes

Want to find all genes except hypothetical proteins?

$ grep "^>" sequences.fasta | grep -v hypothetical
>gene_002 kinase domain
>gene_004 transcription factor
>gene_002 kinase domain
>gene_004 transcription factor

Here we pipe the output of the first grep (all headers) into a second grep with -v to exclude lines containing “hypothetical.”

Finding Sequences with Specific Patterns

Let’s search for sequences containing “GCTA”:

$ grep GCTA sequences.fasta
GCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGCTAG

To see which gene this belongs to, use -B 1 (Before) to show the line before:

$ grep -B 1 GCTA sequences.fasta
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
--
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG

Combining Multiple Flags

Let’s find all genes that are NOT hypothetical proteins, with line numbers and context:

$ grep -n -A 1 "^>" sequences.fasta | grep -v hypothetical
2-ATGCGATCGATCGATCG
3:>gene_002 kinase domain
4-GCTAGCTAGCTAGCTAG
6-TTAATTAATTAATTAA
7:>gene_004 transcription factor
8-CGCGCGCGCGCGCGCG
9:>gene_002 kinase domain
10-GCTAGCTAGCTAGCTAG
11:>gene_004 transcription factor
12-CGCGCGCGCGCGCGCG
14-TTAATTAATTAATTAA
16-TTAATTAATTAATTAA

Working with VCF Files: Finding Genetic Variants

Now let’s explore grep with VCF (Variant Call Format) files, a standard format in genomics for storing gene sequence variations.

A Brief History of VCF Files

VCF (Variant Call Format) was developed in 2010 by the 1000 Genomes Project as a standardized text-based format for storing genetic variants discovered through DNA sequencing. Before VCF, researchers used various incompatible formats, making data sharing difficult. VCF solved this by providing a unified structure with a header section (metadata about the file) and data lines (individual variants with their chromosome position, reference allele, alternate allele, and quality scores). The format quickly became the gold standard, adopted by major genomics projects and sequencing platforms worldwide. Today, VCF files are essential for storing everything from SNPs (single nucleotide polymorphisms) to structural variants in both research and clinical genomics.

We’ll be working with a VCF file called variants.vcf. Let’s first confirm we have the file and examine its structure:

$ cat variants.vcf

You can see that this is a small file for the sake of example and practice realworld vcf files are huge.

##fileformat=VCFv4.2
##reference=hg38
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    12345   rs123456        A       G       99      PASS    DP=50;AF=0.45
chr1    67890   rs789012        C       T       85      PASS    DP=45;AF=0.38
chr2    23456   rs234567        G       A       95      PASS    DP=60;AF=0.52
chr2    78901   rs890123        T       C       75      LowQual DP=20;AF=0.25
chr3    34567   rs345678        A       T       100     PASS    DP=70;AF=0.48
chr3    89012   rs901234        C       G       60      LowQual DP=15;AF=0.30
chrX    45678   rs456789        G       C       98      PASS    DP=55;AF=0.42
chrX    90123   rs012345        T       A       88      PASS    DP=48;AF=0.40

Finding All PASS Variants

To find high-quality variants that passed all filters:

$ grep PASS variants.vcf
chr1    12345   rs123456        A       G       99      PASS    DP=50;AF=0.45
chr1    67890   rs789012        C       T       85      PASS    DP=45;AF=0.38
chr2    23456   rs234567        G       A       95      PASS    DP=60;AF=0.52
chr3    34567   rs345678        A       T       100     PASS    DP=70;AF=0.48
chrX    45678   rs456789        G       C       98      PASS    DP=55;AF=0.42
chrX    90123   rs012345        T       A       88      PASS    DP=48;AF=0.40

Extracting Variants from a Specific Chromosome

To find all variants on chromosome 2:

$ grep "^chr2" variants.vcf
chr2    23456   rs234567        G       A       95      PASS    DP=60;AF=0.52
chr2    78901   rs890123        T       C       75      LowQual DP=20;AF=0.25

The ^ ensures we match lines starting with “chr2” to avoid matching “chr20” or “chr21”.

Finding X-Chromosome Variants

$ grep "^chrX" variants.vcf
chrX    45678   rs456789        G       C       98      PASS    DP=55;AF=0.42
chrX    90123   rs012345        T       A       88      PASS    DP=48;AF=0.40

Searching for Specific SNP IDs

To find a particular variant by its rsID (dbSNP identifier):

$ grep "rs234567" variants.vcf
chr2    23456   rs234567        G       A       95      PASS    DP=60;AF=0.52

This is useful when you need to verify if a known variant from a study or database exists in your dataset.

Finding Low Quality Variants

To identify variants that failed quality filters:

$ grep "LowQual" variants.vcf
chr2    78901   rs890123        T       C       75      LowQual DP=20;AF=0.25
chr3    89012   rs901234        C       G       60      LowQual DP=15;AF=0.30

Counting Variants Per Chromosome

Combine grep with wc -l to count variants:

$ grep -c "^chr1" variants.vcf
2

Combining Multiple Criteria

Find all high-quality PASS variants on autosomes (excluding sex chromosomes):

$ grep -v "^#" variants.vcf | grep PASS | grep -v "^chrX" | grep -v "^chrY"
chr1    12345   rs123456        A       G       99      PASS    DP=50;AF=0.45
chr1    67890   rs789012        C       T       85      PASS    DP=45;AF=0.38
chr2    23456   rs234567        G       A       95      PASS    DP=60;AF=0.52
chr3    34567   rs345678        A       T       100     PASS    DP=70;AF=0.48

Pro Tips for Grep Mastery

Use quotes for patterns with special characters: Patterns with characters like >, $, ^, or spaces should be quoted.

Combine with pipes: Grep works beautifully in pipelines:

cat sequences.fasta | grep "^>" | wc -l

Additional useful flags:

Real-World Bioinformatics Applications

Why Grep Matters

In an age of graphical search interfaces, grep might seem old-fashioned. But its speed, flexibility, and ability to integrate with other Unix tools make it irreplaceable for anyone working with text data, especially in bioinformatics where files can contain millions of sequences.

Whether you’re analyzing FASTA files, searching through FASTQ data, filtering GFF annotations, or debugging pipeline output, grep gives you surgical precision in finding exactly what you need.

Practice Makes Perfect

Start with simple searches and gradually incorporate more flags as you become comfortable. Create your own test files, experiment with different flag combinations, and pay attention to how the results change. Before long, you’ll be crafting complex grep commands that slice through massive genomic datasets to find precisely what you need.


Ready to dive deeper? Combine grep with other Unix tools using pipes (see our post on “The Power of Pipes”) to build sophisticated bioinformatics workflows that process millions of sequences in seconds.

← Previous Next →