14 min to read
Grep
Master Pattern Matching and Text Searching from the Command Line
If you’ve ever “googled” something, you’re already familiar with using a brand name as a verb. In the Unix world, programmers do the same thing with “grep.” But grep isn’t just slang—it’s one of the most essential and powerful commands in your terminal toolkit.
The Origin Story
The name “grep” stands for “global/regular expression/print,” which describes a common sequence of operations in early Unix text editors. Today, it’s evolved into a versatile command-line program that searches through text files to find patterns. Once you master grep, you’ll wonder how you ever worked without it.
Creating a Practice File
Before we dive into grep, let’s create a simple text file to practice with. We’ll use the echo command with redirection (>) to create content:
$ echo "This is line 1" > practice.txt
$ echo "This line contains the word PATTERN" >> practice.txt
$ echo "Another line here" >> practice.txt
$ echo "pattern appears again in lowercase" >> practice.txt
$ echo "You can also find this online" >> practice.txt
$ echo "Final line without the word" >> practice.txt
Note: We use > for the first line (creates/overwrites the file) and >> for subsequent lines (appends to the file).
Let’s verify our file was created:
$ cat practice.txt
This is line 1
This line contains the word PATTERN
Another line here
pattern appears again in lowercase
You can also find this online
Final line without the word
Basic Searching: Finding Patterns in Text
The basic syntax for grep is straightforward:
grep <pattern> <filename>
Let’s search for lines containing the word “line”:
$ grep line practice.txt
This is line 1
This line contains the word PATTERN
Another line here
You can also find this online
Final line without the word
Grep returns every line containing the letters “line.” Notice it found “line” in all five lines. Note that it also greps the line that contains the word online rather than just line.
Case-Insensitive Search: The -i Flag
By default, grep is case-sensitive. Let’s search for “pattern”:
$ grep pattern practice.txt
pattern appears again in lowercase
It only found the lowercase version. To make grep ignore case, use the -i flag:
$ grep -i pattern practice.txt
This line contains the word PATTERN
pattern appears again in lowercase
Now we found both uppercase and lowercase versions!
Searching for Whole Words: The -w Flag
What if you only want to match complete words, not fragments? The -w flag restricts matches to word boundaries:
$ grep -w line practice.txt
This is line 1
This line contains the word PATTERN
Another line here
Final line without the word
This finds “line” as a standalone word. Without -w, “line” would also match words like “online” or “outline.”
Adding Line Numbers: The -n Flag
When working with larger files, knowing which line contains your match is invaluable. The -n flag adds line numbers:
$ grep -n line practice.txt
1:This is line 1
2:This line contains the word PATTERN
3:Another line here
5:You can also find this online
6:Final line without the word
Inverting Your Search: The -v Flag
Sometimes you want to find everything that doesn’t match a pattern. The -v flag inverts your search:
$ grep -v line practice.txt
pattern appears again in lowercase
This returns the only line that doesn’t contain “line” anywhere in it.
Getting Context: The -A Flag
The -A flag (which stands for “After”) prints additional lines following each match:
$ grep -A 1 "word PATTERN" practice.txt
This line contains the word PATTERN
Another line here
The -A 1 flag shows us the line immediately following our match, giving us context.
Real-World Example: Searching FASTA Files
Now let’s apply grep to a real bioinformatics scenario. We will be working with sequences.fasta file.
Let’s view our FASTA file:
$ cat sequences.fasta
You should be able to see the following.
>gene_001 hypothetical protein
ATGCGATCGATCGATCG
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_004 transcription factor
CGCGCGCGCGCGCGCG
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
>gene_004 transcription factor
CGCGCGCGCGCGCGCG
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_003 hypothetical protein
TTAATTAATTAATTAA
Finding Specific Gene Headers
Let’s find all headers containing “hypothetical”:
$ grep hypothetical sequences.fasta
>gene_001 hypothetical protein
>gene_003 hypothetical protein
>gene_003 hypothetical protein
>gene_003 hypothetical protein
Extracting Headers and Sequences Together
Here’s where grep gets powerful for bioinformatics. Use -A 1 to get each header plus its sequence:
$ grep -A 1 "hypothetical" sequences.fasta
>gene_001 hypothetical protein
ATGCGATCGATCGATCG
--
>gene_003 hypothetical protein
TTAATTAATTAATTAA
--
>gene_003 hypothetical protein
TTAATTAATTAATTAA
>gene_003 hypothetical protein
TTAATTAATTAATTAA
The -- is a separator grep adds between matches.
Finding All Headers
To extract all FASTA headers, search for lines starting with “>”:
$ grep "^>" sequences.fasta
>gene_001 hypothetical protein
>gene_002 kinase domain
>gene_003 hypothetical protein
>gene_004 transcription factor
>gene_002 kinase domain
>gene_004 transcription factor
>gene_003 hypothetical protein
>gene_003 hypothetical protein
The ^ symbol means “starts with,” so ^> finds lines beginning with “>”.
Counting Sequences
Combine grep with -c to count how many sequences are in your file:
$ grep -c "^>" sequences.fasta
8
The -c flag counts matching lines instead of displaying them.
Excluding Specific Genes
Want to find all genes except hypothetical proteins?
$ grep "^>" sequences.fasta | grep -v hypothetical
>gene_002 kinase domain
>gene_004 transcription factor
>gene_002 kinase domain
>gene_004 transcription factor
Here we pipe the output of the first grep (all headers) into a second grep with -v to exclude lines containing “hypothetical.”
Finding Sequences with Specific Patterns
Let’s search for sequences containing “GCTA”:
$ grep GCTA sequences.fasta
GCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGCTAG
To see which gene this belongs to, use -B 1 (Before) to show the line before:
$ grep -B 1 GCTA sequences.fasta
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
--
>gene_002 kinase domain
GCTAGCTAGCTAGCTAG
Combining Multiple Flags
Let’s find all genes that are NOT hypothetical proteins, with line numbers and context:
$ grep -n -A 1 "^>" sequences.fasta | grep -v hypothetical
2-ATGCGATCGATCGATCG
3:>gene_002 kinase domain
4-GCTAGCTAGCTAGCTAG
6-TTAATTAATTAATTAA
7:>gene_004 transcription factor
8-CGCGCGCGCGCGCGCG
9:>gene_002 kinase domain
10-GCTAGCTAGCTAGCTAG
11:>gene_004 transcription factor
12-CGCGCGCGCGCGCGCG
14-TTAATTAATTAATTAA
16-TTAATTAATTAATTAA
Working with VCF Files: Finding Genetic Variants
Now let’s explore grep with VCF (Variant Call Format) files, a standard format in genomics for storing gene sequence variations.
A Brief History of VCF Files
VCF (Variant Call Format) was developed in 2010 by the 1000 Genomes Project as a standardized text-based format for storing genetic variants discovered through DNA sequencing. Before VCF, researchers used various incompatible formats, making data sharing difficult. VCF solved this by providing a unified structure with a header section (metadata about the file) and data lines (individual variants with their chromosome position, reference allele, alternate allele, and quality scores). The format quickly became the gold standard, adopted by major genomics projects and sequencing platforms worldwide. Today, VCF files are essential for storing everything from SNPs (single nucleotide polymorphisms) to structural variants in both research and clinical genomics.
We’ll be working with a VCF file called variants.vcf. Let’s first confirm we have the file and examine its structure:
$ cat variants.vcf
You can see that this is a small file for the sake of example and practice realworld vcf files are huge.
##fileformat=VCFv4.2
##reference=hg38
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 12345 rs123456 A G 99 PASS DP=50;AF=0.45
chr1 67890 rs789012 C T 85 PASS DP=45;AF=0.38
chr2 23456 rs234567 G A 95 PASS DP=60;AF=0.52
chr2 78901 rs890123 T C 75 LowQual DP=20;AF=0.25
chr3 34567 rs345678 A T 100 PASS DP=70;AF=0.48
chr3 89012 rs901234 C G 60 LowQual DP=15;AF=0.30
chrX 45678 rs456789 G C 98 PASS DP=55;AF=0.42
chrX 90123 rs012345 T A 88 PASS DP=48;AF=0.40
Finding All PASS Variants
To find high-quality variants that passed all filters:
$ grep PASS variants.vcf
chr1 12345 rs123456 A G 99 PASS DP=50;AF=0.45
chr1 67890 rs789012 C T 85 PASS DP=45;AF=0.38
chr2 23456 rs234567 G A 95 PASS DP=60;AF=0.52
chr3 34567 rs345678 A T 100 PASS DP=70;AF=0.48
chrX 45678 rs456789 G C 98 PASS DP=55;AF=0.42
chrX 90123 rs012345 T A 88 PASS DP=48;AF=0.40
Extracting Variants from a Specific Chromosome
To find all variants on chromosome 2:
$ grep "^chr2" variants.vcf
chr2 23456 rs234567 G A 95 PASS DP=60;AF=0.52
chr2 78901 rs890123 T C 75 LowQual DP=20;AF=0.25
The ^ ensures we match lines starting with “chr2” to avoid matching “chr20” or “chr21”.
Finding X-Chromosome Variants
$ grep "^chrX" variants.vcf
chrX 45678 rs456789 G C 98 PASS DP=55;AF=0.42
chrX 90123 rs012345 T A 88 PASS DP=48;AF=0.40
Searching for Specific SNP IDs
To find a particular variant by its rsID (dbSNP identifier):
$ grep "rs234567" variants.vcf
chr2 23456 rs234567 G A 95 PASS DP=60;AF=0.52
This is useful when you need to verify if a known variant from a study or database exists in your dataset.
Finding Low Quality Variants
To identify variants that failed quality filters:
$ grep "LowQual" variants.vcf
chr2 78901 rs890123 T C 75 LowQual DP=20;AF=0.25
chr3 89012 rs901234 C G 60 LowQual DP=15;AF=0.30
Counting Variants Per Chromosome
Combine grep with wc -l to count variants:
$ grep -c "^chr1" variants.vcf
2
Combining Multiple Criteria
Find all high-quality PASS variants on autosomes (excluding sex chromosomes):
$ grep -v "^#" variants.vcf | grep PASS | grep -v "^chrX" | grep -v "^chrY"
chr1 12345 rs123456 A G 99 PASS DP=50;AF=0.45
chr1 67890 rs789012 C T 85 PASS DP=45;AF=0.38
chr2 23456 rs234567 G A 95 PASS DP=60;AF=0.52
chr3 34567 rs345678 A T 100 PASS DP=70;AF=0.48
Pro Tips for Grep Mastery
Use quotes for patterns with special characters: Patterns with characters like >, $, ^, or spaces should be quoted.
Combine with pipes: Grep works beautifully in pipelines:
cat sequences.fasta | grep "^>" | wc -l
Additional useful flags:
-Bshows lines Before the match-Cshows lines of Context (both before and after)-ccounts matching lines-rsearches recursively through directories^matches start of line$matches end of line
Real-World Bioinformatics Applications
- Extract specific sequences by ID: Pull out particular genes or contigs from a FASTA file by searching for their unique identifier.
- Find all sequence headers: Get a quick list of all sequence IDs in your FASTA file by searching for lines that start with the “>” character.
- Count the number of sequences: Quickly determine how many sequences are in your FASTA file by counting header lines.
- Search for specific motifs: Find sequences containing a particular DNA pattern, such as promoter elements or restriction sites.
- Extract sequences and their headers: Get both the header and sequence for genes of interest using context flags to show the line following each match.
- Find sequences with ambiguous bases: Identify sequences containing N’s or other non-standard bases that might need further quality control.
- Filter out specific sequences: Remove unwanted sequences from your dataset, such as hypothetical proteins or low-confidence predictions.
- Check FASTQ file integrity: Count the number of reads in a FASTQ file by counting lines that start with the “@” symbol.
- Search annotation files: Find specific features in GFF or GTF files, such as all genes on a particular chromosome or all exons in your annotation.
Why Grep Matters
In an age of graphical search interfaces, grep might seem old-fashioned. But its speed, flexibility, and ability to integrate with other Unix tools make it irreplaceable for anyone working with text data, especially in bioinformatics where files can contain millions of sequences.
Whether you’re analyzing FASTA files, searching through FASTQ data, filtering GFF annotations, or debugging pipeline output, grep gives you surgical precision in finding exactly what you need.
Practice Makes Perfect
Start with simple searches and gradually incorporate more flags as you become comfortable. Create your own test files, experiment with different flag combinations, and pay attention to how the results change. Before long, you’ll be crafting complex grep commands that slice through massive genomic datasets to find precisely what you need.
Ready to dive deeper? Combine grep with other Unix tools using pipes (see our post on “The Power of Pipes”) to build sophisticated bioinformatics workflows that process millions of sequences in seconds.

Comments