AgentSkillsCN

bio-sequence-similarity

使用迭代 BLAST(PSI-BLAST)、轮廓 HMM(HMMER)以及互惠最佳匹配分析,识别同源序列。在识别标准 BLAST 不够灵敏的同源基因、远缘同源序列,或蛋白质家族成员时,可选用此功能。

SKILL.md
--- frontmatter
name: bio-sequence-similarity
description: Find homologous sequences using iterative BLAST (PSI-BLAST), profile HMMs (HMMER), and reciprocal best hit analysis. Use when identifying orthologs, distant homologs, or protein family members where standard BLAST is not sensitive enough.
tool_type: mixed
primary_tool: BLAST+

Sequence Similarity Searches

Advanced methods for finding homologous sequences beyond standard BLAST.

PSI-BLAST (Position-Specific Iterated BLAST)

Builds a position-specific scoring matrix (PSSM) through iterations to find distant homologs.

Basic PSI-BLAST

bash
psiblast -query protein.fasta -db nr -out results.txt -num_iterations 3

Save PSSM for Reuse

bash
psiblast -query protein.fasta -db nr \
    -out results.txt \
    -out_pssm pssm.asn \
    -out_ascii_pssm pssm.txt \
    -num_iterations 5

Use Existing PSSM

bash
psiblast -in_pssm pssm.asn -db nr -out results.txt

Output Format

bash
psiblast -query protein.fasta -db nr \
    -out results.txt \
    -outfmt 6 \
    -num_iterations 3 \
    -inclusion_ethresh 0.001

Key Parameters

bash
psiblast -query protein.fasta -db nr \
    -num_iterations 5 \
    -inclusion_ethresh 0.001 \
    -evalue 0.01 \
    -num_threads 8 \
    -out results.txt

PSI-BLAST Parameters

ParameterDefaultDescription
-num_iterations1Number of iterations
-inclusion_ethresh0.002E-value for PSSM inclusion
-evalue10E-value threshold for reporting
-num_threads1CPU threads

HMMER for Profile Searches

HMMER uses profile hidden Markov models for sensitive sequence searches.

Search with Single Sequence

bash
jackhmmer -o results.txt -A aligned.sto --cpu 8 query.fasta database.fasta

Build Profile from Alignment

bash
hmmbuild profile.hmm alignment.sto

Search Database with Profile

bash
hmmsearch -o results.txt --tblout hits.tbl profile.hmm database.fasta
hmmsearch -o results.txt --domtblout domains.tbl profile.hmm database.fasta

Download Pfam Profiles

bash
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

Scan Sequence Against Pfam

bash
hmmscan --tblout pfam_hits.tbl --domtblout domains.tbl Pfam-A.hmm query.fasta

Parse HMMER Output

bash
grep -v "^#" hits.tbl | head
awk '$5 < 1e-10' hits.tbl

HMMER Output Columns (--tblout)

ColumnDescription
1Target name
2Accession
3Query name
4Query accession
5E-value (full sequence)
6Score (full sequence)
7Bias
8E-value (best domain)
9Score (best domain)

Reciprocal Best Hit (RBH) Analysis

Find orthologs using bidirectional best hits.

Create BLAST Databases

bash
makeblastdb -in species_A.fasta -dbtype prot -out species_A_db
makeblastdb -in species_B.fasta -dbtype prot -out species_B_db

Bidirectional BLAST

bash
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > A_vs_B.txt
blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > B_vs_A.txt

Find Reciprocal Best Hits

bash
awk 'FNR==NR {a[$1]=$2; next} $2 in a && a[$2]==$1 {print $1"\t"$2}' \
    A_vs_B.txt B_vs_A.txt > reciprocal_best_hits.txt

Python RBH Script

python
def find_rbh(forward_blast, reverse_blast):
    '''Find reciprocal best hits from BLAST results'''
    forward = {}
    with open(forward_blast) as f:
        for line in f:
            parts = line.strip().split('\t')
            query, subject = parts[0], parts[1]
            if query not in forward:
                forward[query] = subject

    reverse = {}
    with open(reverse_blast) as f:
        for line in f:
            parts = line.strip().split('\t')
            query, subject = parts[0], parts[1]
            if query not in reverse:
                reverse[query] = subject

    rbh = []
    for a, b in forward.items():
        if b in reverse and reverse[b] == a:
            rbh.append((a, b))

    return rbh

rbh_pairs = find_rbh('A_vs_B.txt', 'B_vs_A.txt')
for a, b in rbh_pairs:
    print(f'{a}\t{b}')

Delta-BLAST

Uses conserved domain database for more sensitive initial search.

bash
deltablast -query protein.fasta -db nr -rpsdb cdd_delta -out results.txt

PHI-BLAST (Pattern-Hit Initiated)

Search with a pattern plus sequence.

bash
phi_pattern="G-x(2)-[ST]-x-[RK]"
phiblast -query protein.fasta -db nr -pattern "$phi_pattern" -out results.txt

Iterative Search with Biopython

python
from Bio.Blast import NCBIWWW, NCBIXML

with open('query.fasta') as f:
    query = f.read()

result_handle = NCBIWWW.qblast('psiblast', 'nr', query, expect=0.001, word_size=3)

with open('psiblast_result.xml', 'w') as out:
    out.write(result_handle.read())
result_handle.close()

with open('psiblast_result.xml') as f:
    records = NCBIXML.parse(f)
    for record in records:
        for alignment in record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < 1e-10:
                    print(f'{alignment.hit_def[:50]}: E={hsp.expect}')

HMMER with Biopython

python
from Bio import SearchIO

results = SearchIO.parse('hmmsearch_output.txt', 'hmmer3-text')
for query_result in results:
    print(f'Query: {query_result.id}')
    for hit in query_result:
        print(f'  Hit: {hit.id}, E-value: {hit.evalue}')
        for hsp in hit:
            print(f'    Domain: {hsp.bitscore} bits')

Jackhmmer (Iterative HMMER)

Similar to PSI-BLAST but uses HMM profiles.

bash
jackhmmer -N 5 -o results.txt --tblout hits.tbl query.fasta database.fasta
jackhmmer -N 5 -A iterations.sto --chkhmm checkpoint query.fasta database.fasta

OrthoFinder for Multi-Species Orthologs

bash
orthofinder -f proteomes/ -t 8
orthofinder -f proteomes/ -t 8 -M msa

Prepare Input

bash
mkdir proteomes
cp species_*.fasta proteomes/

Output Files

FileContent
Orthogroups.tsvAll orthogroups
Orthogroups_SingleCopyOrthologues.txt1:1 orthologs
Species_Tree/Inferred species tree
Gene_Trees/Individual gene trees

E-value vs Bit Score

E-valueInterpretation
< 1e-50Highly significant, likely homolog
1e-50 to 1e-10Significant, probable homolog
1e-10 to 1e-3Marginal, possible remote homolog
> 0.01Not significant

Complete Ortholog Finding Pipeline

bash
#!/bin/bash
SPECIES_A=$1
SPECIES_B=$2
EVALUE=1e-10
THREADS=8

echo "Building databases..."
makeblastdb -in $SPECIES_A -dbtype prot -out db_A
makeblastdb -in $SPECIES_B -dbtype prot -out db_B

echo "Running forward BLAST..."
blastp -query $SPECIES_A -db db_B -outfmt 6 -evalue $EVALUE \
    -max_target_seqs 1 -num_threads $THREADS > forward.txt

echo "Running reverse BLAST..."
blastp -query $SPECIES_B -db db_A -outfmt 6 -evalue $EVALUE \
    -max_target_seqs 1 -num_threads $THREADS > reverse.txt

echo "Finding reciprocal best hits..."
awk 'FNR==NR {best[$1]=$2; next}
     $2 in best && best[$2]==$1 {print $1"\t"$2}' \
     forward.txt reverse.txt > orthologs.txt

echo "Found $(wc -l < orthologs.txt) ortholog pairs"

rm -f db_A.* db_B.*

Related Skills

  • blast-searches - Basic remote BLAST
  • local-blast - Local BLAST databases
  • entrez-fetch - Download sequences
  • alignment - Align identified homologs