AgentSkillsCN

lucy-ng:CASE

完全从头解析结构——跳过去重,直接根据核磁共振相关性解析结构。当去重未找到匹配、化合物已知为新物质或希望从第一原理解析结构时使用。

SKILL.md
--- frontmatter
name: lucy-ng:CASE
description: Full de novo structure elucidation - skip dereplication and solve the structure from NMR correlations. Use when dereplication returned no matches, the compound is known to be novel, or you want to solve the structure from first principles.

lucy-ng:CASE

Full de novo structure elucidation - skip dereplication and solve the structure from NMR correlations.


Purpose

This skill performs FULL Computer-Assisted Structure Elucidation (CASE) without dereplication. Use this when:

  • Dereplication already returned no matches
  • You know the compound is novel/not in databases
  • You want to solve the structure from first principles
  • You're evaluating AI-based CASE methodology

Domain Knowledge

Reference: For NMR background, peak picking strategy, symmetry detection, dereplication scoring, LSD reference, and ranking interpretation, see the main skill document: skill/SKILL.md

This skill focuses on the CASE procedure (step-by-step execution). The main skill document contains all shared domain knowledge.


Prerequisites

bash
lucy --version || pip install lucy-ng
lucy lsd check  # Must show LSD and outlsd available

Required Data

DataEssential?Purpose
Molecular formulaYESFrom user (HRMS)
13C spectrumYESAll carbon positions
HSQCYESDirect C-H correlations
HMBCYESLong-range correlations
DEPT-135RecommendedMultiplicities (CH, CH2, CH3)
COSYOptionalH-H correlations

Workflow

Supervisor integration: When running under supervisor control, write CASE-PROGRESS.md after each LSD iteration (see Step 7c). This enables the supervisor to detect loops and provide diagnostic guidance.

Step 0: Setup Documentation

bash
mkdir -p analysis

Document all steps in analysis/ as you proceed.

Step 1: Request Molecular Formula

Always ask the user:

code
"Please provide the molecular formula for this unknown compound (typically from HRMS)."

Calculate key values from formula:

  • Total carbons
  • Total hydrogens
  • Heteroatoms (N, O, S, etc.)
  • Degree of unsaturation: DBE = (2C + 2 + N - H) / 2

Step 2: Identify Available Experiments

bash
for dir in */; do
    if [ -f "$dir/acqus" ]; then
        nuc=$(grep "##\$NUC1=" "$dir/acqus" | head -1)
        pp=$(grep "##\$PULPROG=" "$dir/acqus" | head -1)
        echo "Exp $dir: $nuc | $pp"
    fi
done

Map experiments:

  • 1H: zg*
  • 13C: zgdc*, zgpg*
  • DEPT: dept*
  • HSQC: hsqc*
  • HMBC: hmbc*
  • COSY: cosy*

Step 3: Analyze Symmetry

Compare expected vs observed signals:

bash
lucy analyze symmetry <data_dir> <formula>

Or manually:

  1. Count peaks in 13C spectrum
  2. Compare to carbons in formula
  3. If observed < expected → molecule has symmetry

Document:

markdown
## Symmetry Analysis
- Expected carbons (from formula): X
- Observed 13C signals: Y
- Interpretation: [No symmetry / C2 symmetry / etc.]

Step 4: Pick 13C Peaks

bash
lucy pick 1d <13c_experiment>

Or from peaklist.xml if binary data is poor:

  • Extract F1 values from <Peak1D F1="..."/> tags
  • List all carbon shifts

Document all peaks with proposed assignments:

#Shift (ppm)Type (if known)
1187.8Carbonyl?
2152.5C-N?
.........

Step 5: Pick HSQC Peaks

Get raw HSQC peaks:

bash
lucy pick hsqc <hsqc_exp> --format json

Apply DEPT-guided filtering (see skill/SKILL.md Section 3):

  1. Pick DEPT-135 peaks:
bash
lucy pick 1d <dept135_exp> --format json
  1. Match HSQC carbon positions to DEPT carbons within ±1.5 ppm
  2. Extract multiplicities from DEPT sign (positive = CH/CH3, negative = CH2)
  3. If DEPT-90 available, disambiguate CH vs CH3

Document:

  • Which carbons are protonated (have HSQC signals)
  • Which are quaternary (no HSQC signal)
  • Multiplicities (CH, CH2, CH3)

Step 6: Pick HMBC Peaks

Get raw HMBC peaks:

bash
lucy pick hmbc <hmbc_exp> --format json

Apply cross-validation filtering (see skill/SKILL.md Section 3):

  1. Validate each HMBC peak:
    • Carbon position exists in 13C peaks (±1.5 ppm)
    • Proton position exists in HSQC peaks (±0.1 ppm)
  2. Retain only validated correlations

Document all HMBC correlations:

Carbon (ppm)Proton (ppm)Notes
187.87.5Carbonyl to aromatic H
.........

Step 7: Generate LSD Input

Write the LSD file directly using skill knowledge:

Reference:

  • skill/SKILL.md Section 6 (LSD Reference)
  • skill/diagnostic/SKILL.md Section 1 (LSD Command Reference)

Build the LSD file manually:

code
; LSD input for <FORMULA>

; Atom definitions (MULT atom# element hybridization H-count)
MULT 1 C 2 0    ; Carbonyl carbon, sp2, 0H (quaternary)
MULT 2 C 2 1    ; Aromatic CH, sp2, 1H
MULT 3 N 3 1    ; Amine nitrogen, sp3, 1H (NH)
MULT 4 O 2 0    ; Carbonyl oxygen, sp2, 0H
...

; HSQC correlations (MUST come before HMBC)
HSQC 2 2        ; C2 has H2 attached
HSQC 5 5        ; C5 has H5 attached
...

; HMBC correlations
HMBC 1 2        ; C1 correlates to H2
HMBC 1 5        ; C1 correlates to H5
...

; Heteroatom constraints (optional but helpful)
BOND 1 4        ; C1 bonded to O4 (carbonyl)

Critical checks before running:

  • sp2 count is EVEN
  • Hydrogen count matches formula
  • All HSQC commands before HMBC commands
  • NO ELIM command on first run

Step 7b: Iterative HMBC Addition (Minimize Solutions)

CRITICAL: Do NOT add all HMBC correlations at once!

Adding too many HMBC correlations often leads to 0 solutions (over-constrained) due to:

  • Noise artifacts in the HMBC spectrum
  • Long-range correlations (⁴J+) that exceed LSD's default 2-3 bond assumption
  • Overlapping or incorrectly assigned peaks

Strategy: Gradually add HMBC correlations until solutions reach a minimum > 0

  1. Start with high-confidence correlations only (5-7 strongest peaks)
  2. Run LSD and check solution count
  3. Add 1-2 more correlations at a time
  4. Re-run LSD after each addition
  5. Stop when solutions are minimized but still > 0

Workflow example:

bash
# Start with base correlations
cp compound_base.lsd compound_test.lsd
lsd compound_test.lsd 2>&1 | grep solution
# → "47 solutions found"

# Add HMBC 4 9
echo "HMBC 4 9" >> compound_test.lsd
lsd compound_test.lsd 2>&1 | grep solution
# → "12 solutions found"

# Add HMBC 5 9
echo "HMBC 5 9" >> compound_test.lsd
lsd compound_test.lsd 2>&1 | grep solution
# → "1 solution found" ✓ IDEAL!

# If we add one more and get 0 solutions, remove it!

Tracking table (recommended):

HMBC CountCorrelations AddedSolutionsAction
5Base set47Add more
7+ C1→H7, C2→H1012Add more
8+ C8→H106Add more
9+ C6→H96Add more
10+ C4→H95Add more
11+ C5→H91STOP - Ideal!
12+ C3→H40Remove last

Key principles:

  • Ideal: 1 solution — uniquely determined structure
  • Acceptable: 2-10 solutions — can rank by 13C prediction
  • 0 solutions — over-constrained, remove last correlation(s)
  • Never use ELIM to "fix" 0 solutions — it masks the real problem

Prioritize correlations by:

  1. Intensity (stronger peaks are more reliable)
  2. Proximity to known fragment assignments
  3. Correlations that connect unassigned regions

Step 7c: Write Progress Checkpoint (CASE-PROGRESS.md)

After EVERY LSD iteration (including the baseline run), append an iteration entry to CASE-PROGRESS.md in the compound's working directory. This file is read by the supervisor agent to monitor progress, detect loops, and provide diagnostic guidance.

First iteration: Create the file with header section:

markdown
# CASE Progress Log

**Compound:** <compound_path>
**Formula:** <molecular_formula>
**Started:** <timestamp>

Each iteration: Append a new section:

markdown
---

## Iteration N: <brief description>

**Time:** <timestamp>
**LSD file:** <filename>.lsd
**Solution count:** <count>

**Constraints added:**
- <constraint and reasoning>

**Constraints removed:**
- <constraint and reasoning> (or "None")

**Why:** <natural language explanation of strategy for this iteration>

**Constraint effectiveness:** <% reduction from previous, or "baseline", or "over-constrained (0 solutions)">
**Confidence:** <qualitative assessment: too many solutions / converging / stuck / etc.>
**HMBC correlations used:** X/Y

**Notes:**
- sp2 count: <N> (<even/odd>) <check/warning>
- H budget: <matches/mismatch>
- <other observations>

Rules:

  • NEVER overwrite the file — always append new iteration sections
  • Include ALL required fields in every iteration entry
  • The "Why" field must explain reasoning, not just state what was done
  • The "Constraints added/removed" must list each constraint individually with reasoning
  • If recovering from 0 solutions, document which correlations were removed and why

For the complete format specification with examples, see skill/supervisor/SKILL.md Section 7.

Step 8: Run LSD Solver

bash
lucy lsd run compound.lsd

Or directly:

bash
LSD compound.lsd

For solution count interpretation and troubleshooting, see skill/SKILL.md Section 5 (LSD Reference).

Step 9: Convert to SMILES

bash
outlsd 5 < compound.sol > solutions.smi

Step 10: Rank Solutions

bash
lucy lsd rank solutions.smi --spectrum <13c_exp>
# Or with shift list:
lucy lsd rank solutions.smi --shifts "187.8,152.5,135.7,..."

For MAE score interpretation and ranking guidance, see skill/SKILL.md Section 6 (Ranking and Prediction).

Step 11: Analyze J-Coupling Path Lengths

After solving, use lucy lsd analyze to compute the actual J-coupling path lengths for all HMBC correlations:

bash
lucy lsd analyze compound.sol compound.lsd

This command:

  • Parses the OUTLSD section of the .sol file to extract molecular connectivity
  • Builds a graph from atom neighbors
  • Uses BFS shortest path to compute bonds between carbon and proton-bearing carbon
  • Reports nJ = path_length + 1 for each HMBC correlation

Example output:

code
Solution 2: 9× ²J 11× ³J (all ²J/³J, no ELIM needed)

HMBC Correlations:
-------------------------------------------------------
  C#   H#    C (ppm)   Path   J-coupling
-------------------------------------------------------
   1    7     131.29      1        ²J_CH
   1   10     131.29      1        ²J_CH
   2    7     124.71      2        ³J_CH
   ...

Interpretation:

  • All ²J/³J correlations: Structure is consistent with standard HMBC without ELIM
  • Contains ⁴J+ correlations: May explain why ELIM was needed

JSON output for PDF generation:

bash
lucy lsd analyze compound.sol compound.lsd --format json > analysis/j_coupling.json

Generate structure images with LSD atom numbering:

bash
lucy lsd analyze compound.sol compound.lsd --draw solution_{n}.png

This generates a 2D structure image where each atom is labeled with its LSD index (C1, C2, ..., O11), making the HMBC table directly readable against the structure.

Generate publication-quality correlation diagrams with arrows:

For visualizing HMBC correlations directly on the structure with curved arrows and J-coupling labels:

bash
# Generate correlation diagram with atom numbers and J-coupling labels
lucy visualize correlations \
    --sol compound.sol \
    --lsd-file compound.lsd \
    --show-atom-numbers \
    --show-j-coupling \
    -o analysis/hmbc_diagram.svg

This creates a publication-quality SVG diagram showing:

  • Clean 2D structure (from the solved .sol file)
  • Red atom number annotations positioned away from the structure
  • Curved HMBC arrows connecting correlating atoms
  • ²J/³J labels on arrows indicating coupling path length

Include the correlation diagram next to the HMBC table in your PDF report - it provides an immediate visual representation of how the HMBC correlations connect the molecular fragments.

Step 12: Report Results

markdown
## CASE Results

**Molecular Formula:** [formula]
**Degree of Unsaturation:** [DBE]

### Data Used
- 13C: [X] signals
- HSQC: [Y] correlations (Z protonated carbons)
- HMBC: [N] correlations
- Symmetry: [description]

### LSD Results
- Solutions found: [count]
- ELIM used: [Yes/No]

### Top Candidates

**Rank 1:** MAE = X.XX ppm ([Quality])

[SMILES]

code
- Key features: [description]

**Rank 2:** MAE = X.XX ppm ([Quality])

[SMILES]

code
- Differs from #1 in: [description]

### Confidence Assessment
[High/Medium/Low] - [reasoning]

### Recommendation
[Final structure proposal or need for additional data]

Step 13: Generate PDF Report

Always generate a PDF report with rendered structures and formatted tables at the end of every CASE analysis.

python
# Generate PDF report with structures and tables
python3 << 'EOF'
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from reportlab.lib import colors
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image, Table, TableStyle
from reportlab.lib.enums import TA_CENTER
import io

# Create the PDF document
doc = SimpleDocTemplate(
    "analysis/CASE_Report.pdf",
    pagesize=A4,
    rightMargin=0.75*inch,
    leftMargin=0.75*inch,
    topMargin=0.75*inch,
    bottomMargin=0.75*inch
)

# Styles
styles = getSampleStyleSheet()
title_style = ParagraphStyle('CustomTitle', parent=styles['Heading1'],
    fontSize=20, spaceAfter=30, alignment=TA_CENTER)
heading_style = ParagraphStyle('CustomHeading', parent=styles['Heading2'],
    fontSize=14, spaceBefore=20, spaceAfter=10)
normal_style = styles['Normal']

story = []

# Title
story.append(Paragraph("CASE Structure Elucidation Report", title_style))
story.append(Spacer(1, 0.25*inch))

# Summary table
story.append(Paragraph("Summary", heading_style))
summary_data = [
    ["Molecular Formula", "<FORMULA>"],
    ["Molecular Weight", "<MW> Da"],
    ["Degree of Unsaturation (DBE)", "<DBE>"],
    ["LSD Solutions Found", "<COUNT>"],
]
summary_table = Table(summary_data, colWidths=[2.5*inch, 3*inch])
summary_table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
    ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('PADDING', (0, 0), (-1, -1), 8),
]))
story.append(summary_table)
story.append(Spacer(1, 0.3*inch))

# 13C NMR Data Table
story.append(Paragraph("13C NMR Data", heading_style))
c13_data = [
    ["#", "Shift (ppm)", "Multiplicity", "Assignment"],
    # Add rows for each carbon signal:
    # ["1", "131.29", "C (quat)", "=C< olefinic"],
]
c13_table = Table(c13_data, colWidths=[0.4*inch, 1.2*inch, 1.2*inch, 2.5*inch])
c13_table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#4472C4')),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('PADDING', (0, 0), (-1, -1), 6),
]))
story.append(c13_table)
story.append(Spacer(1, 0.3*inch))

# Structure rendering function
def smiles_to_image(smiles, size=(400, 300)):
    mol = Chem.MolFromSmiles(smiles)
    AllChem.Compute2DCoords(mol)
    img = Draw.MolToImage(mol, size=size)
    img_buffer = io.BytesIO()
    img.save(img_buffer, format='PNG')
    img_buffer.seek(0)
    return img_buffer

# For each candidate structure:
story.append(Paragraph("Structure Candidates", heading_style))
# candidate_smiles = ["SMILES1", "SMILES2", ...]
# for i, smi in enumerate(candidate_smiles, 1):
#     story.append(Paragraph(f"<b>Rank {i}:</b> {name}", normal_style))
#     story.append(Paragraph(f"MAE: {mae} ppm | SMILES: {smi}", normal_style))
#     img = smiles_to_image(smi)
#     story.append(Image(img, width=3*inch, height=2.25*inch))
#     story.append(Spacer(1, 0.2*inch))

# Ranking comparison table
story.append(Paragraph("Ranking Comparison", heading_style))
rank_data = [
    ["Rank", "Structure", "MAE (ppm)", "Quality", "Within 3ppm"],
    # ["1", "Name", "2.69", "Good", "6/10"],
]
rank_table = Table(rank_data, colWidths=[0.5*inch, 2.5*inch, 1*inch, 0.8*inch, 1*inch])
rank_table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#4472C4')),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
    ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
    ('PADDING', (0, 0), (-1, -1), 6),
]))
story.append(rank_table)

# Build PDF
doc.build(story)
print("PDF report generated: analysis/CASE_Report.pdf")
EOF

CRITICAL: Use data from the successful analysis

Do NOT re-pick peaks for the PDF. Extract all data directly from the LSD file that produced successful solutions. The LSD file contains the exact peaks and correlations that were used.

The PDF report must include complete tables of ALL data used:

  1. Summary table — formula, MW, DBE, solution count, recommended structure

  2. Complete 13C NMR table — ALL carbons used in the LSD file:

    • Carbon number (C1, C2, ...)
    • Chemical shift (ppm)
    • Multiplicity (C, CH, CH2, CH3) from DEPT
    • Hybridization (sp2/sp3)
    • H-count
    • Assignment/interpretation
  3. Complete HSQC table — ALL direct C-H correlations from the LSD file:

    • Every HSQC command in the LSD file becomes a row
    • Include carbon identity, shift, multiplicity, and proton chemical shift if known
  4. HMBC Correlation Diagram (placed ABOVE the HMBC table):

    • Generate the diagram FIRST before the HMBC table:
      bash
      lucy visualize correlations --sol compound.sol --lsd-file compound.lsd \
          --show-atom-numbers -o analysis/hmbc_diagram.svg
      
    • Convert SVG to PNG for ReportLab embedding:
      python
      import cairosvg
      cairosvg.svg2png(url='analysis/hmbc_diagram.svg',
                       write_to='analysis/hmbc_diagram.png', scale=2.0)
      
    • The diagram shows:
      • Clean 2D structure with explicit atom labels (C, H, O)
      • Red curved arrows connecting HMBC-correlating atoms
      • Atom numbers matching the LSD file numbering
      • Optimized layout to avoid overlaps between arrows and labels
    • Include as a centered Image in the PDF, full page width (~6 inches)
  5. Complete HMBC table (placed BELOW the diagram) — ALL long-range correlations from the LSD file:

    • Every HMBC command in the LSD file becomes a row
    • Columns: "From Carbon", "To Proton", "<sup>n</sup>J<sub>CH</sub>", "Structural Information"
    • The J-coupling column shows path length using spectroscopist notation:
      • ²J<sub>CH</sub> = 2-bond (C directly bonded to C bearing H)
      • ³J<sub>CH</sub> = 3-bond (most common in HMBC)
      • ⁴J<sub>CH</sub> = 4-bond (W-pathway, rare in HMBC)
    • CRITICAL: Use lucy lsd analyze to calculate path lengths, do NOT guess!
      bash
      lucy lsd analyze compound.sol compound.lsd --format json > analysis/j_coupling.json
      
      This parses the OUTLSD section and uses BFS to compute actual bond distances.
    • All HMBC correlations should be ²J or ³J. If you find ⁴J+, the CASE likely required ELIM.
    • ReportLab note: Use Paragraph() objects for cells with super/subscript. Use <super> and <sub> tags.
    • Note: Reciprocal correlations (e.g., C1→H7 and C7→H2) appear as separate entries because they provide independent constraints
  6. Excluded signals section — Document WHY certain peaks were not used:

    • Solvent peaks (e.g., CDCl3 at 77 ppm)
    • Noise/artifacts
    • Duplicate signals from overlapping peaks
    • Signals that couldn't be assigned confidently
  7. Structure candidates — Rendered 2D images (RDKit) with SMILES and MAE scores

  8. Ranking comparison table — All candidates with MAE, quality rating, carbons within tolerance

  9. Recommended structure — Larger image with SMILES and InChI, plus reasoning if not Rank #1

Required dependencies:

CRITICAL: Install missing dependencies - do NOT fall back to suboptimal solutions (like text placeholders instead of images).

bash
# Core PDF generation (RDKit should already be installed)
pip install reportlab

# SVG to PNG conversion for embedding diagrams in PDF
pip install cairosvg

# cairosvg requires the Cairo system library - install if not present:
# macOS:
brew install cairo
# Then run Python with the library path if needed:
# DYLD_LIBRARY_PATH=/opt/homebrew/opt/cairo/lib:$DYLD_LIBRARY_PATH python3 script.py

# Linux (Debian/Ubuntu):
# sudo apt-get install libcairo2-dev

# Linux (RHEL/CentOS):
# sudo yum install cairo-devel

Before generating the PDF, verify all dependencies are working:

python
# Test imports - if any fail, install the missing package
from reportlab.platypus import SimpleDocTemplate
from rdkit import Chem
from rdkit.Chem import Draw
import cairosvg  # For SVG→PNG conversion

If cairosvg import fails with "no library called cairo", install the system Cairo library as shown above.


Troubleshooting

For detailed troubleshooting guidance, see skill/SKILL.md Section 5 (LSD Reference) and Section 6 (Ranking and Prediction).

Quick checklist for 0 solutions: sp2 count is EVEN, hydrogen count matches formula, HMBC correlations correct, only then try ELIM 1 0.


Quick Reference

bash
# Full workflow
mkdir -p analysis
lucy pick 1d ./2                                    # 13C peaks
lucy pick hsqc ./5 ./3 --dept90 ./4                # HSQC + multiplicities
lucy pick hmbc ./6 ./2 ./5 --dept135 ./3           # HMBC correlations
lucy lsd generate . C16H10N2O2 -o analysis/compound.lsd  # Generate LSD input
cd analysis && LSD compound.lsd                     # Solve
outlsd 5 < compound.sol > solutions.smi            # Convert to SMILES
lucy lsd rank solutions.smi --spectrum ../2        # Rank by 13C prediction
lucy lsd analyze compound.sol compound.lsd --draw structure_{n}.png  # Analyze with numbered structures
# Generate PDF report (see Step 13 for full template)

IMPORTANT: Always generate a PDF report at the end of every CASE analysis (Step 13).