AgentSkillsCN

fiftyone-find-duplicates

利用脑部相似性计算,在 FiftyOne 数据集中查找重复或近乎重复的图像。适用于数据集去重、相似图像检索,或移除冗余样本时使用。

SKILL.md
--- frontmatter
name: fiftyone-find-duplicates
description: Finds duplicate or near-duplicate images in FiftyOne datasets using brain similarity computation. Use when deduplicating datasets, finding similar images, or removing redundant samples.

Find Duplicates in FiftyOne Datasets

Key Directives

ALWAYS follow these rules:

1. Set context first

python
set_context(dataset_name="my-dataset")

2. Launch FiftyOne App

Brain operators are delegated and require the app:

python
launch_app()

Wait 5-10 seconds for initialization.

3. Discover operators dynamically

python
# List all brain operators
list_operators(builtin_only=False)

# Get schema for specific operator
get_operator_schema(operator_uri="@voxel51/brain/compute_similarity")

4. Compute embeddings before finding duplicates

python
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "img_sim", "model": "mobilenet-v2-imagenet-torch"}
)

5. Close app when done

python
close_app()

Complete Workflow

Step 1: Setup

python
# Set context
set_context(dataset_name="my-dataset")

# Launch app (required for brain operators)
launch_app()

Step 2: Verify Brain Plugin

python
# Check if brain plugin is available
list_plugins(enabled=True)

# If not installed:
download_plugin(
    url_or_repo="voxel51/fiftyone-plugins",
    plugin_names=["@voxel51/brain"]
)
enable_plugin(plugin_name="@voxel51/brain")

Step 3: Discover Brain Operators

python
# List all available operators
list_operators(builtin_only=False)

# Get schema for compute_similarity
get_operator_schema(operator_uri="@voxel51/brain/compute_similarity")

# Get schema for find_duplicates
get_operator_schema(operator_uri="@voxel51/brain/find_duplicates")

Step 4: Compute Similarity

python
# Execute operator to compute embeddings
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={
        "brain_key": "img_duplicates",
        "model": "mobilenet-v2-imagenet-torch"
    }
)

Step 5: Find Near Duplicates

python
execute_operator(
    operator_uri="@voxel51/brain/find_near_duplicates",
    params={
        "similarity_index": "img_duplicates",
        "threshold": 0.3
    }
)

Threshold guidelines (distance-based, lower = more similar):

  • 0.1 = Very similar (near-exact duplicates)
  • 0.3 = Near duplicates (recommended default)
  • 0.5 = Similar images
  • 0.7 = Loosely similar

This operator creates two saved views automatically:

  • near duplicates: all samples that are near duplicates
  • representatives of near duplicates: one representative from each group

Step 6: View Duplicates in App

After finding duplicates, use set_view to display them in the FiftyOne App:

Option A: Filter by near_dup_id field

python
# Show all samples that have a near_dup_id (all duplicates)
set_view(exists=["near_dup_id"])

Option B: Show specific duplicate group

python
# Show samples with a specific duplicate group ID
set_view(filters={"near_dup_id": 1})

Option C: Load saved view (if available)

python
# Load the automatically created saved view
set_view(view_name="near duplicates")

Option D: Clear filter to show all samples

python
clear_view()

The find_near_duplicates operator adds a near_dup_id field to samples. Samples with the same ID are duplicates of each other.

Step 7: Delete Duplicates

Option A: Use deduplicate operator (keeps one representative per group)

python
execute_operator(
    operator_uri="@voxel51/brain/deduplicate_near_duplicates",
    params={}
)

Option B: Manual deletion from App UI

  1. Use set_view(exists=["near_dup_id"]) to show duplicates
  2. Review samples in the App at http://localhost:5151/
  3. Select samples to delete
  4. Use the delete action in the App

Step 8: Clean Up

python
close_app()

Available Tools

Session View Tools

ToolDescription
set_view(exists=[...])Filter samples where field(s) have non-None values
set_view(filters={...})Filter samples by exact field values
set_view(tags=[...])Filter samples by tags
set_view(sample_ids=[...])Select specific sample IDs
set_view(view_name="...")Load a saved view by name
clear_view()Clear filters, show all samples

Brain Operators for Duplicates

Use list_operators() to discover and get_operator_schema() to see parameters:

OperatorDescription
@voxel51/brain/compute_similarityCompute embeddings and similarity index
@voxel51/brain/find_near_duplicatesFind near-duplicate samples
@voxel51/brain/deduplicate_near_duplicatesDelete duplicates, keep representatives
@voxel51/brain/find_exact_duplicatesFind exact duplicate media files
@voxel51/brain/deduplicate_exact_duplicatesDelete exact duplicates
@voxel51/brain/compute_uniquenessCompute uniqueness scores

Common Use Cases

Use Case 1: Remove Exact Duplicates

For accidentally duplicated files (identical bytes):

python
set_context(dataset_name="my-dataset")
launch_app()

execute_operator(
    operator_uri="@voxel51/brain/find_exact_duplicates",
    params={}
)

execute_operator(
    operator_uri="@voxel51/brain/deduplicate_exact_duplicates",
    params={}
)

close_app()

Use Case 2: Find and Review Near Duplicates

For visually similar but not identical images:

python
set_context(dataset_name="my-dataset")
launch_app()

# Compute embeddings
execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "near_dups", "model": "mobilenet-v2-imagenet-torch"}
)

# Find duplicates
execute_operator(
    operator_uri="@voxel51/brain/find_near_duplicates",
    params={"similarity_index": "near_dups", "threshold": 0.3}
)

# View duplicates in the App
set_view(exists=["near_dup_id"])

# After review, deduplicate
execute_operator(
    operator_uri="@voxel51/brain/deduplicate_near_duplicates",
    params={}
)

# Clear view and close
clear_view()
close_app()

Use Case 3: Sort by Similarity

Find images similar to a specific sample:

python
set_context(dataset_name="my-dataset")
launch_app()

execute_operator(
    operator_uri="@voxel51/brain/compute_similarity",
    params={"brain_key": "search"}
)

execute_operator(
    operator_uri="@voxel51/brain/sort_by_similarity",
    params={
        "brain_key": "search",
        "query_id": "sample_id_here",
        "k": 20
    }
)

close_app()

Troubleshooting

Error: "No executor available"

  • Cause: Delegated operators require the App executor for UI triggers
  • Solution: Direct user to App UI to view results and complete deletion manually
  • Affected operators: find_near_duplicates, deduplicate_near_duplicates

Error: "Brain key not found"

  • Cause: Embeddings not computed
  • Solution: Run compute_similarity first with a brain_key

Error: "Operator not found"

  • Cause: Brain plugin not installed
  • Solution: Install with download_plugin() and enable_plugin()

Error: "Missing dependency" (e.g., torch, tensorflow)

  • The MCP server detects missing dependencies automatically
  • Response includes missing_package and install_command
  • Example response:
    json
    {
      "error_type": "missing_dependency",
      "missing_package": "torch",
      "install_command": "pip install torch"
    }
    
  • Offer to run the install command for the user
  • After installation, restart MCP server and retry

Similarity computation is slow

  • Use faster model: mobilenet-v2-imagenet-torch
  • Use GPU if available
  • Process large datasets in batches

Best Practices

  1. Discover dynamically - Use list_operators() and get_operator_schema() to get current operator names and parameters
  2. Start with default threshold (0.3) and adjust as needed
  3. Review before deleting - Direct user to App to inspect duplicates
  4. Store embeddings - Reuse for multiple operations via brain_key
  5. Handle executor errors gracefully - Guide user to App UI when needed

Performance Notes

Embedding computation time:

  • 1,000 images: ~1-2 minutes
  • 10,000 images: ~10-15 minutes
  • 100,000 images: ~1-2 hours

Memory requirements:

  • ~2KB per image for embeddings
  • ~4-8KB per image for similarity index

Resources