AIND Analysis Infrastructure
This skill captures knowledge about the Allen Institute for Neural Dynamics (AIND) data infrastructure for analysis results.
Overview
AIND stores analysis results in a two-tier system:
- •MongoDB (DocumentDB): Metadata and pointers to S3 assets
- •S3: Actual result files (figures, tables, videos)
MongoDB Access
Connection Pattern
Use aind-data-access-api for DocumentDB access:
from aind_data_access_api.document_db import MetadataDbClient
client = MetadataDbClient(
host="api.allenneuraldynamics.org", # Public API endpoint
database="analysis", # Analysis database
collection="<collection-name>", # Project-specific collection
)
One Collection Per Project
Each analysis project has its own collection. Examples:
- •
dynamic-foraging-model-fitting- Foraging behavior MLE fitting - •Other projects follow same pattern
Query Patterns
# Basic query - old format only
records = client.retrieve_docdb_records(
filter_query={"status": "success"},
projection={"_id": 1, "subject_id": 1, "s3_location": 1},
paginate=False, # Set True for large queries
)
# Query for new format
records = client.retrieve_docdb_records(
filter_query={"processing.data_processes.output_parameters.additional_info": "success"},
projection={"_id": 1, "location": 1},
paginate=False,
)
Note: Due to the two different formats, it's recommended to use aind-analysis-arch-result-access which handles both formats automatically.
Two Pipeline Formats
AIND has two analysis pipeline formats with different document structures:
1. Prototype Pipeline (older format)
Flat structure with fields at root level:
{
"_id": "fe96ff6c9e7b...", // SHA256 hash
"subject_id": "744040",
"session_date": "2024-12-09",
"status": "success",
"s3_location": "s3://aind-dynamic-foraging-analysis-prod-o5171v/fe96ff6c...",
"analysis_datetime": "2025-01-18T05:14:46.722265",
"nwb_name": "744040_2024-12-09_13-30-23.nwb",
"analysis_spec": {
"analysis_name": "MLE fitting",
"analysis_args": { ... }
},
"analysis_results": {
"fit_settings": { "agent_alias": "QLearning_L1F1_CK1_softmax", ... },
"params": { ... },
"log_likelihood": -144.83,
"AIC": 299.67,
"BIC": 319.50,
"n_trials": 390,
"prediction_accuracy": 0.807,
"cross_validation": { ... }
}
}
Key paths:
- •
subject_id→ root level - •
session_date→ root level - •
status→ root level - •S3 location →
s3_location - •Fitting results →
analysis_results.* - •Agent alias →
analysis_results.fit_settings.agent_alias
2. AIND Analysis Framework (new format)
Nested structure following aind-data-schema:
{
"_id": "d2a652c73ee8420a...", // Shorter UUID
"object_type": "Metadata",
"name": "7d9b907880012b65...",
"location": "s3://aind-analysis-prod-o5171v/dynamic-foraging-model-fitting/7d9b...",
"processing": {
"data_processes": [
{
"name": "han_df_mle_aind-analysis-wrapper",
"start_date_time": "2026-01-10T04:07:11",
"output_parameters": {
"additional_info": "success",
"subject_id": "781575",
"session_date": "2025-07-14",
"nwb_name": "behavior_781575_2025-07-14_21-41-11.nwb",
"fitting_results": {
"fit_settings": { "agent_alias": "ForagingCompareThreshold", ... },
"params": { ... },
"log_likelihood": -272.10,
"AIC": 552.20,
"BIC": 569.34,
"n_trials": 536,
"prediction_accuracy": 0.783,
"cross_validation": { ... }
}
}
}
]
}
}
Key paths:
- •
subject_id→processing.data_processes[0].output_parameters.subject_id - •
session_date→processing.data_processes[0].output_parameters.session_date - •
status→processing.data_processes[0].output_parameters.additional_info - •S3 location →
location - •Fitting results →
processing.data_processes[0].output_parameters.fitting_results.* - •Agent alias →
processing.data_processes[0].output_parameters.fitting_results.fit_settings.agent_alias
Field Mapping Summary
| Field | Old Format | New Format |
|---|---|---|
| subject_id | subject_id | processing.data_processes[0].output_parameters.subject_id |
| session_date | session_date | processing.data_processes[0].output_parameters.session_date |
| status | status | processing.data_processes[0].output_parameters.additional_info |
| S3 location | s3_location | location |
| agent_alias | analysis_results.fit_settings.agent_alias | processing.data_processes[0].output_parameters.fitting_results.fit_settings.agent_alias |
| n_trials | analysis_results.n_trials | processing.data_processes[0].output_parameters.fitting_results.n_trials |
| AIC/BIC | analysis_results.AIC/BIC | processing.data_processes[0].output_parameters.fitting_results.AIC/BIC |
Querying Both Formats
Use MongoDB projection aliasing to normalize fields:
# Query that works for both formats
projection = {
"_id": 1,
# Old format fields
"subject_id": 1,
"session_date": 1,
"status": 1,
"s3_location": 1,
# New format fields (aliased)
"subject_id_new": "$processing.data_processes.output_parameters.subject_id",
"session_date_new": "$processing.data_processes.output_parameters.session_date",
"location": 1,
}
The aind-analysis-arch-result-access package handles this automatically by querying both formats and merging results into a unified DataFrame.
S3 Access
Public Buckets
Analysis results are in public S3 buckets (no auth needed):
import s3fs # Anonymous access for public buckets fs = s3fs.S3FileSystem(anon=True)
Common Bucket Paths
# Old pipeline bucket S3_PATH_ANALYSIS_OLD = "s3://aind-dynamic-foraging-analysis-prod-o5171v" # New pipeline bucket (AIND Analysis Framework) S3_PATH_ANALYSIS_NEW = "s3://aind-analysis-prod-o5171v/dynamic-foraging-model-fitting" # Bonsai processed data S3_PATH_BONSAI_ROOT = "s3://aind-behavior-data/foraging_nwb_bonsai_processed"
Asset URL Construction
Convert S3 path to HTTPS for web display:
def s3_to_https(s3_path: str) -> str:
"""Convert s3://bucket/key to https://bucket.s3.amazonaws.com/key"""
if s3_path.startswith("s3://"):
s3_path = s3_path[5:]
bucket = s3_path.split("/")[0]
key = "/".join(s3_path.split("/")[1:])
return f"https://{bucket}.s3.amazonaws.com/{key}"
Reading Files from S3
import json
import pickle
# JSON files
with fs.open("s3://bucket/path/file.json") as f:
data = json.load(f)
# Pickle files
with fs.open("s3://bucket/path/file.pkl", "rb") as f:
df = pickle.load(f)
# Check existence
if fs.exists("s3://bucket/path/file.png"):
# File exists
Batch Operations
For multiple S3 reads, use ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
def fetch_file(s3_path):
# ... fetch logic
pass
with ThreadPoolExecutor(max_workers=10) as executor:
results = list(tqdm(
executor.map(fetch_file, paths),
total=len(paths),
desc="Fetching files"
))
Using aind-analysis-arch-result-access
This package provides ready-made functions for specific analyses:
# Install: pip install aind-analysis-arch-result-access
from aind_analysis_arch_result_access import get_mle_model_fitting
# Get foraging model fitting results
# IMPORTANT: At least one filter parameter is required!
df = get_mle_model_fitting(
subject_id="778869", # Filter by subject (recommended for fast loading)
# session_date="2024-10-24", # Or filter by date
# agent_alias="QLearning...", # Or filter by model type
# from_custom_query={...}, # Or use custom MongoDB query
if_include_metrics=True,
if_include_latent_variables=False, # Set False for faster loading
if_download_figures=False,
)
# For querying by date range:
from datetime import datetime, timedelta
three_months_ago = (datetime.now() - timedelta(days=90)).strftime("%Y-%m-%d")
df_recent = get_mle_model_fitting(
from_custom_query={"session_date": {"$gte": three_months_ago}},
if_include_metrics=True,
if_include_latent_variables=False,
)
Important notes:
- •Function requires at least one of:
subject_id,session_date,agent_alias, orfrom_custom_query - •The package queries both pipeline formats separately and merges results into a unified DataFrame
- •
only_recent_version=True(default) deduplicates by keeping most recent analysis - •Loading all records can be slow; filter by subject_id or date range for prototyping
Key columns in returned DataFrame:
- •
_id: Record identifier - •
subject_id,session_date: Session info - •
agent_alias: Model type used - •
n_trials: Number of trials - •
S3_location: Path to result files (use for constructing asset URLs) - •
status: "success" or "failed" - •
pipeline_source: "aind analysis framework" or "han's analysis pipeline" - •Metrics:
log_likelihood,AIC,BIC,prediction_accuracy, etc.
Common Asset Types
Assets stored in S3 per record:
- •
fitted_session.png- Main result figure - •
docDB_record.json- Full analysis results - •
original_results_*.json- Raw output files - •Latent variables (q-values, RPE, etc.)
Best Practices
- •Filter early: Use MongoDB queries to reduce data before pandas operations
- •Batch S3 operations: Use threading for multiple file reads
- •Cache results: Consider caching DataFrames for repeated queries
- •Handle both formats: Account for old and new pipeline structures
- •Check S3 existence: Assets may not exist for all records