Optimizing Performance

Name: optimizing-performance
Rating: 92
Author: inference-sh

Best practices for inference.sh apps.

Device Detection

Never hardcode "cuda" - use accelerate for automatic device detection:

python

from accelerate import Accelerator

class App(BaseApp):
    async def setup(self, config):
        self.accelerator = Accelerator()
        self.device = self.accelerator.device

Model Loading

Use HuggingFace hub for downloads:

python

import os
from huggingface_hub import snapshot_download

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

class App(BaseApp):
    async def setup(self, config):
        self.model_path = snapshot_download(
            repo_id="org/model-name",
            resume_download=True,
        )

Avoid:

•Hardcoded local directories (local_dir="./models")
•Subprocess calls to huggingface-cli
•Assuming specific file structures

Memory Cleanup

python

import torch, gc

def cleanup_memory():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

async def run(self, input_data):
    result = self.model(input_data)
    cleanup_memory()
    return result

Mixed Precision

python

model = model.to(dtype=torch.bfloat16)

# Or with autocast
from torch.amp import autocast
with autocast('cuda'):
    output = model(input)

Flash Attention

python

model = AutoModel.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Error Handling

python

import logging

async def run(self, input_data):
    try:
        result = self.process(input_data)
        return AppOutput(result=result)
    except Exception as e:
        logging.error(f"Processing failed: {e}")
        raise ValueError(f"Failed to process: {str(e)}")

Optimizing Performance

Device Detection

Model Loading

Memory Cleanup

Mixed Precision

Flash Attention

Error Handling

Pre-deploy Checklist