AgentSkillsCN

onnx

在.NET中使用ONNX Runtime运行预训练的ONNX模型进行推理时使用。涵盖会话管理、张量输入/输出、执行提供者(CPU/GPU/DirectML)、模型优化,以及与ASP.NET Core的集成。 适用范围:在.NET中运行预训练的ONNX模型、图像分类推理、NLP模型推理、跨框架模型部署(PyTorch/TensorFlow转.NET)、使用CUDA或DirectML进行GPU加速推理。 不适用范围:从零开始训练ML模型(使用mlnet或Python)、调用云托管的LLM(使用microsoft-extensions-ai或azure-ai-inference)、构建智能体工作流(使用agent-framework)、使用特征工程构建自定义ML流水线(使用mlnet)。

SKILL.md
--- frontmatter
name: onnx
description: |
  Use when running pre-trained ONNX models for inference in .NET with ONNX Runtime. Covers session management, tensor inputs/outputs, execution providers (CPU/GPU/DirectML), model optimization, and integration with ASP.NET Core.
  USE FOR: running pre-trained ONNX models in .NET, image classification inference, NLP model inference, cross-framework model deployment (PyTorch/TensorFlow to .NET), GPU-accelerated inference with CUDA or DirectML
  DO NOT USE FOR: training ML models from scratch (use mlnet or Python), calling cloud-hosted LLMs (use microsoft-extensions-ai or azure-ai-inference), building agent workflows (use agent-framework), custom ML pipelines with feature engineering (use mlnet)
license: MIT
metadata:
  displayName: "ONNX Runtime"
  author: "Tyler-R-Kendrick"
  version: "1.0.0"
compatibility: claude, copilot, cursor

ONNX Runtime

Overview

ONNX Runtime is a high-performance inference engine for running models in the Open Neural Network Exchange (ONNX) format. It enables .NET applications to run models trained in PyTorch, TensorFlow, scikit-learn, or any framework that exports to ONNX. The runtime supports multiple execution providers (CPU, CUDA, DirectML, TensorRT) for hardware-accelerated inference with minimal code changes.

NuGet Packages

bash
dotnet add package Microsoft.ML.OnnxRuntime           # CPU inference
dotnet add package Microsoft.ML.OnnxRuntime.Gpu       # CUDA GPU inference
dotnet add package Microsoft.ML.OnnxRuntime.DirectML  # DirectML (Windows GPU)

Basic Inference

csharp
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

using var session = new InferenceSession("model.onnx");

// Inspect model inputs and outputs
foreach (var input in session.InputMetadata)
{
    Console.WriteLine($"Input: {input.Key}, Shape: [{string.Join(",", input.Value.Dimensions)}], Type: {input.Value.ElementType}");
}
foreach (var output in session.OutputMetadata)
{
    Console.WriteLine($"Output: {output.Key}, Shape: [{string.Join(",", output.Value.Dimensions)}], Type: {output.Value.ElementType}");
}

// Create input tensor
var inputData = new float[] { 1.0f, 2.0f, 3.0f, 4.0f };
var inputTensor = new DenseTensor<float>(inputData, new[] { 1, 4 });

var inputs = new List<NamedOnnxValue>
{
    NamedOnnxValue.CreateFromTensor("input", inputTensor)
};

// Run inference
using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs);
var output = results.First().AsTensor<float>();

Console.WriteLine($"Output shape: [{string.Join(",", output.Dimensions.ToArray())}]");
Console.WriteLine($"Values: [{string.Join(", ", output.ToArray())}]");

Image Classification

csharp
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using SixLabors.ImageSharp;
using SixLabors.ImageSharp.PixelFormats;
using SixLabors.ImageSharp.Processing;

public class ImageClassifier : IDisposable
{
    private readonly InferenceSession _session;
    private readonly string[] _labels;

    public ImageClassifier(string modelPath, string labelsPath)
    {
        _session = new InferenceSession(modelPath);
        _labels = File.ReadAllLines(labelsPath);
    }

    public (string Label, float Confidence) Classify(string imagePath)
    {
        // Load and preprocess image (224x224 for typical classification models)
        using var image = Image.Load<Rgb24>(imagePath);
        image.Mutate(x => x.Resize(224, 224));

        // Convert to tensor: [1, 3, 224, 224] in NCHW format
        var tensor = new DenseTensor<float>(new[] { 1, 3, 224, 224 });

        for (int y = 0; y < 224; y++)
        {
            for (int x = 0; x < 224; x++)
            {
                var pixel = image[x, y];
                // Normalize to [0, 1] and apply ImageNet mean/std
                tensor[0, 0, y, x] = (pixel.R / 255f - 0.485f) / 0.229f; // Red
                tensor[0, 1, y, x] = (pixel.G / 255f - 0.456f) / 0.224f; // Green
                tensor[0, 2, y, x] = (pixel.B / 255f - 0.406f) / 0.225f; // Blue
            }
        }

        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input", tensor)
        };

        using var results = _session.Run(inputs);
        var probabilities = Softmax(results.First().AsTensor<float>().ToArray());

        int maxIndex = Array.IndexOf(probabilities, probabilities.Max());
        return (_labels[maxIndex], probabilities[maxIndex]);
    }

    private static float[] Softmax(float[] logits)
    {
        float max = logits.Max();
        var exps = logits.Select(x => MathF.Exp(x - max)).ToArray();
        float sum = exps.Sum();
        return exps.Select(x => x / sum).ToArray();
    }

    public void Dispose() => _session.Dispose();
}

// Usage
using var classifier = new ImageClassifier("resnet50.onnx", "imagenet_labels.txt");
var (label, confidence) = classifier.Classify("photo.jpg");
Console.WriteLine($"Predicted: {label} ({confidence:P1})");

Execution Providers

Configure hardware acceleration by selecting execution providers.

csharp
// CPU (default)
var cpuOptions = new SessionOptions();
using var cpuSession = new InferenceSession("model.onnx", cpuOptions);

// CUDA GPU
var cudaOptions = new SessionOptions();
cudaOptions.AppendExecutionProvider_CUDA(deviceId: 0);
using var gpuSession = new InferenceSession("model.onnx", cudaOptions);

// DirectML (Windows GPU - AMD, Intel, NVIDIA)
var dmlOptions = new SessionOptions();
dmlOptions.AppendExecutionProvider_DML(deviceId: 0);
using var dmlSession = new InferenceSession("model.onnx", dmlOptions);

// TensorRT (NVIDIA optimized)
var trtOptions = new SessionOptions();
trtOptions.AppendExecutionProvider_Tensorrt(deviceId: 0);
trtOptions.AppendExecutionProvider_CUDA(deviceId: 0); // Fallback
using var trtSession = new InferenceSession("model.onnx", trtOptions);

Execution Provider Comparison

ProviderPackagePlatformBest For
CPUOnnxRuntimeAllDefault, universally available
CUDAOnnxRuntime.GpuLinux/Windows + NVIDIAHigh-throughput GPU inference
DirectMLOnnxRuntime.DirectMLWindowsCross-vendor GPU (AMD, Intel, NVIDIA)
TensorRTOnnxRuntime.GpuLinux + NVIDIAMaximum NVIDIA throughput
CoreMLOnnxRuntimemacOS/iOSApple Silicon optimization

Session Options and Optimization

csharp
var options = new SessionOptions
{
    // Enable graph optimizations
    GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL,

    // Set number of intra-op threads (within a single operator)
    IntraOpNumThreads = Environment.ProcessorCount,

    // Set number of inter-op threads (between operators)
    InterOpNumThreads = 2,

    // Enable memory pattern optimization
    EnableMemoryPattern = true,

    // Save optimized model to disk for faster subsequent loads
    OptimizedModelFilePath = "model_optimized.onnx"
};

using var session = new InferenceSession("model.onnx", options);

Batch Inference

csharp
public class BatchPredictor : IDisposable
{
    private readonly InferenceSession _session;

    public BatchPredictor(string modelPath)
    {
        var options = new SessionOptions
        {
            GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL
        };
        _session = new InferenceSession(modelPath, options);
    }

    public float[][] PredictBatch(float[][] inputs, int featureCount)
    {
        int batchSize = inputs.Length;

        // Flatten batch into single tensor: [batchSize, featureCount]
        var flatInput = new float[batchSize * featureCount];
        for (int i = 0; i < batchSize; i++)
        {
            Array.Copy(inputs[i], 0, flatInput, i * featureCount, featureCount);
        }

        var tensor = new DenseTensor<float>(flatInput, new[] { batchSize, featureCount });
        var onnxInputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input", tensor)
        };

        using var results = _session.Run(onnxInputs);
        var outputTensor = results.First().AsTensor<float>();
        int outputSize = outputTensor.Dimensions[1];

        var outputs = new float[batchSize][];
        for (int i = 0; i < batchSize; i++)
        {
            outputs[i] = new float[outputSize];
            for (int j = 0; j < outputSize; j++)
            {
                outputs[i][j] = outputTensor[i, j];
            }
        }

        return outputs;
    }

    public void Dispose() => _session.Dispose();
}

ASP.NET Core Integration

csharp
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton(sp =>
{
    var options = new SessionOptions
    {
        GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL,
        IntraOpNumThreads = 4
    };
    return new InferenceSession("model.onnx", options);
});

var app = builder.Build();

app.MapPost("/predict", (PredictRequest request, InferenceSession session) =>
{
    var tensor = new DenseTensor<float>(
        request.Features, new[] { 1, request.Features.Length });

    var inputs = new List<NamedOnnxValue>
    {
        NamedOnnxValue.CreateFromTensor("input", tensor)
    };

    using var results = session.Run(inputs);
    var output = results.First().AsTensor<float>().ToArray();

    return Results.Ok(new { predictions = output });
});

app.Run();

record PredictRequest(float[] Features);

Text Embedding with ONNX

csharp
public class TextEmbedder : IDisposable
{
    private readonly InferenceSession _session;

    public TextEmbedder(string modelPath)
    {
        _session = new InferenceSession(modelPath);
    }

    public float[] Embed(long[] inputIds, long[] attentionMask)
    {
        var idsTensor = new DenseTensor<long>(inputIds, new[] { 1, inputIds.Length });
        var maskTensor = new DenseTensor<long>(attentionMask, new[] { 1, attentionMask.Length });

        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input_ids", idsTensor),
            NamedOnnxValue.CreateFromTensor("attention_mask", maskTensor)
        };

        using var results = _session.Run(inputs);
        var embeddings = results.First().AsTensor<float>();

        // Mean pooling over sequence dimension
        int seqLen = inputIds.Length;
        int hiddenSize = embeddings.Dimensions[2];
        var pooled = new float[hiddenSize];

        int validTokens = attentionMask.Sum(m => (int)m);
        for (int h = 0; h < hiddenSize; h++)
        {
            float sum = 0;
            for (int s = 0; s < seqLen; s++)
            {
                if (attentionMask[s] == 1)
                    sum += embeddings[0, s, h];
            }
            pooled[h] = sum / validTokens;
        }

        return pooled;
    }

    public void Dispose() => _session.Dispose();
}

Best Practices

  • Register InferenceSession as a singleton in DI because it is thread-safe for Run() calls and expensive to initialize; creating a session per request causes severe performance degradation.
  • Inspect session.InputMetadata and session.OutputMetadata at startup to verify input/output names, shapes, and data types match your code rather than relying on documentation that may be outdated.
  • Use GraphOptimizationLevel.ORT_ENABLE_ALL in SessionOptions to enable constant folding, node fusion, and memory planning optimizations that reduce inference latency by 10-30%.
  • Set OptimizedModelFilePath on SessionOptions to save the optimized graph to disk; subsequent loads skip optimization and start faster.
  • Batch multiple inputs into a single tensor (e.g., shape [N, features]) and call session.Run once rather than looping N times, reducing overhead from session lock acquisition and memory allocation.
  • Pre-allocate DenseTensor<T> instances and reuse them across inference calls in hot paths to avoid repeated heap allocations that trigger garbage collection pauses.
  • Add CUDA or DirectML execution providers with a CPU fallback (AppendExecutionProvider_CUDA followed by default CPU) so the application runs on any machine regardless of GPU availability.
  • Normalize input data consistently with the model's training preprocessing (e.g., ImageNet mean/std for vision models, tokenizer for NLP models); mismatched normalization is the most common cause of incorrect predictions.
  • Use session.Run(inputs, outputNames) with explicit output names when the model has multiple outputs to avoid deserializing outputs you do not need.
  • Monitor inference latency with Stopwatch or OpenTelemetry and set up alerts for p99 latency regressions that may indicate model size issues or execution provider misconfiguration.