AgentSkillsCN

azure-ai-inference

在通过Azure.AI.Inference SDK调用Azure托管的AI模型时使用。涵盖聊天完成、嵌入式表示、流式传输、模型选择,以及Azure AI模型目录的集成。 适用范围:调用Azure AI模型目录中的模型、Azure OpenAI聊天完成、从Azure托管模型生成嵌入式表示、流式传输聊天回复、在Azure上使用Mistral/Cohere/Llama模型。 不适用范围:与提供商无关的AI抽象(使用microsoft-extensions-ai)、使用工具构建智能体工作流(使用agent-framework)、在设备上进行ML推理(使用onnx)、训练自定义ML模型(使用mlnet)。

SKILL.md
--- frontmatter
name: azure-ai-inference
description: |
  Use when calling Azure-hosted AI models via the Azure.AI.Inference SDK. Covers chat completions, embeddings, streaming, model selection, and Azure AI model catalog integration.
  USE FOR: calling Azure AI model catalog models, Azure OpenAI chat completions, generating embeddings from Azure-hosted models, streaming chat responses, working with Mistral/Cohere/Llama models on Azure
  DO NOT USE FOR: provider-agnostic AI abstractions (use microsoft-extensions-ai), building agent workflows with tools (use agent-framework), on-device ML inference (use onnx), training custom ML models (use mlnet)
license: MIT
metadata:
  displayName: "Azure.AI.Inference"
  author: "Tyler-R-Kendrick"
  version: "1.0.0"
compatibility: claude, copilot, cursor

Azure.AI.Inference

Overview

Azure.AI.Inference is the .NET SDK for calling models deployed in the Azure AI model catalog (including OpenAI, Mistral, Cohere, Meta Llama, and others) through a unified API. It provides ChatCompletionsClient for chat/text generation and EmbeddingsClient for vector embeddings, both supporting streaming, tool calling, and structured output with Azure credential integration.

NuGet Packages

bash
dotnet add package Azure.AI.Inference
dotnet add package Azure.Identity   # For managed identity / DefaultAzureCredential

Chat Completions

csharp
using Azure;
using Azure.AI.Inference;

var endpoint = new Uri("https://my-model.eastus.models.ai.azure.com");
var credential = new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_AI_KEY")!);

var client = new ChatCompletionsClient(endpoint, credential);

var requestOptions = new ChatCompletionsOptions
{
    Messages =
    {
        new ChatRequestSystemMessage("You are a helpful coding assistant specializing in C#."),
        new ChatRequestUserMessage("Write a method to validate an email address using regex.")
    },
    Temperature = 0.7f,
    MaxTokens = 1024
};

ChatCompletions response = await client.CompleteAsync(requestOptions);
Console.WriteLine(response.Choices[0].Message.Content);
Console.WriteLine($"Tokens used: {response.Usage.TotalTokens}");

Streaming Chat Completions

csharp
var requestOptions = new ChatCompletionsOptions
{
    Messages =
    {
        new ChatRequestSystemMessage("You are a technical writer."),
        new ChatRequestUserMessage("Explain async/await in C# with examples.")
    },
    Temperature = 0.5f
};

await foreach (StreamingChatCompletionsUpdate update in
    client.CompleteStreamingAsync(requestOptions))
{
    if (update.ContentUpdate is { } content)
    {
        Console.Write(content);
    }
}
Console.WriteLine();

Using Azure Identity (Managed Identity / Entra ID)

csharp
using Azure.Identity;
using Azure.AI.Inference;

var endpoint = new Uri("https://my-model.eastus.models.ai.azure.com");
var credential = new DefaultAzureCredential();

var client = new ChatCompletionsClient(endpoint, credential);

var response = await client.CompleteAsync(new ChatCompletionsOptions
{
    Messages =
    {
        new ChatRequestUserMessage("Summarize the benefits of managed identity in Azure.")
    }
});

Console.WriteLine(response.Choices[0].Message.Content);

Embeddings

csharp
using Azure;
using Azure.AI.Inference;

var endpoint = new Uri("https://my-embedding-model.eastus.models.ai.azure.com");
var credential = new AzureKeyCredential(apiKey);

var client = new EmbeddingsClient(endpoint, credential);

var request = new EmbeddingsOptions(new List<string>
{
    "The quick brown fox jumps over the lazy dog",
    "Azure AI provides enterprise-grade model hosting",
    "Semantic search uses vector embeddings for relevance"
});

EmbeddingsResult result = await client.EmbedAsync(request);

foreach (EmbeddingItem item in result.Data)
{
    Console.WriteLine($"Index {item.Index}: [{string.Join(", ", item.Embedding.ToArray().Take(5))}...]");
    Console.WriteLine($"  Dimensions: {item.Embedding.ToArray().Length}");
}

Tool Calling (Function Calling)

csharp
using System.Text.Json;

var getWeatherTool = new ChatCompletionsFunctionToolDefinition
{
    Name = "get_weather",
    Description = "Get current weather for a location",
    Parameters = BinaryData.FromObjectAsJson(new
    {
        type = "object",
        properties = new
        {
            location = new { type = "string", description = "City name" },
            unit = new { type = "string", @enum = new[] { "celsius", "fahrenheit" } }
        },
        required = new[] { "location" }
    })
};

var options = new ChatCompletionsOptions
{
    Messages =
    {
        new ChatRequestUserMessage("What's the weather in Seattle?")
    },
    Tools = { getWeatherTool }
};

var response = await client.CompleteAsync(options);
var choice = response.Choices[0];

if (choice.FinishReason == CompletionsFinishReason.ToolCalls)
{
    foreach (var toolCall in choice.Message.ToolCalls.OfType<ChatCompletionsFunctionToolCall>())
    {
        Console.WriteLine($"Function: {toolCall.Name}");
        Console.WriteLine($"Arguments: {toolCall.Arguments}");

        // Execute the function and return the result
        var weatherResult = JsonSerializer.Serialize(new { temperature = 62, condition = "cloudy" });

        options.Messages.Add(new ChatRequestAssistantMessage(choice.Message));
        options.Messages.Add(new ChatRequestToolMessage(toolCall.Id, weatherResult));
    }

    // Get the final response with tool results
    var finalResponse = await client.CompleteAsync(options);
    Console.WriteLine(finalResponse.Choices[0].Message.Content);
}

Multi-Turn Conversation

csharp
var conversationHistory = new List<ChatRequestMessage>
{
    new ChatRequestSystemMessage("You are a .NET architecture advisor.")
};

async Task<string> ChatAsync(string userMessage)
{
    conversationHistory.Add(new ChatRequestUserMessage(userMessage));

    var options = new ChatCompletionsOptions();
    foreach (var msg in conversationHistory)
    {
        options.Messages.Add(msg);
    }
    options.Temperature = 0.7f;
    options.MaxTokens = 2048;

    var response = await client.CompleteAsync(options);
    var assistantMessage = response.Choices[0].Message.Content;

    conversationHistory.Add(new ChatRequestAssistantMessage(assistantMessage));
    return assistantMessage;
}

Console.WriteLine(await ChatAsync("Should I use microservices or a monolith?"));
Console.WriteLine(await ChatAsync("What about for a team of 3 developers?"));

Model Comparison on Azure AI

Model ProviderTypical UseAzure Endpoint Pattern
OpenAI (GPT-4o)General chat, code, reasoning*.openai.azure.com
Mistral LargeMultilingual, code, reasoning*.models.ai.azure.com
Cohere Command R+RAG, search-grounded generation*.models.ai.azure.com
Meta Llama 3.1Open-weight general purpose*.models.ai.azure.com
Phi-3Compact, on-device capable*.models.ai.azure.com

Dependency Injection Integration

csharp
using Azure.AI.Inference;
using Azure.Identity;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(config["AzureAI:Endpoint"]!);
    var credential = new DefaultAzureCredential();
    return new ChatCompletionsClient(endpoint, credential);
});

builder.Services.AddSingleton(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(config["AzureAI:EmbeddingsEndpoint"]!);
    var credential = new DefaultAzureCredential();
    return new EmbeddingsClient(endpoint, credential);
});

var app = builder.Build();

app.MapPost("/chat", async (ChatRequest request, ChatCompletionsClient client) =>
{
    var options = new ChatCompletionsOptions
    {
        Messages = { new ChatRequestUserMessage(request.Message) },
        MaxTokens = 1024
    };
    var response = await client.CompleteAsync(options);
    return Results.Ok(new { response = response.Choices[0].Message.Content });
});

app.Run();

record ChatRequest(string Message);

Best Practices

  • Use DefaultAzureCredential from Azure.Identity instead of API keys in production; it supports managed identity, Azure CLI, and Visual Studio credentials with automatic fallback.
  • Register ChatCompletionsClient and EmbeddingsClient as singletons in DI since they are thread-safe and designed for reuse across requests.
  • Set explicit MaxTokens on every request to prevent unexpectedly large responses that consume budget; pair this with Temperature tuning per use case (0.0 for deterministic, 0.7-1.0 for creative).
  • Use CompleteStreamingAsync for user-facing chat interfaces to deliver partial responses in real time rather than waiting for full completion.
  • Implement retry logic with exponential backoff for transient 429 (rate limit) and 503 (service unavailable) errors; Azure.Core provides built-in retry policies via ChatCompletionsClientOptions.
  • Validate tool call arguments with JsonSerializer.Deserialize into strongly-typed models before executing functions to prevent injection of unexpected parameters.
  • Trim conversation history to stay within model context windows by summarizing older messages or using a sliding window of recent turns.
  • Store endpoint URLs and model deployment names in IConfiguration (appsettings, environment variables, or Key Vault) rather than hardcoding them.
  • Monitor token usage from response.Usage.TotalTokens and log it per request to track costs and detect anomalous consumption patterns.
  • Use separate EmbeddingsClient instances for different embedding models when your application needs both document embeddings and query embeddings with different dimensionality.