Azure.AI.Inference
Overview
Azure.AI.Inference is the .NET SDK for calling models deployed in the Azure AI model catalog (including OpenAI, Mistral, Cohere, Meta Llama, and others) through a unified API. It provides ChatCompletionsClient for chat/text generation and EmbeddingsClient for vector embeddings, both supporting streaming, tool calling, and structured output with Azure credential integration.
NuGet Packages
bash
dotnet add package Azure.AI.Inference dotnet add package Azure.Identity # For managed identity / DefaultAzureCredential
Chat Completions
csharp
using Azure;
using Azure.AI.Inference;
var endpoint = new Uri("https://my-model.eastus.models.ai.azure.com");
var credential = new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_AI_KEY")!);
var client = new ChatCompletionsClient(endpoint, credential);
var requestOptions = new ChatCompletionsOptions
{
Messages =
{
new ChatRequestSystemMessage("You are a helpful coding assistant specializing in C#."),
new ChatRequestUserMessage("Write a method to validate an email address using regex.")
},
Temperature = 0.7f,
MaxTokens = 1024
};
ChatCompletions response = await client.CompleteAsync(requestOptions);
Console.WriteLine(response.Choices[0].Message.Content);
Console.WriteLine($"Tokens used: {response.Usage.TotalTokens}");
Streaming Chat Completions
csharp
var requestOptions = new ChatCompletionsOptions
{
Messages =
{
new ChatRequestSystemMessage("You are a technical writer."),
new ChatRequestUserMessage("Explain async/await in C# with examples.")
},
Temperature = 0.5f
};
await foreach (StreamingChatCompletionsUpdate update in
client.CompleteStreamingAsync(requestOptions))
{
if (update.ContentUpdate is { } content)
{
Console.Write(content);
}
}
Console.WriteLine();
Using Azure Identity (Managed Identity / Entra ID)
csharp
using Azure.Identity;
using Azure.AI.Inference;
var endpoint = new Uri("https://my-model.eastus.models.ai.azure.com");
var credential = new DefaultAzureCredential();
var client = new ChatCompletionsClient(endpoint, credential);
var response = await client.CompleteAsync(new ChatCompletionsOptions
{
Messages =
{
new ChatRequestUserMessage("Summarize the benefits of managed identity in Azure.")
}
});
Console.WriteLine(response.Choices[0].Message.Content);
Embeddings
csharp
using Azure;
using Azure.AI.Inference;
var endpoint = new Uri("https://my-embedding-model.eastus.models.ai.azure.com");
var credential = new AzureKeyCredential(apiKey);
var client = new EmbeddingsClient(endpoint, credential);
var request = new EmbeddingsOptions(new List<string>
{
"The quick brown fox jumps over the lazy dog",
"Azure AI provides enterprise-grade model hosting",
"Semantic search uses vector embeddings for relevance"
});
EmbeddingsResult result = await client.EmbedAsync(request);
foreach (EmbeddingItem item in result.Data)
{
Console.WriteLine($"Index {item.Index}: [{string.Join(", ", item.Embedding.ToArray().Take(5))}...]");
Console.WriteLine($" Dimensions: {item.Embedding.ToArray().Length}");
}
Tool Calling (Function Calling)
csharp
using System.Text.Json;
var getWeatherTool = new ChatCompletionsFunctionToolDefinition
{
Name = "get_weather",
Description = "Get current weather for a location",
Parameters = BinaryData.FromObjectAsJson(new
{
type = "object",
properties = new
{
location = new { type = "string", description = "City name" },
unit = new { type = "string", @enum = new[] { "celsius", "fahrenheit" } }
},
required = new[] { "location" }
})
};
var options = new ChatCompletionsOptions
{
Messages =
{
new ChatRequestUserMessage("What's the weather in Seattle?")
},
Tools = { getWeatherTool }
};
var response = await client.CompleteAsync(options);
var choice = response.Choices[0];
if (choice.FinishReason == CompletionsFinishReason.ToolCalls)
{
foreach (var toolCall in choice.Message.ToolCalls.OfType<ChatCompletionsFunctionToolCall>())
{
Console.WriteLine($"Function: {toolCall.Name}");
Console.WriteLine($"Arguments: {toolCall.Arguments}");
// Execute the function and return the result
var weatherResult = JsonSerializer.Serialize(new { temperature = 62, condition = "cloudy" });
options.Messages.Add(new ChatRequestAssistantMessage(choice.Message));
options.Messages.Add(new ChatRequestToolMessage(toolCall.Id, weatherResult));
}
// Get the final response with tool results
var finalResponse = await client.CompleteAsync(options);
Console.WriteLine(finalResponse.Choices[0].Message.Content);
}
Multi-Turn Conversation
csharp
var conversationHistory = new List<ChatRequestMessage>
{
new ChatRequestSystemMessage("You are a .NET architecture advisor.")
};
async Task<string> ChatAsync(string userMessage)
{
conversationHistory.Add(new ChatRequestUserMessage(userMessage));
var options = new ChatCompletionsOptions();
foreach (var msg in conversationHistory)
{
options.Messages.Add(msg);
}
options.Temperature = 0.7f;
options.MaxTokens = 2048;
var response = await client.CompleteAsync(options);
var assistantMessage = response.Choices[0].Message.Content;
conversationHistory.Add(new ChatRequestAssistantMessage(assistantMessage));
return assistantMessage;
}
Console.WriteLine(await ChatAsync("Should I use microservices or a monolith?"));
Console.WriteLine(await ChatAsync("What about for a team of 3 developers?"));
Model Comparison on Azure AI
| Model Provider | Typical Use | Azure Endpoint Pattern |
|---|---|---|
| OpenAI (GPT-4o) | General chat, code, reasoning | *.openai.azure.com |
| Mistral Large | Multilingual, code, reasoning | *.models.ai.azure.com |
| Cohere Command R+ | RAG, search-grounded generation | *.models.ai.azure.com |
| Meta Llama 3.1 | Open-weight general purpose | *.models.ai.azure.com |
| Phi-3 | Compact, on-device capable | *.models.ai.azure.com |
Dependency Injection Integration
csharp
using Azure.AI.Inference;
using Azure.Identity;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddSingleton(sp =>
{
var config = sp.GetRequiredService<IConfiguration>();
var endpoint = new Uri(config["AzureAI:Endpoint"]!);
var credential = new DefaultAzureCredential();
return new ChatCompletionsClient(endpoint, credential);
});
builder.Services.AddSingleton(sp =>
{
var config = sp.GetRequiredService<IConfiguration>();
var endpoint = new Uri(config["AzureAI:EmbeddingsEndpoint"]!);
var credential = new DefaultAzureCredential();
return new EmbeddingsClient(endpoint, credential);
});
var app = builder.Build();
app.MapPost("/chat", async (ChatRequest request, ChatCompletionsClient client) =>
{
var options = new ChatCompletionsOptions
{
Messages = { new ChatRequestUserMessage(request.Message) },
MaxTokens = 1024
};
var response = await client.CompleteAsync(options);
return Results.Ok(new { response = response.Choices[0].Message.Content });
});
app.Run();
record ChatRequest(string Message);
Best Practices
- •Use
DefaultAzureCredentialfromAzure.Identityinstead of API keys in production; it supports managed identity, Azure CLI, and Visual Studio credentials with automatic fallback. - •Register
ChatCompletionsClientandEmbeddingsClientas singletons in DI since they are thread-safe and designed for reuse across requests. - •Set explicit
MaxTokenson every request to prevent unexpectedly large responses that consume budget; pair this withTemperaturetuning per use case (0.0 for deterministic, 0.7-1.0 for creative). - •Use
CompleteStreamingAsyncfor user-facing chat interfaces to deliver partial responses in real time rather than waiting for full completion. - •Implement retry logic with exponential backoff for transient 429 (rate limit) and 503 (service unavailable) errors;
Azure.Coreprovides built-in retry policies viaChatCompletionsClientOptions. - •Validate tool call arguments with
JsonSerializer.Deserializeinto strongly-typed models before executing functions to prevent injection of unexpected parameters. - •Trim conversation history to stay within model context windows by summarizing older messages or using a sliding window of recent turns.
- •Store endpoint URLs and model deployment names in
IConfiguration(appsettings, environment variables, or Key Vault) rather than hardcoding them. - •Monitor token usage from
response.Usage.TotalTokensand log it per request to track costs and detect anomalous consumption patterns. - •Use separate
EmbeddingsClientinstances for different embedding models when your application needs both document embeddings and query embeddings with different dimensionality.