PdfPig

Overview

PdfPig is a fully managed .NET library for reading and extracting content from PDF documents. It provides access to text (at page, word, and letter levels), images, annotations, and document metadata. PdfPig does not depend on native libraries or external tools, making it fully cross-platform.

PdfPig is read-only -- it parses and extracts content from existing PDFs but does not create or modify them. For PDF generation, use PdfSharpCore or QuestPDF. PdfPig is particularly useful for document processing pipelines, search indexing, data extraction, and PDF content analysis.

Install via NuGet:

code

dotnet add package PdfPig

Basic Text Extraction

Open a PDF and extract text from each page.

csharp

using UglyToad.PdfPig;

using var document = PdfDocument.Open("report.pdf");

Console.WriteLine($"Pages: {document.NumberOfPages}");

foreach (var page in document.GetPages())
{
    Console.WriteLine($"--- Page {page.Number} ---");
    Console.WriteLine($"Size: {page.Width:F0}x{page.Height:F0}");
    Console.WriteLine(page.Text);
    Console.WriteLine();
}

Word-Level Extraction

Extract individual words with their positions for structured text analysis.

csharp

using System.Linq;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using var document = PdfDocument.Open("invoice.pdf");
var page = document.GetPage(1);

// Get all words on the page
var words = page.GetWords().ToList();

Console.WriteLine($"Word count: {words.Count}");

foreach (var word in words.Take(20))
{
    Console.WriteLine(
        $"  \"{word.Text}\" at ({word.BoundingBox.Left:F1}, {word.BoundingBox.Bottom:F1}) " +
        $"font: {word.FontName}, size: {word.Letters.First().PointSize:F1}pt");
}

// Find words in a specific region (e.g., top-right for invoice number)
var topRightWords = words
    .Where(w => w.BoundingBox.Left > page.Width * 0.6
             && w.BoundingBox.Top > page.Height * 0.8)
    .OrderByDescending(w => w.BoundingBox.Top)
    .ThenBy(w => w.BoundingBox.Left);

Console.WriteLine("\nTop-right region:");
foreach (var word in topRightWords)
{
    Console.Write($"{word.Text} ");
}

Extracting Text by Region

Extract text from specific rectangular regions of a page for structured document parsing.

csharp

using System.Linq;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Core;

using var document = PdfDocument.Open("form.pdf");
var page = document.GetPage(1);

// Define a region of interest (coordinates from bottom-left)
var region = new PdfRectangle(50, 700, 300, 750); // left, bottom, right, top

var wordsInRegion = page.GetWords()
    .Where(w => region.Contains(w.BoundingBox.BottomLeft))
    .OrderBy(w => w.BoundingBox.Left);

string regionText = string.Join(" ", wordsInRegion.Select(w => w.Text));
Console.WriteLine($"Text in region: {regionText}");

// Extract table-like data by defining column boundaries
double[] columnBoundaries = { 50, 200, 350, 500 };
var rows = page.GetWords()
    .GroupBy(w => Math.Round(w.BoundingBox.Bottom / 15) * 15) // group by line
    .OrderByDescending(g => g.Key)
    .Select(row => columnBoundaries
        .Select((colStart, i) =>
        {
            var colEnd = i < columnBoundaries.Length - 1
                ? columnBoundaries[i + 1]
                : page.Width;
            return string.Join(" ", row
                .Where(w => w.BoundingBox.Left >= colStart && w.BoundingBox.Left < colEnd)
                .OrderBy(w => w.BoundingBox.Left)
                .Select(w => w.Text));
        })
        .ToArray());

foreach (var row in rows)
{
    Console.WriteLine(string.Join(" | ", row));
}

Searching PDF Content

Search for specific text across all pages of a PDF.

csharp

using System;
using System.Collections.Generic;
using System.Linq;
using UglyToad.PdfPig;

public class PdfSearcher
{
    public IReadOnlyList<SearchResult> Search(string filePath, string searchTerm)
    {
        var results = new List<SearchResult>();

        using var document = PdfDocument.Open(filePath);
        foreach (var page in document.GetPages())
        {
            // Full-text search
            if (page.Text.Contains(searchTerm, StringComparison.OrdinalIgnoreCase))
            {
                // Find the specific words matching
                var matchingWords = page.GetWords()
                    .Where(w => w.Text.Contains(searchTerm, StringComparison.OrdinalIgnoreCase))
                    .Select(w => new SearchResult(
                        page.Number,
                        w.Text,
                        w.BoundingBox.Left,
                        w.BoundingBox.Bottom))
                    .ToList();

                results.AddRange(matchingWords);
            }
        }

        return results;
    }
}

public record SearchResult(int PageNumber, string MatchedText, double X, double Y);

// Usage
var searcher = new PdfSearcher();
var matches = searcher.Search("contract.pdf", "payment");
foreach (var match in matches)
{
    Console.WriteLine($"Page {match.PageNumber}: \"{match.MatchedText}\" at ({match.X:F0}, {match.Y:F0})");
}

Extracting Images

Extract images embedded in PDF pages.

csharp

using System.IO;
using System.Linq;
using UglyToad.PdfPig;

using var document = PdfDocument.Open("brochure.pdf");

int imageCount = 0;
foreach (var page in document.GetPages())
{
    var images = page.GetImages().ToList();
    Console.WriteLine($"Page {page.Number}: {images.Count} images");

    foreach (var image in images)
    {
        imageCount++;
        Console.WriteLine(
            $"  Image {imageCount}: {image.WidthInSamples}x{image.HeightInSamples}, " +
            $"Bounds: ({image.Bounds.Left:F0}, {image.Bounds.Bottom:F0})");

        // Save raw image bytes
        if (image.TryGetPng(out var pngBytes))
        {
            File.WriteAllBytes($"image_{imageCount}.png", pngBytes);
        }
        else
        {
            File.WriteAllBytes($"image_{imageCount}.raw", image.RawBytes.ToArray());
        }
    }
}

Reading Document Metadata

Access PDF document properties and metadata.

csharp

using UglyToad.PdfPig;

using var document = PdfDocument.Open("document.pdf");
var info = document.Information;

Console.WriteLine($"Title: {info.Title}");
Console.WriteLine($"Author: {info.Author}");
Console.WriteLine($"Subject: {info.Subject}");
Console.WriteLine($"Creator: {info.Creator}");
Console.WriteLine($"Producer: {info.Producer}");
Console.WriteLine($"Created: {info.CreatedDate}");
Console.WriteLine($"Modified: {info.ModifiedDate}");
Console.WriteLine($"Pages: {document.NumberOfPages}");
Console.WriteLine($"PDF Version: {document.Version}");

Processing PDFs as a Service

Wrap PdfPig in a DI-friendly service for document processing pipelines.

csharp

using System.Collections.Generic;
using System.IO;
using System.Linq;
using UglyToad.PdfPig;

public interface IPdfExtractor
{
    PdfContent Extract(Stream pdfStream);
    PdfContent Extract(string filePath);
}

public record PdfContent(
    string FullText,
    IReadOnlyList<PageContent> Pages,
    DocumentInfo Metadata);

public record PageContent(int Number, string Text, int WordCount);
public record DocumentInfo(string? Title, string? Author, int PageCount);

public class PdfPigExtractor : IPdfExtractor
{
    public PdfContent Extract(string filePath)
    {
        using var stream = File.OpenRead(filePath);
        return Extract(stream);
    }

    public PdfContent Extract(Stream pdfStream)
    {
        using var document = PdfDocument.Open(pdfStream);

        var pages = document.GetPages()
            .Select(p => new PageContent(
                p.Number,
                p.Text,
                p.GetWords().Count()))
            .ToList();

        var fullText = string.Join("\n\n", pages.Select(p => p.Text));

        var metadata = new DocumentInfo(
            document.Information.Title,
            document.Information.Author,
            document.NumberOfPages);

        return new PdfContent(fullText, pages, metadata);
    }
}

// DI registration
// builder.Services.AddTransient<IPdfExtractor, PdfPigExtractor>();

PdfPig vs Other PDF Libraries

Feature	PdfPig	iTextSharp	PdfSharpCore	QuestPDF
Read text	Yes	Yes	Limited	No
Read images	Yes	Yes	No	No
Create PDFs	No	Yes	Yes	Yes
Edit PDFs	No	Yes	Yes	No
Managed only	Yes	Yes	Yes	Yes
License	Apache 2.0	AGPL/Commercial	MIT	MIT

Best Practices

•Always wrap PdfDocument in a using statement because it holds file handles and internal buffers that must be released.
•Use GetWords() instead of page.Text when you need structured text with positions, font information, or region-based extraction.
•Handle malformed PDFs gracefully by wrapping PdfDocument.Open in a try/catch, since real-world PDFs often violate the specification.
•Process large PDFs page-by-page rather than loading all text at once -- iterate with document.GetPages() and process each page independently.
•Use bounding box coordinates for region-based extraction when parsing structured documents (invoices, forms) where text position determines meaning.
•Normalize extracted text by trimming whitespace, collapsing multiple spaces, and handling line breaks, since PDF text extraction often produces inconsistent spacing.
•Group words into lines by Y-coordinate using rounding or binning to reconstruct readable text from position-based word extraction.
•Open PDFs from streams (PdfDocument.Open(stream)) in web applications rather than writing to temporary files, to reduce disk I/O and cleanup overhead.
•Check TryGetPng before saving extracted images because not all PDF image encodings can be converted to PNG; fall back to raw bytes for unsupported formats.
•Validate PDF file headers before processing by checking the first bytes for %PDF- to reject non-PDF files early and avoid cryptic parsing errors.