PDF Extraction

How to Extract Text or Images from a PDF File

📅 April 4, 2026 ⏱ 9 min read ✍️ BuildPDF Team

PDF files are great for sharing finished documents — but what happens when you need to get the content out of a PDF? Maybe you want to copy text from a scanned report, or pull out individual page images from a multi-page PDF. BuildPDF supports both modes of PDF extraction, entirely in your browser.

Two Ways to Extract PDF Content

Text Mode

Extract as Plain Text (.txt)

BuildPDF uses PDF.js to read your PDF's text layer and exports all readable text as a clean .txt file. Best for PDFs with selectable text (not scans).

Image Mode

Extract as Images (.zip of JPGs)

Each page of your PDF is rendered as a high-resolution JPG image and packaged into a .zip file for download. Works with any PDF, including scanned documents.

Step-by-Step: Extract Content from a PDF

Go to BuildPDF

Open buildpdf.co in your browser.

Upload your PDF

Drag and drop your .pdf file onto the converter, or click "Choose Files." BuildPDF automatically detects that it's a PDF and switches to extraction mode.

Choose your output format

In the options panel, select either "Plain Text (.txt)" to extract the text layer, or "ZIP of Images (JPG)" to render each page as an image.

Click "Extract PDF" and download

The extraction runs in your browser. Download the .txt file or .zip archive when complete.

Which Mode Should I Use?

Use Text Extraction if:

Your PDF was created digitally (e.g., exported from Word, Google Docs, or a web browser "Print to PDF")
You can already select and highlight text when you open the PDF in a viewer
You need to copy, edit, or search the text content
You want a small file size output

Use Image Extraction if:

Your PDF is a scanned document (photos of physical pages)
You cannot select text in the PDF — it's "flat" content
You want to post individual pages as images on a website or social media
You need to view or share each page separately
The PDF contains charts, diagrams, or visual content you want as image files

⚠️ Scanned PDFs and OCR: BuildPDF can render the pages of a scanned PDF as images, but it cannot automatically perform OCR (Optical Character Recognition) to extract text from scanned pages. For OCR, consider Adobe Acrobat or Google Drive (which offers free built-in OCR when you upload a PDF and open it with Google Docs).

What Gets Extracted?

Text extraction

BuildPDF extracts all text content from the PDF's text layer, page by page, separated by page markers. Mathematical symbols, special characters, and most Unicode text are supported. Table structure may not be perfectly preserved — the output is a sequential plain-text approximation.

Image extraction

Each page is rendered at screen resolution to a JPG image. For a 10-page PDF, you'll receive a ZIP file containing 10 images named page-1.jpg, page-2.jpg, etc. Image quality is high by default.

💡 Tip: Need individual page images at print resolution? For best results, ensure the original PDF was created at 150 DPI or higher. Screen-resolution PDFs will produce lower-quality page images.

What PDF.js Can and Cannot Extract

BuildPDF uses PDF.js — Mozilla's open-source, battle-tested PDF rendering library — to process your files entirely in the browser. Understanding its capabilities helps you know when it will work brilliantly and when you'll need a different tool.

PDF.js can extract: all text stored in a PDF's text layer (characters, words, paragraphs, including Unicode and most special symbols), the visual appearance of each page rendered as a raster image, and document metadata like title and author if present in the file.

PDF.js cannot extract: text from scanned PDFs (image-only content with no text layer), vector graphics as editable vectors (they are rasterised during page rendering), embedded fonts as independent font files, form field data as a structured dataset, or digital signatures in a tamper-evident way.

The most important limitation for most users is the scanned document scenario. A PDF created by scanning physical pages is, underneath its .pdf wrapper, just a series of high-resolution images with no text layer at all. PDF.js will render these pages as images just fine — but text extraction will return nothing, because there is literally no text to extract.

⚠️ How to tell if your PDF has a text layer: Open it in any PDF viewer and try to click and drag to select a word. If you can highlight individual words, there is a text layer and BuildPDF's text extraction will work. If nothing highlights, or the entire page selects as a single object, it's a scanned (image-only) PDF.

OCR: The Gap BuildPDF Doesn't Fill (and What Does)

Optical Character Recognition (OCR) reads text out of images — including scanned PDFs. It's a computationally intensive AI task requiring specialised models. BuildPDF extracts text that already exists as machine-readable content inside the PDF. For scanned PDFs that need OCR, here are genuinely useful alternatives:

Google Drive (free): Upload the PDF to Google Drive, right-click it, and choose "Open with → Google Docs." Google runs OCR automatically and opens a Docs file with the extracted text. Quality is excellent for clean scans in major languages.
Adobe Acrobat (paid): Acrobat's "Recognize Text" feature is the industry gold standard. It also adds the recognised text back as a searchable layer in the PDF.
Tesseract (free, open source): A command-line OCR engine that runs locally. No cloud, no subscription, supports 100+ languages. Requires some technical comfort but is very capable for batch processing.
Microsoft OneNote (free): Insert the scanned PDF page as an image into OneNote, right-click, and choose "Copy Text from Picture." Works surprisingly well for clean document scans.

Legal and Ethical Considerations When Extracting PDF Content

Extracting text or images from a PDF you own — your own documents, your scanned receipts, your own reports — is entirely straightforward. However, some PDFs belong to others and may be protected by copyright or access restrictions.

Many PDFs have password protection or DRM applied. BuildPDF will not bypass password protection — if a PDF requires a password to open, it cannot be processed without that password. If you legitimately have the password (e.g., it's your own file), enter it first in a PDF reader to unlock it, then re-save the unlocked version for extraction.

Beyond technical locks, copying significant portions of copyrighted material (a textbook chapter, a commercial report, a published article) for redistribution may infringe copyright, regardless of the technical ease of extraction. Always ensure you have the right to use extracted content in the way you intend.

Real-World Use Cases for PDF Extraction

Reusing content from old reports: You have a PDF report from five years ago and need to update its figures in a new document. Text extraction saves you from retyping hundreds of words of analysis.
Creating social media images from slide decks: Export each page of a PDF presentation as a JPG image, then use those images directly on LinkedIn, Twitter, or Instagram without needing the original slide software.
Archiving receipts as images: Accounting teams sometimes need individual receipt pages as image files for expense management systems that don't accept PDFs. Extract each page as a JPG.
Extracting text for AI or analysis: Feed a PDF's text content into a summarisation tool or spreadsheet for analysis. Text extraction gives you the raw content in a paste-able format.
Converting a PDF manual to a searchable text file: A product manual in PDF might not be full-text searchable if your system doesn't index PDFs. Extract the text, save it, and your standard file search will find any keyword in it.

Batch Extraction Tips

BuildPDF processes one PDF at a time, but there are practical ways to handle batch extraction efficiently. For image extraction, each run produces a ZIP containing all page JPGs. If you need images from several PDFs, run the tool once per file — each ZIP will be clearly named and self-contained.

For text extraction from multiple PDFs, extract each PDF's text separately, saving each .txt file with a descriptive name. You can then combine them in a text editor, or import them into a spreadsheet or document for further processing. This manual but fast approach beats any server-based batch tool for privacy, since every file stays local.

💡 Name your extractions clearly: When BuildPDF downloads an extracted file, it typically uses the original PDF's filename. If you're processing multiple PDFs, rename each output immediately after download so you don't mix up document.txt from three different files.

Common Questions

Can I extract just a specific page, not the whole PDF?

BuildPDF's extraction currently processes the entire PDF. For image extraction, you receive a ZIP with every page — simply open the ZIP and discard the pages you don't need. For text extraction, the output .txt file includes all pages separated by page markers, so you can easily copy just the section you need.

Why is the extracted text in the wrong order?

PDF text ordering is notoriously difficult. Unlike HTML or Word documents, PDF stores text as positioned elements on a page, and reading order must be inferred. PDF.js does a good job for most Western-language documents in simple column layouts. Multi-column layouts, complex tables, and right-to-left languages may have text extracted in visual position order rather than reading order. Manual cleanup is often necessary for complex layouts.

The ZIP file has images at a lower resolution than I'd like — how do I improve it?

The page images are rendered at a resolution determined by the PDF's internal page dimensions and your browser's rendering engine. For most PDFs, the output JPGs are screen-resolution images — excellent for web use but not ideal for print. For higher-resolution images from a PDF, the most reliable approach is Adobe Acrobat's "Export to Image" feature, which lets you specify DPI output directly.

Common Issues & How to Fix Them

Text extraction produces only a blank .txt file

This is the scanned PDF scenario — the file has no text layer, it's image-only content. Text extraction has nothing to work with. Solution: use Google Drive's free OCR (described above) to extract the text, or use BuildPDF's image extraction mode to get the page images and work with those instead.

The page images in the ZIP look washed out or lower quality than the PDF

JPG compression can affect colour and sharpness. Ensure you're using High (95%) in the options panel before extracting. Note that PDFs designed for screen display at 72 DPI will naturally produce lower-quality JPGs than print-quality PDFs designed at 300 DPI.

The PDF fails to load or extract at all

A handful of PDFs use features that PDF.js doesn't support: advanced encryption, non-standard PDF extensions, or corrupted file structures. Try opening the PDF in Adobe Reader first — if it doesn't open there, the file may be corrupted. If it opens in Adobe Reader but not in BuildPDF, the file may use DRM or encryption that PDF.js cannot process. In that case, use Adobe Acrobat's own export tools.

Privacy & Security

Your PDF file is processed entirely within your browser using PDF.js, Mozilla's open-source PDF rendering engine. The file is never uploaded to any server. This makes BuildPDF safe for extracting sensitive content from confidential PDFs like contracts, financial statements, or medical records.

Extract content from your PDF now

Free, private, instant. No uploads, no sign-up.

Try PDF Extractor →