Definition

Data Extraction

Also called: document extraction, table extraction.

Data extraction is the process of pulling structured data — rows, fields, tables — out of documents, files or systems so it can be used elsewhere.

Data extraction is the process of pulling structured data — rows, fields and tables — out of documents, files or systems so it can be analysed, stored or reported on. It is usually the very first step in any data workflow: before you can clean, join or chart anything, you have to get the data out of wherever it is trapped.

Where it shows up

  • Pulling line items out of invoices and receipts.
  • Turning a PDF bank statement into transaction rows.
  • Reading tables from financial reports and exports.

The hard part is preserving structure: a good extractor keeps a description in the description column and an amount as a number, rather than returning a jumble of text. For documents specifically, see OCR, which handles the scanned-image case.

Try data extraction on a PDF — free, no signup: Invoice Data Extraction →

FAQ

Frequently asked questions

What is the difference between data extraction and OCR?
OCR converts an image of text into characters. Data extraction goes further: it reconstructs the structure — rows, columns and fields — so the result is usable data, not just a wall of text. OCR is often a step inside extraction for scanned documents.
Is data extraction the same as scraping?
They overlap. “Scraping” usually refers to pulling data from web pages; “extraction” is the broader term and most often refers to documents and files like PDFs, spreadsheets and statements.