Definition
Data Extraction
Also called: document extraction, table extraction.
Data extraction is the process of pulling structured data — rows, fields, tables — out of documents, files or systems so it can be used elsewhere.
Data extraction is the process of pulling structured data — rows, fields and tables — out of documents, files or systems so it can be analysed, stored or reported on. It is usually the very first step in any data workflow: before you can clean, join or chart anything, you have to get the data out of wherever it is trapped.
Where it shows up
- Pulling line items out of invoices and receipts.
- Turning a PDF bank statement into transaction rows.
- Reading tables from financial reports and exports.
The hard part is preserving structure: a good extractor keeps a description in the description column and an amount as a number, rather than returning a jumble of text. For documents specifically, see OCR, which handles the scanned-image case.
FAQ
Frequently asked questions
- What is the difference between data extraction and OCR?
- OCR converts an image of text into characters. Data extraction goes further: it reconstructs the structure — rows, columns and fields — so the result is usable data, not just a wall of text. OCR is often a step inside extraction for scanned documents.
- Is data extraction the same as scraping?
- They overlap. “Scraping” usually refers to pulling data from web pages; “extraction” is the broader term and most often refers to documents and files like PDFs, spreadsheets and statements.
Related terms
OCR (Optical Character Recognition)
OCR (optical character recognition) converts images of text — like a scanned page or photo — into machine-readable characters.
Read definition → DefinitionETL (Extract, Transform, Load)
ETL stands for Extract, Transform, Load — the three steps of moving data from a source, cleaning and reshaping it, and writing it to a destination.
Read definition → DefinitionData Pipeline
A data pipeline is an automated sequence that moves data from a source through cleaning and transformation to a destination, usually on a schedule.
Read definition →