How to extract structured data from messy text

Which tool to choose

For occasional extractions, any AI assistant in the normal conversation: you paste the text and ask for the fields. To extract from the same type of document many times (a hundred invoices, hundreds of emails), it's worth fixing a template prompt with fields and rules that are always the same, so every extraction follows the same schema. If the documents are images or photos (receipts, tickets), use a tool that "reads" images: you show them to it and it extracts the text before structuring it. Start from the conversation with one example; standardize into a template when the documents grow numerous.

How to do it

List the fields you want, precisely. Not "the important data," but "date, supplier, total amount, invoice number." The AI extracts well only what you name to it.
Impose a rule against inventions. This is the step that saves your data. If a field isn't in the text, the AI must say so, not fill it in at random.
Give the rules for ambiguous cases. The operational syntax:

Extract from the text below these fields: date, sender, amount, due date.
Return them as a table with one row per document.
Rules: dates in day/month/year format; amounts as a number with no currency symbol;
if a field is not present in the text write "not present" and don't infer it.
If multiple dates appear, use as "due date" the one furthest in the future.
[paste the text]

Add an example if the format is unusual. Showing one row already filled in the way you want it ("example of correct output: ...") guides the AI better than a thousand explanations.
Verify the critical fields. Amounts, dates, codes: check a sample against the original text before trusting the whole batch.

A concrete example

You receive twenty job-application emails, each written its own way: some put the phone in the signature, some in the body, some the years of experience in one sentence, some in another. By hand it would mean opening and copying twenty times. You paste all the emails to the AI with the prompt: "For each candidate extract: name, email, phone, years of experience, desired role. A table with one row per candidate. If a piece of data isn't there write 'not present', don't invent it." In a few seconds you have an orderly twenty-row table. The ones with "not present" in the phone field you spot immediately and know who to ask. The chaos of twenty different formats has become a grid you can finally sort and filter.

When it does NOT work (and how to fix it)

If the AI invents the missing data

This is the number-one risk: faced with an absent field, it tends to fill it with something plausible. Fix: always put in the explicit rule "if it's missing, write 'not present' and don't infer," and then check that it respected it. A plausible but invented amount is worse than an empty field.

If it gets the ambiguous cases wrong (multiple dates, multiple amounts)

Without instructions, it picks at random which date or amount to enter. Fix: give the disambiguation rule in the prompt ("if there are multiple amounts, use the final total," "if there are multiple dates, use the due date"). You know the document, the AI doesn't.

If on long batches it skips rows or stops halfway

On many documents at once, the AI can truncate. Fix: work in smaller blocks (ten documents at a time) and ask at the end "how many rows did you extract?" to check that the number matches the inputs.

A tip from someone who actually uses it

The rule "don't invent, write 'not present'" isn't a detail: it's the difference between reliable data and poisoned data. An empty field you see and handle; a field filled with fantasy enters your calculations and decisions without your noticing. Put it in every extraction prompt and always verify the AI actually respected it, because sometimes it declares it and then invents anyway. Wrong data costs more than the time you thought you were saving.

Frequently asked questions

Can I extract data from a photo or a PDF?

Yes, with a tool that reads images and documents: you show it the file and it first derives the text, then structures it. The quality depends on legibility: a sharp photo gives good results, a blurry or crooked one generates errors. For documents that matter, verify the extracted fields against the original.

In what format is it best to get the extracted data?

A table if you take them into a spreadsheet; CSV if you import them into a program; a format like JSON (a standard way to organize data as name-value pairs) if you pass them to another application. Choose based on where they'll go: ask the AI directly for the destination's format.

Is automatic extraction reliable enough for important data?

To reduce the work yes, to trust blindly no. On data that matters — amounts, due dates, tax codes — the AI gets it wrong often enough to make a check mandatory. The serious way to use it is as a first pass that takes you from chaos to an orderly grid, followed by a human check on the critical fields. You skip the check only where an error does no harm.

Quick answer