indexify.dev - Document Processing for AI Agents

Structured Knowledge

Structured Document Access

The API returns layout-aware structure, not only flat text: per-document elements, sections, tables, figures, and relationships from JSON parse. Start from search for documentId, then use document-scoped GETs (below). Full schemas: API reference; MCP: sections, tables, figures tools.

Tip: the snippets are meant to be copy/paste friendly—start with the happy path, then intentionally poke at scopes, missing fields, and bad inputs so you know exactly what your app will see in production.

Per-document structure
- List or fetch elements, sections, tables, figures, and relationships for a single documentId. Typical uses: show every table in a policy, caption↔figure linking, section-scoped tools in a viewer.
Elements
- Each element has elementType, optional title, and grounding (pages, bounding region, section path) so clients can highlight the same region users see.
- From a search hit, follow ids to GET …/elements/{elementId}, chunk lists (GET …/chunks), or GET …/sections using the same documentId.
Element types (public enum)
- Documented values include: title, heading, paragraph, list, table, figure, caption, code_block, section, container, other—confirm the current list in the API reference.
- Headings / sections — table of contents and in-document navigation.
- Table, figure, caption — multimodal and layout-grounded answers.
- List, paragraph — structured summaries; code_block — technical documentation.
- container / other — preserved hierarchy when the parser cannot map to a finer type.
Relationship types
- Each relationship has a type label plus source and target element ids within the document.
- document_contains_element — document root to structural descendants.
- parent_child — hierarchy within a document.
- precedes — reading or layout order between elements.
- caption_of — caption linked to its figure.
Authorization scopes
- docs:parsed:read — element, section, table, figure, and relationship routes (GET …/sections, GET …/elements, GET …/relationships, …); confirm each route in the API reference.
- docs:ingest / kb:write — upload, reprocess, and job retry as documented.
Listing elements
- GET …/elements accepts filters such as type, sectionId, page, include—defaults and limits are in that operation’s spec.
- Combine with search: use documentId from a hit, then narrow by element type for grounded UI.
- Lineage without leaving the data plane: GET …/elements/{elementId}/relationships · GET …/elements/{elementId}/related.
Integration patterns
- Navigation assistants — List sections or filter elements → fetch chunks for the chosen branch.
- Audit / policy — Use search with elementTypes constrained to heading / section / table, then walk relationships to supporting paragraphs.
- Retrieval + structure — Use sections or elements to choose where to read, then search or chunk fetch for what to pass to a model, keeping citations consistent.
The cURL examples below include Open endpoint in api-docs and Open in API Runner under each block. Pair structured reads with search for end-to-end flows.

Parsed document body (Accept header; no ?query — duplicate keys → 400 invalid_input)

curl -s -X GET "https://api.indexify.dev/projects/550e8400-e29b-41d4-a716-446655440001/kbs/660e8400-e29b-41d4-a716-446655440002/documents/770e8400-e29b-41d4-a716-446655440003/parsed" \
  -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJtYWNoaW5lIn0.dBjftJeZ4CVP-mB92K27uhbUJU1p1b_wW1gFWFOEjXk" \
  -H "Accept: text/markdown"

Open endpoint in api-docs Open in API Runner

Chunks page (docs:parsed:read; next page: append cursor=<nextCursor> from prior JSON)

curl -s -X GET "https://api.indexify.dev/projects/550e8400-e29b-41d4-a716-446655440001/kbs/660e8400-e29b-41d4-a716-446655440002/documents/770e8400-e29b-41d4-a716-446655440003/chunks?limit=20&format=markdown" \
  -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJtYWNoaW5lIn0.dBjftJeZ4CVP-mB92K27uhbUJU1p1b_wW1gFWFOEjXk" \
  -H "Accept: application/json"

Open endpoint in api-docs Open in API Runner

List document elements (docs:parsed:read; omit include= for full payload+grounding; cursor = prior nextCursor UUID)

curl -s -X GET "https://api.indexify.dev/projects/550e8400-e29b-41d4-a716-446655440001/kbs/660e8400-e29b-41d4-a716-446655440002/documents/770e8400-e29b-41d4-a716-446655440003/elements?limit=50&type=table,figure&include=payload,grounding" \
  -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJtYWNoaW5lIn0.dBjftJeZ4CVP-mB92K27uhbUJU1p1b_wW1gFWFOEjXk"

Open endpoint in api-docs Open in API Runner

Get one element by id (docs:parsed:read; only ?include=…; add relationships for embedded edges cap)

curl -s -X GET "https://api.indexify.dev/projects/550e8400-e29b-41d4-a716-446655440001/kbs/660e8400-e29b-41d4-a716-446655440002/documents/770e8400-e29b-41d4-a716-446655440003/elements/aa0e8400-e29b-41d4-a716-446655440006?include=payload,grounding,relationships" \
  -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJtYWNoaW5lIn0.dBjftJeZ4CVP-mB92K27uhbUJU1p1b_wW1gFWFOEjXk"

Open endpoint in api-docs Open in API Runner

Why structured access matters for agents

Vector search finds relevant passages; structured data-plane routes supply tables as data, figure assets, section trees, cross-references, and processing artifacts so tools and UIs stay grounded. Use per-document GET …/documents/{documentId}/… APIs together with POST …/search when you need cross-document discovery—see Building agents with Indexify for end-to-end scenarios.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Beyond chunk-only RAG
- Chunks are lossy and overlapping; listing or fetching elements, sections, tables, and figures by stable ids gives the model layout-aware inputs—especially when the answer depends on row/cell accuracy or visual content, not a paraphrase of PDF text.
Safer tool-calling
- After POST …/search, use GET …/tables/{tableId}, GET …/elements/{elementId}, or GET …/elements/{elementId}/related to assemble exact fields for external systems (ticketing, billing, CRM) instead of invented parameters.
Scoped retrieval and long-document UX
- GET …/sections, POST …/search with groupBy: section or elementTypes, and per-document chunks / elements let the agent or user narrow to a chapter, clause, or manual branch before pulling evidence—patterns that are awkward when you only have an unordered set of text chunks.
Explainability and audit
- GET …/relationships and related-elements routes expose cross-references (“see Figure 2”, “per §3.1”) so citations and policy answers map to explicit regions.