indexify.dev - Document Processing for AI Agents

Reference & Troubleshooting

Best Practices and Anti-Patterns

Favor clear data boundaries, observable pipelines, and least-privilege credentials. The platform gives you parsing, chunking, search, and structured document routes—you still own corpus design, evaluation, and how evidence is shown to end users.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Do
- Align each knowledge base with one domain or compliance boundary so settings and access patterns stay explainable.
- Tune retrieval and rerank with labeled queries before changing LLM prompts—bad evidence rarely fixes itself in the prompt.
- Enable source metadata in search when user-facing answers must cite documents.
Avoid
- One oversized mixed corpus—split KBs and use metadata filters where the API allows.
- Shipping without citation metadata or retrieval metrics (Glossary — integration quality).
- Recycling one machine credential across unrelated services (Scope model).

Troubleshooting

Start from HTTP status and Problem JSON (title, detail, optional code), then map to token type, scopes, KB settings, or provider configuration. Cross-check the API reference for the exact route you called.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Symptom → check
- 401 / 403 — Wrong token type (developer vs project), wrong projectId in path, or missing scope (Available scopes).
- Search errors or empty hits — format must match parse outputs available for the document; confirm jobs completed (Jobs).
- Rerank failures — KB rerankConfig and provider credentials; see Embedding models & rerank.
- 429 — Throttle concurrency; read limit headers (Throttling); exponential backoff with jitter.
- Low relevance — Widen retrieveTopK, adjust hybrid settings, or revisit chunking (Advanced retrieval).
- Webhooks — Verify X-Indexify-Signature on the raw body (Verification).

Data plane endpoint checklist (KB settings ↔ endpoints)

Use this to quickly map an endpoint returning empty/missing data to the likely KB setting or missing artifact (pipeline stage, output format, or multimodal structure).

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Parsed document
- Endpoints
  - GET …/parsed
- KB settings or artifacts to check
  - settings.pipeline includes parse
  - Requested representation is in settings.parse.outputFormats (via Accept)
  - Parse job finished successfully
Document chunks
- Endpoints
  - GET …/chunks (?format=… per spec)
- KB settings or artifacts to check
  - settings.pipeline includes chunk (or includes index, since index ⇒ chunk)
  - Requested format is in settings.parse.outputFormats
  - Chunking job finished successfully
Search
- Endpoints
  - POST …/search (?format=… query param)
- KB settings or artifacts to check
  - settings.pipeline includes index (embeddings exist)
  - Requested format is in settings.parse.outputFormats
Elements, relationships, sections, tables, figures
- Endpoints
  - GET …/elements · GET …/relationships · GET …/sections · GET …/tables · GET …/tables/{tableId} · GET …/figures · GET …/figures/{figureId} · GET …/elements/{elementId}/relationships · GET …/elements/{elementId}/related
- KB settings or artifacts to check
  - settings.parse.outputFormats includes json
  - Parse(JSON) finished successfully
Search with section or element narrowing
- Endpoints
  - POST …/search (with groupBy: section and/or elementTypes)
- KB settings or artifacts to check
  - settings.pipeline includes index (embeddings available)
  - settings.parse.outputFormats includes json when relying on section/element metadata
  - Parse and index jobs finished successfully

Glossary

Short definitions for terms used across this guide and the API reference. For procedures, prefer linked sections: Authentication, Ingestion, Retrieval, MCP.

If you’re reading this as a reference, feel free to jump around—each section is written to stand on its own, and the left-hand search is the fastest way to find a specific endpoint or concept.

Identity, access, and tenancy

Who is calling and what they may do. How-to: Authentication and Security, Appendix — token matrix.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Developer — human user of the product (signup, login, account UI). Often authenticated with a session or JWT for Control & Ingestion Plane actions.
- Developer JWT — bearer token representing the logged-in developer, used for account- and project-level operations (not the same as machine project tokens).
- Project — top-level container for credentials, knowledge bases, and billing-style boundaries. Admin routes (projects, credentials, KB listing): GET /projects, POST /projects, GET /projects/{projectId}, POST …/credentials, DELETE …/credentials/{credentialId}, GET …/kbs — full matrix in API reference.
- Project access token — short-lived OAuth2 access token obtained with client_credentials using a project’s client id/secret. Used for KB, documents, jobs, search, webhooks against that project.
- Machine credential / project credential — the client id + secret pair created on a project; used only on trusted servers to mint project access tokens.
- Scope — fine-grained permission string on a project token (for example docs:read, search:run). Requests fail with 403 if the token lacks the required scope.
- Least privilege — practice of issuing narrowly scoped credentials per service (ingest-only vs search-only) to limit blast radius if a secret leaks.
- Control & Ingestion Plane vs data plane — Control & Ingestion Plane: projects, settings, credential CRUD. Data plane: runtime ingestion, search, and webhooks on KBs.
- Rate limit key — identifier used for throttling (typically per project when using a project token, else per developer or client IP).
- Idempotency key — client-supplied header value so retries of the same logical operation do not create duplicates.

Data model and content hierarchy

Projects, KBs, documents, chunks, and structured outputs. See Knowledge Base Design and Structured document access.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Knowledge Base (KB) — named configuration and storage boundary under a project. Each KB has its own parse/chunk/index/search settings, documents, jobs, and optional webhook.
- Document — a single uploaded file or ingestion unit in a KB. Has metadata (name, type, size), processing jobs, and derived artifacts (parsed tree, chunks, vectors).
- Source file / blob — raw bytes stored for a document; processing reads this to produce parsed outputs.
- Parsed document — structured output of the parse stage: layout-aware representation (sections, elements, tables, figures, etc.) rather than a single plain string.
- Element — typed node in a parsed document (paragraph, heading, table cell region, figure, etc.). Useful for UI and agents that need more than flat text.
- Section — logical subdivision of a document (often heading-derived). API routes expose sections and their child elements.
- Chunk — text (and metadata) segment stored for retrieval. Created in the chunk stage; embedding vectors typically correspond to chunks.
- Chunking — policy that splits or merges parsed content into chunk boundaries (size, hierarchy, markdown tables, metadata attachment).
- Embedding vector — fixed-length numeric representation of chunk or query text used for similarity search in vector space.
- Index / indexing — stage that writes embeddings and retrieval structures so search can run. Distinct from “search index” as a generic term.
- Corpus — the set of chunks (and associated metadata) searchable within a KB after successful processing.

Ingestion pipeline and jobs

Parse → chunk → index and async jobs. Operational guide: Jobs and webhooks, Document ingestion.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Pipeline — ordered processing stages for each document, reflected in job payloads (GET …/jobs/{jobId}); typically parse → chunk → index.
- Parse stage — converts source format into structured parsed form and chosen output formats (markdown, html, json, doctags, etc.).
- Chunk stage — transforms parsed content into chunks according to KB chunk settings.
- Job — asynchronous unit of work for a document ingest or reprocess. GET …/jobs/{jobId} returns status, optional stages, and errors when failed.
- Job stage — fine-grained state within a job (for example parsing, chunking, indexing) with success/failure and timing.
- Reprocess / retry — trigger processing again after failures or settings changes; may create new attempts or follow idempotency rules.
- Processing complete — state where required stages succeeded and search can return results for the affected content (subject to eventual consistency).
- Failure / terminal error — job or stage stopped with an error payload; clients should surface codes and retryability hints when present.

Search, retrieval, and ranking

Search modes, top-K, rerank. Guides: Retrieval fundamentals, Advanced retrieval, Embedding models & rerank.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Search / query — HTTP request to a KB’s search endpoint with a natural language or keyword string plus optional settings overrides.
- Semantic search — retrieval using embedding similarity between query and chunks (meaning-based match).
- Keyword search — retrieval matching lexical terms (important for SKUs, error codes, API paths).
- Hybrid search — combines semantic and keyword signals so both paraphrases and exact tokens can rank well.
- Retriever — first stage that pulls a candidate set of chunks (often top-K by similarity or hybrid score).
- Rerank / reranker — second-stage model that scores each query–chunk pair for finer relevance; reduces noise before the LLM sees context.
- retrieveTopK — setting controlling how many candidates the first stage returns before rerank.
- rerank.topK — setting controlling how many chunks survive reranking into the final context set.
- Top-K — generic term for “K best results” after a scoring step.
- Similarity / distance — geometric relationship between vectors; higher similarity usually means closer in embedding space.
- Embedding model — provider-specific model id (for example openai:text-embedding-3-small) configured on the KB for both index and query embedding when applicable.
- Query embedding — vector computed for the user query at search time using the KB’s embedding configuration.
- Format (search) — response projection for chunk text (markdown, html, text, json, doctags). Must align with parse output formats available for the document.
- Pagination cursor — opaque token (nextCursor) for stable continuation of large result sets.
- Source metadata / citation metadata — fields tying a hit back to document, offsets, or section paths so answers can cite sources.

RAG, LLMs, and agents

Patterns for models plus tools. See Building agent workflows and MCP Integration.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- RAG (retrieval-augmented generation) — pattern where an LLM answers using retrieved chunks as context, reducing hallucinations when evidence exists.
- Grounding — anchoring model output in retrieved passages; strong grounding implies citations trace to real chunks.
- Hallucination — plausible but unsupported statement; RAG and citations mitigate but do not eliminate.
- Context window — maximum tokens an LLM can attend to; rerank and top-K exist to fit the best evidence within this budget.
- Prompt — instructions and retrieved text sent to the model; quality of evidence often matters more than prompt tricks.
- Agent — system that plans multiple steps (tool calls, retrieval, refinement) rather than one-shot prompt/answer.
- Tool / function calling — agent invokes external APIs (including your wrappers around Indexify search) with structured arguments.
- MCP (Model Context Protocol) — standard way for agent hosts to expose tools to a model; Indexify exposes a subset of the REST surface as MCP tools (MCP Integration).
- Orchestration layer — your own scheduler or a third-party agent framework that sequences tool calls and manages state between steps.

Webhooks, events, and delivery

Push notifications to your HTTPS endpoint. Setup: Jobs and webhooks, Webhook security.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Webhook — HTTPS callback URL registered on a KB to receive JSON payloads when subscribed events occur.
- Webhook event — named occurrence (job.pending, job.completed, job.failed, coarse stage completions, granular milestones like job.parsing_format_completed, partial-failure signals, etc.).
- Payload — JSON body POSTed to your URL, including event type, timestamps, and nested project/KB/document/job data.
- Delivery — single HTTP attempt or retry sequence for one webhook payload; failures may retry per webhook retry policy.
- Retry policy — max attempts and backoff between deliveries when your endpoint returns non-success or times out (configured on the webhook).
- Signing secret — shared HMAC secret; Indexify may send X-Indexify-Signature so you can verify authenticity.
- At-least-once delivery — duplicates are possible; your handler should be idempotent (dedupe by delivery or event id).
- Test webhook — synthetic POST (webhook.test) to validate connectivity and signature verification without waiting for real jobs.

Structured access and multimodal

Layout-aware content from JSON parse. Guide: Structured document access.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Structured access — per-document elements, sections, tables, figures, and relationships exposed via document-scoped GET routes.
- Relationship — typed edge between elements within a document (containment, reading order, captions, etc.).
- Multimodal — processing that retains non-textual structure (tables as tables, figures with captions, layout) rather than flattening everything to prose.
- Table (document) — structured grid extracted as first-class content; may support row-level retrieval modes when enabled.
- Figure — image or diagram with caption and metadata.
- Bounding box / bbox — coordinates grounding content in a page; useful for PDF viewers and provenance.
- OCR — optical character recognition for scanned pages or images inside documents.
- Output format — parse-time representation choice (markdown, html, etc.) influencing what search can return.

Security, reliability, and observability

HTTP semantics, secrets, and SLO-style thinking. See Reliability and operations and Troubleshooting.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- TLS / HTTPS — required for webhook URLs in typical configurations; protects payloads in transit.
- Secret management — storing client secrets and webhook secrets in vaults or managed secret stores, not in repos or frontends.
- 403 Forbidden — authentication succeeded but scopes or ownership checks failed.
- 401 Unauthorized — missing or invalid token.
- 429 Too Many Requests — rate limit exceeded; honor Retry-After and backoff.
- 504 / upstream errors — transient faults from providers or dependencies; retry with limits.
- Correlation id — client-generated id propagated across logs to tie webhook handling to originating API calls.
- SLO / SLA — service objectives you define (freshness, latency); Indexify provides metrics hooks via jobs and headers you can chart.
- Monitoring email — optional project setting to receive failure/recovery notifications for jobs and webhooks.

Integration, testing, and quality

Testing vocabulary for retrieval systems. Apply with Best practices and API Runner.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- Smoke test — minimal automated path (auth → upload or existing doc → job success → search hit) run after deploys or config changes.
- Golden query set — fixed list of questions with expected source documents or passages; used to regression-test retrieval.
- Precision@K — fraction of top-K results that are relevant; common offline metric for retrieval tuning.
- Recall — fraction of all relevant documents (or chunks) found in the candidate set; trades off with precision when K is small.
- Regression — retrieval quality or latency gets worse after a change; guard with golden sets and dashboards.
- Shadow mode — run new retrieval settings or providers in parallel without serving users, compare scores offline.
- Feature flag — toggle in your app to route traffic between retrieval profiles or Indexify KBs.
- Canary — roll out a change to a small slice of users or traffic before full cutover.
- Dead letter queue — store failed webhook deliveries (or your handler’s failures) for manual replay after fixing bugs.
- Backpressure — slow downstream consumers when ingestion or search load spikes to avoid cascading failures.
- Cold start — first query or first document in a new KB may pay one-time latency until caches warm.
- Warm path — steady-state requests after caches and connections are established.
- Determinism — same inputs producing the same stored chunks and vectors given fixed embedding/rerank models and KB settings.
- Model drift — embedding or rerank model changes upstream so vectors are no longer comparable without reprocessing.

General and cross-cutting

HTTP, errors, and spec terms. Primary references: API reference, Appendix.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Definitions
- API — HTTP JSON interface to Indexify (api.indexify.dev in production examples).
- REST — resource-oriented HTTP patterns (paths, verbs, status codes) used by most endpoints.
- Problem JSON — API errors often use Content-Type: application/problem+json with a JSON body { status, title, detail, code? } (code is optional). This is RFC 7807–inspired but Indexify does not set a type URI field.
- OpenAPI / spec — machine-readable description of endpoints; Indexify landing may expose a spec for downloads and docs UI.
- Environment — logical deployment tier (production, staging, development); use separate projects and credentials per tier.
- Deprecation — older fields or behaviors scheduled for removal; check changelog and migration notes.
- KB settings — JSON blob on create/update controlling pipeline, parse, chunk, index embedding, search rerank.
- Provider — third-party AI vendor (OpenAI, Cohere, Voyage, Jina, Google, AWS Bedrock, Nomic) behind embedding or rerank configuration.
- Token (LLM) — length unit for language model input; distinct from OAuth access token.
- Token (OAuth) — bearer credential for API authorization.
- Base URL — host prefix for all API calls; examples use https://api.indexify.dev but your deployed region or vanity host may differ.
- Content-Type — HTTP header; JSON bodies use application/json, uploads typically multipart/form-data.
- Accept — HTTP header indicating preferred response shapes where negotiated (most Indexify APIs return JSON).
- User-agent — optional client identifier string; useful for support when debugging abuse or quotas.

Appendix

Dense reference: how the API surface is grouped, which token types apply, HTTP semantics, limits, webhooks, and KB settings keys. For project-token scope strings, see Available scopes. Use this alongside the API reference and API Runner for path-level detail and interactive calls.

If you’re reading this as a reference, feel free to jump around—each section is written to stand on its own, and the left-hand search is the fastest way to find a specific endpoint or concept.

API layout and path conventions

How URLs are structured so you can predict auth and scope needs from the path alone.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

URLs & auth classes
- Host — production examples use https://api.indexify.dev; use the Base URL for your account/region.
- URL layout — project admin: GET /projects, POST /projects, GET /projects/{projectId}; credentials: POST …/credentials, DELETE …/credentials/{credentialId}; KB index: GET …/kbs, POST …/kbs. KB-scoped data plane lives under /projects/{projectId}/kbs/{kbId}/ — see API reference.
- projectId in the path must match the project access token’s project.
- OAuth machine auth — POST /oauth/token (client_credentials, form body).
- Developer (human) auth — POST /auth/signup, POST /auth/login and related routes in API reference; not project tokens.
- Use only paths documented in the public API reference; undocumented URLs are unsupported.
Errors & lists
- Problem JSON — application/problem+json with status, title, detail, optional code (no type URI today).
- Pagination — nextCursor in body; pass cursor (or spec name) on follow-up.
- Cursors are not immortal — re-list after large data changes.

Token types and what they unlock (summary)

Rule of thumb: developer JWT for Control & Ingestion Plane operations; project token for data plane operations.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Token types
- Developer JWT — Projects, account, credentials, settings (e.g. monitoring email). Never in public clients.
- Project access token — KBs, docs, jobs, search, parsed content, webhooks via client_credentials.
- Mismatch — 401/403 often = wrong token type or projectId in path ≠ token’s project.
- TTL — Short-lived; in-memory cache + refresh; never commit tokens or secrets.
- Rotation — Create a new credential, update your services, revoke the old credential; brief overlap avoids downtime.

HTTP status and client handling

How to interpret common responses and what to do next.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Success
- 200 / 201 / 204 — Parse JSON when present.
Client errors
- 400 — Validation; fix payload, ids, or query params per detail.
- 401 — Missing/invalid token or wrong token type for route.
- 403 — Scopes or project mismatch.
- 404 — Unknown resource or missing webhook config.
- 409 — State conflict; read detail for retry vs change strategy.
- 413 / 415 / 422 — Size, media type, or semantic config rejection.
- 429 — Rate limit; use Retry-After and X-RateLimit-*.
Server errors & retries
- 502 / upstream — Transient; backoff with cap; log detail.
- Idempotent retries — GETs safe; POST/PUT use Idempotency-Key where documented.

Limits, uploads, and practical caps

Figures below match the public API spec where stated; platform or plan limits may vary—treat error responses and docs as source of truth.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Uploads & batches
- Multipart file parts; per-file max in spec (often tens of MB) → 413 if exceeded.
- Batch file count cap per endpoint — check spec.
Search, webhooks, concurrency
- Large retrieveTopK / rerank.topK → latency and cost; tune with evals.
- Webhook retryPolicy: maxAttempts 1–10, backoffSeconds 1–300, (maxAttempts-1)*backoffSeconds ≤ 300.
- Match client concurrency to rate-limit headers to avoid 429 storms.
- Huge queries/filters may hit provider/HTTP limits before app logic.

Webhook contract checklist

What to implement on your server to consume webhooks safely.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Protocol
- POST + application/json unless docs say otherwise.
- Subscribable events include job lifecycle events (full enum in API docs).
- webhook.test only from the test endpoint.
Security & handler design
- Verify X-Indexify-Signature (HMAC-SHA256, raw body, constant-time).
- Return 2xx fast; queue heavy work.
- Dedupe with deliveryId or composite keys — duplicates happen.
- Repeated failures exhaust retries → possible monitoring email.

KB `settings` object (top-level keys)

Reference for the main blocks inside a knowledge base configuration. Full field-level schemas are in ./api-docs and the OpenAPI components.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Top-level settings keys
- pipeline — Stages per ingest (parse, chunk, index, …).
- parse — Output formats, parse.pdf.
- chunk — method (hierarchical | hybrid), size, mergePeers, tokenizer, attachMetadata.
- index — embeddingConfig + optional multimodal index flags.
- search — Default rerankConfig + multimodal retrieval defaults.
- Unknown top-level keys rejected — PATCH only documented shapes.

Search request surface (conceptual)

Search request bodies and query strings are defined in the OpenAPI document; use it for exact field names. This lists concepts you will see repeatedly.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Concepts
- query — Embedded with the KB’s embedding model.
- format — Hit text projection (markdown, html, …).
- settings overrides — Merge with KB defaults (searchMode, retrieveTopK, rerank, rerank.topK, includeSourceMetadata, …).
- includeSourceMetadata — Citation-friendly metadata in hits.
- nextCursor — Paging for large result sets.

Standard headers you should use

Headers that appear across many integrations.

Use the bullets as a checklist: read once, then keep this open while you implement so you don’t miss the small but important details (like token type, scopes, and strict query rules).

Standard headers
- Authorization: Bearer — Developer JWT or project token per route.
- Content-Type — application/json or multipart/form-data for uploads.
- Idempotency-Key — On supported mutating calls (e.g. webhook upsert).
- X-RateLimit-* — Client-side pacing.
- Retry-After — On 429.

Reference & Troubleshooting

Reference & Troubleshooting

Best Practices and Anti-Patterns

Troubleshooting

Data plane endpoint checklist (KB settings ↔ endpoints)

Glossary

Identity, access, and tenancy

Data model and content hierarchy

Ingestion pipeline and jobs

Search, retrieval, and ranking

RAG, LLMs, and agents

Webhooks, events, and delivery

Structured access and multimodal

Security, reliability, and observability

Integration, testing, and quality

General and cross-cutting

Appendix

API layout and path conventions

Token types and what they unlock (summary)

HTTP status and client handling

Limits, uploads, and practical caps

Webhook contract checklist

KB settings object (top-level keys)

Search request surface (conceptual)

Standard headers you should use

KB `settings` object (top-level keys)