Inferences
VLM (Vision Language Model) Inference
The VLM API provides advanced document processing capabilities using Vision Language Models. It can intelligently classify and extract structured data from various document types including shipping labels, item labels, bills of lading, receipts, and invoices. The API supports both images and PDF documents with multi-page processing.
Supported File Types
Images
- JPEG/JPG, PNG, BMP, TIFF, GIF, SVG, WebP, HEIC
PDF Documents
- Maximum Pages: 100 pages per PDF
- Multi-page Support: Each page is processed individually and returned in a structured response
Supported Document Types
The VLM API automatically classifies and processes the following document types:
- Shipping Labels - Extract tracking numbers, courier information, sender/recipient details
- Item Labels - Extract product information, SKUs, batch numbers, dimensions
- Bills of Lading - Extract logistics information, container details, shipping data
- Receipts - Extract merchant information, transaction details, itemized purchases
- Invoices - Extract billing information, line items, payment terms
- Other Documents - Flexible extraction for custom document types
Available Models
The VLM API supports multiple model sizes for different use cases:
New Inference
Create a new VLM inference to process and extract structured data from document images.
Parameters
image string (required)
Base64 encoded data URL or public web URL of the image or PDF to process. Supports images (JPEG, PNG, etc.) and PDF documents (max 100 pages).
- Image format:
data:image/jpeg;base64,...or image URL - PDF format:
data:application/pdf;base64,...or PDF URL
prompt string (required) (8000)
Custom prompt to guide the VLM processing. Must be between 5-8000 characters and contain meaningful content.
prompt_id string
You can pass the prompt_id from the prompts api to run the vlm inference. Either prompt or prompt_id is required.
model string (optional)
VLM model to use for processing. Defaults to vscan_small. Options: orion_small, orion_medium, orion_large, vscan_small, vscan_medium, vscan_large.
location_id string (optional)
The ID of the location to attribute this inference to for filtering and organization.
metadata object (optional)
Custom metadata to associate with the inference.
Example Request (Image)
Example Request (PDF)
Response Format
The VLM API returns structured JSON data based on the detected document type. The model_response is always an array of data objects, where each element represents extracted data from one page.
VLM Response Model (Single Page)
Multi-Page PDF Response
When processing a PDF with multiple pages, the model_response array contains one object per page:
Response Fields:
Retrieve Inference
Retrieve a specific VLM inference by its ID.
Parameters
inference_id string (required)
The unique identifier of the VLM inference to retrieve.
Example Request
List Inferences
Retrieve a paginated list of VLM inferences with optional filtering.
Query Parameters
page number (optional)
Page number for pagination. Default: 1.
limit number (optional)
Number of inferences per page. Default: 20, max: 100.
order_by string (optional)
Field to sort by. Options: created_at. Default: created_at.
location_id string (optional)
Filter inferences by location ID.
models string (optional)
Filter inferences by model used.
status string (optional)
Filter inferences by status.
Example Request
VLM Inference Model
object "vlm_inference"
The description of the model.
id string
Unique identifier for the VLM inference.
organization_id string
Unique identifier for the organization that owns this inference. This will always be your organization ID.
location_id string | null
The hub location to which this inference is assigned, if specified.
organization object
Details about the organization that owns this inference.
Show Details
location object | null
Details about the location this inference is assigned to.
Show Details
status string
Processing status of the inference. Possible values: inferring, completed, error.
image_url string
The URL to the processed image used for this inference.
image_hash string
A unique hash for this image that can be used to identify duplicate processed images.
model string
The VLM model used for processing. Options: vscan_small, vscan_medium, vscan_large, orion_small, orion_medium, orion_large.
prompt string
The prompt used to guide the VLM processing.
model_response array
Always an array of extracted data objects, one per page (index 0 = page 1, etc.).
token_count number | null
Number of tokens consumed during processing.
metadata object
Key-value pairs of custom metadata associated with this inference.
created_at string
Creation timestamp in ISO 8601 format.
created_by string | null
User ID who created this inference.
updated_at string
Last update timestamp in ISO 8601 format.
updated_by string
User ID who last updated this inference.
checksum string
MD5 checksum for data integrity verification.
Best Practices
Image Quality
- Use high-resolution images (minimum 300 DPI)
- Ensure good lighting and contrast
- Avoid blurry or distorted images
- Crop images to focus on the document content
PDF Documents
- Keep PDFs under 100 pages (maximum limit)
- Ensure PDF pages are clear and readable
- Use text-based PDFs when possible for better accuracy
- Consider splitting very large documents into smaller batches
- Processing time scales with page count
Prompt Engineering
- Be specific about what information you want extracted
- Include context about the document type
- Use clear, concise language
- Avoid ambiguous instructions
Model Selection
- Use
orion_smallfor high-volume, simple generic documents - Use
orion_mediumfor balanced performance on generic documents - Use
orion_largefor complex generic documents requiring high accuracy - Use
vscan_smallfor logistics documents and shipping labels - Use
vscan_mediumfor balanced performance on logistics documents - Use
vscan_largefor complex logistics documents requiring maximum accuracy