1. Inferences
  2. VLM (Vision Language Model) Inference

Inferences

VLM (Vision Language Model) Inference

The VLM API provides advanced document processing capabilities using Vision Language Models. It can intelligently classify and extract structured data from various document types including shipping labels, item labels, bills of lading, receipts, and invoices. The API supports both images and PDF documents with multi-page processing.

Supported File Types

Images

  • JPEG/JPG, PNG, BMP, TIFF, GIF, SVG, WebP, HEIC

PDF Documents

  • Maximum Pages: 100 pages per PDF
  • Multi-page Support: Each page is processed individually and returned in a structured response

Supported Document Types

The VLM API automatically classifies and processes the following document types:

  • Shipping Labels - Extract tracking numbers, courier information, sender/recipient details
  • Item Labels - Extract product information, SKUs, batch numbers, dimensions
  • Bills of Lading - Extract logistics information, container details, shipping data
  • Receipts - Extract merchant information, transaction details, itemized purchases
  • Invoices - Extract billing information, line items, payment terms
  • Other Documents - Flexible extraction for custom document types

Available Models

The VLM API supports multiple model sizes for different use cases:

Model Description Use Case
orion_small Fast, lightweight generic model Quick processing, high volume
orion_medium Balanced performance generic model General purpose processing
orion_large High accuracy generic model Complex documents, maximum accuracy
vscan_small Specialized logistics model (small) Logistics documents, shipping labels
vscan_medium Specialized logistics model (medium) Logistics documents, balanced performance
vscan_large Specialized logistics model (large) Complex logistics documents, maximum accuracy

New Inference

POST
`/v1/inferences/images/vlm`

Create a new VLM inference to process and extract structured data from document images.

Parameters

image string (required)
Base64 encoded data URL or public web URL of the image or PDF to process. Supports images (JPEG, PNG, etc.) and PDF documents (max 100 pages).

  • Image format: data:image/jpeg;base64,... or image URL
  • PDF format: data:application/pdf;base64,... or PDF URL

prompt string (required) (8000)
Custom prompt to guide the VLM processing. Must be between 5-8000 characters and contain meaningful content.

prompt_id string
You can pass the prompt_id from the prompts api to run the vlm inference. Either prompt or prompt_id is required.

model string (optional)
VLM model to use for processing. Defaults to vscan_small. Options: orion_small, orion_medium, orion_large, vscan_small, vscan_medium, vscan_large.

location_id string (optional)
The ID of the location to attribute this inference to for filtering and organization.

metadata object (optional)
Custom metadata to associate with the inference.

Example Request (Image)

js
        const data = {
  image: "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAASA...", // Base64 encoded image
  prompt: "Extract all shipping information from this label including tracking number, sender, and recipient details.",
  model: "orion_medium",
  location_id: "loc_123456789",
  metadata: {
    source: "mobile_app",
    batch_id: "batch_001"
  }
};

const response = await fetch("https://api.packagex.io/v1/inferences/images/vlm", {
  method: "POST",
  headers: {
    "PX-API-KEY": process.env.PX_API_KEY,
    "Content-Type": "application/json",
  },
  body: JSON.stringify(data),
}).then((res) => res.json());

const inference = response.data;

      

Example Request (PDF)

js
        const data = {
  image: "data:application/pdf;base64,JVBERi0xLjQK...", // Base64 encoded PDF
  prompt: "Extract shipping information from each page of this document.",
  model: "vscan_large",
  metadata: {
    document_type: "shipping_manifest"
  }
};

const response = await fetch("https://api.packagex.io/v1/inferences/images/vlm", {
  method: "POST",
  headers: {
    "PX-API-KEY": process.env.PX_API_KEY,
    "Content-Type": "application/json",
  },
  body: JSON.stringify(data),
}).then((res) => res.json());

const inference = response.data;

// model_response is always an array of data objects (one per page)
console.log(`Processed ${inference.model_response.length} page(s)`);

inference.model_response.forEach((pageData, index) => {
  console.log(`Page ${index + 1}:`, pageData);
});

      

Response Format

The VLM API returns structured JSON data based on the detected document type. The model_response is always an array of data objects, where each element represents extracted data from one page.

VLM Response Model (Single Page)

        {
  "object": "vlm_inference",
  "id": "vlm_1234567890abcdef",
  "organization_id": "org_1234567890abcdef",
  "location_id": "loc_1234567890abcdef",
  "organization": {
    "id": "org_1234567890abcdef",
    "name": "Acme Corp",
    "logo_url": "https://example.com/logo.png"
  },
  "location": {
    "id": "loc_1234567890abcdef",
    "name": "Main Warehouse"
  },
  "status": "completed",
  "image_hash": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6",
  "image_url": "https://example.com/images/vlm_1234567890abcdef.jpg",
  "model": "vscan_medium",
  "prompt": "You are a high-performing OCR scanner and information extractor...",
  "model_response": [
    {
      "document_type": "shipping_label",
      "courier_name": "FedEx",
      "tracking_number": "1234567890123456",
      "dimensions": "12x8x6 inches",
      "weight": "2.5 lbs",
      "recipient": {
        "name": "John Doe",
        "address": {
          "line1": "123 Main St",
          "city": "New York",
          "state": "NY",
          "postal_code": "10001",
          "country": "USA"
        }
      },
      "sender": {
        "name": "Jane Smith",
        "address": {
          "line1": "456 Oak Ave",
          "city": "Los Angeles",
          "state": "CA",
          "postal_code": "90210",
          "country": "USA"
        }
      }
    }
  ],
  "token_count": 1250,
  "metadata": {},
  "created_at": "2024-01-15T10:30:00.000Z",
  "created_by": "user_1234567890abcdef",
  "updated_at": "2024-01-15T10:30:05.000Z",
  "updated_by": "user_1234567890abcdef",
  "checksum": "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0"
}

      

Multi-Page PDF Response

When processing a PDF with multiple pages, the model_response array contains one object per page:

        {
  "object": "vlm_inference",
  "id": "vlm_1234567890abcdef",
  "status": "completed",
  "model_response": [
    {
      "document_type": "shipping_label",
      "tracking_number": "1234567890123456",
      "recipient": {
        "name": "John Doe",
        "address": {
          "line1": "123 Main St",
          "city": "New York",
          "state": "NY",
          "postal_code": "10001"
        }
      }
    },
    {
      "document_type": "shipping_label",
      "tracking_number": "9876543210987654",
      "recipient": {
        "name": "Jane Smith",
        "address": {
          "line1": "456 Oak Ave",
          "city": "Los Angeles",
          "state": "CA",
          "postal_code": "90210"
        }
      }
    }
  ]
}

      

Response Fields:

Field Type Description
model_response array Always an array of extracted data objects, one per page (index 0 = page 1, etc.)

Retrieve Inference

GET
`/v1/inferences/images/vlm/inference_id`

Retrieve a specific VLM inference by its ID.

Parameters

inference_id string (required)
The unique identifier of the VLM inference to retrieve.

Example Request

js
        const response = await fetch("https://api.packagex.io/v1/inferences/images/vlm/inf_vlm_123456789", {
  method: "GET",
  headers: {
    "PX-API-KEY": process.env.PX_API_KEY,
  },
}).then((res) => res.json());

const inference = response.data;

      

List Inferences

GET
`/v1/inferences/images/vlm`

Retrieve a paginated list of VLM inferences with optional filtering.

Query Parameters

page number (optional)
Page number for pagination. Default: 1.

limit number (optional)
Number of inferences per page. Default: 20, max: 100.

order_by string (optional)
Field to sort by. Options: created_at. Default: created_at.

location_id string (optional)
Filter inferences by location ID.

models string (optional)
Filter inferences by model used.

status string (optional)
Filter inferences by status.

Example Request

js
        const response = await fetch("https://api.packagex.io/v1/inferences/images/vlm?page=1&limit=10&location_id=loc_123456789", {
  method: "GET",
  headers: {
    "PX-API-KEY": process.env.PX_API_KEY,
  },
}).then((res) => res.json());

const inferences = response.data;
const pagination = response.pagination;

      

VLM Inference Model

object "vlm_inference"
The description of the model.

id string
Unique identifier for the VLM inference.

organization_id string
Unique identifier for the organization that owns this inference. This will always be your organization ID.

location_id string | null
The hub location to which this inference is assigned, if specified.

organization object
Details about the organization that owns this inference.

Show Details
Field Type Description
organization.id string Organization ID
organization.name string Organization name
organization.logo_url string | null URL to organization logo

location object | null
Details about the location this inference is assigned to.

Show Details
Field Type Description
location.id string Location ID
location.name string Location name

status string
Processing status of the inference. Possible values: inferring, completed, error.

image_url string
The URL to the processed image used for this inference.

image_hash string
A unique hash for this image that can be used to identify duplicate processed images.

model string
The VLM model used for processing. Options: vscan_small, vscan_medium, vscan_large, orion_small, orion_medium, orion_large.

prompt string
The prompt used to guide the VLM processing.

model_response array
Always an array of extracted data objects, one per page (index 0 = page 1, etc.).

token_count number | null
Number of tokens consumed during processing.

metadata object
Key-value pairs of custom metadata associated with this inference.

created_at string
Creation timestamp in ISO 8601 format.

created_by string | null
User ID who created this inference.

updated_at string
Last update timestamp in ISO 8601 format.

updated_by string
User ID who last updated this inference.

checksum string
MD5 checksum for data integrity verification.

Best Practices

Image Quality

  • Use high-resolution images (minimum 300 DPI)
  • Ensure good lighting and contrast
  • Avoid blurry or distorted images
  • Crop images to focus on the document content

PDF Documents

  • Keep PDFs under 100 pages (maximum limit)
  • Ensure PDF pages are clear and readable
  • Use text-based PDFs when possible for better accuracy
  • Consider splitting very large documents into smaller batches
  • Processing time scales with page count

Prompt Engineering

  • Be specific about what information you want extracted
  • Include context about the document type
  • Use clear, concise language
  • Avoid ambiguous instructions

Model Selection

  • Use orion_small for high-volume, simple generic documents
  • Use orion_medium for balanced performance on generic documents
  • Use orion_large for complex generic documents requiring high accuracy
  • Use vscan_small for logistics documents and shipping labels
  • Use vscan_medium for balanced performance on logistics documents
  • Use vscan_large for complex logistics documents requiring maximum accuracy

Error Handling

Common Errors

Status Code Description
400 pdf.too_many_pages PDF exceeds maximum allowed pages (100)
400 image.invalid Invalid image format or corrupted file
400 image.safety_violation Image content violates safety guidelines
400 image.no_text No extractable text found in image
404 prompt.not_found Specified prompt_id not found
408 api.timeout Request timed out
429 api.quota_exceeded API quota exceeded

Example Error Response

        {
  "error": {
    "message": "PDF exceeds maximum allowed pages. Maximum: 100, Found: 150",
    "code": "pdf.too_many_pages",
    "status": 400
  }
}