> ## Documentation Index
> Fetch the complete documentation index at: https://docs.deck.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Storage & document extraction

> Capture documents, spreadsheets, images, and other files from external sources, and extract structured data.

<Note>
  Available as an add-on on paid plans.
</Note>

Tasks can produce files like receipts, invoices, statements, reports, images, and spreadsheets. When storage is enabled on a task, Deck captures the files the agent is instructed to collect and makes them available through the API. If extraction is also enabled, Deck parses supported files and returns structured JSON alongside the raw file.

## Enabling storage on a task

Storage is configured when you create or update a task. Set `storage.enabled` to `true` to capture files. Set `storage.extraction` to `true` to also extract structured data from those files.

```json theme={null}
POST /v2/tasks

{
  "name": "Fetch utility bills",
  "agent_id": "agt_a1b2c3d4...",
  "input_schema": {
    "type": "object",
    "properties": {
      "start_date": { "type": "string" },
      "end_date": { "type": "string" }
    }
  },
  "output_schema": {
    "type": "object",
    "properties": {
      "account_number": { "type": "string" },
      "amount_due": { "type": "number" }
    }
  },
  "storage": {
    "enabled": true,
    "extraction": true,
    "extraction_schema": {
      "type": "object",
      "properties": {
        "account_number": { "type": "string" },
        "amount_due": { "type": "number" }
      }
    }
  }
}
```

| Field                          | Type    | Description                                                                                                                             |
| ------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| `storage.enabled`              | boolean | Capture files produced during task execution                                                                                            |
| `storage.extraction`           | boolean | Parse captured files and extract structured data                                                                                        |
| `storage.extraction_schema`    | object  | JSON Schema describing the fields to extract. Required when `extraction` is `true`.                                                     |
| `storage.extraction_prompt`    | string  | Optional natural-language guidance for the extraction model. See [Guiding extraction with a prompt](#guiding-extraction-with-a-prompt). |
| `storage.deduplication`        | boolean | Enable deduplication to skip files that match a previous capture. See [Deduplication](#deduplication).                                  |
| `storage.deduplication_schema` | object  | JSON Schema describing the fields used for duplicate detection. Required when `deduplication` is `true`.                                |

<Note>
  Captured files appear in storage, not the output schema. Don't add fields like file names or file counts; they duplicate what storage returns and can drift from what was actually captured. The storage object is the source of truth for captured files and downloads.
</Note>

## Retrieving storage items

Pass `?include=storage` when you fetch a task run and its captured files come back embedded in the response, each with its extracted data and a pre-signed download `url` inline:

```text theme={null}
GET /v2/task-runs/{run_id}?include=storage
```

The embedded array returns up to 100 items per run. To fetch every file, or to fetch a single item by ID, use the dedicated storage endpoints. The list endpoint returns all of a run's files in one response:

```bash theme={null}
curl https://api.deck.co/v2/task-runs/trun_a1b2c3d4/storage \
  -H "Authorization: Bearer sk_live_your_key_here"
```

```json theme={null}
{
  "data": [
    {
      "id": "stor_x1y2z3...",
      "object": "storage",
      "file_name": "statement_jan_2025.pdf",
      "file_type": "application/pdf",
      "file_size": 245678,
      "url": "https://files.deck.co/stor_x1y2z3...?signature=...",
      "metadata": null,
      "extraction": null,
      "purpose": "output",
      "result": "failure",
      "created_at": "2025-01-23T14:30:00Z"
    },
    {
      "id": "stor_a4b5c6...",
      "object": "storage",
      "file_name": "statement_dec_2024.pdf",
      "file_type": "application/pdf",
      "file_size": 198432,
      "url": "https://files.deck.co/stor_a4b5c6...?signature=...",
      "metadata": null,
      "extraction": {
        "company_name": "EnergyLink",
        "account_number": "58291-44720",
        "billing_date": "2024-12-22",
        "amount_due": 6925.18,
        "currency": "USD"
      },
      "purpose": "output",
      "result": "success",
      "created_at": "2025-01-22T09:15:00Z"
    }
  ],
  "has_more": false,
  "request_id": "req_f5g6h7..."
}
```

## Storage item fields

| Field         | Task run (`include=storage`) | List items | Get item |
| ------------- | :--------------------------: | :--------: | :------: |
| `id`          |               ✓              |      ✓     |     ✓    |
| `file_name`   |               ✓              |      ✓     |     ✓    |
| `file_type`   |               ✓              |      ✓     |     ✓    |
| `file_size`   |               ✓              |      ✓     |     ✓    |
| `url`         |               ✓              |      ✓     |     ✓    |
| `extraction`  |               ✓              |      ✓     |     ✓    |
| `result`      |                              |      ✓     |     ✓    |
| `purpose`     |               ✓              |      ✓     |          |
| `metadata`    |                              |      ✓     |     ✓    |
| `task_run_id` |                              |            |     ✓    |
| `created_at`  |               ✓              |      ✓     |     ✓    |

<ResponseField name="id" type="string">
  Unique identifier, prefixed with `stor_`.
</ResponseField>

<ResponseField name="file_name" type="string">
  Original file name as it appeared on the source.
</ResponseField>

<ResponseField name="file_type" type="string">
  MIME type (`application/pdf`, `image/png`, `text/csv`, etc.). See [Supported files](#supported-files) for supported types.
</ResponseField>

<ResponseField name="file_size" type="integer">
  Size in bytes.
</ResponseField>

<ResponseField name="purpose" type="string">
  `attachment` for a file you provide as task input for the agent to use, `extraction` for a file you provide as task input that Deck extracts data from directly (skipping the agent), or `output` for a file the agent captures during the run.
</ResponseField>

<ResponseField name="created_at" type="datetime">
  When the storage item was created.
</ResponseField>

<ResponseField name="extraction" type="object or null">
  Structured data extracted from the file, if extraction is enabled.
</ResponseField>

<ResponseField name="result" type="string">
  Extraction outcome: `failure` if the file's data couldn't be extracted, otherwise `success` (including files with no extraction). Returned by the list and get-by-id endpoints.
</ResponseField>

<ResponseField name="url" type="string">
  Signed download URL.
</ResponseField>

<ResponseField name="metadata" type="object or null">
  A JSON object of additional key-value details Deck records about the file, or `null` when there are none. Which keys appear depends on the file and how it was captured, so treat it as informational and don't depend on specific keys being present. Returned by the list and get-by-id endpoints.
</ResponseField>

<ResponseField name="task_run_id" type="string">
  The task run that produced this storage item.
</ResponseField>

## Downloading files

Both list and get-by-id responses include a pre-signed `url` you can use to download the raw file. URLs expire after an hour; if one does, re-fetch the item to get a fresh one.

```bash theme={null}
curl https://api.deck.co/v2/storage/stor_x1y2z3 \
  -H "Authorization: Bearer sk_live_your_key_here"
```

## Supported files

These rules apply to every file Deck stores: files an agent captures during a run (`purpose: "output"`), files you provide as task input (`attachment`), and files Deck processes directly (`extraction`).

### File types

The following types can be captured and returned through storage. The **Extraction** column shows which can also be parsed into structured JSON when [extraction](#document-extraction) is enabled.

| Type                    | Captured & returned | Extraction |
| ----------------------- | :-----------------: | :--------: |
| PDF                     |          ✓          |      ✓     |
| CSV                     |          ✓          |      ✓     |
| Plain text (`.txt`)     |          ✓          |      ✓     |
| JSON                    |          ✓          |      ✓     |
| Excel (`.xlsx`, `.xls`) |          ✓          |      ✓     |
| PNG, JPEG, WebP, TIFF   |          ✓          |      ✓     |
| GIF, BMP                |          ✓          |            |
| ZIP                     |          ✓          |            |

<Warning>
  When extraction is enabled, every captured file is sent for extraction, including types the extractor can't parse. A captured GIF, BMP, or ZIP fails extraction, which fails the run with an `extraction_failed` error. If a task with extraction enabled might encounter these types, instruct the agent not to collect them.
</Warning>

### File size

Each captured file can be up to **200 MB**. Files you provide as task input have a smaller limit. See [Sending a file](#sending-a-file).

## Providing files as input

Tasks can accept files as input. The file is uploaded to storage and malware-scanned, then either handed to the agent or extracted directly by Deck, depending on the field's `purpose`:

| `purpose`    | What Deck does                                                              | Available on                                             |
| ------------ | --------------------------------------------------------------------------- | -------------------------------------------------------- |
| `attachment` | The agent receives the file at run time and uses it on the source.          | Enterprise plans                                         |
| `extraction` | Deck extracts structured JSON from the file directly. The agent is skipped. | Enterprise plans with the extraction and storage add-ons |

Both purposes share the same field shape and upload behavior, and differ in what happens after the upload. A single task can declare attachment file inputs or extraction file inputs, not both. Running a task whose input schema mixes both is rejected with an `input_invalid` error.

### Defining the field

Define a file field in the input schema. The shape is the same for both purposes; only the `purpose` constant changes.

```json theme={null}
"resume": {
  "type": "object",
  "properties": {
    "purpose": { "const": "attachment" },
    "file_name": { "type": "string" },
    "content_type": { "type": "string" },
    "data": { "type": "string", "contentEncoding": "base64" }
  }
}
```

| Field          | Description                                                                                                 |
| -------------- | ----------------------------------------------------------------------------------------------------------- |
| `purpose`      | Constant `"attachment"` or `"extraction"`. Marks the field as a file input and selects how Deck handles it. |
| `file_name`    | Original file name, e.g. `resume.pdf`.                                                                      |
| `content_type` | MIME type, e.g. `application/pdf`.                                                                          |
| `data`         | The file contents, base64-encoded.                                                                          |

### Sending a file

Provide the file inline as base64 in the task run input:

```json theme={null}
POST /v2/tasks/task_a1b2c3d4.../run

{
  "credential_id": "cred_a1b2c3d4...",
  "input": {
    "applicant_name": "Jordan Lee",
    "resume": {
      "purpose": "attachment",
      "file_name": "resume.pdf",
      "content_type": "application/pdf",
      "data": "JVBERi0xLjQKJ..."
    }
  }
}
```

Each file can be up to **30 MB**. Larger files, or invalid base64, are rejected synchronously with a validation error and the run isn't created. The `content_type` must be one of the [supported types](#supported-files) and match the actual file contents; a file that fails this check fails the run with a `task_failed` error. After upload, every file is malware-scanned asynchronously. The rest of the run lifecycle depends on the purpose.

### Attachments

With `purpose: "attachment"`, the agent receives the file at run time and uses it on the source: uploading it to a portal, attaching it to a form, or referencing it while completing the task.

The run stays `queued` until every attachment passes the scan, then transitions to `running` and dispatches to the agent. If a file is flagged, the run fails with an `attachment_invalid` error on the task run object. Listen for `task_run.failed` or fetch the run to handle it.

Deck replaces the base64 `data` in the stored input with a `storage_id` reference so the raw bytes aren't carried through the run. The file becomes a storage item with `purpose: "attachment"`, alongside the `output` files the agent captures, and appears under the Input tab on the task run in the Console.

### Extraction

With `purpose: "extraction"`, Deck processes the file directly against the task's `extraction_schema` and returns the structured result on the run. The agent doesn't execute. Use this when you have a document and want JSON back, with no source interaction.

The run transitions to `running` as soon as the file is uploaded, and finalizes when extraction completes. It doesn't sit in `queued` waiting on the scan.

The task must have storage and extraction enabled, with an `extraction_schema` defining the result shape:

```json theme={null}
POST /v2/tasks

{
  "name": "Extract utility bill",
  "agent_id": "agt_a1b2c3d4...",
  "input_schema": {
    "type": "object",
    "properties": {
      "bill": {
        "type": "object",
        "properties": {
          "purpose": { "const": "extraction" },
          "file_name": { "type": "string" },
          "content_type": { "type": "string" },
          "data": { "type": "string", "contentEncoding": "base64" }
        }
      }
    },
    "required": ["bill"]
  },
  "storage": {
    "enabled": true,
    "extraction": true,
    "extraction_schema": {
      "type": "object",
      "properties": {
        "vendor_name": { "type": "string" },
        "total_amount": { "type": "number" },
        "invoice_date": { "type": "string", "format": "date" }
      }
    }
  }
}
```

To run it, send the file in the extraction field. The run still needs a `credential_id` or `source_id` to link the extraction to a user or source, even though the agent doesn't execute.

```json theme={null}
POST /v2/tasks/task_a1b2c3d4.../run

{
  "credential_id": "cred_a1b2c3d4...",
  "input": {
    "bill": {
      "purpose": "extraction",
      "file_name": "january-bill.pdf",
      "content_type": "application/pdf",
      "data": "JVBERi0xLjQKJ..."
    }
  }
}
```

A task accepts one extraction input per run. The file becomes a storage item with `purpose: "extraction"` carrying the extracted data, matching the fields defined in `extraction_schema`. See [Document extraction](#document-extraction) for guidance on writing extraction schemas.

### Reusing a file across runs

Once a file is uploaded, you can reference it on a later run instead of sending the bytes again. Pass the `storage_id` in place of `data`, keeping the same `purpose`:

```json theme={null}
"resume": {
  "purpose": "attachment",
  "storage_id": "stor_x1y2z3..."
}
```

The same shape works with `purpose: "extraction"`. Deck verifies the file belongs to your organization, then copies it for the new run so each run keeps its own input files. The copy is malware-scanned like a fresh upload. If the original has been deleted, by you or by retention, the reference is rejected with a `resource_not_found` error.

## Document extraction

When extraction is enabled, Deck parses the captured files and populates the `extraction` field on each storage item with structured JSON.

Extraction is available for **PDF, CSV, Excel (`.xlsx`, `.xls`), plain text, JSON, and images (PNG, JPEG, WebP, TIFF)**. The extracted data depends on the document. A utility bill produces different fields than a hotel receipt.

Extraction is powered by a language model, not rule-based or positional parsing. Deck packages your schema's field names and descriptions with the document and sends them to the model, which matches document content to your fields by meaning rather than exact keywords: a `start_date` field with a clear description will match a label like "Starting date" even though the text differs. Values are the model's best-fit reading of the document, not verbatim key lookups. If the model matches the wrong value, sharpen the field's `description` or correct it with an [extraction prompt](#guiding-extraction-with-a-prompt).

A file appears in storage once Deck finishes processing it. With extraction enabled, that means a file isn't listed until its extraction succeeds or fails; when it appears, its `extraction` data is already populated.

### Custom extraction schemas

Use the `extraction_schema` field on the task's storage config to define exactly what fields you want extracted. Because the model matches fields by name and `description`, describe each field precisely, the same way you would for [deduplication fields](#deduplication): `"Total amount due, including tax"` works better than a bare `total`.

```json theme={null}
{
  "storage": {
    "enabled": true,
    "extraction": true,
    "extraction_schema": {
      "type": "object",
      "properties": {
        "vendor_name": { "type": "string", "description": "Vendor or supplier name" },
        "total_amount": { "type": "number", "description": "Total amount due, including tax" },
        "invoice_date": { "type": "string", "format": "date", "description": "Invoice issue date" },
        "line_items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "description": { "type": "string" },
              "amount": { "type": "number" }
            }
          }
        }
      }
    }
  }
}
```

Schemas are plain JSON Schema: a root `object` with `properties`, using types `string`, `number`, `integer`, `boolean`, `object`, `array`, and `null`. Composition and reference keywords (`$ref`, `oneOf`, `allOf`, `anyOf`, `not`, `if`/`then`/`else`, `patternProperties`, `additionalProperties`, `definitions`, `$defs`, `$id`, `$schema`) aren't supported and are rejected with a validation error.

### Guiding extraction with a prompt

`extraction_prompt` is optional free-text guidance applied alongside the schema. Set it on the task's storage config to give the extraction model instructions that the schema's field names can't capture on their own: how to disambiguate similar fields, what format to return a value in, where on the document to look, what to assume when a value is missing, or fixes for mistakes you've seen it make.

```json theme={null}
{
  "storage": {
    "enabled": true,
    "extraction": true,
    "extraction_schema": { "type": "object", "properties": { "...": {} } },
    "extraction_prompt": "Amounts are in euros; return them as numbers with no currency symbol. If both a billing date and a due date appear, use the due date for payment_due_date."
  }
}
```

The prompt steers how values are interpreted but doesn't change the output shape, which is defined entirely by `extraction_schema`. It's configured once on the task and applies to every file that task extracts. When omitted, extraction runs on the schema alone.

### Extraction example

A utility bill extraction might return:

```json theme={null}
{
  "company_name": "EnergyLink",
  "account_number": "58291-44720",
  "billing_date": "2025-01-22",
  "billing_period": {
    "start_date": "2024-12-18",
    "end_date": "2025-01-20",
    "total_days": 33
  },
  "amount_due": 8247.41,
  "payment_due_date": "2025-02-14",
  "currency": "USD",
  "service_locations": [
    {
      "service_type": "Fuel",
      "service_address": {
        "street": "4421 OAK VIEW LN UNIT 3A",
        "city": "CAMBRIDGE",
        "state": "MA",
        "postal_code": "02140"
      },
      "total_usage": 5348,
      "total_usage_unit": "Therms",
      "total_charges": 8214.67
    }
  ]
}
```

### Extraction errors

Each storage item reports its extraction outcome in a `result` field: `failure` if the file's data couldn't be extracted, otherwise `success`. Note that `success` also covers files with no extraction, so it doesn't on its own mean data was extracted.

When extraction fails, `result` is `failure` and `extraction` stays `null`, but the raw file is still kept and available for download. Check `result` to tell which files failed.

If any file in a task run fails extraction, the run completes with a `failure` result and an `extraction_failed` error in the `errors` array indicating how many files were affected. Successfully extracted files in the same run still return their `extraction` data. Failed extractions aren't retried; run the task again to re-extract the file.

## Deduplication

Deduplication tells Deck to skip files that match one captured by a previous run for the same task, source, and credential, so recurring tasks only return new documents.

You define a set of fields that uniquely identify a document. Deck reads those fields from each captured file and compares them against prior captures. If every field matches, the new file is skipped: it's not stored, no `storage.created` event fires, and it's not extracted even if extraction is enabled. Each skip is recorded on the task run, so you can always tell a deduplicated file apart from one the agent never captured. See [Seeing skipped files](#seeing-skipped-files).

### Configuration

Set `deduplication` to `true` and provide a `deduplication_schema` on the task's storage config:

```json theme={null}
{
  "storage": {
    "enabled": true,
    "deduplication": true,
    "deduplication_schema": {
      "type": "object",
      "properties": {
        "account_number": {
          "type": "string",
          "description": "The utility account number"
        },
        "billing_period_start": {
          "type": "string",
          "description": "Start date of the billing period (YYYY-MM-DD)"
        }
      }
    }
  }
}
```

| Field                  | Type    | Description                                                                                                           |
| ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- |
| `deduplication`        | boolean | Turn deduplication on for this task                                                                                   |
| `deduplication_schema` | object  | JSON Schema with a `properties` map of `field_name → { type, description }`. Required when `deduplication` is `true`. |

Each property must declare a `type`, one of `string`, `integer`, `number`, or `boolean`. Nested objects and arrays aren't supported, so pick top-level scalar fields.

Property names are arbitrary; you make them up, and they're just keys for the result. The `description` is what tells Deck where to find the value on each document, so describe each field precisely. For example, `"Account number printed at the top of the bill"` works better than a vague `"account"`.

### Choosing fields

The fields you list together form the dedup key. Two files match only if **every** field is identical. String values are compared case-insensitively, ignoring surrounding whitespace. A few rules of thumb:

* **Pick fields that stay stable for the same logical document.** A monthly bill should have the same account number and billing period every time it's fetched.
* **Avoid volatile fields.** File names, fetch dates, and page numbers will produce false negatives, since the same document looks new every time.
* **Pick enough fields to be unique.** A single field like `vendor_name` will collide across unrelated invoices from the same vendor.
* **Two or three fields is usually right.**

### Field combinations by document type

<Tabs>
  <Tab title="Utility bills">
    Account plus billing period:

    ```json theme={null}
    "deduplication_schema": {
      "type": "object",
      "properties": {
        "account_number": { "type": "string", "description": "Utility account number" },
        "billing_period_start": { "type": "string", "description": "Billing period start date (YYYY-MM-DD)" }
      }
    }
    ```
  </Tab>

  <Tab title="Invoices">
    Vendor plus invoice number:

    ```json theme={null}
    "deduplication_schema": {
      "type": "object",
      "properties": {
        "vendor_name": { "type": "string", "description": "Vendor or supplier name" },
        "invoice_number": { "type": "string", "description": "Invoice number printed on the document" }
      }
    }
    ```
  </Tab>

  <Tab title="Receipts">
    Merchant, date, and total, since receipts often lack a unique ID:

    ```json theme={null}
    "deduplication_schema": {
      "type": "object",
      "properties": {
        "merchant_name": { "type": "string", "description": "Merchant or store name" },
        "transaction_date": { "type": "string", "description": "Date of purchase (YYYY-MM-DD)" },
        "total_amount": { "type": "number", "description": "Total amount charged" }
      }
    }
    ```
  </Tab>
</Tabs>

### Seeing skipped files

A skipped file never appears in storage, but the skip itself is recorded on the task run. Pass `include=storage_deduplicated` when you fetch the run to see which files were deduplicated and what they matched:

```text theme={null}
GET /v2/task-runs/{run_id}?include=storage,storage_deduplicated
```

`storage_deduplicated` is a sibling of `storage` in the response. The two flags compose, but neither requires the other; `include=storage` alone never returns `storage_deduplicated`.

```json theme={null}
{
  "id": "trun_x7Kp2mQvR8sT4wYz",
  "object": "task_run",
  "status": "completed",
  "result": "success",
  "storage": [
    {
      "id": "stor_a1b2c3d4e5f6",
      "object": "storage",
      "file_name": "statement_feb_2025.pdf",
      "file_type": "application/pdf",
      "file_size": 245678,
      "url": "https://files.deck.co/stor_a1b2c3d4e5f6?signature=...",
      "extraction": { "account_number": "4417-88", "billing_period_start": "2025-02-01" },
      "purpose": "output",
      "created_at": "2026-07-09T14:29:12Z"
    }
  ],
  "storage_deduplicated": [
    {
      "file_name": "statement_jan_2025.pdf",
      "duplicate_of": "stor_x1y2z3w4v5u6",
      "task_run_id": "trun_8mNp4qRs2tUv6w",
      "created_at": "2026-07-09T14:30:00Z"
    }
  ]
}
```

The field appears only when you request it **and** the run's task has `deduplication` enabled:

* A dedup-enabled run with no skips returns `"storage_deduplicated": []`. The empty array is meaningful: deduplication ran and nothing was skipped.
* A run whose task doesn't use deduplication omits the field entirely, even when requested.

Entries are inline values, not storage items. They have no `id`, no `object` type, and no endpoint of their own; the task run embed is the only place they appear.

<ResponseField name="file_name" type="string">
  Name of the skipped file as the agent captured it.
</ResponseField>

<ResponseField name="duplicate_of" type="string">
  Storage ID of the file it matched, the one already in storage. Fetch `GET /v2/storage/{duplicate_of}` to get the original's file name, extraction, and a fresh signed download URL. The entry doesn't embed the original's `url` or `file_name`: signed URLs expire, and the original can be deleted by retention, so the ID is the reliable reference.
</ResponseField>

<ResponseField name="task_run_id" type="string">
  The task run that originally captured the matched file. This is not the run you're looking at; it's the earlier run that first retrieved the document.
</ResponseField>

<ResponseField name="created_at" type="datetime">
  When the file was skipped.
</ResponseField>

The matched file follows the normal [retention](#retention) schedule. After it's deleted, the entry keeps its `duplicate_of` value but the ID no longer resolves; `task_run_id` still points to the original run. Skips are recorded going forward only: runs from before this field existed have nothing to show.

### Errors

`deduplication_schema` must be present with at least one property whenever `deduplication` is `true`. Starting a task run without it returns:

```
422 Unprocessable Entity
deduplication_schema must be defined before running a task with deduplication enabled.
```

To disable deduplication, set `deduplication` to `false` (or omit it entirely).

## Events

Storage items emit events you can subscribe to through [event destinations](/events/events):

| Event             | When it fires                                          |
| ----------------- | ------------------------------------------------------ |
| `storage.created` | A new file has been captured and is ready for download |

With extraction enabled, the event fires after extraction finishes, so the item's `extraction` data is already available when you fetch it.

## Retention

Retention period varies by plan. All files are deleted after 90 days.