# Data Lake Delta Files



Socotra supports the replication of Data Lake tables to your own data infrastructure via a series of delta files.
Appropriately permissioned API clients can list and retrieve the relevant files via a pair of endpoints optimized for programmatic consumption.

Getting Started with Data Lake Delta Files [#getting-started-with-data-lake-delta-files]

<Callout>
  Delta File generation is not enabled by default. Contact your Socotra representative for onboarding details.
</Callout>

Overview [#overview]

Consuming the Socotra [Data Lake Delta file API](/api/reporting/data-lake-delta-file) involves a recursive, two-step process:

1. Get an index of available files for a given table.
2. Retrieve the necessary individual files.

The delta files are provided in `sql` or `csv` format, containing all requisite upsert statements (for `sql`) or updated records (for `csv`) to replicate table records in the correct format and order.

Delta files in `csv` format follow :rfc:`4180` formatting, with comma delimiters and standard quote escaping for fields containing special characters, and `null` values are represented as empty fields.

In order to ensure a complete and accurate replication, all delta files for a given table's latest schema version must be consumed, and in the order in which they are presented within the index.

While each delta file enumerated in the index array will include metadata related to the generation of that file's contents, it is not recommended to rely on that metadata to derive the correct order of consumption. **The system handles and guarantees this via the ordering of the files in the index array**.

The maximum size of a delta file is 1000 statements (for `sql`) or 1000 rows (for `csv`).

<Callout>
  There may be more delta files available than can be returned in a single index response. See the section on pagination below.
</Callout>

Delta files are generated at most once every two hours following Data Lake updates. If an update occurs within two hours of the previous delta file generation, the system will generate the next set once the interval has elapsed. This two-hour interval is configurable by environment.

Schema Versioning [#schema-versioning]

Since the schema of any source Data Lake table may evolve over time, each table consumed via the Delta File API has a corresponding version number. The version number is a sequentially incrementing integer.

When a source table schema update occurs, a new schema version is automatically made available in the Delta File API, and all historical data is regenerated into the newest version. Historical versions will remain available, but updated delta files will not be generated for it.

For each table's schema version, files containing the requisite `drop` and `create` statements are also provided.

On first consumption of the Delta File API, the `create` statement will be needed. The `drop` and `create` files will be used in sequence when a new table schema version becomes available.

Pagination [#pagination]

The number of files available for a specific table and schema version may vary based on data volume and growth rate. The Index API response is limited to 100 Delta files per request. If more than 100 files exist, the consuming client must paginate through results.

To ensure complete indexing, use the `lastFile` parameter in the Delta File Index API request to continue retrieving additional files beyond the initial response.

Client Example [#client-example]

A sample client implementation, illustrating how to consume the API programmatically, is available upon request.

File Index API [#file-index-api]

Clients can retrieve an index of the available files for a particular Data Lake table using the <ApiLink name="getMetadata">Fetch List of Delta Files</ApiLink> endpoint.

<ApiSchema name="DeltaFilesGetRequest" />

Sample DeltaFilesGetRequest [#sample-deltafilesgetrequest]

```json
{
    // required
    "tenantLocator": "b6f8aa30-b978-4934-bef3-627XXXXXXXXX",
    "transformationTable":"DataLakeInvoices"

    // optional
    "deltaFileType": "csv",
    "version": 0,

    // optional, mutually exclusive
    // "startTime":1734542240221,
    "lastFile": "DataLakeInvoices/version_0/2025/March/b6f8aa30-b978-4934-bef3-627b0e6edd88_DataLakeInvoices_1451606100_1741713134734.csv",
}
```

| Request Property    | Type                    | Description                                                                     |
| ------------------- | ----------------------- | ------------------------------------------------------------------------------- |
| tenantLocator       | ULID                    | Locator of source tenant                                                        |
| transformationTable | string                  | Name of the desired Data Lake table                                             |
| deltaFileType       | enum (`sql`, `csv`)     | Optional; defaults to `sql`                                                     |
| version             | int                     | Target a specific schema version; defaults to latest if omitted                 |
| startTime           | Unix timestamp (UTC ms) | Files in returned index will all have a `generationTime` later than `startTime` |
| lastFile            | string                  | Only files after this file in the index will be returned                        |

<ApiSchema name="DeltaFilesGetResponse" />

<ApiSchema name="DeltaFile" />

Sample DeltaFilesGetResponse [#sample-deltafilesgetresponse]

```json
{
    "version": 0,
    "createTableFile": "DataLakeInvoices/version_0/createTable.sql",
    "dropTableFile": "DataLakeInvoices/version_0/dropTable.sql",
    "s3Bucket": "socotra-kernel-develop-dm-delta",
    "deltaFiles": [
        {
            "deltaFileType": "csv",
            "fileName": "DataLakeInvoices/version_0/2025/March/b6f8aa30-b978-4934-bef3-627b0e6edd88_DataLakeInvoices_1451606100_1741713134934.sql",
            "jobStartTime": 1451606100,
            "jobEndTime": 1741582882,
            "generationTime": 1741713134934,
            "recordCount": 1000,
            "md5HashSum": "a1b2c3d4e5f67890abcdef1234567890"
        },
        {
            "deltaFileType": "csv",
            "fileName": "DataLakeInvoices/version_0/2025/March/b6f8aa30-b978-4934-bef3-627b0e6edd88_DataLakeInvoices_1451606100_1741713135163.sql",
            "jobStartTime": 1451606100,
            "jobEndTime": 1741582882,
            "generationTime": 1741713135163,
            "recordCount": 1000,
            "md5HashSum": "8d4a2f9c1e7b3d5a0f6c8e2b4a9d1f7c"
        },
        {
            "deltaFileType": "csv",
            "fileName": "DataLakeInvoices/version_0/2025/March/b6f8aa30-b978-4934-bef3-627b0e6edd88_DataLakeInvoices_1451606100_1741713135490.sql",
            "jobStartTime": 1451606100,
            "jobEndTime": 1741582882,
            "generationTime": 1741713135490,
            "recordCount": 198,
            "md5HashSum": "e3b0c44298fc1c149afbf4c8996fb924"
        }
    [
{
```

| Request Property | Type         | Description                                                                                                          |
| ---------------- | ------------ | -------------------------------------------------------------------------------------------------------------------- |
| version          | int          | Target a specific schema version, defaults to latest if omitted                                                      |
| createTableFile  | string       | Path & name of file with necessary sql statement to create the table in the destination schema                       |
| dropTableFile    | string       | Path & name of file with necessary sql statement to drop the existing version of the table in the destination schema |
| s3Bucket         | string       | The source S3 bucket required for the <ApiLink name="DeltaFileDownloadRequest">file retrieval request</ApiLink>      |
| deltaFiles       | string array | The index of individual files                                                                                        |

File Retrieval API [#file-retrieval-api]

Clients can download each individual Delta File using the <ApiLink name="download">Fetch Specific Delta File</ApiLink> endpoint. The response will be a streamed file `StreamingResponseBody<string>`.

<ApiSchema name="DeltaFileDownloadRequest" />

See Also [#see-also]

* [Data Lake Delta File API Guide](/api/reporting/data-lake-delta-file)


## API Reference

DeltaFilesGetRequest
Properties:
  tenantLocator (uuid, required)
  transformationTable (Enum DataLakeAccountDataExtensions | DataLakeAccounts | DataLakeAffectedTransactions | DataLakeAuxData | DataLakeBillingHolds | DataLakeClaimDataExtensions | DataLakeClaims | DataLakeCreditDistributions | DataLakeCreditItems | DataLakeDelinquencies | DataLakeDelinquencyReferences | DataLakeDiaries | DataLakeDisbursementDataExtensions | DataLakeDisbursements | DataLakeFaTransactionAccountLines | DataLakeFaTransactions | DataLakeFnolDataExtensions | DataLakeFnols | DataLakeInstallmentItems | DataLakeInstallments | DataLakeInstallmentSettings | DataLakeInvoiceItems | DataLakeInvoices | DataLakeLedgerAccountLineItems | DataLakeLedgerAccounts | DataLakeMoratoriumElections | DataLakeMoratoriums | DataLakeMoratoriumStatuses | DataLakePaymentDataExtensions | DataLakePayments | DataLakePolicies | DataLakePolicyAutoRenewals | DataLakePolicyCoverageTerms | DataLakePolicyDataExtensions | DataLakePolicyElementCharges | DataLakePolicyElements | DataLakePolicyElementTree | DataLakePolicyElementUnderwritingFlags | DataLakePolicyPreferences | DataLakePolicySegments | DataLakePolicyStatuses | DataLakePolicyTerms | DataLakePolicyTransactionChangeInstructions | DataLakePolicyTransactions | DataLakeProducerCodeDataExtensions | DataLakeProducerCodes | DataLakeProducerDataExtensions | DataLakeProducerHierarchy | DataLakeProducers | DataLakeQuoteCoverageTerms | DataLakeQuoteDataExtensions | DataLakeQuoteElementCharges | DataLakeQuoteElements | DataLakeQuoteElementTree | DataLakeQuoteElementUnderwritingFlags | DataLakeQuotes | DataLakeTaskReferences | DataLakeTasks | DataLakeUserAssociations | DataLakeUserQualifications | DataLakeWriteOffs, required)
  deltaFileType (Enum sql | csv)
  version (integer)
  startTime (integer)
  lastFile (string)
  dataProcessedThroughTime (integer)

DeltaFilesGetResponse
Properties:
  version (integer, required)
  createTableFile (string, required)
  dropTableFile (string, required)
  s3Bucket (string, required)
  dataProcessedThroughTime (integer, required)
  deltaFiles (DeltaFile[], required)

DeltaFile
Properties:
  deltaFileType (Enum sql | csv, required)
  fileName (string, required)
  jobStartTime (integer, required)
  jobEndTime (integer, required)
  generationTime (integer, required)
  recordCount (integer)
  md5HashSum (string)

DeltaFileDownloadRequest
Properties:
  tenantLocator (uuid, required)
  s3Bucket (string, required)
  fileName (string, required)