Data Lake - Beta

Warning

This documentation is preliminary and is subject to change frequently until the General Availability release of this feature. Please contact your account representative with any questions or if you have feedback.

Socotra’s Data Lake provides a relational view of your book of business in Socotra. You can issue SQL queries across a broad set of tables, including policies, transactions, field values, and charges.

Getting Started with Socotra Data Lake

Prerequisites

  • Socotra-provided credentials and connection details

  • IP registration with Socotra to allow connections from your IP

  • A MariaDB-compatible client, such as MySQL Workbench

Contact your Socotra representative for onboarding details.

Note

Popular analytical tools such as Tableau or Microsoft Power BI can also be used with their respective MariaDB connectors.

Data Lake Schema

Data Lake is enabled at the Business Account level and by default will replicate data for all test and production tenants in the account. Once connected to Data Lake, users will be presented with a schema for each tenant. While the schemas will be identical, their data will be unique to each tenant.

Querying across tenants is supported via UNION statements and examples will be provided in a soon to follow docs release. A later release of Data Lake will offer control over which tenants’ data is replicated into the lake.

Data Lake schema

Index of Tables

A complete index of tables within the Data Lake schema is available here.

Frequently Asked Questions (FAQ)

Can I connect with Tableau, Microsoft Power BI, or other such analytical tools?

Yes, you can connect with any MariaDB/ODBC-compatible client. Both Tableau and Microsoft Power BI publish instructions for connections to MariaDB:


How can I use datetime values in Data Lake?

Data Lake datetimes are stored as MariaDB datetime(6), storing date and time in the format YYYY-MM-DD HH:MM:SS, and specifies the precision for fractional seconds. In this case the (6) allows for storing up to microseconds (1 millionth of a second). For example: 2024-11-30 12:34:56.123456.

For now, all datetimes are expressed in Coordinated Universal Time (UTC). You can use the MariaDB CONVERT_TZ function to handle conversion; for example, to list all policies effective on or after midnight Jan 1st 2024 PST, use the following query:

SELECT * FROM `data_lake_my_tenant_locator`.`policies`
WHERE start_time_utc < CONVERT_TZ('2024-01-00 12:00:00', 'America/Los_Angeles', 'UTC');

Note

Refer to the following MariaDB documentation for details.


What is a locator?

Socotra services rely heavily on locators throughout the system. Locators are expressed as a ULID for identifying, locating, or referencing an object or entity in the system. ULIDs are a type of unique identifier that combines uniqueness, timestamp ordering, and readability into a 26-character alphanumeric string.


What is the relationship between the Data Lake schema and entity representations in the Socotra API?

While we aim for consistency in data representations, you should expect to see variations between these two systems that reflect different purposes. In comparing Data Lake’s policy and the API’s PolicyResponse object, for example, you can see that they expose many of the same attributes – locator, product name, timestamps, etc. – but that the naming conventions differ slightly. Two significant commonalities between the API and Data Lake include (1) consistent presence of locator identifiers for referential relationships; and (2) a relatively small set of data types for the full range of fields. Not all Data Lake tables will have a corresponding analogue in the API data model, and some tables will have fields not seen in a related API representation, often introduced to facilitate simplified SQL queries against the Data Lake schema.


Why do some tables have an id field while others do not?

id fields help to uniquely identify rows in Data Lake in the absence of any other unique field, aiding the processes that insert and update Data Lake entries. They are distinct from locators in that they are (1) not intended to help link to other entities, and (2) not necessarily surfaced by or consistent with identifiers provided by the Socotra API. A locator value from Data Lake will be identical to the value obtained via API or exposed in the UI. Not all entities in Data Lake will have a locator.


What version of MariaDB is Data Lake on?

Version 10.11.x; you can check the specific version at any time with the SELECT VERSION(); query.


How often is Data Lake data updated?

Data Lake is continuously loaded with new data from the corresponding tenant, usually within minutes and for the majority of customers, at most 2 hours.

In scenarios involving extremely large volumes or complex data structures, such as bulk ingestion of data via Migration APIs or other large scale use cases, data loads may exceed 2 hours.


Are there plans for alternative modes of access to Data Lake data?

Yes - in a subsequent release, we plan to support diff files in .sql format that will facilitate the direct load of Data Lake data into your own data infrastructure, removing the need to consume data from the hosted MariaDB instance.

Current estimated delivery for this feature is at the end of Q1 2025. Please contact your Socotra account representative for more details.