ABOUT

BLOG

CASE STUDIES

PRODUCT

DOCS

DOWNLOADS

SUPPORT

Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Datasets

Overview

Datasets are collection of data tasks grouped together as a logical entity. They can also be called Data jobs that the user needs to run. A dataset will have tasks for the work items in the dataset.

For example, a user may want to map reduce all the files in a S3 bucket to process data. They will define this work - map reduce all the files - as a dataset in #LetsData. Once this dataset is created, #LetsData will create a data task for each file in the S3 Bucket (map reduce each file to process data). These data tasks grouped together are collectively called a dataset.

Dataset name uniquely identifies these aggregated data tasks and needs to be unique and alphanumeric (chars such as _ . | etc are not allowed). In the map reduce all the files example above, the dataset can be named ClickstreamsMapReduceJune2022

User Configuration

Datasets require a few different pieces of configuration to be able to create and successfully run data tasks. A dataset has the following different configuration components:

The Dataset Name: A unique name for the dataset
The Region: The AWS region for the dataset. LetsData supported AWS regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
The Read Connector: A read connector configuration defines where the data that needs to be read by the tasks is located and how it would be read.
The Write Connector: A write connector configuration defines where the data is going to be written to.
The Error Connector: An error connector configuration defines where the error records in the data processing will be archived.
The Compute Engine: The compute engine configuration defines the infrastructure that would used to run these data tasks.
The Manifest File: A manifest file configuration that defines the read connector data for each individual task - for example, if the read connector is an S3 Bucket, the manifest file specifies what files in the bucket are to be read by the read connector. Each line in the manifest file definition becomes a data task in # Let's Data
The Access Grant Role: The ARN of the Access Grant IAM Role - this is the role that grants permissions to read from the read connector destination, write to the write connector and error connector destinations, gives access to the manifest file and any additional artifacts required by the dataset.
The Customer Account For Let's Data Resource Access: The aws account id of an account that should be granted access to the resources that this dataset will create. For example, if #Let's Data is creating the write connector's kinesis stream and error connector's S3 bucket, then #Let's Data will give this aws account id access to these resources.

System Configuration & Dataset Lifecycle

The user defines these components configuration to define a dataset - this allows #Let's Data to run the tasks to process the dataset (read, compute and write).

While the user definition completely specifies the dataset for the purposes of successful execution by # Let's Data, the system also appends some additional data structures to the dataset for internal housekeeping. Most of these aren't terribly useful for the customers, however the following do need a mention to explain the dataset processing.

Dataset Id:The system defines a unique id for each dataset called the datasetId - this is what would be used to identify the dataset in # Let's Data. The dataset name while unique, can be re-used by different users / tenants. A dataset Id is unique in the system.
Dataset Status: #Let's Data defines a status for each Dataset which represents a dataset's status in the system. The status can have the following values:
- CREATED - The dataset has been created by the user but the system has not started processing it yet.
- INITIALIZING - The dataset has been picked up by the # Let's Data and the system is initializing the resources needed for the processing of the dataset.
- PROCESSING - The dataset has been fully initialized and is now being processed by # Let's Data. This essentially means that the individual tasks defined in the manifest are now being executed by the system.
- COMPLETED - The dataset processing has been completed and all tasks have successfully completed. There still may be individual errors archived by the tasks, but the task execution has completed the read of the read destination and sent error and write records to the error and write destinations.
- ERRORED - The dataset processing has been completed and all tasks have completed but there is atleast 1 task that has errored i.e. The error task read destination (example, s3 file) has not been completely processed / exception during processing.
- DESCALED - This is when dataset has completed processing (success or error) and user / cost management service has decided to descale (different from reclaimed) the resources that were allocated for the dataset. For example, provisioned throughputs are decreased, lambda concurrency reclaimed etc. Though not supported yet, but dataset in this state can have its resources re-hydrated to rerun tasks etc if needed.
- FROZEN - This is when the dataset has completed processing and the user / cost management service has decided to reclaim the resources that were allocated for the dataset. For example, internal queues are deleted, processing tables are deleted and any non user data infrastructure is reclaimed. Things such as user data in write destination and error destinations are still available though - this means that dataset consumers can continue processing from a frozen dataset. A frozen dataset cannot be re-hydrated.
- DELETED - This is when the user has decided to delete the dataset - all resources are reclaimed. Zombie records are kept to disallow recreation and aid delayed processes such as billing etc.
- Transient Statuses - Users can request different updates to a dataset such as updating the implementation JAR, updating compute configuration, redriving error tasks, descaling/freezing/deleting datasets and to stop a dataset's execution. This request processing moves the dataset to a transient state and then back to one of the above states. Concurrent actions on a dataset in these transient states are not allowed. These transient states are mostly self explanatory and are as follows:
  - UPDATING - User has submitted a dataset update request such as update the dataset implementation JAR or update compute engine parameters such as memory, timeouts and concurrency. The dataset's update request is being processed.
  - REDRIVING - User has submitted a request to REDRIVE the error tasks (after maybe fixing the errors). The system is preparing the resources for re-executing the error tasks.
  - STOPPING_ERROR / STOPPING_COMPLETE - User has submitted a request to STOP the dataset tasks. The system is stopping these running tasks and will transition the dataset to ERRORED / COMPLETE status.
  - DESCALING - User has submitted a request to DESCALE the dataset. The dataset's descale request is being processed.
  - FREEZING - User has submitted a request to FREEZE the dataset. The dataset's freeze request is being processed.
  - DELETING - User has submitted a request to DELETE the dataset. The dataset's delete request is being processed.
This translates to the following Dataset and Task Lifecycle:
```
 
```
Progress: In addition to dataset's status, the system also puts in aggregate task progress (total tasks, completed tasks and errored tasks) in the dataset for quick determination of the progress of the dataset. These are eventually consistent and convergently correct.
Execution Logs:While a dataset is executed at-least once when it is created, it can be re-executed multiple times when error tasks are redriven. It may be important to know the start and end datetime for each time the dataset is executed. These are recorded in the Dataset Execution Logs structure so that users can know when dataset was run and correlate external systems as needed.

Dataset Initialization Workflow

The dataset when created goes through an initialization workflow where internal databases are setup, lambda functions are created, internal queues are initialized, tasks are scheduled and task monitoring is put in place. Users can view the progress of the initialization workflow by calling the datasets view CLI command / API. The initialization workflow is expected to list the following steps in a finite sequence of steps:

Step Name	Details
Create Dataset IAM Execution Role	Creates the execution role for the dataset. All accesses and processing for a dataset's execution are scoped to this permissions role. See our docs for details around how #LetsData secures dataset execution
Create Dataset's Sagemaker Execution Roles	Creates execution roles for Sagemaker Serverless / Provisioned endpoints if needed.
Implementation Jar S3 Artifact - Create Internal Copy	Copies the tenant's implementation jar for the dataset to an internal, secure, tenant code implementation bucket.
Create Dataset Source Code Branch	Forks the #LetsData code to a dataset branch to have a snapshot in time for the code. Runs and reruns will use this same code snapshot.
Create Dataset Code Config	Adds configuration for the tenant's implmentation jar in code repositories.
Create Dataset Code Build Project	Creates a Code Project for the Dataset's code builds.
Create Dataset Code Builds	Builds the #LetsData code with the tenant code jar included. Any unimplemented interfaces, or interface mismatch or any compile errors etc would be flagged in this step.
Write Connector - Create Kinesis Stream	Create the WriteConnector Kinesis stream if required. (ConnectorDestination: Kinesis, ResourceLocation: #LetsData)
Write Connector - Create Dynamo DB Table	Create the WriteConnector DynamoDB table if required. (ConnectorDestination: DynamoDB, ResourceLocation: #LetsData)
Write Connector - Create S3 Bucket	Create the WriteConnector S3 bucket if required. (ConnectorDestination: S3, ResourceLocation: #LetsData)
Write Connector - Create SQS Queue	Create the WriteConnector SQS queue if required. (ConnectorDestination: SQS, ResourceLocation: #LetsData)
Write Connector - Create Momento	Create the WriteConnector Momento resource if required. (ConnectorDestination: Momento, ResourceLocation: #LetsData)
Write Connector - Create Kafka Cluster (1/2)	Create the resources required for Kafka Cluster - example IP Address Management, VPC Networking etc if required. (ConnectorDestination: Kafka, ResourceLocation: #LetsData)
Write Connector - Create Kafka Cluster (2/2)	Create the WriteConnector Kafka cluster if required. (ConnectorDestination: Kafka, ResourceLocation: #LetsData)
Error Connector - Create Error Bucket	Create the ErrorConnector S3 Bucket if required. (ConnectorDestination: S3, ResourceLocation: #LetsData)
Create SageMaker Model	Create the Sagemaker model if required.
Create SageMaker Endpoint	Create the Sagemaker endpoint if required.
Create Internal Task Database	Create the Internal Task Database that would be used for task management. Each dataset gets a dedicated task database.
Create Internal Queues	Create the Internal Queues for component communication. Each dataset gets dedicated queues for communication between components.
Complete Dataset Code Builds	Wait for the completion of code build we had started earlier.
Compute Engine - Create Data Task Lambda Function (1/4)	Create the Data Task Lambda Function permissions and roles
Compute Engine - Create Data Task Lambda Function (2/4)	Create the Data Task Lambda Function using the built code
Compute Engine - Create Data Task Lambda Function (3/4)	Configure the Data Task Lambda Function
Compute Engine - Create Data Task Lambda Function (4/4)	Finalize the Data Task Lambda Function setup
Compute Engine - Task State Monitor Process (1/3)	Setup Task Monitoring Process IAM permissions and IAM roles
Compute Engine - Task State Monitor Process (2/3)	Install and start the Task Monitoring Process code
Compute Engine - Task State Monitor Process (3/3)	Finalize the Task Monitoring Process setup
Write Connector - Wait For Create Kafka Cluster Completion	Wait for Kafka Cluster creation to be completed if required.
Compute Engine - Wait For Create Sagemaker Endpoint Completion	Wait for Sagemaker Endpoint creation to be completed if required
Secure Dataset IAM Execution Role	Dataset execution role is updated with any newer resource arns that are dynamically generated during creation as part of the dataset.
Grant Customer Account Access	Grant the customer account access to any write and error connector destinations (and any other resources) that they need access to.
Create Tasks (1/2)	Read the manifest file from the dataset configuration and create tasks.
Create Tasks (2/2)	Kickoff the data tasks for the created tasks.
Finalize Initialization	Final round of sanity checks and move the dataset from the INITIALIZING state to the PROCESSING state.

Some steps may not be relevant to your dataset, so they will get omitted from the step list - for example, a Write Connector Kinesis dataset will not list the Write Connector steps for the other write destinations

You can use the CLI datasets view command $ > letsdata datasets view --datasetName <datasetName> --prettyPrint to view the initialization details of a dataset. The workflow details are a map of the step name and the step's latest result. Steps that have not be started show up as empty maps. You may also see some steps in status WAIT or ERROR and then on a later invocation show as COMPLETED, this is fine since the steps are retried until complete and we expect some transient timeout failures in waiting for resources to be initialized. Here is an abbreviated example JSON initializationWorkflow:

Dataset Redrive Workflow

The dataset when redriven goes through an redrive workflow where the tasks are reinitialized. Users can view the progress of the redrive workflow by calling the datasets view CLI command / API. The redrive workflow is expected to list the following steps in a finite sequence of steps

Step Name	Details
Initialize	Initializes the dataset for redriving.
Compute Redrive Task List	Computes the tasklist of tasks that are being requested for redrive.
Update Task State	Updates the task state, checkpoints (if needed) so that task can be redriven.
Enqueue Tasks For Processing	Enqueue the tasks for processing by the Data Task Function.
Finalize Redrive Request	Finalize the redrive request.

You can use the CLI datasets view command $ > letsdata datasets view --datasetName <datasetName> --prettyPrint to view all the redrive workflows for a dataset. The workflow details are a map of the workflow's step name and the step's latest result. Steps that have not be started show up as empty maps. You may also see some steps in status ERROR and then on a later invocation show as COMPLETED, this is fine since the steps are retried until complete and we expect some transient timeout failures in waiting for resources to be initialized. Here is an abbreviated example JSON redriveWorkflow list:

See additional details about Task Redrives and different configurations in the Error Handling docs. Redriving Tasks Docs

Descale / Freeze / Delete Workflows

The Descale / Freeze / Delete Workflows are implemented as workflows, but their progress is not captured as a JSON step list yet (unlike initialization and redrive workflows). You can see the details about the dataset Descale / Freeze / Delete Workflows in the Cost Management Docs.

Update Workflows

Datasets can currently be updated by customers for:

Update the Implementation Jar: Updates the dataset's tenant implementation code jar. See $ > letsdata datasets code help or the CLI Update Dataset Code Docs for additional details.
Update the Compute Engine Lambda Configuration: Updates the dataset compute engine's lambda configuration such as concurrency, timeoutInSeconds, memoryLimitInMegabytes, logLevel. See $ > letsdata datasets compute help or the CLI Update Dataset Compute Docs for additional details.

Schema & Example

Here is a representation of a dataset with high level component configuration and an actual example dataset:

Creating, Listing and Viewing Datasets

Here is how one can create a dataset in lets data:

Command Syntax:

Command Help:

Show Help

Command Examples:

Show Examples