Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Datasets

Datasets are collection of data tasks grouped together as a logical entity. They can also be called Data jobs that the user needs to run. A dataset will have tasks for the work items in the dataset.

For example, a user may want to map reduce all the files in a S3 bucket to process data. They will define this work - map reduce all the files - as a dataset in #LetsData. Once this dataset is created, #LetsData will create a data task for each file in the S3 Bucket (map reduce each file to process data). These data tasks grouped together are collectively called a dataset.

Dataset name uniquely identifies these aggregated data tasks and needs to be unique and alphanumeric (chars such as _ . | etc are not allowed). In the map reduce all the files example above, the dataset can be named ClickstreamsMapReduceJune2022

Tasks

Tasks are defined as:

the system's representation of a work item (a unit of work for the dataset) and have a 1-1 mapping with manifest's work definition.
tasks are executed on compute engines (e.g. AWS Lambda) to read the data from the read destination, call the user data handlers and write to the write destination.
during task execution, a task emits metrics and logs, creates errors records and the usage records for different resources used by the tasks. Each of these (metrics, logs, errors, usage records) is treated as a separate resource in #Let's Data and these resources tie in together with datasets / tasks (datasetName / taskId).
each dataset has a number of tasks. The dataset's manifest file defines the amount of work for a dataset - each work item in the manifest becomes a task in #Let's Data.

Data Destinations

Data Destinations are different sources of data which can be used to read data from and write data to. For example, S3 is a read connector destination when data is read from S3. Writing data to Kinesis uses the Kinesis as data write destination. S3 is the error destination when error data is written to S3.

#Let's Data currently supports the following data destinations (and the list is growing):

S3 (Read, Write, Error)
SQS (Read, Write)
Dynamo DB (Write)
Kinesis (Write)
Kafka (Write)

Connectors

Connectors are logical connections to different data destinations.

Read Connector: A read connector is created when data needs to be read from a destination.
Write Connector: A write connector is created when data needs to be written to a destination.
Error Connector: An error connector is a special write connector used to archive the errors from the dataset tasks

Manifest File

The manifest file defines the amount of work for a dataset - how the read data source and the reader type is mapped to #Let's Data tasks and any additional details around the read destination's execution environment.
In simple words, Manifest files define the source data that the dataset would be reading (for example, S3 files that should be read) and its mapping to logical task units (for example, S3 fileType to file mappings for dataset tasks)

S3 Read Connector: For example, The S3 Read Connector manifest file would define:
- what files in the S3 bucket need to be read as part of this dataset
- each file's filetype
- the mapping of different filetypes (for example, metadata_file1.gz maps to data_file1.gz, metadata_file2.gz maps to data_file2.gz etc). In this example, we create a manifest file that specifies the file types (metadata, data) and the individual files that need to be read and their mappings (metadata_file1.gz -> data_file1.gz), (metadata_file2.gz -> data_file2.gz). This manifest file becomes the complete list of data in the dataset that would be processed by the read connector. Each line in the manifest file will become a task.
SQS Read Connector: For example, The SQS Read Connector manifest file would define:
- # of concurrent tasks to run: this is currently set internally by default to the computeEngine Lambda concurrency.
- Task stop conditions: User defined criteria to determine the end of the queue so that the SQS queue reader tasks can complete
Kinesis Read Connector: Kinesis read connector creates a task for each available shard. In addition, the Kinesis Read Connector manifest file would define:
- Task Start From conditions: User can specify whether to start reading the stream from the Earliest record or from the Latest record.
- Task stop conditions: User defined criteria to determine the end of the stream so that the Kinesis stream reader tasks can complete

Compute Engine

Dataset Compute Engine is the compute infrastructure that is used to process the dataset. AWS Lambda, AWS Lambda & Sagemaker Compute Engines and Spark Compute Engines are currently supported Compute Engines in #LetsData. We intend to support EC2, ECS and Kubernetes compute engines as well. Here are the currently supported compute engines:

AWS Lambda Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring.
AWS Lambda & Sagemaker Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring. Additionally, AI/ ML inferences are run through a Sagemaker endpoint. #LetsData can create and manage these Sagemaker endpoints or re-use existing Sagemaker endpoints.
Spark Compute Engine: The Spark Compute Engine runs user's spark code on AWS Lambda and offers infrastructure simplifications such as removing the needs for clusters (no cluster provisioning, no cluster management, no cluster scaling issues). It adds infinte elastic scaling (no need for application scheduling, application run as soon as they are created) and adds a layer of file level progress / task management that is consistent with LetsData datasets. Your spark code will just work out of the box - no jar issues, classpath problems or elaborate session and cluster configurations.

#LetsData creates the selected compute engines for data processing and manages the infrastructure as needed until the data processing is complete. The user does not need to specify any additional infrastructure details or write any code to manage the compute infrastructure. This is what we promise - "Focus on the data and we'll manage the infrastructure."

Resource Location

In terms of access, the different resources (S3 Buckets, DynamoDB Tables, SQS Queues etc) that are read from, written to, and managed by #Let's Data can be divided into two groups - 1./ Customer: Resources that are located in external AWS accounts - 2./ LetsData: Resources that located in #Let's Data AWS account

Customer: Resources that are not located in #Let's Data AWS account but are used in dataset processing can either be public or access limited by the owner. In these cases, #Let's Data requires that the owner adds #Let's Data to the access lists.
Let's Data: Resources that are located in #Let's Data AWS account are managed completely by #Let's Data - we'll grant the customer account access to these resources to read, write and manage them.

Regardless of the resource location, #Let's Data adheres to the strictest software security principles. The code follows the principal of least privilege, runs in context of the dataset's user and is granted access only to the resources that it needs.

Access Grants

Resources (S3 Buckets, DynamoDB Tables, SQS Queues etc) that are external to #Let's Data (in customer's AWS Accounts - ResourceLocation: Customer) and are used by the dataset can either be public or access limited by the owner.

In these cases, #Let's Data requires that the resource owner creates an IAM role with an IAM policy that adds access for these resources by the #Let's Data AWS AccountId. The IAM Role ARN is specified as the accessGrantedRoleArn attribute when creating the dataset.

Customer Account For Access

Resources (S3 Buckets, DynamoDB Tables, SQS Queues etc) that are located in #Let's Data AWS account are managed completely by #Let's Data. These are created, read from, written to, managed and deleted by #Let's Data as part of the dataset processing. The customer will need access to these resources to read/write data, manage resources etc.

#Let's Data grants access to the customer's AWS Account Id specified in the customerAccountForAccess attribute in the dataset.

Customers can use the IAM Role created by #Let's Data and specified in the customerAccessRoleArn attribute in the dataset to access this data. Detailed sample code on data access for each data destination are available in the docs.

Vpcs

Connector Destinations such as AWS Kafka require setting up a Virtual Private Cloud (VPC). Let's Data automatically creates and secures the Vpc at write connector initialization and deletes the VPC when write connector is deleted. #LetsData provides self-service infrastructure to enable connectivity to these Vpcs.