Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Compute Engine: Overview

Dataset Compute Engine is the compute infrastructure that is used to process the dataset. AWS Lambda, AWS Lambda & Sagemaker Compute Engines and Spark Compute Engines are currently supported Compute Engines in #LetsData. We intend to support EC2, ECS and Kubernetes compute engines as well. Here are the currently supported compute engines:

AWS Lambda Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring.
AWS Lambda & Sagemaker Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring. Additionally, AI/ ML inferences are run through a Sagemaker endpoint. #LetsData can create and manage these Sagemaker endpoints or re-use existing Sagemaker endpoints.
Spark Compute Engine: The Spark Compute Engine runs user's spark code on AWS Lambda and offers infrastructure simplifications such as removing the needs for clusters (no cluster provisioning, no cluster management, no cluster scaling issues). It adds infinte elastic scaling (no need for application scheduling, application run as soon as they are created) and adds a layer of file level progress / task management that is consistent with LetsData datasets. Your spark code will just work out of the box - no jar issues, classpath problems or elaborate session and cluster configurations.

#LetsData creates the selected compute engines for data processing and manages the infrastructure as needed until the data processing is complete. The user does not need to specify any additional infrastructure details or write any code to manage the compute infrastructure. This is what we promise - "Focus on the data and we'll manage the infrastructure."

Compute Engine: Details

AWS Lambda Compute Engine

Lambda compute engine packages the user data handler implementation JAR alongwith the #Let's Data infrastructure code and creates a Data Task lambda function for each dataset. It then executes the #Let's Data tasks by invoking the DataTask Lambda functions concurrently.

The user only needs to specify the desired concurrency and optionally, the lambda function memory requirements and the lambda function timeout to configure the Lambda function. The #Let's Data AWS Lambda Compute Engine takes care of tasks execution (runs and reruns, checkpointing and resume capabilities, progress monitoring etc.), handling errors and failures , and builds in diagnostics such as metrics and logging.

Task Design

Here is a high level AWS Lambda Data Task function design:

Config

The AWS Lambda Compute Engine requires the following configuration:

region: (Optional) The optional compute engine region and is set to dataset's region if unspecified. LetsData will create the DataTask Lambda Functions that do the actual processing in this region. So all your reads, writes and errors will happen from the compute engine region. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
computeEngineType: The computeEngineType value Lambda specifies that this is the AWS Lambda Compute Engine
concurrency: # the concurrent lambda invocations allowed at any instant in time.
memoryLimitInMegabytes: The task lambda function's memory limit in MB. Defaults to 5120. Min allowed value: 512, Max Allowed Value: 10240.
timeoutInSeconds: The task lambda function's timeout in seconds. Defaults to 300. Min allowed value: 5, Max Allowed Value: 900
logLevel: The log level for emitted logs. Allowed values are [DEBUG, INFO, WARN, ERROR]. Defaults to WARN

Access Grants

No additional access needed.