Developer Documentation
# Let's Data : Focus on the data - we'll manage the infrastructure!
Cloud infrastructure that simplifies how you process, analyze and transform data.
Tasks
Overview
Tasks are defined as:
- the system's representation of a work item (a unit of work for the dataset) and have a 1-1 mapping with manifest's work definition.
- tasks are executed on compute engines (e.g. AWS Lambda) to read the data from the read destination, call the user data handlers and write to the write destination.
- during task execution, a task emits metrics and logs, creates errors records and the usage records for different resources used by the tasks. Each of these (metrics, logs, errors, usage records) is treated as a separate resource in #Let's Data and these resources tie in together with datasets / tasks (datasetName / taskId).
- each dataset has a number of tasks. The dataset's manifest file defines the amount of work for a dataset - each work item in the manifest becomes a task in #Let's Data.
For example, the following manifest file would create 3 tasks for the dataset.
Task Schema & Example
Task Schema
A task has a few different components that users might find interesting:
- Dataset References: Task has references to the parent dataset, such as the datasetName, datasetId (and the usual tenantId and userIds)
- Task Id: A unique identifier for the task
- Task Type: The type of the task - these differ slightly across different read destinations. We currently support READER and SQS task types for S3 and SQS destinations respectively
- Reader Definition: A copy of the read connector and manifest file attributes that are relevant to this task and would be used in processing. These are readerType, the manifestId and the fileType/Name mapping
- Compute Definition: A copy of the compute engine's attributes that are relevant to this task and would be used in processing.
- Worker Process Id: The unique identifier of the worker that processed this task - this isn't terribly important for the users, however in cases where tasks are running on EC2 machines as different threads, this can be helpful in diagnosing errors related to a single machine etc.
- Task Status: The task status structure that has the current status of the task. It has the following sub attributes:
- Task Status: The current status of the task - This can be one of the following statuses which are self explanatory (CREATED, PROCESSING, COMPLETED, ERROR, RERUN)
- Last Error: A snippet of the last error / exception that the task encountered. Copied for quick debugging
- Task Checkpoint: The task as it is processing the data from the read connector will periodically checkpoint its progress as a task checkpoint structure in the task. This is useful to resume from last saved checkpoint in case there are errors and task needs to be redriven. Task checkpoint has the following data:
- checkpointDatetime: The datetime of the checkpoint
- checkpointId: The documentId that was last processed from the file
- checkpointProcessId: The processId that created this checkpoint
- offsetBytes: The offset of bytes into the file that have been processed till this checkpoint. If the task restarts, it will start reading from this offset into the file.
- lastProcessedRecordType: The last processed record type, this is to re-initialize the state machine when the task restarts.
- Task Progress: The aggregate progress that this task has made - for example, numRecordsErrored, numRecordsProcessed and numRecordsSkipped
- Task Execution Logs: The start and end datetimes for each run of this task (includes initial run and task redrives). This is useful to correlate the datetimes of the task run with any external systems etc.
Task Execution
Task Execution Trace: When a task executes, it reads from the read destination, calls the user handlers to create a document that it then writes to the write destination. During this execution, a task emits metrics and logs, creates errors records and the usage records for different resources used by the tasks. Each of these (metrics, logs, errors, usage records) is treated as a separate resource in #Let's Data and these resources tie in together with datasets /tasks (datasetName / taskId).
Task Completions: A task either completes successfully (taskStatus: COMPLETED), fails (taskStatus: ERROR) or requests a rerun (taskStatus: RERUN).
- Task Failures: When the task fails, the failure could be from #Let's Data infrastructure or unhandled exception's from the user's data handler code. The error tasks can be redriven (either re-running as is, or after fixing issues such as update the implementation jar, fixing timeout / concurrency). Redriven task execution will also generate metrics, logs, errors and usage records.
- Task Rerun: A task may request a rerun by returning a RERUN status. #Let's Data monitor's the task's execution environment and can stop the current task run and request a rerun under certain conditions. For example, when the execution time of a lambda task with timeout of 10 mins approaches its timeout threshold, lambda can checkpoint the task's progress and exit gracefully requesting a RERUN. The rerun task would be automatically run again from the last checkpoint. This is useful in the following cases:
- long running tasks that exceed the lambda's allowed 15 min duration
- tasks that are supposed to finish within the allowed timeout but are delayed in execution, example write destination slowdown etc.
- tasks that run in continuous mode and can be schedule for rerun in Idle / Timeout scenarios
Stopping Tasks: Dataset Tasks that are running can be stopped by using the $ > letsdata dataset stop command. These tasks may need to be stopped for a few different reasons:
- dataset processing is completed and there is no additional work to be done. For example, tasks are reading from SQS in continuous mode and the queue has no more data.
- dataset is processing with errors / failures that need to be corrected. The dataset tasks can be stopped in this case, errors and failures corrected by updating dataset code / compute and then redriving the tasks.
- dataset is processing but is not needed anymore. The dataset tasks can be stopped in this case so they do not process any additional data and accrue costs.
Additionally, deleting a dataset also immediately stops the execution of the tasks and can be used in emergency situations where the tasks may actually need to be stopped to prevent doing harm to the system.
Task Design
Task Design: Here are Task Design Architecure Diagrams for Tasks executed on Lambda Compute Engine and the Lambda And Sagemaker Compute Engine:
Here is a high level AWS Lambda Data Task function design:
Task Miscellaneous
- Metrics: Datasets come by default with two Dashboards, DatasetSummary dashboard and TaskDetails dashboard. See the Metrics Docs for details.
- Logs: The Data Task Lambda Function logs are available via the logs API / CLI / Console. See the Logs Docs for details.
- Errors: Task errors are archived at the error destination (currently S3). See the Error Handling Docs for details.
- Usage, Metering, Billing: We do attempt to record service usage at a task level where possible. In cases where it isn't possible, aggregated usage records are emitted for the dataset. The Billing Docs have more details around usage records.