Developer Documentation
# Let's Data : Focus on the data - we'll manage the infrastructure!
Cloud infrastructure that simplifies how you process, analyze and transform data.
Costs Management
Datasets execution causes #Let's Data to initialize a variety of different AWS resources - these include the write connector (e.g. kinesis stream), the error connectors (e.g. S3 bucket) and different internal components such as queues, database tables, compute resources etc.
These resources start accruing costs as soon as they are created. The costs can be classified into two categories: 1./ Use-time cost 2./ Recurring costs
- Use-Time Costs: Use-time costs are costs that accrue when the resources are being used. If the resources are not being used, these costs do not add up. For example, Lambda compute duration costs only for the time the lambda function was running. If the function is not running, there would be no costs. Similar examples are Requests per second, Bytes Transferred per second etc.
- Recurring Costs: Recurring costs are costs that keep on adding up as long as the resource remains initialized. For example, a kinesis stream's Shard-hours costs are such an example - every hour the kinesis shards size is calculated and costed. Even if the dataset is complete and no data is being written / read from the kinesis stream, the Shard-hours costs keep adding up. Similar examples are S3 Bucket Storage Bytes, Dynamo DB Storage Bytes etc.
Considering that our datasets are active for a finite amount of time, it makes sense to reclaim any resources that were initialized for the dataset processing but are not being used anymore since the dataset has COMPLETED /ERRORED. A notable exception to these is data in the write connectors / error connectors that may still be used by the external applications. If these unused resources are not reclaimed, they can easily add up to the costs for the dataset and result in a larger than expected billed amount.
With this reasoning in mind, we've architected the dataset execution and resource management to be optimized for costs - we try and determine if the dataset execution is complete (and would not require any reruns) and attempt to descale / freeze the resources that were associated with the dataset. These cost management actions (descale, freeze and delete) are also available to the user to descale / freeze / delete datasets using their own custom logic if needed.
Here is the cost management section from the dataset lifecycle diagram:
Here are what each of these cost management actions mean:
- DESCALED: This is when dataset has completed processing (success or error) and user / cost management service has decided to descale (different from reclaimed) the resources that were allocated for the dataset. For example, provisioned throughputs are decreased, lambda concurrency reclaimed etc. Though not supported yet, but dataset in this state can have its resources re-hydrated to rerun tasks etc if needed.
- FROZEN: This is when the dataset has completed processing and the user / cost management service has decided to reclaim the resources that were allocated for the dataset. For example, internal queues are deleted, processing tables are deleted and any non user data infrastructure is reclaimed. Things such as user data in write destination and error destinations are still available though - this means that the dataset is essentially read-only and dataset consumers can continue processing from a frozen dataset. A frozen dataset cannot be re-hydrated.
- DELETED: This is when the user has decided to delete the dataset - all resources are reclaimed. Zombie records are kept to disallow recreation and aid delayed processes such as billing etc.
Here is the different resources and how they map to the descale / freeze / delete actions:
The following command can be used for different Cost Management actions:
Descaling, Freezing and Deleting Datasets
Here is how one can descale/freeze/delete datasets in lets data: