Developer Documentation
# Let's Data : Focus on the data - we'll manage the infrastructure!
Cloud infrastructure that simplifies how you process, analyze and transform data.
Data Catalog
The LetsData Write Destinations are categorized into either:
- permanent, durable data store (think databases such as S3, Dynamo DB, Vector Indexes)
- ephemeral data containers (such as streams, queues etc)
The permanent, durable data stores can be enabled for analytical workloads (OLAP). LetsData enables these by running AWS Glue Crawlers on these data stores to discover the data schema, partitioning and metadata. These are then added to data catalog / meta stores (AWS Glue Data Catalog). Customers can run AWS Athena queries using the AWS Glue Data Catalog - these are distributed queries that enable on demand, massively parallel processing of large datasets.
See AWS Athena Docs to understand the AWS analytics services and their usage, Redshift Docs about Columnar Storage to understand OLAP / OLTP differences
Highlights about LetsData's Data Catalog implementation:
- Automatic Data Cataloging: LetsData can automatically catalog datasets that write to S3 destinations by running glue crawlers on S3 data files and creating databases and tables for the LetsData users.
- Primitive Data Lake: Adding to a Data Catalog, defining a permissioning model for access and enabling Distributed Queries is essentially forming a Data Lake. (There are different terms that one comes across when defining a data strategy for OLAP workloads - Data Warehouses, Data Lakes and Data Lakehouse etc. This external link explains the different terminology). The LetsData's data catalog is primitive data lake (not a datawarehouse).
- Access and Permissioning: The AWS Lake Formation is a permissioning model over AWS Glue Catalog and vends automated temporary credentials to different services - Lake Formation has different user roles and allows for permissions delegation and sharing as well. As of now, LetsData permissioning model allows data catalog access to data owner only and we do not offer permission sharing / permissions delegation yet.
Architecture
- Automated Data Catalog: Here is a high level architecture of how automated data catalog is implemented:
- Athena Querying: Here are a high level steps to query a dataset's data catalog tables.
Details
- Availability: Automated Data Catalog is currently available for S3 Write Connectors only (S3, S3 Aggregate File, S3 Spark). Automated Data Catalog is currently available for resourceLocation: LetsData only. resourceLocation: Customer requires some additional architectural redesign which we've deffered for now.
- Enable Config: To enable automated Data Catalog, add the "addToCatalog": true attribute to the dataset's write connector config.Show Example Config
- Catalog Details: The dataset's initialization creates the catalog database, the glue crawler and sets up the crawl configuration. These dataset's write connector node is updated with these catalog details. You'll need these in athena queries. Use the CLI datasets view command or an http GET on the api/datasets api node to view the dataset details. Here are the example details:
Here are what the different attributes in the configuration mean:
- awsGlueCrawlerName: The name of the AWS Glue crawler that was created by dataset. You can use aws cli / sdk to query details about the crawler and its crawl runs. See additional details in the crawler section below.
- awsGlueCatalogName: The name of the AWS Glue catalog that the crawler will store the schema in. This is currently set to the default AWS account catalog. You'll need this to query the catalog details about the tables and when running actual athena queries.
- awsGlueDatabaseName: The name of the AWS Glue catalog database. Every LetsData user gets their own database accessible only to them. The database is named as <stack_name>tenant<tenant_id>. For example tenant with tenantId (d5feaf90-71a9-41ee-b1b9-35e4242c3155) on the prod stack would be assigned a database prodtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155
- awsGlueTableNamePrefix: The crawler sets up the crawl configuration and specifies the tableName prefix. The actual tableName(s) are obtained later by calling the AWS Glue GetTables API using credentials from the the dataset's customerAccessRoleArn and awsGlueCatalogName / awsGlueDatabaseName as the catalog name / database name respectively.
- athenaQueryOutputPath: The help with querying that works without any additional setup requirements, we provisioned an S3 Bucket path for each dataset which is configured for athena queries for that dataset. You can run athena queries with the S3 Bucket path as the results output path. The contents of the path are deleted upon dataset deletion (so do save any results as needed).
- Crawler Configuration: We've configured the crawler by default with settings that should work out of the box for most use cases. However, for those who'd need advanced options, here is how the crawler is setup - we configure the crawler to recrawl everything on each run, and schema change policies of update / delete in database on updates / deletes. Crawler lineage settings are disabled. Additionally, we specify the tableGroupingPolicy to combine compatible schemas (and create a single table). If we should add some classifier support / additional options out of the box, do let us know we can enable these (mailto: support@letsdata.io). ( Crawler in AWS Docs) Here is an example crawler configuration:
Running Athena Queries
Here are a few example commands that can be used to get details about a dataset's catalog and run athena queries.
- Access Credentials: To run aws commands for a dataset's resources, you'll need AWS Access Credentials - this is granted via an IAM Role that LetsData creates on dataset initialization. The IAM Role ARN is specified as the dataset's customerAccessRoleArn attribute and is configured for access for the customer's AWS account specified as the customerAccountForAccess attribute in the dataset. Any IAM User in the Customer's AWS account with access to STS assume role API can get temporary, time limited credentials to access the resources of the dataset. Here are quick commands that can be used to get dataset's access credentials.
- We need the customerAccessRoleArn, createDatetime from the dataset to be able to access the bucket files.
- Suppose that for the current dataset, the customerAccountForAccess is 308240606591. This has an IAM Admin user for this account whose credentials are stored in the ~/.aws/credentials file as the profile IamAdminUser308240606591. Here are the contents of the ~/.aws/credentials file:
- I'll run the following AWS CLI command to get time limited credentials and save them to the ~/.aws/credentials file in the stsassumerole node
- Crawler: The crawler is created when the dataset is initialized and is run upon dataset completion i.e. when all tasks in the dataset SUCCEED. You can get the status of the crawler using get-crawler CLI command, start a new crawler run using the start-crawler and stop a running crawler using stop-crawler CLI commands. (The crawler name is in the dataset config as the writeConnector.catalog.awsGlueCrawlerName attribute)
- Table Details: The table(s) are created when the the crawler has run successfully. You can get the table details for the created tables using the get-tables CLI command. (The database name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName attribute). Additionally, the credentials can also be used to lookup additional details about the Glue Database, Tables and Partitions. Access to the following commands has been configured for the customer's access credentials: [ "glue:GetTables", "glue:GetTable", "glue:GetDatabases", "glue:GetDatabase", "glue:GetPartitions", "glue:BatchGetPartition", "glue:GetPartitionIndexes", "glue:GetPartition" ]
- Run Athena Query: Using the schema from the get tables, construct a query to run. For example, the query to groups documents by their language SELECT language, COUNT(*) as recs FROM b8e4b3e80eeb8c60258c9dab38d94390tldwcb8e4b3e80eeb8c60258c9dab38d94390 GROUP BY language ORDER BY recs desc. You can run the query using the start-query-execution CLI command. (The database name, catalog name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName, writeConnector.catalog.awsGlueCatalogName attribute. Additionally, use the writeConnector.catalog.athenaQueryOutputPath to store your results.)