logo
SIGN UP

Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Overview

Here are a few end-to-end examples to get you started with #LetsData.

  • Spark - Extract and Map Reduce: An example that runs a spark reader on the Web Crawl Archive files from S3 and that extracts a web crawl record as a #Let's Data documents using spark. It then map reduces these documents using spark to compute the 90th percentile web page content length grouped by the web page language and writes these as a json document to S3. This is a beginner level example, assumes little or no familiarity with the #LetsData and guides you through tasks such as dev machine setup, java project setup and has step by step details around how to grant access etc.
  • Uri Extractor: An example that runs a simple single file reader on the Web Crawl Archive files from S3 and extracts the crawled URIs from these files and writes them to the Kinesis streams. This is a beginner level example, assumes little or no familiarity with the #LetsData and guides you through tasks such as dev machine setup, java project setup and has step by step details around how to grant access etc.
  • Generate Vector Embeddings: An example that runs a multiple file reader on the Web Crawl Archive files from S3 to extract an indexer document. It then runs the content of document through the Sagemaker to generate vector embeddings. It then writes these vector embeddings to a Kinesis streams. This is an intermediate level example, reuses the code from the  letsdata-common-crawl repository and assumes familiarity with items such as setup, access grants etc. Detailed step by step instructions for common #LetsData tasks and detailed explanations are linked instead of being explained inline.

Details

Example - Spark Extract and Map Reduce

Overview

In this example, we'll read files (web crawl archive files) from S3 using Spark code and extract the web crawl header and the web page content as a LetsData Document. We'll then map reduce these documents using Spark to compute the 90th percentile contentLength grouped by language and write the results as a json document to S3. We'll use the LetsData Spark Compute Engine (which runs on AWS Lambda) to read, write and compute the data to/from S3.

  • Step 1 - Setup:We'll setup our environment by installing the LetsData CLI, setting up the dev environment for the dev language (Java and Python), and creating a project using the LetsData interfaces.
  • Step 2 - Common Crawl Format:We'll look at the  Common Crawl files that we'd be processing and we'll understand their schema, format and high level transformations that we'll do
  • Step 3 - Code Implementations:We'll write an implementation for the LetsData Spark interfaces (SparkMapperInterface, SparkReducerInterface) - SparkMapperInterface will extract the LetsData documents from common crawl files and store them to S3 in an intermediate location. SparkReducerInterface will reduce these intermediate files to compute the 90th percentile contentLength grouped by language and write the results as a json document to S3.
  • Step 4 - Granting Access:We'll look at the Access Grants - the LetsData permissioning model that manages access to different data sources that are created by and consumed by LetsData
  • Step 5 - Dataset Configuration:We'll create a dataset configuration, this specifies where to read the files from, the write destination, compute engine details.
  • Step 6 - Create Dataset:We'll run this dataset on LetsData using the LetsData CLI, monitor its execution and look at the execution logs and metrics.
  • Step 7 - View Results:View the results of the spark map reduce dataset by downloading the results file to the machine from S3 and viewing the results.

Here are the step by step instructions to run this example:

Step 1: Setup

i. Pre-requisites

  • Development Environment:Assumes that the development environment for Java / Python is setup as follows:
    • Java Development Environment: Assumes a java development environment that Maven, JDK and IntelliJ installed.
    • Python Development Environment: Assumes a python development environment that python, Docker, VS Code and github installed.
  • Let's Data Signup:Assume that you are  signed up for #Let's Data and have a username and password that was sent to you in email. You've have also logged in once on the #Let's Data Website to complete the registration and setup a new password. (This is required for the password to work via CLI.)
  • AWS CLI Setup:Assume that you have aws cli installed and setup with credentials

ii. #Let's Data CLI setup

Download and setup the #Let's Data CLI using the following:

  • Download the  letsdata-cli.tar.gz file
  • Unzip using the following command, this should create a letsdata-cli directory which has the cli JAR file and the letsdata.sh script
  • Run the letsdata file (Assumes JAVA is installed on the machine and JAVA_HOME is in the path)

iii. #Let's Data Interface Setup:

Download the #Let's Data java interface jars and install in the local maven repo. (Ref:  Interface Docs,  Interface Code on Git Hub)

iv. Create Dev Project:

  • Create a new IntelliJ project using Maven. Use the following groupId and artifactId
  • Update the POM.xml file:Update the project's pom.xml file to add the #Let's Data interface dependencies and some useful dependencies such as google guava, apache commons-lang3, and junit. Setup the project for JDK8 (you could use a different JDK) and configure the 'maven-assembly-plugin' to build a jar that includes the dependent jars.
  • Create the code files:Create the java code files that will have the implementation of #Let's Data code.
    • Create a new java package com.letsdata.example
    • Create a new java class CommonCrawlSparkMapper in this com.letsdata.example package
    • Create a new java class CommonCrawlSparkReducer in this com.letsdata.example package

Step 2: The Common Crawl Data

Before we write the #Let's Data Interface implementation, we should look at the data that we'd be processing and understand its schema and format so that we know how to process it. The data we'll process is  common crawl data. Common Crawl is an open repository of web crawl data that can be accessed and analyzed by anyone.

  • Lets look at the Nov/Dec 2023 data  https://www.commoncrawl.org/blog/november-december-2023-crawl-archive-now-available. Find a file that we should download and look at. Here are the commands:
  • Download the first file in the output to your machine (This is ~110 MB file) and look at the file to understand the format and schema. The files have a 'warcinfo' header and payload record at the start of the file, followed by the 'warc conversion' header and payload records for each url that was crawled. Each warc conversion record is separated by string delimiter '\n\r\n\r\n' and the header and payload within each conversion record are delimited by '\r\n\r\n'
  • Suppose we were to process the following 4 files in our dataset:
  • Mapper Task: We'll have a Mapper Task (which calls the mapper interface) for each file. The mapper interface implements single partition operations (spark narrow transformations). The mapper task will run spark transformations to convert the header and payload of each 'warc conversion record' to a defined schema. These extracted documents are stored as intermediate files in S3 (internally by system in parquet format). In this example case, 4 manifest files -> 4 Mapper Tasks -> 4 intermediate files. Here is what the schema (and the above document converted) would look like:
  • Reducer Task:LetsData creates a reducer task for any reduce operations for the dataset. The intermediate files from the mapper phase are read by the reducer, any multi-partition operations, shuffles, aggregates or joins are performed on the intermediate files. (Reducer tasks read all the files and compute spark wide transformations). The dataframe returned by the Reducer Task is written to the dataset's write destination. In this example case, 4 intermediate files from the mapper tasks -> 1 output file by the reducer task, where the reducer task computes the '90th percentile contentLength' of these pages grouped by the 'language'. Here is an example result file:
  • Overall Architecture: Here is what the overall architecture looks like in this scenario.
  • Now that we understand what we are going to do at a high level, lets dig into the implementation details.

Step 3: Code Implementation

    i. Mapper Interface

    • Implementing the mapper interface:Here is the sample code for the CommonCrawlSparkMapper java class - this implements the  SparkMapperInterface interface which reads the file from S3 using default Spark Session and Spark Read functions that we've implemented in  SparkUtils. It then transforms the records to desired schema. Code has comments for each transformation that should help understand what the code is doing.

    ii. Reducer Interface

    • Implementing the mapper interface:Here is the sample code for the CommonCrawlSparkReducer java class - this implements the  SparkReducerInterface interface which reads the S3 intermediate files from using default Spark Session and Spark Read functions that we've implemented in  SparkUtils. It then transforms the records to desired result. Code has comments for each transformation that should help understand what the code is doing.

    iii. Build and Upload

    • Build the code jar:Build the code jar by running the following maven command. This should create a spark-extract-and-map-reduce-1.0-SNAPSHOT-jar-with-dependencies.jar file in the target folder. This is the jar that has the code we wrote above and the dependency jars (lets data interfaces, google guava, apache common lang 3 and their dependencies).
    • Upload the jar to S3:Upload the jar we created to an s3 bucket so that we can reference it in our dataset configuration. The code here creates a new bucket and uploads to that bucket (change the name to be unique if creating a new bucket). You can also upload to an existing bucket.

Step 4: Granting Access

  • Grant #Let's Data access to the S3 jar file:We need to grant access to your #Let's Data IAM user so that the dataset can read the jar file in S3 and the commoncrawl files in the commoncrawl bucket. Let's do that.
    • Find the User details: We need the following identifiers from the logged in user's data to enable access.

      1. #Let's Data IAM Account ARN: the logged in user's #Let's Data IAM Account ARN. This is the IAM user that was created automatically by #Let's Data when you signed up. All the dataset execution would be scoped to this user's security perimeter.
      2. UserId: the logged in user's User Id. We use the userId as the STS ExternalId to follow Amazon's security best practices. This would be an additional identifier (similar to MFA) that would limit someone inadvertently gaining access.

      The console's User Management tab lists your IAM user ARN. You can also find it via CLI.

    • Create an IAM Policy to grant access to the S3 jar file and the commoncrawl files (these are public, code will use IAM User to access these and we check for access during validation). This policy in itself does not grant any access yet, it needs to be attached to an IAM Role which will define who has access defined by this policy.
    • Create an IAM Trust Role to trust the #Let's Data IAM User Account. This allows the #Let's Data IAM User account to assume this role and this effectively grants the IAM User account access to the resources defined in any policies attached to this role. (Currently it does not have any policies attached, we'll attach the policy in next step)
    • Attach the IAM policy that we created to the IAM Role from the previous step

Step 5: Creating the Dataset Configuration

Create the dataset configuration:Lets create the dataset configuration by filling each property in the dataset configuration one by one.  Datasets

  • Browse the datasets create command help for details around the dataset configuration json, its different sections and properties.
  • Here is the sample dataset configuration json that we'd be filling, one property at a time.  Datasets
  • Update the dataset name - this needs to be alphanumeric that hasn't been used for another dataset. No hyphens, underscores or any such characters allowed.  Datasets
  • Update the region - since all our resources are in us-east-1, we'll set this to us-east-1. See the  Regions docs to understand how regions can be used to do cross-region datasets.
  • Update the accessGrantRoleArn - this is the IAM Role ARN that we created earlier that grants your #Let's Data account access to the common crawl and JAR resources.  Access Grants
  • Update the customerAccountForAccess - this is your AWS Account. Why? #Let's Data will create resources such as Kinesis stream and S3 error records - these will be in the #Let's Data account. We will grant access to these resources to the AWS account in 'customerAccountForAccess'.  Customer Account For Access Docs

i. Read Connector Config

Create the read connector section in the dataset configuration.  Read Connector Docs

  • connectorDestination: We'd be reading from S3, so the connectorDestination would be S3.
  • bucketName: The S3 bucketName we'll set to commoncrawl
  • bucketResourceLocation: This S3 bucket is public and we've granted access through the IAM Role, so lets set this to Customer
  • readerType: We're reading using Spark so we'll set this to SPARKREADER
  • artifactImplementationLanguage: We implemented in Java so set this to Java
  • artifactFileS3Link: The implementation JAR file is uploaded to the customer s3 bucket - we should set this to the s3 link of the jar s3://letsdata-spark-demotest/spark-extract-and-map-reduce-1.0-SNAPSHOT-jar-with-dependencies.jar
  • artifactFileS3LinkResourceLocation: The implementation JAR file is in the customer AWS account - set this to Customer

ii. Write Connector Config

Create the write connector section in the dataset configuration.  S3 Spark Write Connector Docs

  • connectorDestination: We'd be writing to S3, so the connectorDestination would be S3.
  • resourceLocation: We'd like #Let's Data to manage this S3 bucket (create it, scale it, write data to it and grant our AWS Account access) - Let's set this to letsdata.
  • writerType: We're writing using Spark so we'll set this to Spark
  • sparkFileFormat: We want to write the result as json files so set this to json
  • sparkWriteOptions: We want spark to write compressed (gzipped) files so lets set that as a write option. {"compression":"gzip"}

iii. Error Connector Config

LetsData's Error Handling infrastructure that automatically captures errors and records them to an error destination such as S3 are not currently enabled for Spark. However, we havent removed this from config yet. Lets set this to an S3 connector destination and letsdata as the bucket resource. LetsData will create a bucket but no errors would be recorded in it.

iv. Compute Engine Config

Create the compute engine section in the dataset configuration. We'd be using the Spark Compute Engine.  Spark Compute Engine Docs

  • computeEngineType: We'd be using Spark Compute Engine - set this to 'LAMBDA_AND_SPARK'
  • runSparkInterfaces: Specifies which spark interfaces would be created as tasks and run. We could run just mapper tasks, reducer tasks only or run both mapper and reducer tasks. Set this to MAPPER_AND_REDUCER since we are running both mapper and reducer tasks.
  • concurrency: We'd like 15 concurrent lambda functions for this dataset i.e. 15 tasks (each task is for an s3 file) would be processed concurrently. So set this to 15
  • memoryLimitInMegabytes: task lambda function's memory limit in MB. We'd set this to the max allowed 10240 i.e. 10 GB
  • timeoutInSeconds: task lambda function's timeout in seconds. We'd set this to the max allowed 900 seconds i.e. 15 mins
  • logLevel: the log level for emitted logs - lets set this to DEBUG so that we can see debug log statements as well. In production scenarios we recommend setting this to WARN or ERROR since DEBUG logging can cause increased storage costs
  • sparkMapperInterfaceClassName: the fully qualified name of the class that implements the SparkMapperInterface for implementations in Java - lets set this to the FQDN of the SparkMapperInterface class com.letsdata.example.CommonCrawlSparkMapper
  • sparkReducerInterfaceClassName: the fully qualified name of the class that implements the SparkReducerInterface for implementations in Java - lets set this to the FQDN of the SparkReducerInterface class com.letsdata.example.CommonCrawlSparkReducer

v. Manifest File Config

We've specified the read connector (where we'd read from and how we will read), the write connector (where we'd write data to) and the compute engine (how we'd process the files). We haven't specified what files we want processed in this dataset. This is done via the manifest file. Lets create this.  Manifest File Docs

  • manifestType: We'd be using Text manifest file specified in the dataset configuration - set this to 'S3ReaderTextManifestFile'
  • readerType: We are reading using spark - set this to 'SPARKEADER'
  • fileContents: These are the manifest file's contents - the filenames of the s3 files that we want processed

vi. Final Dataset Config

Here is the final dataset configuration that we'll use:

Step 6: Create Dataset and Monitor Its Execution

  • Create the dataset on #Let's Data either using the CLI or the console.
  • View the dataset on #Let's Data either using the CLI or the console. Once the dataset is created, it takes ~3 mins to initialize the resources (dataset is in INITIALIZING state, no tasks have been created yet). Once the dataset is initialized, it moves to PROCESSING state, this is when tasks start executing. The console currently just shows the status and not the details of the initialization workflow. However, the datasets view command shows the initialization workflow steps and progress against each. (Some steps can transiently show up as errors and the as success (example, the build step), so if you see an error, retry in 30 secs a few times to see if its transient. If its not send us email support@letsdata.io)
  • List the dataset tasks on #Let's Data either using the CLI or the console.
  • See the  Docs for details on different aspects of the dataset and how to look at logs, metrics etc.

Step 7: View Results

  • To view the results of the spark dataset, LetsData has created an IAM Role and provided access to the customer's account specified in customerAccessRoleArn. Any IAM User in the customers account with access to STS assume role API can get temporary, time limited credentials to access the resources of the dataset.
  • We need the customerAccessRoleArn, createDatetime and writeConnector.bucketName from the dataset to be able to access the bucket files.
  • For this current dataset, the customer's account is '308240606591'. I have an IAM Admin user for this account whose credentials are stored in the ~/.aws/credentials file as the profile 'devAtResonanceIAMUser'. I'll run the following AWS CLI command to get time limited credentials.
On This Page