Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Read Connectors

Read connectors are logical connections that are created to read data from read connector destinations. For example, an S3 Read Connector is used to read data from S3.

Read Connectors implement the #Let's Data's data interface (the user data handlers) and are primarily responsible for making sense of the read records and creating output documents. These output documents are then written to the write connector destination by the write connector.

The #Let'sData infrastructure simplifies the data processing for the customers. Our promise to the customers is that "Focus on the data - we'll manage the infrastructure". #Let'sData implements a control plane for the data tasks, reads and writes to different destinations, scales the processing infrastructure as needed, builds in monitoring and instrumentation. However, it does not know what the data is, how it is stored in the data files and how to read the data and create transformed documents. The customer needs to implement the user data handlers (letsdata-data-interface) that tell us what makes a data record in the files - we'll then send the records to these data handlers where the user can transform the records into documents that can be sent to the write destination.

The #Let's Data's data interfaces are defined in the Github repos:

Java: letsdata-data-interface
Python: letsdata-python-interface
Javascript: letsdata-javascript-interface

We currently support the following read connectors destinations:

AWS S3 Read Connector
AWS SQS Read Connector
AWS Kinesis Read Connector
AWS DynamoDB Streams Read Connector
AWS DynamoDB Table Read Connector

Common Configuration

The read connectors need few different pieces of information that is captured in the readConnector configuration section in the dataset. The readConnector configuration is different for data destinations - for example, the S3 Read Connector requires a bucketName, fileTypes and additional information that need to be read and an SQS Read Connector requires the SQS Queue Name and additional information to read data from SQS. However, the following readConnector configuration is common to all readConnector implementations:

region: (Optional) The optional read connector region and is set to dataset's region if unspecified. The read connector region tells LetsData a couple of things - 1./ what region is the read destination (S3 Bucket, SQS queue) located in 2./ the region that the user data handler artifact jar is located in. LetsData will create clients for this region to connect to these buckets / queues etc. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]. As of now, we require the artifact and the read connector to be in the same region - this is purely because we havent separated this configuration yet. Ideally, artifact should have its own region and read connector its own separate region. However, this is a minor issue that we will fix in our upcoming releases.
connectorDestination: The data destination for the connector. This can be S3, SQS or any of the supported read connector destinations.
artifactImplementationLanguage: The read connector implementation of the #Let's Data interface is called the artifact and is implemented in either Java, Python or Javascript. When the implementation is in Java, the artifact is essentially a JAR file with the implementation of the #Let's Data data interface. For Python and Javascript, the artifact is an ECR Image with the #LetsData interface implementation.
+ artifactFileS3Link: The S3 link of the artifact - the JAR file that implements the required interfaces in Java language.
+ artifactFileS3LinkResourceLocation: #Let's Data needs to know how to access the artifactFileS3Link. If it is in the customer account (artifactFileS3LinkResourceLocation: Customer), we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access this JAR file. Otherwise (artifactFileS3LinkResourceLocation: LetsData), we'll use #Let's Data account to access. You'll need to add access to this JAR file in the IAM role's policy document. See Access Grants Docs for details.

Usecases and Configuration

Each read connector may implement different types of readers for different usecases. For example, the S3 Read Connector can read from a single S3 file or read from multiple files (example, metadata and data files) and creates the output record by combining data from these different files. Each usecase may require different readConnector configuration elements. Here are the different Read Connectors, the different usecases currently supported, the configuration details, example interface implementations and manifest file examples.

Single File Reader

This reader is used when each data document to be extracted is completely contained in a single file and created from a single data record in the file. Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file)
For example, if the s3 bucket which has the data has the following 3 files with the record structure as shown, then 3 Single File Reader Datatasks would be created, one for each file (Datatask s3file_1.gz, Datatask s3file_2.gz, Datatask s3file_1.gz). They may or may not run concurrently depending on the available concurrency.

Each task will sequentially process its file, and send each record to the user's task handler which would transform the record to an extracted document.

Config

readerType: The data records in s3 files can be in a single file or contained across multiple files. The reader type configuration tells the read connector on how to read the s3 files to extract the documents.For the Single File Reader case, set this to 'Single File Reader'. This will read single type of records from a single file.
bucketName: An S3 read connector requires the bucket name which has the files that will be read.
bucketResourceLocation: The bucket's resource location - i.e. the bucket owner AWS account. Set this to Customer if the bucket is public / in customer account. Set this to LetsData if the bucket was created by Let's Data. #Let's Data will use these to determine how to access the files in the S3 bucket. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the S3 Files in the bucket. You'll need to add access to the s3 files in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access. See Access Grants Docs for details.
singleFileParserImplementationClassName: The user data handler needs to implement the SingleFileParser interface on how to parse records from the S3 files. Set this to the fully qualified class names of the implementation of the SingleFileParser interface. Not required when ECR Image is being used (Python / Javascript). Example Interface Implementation Step By Step Example - Uri Extractor

Implementation

The Single File Reader usecase, as explained earlier, is when all the files are of a single type and the records in the file do not follow a state machine. Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file).

In this simple example, you'll only need to implement the SingleFileParser interface. Here is a quick look at the interface methods, the implementation has detailed comments on what each method does and how to implement it:

getS3FileType(): The logical name of the filetype. For example we could name the fileType as LOGFILE.
getResolvedS3FileName():The filename resolved from the manifest name
getRecordStartPattern():The start pattern / delimiter of the record
getRecordEndPattern():The end pattern / delimiter of the record
parseDocument():The logic to skip, error or return the parsed document

Here is an example implementation:

This example assumes that this code is built as LogFileParserImpl-1.0-SNAPSHOT.jar and uploaded to S3 at s3://logfileparserimpl-jar/LogFileParserImpl-1.0-SNAPSHOT.jar

Manifest File - Example

Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
The S3 bucket may have many folders and files and we only need to process November's logfile data as part of the dataset. We add the relative path (Nov22/) from the bucket root to each individual file.
This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.