# Let's Data

Dataset Configuration JSON

Datasets and the Dataset Configuration JSON

Overview

Datasets are collection of data tasks grouped together as a logical entity. They can also be called Data jobs, Data tasks that the user needs to run. A dataset will have tasks for the work items in the dataset.
For example, a user may want to map reduce all the files in a S3 bucket to process data. They will define this work - 'map reduce all the files' - as a dataset in #Let'sData. Once this dataset is created, #Let'sData will create a data task for each file in the S3 Bucket (map reduce each file to process data). These data tasks grouped together are collectively called a dataset.
Dataset name uniquely identifies these aggregated data tasks and needs to be unique and alphanumeric (chars such as _ . | etc are not allowed). In the 'map reduce all the files' example above, the dataset can be named "ClickstreamsMapReduceJune2022"

User Configuration

Datasets require a few different pieces of configuration to be able to create and successfully run data tasks. A dataset has the following different configuration components:

The Dataset Name: A unique name for the dataset
The Region: The AWS region for the dataset. LetsData supported AWS regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
The Read Connector: A read connector configuration defines where the data that needs to be read by the tasks is located and how it would be read.
The Write Connector: A write connector configuration defines where the data is going to be written to.
The Error Connector: An error connector configuration defines where the error records in the data processing will be archived.
The Compute Engine: The compute engine configuration defines the infrastructure that would used to run these data tasks.
The Manifest File: A manifest file configuration that defines the read connector data for each individual task - for example, if the read connector is an S3 Bucket, the manifest file specifies what files in the bucket are to be read by the read connector. Each line in the manifest file definition becomes a data task in # Let's Data
The Access Grant Role: The ARN of the Access Grant IAM Role - this is the role that grants permissions to read from the read connector destination, write to the write connector and error connector destinations, gives access to the manifest file and any additional artifacts required by the dataset.
The Customer Account For Let's Data Resource Access: The aws account id of an account that should be granted access to the resources that this dataset will create. For example, if #Let's Data is creating the write connector's kinesis stream and error connector's S3 bucket, then #Let's Data will give this aws account id access to these resources.

Schema & Example

Here is a representation of a dataset with high level component configuration and an actual example dataset:

Schema
Example

    {
        "datasetName": "String",
        "region": "String - us-east-1|us-east-2|us-west-2|eu-west-1|ap-south-1|ap-northeast-1",
        "accessGrantRoleArn": "String",
        "customerAccountForAccess": "String",
        "readConnector": {
            ...
        },
        "writeConnector": {
            ...
        },
        "errorConnector": {
            ...
        },
        "computeEngine": {
            ...
        },
        "manifestFile": {
            ...
        },

        // System / Internal attributes
        "tenantId": "String",
        "userId": "String",
        "datasetId": "String"
        // Status
        "datasetStatus": "CREATED|INITIALIZING|PROCESSING|COMPLETED|ERRORED|REDRIVING|DESCALED|FROZEN|DELETED|UPDATING|FREEZING|DESCALING|DELETING|STOPPING_ERROR|STOPPING_COMPLETE",
        // Progress
        "datasetProgress": {
            "totalTasks": "long",
            "completedTasks": "long",
            "errorTasks": "long",
        },
        // Execution Logs
        "taskExecutionLogs": [
            {
                "startDatetime": "long",
                "endDatetime": "long",
            }
        ],
        // Access
        "customerAccessRoleArn": "String",
        "createDatetime": "long",
        "updateDatetime": "long"
    }

    {
        "datasetName": "IndexWebCrawlDataNov2022",
        "region": "us-east-1",
        "accessGrantRoleArn": "arn:aws:iam::151166716410:role/LetsData_AccessRole_IndexWebCrawlDataNov2022",
        "customerAccountForAccess": "151166716410"
        "readConnector": {
            // connector destination
            "connectorDestination": "S3",
            "bucketName": "commoncrawl",
            // bucket access
            "bucketResourceLocation": "Customer";

            // reader type definition
            "readerType": "MULTIPLEFILESTATEMACHINEREADER"
            "singleFileStateMachineParserImplementationClassNameMap": {
                "WARC": "com.resonance.saas.data.commoncrawl.parser.WARCFileParser",
                "WAT": "com.resonance.saas.data.commoncrawl.parser.WATFileParser",
                "WET": "com.resonance.saas.data.commoncrawl.parser.WETFileParser",
            },
            "multipleFileStateMachineReaderClassName": "com.resonance.saas.data.commoncrawl.reader.CommonCrawlReader",

            // implementation jar file artifact
            "artifactImplementationLanguage": "JAVA",
            "artifactFileS3Link": "s3://resonancecommoncrawl-jar/resonance-saas-data-common-crawl-1.0-SNAPSHOT.jar",
            // implementation jar file artifact access
            "artifactFileS3LinkResourceLocation": "Customer"
        },
        "writeConnector": {
            "connectorDestination": "KINESIS",
            "kinesisStreamName": "tldwc4c5854fa92dba54ba706459d3b0440dc",
            "resourceLocation": "LetsData"
            "kinesisShardCount": 15
        },
        "errorConnector": {
            "connectorDestination": "S3",
            "bucketName": "tldec4c5854fa92dba54ba706459d3b0440dc",
            "errorConnectorResourceLocation": "LetsData"
        },
        "computeEngine": {
            "computeEngineType": "LAMBDA",
            "concurrency": 15,
            "memoryLimitInMegabytes": 10240,
            "timeoutInSeconds": 300
        },
        "manifestFile": {
            "manifestFileS3Uri": "s3://resonancemanifestfile/resonance_manifest.txt",
            "manifestType": "S3ReaderS3LinkManifestFile",
            "readerType": "MULTIPLEFILESTATEMACHINEREADER",
            "readerManifestResourceLocation": "Customer"
        },
        "datasetStatus": "FROZEN",
        "datasetProgress": {
            "totalTasks": 6,
            "completedTasks": 6,
            "errorTasks": 0
        },
        "executionLogs": [
            {
              "startDatetime": 1678546821208,
              "endDatetime": 1678547579649
            },
            {
              "startDatetime": 1678547819604,
              "endDatetime": 1678548407473
            }
        ],
        "createDatetime": 1678546569567,
        "updateDatetime": 1678548407473
    }

Details about the schema and examples for different dataset components, see the following detailed docs:

Datasets: https://www.letsdata.io/docs#datasets

Read Connectors: https://www.letsdata.io/docs#readconnectors

Write Connectors: https://www.letsdata.io/docs#writeconnectors

Error Connectors: https://www.letsdata.io/docs#errorconnectors

Manifest File: https://www.letsdata.io/docs#readconnectormanifestfile

Compute Engine: https://www.letsdata.io/docs#computeengine

Access Grants: https://www.letsdata.io/docs#accessgrants

Dataset Configuration Form:

Dataset name

Dataset Region

Datasets

JSON Element
This value is saved as the dataset configuration's datasetName attribute

Dataset Region

The AWS region for the dataset. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's region attribute

Dataset Read Connector

Read Connector Region

Read Connector Destination

Implementation Language

Read Connectors

Read connectors are logical connections that are created to read data from read connector destinations. For example, an S3 Read Connector is used to read data from S3.

Read Connectors implement the #Let's Data's data interface (the user data handlers) and are primarily responsible for making sense of the read records and creating output documents. These output documents are then written to the write connector destination by the write connector.

The #Let'sData infrastructure simplifies the data processing for the customers. Our promise to the customers is that "Focus on the data - we'll manage the infrastructure". #Let'sData implements a control plane for the data tasks, reads and writes to different destinations, scales the processing infrastructure as needed, builds in monitoring and instrumentation. However, it does not know what the data is, how it is stored in the data files and how to read the data and create transformed documents. The customer needs to implement the user data handlers (letsdata-data-interface) that tell us what makes a data record in the files - we'll then send the records to these data handlers where the user can transform the records into documents that can be sent to the write destination.

The #Let's Data's data interfaces are defined in the github repo letsdata-data-interface.

We currently support the following read connectors destinations:

AWS S3 Read Connector
AWS SQS Read Connector
AWS Kinesis Read Connector
AWS DynamoDB Streams Read Connector
AWS DynamoDB Table Read Connector

JSON Element
This value is saved as the dataset configuration's readConnector.connectorDestination attribute

Read Connector Region

(Optional) The AWS region for the dataset's readConnector. If not specified, read connector will default to the dataset region. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's readConnector.region attribute

Artifact Implementation Language

The read connector implementation of the #LetsData interface is called the artifact and is implemented in either Java, Python or Javascript. When the implementation is in Java, the artifact is essentially a JAR file with the implementation of the #LetsData interface. For Python and Javascript, the artifact is an ECR Image with the #LetsData interface implementation

JSON Element
This value is saved as the dataset configuration's readConnector.artifactImplementationLanguage attribute

S3 Read Connector

S3 Bucket Name

S3 Read Connector Bucket Access

Is the AWS S3 Read Connector Bucket in Customer account or in #Let's Data account?
(This is required for enabling access - we'll look at enabling access a little later)

Customer LetsData

AWS S3 Read Connector: Bucket Name

bucketName: An S3 read connector requires the bucket name which has the files that will be read.

JSON Element
This value is saved as the dataset configuration's readConnector.bucketName attribute

AWS S3 Read Connector Bucket Access

bucketResourceLocation: The bucket's resource location - i.e. the bucket owner AWS account. Set this to 'Customer' if the bucket is public / in customer account. Set this to 'Let's Data' if the bucket was created by Let's Data. #Let's Data will use these to determine how to access the files in the S3 bucket. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the S3 Files in the bucket. You'll need to add access to the s3 files in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access.

JSON Element
This value is saved as the dataset configuration's readConnector.bucketResourceLocation attribute

S3 Reader Type

AWS S3 Read Connector: Reader Type

The data records in s3 files can be in a single file or contained across multiple files. The reader type configuration tells the read connector on how to read the s3 files to extract the documents. Here are the different supported reader types:

SparkReader

Spark Reader

Spark Reader is used to read data from S3 Files using Apache Spark. The reader uses Spark Sql's DataFrameReader class to read different file formats as a Spark dataframe. You can write your custom transformation code using Spark SQL to transform the data according to the requirements. We currently support text, json, csv and parquet formats.
SparkReader works in conjunction with Spark Compute Engine and Spark Writer. For additional details, do look into the Spark Compute Engine Docs.

Here is an abbreviated example, to read data from text file, you can specify the end record delimiter to convert the data in the text file into a spark dataframe. You can further parse the data into different columns and run transformations such as filters, aggregations and joins. Here is an example of a file that we parse using spark with some sample spark code.

Example S3 File

    S3 File Contents
    ----------------
    WARC/1.0
    WARC-Type: conversion
    WARC-Target-URI: https://www.cnn.com/about.html
    WARC-Date: 2023-11-28T11:16:20Z
    WARC-Record-ID: 
    WARC-Refers-To: 
    WARC-Block-Digest: sha1:FYPDL4PT4EULBPD7NPZVC3Z3AHE6IGYP
    WARC-Identified-Content-Language: eng
    Content-Type: text/plain
    Content-Length: 1719

    CNN Digital is the world leader in online news and information and seeks to inform, engage and empower the world. Staffed 24 hours, seven days a week by a dedicated team in CNN bureaus around the world, CNN’s digital platforms deliver news from almost 4,000 journalists in every corner of the globe.


    WARC/1.0
    ...

Example LetsData Spark Config

    LetsData Spark Configuration
    ----------------------------
    "readConnector": {
        ...
        "readerType": "SPARKREADER",
        "bucketName": "commoncrawl",
        "sparkFileFormat": "text",
        "sparkReadOptions": {
            "lineSep": "\n\\r\n\\r\n"
        }
    },
    "writeConnector": {
        ...
        "writerType": "Spark",
        "sparkFileFormat": "json",
        "sparkWriteOptions": {"compression":"gzip"}
    },
    "computeEngine": {
        "computeEngineType": "LAMBDA_AND_SPARK",
        "runSparkInterfaces": "MAPPER_AND_REDUCER",
        ...
    },

Example LetsData Spark Code

    --------------------
    LetsData Spark Code
    --------------------
    // Default LetsData Credentials Implementations
    AWSSessionCredentials readCredentials = getReadDestinationCredentials(gson, System.getenv("AWS_REGION"), sparkCredentialsSecretArn);
    AWSSessionCredentials writeCredentials = getWriteDestinationCredentials(gson, System.getenv("AWS_REGION"), sparkCredentialsSecretArn);

    // Default LetsData Session and Read Functions
    SparkSession spark = SparkUtils.createSparkSession(appName, readDestination, writeDestination, readUri, writeUri, readCredentials, writeCredentials);
    Dataset df = SparkUtils.readSparkDataframe(spark, readDestination, readUri, readFormat, readOptions);

    // Trim whitespaces, not just spaces
    df = df.select(regexp_replace(df.col("value"), "^\\s+|\\s+$", "").alias("value"));

    // Split string into header and payload columns
    df = df.select(split(df.col("value"), "\r\n\\r\n").alias("s"))
    .withColumn("header", expr("s[0]"))
    .withColumn("payload", expr("s[1]"))
    .select(col("header"), col("payload"));

    // additional transformations omitted for brevity

    // Default LetsData write code
    SparkUtils.writeSparkDataframe(spark, writeDestination, writeUri, writeFormat, writeMode, writeOptions, df);

Single File Reader
- This reader is used when each data document to be extracted is completely contained in a single file and created from a single data record in the file. Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file)
- For example, if the s3 bucket which has the data has the following 3 files with the record structure as shown, then 3 "Single File Reader" Datatasks would be created, one for each file (Datatask s3file_1.gz, Datatask s3file_2.gz, Datatask s3file_1.gz). They may or may not run concurrently depending on the available concurrency.
- Each task will sequentially process its file, and send each record to the user's task handler which would transform the record to an extracted document.

Single File State Machine Reader

This reader is used when each data document to be extracted is completely contained in a single file but created from multiple data record in the file.
Use "Single File State Machine Reader" when records in the file follow a finite state machine. Simple example is a data file where a record's metadata and data are written sequentially as two separate records.
For example, if the s3 bucket which has the data has the following 3 files with the record structure as shown, then 3 "Single File State Machine Reader" tasks would be created, one for each file (task s3file_1.gz, task s3file_2.gz, task s3file_3.gz). They may or may not run concurrently depending on the available concurrency.

        +-----------------------+---------------------------+---------------------------+
        | s3file_1.gz           | s3file_2.gz               | s3file_3.gz               |
        +-----------------------+---------------------------+---------------------------+
        | <metadata_record_1>   | <metadata_record_n>       | <metadata_record_m>       |
        | <data_record_1>       | <data_record_n>           | <data_record_m>           |
        | <metadata_record_2>   | <metadata_record_n+1>     | <metadata_record_m+1>     |
        | <data_record_2>       | <data_record_n+1>         | <data_record_m+1>         |
        | ...                   | ...                       | ...                       |
        +-----------------------+---------------------------+---------------------------+

The record state machine is simple in this case:

        +----------+                   +------+
        | Metadata | -------->-------- | Data | -->----+
        +----------+                   +------+        |
        ^                                              V
        |                                              |
        +--------<----------------------<--------------+

        Example:  S3 File State Machine

Each task will sequentially process its file, and send each record to the user's task handler which would transform the record to an extracted document.
The user's data task handler maintains the state machine, validates that the incoming record is of the correct type (metadata, data) and returns extracted documents.

Multiple File State Machine Reader

This reader is used when each data document to be extracted is contained across multiple files.
Use "Multiple File State Machine Reader" when records in the files follow a finite state machine across these files.

Simple example is group of two types data files - The metadata files: [metadata_file1.gz,metadata_file2.gz,...] and the data files: [data_file1.gz, data_file2.gz, ...]. The metadata_file1.gz file has structure { {metadata_rec_1}, {metadata_rec_2}, ... }. The data_file1.gz file has structure { {data_rec_1}, {data_rec_2}, ... }.

        +--------------------+--------------------+    +-----------------+-----------------+
        | metadata_file1.gz  | metadata_file2.gz  |    | data_file1.gz   | data_file2.gz   |
        +--------------------+--------------------+    +-----------------+-----------------+
        | <metadata_rec_1>   | <metadata_rec_n>   |    | <data_rec_1>    | <data_rec_n>    |
        | <metadata_rec_2>   | <metadata_rec_n+1> |    | <data_rec_2>    | <data_rec_n+1>  |
        | <metadata_rec_3>   | <metadata_rec_n+2> |    | <data_rec_3>    | <data_rec_n+2>  |
        | <metadata_rec_4>   | <metadata_rec_n+3> |    | <data_rec_4>    | <data_rec_n+3>  |
        | ...                | ...                |    | ...             | ...             |
        +--------------------+--------------------+    +-----------------+-----------------+

task 1 initializes readers for metadata_file1.gz and data_file1.gz and extracts the {metadata_rec_X, data_rec_X} and sends them to the user's data handler. Similarly, a task 2 is created for files metadata_file2.gz and data_file2.gz.
The user's data task handler maintains the state machine, validates that the incoming record is of the correct type (metadata, data) and returns extracted documents.

The record state machine is similar to earlier in this case but is being maintained across two types of files:

        +------------------------------+                   +----------------------+
        | metadata_file:metadata_rec_X | -------->-------- | data_file:data_rec_X | ----->----+
        +------------------------------+                   +----------------------+           |
        ^                                                                                     V
        |                                                                                     |
        +--------<----------------------<--------------<-------------------<------------<-----+

        Example:  S3 File State Machine

JSON Element
This value is saved as the dataset configuration's readConnector.readerType attribute

Reader Connector Implementation

SingleFileParser class name

S3 Read Connector Implementation

This is where you tell us what about your implementation - what Java class files implement the parser and reader interfaces. Look at the SDK docs below to learn how to implement the reader. (The docs have been auto-filtered according to your reader type selection)

JSON Element
This value is saved as the dataset configuration's readConnector.singleFileParserImplementationClassName attribute

#Let'sData Data Model

Documents

#Let'sData defines the following data model class / interfaces for the transformed document implementations that are required in data processing. These are defined in the namespace "com.resonance.letsdata.data.documents.interfaces".

These are explained as follows:

DocumentInterface: The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.

The interface is simple and defined as:

        /**
 * The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
 */
public interface DocumentInterface {
    /**
     * Gets the documentId for the document
     * @return documentId
     */
    String getDocumentId();

    /**
     * Gets the record type of the document
     * @return the record type
     */
    String getRecordType();

    /**
     * Gets any optional metadata for the document as a map
     * @return map of optional document metadata
     */
    Map getDocumentMetadata();

    /**
     * Interface method that serializes the document to string that can be written to the destination
     * @return serialized document as string
     */
    String serialize();

    /**
     * The partition key of the document - useful to determine the partition for the document that would be written to
     * @return the partition key for the document
     */
    String getPartitionKey();
}

SingleDocInterface: The "SingleDocInterface" extends the "DocumentInterface" is the base interface for any documents that are transformed from single records and are returned by the user handlers. The java doc on the interface below explain these in detail.

        In addition to the DocumentInterface interfaces, the SingleDocInterface implementations must implement the following:

        /**
 *  Single method interface for a Single Document (as opposed to a Composite Document)
 *  Single Documents should be used for each individual record in the file i.e. there should be 1-1 mapping between a file record and its corresponding java Single Doc Interface implementation.
 *  Composite Documents on the other hand are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface SingleDocInterface extends DocumentInterface {
    /**
     * Interface should return true for SingleDocInterface implementations (maybe some future SingleDoc implementations can return false)
     * @return - true for SingleDocInterface implementations
     */
    boolean isSingleDoc();
}

CompositeDocInterface: The "CompositeDocInterface" extends the "DocumentInterface" is the base interface for any documents that are composited from multiple single docs and are returned by the user handlers. The java doc on the interface below explain these in detail.

        In addition to the DocumentInterface interfaces, the CompositeDocInterface implementations must implement the following:

        /**
 *  Composite Documents are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface CompositeDocInterface extends DocumentInterface {
    /**
     * Since a composite doc is created from multiple single docs, it is possible that there are partial errors in some of the single docs.
     * This method returns a linked hashmap of single docs and the errors when constructing a composite doc
     * @return
     */
    LinkedHashMap> getDocumentList();
}

ErrorDocInterface: The "ErrorDocInterface" extends the "DocumentInterface" is the base interface for any error documents that are returned by the user handlers. A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default. Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.

Here is the interface:

                In addition to the DocumentInterface interfaces, the ErrorDocInterface implementations must implement the following:

                /**
 * The "ErrorDocInterface" extends the "SingleDocInterface" is the base interface for any error documents that are returned by the user handlers.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default.
 * Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
 */
public interface ErrorDocInterface extends SingleDocInterface {
    /**
     * The erroneous record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The erroneous record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordEndOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The error message string that will be captured in the error record
     * @return The error message string
     */
    String getErrorMessage();
}

SkipDocInterface: The "SkipDocInterface" extends the "DocumentInterface" is the base interface for any skip documents that are returned by the user handlers. A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination. A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default. Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.

Here is the interface:

        In addition to the DocumentInterface interfaces, the SkipDocInterface implementations must implement the following:

        /**
 * The "SkipDocInterface" extends the "SingleDocInterface" is the base interface for any skip documents that are returned by the user handlers.
 * A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default.
 * Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.
 */
public interface SkipDocInterface extends SingleDocInterface {
    /**
     * The skip record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": 58965L
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": 58965L,
     *      "DATALOG": 5484726L,
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The skip record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The skip message string that will be captured in the skip record
     * @return The skip message string
     */
    String getSkipMessage();
}

#Let'sData Interfaces

Usecase - Implementation Requirements

Single File Reader: You'll need to implement the 'SingleFileReaderParserInterface'

The Single File Reader usecase, as explained earlier, is when all the files are of a single type and the records in the file do not follow a state machine. Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file).

            Example: Single File Reader Files and Record Layout:

            +----------------------+------------------=---+---------------------+
            | logfile_1.gz         | logfile_2.gz         | logfile_3.gz        |
            +---=------------------+----------------------+---------------------+
            | <logline_1>          | <logline_1>          | <logline_1>         |
            | <logline_2>          | <logline_2>          | <logline_2>         |
            | <logline_3>          | <logline_3>          | <logline_3>         |
            | <logline_4>          | <logline_4>          | <logline_4>         |
            | ...                  | ...                  | ...                 |
            +----------------------+----------------------+---------------------+

In this simple example, you'll only need to implement the 'SingleFileParser' interface. Here is a quick look at the interface methods, the implementation has detailed comments on what each method does and how to implement it:

getS3FileType(): The logical name of the filetype. For example we could name the fileType as LOGFILE.
getResolvedS3FileName():The filename resolved from the manifest name
getRecordStartPattern():The start pattern / delimiter of the record
getRecordEndPattern():The end pattern / delimiter of the record
parseDocument():The logic to skip, error or return the parsed document

Here is an example implementation:

            Example SingleFileParser Interface Implementation:

            public class LogFileParserImpl implements SingleFileReaderParserInterface {
                /**
                * The filetype of the file - for the examples we've used, we define the filetype (logical name) as "LOGFILE"
                * @return - The file type
                */
                public String getS3FileType() {
                    return "LOGFILE"
                }

                /**
                * Given the filetype (LOGFILE) and filename (logfile_1.gz/logfile_2.gz/logfile_3.gz) from the manifest file, return the resolved filename if necessary.
                * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
                * For example, dataset_name/log_date/logfile_1.gz (See the docs for MultipleFileStateMachineReader.getResolvedS3Filename method for additional details around the file name resolution)
                * @param s3FileType - the file type
                * @param fileName - the file name
                * @return - the resolved file name
                */
                public String getResolvedS3FileName(String s3FileType, String fileName) {
                    ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                    ValidationUtils.validateAssertCondition(fileName != null, "filename should not be null");
                    // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
                    return resolvedFileNames;
                }

                 /*
                 * getRecordStartPattern/getRecordEndPattern interface methods:
                 * To extract the records from the file, the parser needs to know the record start delimiter - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter. Once it finds the end record delimiter, it will copy those bytes to the parse document function to create an extracted record
                 * For example, for the following log lines in the log file, we define the start pattern as "{\"ts\"" and end pattern as "}\n":
                 * Logfile:
                 * --------
                 * {"ts":1647352053448,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
                 * {"ts":1647352053449,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
                 *
                 */
                public RecordParseHint getRecordStartPattern(String s3FileType) {
                    ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                    return new RecordParseHint(RecordHintType.PATTERN, "{\"ts\"", -1);
                }

                public RecordParseHint getRecordEndPattern(String s3FileType) {
                    ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                    return new RecordParseHint(RecordHintType.PATTERN, "}\n", -1);
                }

               /**
               * This function is called with each log line's contents in a byteArr
               * The startIndex and endIndex into the byteArr as the start and end of the log line record.
               * The implementer is expected to construct the output record from these bytes.
               * In this example implementation, we parse "lvl":"ERROR" log lines from the logfile:
               */
               public ParseDocumentResult parseDocument(String s3FileType, String s3Filename, String offsetBytes, byte[] byteArr, int startIndex, int endIndex) {
                   ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(getS3FileType()), "parseDocument file type is unexpected");
                   ValidationUtils.validateAssertCondition(byteArr != null && startIndex >= 0 && byteArr.length > endIndex && endIndex > startIndex, "parseDocument byte array offsets are invalid");

                   // uses the Matcher utility from the interface package to do pattern matching
                   int errorLevelIndex = Matcher.match(byteArr, startIndex, endIndex, "\"lvl\":\"ERROR\"");
                   if (errorLevelIndex == -1) {
                       // skip since we are interested only in ERROR records
                       Map<String, String> startOffset = new HashMap<>();
                       startOffset.put(s3FileType, Long.toString(offsetBytes));
                       Map<String, String> endOffset = new HashMap<>();
                       endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
                       String errorMessage = "skipping message - level is not ERROR";
                       String documentId = null;
                       String recordType = null;
                       Map<String, Object> documentMetadata = null;
                       String serialize = null;
                       String partitionKey = s3Filename;
                       String nextRecordType = null;
                       DocumentInterface skipDoc = new SkipDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
                       return new ParseDocumentResult(nextRecordType, skipDoc, ParseDocumentResultStatus.SKIP);
                   } else {
                       try {
                           Map<String, Object> jsonMap = objectMapper.readValue(byteArr, startIndex, endIndex, HashMap.class);
                           Long timestamp = (Long) jsonMap.get("ts");
                           String docId = ""+timestamp+s3Filename;
                           String recordType = (String) jsonMap.get("lvl");
                           DocumentInterface doc = new JsonLogDocument(docId, recordType, null,  objectMapper.writeValueAsString(jsonMap), docId);
                           String nextRecordType = null;
                           return new ParseDocumentResult(nextRecordType, doc, ParseDocumentResultStatus.SUCCESS);
                        } catch (Exception e) {
                           logger.error("error processing json document from file", s3FileType, s3Filename, offsetBytes, e);
                           Map<String, String> startOffset = new HashMap<>();
                           startOffset.put(s3FileType, Long.toString(offsetBytes));
                           Map<String, Long> endOffset = new HashMap<>();
                           endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
                           String errorMessage = "error processing json document from file - ex: "+e;
                           String documentId = null;
                           String recordType = null;
                           Map<String, Object> documentMetadata = null;
                           String serialize = null;
                           String partitionKey = s3Filename;
                           DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
                           return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
                       }
                    }
                }
            }

Single File State Machine Reader: You'll need to implement the 'SingleFileStateMachineParser' interface (to parse individual records from file and maintain a state machine) & 'SingleFileStateMachineReader' interface (to output a composite doc from the parsed records)

The Single File State Machine Reader usecase, as explained earlier, is when data document to be extracted is completely contained in a single file and is created from multiple data record in the file. The records in the file follow a finite state machine. Simple example is a data file where a record's metadata and data are written sequentially as two separate records.

                    Example: Single File State Machine Reader Files and Record Layout:
                    +-----------------------+---------------------------+---------------------------+
                    | s3file_1.gz           | s3file_2.gz               | s3file_3.gz               |
                    +-----------------------+---------------------------+---------------------------+
                    | <metadata_record_1>   | <metadata_record_n>       | <metadata_record_m>       |
                    | <data_record_1>       | <data_record_n>           | <data_record_m>           |
                    | <metadata_record_2>   | <metadata_record_n+1>     | <metadata_record_m+1>     |
                    | <data_record_2>       | <data_record_n+1>         | <data_record_m+1>         |
                    | ...                   | ...                       | ...                       |
                    +-----------------------+---------------------------+---------------------------+

The records in each file follow the following state machine - this state machine is encoded in the SingleFileStateMachineParser implementation as it parses the records from the file.

                    Example:  S3 File State Machine
                    +----------+                   +------+
                    | Metadata | -------->-------- | Data | -->----+
                    +----------+                   +------+        |
                    ^                                              V
                    |                                              |
                    +--------<----------------------<--------------+

                    Example code:
                    -------------
                    public class DataFileParserImpl implements SingleFileReaderParserInterface {
                        /**
                        * The filetype of the file - for the examples we've used, we define the filetype (logical name) as "DATAFILE"
                        */
                        public String getS3FileType() {
                            return "DATAFILE"
                        }

                        /**
                        * Given the filetype (DATAFILE) and filename (s3file_1.gz/s3file_2.gz/s3file_3.gz) from the manifest file, return the resolved filename if necessary.
                        * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
                        * For example, dataset_name/data_date/s3file_1.gz
                        */
                        public String getResolvedS3FileName(String s3FileType, String fileName) {
                            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                            ValidationUtils.validateAssertCondition(fileName != null, "filename should not be null");
                            // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
                            return resolvedFileNames;
                        }

                         /*
                         * getNextRecordStartPattern/getNextRecordEndPattern interface methods:
                         * To extract the records from the file, the parser needs to know the expected record type for the state machine and the record type start and end delimiters - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter. Once it finds the end record delimiter, it will copy those bytes to the parse document function to create an extracted record
                         * For example, for the following {metadata, data} records in the data file, we define the start patterns
                         *                                        metadata record start pattern: \r\nRecord-Type: metadata
                         *                                        metadata record end pattern: \r\nRecord-Type: data
                         *
                         *                                        data record start pattern: \r\nRecord-Type: data
                         *                                        data record end pattern: \r\nRecord-Type: metadata
                         *
                         * Logfile:
                         * --------
                         *
                         *              Record-Type: metadata
                         *              Content-Language: en-us
                         *              Content-Length: 1024
                         *              Content-Encoding: gzip
                         *
                         *
                         *              Record-Type: data
                         *              Content: eJxtUTFuwzAM3POKg8ci7QP6gK5dOmZRZbomKpOuxMAJiv69lBInRlEDAiweeXc8AdfvAW8KOlkO0WAjIVPU3BcMWadWGDjRvv3NIRfKECLHTfEpumxmUCxkQ0+JJzZvfAQbFk4JhUKO443O719HEuOQ0hlWO7xzYKm8I5cNR5C+jsnqJxhmZTGo7P/SD5qbBPnM1dKN6AmvEmkr83/fnTTq7N5GLYT3s1Hb+JYCeo3HyVfAcJRorFLhmCkYuec1UVoFdrgH/uI26RSmuea6eh40JV1YPpDUD4sLsjSoFi6PsLguuX9q9UvcczC3LQgF3fehs3LoupZa3W4L/hyke979Ar2qsBI=
                         *
                         */
                        @Override
                        public RecordParseHint getNextRecordStartPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc) {
                           ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "METADATALOGFILE getNextRecordStartPattern file type is unexpected");
                           switch(nextExpectedRecordType)
                           {
                               case "HEADER": {
                                   // the header record starts with a new line followed by the Record-Type: header line
                                   return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: metadata", -1);
                               }
                               case "DATA": {
                                   // the data record starts with a new line followed by the Record-Type: data line
                                   return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: data", -1);
                               }
                               default: {
                                   throw new RuntimeException("unexpected record type");
                               }
                           }
                       }

                        @Override
                        public RecordParseHint getNextRecordEndPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc) {
                           ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "METADATALOGFILE getNextRecordStartPattern file type is unexpected");
                           switch(nextExpectedRecordType)
                           {
                               case "HEADER": {
                                    // the data record start is the end of the header record
                                   return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: data", -1);
                               }
                               case "DATA": {
                                   // the header record start is the end of the data record
                                   return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: metadata", -1);
                               }
                               default: {
                                   throw new RuntimeException("unexpected record type");
                               }
                           }
                       }

                       /**
                        * Given the fileType and the last processed record type, find the next expected record type from the state machine
                        * This method encodes the file's state machine. From the DATALOGFILE example above, an example implementation could be as follows:
                        */
                        @Override
                        public String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType) {
                            ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "DATAFILE getNextRecordStartPattern file type is unexpected");

                            if (lastProcessedRecordType == null) {
                                return "METADATA";
                            } else {
                                if (lastProcessedRecordType.equals("METADATA")) {
                                    return "DATA";
                                } else if (lastProcessedRecordType.equals("DATA")) {
                                    return "METADATA";
                                } else {
                                    throw new RuntimeException("Unexpected lastProcessedRecordType "+lastProcessedRecordType);
                                }
                            }
                        }

                       /**
                       * This function is called with each record's contents in a byteArr
                       * The startIndex and endIndex into the byteArr as the start and end of the log line record.
                       * The implementer is expected to construct the output record from these bytes.
                       * Here is an example implementation which parses error records from the DATALOGFILE example:
                       */
                       public ParseDocumentResult parseDocument(String s3FileType, String s3Filename, long offsetBytes, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, byte[] byteArr, int startIndex, int endIndex) {
                            ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(getS3FileType()), "parseDocument file type is unexpected");
                            ValidationUtils.validateAssertCondition(byteArr != null && startIndex >= 0 && byteArr.length > endIndex && endIndex > startIndex, "parseDocument byte array offsets are invalid");
                            String nextRecordType = getNextExpectedRecordType(s3FileType, lastProcessedRecordType);
                            ValidationUtils.validateAssertCondition(nextRecordType  != null && (nextRecordType.equalsIgnoreCase("DATA") || nextRecordType.equalsIgnoreCase("METADATA")), "nextRecordType should not be null or invalid");

                                AbstractDataRecord record = null;
                                switch (nextRecordType.toUpperCase()) {
                                    case "METADATA": {
                                        try {
                                            record = DataRecordFactory.constructMetadataRecord(byteArr, startIndex, endIndex);
                                        } catch (Exception ex) {
                                            logger.error("Exception in parseDocument for METADATA record - recordType: "+nextRecordType+", ex: "+exception);

                                            String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
                                            Map<String, String> startOffset = new HashMap<>();
                                            startOffset.put(s3FileType, Long.toString(offsetBytes));
                                            Map<String, String> endOffset = new HashMap<>();
                                            endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
                                            String errorMessage = "Exception in parseDocument for METADATA record - recordType: "+nextRecordType+", ex: "+exception;
                                            String documentId = null;
                                            String recordType = null;
                                            Map<String, Object> documentMetadata = null;
                                            String serialize = null;
                                            String partitionKey = s3Filename;
                                            DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
                                            return new ParseDocumentResult(nextRecordTypeForExtractedRecord, errorDoc, ParseDocumentResultStatus.ERROR);
                                        }
                                        break;
                                    }
                                   case "DATA": {
                                        try {
                                            record = DataRecordFactory.constructDataRecord(byteArr, startIndex, endIndex);
                                        } catch (Exception ex) {
                                            logger.error("Exception in parseDocument for DATA record - recordType: "+nextRecordType+", ex: "+exception);

                                            String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
                                            Map<String, String> startOffset = new HashMap<>();
                                            startOffset.put(s3FileType, Long.toString(offsetBytes));
                                            Map<String, String> endOffset = new HashMap<>();
                                            endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
                                            String errorMessage = "Exception in parseDocument for DATA record - recordType: "+nextRecordType+", ex: "+exception;
                                            String documentId = null;
                                            String recordType = null;
                                            Map<String, Object> documentMetadata = null;
                                            String serialize = null;
                                            String partitionKey = s3Filename;
                                            DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
                                            return new ParseDocumentResult(nextRecordTypeForExtractedRecord, errorDoc, ParseDocumentResultStatus.ERROR);
                                        }
                                        break;
                                    }
                                    default: {
                                        logger.error("Unexpected record type in datafile - recordType: "+nextRecordType);
                                        throw new RuntimeException("Unexpected record type in datafile - recordType: "+nextRecordType);
                                    }
                                }

                                String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
                                return new ParseDocumentResult(nextRecordTypeForExtractedRecord, record, ParseDocumentResultStatus.SUCCESS);

                        }
                    }

The SingleFileStateMachineReader implements the logic to combine each {metadata, data} record pair into an output doc.

                    Example: Output Doc

                    (Diagram 1)
    +-----------------------+ SingleFileStateMachineParser   +-----------------+
    | s3file_1.gz           |-------------->-----------------| metadata_record | -------->---------+
    +-----------------------+                                +-----------------+                   | SingleFileStateMachineReader     +---------------+
    | <metadata_record_1>   |                                                                      |-------------->--------------->---| composite_doc |
    | <data_record_1>       | SingleFileStateMachineParser   +-------------+                       |                                  +---------------+
    | <metadata_record_2>   |-------------->-----------------| data_record | -------->-------------+
    | <data_record_2>       |                                +-------------+
    | ...                   |
    +-----------------------+

                    (Diagram 2)
                            +--------------------------------+
                            |                                |
     ---> parseDocument-->--|                                |                       +--------------------+                       +-------------------------+
                            |                                |                       |XXXXXXXXXXXXXXXXXXXX|                       |                         |
                            |                                |---> nextRecord --->---|XXXXXXXXXXXXXXXXXXXX|---> parseDocument-->--|                         |
                            | SingleFileStateMachineReader   |<--- metadata rec --<--|XXXXXXXXXXXXXXXXXXXX|<--- metadata rec --<--|                         |
                            |                                |                       | System File Reader |                       | SingleFileStateMachine  |
                            |                                |---> nextRecord --->---|XXXXXXXXXXXXXXXXXXXX|---> parseDocument-->--|       Parser            |
                            |   { metadata + data }          |<----- data rec ----<--|XXXXXXXXXXXXXXXXXXXX|<----- data rec ----<--|                         |
     -<-- output doc---<----|        = output doc            |                       |XXXXXXXXXXXXXXXXXXXX|                       |                         |
                            |                                |                       +--------------------+                       +-------------------------+
                            |                                |                                |   |
                            +--------------------------------+                                |   |
                                                                                              |   |    File read
                                                                                             \|   |/
                                                                                              \   /
      Legend:                                                                                  \ /
      +--+                                                                         +-----------------------+
      |  |   Customer Implements Interface                                         | s3file_1.gz           |
      +--+                                                                         +-----------------------+
                                                                                   | {metadata_record_1}   |
      +--+                                                                         | {data_record_1}       |
      |XX|   System Implementation                                                 | {metadata_record_2}   |
      +--+                                                                         | {data_record_2}       |
                                                                                   | ...                   |
                                                                                   +-----------------------+

                    Example code:
                    -------------

                    public class DataFileParserImpl implements SingleFileReaderParserInterface {
                        /**
                        * The filetype of the file - for the examples we've used, we define the filetype (logical name) as "DATAFILE"
                        */
                        public String getS3FileType() {
                            return "DATAFILE"
                        }

                        /**
                        * Given the filetype (DATAFILE) and filename (s3file_1.gz/s3file_2.gz/s3file_3.gz) from the manifest file, return the resolved filename if necessary.
                        * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
                        * For example, dataset_name/data_date/s3file_1.gz
                        */
                        public String getResolvedS3FileName(String s3FileType, String fileName) {
                            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                            ValidationUtils.validateAssertCondition(fileName != null, "filename should not be null");
                            // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
                            return resolvedFileNames;
                        }

                       /**
                        * getReaderParserInterfacesForS3FileType;
                        * This function is responsible for creating the parser that are needed to parse the file that is being read by the Single File State Machine reader
                        * For example, the input filetype is {DATAFILE}, the output is the filetype's corresponding SingleFileStateMachineParser
                        * The function will create the parsers for the filetype (and should cache these in implementation as well).
                        * The parser implementation that it returns would be the user implementations on how to parse the file types.
                        * We'll reuse the parser class we've defined above in this example (DataFileParserImpl)
                        * Here is an example implementation:
                        */
                        private SingleFileStateMachineParser fileParser;       // class member - cached for reuse

                        @Override
                        public SingleFileStateMachineParser getReaderParserInterfacesForS3FileType(String fileType) {
                            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                            if (fileParser == null) {
                                fileParser = new DataFileParserImpl();
                            }
                            return fileParser;
                        }

                        /**
                        * Given the fileType and the last processed record type, find the next expected record type from the state machine
                        * Defer to the file parser's getNextExpectedRecordType method or add custom logic
                        */
                        @Override
                        public String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType) {
                            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                            return this.getReaderParserInterfacesForS3FileType(fileType).getNextExpectedRecordType(s3FileType, lastProcessedRecordType);
                        }

                        /**
                        * ParseCompositeDocumentResult parseDocument(String s3FileType, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, SystemFileReader fileReader);
                        *      This function is responsible for returning the extracted document - it is given:
                        *          * the current state in the datafile which it can use to determine next state
                        *          * the last document that was extracted in case the documents have some link with each other in the file
                        *          * the filetype reader interface - that it can use to get the next records from the  datafile - this interface is a simple wrapper over the parser interface that we defined earlier
                        *      With these inputs, this function will construct the extracted document from the records obtained from the file reader and return the result.
                        *      Note that the document being created is a CompositeDocument - since its being created by combining different records from different files.
                        */
                        @Override
                        public ParseCompositeDocumentResult parseDocument(String s3FileType, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, SystemFileReader fileReader) {
                            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
                            ValidationUtils.validateAssertCondition(fileReader.getFileTypeFileNameMap().get(s3FileType) != null, "fileReader fileTypeFileNameMap is unexpected");

                            if (fileReader.getOffset().get(s3FileType) == 0) {
                                // optional: skip if there are any file headers that need not be processed
                                skipFileHeaders(fileReader);
                            }

                            DocumentInterface document = fileReader.nextRecord(peek: false);
                            if (document == null) {
                                // End of file
                                ValidationUtils.validateAssertCondition(fileReader.getState() == SingleFileReaderState.COMPLETED, "fileReader state should be completed when nextRecord is null", fileReader);
                                Map<String, String> s3FileTypeNextRecordTypeMap = new HashMap<>();
                                s3FileTypeNextRecordTypeMap.put(s3FileType, null);

                                Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                                s3FileTypeOffsetMap.putAll(fileReader.getOffset());

                                Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();

                                Map<String, String> s3FileTypeNextRecordTypeMap = null;
                                CompositeDocInterface resultDocument = null;
                                return new ParseCompositeDocumentResult(s3FileTypeNextRecordTypeMap, resultDocument, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.COMPLETED, ParseDocumentResultStatus.SUCCESS);
                            }

                            // get metatdata record
                            MetadataRecord metadataDocument = null;
                            ErrorDocInterface metadataError = null;
                            if (document instanceof MetadataRecord) {
                                metadataDocument = (MetadataRecord) document;
                            } else if (document instanceof ErrorDocInterface) {
                                metadataError = (ErrorDocInterface) document;
                            } else {
                                throw new RuntimeException("parseDocument metadata nextRecord is of unexpected type");
                            }


                            // get data record
                            document = fileReader.nextRecord(peek: false);
                            ValidationUtils.validateAssertCondition(document != null, "data document should not be null");

                            DataRecord dataDocument = null;
                            ErrorDocInterface dataError = null;
                            if (document instanceof DataRecord) {
                                dataDocument = (DataRecord) document;
                            } else if (document instanceof ErrorDocInterface) {
                                dataError = (ErrorDocInterface) document;
                            } else {
                                throw new RuntimeException("parseDocument data nextRecord is of unexpected type");
                            }

                            if (metadataError != null || dataError != null) {
                                // Records have errors, construct and return error document

                                Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();
                                Map<String, String> nextRecordTypeMap = fileReader.getNextExpectedRecordType(s3FileType, lastProcessedRecordTypeMap.get(s3FileType));

                                Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                                s3FileTypeOffsetMap.putAll(fileReader.getOffset());

                                LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
                                // add metadata doc and errors
                                Listlt;ErrorDocInterface> metadataErrorRecordsList = metadataError == null ? null : Arrays.asList(metadataError);
                                docMap.put(metadataDocument == null ? DocFactoryImpl.createNullDoc("metadata") : metadataDocument, metadataErrorRecordsList);

                                // add data doc and errors
                                Listlt;ErrorDocInterface> dataErrorRecordsList = dataError == null ? null : Arrays.asList(dataError);
                                docMap.put(dataDocument == null ? DocFactoryImpl.createNullDoc("data") : dataDocument, dataErrorRecordsList);

                                CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);
                                return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.ERROR);
                            }


                            Listlt;ErrorDocInterface> errorRecordsList = null;
                            LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
                            docMap.put(metadataDocument, errorRecordsList);
                            docMap.put(dataDocument, errorRecordsList);
                            CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);

                            Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();
                            Map<String, String> nextRecordTypeMap = fileReader.getNextExpectedRecordType(s3FileType, lastProcessedRecordTypeMap.get(s3FileType));

                            Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                            s3FileTypeOffsetMap.putAll(fileReader.getOffset());
                            return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.SUCCESS);
                        }
                    }

Multiple File State Machine Reader: You'll need to implement either the 'SingleFileParser' or 'SingleFileStateMachineParser' interface for each file. If the records in the file are a single record type and do not follow a state machine, use the SingleFileParser interface. If there are multiple record types in the file that follow a state machine, use the SingleFileStateMachineParser interface. You'll also need to implement 'MultipleFileStateMachineReader' interface - this will combine the records returned by the individual file parsers and produce a Composite document. It will also maintain the state machine across files - you'll be adding logic to get the next records from each file and combining them into the result doc.

The Multiple File State Machine Reader usecase, as explained earlier, is when data document to be extracted is contained across multiple files and the records in the files follow a finite state machine across these files. Simple example is group of two types of data files - The metadata files: [metadata_file1.gz,metadata_file2.gz,...] and the data files: [data_file1.gz, data_file2.gz, ...]. A reader data task is created for each {metadata_file, data_file} pair - this has a MultipleFileStateMachineReader that reads over these files using each file's SingleFileParser interface.

                        Example: Multiple File State Machine Reader Files and Record Layout:
                        +--------------------+--------------------+    +-----------------+-----------------+
                        | metadata_file1.gz  | metadata_file2.gz  |    | data_file1.gz   | data_file2.gz   |
                        +--------------------+--------------------+    +-----------------+-----------------+
                        | <metadata_rec_1>   | <metadata_rec_n>   |    | <data_rec_1>    | <data_rec_n>    |
                        | <metadata_rec_2>   | <metadata_rec_n+1> |    | <data_rec_2>    | <data_rec_n+1>  |
                        | <metadata_rec_3>   | <metadata_rec_n+2> |    | <data_rec_3>    | <data_rec_n+2>  |
                        | <metadata_rec_4>   | <metadata_rec_n+3> |    | <data_rec_4>    | <data_rec_n+3>  |
                        | ...                | ...                |    | ...             | ...             |
                        +--------------------+--------------------+    +-----------------+-----------------+

The records in these file follow the following state machine - this state machine is encoded in the MultipleFileStateMachineReader implementation as it parses the records from the file.

                Example:  S3 File State Machine
                +------------------------------+                   +----------------------+
                | metadata_file:metadata_rec_X | -------->-------- | data_file:data_rec_X | ----->----+
                +------------------------------+                   +----------------------+           |
                ^                                                                                     V
                |                                                                                     |
                +--------<----------------------<--------------<-------------------<------------<-----+

Here is a sample MultipleFileStateMachineReader implementation - the SingleFileParser/SingleFileStateMachineParser implementations for each of the files are not provided for brevity - they can be coded using the examples in Single File Reader and Single File State Machine Reader sections.

                    Example: Output Doc

                    (Diagram 1)
                    +-----------------------+
                    | metadata_file1.gz     |
                    +-----------------------+
                    | <metadata_record_1>   |
                    | <metadata_record_2>   | SingleFileParser               +-----------------+
                    | <metadata_record_3>   |-------------->-----------------| metadata_record | -------->---------+
                    | <metadata_record_4>   |                                +-----------------+                   |
                    | ...                   |                                                                      V
                    +-----------------------+                                                                      |
                                                                                                                   | SingleFileStateMachineReader     +---------------+
                                                                                                                   |-------------->--------------->---| composite_doc |
                    +-----------------------+                                                                      |                                  +---------------+
                    | data_file_1.gz        |                                                                      ^
                    +-----------------------+                                                                      |
                    | <data_record_1>       |                                                                      |
                    | <data_record_2>       | SingleFileParser               +-------------+                       |
                    | <data_record_3>       |-------------->-----------------| data_record | -------->-------------+
                    | <data_record_4>       |                                +-------------+
                    | ...                   |
                    +-----------------------+


                    (Diagram 2)
                            +--------------------------------+
                            | MultipleFileStateMachineReader |
     ---> parseDocument-->--|                                |                  +--------------------+                  +-------------------------+
                            |                                |-> nextRecord -->-| System File Reader |-> parseDocument->| SingleFileParser        |
                            |                                |<- metadata rec-<-|XXXXXXXXXXXXXXXXXXXX|<- metadata rec-<-|                         |
                            |                                |                  +--------------------+                  | (MetadataLogFileParser) |
                            |                                |                         | |                              +-------------------------+
                            |                                |                         | | File read
                            |   { metadata + data }          |                        \| |/
                            |        = output doc            |                         \ /
                            |                                |                     +-----------------------+
                            |                                |                     | metadata_file1.gz     |
                            |                                |                     +-----------------------+
                            |                                |                     | {metadata_record_1}   |
     -<-- output doc---<----|                                |                     | {metadata_record_2}   |
                            |                                |                     +-----------------------+
                            |                                |
                            |                                |
                            |                                |                 +--------------------+                  +-------------------------+
                            |                                |-> nextRecord ->-| System File Reader |-> parseDocument->| SingleFileParser        |
                            |                                |<- data rec --<--|XXXXXXXXXXXXXXXXXXXX|<- data rec --<---|                         |
                            +--------------------------------+                 +--------------------+                  |   (DataLogFileParser)   |
                                                                                        | |                            +-------------------------+
      Legend:                                                                           | | File read
      +--+                                                                             \| |/
      |  |   Customer Implements Interface                                              \ /
      +--+                                                                         +-----------------------+
                                                                                   | data_file1.gz         |
      +--+                                                                         +-----------------------+
      |XX|   System Implementation                                                 | {data_record_1}       |
      +--+                                                                         | {data_record_2}       |
                                                                                   | ...                   |
                                                                                   +-----------------------+

                    Example code:
                    -------------
                    public class DataRecordReaderImpl {
                        /**
                        * This method returns a set of all the filetypes that are being read by the Multiple File State Machine reader.
                        * For example, if the metadata records are in file metadata_file_X.gz and the data records are in file data_file_X.gz
                        * and we've named these filetypes in the dataset creation as METADATALOGFILE and DATALOGFILE respectively, then this function
                        * will return a set {METADATALOGFILE, DATALOGFILE}
                        */
                        private static final Set data_file1 etc
                                    resolvedFileNames.put(fileTypeUpperCased, fileTypeFileNameMap.get(fileType));
                                }
                            }

                            return resolvedFileNames;
                        }

                        /**
                        * getParserInterfacesForS3FileTypes
                        * This function is responsible for creating the different reader parsers that are needed to parse each file type that is being read by the Multiple File State Machine reader
                        *      For example,
                        *          The input filetypes are {METADATALOGFILE, DATALOGFILE}
                        *          The output is a map of each filetype and its corresponding SingleFileStateMachineParser (Although METADATALOGFILE / DATALOGFILE might have 1 type of record only (no state machine) and might require SingleFileParser instead of SingleFileStateMachineParser, we're keeping the SingleFileStateMachineParser as the return value since single file reader parses can be implemented using a simple state machine with 1 record.)
                        *
                        *      The function will create / return the parsers for each file (and should cache these in implementation as well). The parser implementations that it returns would be the user implementations on how to parse the file types.
                        *
                        *      Here is an example implementation:
                        */
                        private Map<String, SingleFileStateMachineParser> fileParserMap;       // class member - cached for reuse
                        @Override
                        public Map<String, SingleFileStateMachineParser> getParserInterfacesForS3FileTypes(Set data).
                        * In this implementation, we are using the individual file parser's getNextExpectedRecordType method to get the next expected records. (Custom logic can be added here if needed)
                        */
                        public Map<String, String> getNextExpectedRecordType(Map<String, String> lastRecordTypeMap) {
                           Map<String, String> nextExpectedRecordType = null;
                           // initial case, nothing has been processed (null map or null types)
                           if(lastRecordTypeMap == null) {
                               nextExpectedRecordType = new HashMap<>();
                               for(String fileType : ExpectedFileTypes){
                                  // NOTE: deferring the next record type to the file's parser
                                  nextExpectedRecordType.put(fileType, fileParserMap.get(fileType).getNextExpectedRecordType(fileType, null));
                               }
                               return nextExpectedRecordType;
                           } else {
                               ValidationUtils.validateAssertCondition(lastRecordTypeMap != null && lastRecordTypeMap.size() == ExpectedFileTypes.size(), "lastRecordTypeMap size should equal expected");
                               nextExpectedRecordType = new HashMap<>();
                               for(String fileType : lastRecordTypeMap.keySet()){
                                   // NOTE: deferring the next record type to the file's parser
                                   nextExpectedRecordType.put(fileType, fileParserMap.get(fileType.toUpperCase()).getNextExpectedRecordType(fileType, lastRecordTypeMap.get(fileType)));
                               }
                               return nextExpectedRecordType;
                           }
                        }

                        /**
                        *   parseDocument
                        *      This function is responsible for returning the extracted document - it is given:
                        *          * the current state in each file which it can use to determine next state
                        *          * the last document that was extracted in case the documents have some link with each other in the file
                        *          * the filetype reader interfaces - that it can use to get the next records from each file
                        *      With these inputs, it will construct the extracted document from the records obtained from each file and return that as the result. Note that the document being created is a CompositeDocument - since its being created by combining different records from different files.
                        *
                        *      For example, in our example above where there are two file types {METADATALOGFILE, DATALOGFILE} and the records in each are output as follows:
                        *
                        *          METADATALOGFILE                 DATALOGFILE                  OUTPUT_DOC
                        *         -----------------                -------------                ----------
                        *         {metadata_rec_1}                 {data_rec_1}                 {composite_doc_1}
                        *         {metadata_rec_2}                 {data_rec_2}                 {composite_doc_2}
                        *         {metadata_rec_3}                 {data_rec_3}                 {composite_doc_3}
                        *         {metadata_rec_4}                 {data_rec_4}                 {composite_doc_4}
                        *
                        *      Here is what an example implementation would look like:

                        @Override
                        public ParseCompositeDocumentResult parseDocument(Map<String, String> fileNameFileTypeMap, DocumentInterface lastProcessedDocument, Map<String, SystemFileReader> fileTypeReaderMap) {
                            // validate inputs
                            ValidationUtils.validateAssertCondition(ExpectedFileTypes.size() == fileNameFileTypeMap.size(), "fileTypeFileNameMap size is unexpected");
                            ValidationUtils.validateAssertCondition(ExpectedFileTypes.size() == fileTypeReaderMap.size(), "fileTypeReaderMap size is unexpected");
                            try {
                                // get the file readers for each file
                                SingleFileStateMachineReaderInterface metadataLogFileReader = fileTypeReaderMap.get("METADATALOGFILE");
                                ValidationUtils.validateAssertCondition(metadataLogFileReader != null, "metadataLogFileReader is null");
                                SingleFileStateMachineReaderInterface dataLogFileReader = fileTypeReaderMap.get("DATALOGFILE");
                                ValidationUtils.validateAssertCondition(dataLogFileReader != null, "dataLogFileReader is null");

                                DocumentInterface metadataDocument = metadataLogFileReader.nextRecord(peek: false);
                                DocumentInterface dataDocument = dataLogFileReader.nextRecord(peek: false);

                                if (metadataDocument == null && dataDocument == null) {
                                    // End of file
                                    ValidationUtils.validateAssertCondition(metadataLogFileReader.getState() == SingleFileReaderState.COMPLETED, "metadataLogFileReader state should be completed when nextRecord is null", fileReader);
                                    ValidationUtils.validateAssertCondition(dataLogFileReader.getState() == SingleFileReaderState.COMPLETED, "dataLogFileReader state should be completed when nextRecord is null", fileReader);

                                    Map<String, String> s3FileTypeNextRecordTypeMap = new HashMap<>();
                                    s3FileTypeNextRecordTypeMap.put("METADATALOGFILE", null);
                                    s3FileTypeNextRecordTypeMap.put("DATALOGFILE", null);

                                    Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                                    s3FileTypeOffsetMap.putAll(metadataLogFileReader.getOffset());
                                    s3FileTypeOffsetMap.putAll(dataLogFileReader.getOffset());

                                    Map<String, String> lastProcessedRecordTypeMap = new HashMap<>();
                                    lastProcessedRecordTypeMap.putAll(metadataLogFileReader.getLastRecordTypeMap());
                                    lastProcessedRecordTypeMap.putAll(dataLogFileReader.getLastRecordTypeMap());

                                    CompositeDocInterface resultDocument = null;
                                    return new ParseCompositeDocumentResult(s3FileTypeNextRecordTypeMap, resultDocument, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.COMPLETED, ParseDocumentResultStatus.SUCCESS);
                                } else if (metadataDocument == null || dataDocument == null) {
                                    throw new RuntimeException("metadataDocument / dataDocument should both be null or not null");
                                }

                                // process metatdata record
                                MetadataRecord metadataDocument = null;
                                ErrorDocInterface metadataError = null;
                                if (metadataDocument instanceof MetadataRecord) {
                                    metadataDocument = (MetadataRecord) metadataDocument;
                                } else if (metadataDocument instanceof ErrorDocInterface) {
                                    metadataError = (ErrorDocInterface) metadataDocument;
                                } else {
                                    throw new RuntimeException("parseDocument metadata nextRecord is of unexpected type");
                                }


                                // process data record
                                DataRecord dataDocument = null;
                                ErrorDocInterface dataError = null;
                                if (dataDocument instanceof DataRecord) {
                                    dataDocument = (DataRecord) dataDocument;
                                } else if (dataDocument instanceof ErrorDocInterface) {
                                    dataError = (ErrorDocInterface) dataDocument;
                                } else {
                                    throw new RuntimeException("parseDocument data nextRecord is of unexpected type");
                                }

                                if (metadataError != null || dataError != null) {
                                    // Records have errors, construct and return error document
                                    Map<String, String> s3FileTypeNextRecordTypeMap = new HashMap<>();
                                    s3FileTypeNextRecordTypeMap.putAll(metadataLogFileReader.getNextExpectedRecordType("METADATALOGFILE", lastProcessedRecordTypeMap.get("METADATALOGFILE")));
                                    s3FileTypeNextRecordTypeMap.putAll(dataLogFileReader.getNextExpectedRecordType("DATALOGFILE", lastProcessedRecordTypeMap.get("DATALOGFILE")));

                                    Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                                    s3FileTypeOffsetMap.putAll(metadataLogFileReader.getOffset());
                                    s3FileTypeOffsetMap.putAll(dataLogFileReader.getOffset());

                                    Map<String, String> lastProcessedRecordTypeMap = new HashMap<>();
                                    lastProcessedRecordTypeMap.putAll(metadataLogFileReader.getLastRecordTypeMap());
                                    lastProcessedRecordTypeMap.putAll(dataLogFileReader.getLastRecordTypeMap());


                                    LinkedHashMaplt;SingleDocInterface, Listlt;ErrorDocInterface>> docMap = new LinkedHashMap<>();

                                    // add metadata doc and errors
                                    Listlt;ErrorDocInterface> metadataErrorRecordsList = metadataError == null ? null : Arrays.asList(metadataError);
                                    docMap.put(metadataDocument == null ? DocFactoryImpl.createNullDoc("metadata") : metadataDocument, metadataErrorRecordsList);

                                    // add data doc and errors
                                    Listlt;ErrorDocInterface> dataErrorRecordsList = dataError == null ? null : Arrays.asList(dataError);
                                    docMap.put(dataDocument == null ? DocFactoryImpl.createNullDoc("data") : dataDocument, dataErrorRecordsList);

                                    CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);
                                    return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.ERROR);
                                }


                                Listlt;ErrorDocInterface> errorRecordsList = null;
                                LinkedHashMaplt;SingleDocInterface, Listlt;ErrorDocInterface>> docMap = new LinkedHashMap<>();
                                docMap.put(metadataDocument, errorRecordsList);
                                docMap.put(dataDocument, errorRecordsList);
                                CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);

                                Map<String, String> s3FileTypeNextRecordTypeMap = new HashMap<>();
                                s3FileTypeNextRecordTypeMap.putAll(metadataLogFileReader.getNextExpectedRecordType("METADATALOGFILE", lastProcessedRecordTypeMap.get("METADATALOGFILE")));
                                s3FileTypeNextRecordTypeMap.putAll(dataLogFileReader.getNextExpectedRecordType("DATALOGFILE", lastProcessedRecordTypeMap.get("DATALOGFILE")));

                                Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
                                s3FileTypeOffsetMap.putAll(metadataLogFileReader.getOffset());
                                s3FileTypeOffsetMap.putAll(dataLogFileReader.getOffset());

                                Map<String, String> lastProcessedRecordTypeMap = new HashMap<>();
                                lastProcessedRecordTypeMap.putAll(metadataLogFileReader.getLastRecordTypeMap());
                                lastProcessedRecordTypeMap.putAll(dataLogFileReader.getLastRecordTypeMap());

                                return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.SUCCESS);
                            }
                            catch (Exception ex) {
                                throw ex;
                            }
                        }
                    }

Interfaces

Parsers

SingleFileParser: The parser interface for "Single File" reader usecase. This is where you tell us how to parse the individual records from the file. Since this is single file reader, there is no state machine maintained.

Here is the interface:

    /**
 * The parser interface for Single File Reader usecase. This is used when all the files are of a single type and the records in the file do not follow a state machine.
 * This interface is where you tell us how to parse the individual records from the file. Since this is single file reader, there is no state machine maintained.
 * Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file).
 *
 *             Example: Single File Reader Files and Record Layout:
 *
 *             +----------------------+------------------=---+---------------------+
 *             | logfile_1.gz         | logfile_2.gz         | logfile_3.gz        |
 *             +---=------------------+----------------------+---------------------+
 *             | <logline_1>          | <logline_1>          | <logline_1>         |
 *             | <logline_2>          | <logline_2>          | <logline_2>         |
 *             | <logline_3>          | <logline_3>          | <logline_3>         |
 *             | <logline_4>          | <logline_4>          | <logline_4>         |
 *             | ...                  | ...                  | ...                 |
 *             +----------------------+----------------------+---------------------+
 */
public interface SingleFileParser {
    /**
     * The filetype of the file - for the example we've used, we define the filetype (logical name) as "LOGFILE"
     * Here is an example implementation:
     *
     *     public String getS3FileType() {
     *         return "LOGFILE"
     *     }
     *
     * @return - The file type
     */
    String getS3FileType();

    /**
     * Given the filetype (LOGLINE) and filename (logfile_1.gz/logfile_2.gz/logfile_3.gz) from the manifest file, return the resolved filename if necessary.
     * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
     * For example, dataset_name/log_date/logfile_1.gz (See the docs for MultipleFileStateMachineReader.getResolvedS3Filename method for additional details around the file name resolution)
     *
     * Here is an example implementation using the example in the class doc:
     *
     *      public String getResolvedS3FileName(String s3FileType, String fileName) {
     *          ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *          ValidationUtils.validateAssertCondition(fileName != null, "filename should not be null");
     *          // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
     *          return resolvedFileNames;
     *      }
     *
     * @param s3FileType - the file type - example LOGFILE
     * @param fileName - the file name - example logfile_1.gz
     * @return - the resolved file name - example dataset_name/data_date/logfile_1.gz
     */
    String getResolvedS3FileName(String s3FileType, String fileName);

    /**
     * The record start pattern - to extract the records from the file, the parser needs to know the record start delimiter - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter.
     * Once it finds the end record delimiter, it will copy those bytes to the parse document function to create an extracted record
     *
     *      For example, for the following log lines in the log file, we define the start pattern as "{\"ts\"" and end pattern as "}\n":
     *
     *      Logfile:
     *      --------
     *       {"ts":1647352053448,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
     *       {"ts":1647352053449,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
     *
     *       Example implementation:
     *       -----------------------
     *       public RecordParseHint getRecordStartPattern(String s3FileType) {
     *           ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *           return new RecordParseHint(RecordHintType.PATTERN, "{\"ts\"", -1);
     *       }
     *
     * @param s3FileType - the filetype
     * @return - the record start pattern as a RecordParseHint object
     */
    RecordParseHint getRecordStartPattern(String s3FileType);

    /**
     * The record end pattern - to extract the records from the file, the parser needs to know the record start delimiter - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter.
     * Once it finds the end record delimiter, it will copy those bytes to the parse document function to create an extracted record
     *
     *      For example, for the following log lines in the log file, we define the start pattern as "{\"ts\"" and end pattern as "}\n":
     *
     *      Logfile:
     *      --------
     *       {"ts":1647352053448,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
     *       {"ts":1647352053449,"dt":"Mar 15, 2022 6:47:33 AM","hnm":"archimedes-mbp-2.hsd1.wa.comcast.net","unm":"archimedes","lvl":"WARN","thd":"main","cnm":"com.ancient.mathematicians.archimedes.InvalidBuoyantForceException","fnm":"InvalidBuoyantForceException.java","lnm":178,"mnm":"validateArchimedesPrinciple","msg":"The buoyant force is different from the weight. Archimedes' principle has been invalidated."}
     *
     *       Example implementation:
     *       -----------------------
     *       public RecordParseHint getRecordEndPattern(String s3FileType) {
     *           ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *           return new RecordParseHint(RecordHintType.PATTERN, "}\n", -1);
     *       }
     * @param s3FileType - the filetype
     * @return - the record end pattern as a RecordParseHint object
     */
    RecordParseHint getRecordEndPattern(String s3FileType);

    /**
     *  This function is called with the document contents in a byteArr and the startIndex and endIndex into the byteArr as the start and end of the record.
     *  The implementer is expected to construct the output record from these bytes.
     *
     *  Here is an example implementation which parses error records from the file:
     *  @Override
     *  public ParseDocumentResult parseDocument(String s3FileType, String s3Filename, long offsetBytes, byte[] byteArr, int startIndex, int endIndex) {
     *      ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "ResonanceJsonLogProcessor.parseDocument file type is unexpected");
     *      ValidationUtils.validateAssertCondition(byteArr != null && startIndex >= 0 && byteArr.length > endIndex && endIndex > startIndex, "ResonanceJsonLogProcessor.parseDocument byte array offsets are invalid");
     *
     *      // uses the Matcher utility from the interface package to do pattern matching
     *      int errorLevelIndex = Matcher.match(byteArr, startIndex, endIndex, "\"lvl\":\"ERROR\"");
     *      if (errorLevelIndex == -1) {
     *          // skip since we are interested only in ERROR records
     *          Map<String, String> startOffset = new HashMap<>();
     *          startOffset.put(s3FileType, Long.toString(offsetBytes));
     *          Map<String, String> endOffset = new HashMap<>();
     *          endOffset.put(s3FileType, Long.toString(offsetBytes));
     *          String errorMessage = "skipping message - level is not ERROR";
     *          String documentId = null;
     *          String recordType = null;
     *          Map<String, Object> documentMetadata = null;
     *          String serialize = null;
     *          String partitionKey = s3Filename;
     *          DocumentInterface skipDoc = new SkipDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
     *          return new ParseDocumentResult(null, skipDoc, ParseDocumentResultStatus.SKIP);
     *      } else {
     *          try {
     *              Map<String, Object> jsonMap = objectMapper.readValue(byteArr, startIndex, endIndex, HashMap.class);
     *              Long timestamp = (Long) jsonMap.get("ts");
     *              String docId = ""+timestamp+s3Filename;
     *              String recordType = (String) jsonMap.get("lvl");
     *              DocumentInterface doc = new JsonLogDocument(docId, recordType, null,  objectMapper.writeValueAsString(jsonMap), docId);
     *              return new ParseDocumentResult(null, doc, ParseDocumentResultStatus.SUCCESS);
     *          } catch (Exception e) {
     *              logger.error("error processing json document from file", s3FileType, s3Filename, offsetBytes, e);
     *              Map<String, String> startOffset = new HashMap<>();
     *              startOffset.put(s3FileType, Long.toString(offsetBytes));
     *              Map<String, String> endOffset = new HashMap<>();
     *              endOffset.put(s3FileType, Long.toString(offsetBytes));
     *              String errorMessage = "error processing json document from file - ex: "+e;
     *              String documentId = null;
     *              String recordType = null;
     *              Map<String, Object> documentMetadata = null;
     *              String serialize = null;
     *              String partitionKey = s3Filename;
     *              DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
     *              return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
     *          }
     *      }
     *  }
     *
     * @param s3FileType - the filetype
     * @param s3Filename - the filename
     * @param offsetBytes - the offset bytes into the file
     * @param byteArr - the byteArr that has the contents of the record
     * @param startIndex - the start index of the record in the byteArr
     * @param endIndex - the end index of the record in the byteArr
     * @return - ParseDocumentResult which has the extracted record and the status (error, success or skip)
     */
    ParseDocumentResult parseDocument(String s3FileType, String s3Filename, long offsetBytes, byte[] byteArr, int startIndex, int endIndex);
}

SingleFileStateMachineParser: The parser interface for "Single File State Machine" reader usecase. This is where you tell us how to parse the different records from a file. This class maintains the overall state machine for the file parser. It will create the extracted document from different file records that are being read from the files.

Here is the interface:

    /**
 * The parser interface for Single File State Machine Reader usecase. This is used when data document to be extracted is completely contained in a single file but is created from multiple data record in the file.
 * This interface is where you tell us how to parse the individual records from the file. Since this is single file state machine parser, the records in the file follow a finite state machine.
 * This class maintains the overall state machine for the file parser. It will create the extracted document from different file records that are being read from the files.
 *
 * Simple example is a data file where a record's metadata and data are written sequentially as two separate records.
 *
 *                     Example: Single File State Machine Reader Files and Record Layout:
 *                     +-----------------------+---------------------------+---------------------------+
 *                     | s3file_1.gz           | s3file_2.gz               | s3file_3.gz               |
 *                     +-----------------------+---------------------------+---------------------------+
 *                     | <metadata_record_1>   | <metadata_record_n>       | <metadata_record_m>       |
 *                     | <data_record_1>       | <data_record_n>           | <data_record_m>           |
 *                     | <metadata_record_2>   | <metadata_record_n+1>     | <metadata_record_m+1>     |
 *                     | <data_record_2>       | <data_record_n+1>         | <data_record_m+1>         |
 *                     | ...                   | ...                       | ...                       |
 *                     +-----------------------+---------------------------+---------------------------+
 *
 *                     Example:  Single File State Machine Parser's Finite State Machine
 *                     +----------+                   +------+
 *                     | Metadata | -------->-------- | Data | -->----+
 *                     +----------+                   +------+        |
 *                     ^                                              V
 *                     |                                              |
 *                     +--------<----------------------<--------------+
 */
public interface SingleFileStateMachineParser {
    /**
     * This method returns the filetype (logical name) for the file being parsed by the Single File State Machine parser.
     * For example, if the metadata records are in file s3file_1.gz and we've named this filetype in
     * the dataset creation as DATAFILE, then this function will return a DATAFILE as the file type
     *
     * Here is an example implementation:
     *
     *     public String getS3FileType() {
     *         return "DATAFILE";
     *     }
     *
     * @return - the fileType of the file
     */
    String getS3FileType();

    /**
     * Given the filetype (DATAFILE) and filename (s3file_1.gz) from the manifest file, return the resolved filename if necessary.
     * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
     * For example, dataset_name/data_date/s3file_1.gz (See the docs for MultipleFileStateMachineReader.getResolvedS3Filename method for additional details around the file name resolution)
     *
     * Here is an example implementation using the example in the class doc:
     *
     *      public String getResolvedS3FileName(String s3FileType, String fileName) {
     *          ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *          ValidationUtils.validateAssertCondition(fileName != null, "filename should not be null");
     *          // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
     *          return resolvedFileNames;
     *      }
     *
     * @param s3FileType - the filetype - example DATAFILE
     * @param fileName - the filename - example metadata_file_1.gz
     * @return - the resolved fileName - example dataset_name/data_date/metadata_file_1.gz
     */
    String getResolvedS3FileName(String s3FileType, String fileName);

    /**
     * Given the fileType, next expected record type and the last processed doc, this function returns the start delimiter for the next record
     * To extract the records from the file, the parser needs to know the expected record type for the state machine and the record type start and end delimiters - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter.
     *
     * For example, for the following {metadata, data} records in the data file, we define the start patterns:
     *
     *              metadata record start pattern: \r\nRecord-Type: metadata
     *              metadata record end pattern: \r\nRecord-Type: data
     *
     *              data record start pattern: \r\nRecord-Type: data
     *              data record end pattern: \r\nRecord-Type: metadata
     *
     *              Logfile:
     *              --------
     *
     *              Record-Type: metadata
     *              Content-Language: en-us
     *              Content-Length: 1024
     *              Content-Encoding: gzip
     *
     *
     *              Record-Type: data
     *              Content: eJxtUTFuwzAM3POKg8ci7QP6gK5dOmZRZbomKpOuxMAJiv69lBInRlEDAiweeXc8AdfvAW8KOlkO0WAjIVPU3BcMWadWGDjRvv3NIRfKECLHTfEpumxmUCxkQ0+JJzZvfAQbFk4JhUKO443O719HEuOQ0hlWO7xzYKm8I5cNR5C+jsnqJxhmZTGo7P/SD5qbBPnM1dKN6AmvEmkr83/fnTTq7N5GLYT3s1Hb+JYCeo3HyVfAcJRorFLhmCkYuec1UVoFdrgH/uI26RSmuea6eh40JV1YPpDUD4sLsjSoFi6PsLguuX9q9UvcczC3LQgF3fehs3LoupZa3W4L/hyke979Ar2qsBI=
     *
     *
     *               @Override
     *               public RecordParseHint getNextRecordStartPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc) {
     *                  ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "METADATALOGFILE getNextRecordStartPattern file type is unexpected");
     *                  switch(nextExpectedRecordType)
     *                  {
     *                      case "HEADER": {
     *                          // the header record starts with a new line followed by the Record-Type: header line
     *                          return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: metadata", -1);
     *                      }
     *                      case "DATA": {
     *                          // the data record starts with a new line followed by the Record-Type: data line
     *                          return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: data", -1);
     *                      }
     *                      default: {
     *                          throw new RuntimeException("unexpected record type");
     *                      }
     *                  }
     *              }
     *
     * @param s3FileType - the filetype
     * @param nextExpectedRecordType - the nextExpectedRecordType in the state machine
     * @param lastProcessedDoc - the lastProcessedDoc
     * @return - the record start pattern as a RecordParseHint object
     */
    RecordParseHint getNextRecordStartPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc);

    /**
     /*
     * Given the fileType, next expected record type and the last processed doc, this function returns the end delimiter for the next record
     * To extract the records from the file, the parser needs to know the expected record type for the state machine and the record type start and end delimiters - it will search the file sequentially till it finds this delimiter and then from that point on, it will search for the end record delimiter.
     *
     * For example, for the following {metadata, data} records in the data file, we define the start patterns:
     *
     *     metadata record start pattern: \r\nRecord-Type: metadata
     *     metadata record end pattern: \r\nRecord-Type: data
     *
     *     data record start pattern: \r\nRecord-Type: data
     *     data record end pattern: \r\nRecord-Type: metadata
     *
     *     Logfile:
     *     --------
     *
     *     Record-Type: metadata
     *     Content-Language: en-us
     *     Content-Length: 1024
     *     Content-Encoding: gzip
     *
     *
     *     Record-Type: data
     *     Content: eJxtUTFuwzAM3POKg8ci7QP6gK5dOmZRZbomKpOuxMAJiv69lBInRlEDAiweeXc8AdfvAW8KOlkO0WAjIVPU3BcMWadWGDjRvv3NIRfKECLHTfEpumxmUCxkQ0+JJzZvfAQbFk4JhUKO443O719HEuOQ0hlWO7xzYKm8I5cNR5C+jsnqJxhmZTGo7P/SD5qbBPnM1dKN6AmvEmkr83/fnTTq7N5GLYT3s1Hb+JYCeo3HyVfAcJRorFLhmCkYuec1UVoFdrgH/uI26RSmuea6eh40JV1YPpDUD4sLsjSoFi6PsLguuX9q9UvcczC3LQgF3fehs3LoupZa3W4L/hyke979Ar2qsBI=

     *
     *     @Override
     *     public RecordParseHint getNextRecordEndPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc) {
     *         ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "DATAFILE getNextRecordEndPattern file type is unexpected");
     *         switch(nextExpectedRecordType)
     *         {
     *             case "HEADER": {
     *                 // the data record start is the end of the header record
     *                 return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: data", -1);
     *             }
     *             case "DATA": {
     *                 // the header record start is the end of the data record
     *                 return new RecordParseHint(RecordHintType.PATTERN, "\r\nRecord-Type: metadata", -1);
     *             }
     *             default: {
     *                 throw new RuntimeException("unexpected record type");
     *             }
     *         }
     *     }
     *
     * @param s3FileType - the filetype
     * @param nextExpectedRecordType - the nextExpectedRecordType in the state machine
     * @param lastProcessedDoc - the lastProcessedDoc
     * @return - the record end pattern as a RecordParseHint object
     */
    RecordParseHint getNextRecordEndPattern(String s3FileType, String nextExpectedRecordType, DocumentInterface lastProcessedDoc);

    /**
     * Given the fileType and the last processed record type, find the next expected record type from the state machine
     * This method encodes the file's state machine. From the METADATALOGFILE example above, an example implementation could be as follows:
     *                @Override
     *                public String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType) {
     *                    ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(ParserFileType), "DATAFILE getNextRecordStartPattern file type is unexpected");
     *                    if (lastProcessedRecordType == null) {
     *                        return "METADATA";
     *                    } else {
     *                        if (lastProcessedRecordType.equals("METADATA")) {
     *                            return "DATA";
     *                        } else if (lastProcessedRecordType.equals("DATA")) {
     *                            return "METADATA";
     *                        } else {
     *                            throw new RuntimeException("Unexpected lastProcessedRecordType "+lastProcessedRecordType);
     *                        }
     *                    }
     *                }
     * @param s3FileType - the filetype
     * @param lastProcessedRecordType - the lastProcessedRecordType in the state machine
     * @return - the nextExpectedRecordType in the state machine
     */
    String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType);

    /**
     * This function is called with each record's contents in a byteArr
     * The startIndex and endIndex into the byteArr as the start and end of the log line record.
     * The implementer is expected to construct the output record from these bytes.
     *
     * Here is an example implementation which parses error records from the DATALOGFILE example:
     *        public ParseDocumentResult parseDocument(String s3FileType, String s3Filename, long offsetBytes, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, byte[] byteArr, int startIndex, int endIndex) {
     *            ValidationUtils.validateAssertCondition(s3FileType != null && s3FileType.equalsIgnoreCase(getS3FileType()), "parseDocument file type is unexpected");
     *            ValidationUtils.validateAssertCondition(byteArr != null && startIndex >= 0 && byteArr.length > endIndex && endIndex > startIndex, "parseDocument byte array offsets are invalid");
     *            String nextRecordType = getNextExpectedRecordType(s3FileType, lastProcessedRecordType);
     *            ValidationUtils.validateAssertCondition(nextRecordType  != null && (nextRecordType.equalsIgnoreCase("DATA") || nextRecordType.equalsIgnoreCase("METADATA")), "nextRecordType should not be null or invalid");
     *
     *            AbstractDataRecord record = null;
     *            switch (nextRecordType.toUpperCase()) {
     *                case "METADATA": {
     *                    try {
     *                        record = DataRecordFactory.constructMetadataRecord(byteArr, startIndex, endIndex);
     *                    } catch (Exception ex) {
     *                        logger.error("Exception in parseDocument for METADATA record - recordType: "+nextRecordType+", ex: "+exception);
     *
     *                        String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
     *                        Map<String, String> startOffset = new HashMap<>();
     *                        startOffset.put(s3FileType, Long.toString(offsetBytes));
     *                        Map<String, String> endOffset = new HashMap<>();
     *                        endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
     *                        String errorMessage = "Exception in parseDocument for METADATA record - recordType: "+nextRecordType+", ex: "+exception;
     *                        String documentId = null;
     *                        String recordType = null;
     *                        Map<String, Object> documentMetadata = null;
     *                        String serialize = null;
     *                        String partitionKey = s3Filename;
     *                        DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
     *                        return new ParseDocumentResult(nextRecordTypeForExtractedRecord, errorDoc, ParseDocumentResultStatus.ERROR);
     *                    }
     *                    break;
     *                }
     *               case "DATA": {
     *                    try {
     *                        record = DataRecordFactory.constructDataRecord(byteArr, startIndex, endIndex);
     *                    } catch (Exception ex) {
     *                        logger.error("Exception in parseDocument for DATA record - recordType: "+nextRecordType+", ex: "+exception);
     *
     *                        String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
     *                        Map<String, String> startOffset = new HashMap<>();
     *                        startOffset.put(s3FileType, Long.toString(offsetBytes));
     *                        Map<String, String> endOffset = new HashMap<>();
     *                        endOffset.put(s3FileType, Long.toString(offsetBytes+(endIndex-startIndex)));
     *                        String errorMessage = "Exception in parseDocument for DATA record - recordType: "+nextRecordType+", ex: "+exception;
     *                        String documentId = null;
     *                        String recordType = null;
     *                        Map<String, Object> documentMetadata = null;
     *                        String serialize = null;
     *                        String partitionKey = s3Filename;
     *                        DocumentInterface errorDoc = new ErrorDoc(startOffset, endOffset, errorMessage, documentId, recordType, documentMetadata, serialize, partitionKey);
     *                        return new ParseDocumentResult(nextRecordTypeForExtractedRecord, errorDoc, ParseDocumentResultStatus.ERROR);
     *                    }
     *                    break;
     *                }
     *                default: {
     *                    logger.error("Unexpected record type in datafile - recordType: "+nextRecordType);
     *                    throw new RuntimeException("Unexpected record type in datafile - recordType: "+nextRecordType);
     *                }
     *            }
     *            String nextRecordTypeForExtractedRecord = getNextExpectedRecordType(s3FileType, nextRecordType);
     *            return new ParseDocumentResult(nextRecordTypeForExtractedRecord, record, ParseDocumentResultStatus.SUCCESS);
     *        }
     *
     * @param s3FileType - the filetype
     * @param s3Filename - the filename
     * @param offsetBytes - the offset bytes into the file
     * @param lastProcessedRecordType - the last processed record type
     * @param lastProcessedDoc - the last processed doc
     * @param byteArr - the byteArr that has the contents of the record
     * @param startIndex - the start index of the record in the byteArr
     * @param endIndex - the end index of the record in the byteArr
     * @return - ParseDocumentResult which has the extracted record and the status (error, success or skip)
     */
    ParseDocumentResult parseDocument(String s3FileType, String s3Filename, long offsetBytes, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, byte[] byteArr, int startIndex, int endIndex);
}

Readers

SingleFileStateMachineReader: The SingleFileStateMachineReader implements the logic to combine the individual records parsed by the SingleFileStateMachine parser and output them to a composite doc. For example, if we have a DATAFILE which contains 2 types of records {metadata record, data record} and the output doc is constructed by the combining these two docs, then the SingleFileStateMachineReader combines each {metadata, data} record pair into an output doc.

Here is the interface:

    package com.resonance.letsdata.data.readers.interfaces;

import com.resonance.letsdata.data.documents.interfaces.DocumentInterface;
import com.resonance.letsdata.data.readers.interfaces.parsers.SingleFileStateMachineParser;
import com.resonance.letsdata.data.readers.model.ParseCompositeDocumentResult;

/**
 * The SingleFileStateMachineReader implements the logic to combine the individual records parsed by the SingleFileStateMachine parser and output them to a composite doc.
 * For example, if we have a DATAFILE which contains 2 types of records {metadata record, data record} and the output doc is constructed by the combining these two docs,
 * then the SingleFileStateMachineReader combines each {metadata, data} record pair into an output doc.
 *
 * Here are a couple of system architecture diagrams to explain this usecase:
 *      Diagram 1: Component Diagram with data flow
 *                             +--------------------------------+
 *                             |                                |
 *      ---> parseDocument-->--|                                |                       +--------------------+                       +-------------------------+
 *                             |                                |                       |XXXXXXXXXXXXXXXXXXXX|                       |                         |
 *                             |                                |---> nextRecord --->---|XXXXXXXXXXXXXXXXXXXX|---> parseDocument-->--|                         |
 *                             | SingleFileStateMachineReader   |<--- metadata rec --<--|XXXXXXXXXXXXXXXXXXXX|<--- metadata rec --<--|                         |
 *                             |                                |                       | System File Reader |                       | SingleFileStateMachine  |
 *                             |                                |---> nextRecord --->---|XXXXXXXXXXXXXXXXXXXX|---> parseDocument-->--|       Parser            |
 *                             |   { metadata + data }          |<----- data rec ----<--|XXXXXXXXXXXXXXXXXXXX|<----- data rec ----<--|                         |
 *      -<-- output doc---<----|        = output doc            |                       |XXXXXXXXXXXXXXXXXXXX|                       |                         |
 *                             |                                |                       +--------------------+                       +-------------------------+
 *                             |                                |                                |   |
 *                             +--------------------------------+                                |   |
 *                                                                                               |   |    File read
 *                                                                                              \|   |/
 *                                                                                               \   /
 *       Legend:                                                                                  \ /
 *       +--+                                                                         +-----------------------+
 *       |  |   Customer Implements Interface                                         | s3file_1.gz           |
 *       +--+                                                                         +-----------------------+
 *                                                                                    | {metadata_record_1}   |
 *       +--+                                                                         | {data_record_1}       |
 *       |XX|   System Implementation                                                 | {metadata_record_2}   |
 *       +--+                                                                         | {data_record_2}       |
 *                                                                                    | ...                   |
 *                                                                                    +-----------------------+
 *
 *  Diagram 2: Data flow
 *
 *     +-----------------------+ SingleFileStateMachineParser   +-----------------+
 *     | s3file_1.gz           |-------------->-----------------| metadata_record | -------->---------+
 *     +-----------------------+                                +-----------------+                   | SingleFileStateMachineReader     +---------------+
 *     | <metadata_record_1>   |                                                                      |-------------->--------------->---| composite_doc |
 *     | <data_record_1>       | SingleFileStateMachineParser   +-------------+                       |                                  +---------------+
 *     | <metadata_record_2>   |-------------->-----------------| data_record | -------->-------------+
 *     | <data_record_2>       |                                +-------------+
 *     | ...                   |
 *     +-----------------------+
 */
public interface SingleFileStateMachineReader {
    /**
     * This method returns the filetype for the file being read by the Single File State Machine reader.
     * For example, if the metadata records are in file metadata_file_X.gz and we've named this filetype in
     * the dataset creation as DATALOGFILE, then this function will return a DATALOGFILE as the file type
     * @return - the fileType of the file
     */
    String getS3FileType();

    /**
     * Given the filetype (LOGFILE/METADATALOGFILE/DATALOGFILE) and filename (s3file_1.gz/s3file_2.gz/s3file_3.gz) from the manifest file, return the resolved filename if necessary.
     * In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
     * For example, s3file_1.gz could be resolved to dataset_name/data_date/s3file_1.gz
     * @param s3FileType - the filetype - example DATALOGFILE
     * @param fileName - the filename - example s3file_1.gz
     * @return - the resolved fileName - example dataset_name/data_date/s3file_1.gz
     */
    String getResolvedS3FileName(String s3FileType, String fileName);

    /**
     * This function is responsible for creating the parser that is needed to parse the file that is being read by the Single File State Machine reader
     * For example, the input filetype is {DATAFILE}, the output is the filetype's corresponding SingleFileStateMachineParser
     * The function will create the parsers for the filetype (and should cache these in implementation as well).
     * The parser implementation that it returns would be the user implementations on how to parse the file types.
     * We'll reuse the parser class we've defined above in this example (DataFileParserImpl)
     *
     * Here is an example implementation:
     *        @Override
     *        public SingleFileStateMachineParser getReaderParserInterfacesForS3FileType(String fileType) {
     *            ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *            if (fileParser == null) {
     *               fileParser = new DataFileParserImpl();
     *            }
     *            return fileParser;
     *        }
     * @param s3FileType - the filetype - example DATALOGFILE
     * @return - the constructed parser implementation for the filetype
     */
    SingleFileStateMachineParser getReaderParserInterfacesForS3FileType(String s3FileType);

    /**
     * Given the fileType and the last processed record type, find the next expected record type from the state machine
     * Defer to the file parser's getNextExpectedRecordType method or add custom logic. Here is an example implementation:
     *
     *         @Override
     *         public String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType) {
     *             ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *             return this.getReaderParserInterfacesForS3FileType(fileType).getNextExpectedRecordType(s3FileType, lastProcessedRecordType);
     *         }
     * @param s3FileType - the filetype - example DATALOGFILE
     * @param lastProcessedRecordType - the lastProcessedRecordType - example METADATA
     * @return - the next expected record type - example DATA
     */
    String getNextExpectedRecordType(String s3FileType, String lastProcessedRecordType);

    /**
     * ParseCompositeDocumentResult parseDocument(String s3FileType, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, SystemFileReader fileReader);
     *      This function is responsible for returning the extracted document - it is given:
     *          * the current state in the datafile which it can use to determine next state
     *          * the last document that was extracted in case the documents have some link with each other in the file
     *          * the filetype reader interface - that it can use to get the next records from the  datafile - this interface is a simple wrapper over the parser interface that we defined earlier
     *      With these inputs, this function will construct the extracted document from the records obtained from the file reader and return the result.
     *      Note that the document being created is a CompositeDocument - since its being created by combining different records from different files.
     *      Here is an example implementation:
     *
     *               @Override
     *               public ParseCompositeDocumentResult parseDocument(String s3FileType, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, SystemFileReader fileReader) {
     *                   ValidationUtils.validateAssertCondition(getS3FileType().equals(fileType.toUpperCase()), "fileType is unexpected");
     *                   ValidationUtils.validateAssertCondition(fileReader.getFileTypeFileNameMap().get(s3FileType) != null, "fileReader fileTypeFileNameMap is unexpected");
     *
     *                   if (fileReader.getOffset().get(s3FileType) == 0) {
     *                       // optional: skip if there are any file headers that need not be processed
     *                       skipFileHeaders(fileReader);
     *                   }
     *
     *                   DocumentInterface document = fileReader.nextRecord(peek: false);
     *                   if (document == null) {
     *                       // End of file
     *                       ValidationUtils.validateAssertCondition(fileReader.getState() == SingleFileReaderState.COMPLETED, "fileReader state should be completed when nextRecord is null", fileReader);
     *                       Map<String, String> s3FileTypeNextRecordTypeMap = new HashMap<>();
     *                       s3FileTypeNextRecordTypeMap.put(s3FileType, null);
     *
     *                       Map<String, Long> s3FileTypeOffsetMap = new HashMap<>();
     *                       s3FileTypeOffsetMap.putAll(fileReader.getOffset());
     *
     *                       Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();
     *
     *                       Map<String, String> s3FileTypeNextRecordTypeMap = null;
     *                       CompositeDocInterface resultDocument = null;
     *                       return new ParseCompositeDocumentResult(s3FileTypeNextRecordTypeMap, resultDocument, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.COMPLETED, ParseDocumentResultStatus.SUCCESS);
     *                   }
     *
     *                   // get metatdata record
     *                   MetadataRecord metadataDocument = null;
     *                   ErrorDocInterface metadataError = null;
     *                   if (document instanceof MetadataRecord) {
     *                       metadataDocument = (MetadataRecord) document;
     *                   } else if (document instanceof ErrorDocInterface) {
     *                       metadataError = (ErrorDocInterface) document;
     *                   } else {
     *                       throw new RuntimeException("parseDocument metadata nextRecord is of unexpected type");
     *                   }
     *
     *
     *                   // get data record
     *                   document = fileReader.nextRecord(peek: false);
     *                   ValidationUtils.validateAssertCondition(document != null, "data document should not be null");
     *
     *                   DataRecord dataDocument = null;
     *                   ErrorDocInterface dataError = null;
     *                   if (document instanceof DataRecord) {
     *                       dataDocument = (DataRecord) document;
     *                   } else if (document instanceof ErrorDocInterface) {
     *                       dataError = (ErrorDocInterface) document;
     *                   } else {
     *                       throw new RuntimeException("parseDocument data nextRecord is of unexpected type");
     *                   }
     *
     *                   if (metadataError != null || dataError != null) {
     *                       // Records have errors, construct and return error document
     *
     *                       Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();
     *                       Map<String, String> nextRecordTypeMap = fileReader.getNextExpectedRecordType(s3FileType, lastProcessedRecordTypeMap.get(s3FileType));
     *
     *                       Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
     *                       s3FileTypeOffsetMap.putAll(fileReader.getOffset());
     *
     *                       LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
     *                       // add metadata doc and errors
     *                       List<ErrorDocInterface> metadataErrorRecordsList = metadataError == null ? null : Arrays.asList(metadataError);
     *                       docMap.put(metadataDocument == null ? DocFactoryImpl.createNullDoc("metadata") : metadataDocument, metadataErrorRecordsList);
     *
     *                       // add data doc and errors
     *                       List<ErrorDocInterface> dataErrorRecordsList = dataError == null ? null : Arrays.asList(dataError);
     *                       docMap.put(dataDocument == null ? DocFactoryImpl.createNullDoc("data") : dataDocument, dataErrorRecordsList);
     *
     *                       CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);
     *                       return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.ERROR);
     *                   }
     *
     *
     *                   List<ErrorDocInterface> errorRecordsList = null;
     *                   LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
     *                   docMap.put(metadataDocument, errorRecordsList);
     *                   docMap.put(dataDocument, errorRecordsList);
     *                   CompositeDocInterface compositeDoc = DocFactoryImpl.createCompositeDoc(docMap);
     *
     *                   Map<String, String> lastProcessedRecordTypeMap = fileReader.getLastRecordTypeMap();
     *                   Map<String, String> nextRecordTypeMap = fileReader.getNextExpectedRecordType(s3FileType, lastProcessedRecordTypeMap.get(s3FileType));
     *
     *                   Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
     *                   s3FileTypeOffsetMap.putAll(fileReader.getOffset());
     *                   return new ParseCompositeDocumentResult(nextRecordTypeMap, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, ParseDocumentResultStatus.SUCCESS);
     *               }
     *           }
     *
     * @param s3FileType - the filetype
     * @param lastProcessedRecordType - the lastProcessedRecordType
     * @param lastProcessedDoc - the lastProcessedDoc
     * @param fileReader - the reference to SystemFileReader implementation for the fileType that can be used to retrieve the next records from the parser
     * @return - ParseCompositeDocumentResult which has the extracted record and the status (error, success or skip)
     */
    ParseCompositeDocumentResult parseDocument(String s3FileType, String lastProcessedRecordType, DocumentInterface lastProcessedDoc, SystemFileReader fileReader);
}

MultipleFileStateMachineReader: The reader interface for "Multiple File State Machine" reader. This is where you tell us how to make sense of the individual records that are parsed from multiple files. This class would maintain the overall state machine across the files. It will create the extracted document from different file records that are being read from the files.

Here is the interface:

package com.resonance.letsdata.data.readers.interfaces;

import com.resonance.letsdata.data.documents.interfaces.DocumentInterface;
import com.resonance.letsdata.data.readers.interfaces.parsers.SingleFileStateMachineParser;
import com.resonance.letsdata.data.readers.model.ParseCompositeDocumentResult;

import java.util.Map;
import java.util.Set;

/**
 * The reader interface for "Multiple File State Machine" reader.
 * This is where you tell us how to make sense of the individual records that are parsed from multiple files.
 * This class would maintain the overall state machine across the files.
 * It will create the extracted document from different file records that are being read from the files.
 *
 *  Here are a couple of system architecture diagrams to explain this usecase:
 *       Diagram 1: Component Diagram with data flow
 *                             +--------------------------------+
 *                             | MultipleFileStateMachineReader |
 *      ---> parseDocument-->--|                                |                  +--------------------+                  +-------------------------+
 *                             |                                |-> nextRecord -->-| System File Reader |-> parseDocument->| SingleFileParser        |
 *                             |                                |<- metadata rec-<-|XXXXXXXXXXXXXXXXXXXX|<- metadata rec-<-|                         |
 *                             |                                |                  +--------------------+                  | (MetadataLogFileParser) |
 *                             |                                |                         | |                              +-------------------------+
 *                             |                                |                         | | File read
 *                             |   { metadata + data }          |                        \| |/
 *                             |        = output doc            |                         \ /
 *                             |                                |                     +-----------------------+
 *                             |                                |                     | metadata_file1.gz     |
 *                             |                                |                     +-----------------------+
 *                             |                                |                     | {metadata_record_1}   |
 *      -<-- output doc---<----|                                |                     | {metadata_record_2}   |
 *                             |                                |                     +-----------------------+
 *                             |                                |
 *                             |                                |
 *                             |                                |                 +--------------------+                  +-------------------------+
 *                             |                                |-> nextRecord ->-| System File Reader |-> parseDocument->| SingleFileParser        |
 *                             |                                |<- data rec --<--|XXXXXXXXXXXXXXXXXXXX|<- data rec --<---|                         |
 *                             +--------------------------------+                 +--------------------+                  |   (DataLogFileParser)   |
 *                                                                                         | |                            +-------------------------+
 *       Legend:                                                                           | | File read
 *       +--+                                                                             \| |/
 *       |  |   Customer Implements Interface                                              \ /
 *       +--+                                                                         +-----------------------+
 *                                                                                    | data_file1.gz         |
 *       +--+                                                                         +-----------------------+
 *       |XX|   System Implementation                                                 | {data_record_1}       |
 *       +--+                                                                         | {data_record_2}       |
 *                                                                                    | ...                   |
 *                                                                                    +-----------------------+
 *
 *      Diagram 2: Data flow
 *                     +-----------------------+
 *                     | metadata_file1.gz     |
 *                     +-----------------------+
 *                     | <metadata_record_1>   |
 *                     | <metadata_record_2>   | SingleFileParser               +-----------------+
 *                     | <metadata_record_3>   |-------------->-----------------| metadata_record | -------->---------+
 *                     | <metadata_record_4>   |                                +-----------------+                   |
 *                     | ...                   |                                                                      V
 *                     +-----------------------+                                                                      |
 *                                                                                                                    | SingleFileStateMachineReader     +---------------+
 *                                                                                                                    |-------------->--------------->---| composite_doc |
 *                     +-----------------------+                                                                      |                                  +---------------+
 *                     | data_file_1.gz        |                                                                      ^
 *                     +-----------------------+                                                                      |
 *                     | <data_record_1>       |                                                                      |
 *                     | <data_record_2>       | SingleFileParser               +-------------+                       |
 *                     | <data_record_3>       |-------------->-----------------| data_record | -------->-------------+
 *                     | <data_record_4>       |                                +-------------+
 *                     | ...                   |
 *                     +-----------------------+
 */
public interface MultipleFileStateMachineReader {
    /**
     *  This method returns a set of all the filetypes that are being read by the Multiple File State Machine reader.
     *  For example, if the metadata records are in file metadata_file_X.gz and the data records are in file data_file_X.gz
     *  and we've named these filetypes in the dataset creation as METADATALOGFILE and DATALOGFILE respectively, then this function
     *  will return a set {METADATALOGFILE, DATALOGFILE}
     *  @return a set of file types
     */
    Set getS3FileTypes();

    /**
     *      Input: The log line from the manifest file as a {filetype, filelocation} map
     *      Output: Updated {filetype, filelocation} map - this could be the same as the input or resolved to fix the relative paths if needed.
     *
     *      Given the filetypes (METADATALOGFILE, DATALOGFILE) and filenames (metadata_file1.gz/data_file_1.gz) from the manifest file, return the resolved filename if necessary.
     *      In most cases, the resolved filename would be the same as the filename, but in some cases, you might need to prepend paths if the data in the s3 bucket is not in the root directory.
     *
     *      Here are the details - This method is an alternative to specifying the complete paths in the manifest file.
     *      For example, lets suppose the data is stored in s3 bucket as follows:
     *          S3 Bucket
     *          ---------
     *              June2022/metadata/metadata_file1.gz
     *              June2022/metadata/metadata_file2.gz
     *              June2022/metadata/metadata_file3.gz
     *              June2022/data/data_file1.gz
     *              June2022/data/data_file2.gz
     *              June2022/data/data_file3.gz
     *
     *      We can create the manifest file in one of the following two ways:
     *          Fully Specified manifest file:
     *          -----------------------------
     *              METADATALOGFILE:June2022/metadata/metadata_file1.gz|DATALOGFILE:June2022/data/data_file1.gz
     *              METADATALOGFILE:June2022/metadata/metadata_file2.gz|DATALOGFILE:June2022/data/data_file2.gz
     *              METADATALOGFILE:June2022/metadata/metadata_file3.gz|DATALOGFILE:June2022/data/data_file3.gz
     *
     *              In this case, the getResolvedS3FileNames will validate the file names are correct and the mapping is correct
     *                  * METADATALOGFILE is specified, DATALOGFILE is specified
     *                  * the mapping of metadata file and data file is correct - metadata_file1.gz corresponds to data_file1.gz i.e. file1 ->file1, file2 ->file2 etc.
     *              The returned map would be an echo of the manifest file log line
     *                  {
     *                      METADATALOGFILE:June2022/metadata/metadata_file1.gz
     *                      DATALOGFILE:June2022/data/data_file1.gz
     *                  }
     *
     *          Lazily resolved manifest file:
     *          -------------------------------
     *              METADATALOGFILE:June2022/metadata/metadata_file1.gz|DATALOGFILE:data_file1.gz
     *              METADATALOGFILE:June2022/metadata/metadata_file2.gz|DATALOGFILE:data_file2.gz
     *              METADATALOGFILE:June2022/metadata/metadata_file3.gz|DATALOGFILE:data_file3.gz
     *
     *              In this case, we've fully specified the METADATALOGFILE's location in the S3 Bucket but the DATALOGFILE's name has only been specified.
     *              The getResolvedS3FileNames will resolve the data_fileX.gz's location in the s3 bucket. It knows that the root folder is June2022 and knows
     *              that data files are in a peer folder of the metedata folder and that folder is named data. It will resolve the data file's location and return
     *              the resolved file locations in the map as follows:
     *                  {
     *                      METADATALOGFILE:June2022/metadata/metadata_file1.gz
     *                      DATALOGFILE:June2022/data/data_file1.gz
     *                  }
     *
     *         Here is an example implementation for the "Fully Specified manifest file":
     *              private static final Set<String> EXPECTED_FILE_TYPES = new HashSet<>(Arrays.asList(RecordLogFileTypes.METADATALOGFILE.name(), RecordLogFileTypes.DATALOGFILE.name()));
     *              private final Map<String, String> resolvedFileNames;    // class member - cached for reuse
     *
     *              @Override
     *              public Map<String, String> getResolvedS3FileNames(Map<String, String> fileTypeFileNameMap) {
     *                  ValidationUtils.validateAssertCondition(fileTypeFileNameMap != null && fileTypeFileNameMap.size() == EXPECTED_FILE_TYPES.size(), "fileTypeFileNameMap size should equal expected");
     *                  for (String fileType : fileTypeFileNameMap.keySet())
     *                  {
     *                      String fileTypeUpperCased = fileType.toUpperCase();
     *                      ValidationUtils.validateAssertCondition(EXPECTED_FILE_TYPES.contains(fileTypeUpperCased), "fileType is unexpected");
     *                      if (!resolvedFileNames.containsKey(fileTypeUpperCased)) {
     *                          // no resolution being done, expect incoming file names to be resolved but custom logic can be put here
     *                          // file name mapping can be validated if needed metadata_file1 -> data_file1 etc
     *                          resolvedFileNames.put(fileTypeUpperCased, fileTypeFileNameMap.get(fileType));
     *                      }
     *                  }
     *
     *                  return resolvedFileNames;
     *              }
     * @param s3FileTypeFileNameMap The log line from the manifest file as a {filetype, filelocation} map
     * @return Updated {filetype, filelocation} map - this could be the same as the input or resolved to fix the relative paths if needed.
     */
    Map getResolvedS3FileNames(Map s3FileTypeFileNameMap);

    /**
     *      This function is responsible for creating the different reader parsers that are needed to parse each file type that is being read by the Multiple File State Machine reader
     *      For example,
     *          The input filetypes are {METADATALOGFILE, DATALOGFILE}
     *          The output is a map of each filetype and its corresponding SingleFileStateMachineParser (Although METADATALOGFILE / DATALOGFILE might have 1 type of record only (no state machine) and might require SingleFileParserInterface instead of SingleFileStateMachineParser, we're keeping the SingleFileStateMachineParser as the return value since single file reader parses can be implemented using a simple state machine with 1 record.)
     *
     *      The function will create / return the parses for each file (and should cache these in implementation as well). The parser implementations that it returns would be the user implementations on how to parse the file types.
     *
     *      Here is an example implementation:
     *
     *      private final Map<String, SingleFileStateMachineParser> fileParserMap;       // class member - cached for reuse
     *
     *      @Override
     *      public Map<String, SingleFileStateMachineParser> getParserInterfacesForS3FileTypes(Set<String> fileTypeSet) {
     *         ValidationUtils.validateAssertCondition(fileTypeSet != null && fileTypeSet.size() == EXPECTED_FILE_TYPES.size(), "fileTypeFileNameMap size should equal expected");
     *         for (String fileType : fileTypeSet)
     *         {
     *             String fileTypeUpperCased = fileType.toUpperCase();
     *             ValidationUtils.validateAssertCondition(EXPECTED_FILE_TYPES.contains(fileTypeUpperCased), "fileType is unexpected");
     *             if (!fileParserMap.containsKey(fileTypeUpperCased)) {
     *                 switch (fileTypeUpperCased)
     *                 {
     *                     case "METADATALOGFILE": {
     *                         fileParserMap.put(fileTypeUpperCased, new MetadataLogFileParser());
     *                         break;
     *                     }
     *                     case "DATALOGFILE": {
     *                         fileParserMap.put(fileTypeUpperCased, new DataLogFileParser());
     *                         break;
     *                     }
     *                     default:{
     *                         throw new RuntimeException("unexpected filetype: "+fileTypeUpperCased);
     *                     }
     *                 }
     *             }
     *         }
     *
     *         return fileParserMap;
     *     }
     * @param s3FileTypes input is a set of filetypes
     * @return The output is a map of each filetype and its corresponding SingleFileStateMachineParser (Although METADATALOGFILE / DATALOGFILE might have 1 type of record only (no state machine) and might require SingleFileParserInterface instead of SingleFileStateMachineParser, we're keeping the SingleFileStateMachineParser as the return value since single file reader parses can be implemented using a simple state machine with 1 record.)
     */
    Map getParserInterfacesForS3FileTypes(Set s3FileTypes);

    /**
     *      This function is responsible for encoding the state machine across the files and returning the next expected state given the current state.
     *
     *      Input: The input is a map of each file type and the last processed record type in that file.
     *      Output: The output is a map of each file type and the next expected record type in that file.
     *
     *      For example, in our example above where there are two file types {METADATALOGFILE, DATALOGFILE} and the records in each file have a separate header and payload record as follows:
     *
     *          METADATALOGFILE                          DATALOGFILE
     *         -----------------                        -------------
     *         {metadata_headers_rec_1}                 {data_headers_rec_1}
     *         {metadata_payload_rec_1}                 {data_payload_rec_1}
     *         {metadata_headers_rec_2}                 {data_headers_rec_1}
     *         {metadata_payload_rec_2}                 {data_payload_rec_1}
     *
     *      examples input / outputs in this case:
     *          * nothing has been processed, expect the header records
     *              Input: {                                Output: {
     *                      METADATALOGFILE: null,              METADATALOGFILE: HEADER,
     *                      DATALOGFILE: null                   DATALOGFILE: HEADER,
     *                  }                                   }
     *          * last processed records are header records
     *               Input: {                                Output: {
     *                       METADATALOGFILE: HEADER,              METADATALOGFILE: PAYLOAD,
     *                       DATALOGFILE: HEADER                   DATALOGFILE: PAYLOAD,
     *                   }                                   }
     *          ...
     *          * last processed records are payload records, wrap around to header records
     *              Input: {                                Output: {
     *                       METADATALOGFILE: PAYLOAD,              METADATALOGFILE: HEADER,
     *                       DATALOGFILE: PAYLOAD                   DATALOGFILE: HEADER,
     *                   }                                   }
     *
     *      Some things to note:
     *          * for files with only a single record type, this simplifies to returning the same next record given the same last record
     *          * additional validations can be done to validate the integrity of the state machine and the last record set for each file type if needed
     *          * although this example shows returning the state machine for individual files, the actual next record for each file should be obtained by calling the respective method for the file type parser - this defers the state machine of each file to its own parser (better design)
     *
     *     Here is an example implementation for the setup above:
     *
     *          @Override
     *          public Map<String, String> getNextExpectedRecordType(Map<String, String> lastRecordTypeMap) {
     *              Map<String, String> nextExpectedRecordType = null;
     *              // initial case, nothing has been processed (null map or null types)
     *              if(lastRecordTypeMap == null) {
     *                  nextExpectedRecordType = new HashMap<>();
     *                  for(String fileType : EXPECTED_FILE_TYPES){
     *                     // NOTE: deferring the next record type to the file's parser
     *                     nextExpectedRecordType.put(fileType, fileParserMap.get(fileType).getNextExpectedRecordType(fileType, null));
     *                  }
     *                  return nextExpectedRecordType;
     *              } else {
     *                 ValidationUtils.validateAssertCondition(lastRecordTypeMap != null && lastRecordTypeMap.size() == EXPECTED_FILE_TYPES.size(), "lastRecordTypeMap size should equal expected");
     *                 nextExpectedRecordType = new HashMap<>();
     *                 for(String fileType : lastRecordTypeMap.keySet()){
     *                     // NOTE: deferring the next record type to the file's parser
     *                     nextExpectedRecordType.put(fileType, fileParserMap.get(fileType.toUpperCase()).getNextExpectedRecordType(fileType, lastRecordTypeMap.get(fileType)));
     *                 }
     *                 return nextExpectedRecordType;
     *             }
     *          }
     * @param s3FileTypeToLastProcessedRecordType map of each file type and the last processed record type in that file.
     * @return map of each file type and the next expected record type in that file.
     */
    Map getNextExpectedRecordType(Map s3FileTypeToLastProcessedRecordType);

    /**
     *      This function is responsible for returning the extracted document - it is given:
     *          * the current state in each file which it can use to determine next state
     *          * the last document that was extracted in case the documents have some link with each other in the file
     *          * the filetype reader interfaces - that it can use to get the next records from each file
     *      With these inputs, it will construct the extracted document from the records obtained from each file and return that as the result. Note that the document being created is a CompositeDocument - since its being created by combining different records from different files.
     *
     *      For example, in our example above where there are two file types {METADATALOGFILE, DATALOGFILE} and the records in each file have a separate header and payload record as follows:
     *
     *          METADATALOGFILE                          DATALOGFILE
     *         -----------------                        -------------
     *         {metadata_headers_rec_1}                 {data_headers_rec_1}
     *         {metadata_payload_rec_1}                 {data_payload_rec_1}
     *         {metadata_headers_rec_2}                 {data_headers_rec_1}
     *         {metadata_payload_rec_2}                 {data_payload_rec_1}
     *
     *      Here is what an example implementation would look like:
     *
     *          @Override
     *          public ParseCompositeDocumentResult parseDocument(Map<String, String> fileNameFileTypeMap, DocumentInterface lastProcessedDocument, Map<String, SingleFileStateMachineReaderInterface> fileTypeReaderMap) {
     *              // validate inputs
     *              ValidationUtils.validateAssertCondition(EXPECTED_FILE_TYPES.size() == fileNameFileTypeMap.size(), "fileTypeFileNameMap size is unexpected");
     *              ValidationUtils.validateAssertCondition(EXPECTED_FILE_TYPES.size() == fileTypeReaderMap.size(), "fileTypeReaderMap size is unexpected");
     *              try {
     *                  // get the file readers for each file
     *                  SingleFileStateMachineReaderInterface metadataLogFileReader = fileTypeReaderMap.get(RecordLogFileTypes.METADATALOGFILE.name());
     *                  ValidationUtils.validateAssertCondition(metadataLogFileReader != null, "metadataLogFileReader is null");
     *                  SingleFileStateMachineReaderInterface dataLogFileReader = fileTypeReaderMap.get(RecordLogFileTypes.DATALOGFILE.name());
     *                  ValidationUtils.validateAssertCondition(dataLogFileReader != null, "dataLogFileReader is null");
     *
     *                  // init error list
     *                  List<ErrorDocInterface> errorRecordsList = new ArrayList();
     *
     *                  // Get the header record from metadata file
     *                  SingleDocInterface metadataHeaderRecord = null;
     *                  ErrorDocInterface metadataHeaderRecordError = null;
     *                  DocumentInterface document = metadataLogFileReader.nextRecord(peak: false)
     *                  if (document instanceof ErrorDocInterface) {
     *                      metadataHeaderRecordError = document;
     *                      errorRecordsList.Add(metadataHeaderRecordError);
     *                  } else if (document instanceof SkipDocInterface) {
     *                      throw new RuntimeException("Skip records are not expected");
     *                  } else {
     *                      metadataHeaderRecord = document;
     *                      ValidationUtils.validateAssertCondition("METADATA_HEADER".equals(metadataHeaderRecord.getRecordType()), "metadataHeaderRecord type should be METADATA_HEADER");
     *                  }
     *
     *                  // Get the payload record from metadata file
     *                  SingleDocInterface metadataPayloadRecord = null;
     *                  ErrorDocInterface metadataPayloadRecordError = null;
     *                  document = metadataLogFileReader.nextRecord(peak: false)
     *                  if (document instanceof ErrorDocInterface) {
     *                      metadataPayloadRecordError = document;
     *                      errorRecordsList.Add(metadataPayloadRecordError);
     *                  } else if (document instanceof SkipDocInterface) {
     *                      throw new RuntimeException("Skip records are not expected");
     *                  } else {
     *                      metadataPayloadRecord = document;
     *                      ValidationUtils.validateAssertCondition("METADATA_PAYLOAD".equals(metadataPayloadRecord.getRecordType()), "metadataPayloadRecord type should be METADATA_PAYLOAD");
     *                  }
     *
     *                  // Get the header record from data file
     *                  SingleDocInterface dataHeaderRecord = null;
     *                  ErrorDocInterface dataHeaderRecordError = null;
     *                  document = dataLogFileReader.nextRecord(peak: false)
     *                  if (document instanceof ErrorDocInterface) {
     *                      dataHeaderRecordError = document;
     *                      errorRecordsList.Add(dataHeaderRecordError);
     *                  } else if (document instanceof SkipDocInterface) {
     *                      throw new RuntimeException("Skip records are not expected");
     *                  } else {
     *                      dataHeaderRecord = document;
     *                      ValidationUtils.validateAssertCondition("DATA_PAYLOAD".equals(dataHeaderRecord.getRecordType()), "dataHeaderRecord type should be DATA_HEADER");
     *                  }
     *
     *                  // Get the payload record from data file
     *                  SingleDocInterface dataPayloadRecord = null;
     *                  ErrorDocInterface dataPayloadRecordError = null;
     *                  document = dataLogFileReader.nextRecord(peak: false)
     *                  if (document instanceof ErrorDocInterface) {
     *                      dataPayloadRecordError = document;
     *                      errorRecordsList.Add(dataPayloadRecordError);
     *                  } else if (document instanceof SkipDocInterface) {
     *                      throw new RuntimeException("Skip records are not expected");
     *                  } else {
     *                      dataPayloadRecord = document;
     *                      ValidationUtils.validateAssertCondition("DATA_PAYLOAD".equals(dataPayloadRecord.getRecordType()), "dataPayloadRecord type should be DATA_PAYLOAD");
     *                  }
     *
     *                  // validate the records are for the same documentId - we could do a peak: true when getting the records and validate Ids and then do peak: false in case there is concern that data corruption would lead to different file readers being at different / incorrect locations, but we also have a reference to last record in case we want to look at the last record to determine current position in file - implementers can decide what fits their usecase
     *                  String documentId = null;
     *                  if (metadataHeaderRecord != null) {
     *                      documentId = metadataHeaderRecord.getDocumentId();
     *                      ValidationUtils.validateAssertCondition(documentId != null, "documentId should not be null");
     *                  }
     *
     *                  if (metadataPayloadRecord != null) {
     *                      if (documentId == null) {
     *                          documentId = metadataPayloadRecord.getDocumentId();
     *                      } else {
     *                          ValidationUtils.validateAssertCondition(documentId.equals(metadataPayloadRecord.getDocumentId()), "documentId should not be the same");
     *                      }
     *                  }
     *
     *                  if (dataHeaderRecord != null) {
     *                      if (documentId == null) {
     *                          documentId = dataHeaderRecord.getDocumentId();
     *                      } else {
     *                          ValidationUtils.validateAssertCondition(documentId.equals(dataHeaderRecord.getDocumentId()), "documentId should not be the same");
     *                      }
     *                  }
     *
     *                  if (dataPayloadRecord != null) {
     *                      if (documentId == null) {
     *                          documentId = dataPayloadRecord.getDocumentId();
     *                      } else {
     *                          ValidationUtils.validateAssertCondition(documentId.equals(dataPayloadRecord.getDocumentId()), "documentId should not be the same");
     *                      }
     *                  }
     *
     *                  // construct the extracted document
     *                  CompositeDocInterface extractedDocument = null;
     *                  if (errorRecordsList.size() == 0) {
     *                      SingleDocInterface recordLogDocument = new LogRecordDocument(metadataHeaderRecord, metadataPayloadRecord, dataHeaderRecord, dataPayloadRecord);
     *                      LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
     *                      docMap.put(recordLogDocument, errorRecordsList);
     *                      extractedDocument = new CompositeLogRecord(recordLogDocument.getDocumentId(), docMap);
     *                  } else {
     *                      SingleDocInterface recordLogDocument = CompositeLogRecord.NULL_DOCUMENT;
     *                      LinkedHashMap<SingleDocInterface, List<ErrorDocInterface>> docMap = new LinkedHashMap<>();
     *                      docMap.put(recordLogDocument, errorRecordsList);
     *
     *                      if (documentId == null) {
     *                          // if id is null, generate a random id, framework will add offsets to the error record, so we'll know where to look for the error record
     *                          documentId = "ERROR_ID_"+UUID.randomUUID().toString();
     *                      }
     *                      extractedDocument = new CompositeLogRecord(documentId, docMap);
     *                  }
     *
     *                  // generate the last expected record type map
     *                  Map<String, String> lastExpectedRecordType = new HashMap<>();
     *                  for(String fileType : fileTypeReaderMap.keySet()){
     *                      if (fileTypeReaderMap.get(fileType).getLastRecordTypeMap() == null) {
     *                          lastExpectedRecordType.put(fileType, null);
     *                      } else {
     *                          lastExpectedRecordType.putAll(fileTypeReaderMap.get(fileType).getLastRecordTypeMap());
     *                      }
     *                  }
     *
     *                  // generate the next expected record type map
     *                  Map<String, String> nextExpectedRecordType = this.getNextExpectedRecordType(lastExpectedRecordType);
     *
     *                  // create the offset map
     *                  Map<String, String> s3FileTypeOffsetMap = new HashMap<>();
     *                  s3FileTypeOffsetMap.putAll(metadataLogFileReader.getOffset());
     *                  s3FileTypeOffsetMap.putAll(dataLogFileReader.getOffset());
     *
     *                  return new ParseCompositeDocumentResult(nextRecordType, compositeDoc, lastProcessedRecordTypeMap, s3FileTypeOffsetMap, SingleFileReaderState.PROCESSING, errorRecordsList.size() == 0 ? ParseDocumentResultStatus.SUCCESS : ParseDocumentResultStatus.ERROR);
     *              }
     *              catch (Exception ex) {
     *                  throw ex;
     *              }
     *          }
     * @param s3FileTypeToLastProcessedRecordTypeMap the current state in each file which it can use to determine next state (map of filetype to last processed record)
     * @param lastProcessedDoc the last document that was extracted in case the documents have some link with each other in the file
     * @param s3FileTypeToFileReaderInterfaceMap the filetype reader interfaces - that it can use to get the next records from each file
     * @return The parse result containing the update file reader state and the extracted document / errors
     */
    ParseCompositeDocumentResult parseDocument(Map s3FileTypeToLastProcessedRecordTypeMap, DocumentInterface lastProcessedDoc, Map s3FileTypeToFileReaderInterfaceMap);
}

SQS Read Connector

Read Connector SQS Queue Location

Does this SQS queue exist in Customer's account or in #Let's Data account?

Customer Account
(We'll ask about enabling access later) #Let's Data Account
(We'll enable access to the SQS Queue)

AWS SQS Read Connector: Resource Location

resourceLocation: The queue's resource location - i.e. the queue's AWS account. Set this to 'Customer' if the queue is in customer account. Set this to 'Let's Data' if the queue is public or was created by Let's Data. #Let's Data will use these to determine how to access the SQS queue. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the SQS Queue. You'll need to add access to the sqs queue in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access the queue.

JSON Element
This value is saved as the dataset configuration's readConnector.resourceLocation attribute

AWS SQS Queue Name

AWS SQS Read Connector: Queue Name

queueName: An SQS read connector requires the queue name which has the messages that will be read.

JSON Element
This value is saved as the dataset configuration's readConnector.queueName attribute

AWS SQS Queue AccountId

#Let's Data System AccountId will be used

AWS SQS Read Connector: Queue AccountId

queueAccountId: The queue's AWS accountId. This is needed when the resourceLocation is 'Customer', not needed when resourceLocation is #Let's Data - we'll automatically add the #Let's Data system accountId.

JSON Element
This value is saved as the dataset configuration's readConnector.queueAccountId attribute

Throw on Message Read Failures?

AWS SQS Read Connector: Throw On Message Read Failures

throwOnMessageReadFail: Optional - boolean value. Defaults to false. When this is set to true, #Let's Data will fail the tasks for any errors in # Let's Data SQS Reader or when error docs being returned by the user data handler. When this is false, such errors will be written to the write destination and #Let's Data SQS reader will continue and process next messages. This is useful to set in scenarios where the subsequent messages are dependent on earlier messages and skipping an earlier message with an error will cause an inconsistency in the system. SQS Queues can be ordered (FIFO) which would allow for such an ordered message stream, so it might make sense to fail the task on errors in such cases.

JSON Element
This value is saved as the dataset configuration's readConnector.throwOnMessageReadFail attribute

SQS Queue Max Number Of Messages In ReceiveMessage Call

AWS SQS Read Connector: Max Number Of Messages In ReceiveMessage Call

queueMaxNumberOfMessagesInSinglePoll: Optional - integer. Defaults to 10. The #Let's Data reader calls the SQS's receiveMessage API to poll the messages from SQS. In any call, it can poll a maximum of 10 messages. This allows for batched execution that can be a little more performant that polling each message from the queue. Set this between 1-10.

JSON Element
This value is saved as the dataset configuration's readConnector.queueMaxNumberOfMessagesInSinglePoll attribute

QueueMessageReader Implementation ClassName

AWS SQS Read Connector: #Let's Data QueueMessageReader Interface Implementation Class Name

queueMessageReaderClassName: The user data handler needs to implement the 'QueueMessageReader' interface on how to parse records from the SQS messages. Set this to the fully qualified class names of the implementation of the 'QueueMessageReader' interface.

This is where you tell us what about your implementation - what Java class files implement the reader interface. Look at the SDK docs below to learn how to implement the reader.

JSON Element
This value is saved as the dataset configuration's readConnector.queueMessageReaderClassName attribute

#Let'sData Data Model

Documents

These are explained as follows:

The interface is simple and defined as:

        /**
 * The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
 */
public interface DocumentInterface {
    /**
     * Gets the documentId for the document
     * @return documentId
     */
    String getDocumentId();

    /**
     * Gets the record type of the document
     * @return the record type
     */
    String getRecordType();

    /**
     * Gets any optional metadata for the document as a map
     * @return map of optional document metadata
     */
    Map getDocumentMetadata();

    /**
     * Interface method that serializes the document to string that can be written to the destination
     * @return serialized document as string
     */
    String serialize();

    /**
     * The partition key of the document - useful to determine the partition for the document that would be written to
     * @return the partition key for the document
     */
    String getPartitionKey();
}

        In addition to the DocumentInterface interfaces, the SingleDocInterface implementations must implement the following:

        /**
 *  Single method interface for a Single Document (as opposed to a Composite Document)
 *  Single Documents should be used for each individual record in the file i.e. there should be 1-1 mapping between a file record and its corresponding java Single Doc Interface implementation.
 *  Composite Documents on the other hand are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface SingleDocInterface extends DocumentInterface {
    /**
     * Interface should return true for SingleDocInterface implementations (maybe some future SingleDoc implementations can return false)
     * @return - true for SingleDocInterface implementations
     */
    boolean isSingleDoc();
}

        In addition to the DocumentInterface interfaces, the CompositeDocInterface implementations must implement the following:

        /**
 *  Composite Documents are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface CompositeDocInterface extends DocumentInterface {
    /**
     * Since a composite doc is created from multiple single docs, it is possible that there are partial errors in some of the single docs.
     * This method returns a linked hashmap of single docs and the errors when constructing a composite doc
     * @return
     */
    LinkedHashMap> getDocumentList();
}

Here is the interface:

        In addition to the DocumentInterface interfaces, the ErrorDocInterface implementations must implement the following:

        /**
 * The "ErrorDocInterface" extends the "SingleDocInterface" is the base interface for any error documents that are returned by the user handlers.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default.
 * Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
 */
public interface ErrorDocInterface extends SingleDocInterface {
    /**
     * The erroneous record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The erroneous record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordEndOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The error message string that will be captured in the error record
     * @return The error message string
     */
    String getErrorMessage();
}

Here is the interface:

        In addition to the DocumentInterface interfaces, the SkipDocInterface implementations must implement the following:

        /**
 * The "SkipDocInterface" extends the "SingleDocInterface" is the base interface for any skip documents that are returned by the user handlers.
 * A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default.
 * Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.
 */
public interface SkipDocInterface extends SingleDocInterface {
    /**
     * The skip record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": 58965L
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": 58965L,
     *      "DATALOG": 5484726L,
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The skip record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The skip message string that will be captured in the skip record
     * @return The skip message string
     */
    String getSkipMessage();
}

#Let'sData Interfaces

The #Let'sData implementation defines:

Usecase - Implementation Requirements

SQS Queue Reader: You'll need to implement the 'QueueMessageReader' interface.

The user data handlers for the SQS reader need to implement the QueueMessageReader interface.

    ParseDocumentResult parseMessage(String messageId, String messageGroupId, String messageDeduplicationId, Map messageAttributes, String messageBody);

Here is an example implementation that echoes the sqs message contents as a #Let's Data document.

    public class CommonCrawlQueueReader implements QueueMessageReader {
        private static final Logger logger = LoggerFactory.getLogger(CommonCrawlQueueReader.class);
        private static final ObjectMapper objectMapper = new ObjectMapper();

        @Override
        public ParseDocumentResult parseMessage(String messageId, String messageGroupId, String messageDeduplicationId, Map messageAttributes, String messageBody) {
            if (StringUtils.isBlank(messageBody)) {
                logger.error("message body is blank, returning error - messageId: {}, messageGroupId: {}, messageDeduplicationId: {} , messageAttributes: {}, messageBody: {}", messageId, messageGroupId, messageDeduplicationId, messageAttributes, messageBody);
                ErrorDoc errorDoc = new ErrorDoc(null, null, "empty message body - msgId: "+messageId, messageId, null, null, "empty message body", messageBody);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }

            try {
                logger.debug("processing message - messageId: {}", messageId);
                DocumentInterface documentInterface = objectMapper.readValue(messageBody, IndexRecord.class);
                logger.debug("returning success - docId: {}", documentInterface.getDocumentId());
                return new ParseDocumentResult(null, documentInterface, ParseDocumentResultStatus.SUCCESS);
            } catch (JsonProcessingException ex) {
                logger.error("JsonProcessingException in reading the document - messageId: {}, messageGroupId: {}, messageDeduplicationId: {} , messageAttributes: {}, messageBody: {}", messageId, messageGroupId, messageDeduplicationId, messageAttributes, messageBody);
                ErrorDoc errorDoc = new ErrorDoc(null, null, "JsonProcessingException - " + ex.getMessage(), messageId, null, null, "JsonProcessingException - " + ex.getMessage(), messageBody);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }
        }
    }

Interfaces

Readers

QueueMessageReader: The reader interface for "SQS Queue Reader" reader type. This is where you construct a document from the SQS queue message.

Here is the interface:

    public interface QueueMessageReader {
    /**
     The #Let's Data SQS Queue Reader uses this interface's implementation (also called as user data handlers) to transform the messages from SQS queue message to a #Let's Data document. At a high level, the overall # Let's Data SQS reader design is as follows:

     * #Let's Data reads the messages from the SQS queue and passes the message contents to the user data handlers.
     * The user data handlers transform this message and returns a document.
     * #Let's data writes the document to the write / error destinations and then deletes the message from the SQS queue.
     * For any errors in # Let's Data SQS Reader, or error docs being returned by the user data handler, #Let's Data looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc an continue processing
     * If the decision is to continue processing, the reader deletes the message from the queue and polls for next message.

     +---------------------+                              +---------------------+                        +---------------------+
     |                     | ------ Read Message -------> |    # Let's Data     |---- parseDocument ---> |  User Data Handler  |
     |                     |                              |     SQS Reader      |<---- document -------- |                     |
     |                     |                              |                     |                        +---------------------+
     |   AWS SQS Queue     |                              |   Is Error Doc?     |
     |                     |                              |        |            |                        +---------------------+
     |                     |                              |        +---- yes ->-|---- write document --->|  Write Destination  |
     |                     |                              |        |            |                        +---------------------+
     +---------------------+                              |        |            |                        +---------------------+
              ^                                           |        +---- no -->-|---- write error ------>|  Error Destination  |
              |                                           |        |            |                        +---------------------+
              |                                           |   Should Delete?    |
              |                                           |        |            |
              +---<------- Delete Message ---------<------|<- yes -+            |
                                                          |        |            |
                                                          |        |            |
                                                          |  no <--+            |
                                                          |  |                  |
                                                          |  V                  |
                                                          |  Throw on Error     |
                                                          +---------------------+

     The SQS read connector configuration has details about the SQS receive message batch size and on dealing with failures.

     * @param messageId The SQS message messageId
     * @param messageGroupId The SQS message messageGroupId
     * @param messageDeduplicationId The SQS message messageDeduplicationId
     * @param messageAttributes The SQS message messageAttributes
     * @param messageBody The SQS message messageBody
     * @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
     */
    ParseDocumentResult parseMessage(String messageId, String messageGroupId, String messageDeduplicationId, Map messageAttributes, String messageBody);
}

Kinesis Read Connector

Read Connector Kinesis Stream Location

Does this Kinesis stream exist in Customer's account or in #Let's Data account?

Customer Account
(We'll ask about enabling access later) #Let's Data Account
(We'll enable access to the Kinesis Stream)

AWS Kinesis Read Connector: Resource Location

resourceLocation: The stream's resource location - i.e. the stream's AWS account. Set this to 'Customer' if the Kinesis stream is in customer account. Set this to 'Let's Data' if the Kinesis stream is public or was created by Let's Data. #Let's Data will use these to determine how to access the Kinesis stream. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the Kinesis stream. You'll need to add access to the Kinesis stream in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access the stream.

JSON Element
This value is saved as the dataset configuration's readConnector.resourceLocation attribute

AWS Kinesis Stream Arn

AWS Kinesis Read Connector: Stream Arn

streamArn: A kinesis read connector requires a stream Arn for the kinesis stream which has the records that will be read.

JSON Element
This value is saved as the dataset configuration's readConnector.streamArn attribute

Throw on Message Read Failures?

AWS SQS Read Connector: Throw On Message Read Failures

throwOnMessageReadFail: Optional - boolean value. Defaults to false. When this is set to true, #Let's Data will fail the tasks for any errors in the Kinesis Reader or when error docs being returned by the user data handler. When this is false, such errors will be written to the error destination and the Kinesis reader will continue and process next messages. This is useful to set when processing ordered messages (for example, scenarios where the subsequent messages are dependent on earlier messages and skipping an earlier message with an error will cause an inconsistency in the system). Kinesis stream records are ordered, so it might make sense to fail the task on errors.

JSON Element
This value is saved as the dataset configuration's readConnector.throwOnMessageReadFail attribute

#Let's Data QueueMessageReader Interface Implementation Class Name

AWS Kinesis Read Connector: #Let's Data StreamRecordReader Interface Implementation Class Name

streamReaderClassName: The user data handler needs to implement the 'StreamRecordReader' interface on how to parse records from the Kinesis stream. Set this to the fully qualified class name of the 'StreamRecordReader' interface implementation.

This is where you tell us what about your implementation - what Java class files implement the reader interface. Look at the SDK docs below to learn how to implement the reader.

JSON Element
This value is saved as the dataset configuration's readConnector.streamReaderClassName attribute

#Let'sData Data Model

Documents

These are explained as follows:

The interface is simple and defined as:

        /**
 * The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
 */
public interface DocumentInterface {
    /**
     * Gets the documentId for the document
     * @return documentId
     */
    String getDocumentId();

    /**
     * Gets the record type of the document
     * @return the record type
     */
    String getRecordType();

    /**
     * Gets any optional metadata for the document as a map
     * @return map of optional document metadata
     */
    Map getDocumentMetadata();

    /**
     * Interface method that serializes the document to string that can be written to the destination
     * @return serialized document as string
     */
    String serialize();

    /**
     * The partition key of the document - useful to determine the partition for the document that would be written to
     * @return the partition key for the document
     */
    String getPartitionKey();
}

        In addition to the DocumentInterface interfaces, the SingleDocInterface implementations must implement the following:

        /**
 *  Single method interface for a Single Document (as opposed to a Composite Document)
 *  Single Documents should be used for each individual record in the file i.e. there should be 1-1 mapping between a file record and its corresponding java Single Doc Interface implementation.
 *  Composite Documents on the other hand are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface SingleDocInterface extends DocumentInterface {
    /**
     * Interface should return true for SingleDocInterface implementations (maybe some future SingleDoc implementations can return false)
     * @return - true for SingleDocInterface implementations
     */
    boolean isSingleDoc();
}

        In addition to the DocumentInterface interfaces, the CompositeDocInterface implementations must implement the following:

        /**
 *  Composite Documents are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface CompositeDocInterface extends DocumentInterface {
    /**
     * Since a composite doc is created from multiple single docs, it is possible that there are partial errors in some of the single docs.
     * This method returns a linked hashmap of single docs and the errors when constructing a composite doc
     * @return
     */
    LinkedHashMap> getDocumentList();
}

Here is the interface:

                In addition to the DocumentInterface interfaces, the ErrorDocInterface implementations must implement the following:

                /**
 * The "ErrorDocInterface" extends the "SingleDocInterface" is the base interface for any error documents that are returned by the user handlers.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default.
 * Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
 */
public interface ErrorDocInterface extends SingleDocInterface {
    /**
     * The erroneous record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The erroneous record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordEndOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The error message string that will be captured in the error record
     * @return The error message string
     */
    String getErrorMessage();
}

Here is the interface:

        In addition to the DocumentInterface interfaces, the SkipDocInterface implementations must implement the following:

        /**
 * The "SkipDocInterface" extends the "SingleDocInterface" is the base interface for any skip documents that are returned by the user handlers.
 * A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default.
 * Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.
 */
public interface SkipDocInterface extends SingleDocInterface {
    /**
     * The skip record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": 58965L
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": 58965L,
     *      "DATALOG": 5484726L,
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The skip record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The skip message string that will be captured in the skip record
     * @return The skip message string
     */
    String getSkipMessage();
}

#Let'sData Interfaces

The #Let'sData implementation defines:

Usecase - Implementation Requirements

Kinesis Stream Reader: You'll need to implement the 'StreamRecordReader' interface.

The user data handlers for the Kinesis reader need to implement the KinesisRecordReader interface.

    ParseDocumentResult parseMessage(String streamArn, String shardId, String partitionKey, String sequenceNumber, Date approximateArrivalTimestamp, ByteBuffer data);

Here is an example implementation that echoes the Kinesis stream record contents as a #Let's Data document.

    public class CommonCrawlStreamReader implements KinesisRecordReader {
        private static final Logger logger = LoggerFactory.getLogger(CommonCrawlStreamReader.class);
        private static final ObjectMapper objectMapper = new ObjectMapper();

        @Override
        public ParseDocumentResult parseMessage(String streamArn, String shardId, String partitionKey, String sequenceNumber, Date approximateArrivalTimestamp, ByteBuffer data) {
            if (data == null || data.array() == null || data.array().length <= 0) {
                logger.error("record data is null or empty, returning error - streamArn: {}, shardId: {}, partitionKey: {} , sequenceNumber: {}, approximateArrivalTimestamp: {}, data: {}", streamArn, shardId, partitionKey, sequenceNumber, approximateArrivalTimestamp, data);
                Map offsetMap = createOffsetMap(shardId, sequenceNumber);
                ErrorDoc errorDoc = new ErrorDoc(null, null, "empty record data - sequenceNumber: "+sequenceNumber+", shardId: "+shardId, sequenceNumber, null, null, "empty message body", partitionKey);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }

            try {
                logger.debug("processing record - sequenceNumber: {}", sequenceNumber);
                IndexRecord documentInterface = objectMapper.readValue(data.array(), IndexRecord.class);
                logger.debug("returning success - docId: {}", documentInterface.getDocumentId());
                return new ParseDocumentResult(null, documentInterface, ParseDocumentResultStatus.SUCCESS);
            } catch (IOException ex) {
                logger.error("IOException in reading the document - streamArn: {}, shardId: {}, partitionKey: {} , sequenceNumber: {}, approximateArrivalTimestamp: {}, data: {}, ex: {}", streamArn, shardId, partitionKey, sequenceNumber, approximateArrivalTimestamp, data, ex);
                // Map offsetMap = createOffsetMap(shardId, sequenceNumber);
                ErrorDoc errorDoc = new ErrorDoc(null, null, "IOException - " + ex.getMessage(), sequenceNumber, null, null, "JsonProcessingException - " + ex.getMessage(), partitionKey);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }
        }

        private Map createOffsetMap(String shardId, String sequenceNumber) {
            ValidationUtils.validateAssertCondition(shardId != null, "shardId should not be null");
            ValidationUtils.validateAssertCondition(sequenceNumber != null, "sequenceNumber should not be null");
            Map offsetMap = new HashMap<>();
            offsetMap.put(shardId, sequenceNumber);
            return offsetMap;
        }
    }

Interfaces

Readers

StreamRecordReader: The reader interface for "Stream Record Reader" reader type. This is where you construct a document from the Kinesis stream message.

Here is the interface:

public interface KinesisRecordReader {
    /**
     The #Lets Data Kinesis Stream Reader uses this interface's implementation (also called as user data handlers) to transform the messages from Kinesis Stream record to a #Lets Data document. At a high level, the overall # Lets Data Kinesis reader design is as follows:

     * #Lets Data reads the records from the Kinesis stream and passes the message contents to the user data handlers.
     * The user data handlers transform this message and returns a document.
     * #Lets data writes the document to the write / error destinations and then checkpoints the task with the processed sequence number.
     * For any errors in # Lets Data Kinesis Stream Reader, or error docs being returned by the user data handler, #Lets Data looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
     * If the decision is to continue processing, the reader polls for next stream record.

     +---------------------+                              +---------------------+                        +---------------------+
     |                     | ------ Read Message -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
     |                     |                              |   Kinesis Reader    |<---- document -------- |                     |
     |                     |                              |                     |                        +---------------------+
     | AWS Kinesis Stream  |                              |   Is Error Doc?     |
     |                     |                              |        |            |                        +---------------------+
     |                     |                              |        +---- yes ->-|---- write document --->|  Write Destination  |
     |                     |                              |        |            |                        +---------------------+
     +---------------------+                              |        |            |                        +---------------------+
                                                          |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                          |        |            |                        +---------------------+
                                                          | Should Checkpoint?  |
                                                          |        |            |
               ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                          |        |            |
                                                          |        |            |
                                                          | Throw on Error?     |
                                                          |<- yes -+            |
                                                          |        |            |
                                                          |        V            |
                                                          |  Throw on Error     |
                                                          +---------------------+

     The Kinesis read connector configuration has details about the Kinesis read config and on dealing with failures.

     * @param streamArn The Kinesis streamArn
     * @param shardId The stream record's shardId
     * @param partitionKey The stream record's partition key
     * @param sequenceNumber The stream record's sequenceNumber
     * @param approximateArrivalTimestamp The stream record's approximateArrivalTimestamp
     * @param data The stream record's data payload as a ByteBuffer
     * @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
     */
    ParseDocumentResult parseMessage(String streamArn, String shardId, String partitionKey, String sequenceNumber, Date approximateArrivalTimestamp, ByteBuffer data);
}

DynamoDB Streams Read Connector

Read Connector DynamoDB Streams Location

Does this DynamoDB Stream exist in Customer's account or in #Let's Data account?

Customer Account
(We'll ask about enabling access later) #Let's Data Account
(We'll enable access to the DynamoDB Stream)

AWS DynamoDB Streams Read Connector: Resource Location

resourceLocation: The stream's resource location - i.e. the stream's AWS account. Set this to 'Customer' if the DynamoDB stream is in customer account. Set this to 'LetsData' if the DynamoDB stream is public or was created by Let's Data. #Let's Data will use these to determine how to access the DynamoDB stream. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the DynamoDB stream. You'll need to add access to the DynamoDB stream in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access the stream.

JSON Element
This value is saved as the dataset configuration's readConnector.resourceLocation attribute

AWS DynamoDB Stream Arn

AWS DynamoDB Streams Read Connector: Stream Arn

streamArn: A DynamoDB Streams read connector requires a stream Arn for the dynamoDB stream which has the records that will be read.

JSON Element
This value is saved as the dataset configuration's readConnector.streamArn attribute

Throw on Message Read Failures?

AWS DynamoDB Streams Read Connector: Throw On Message Read Failures

throwOnMessageReadFail: Optional - boolean value. Defaults to false. When this is set to true, #Let's Data will fail the tasks for any errors in the DynamoDB Streams Reader or when error docs being returned by the user data handler. When this is false, such errors will be written to the error destination and the DynamoDB reader will continue and process next messages. This is useful to set when processing ordered messages (for example, scenarios where the subsequent messages are dependent on earlier messages and skipping an earlier message with an error will cause an inconsistency in the system). DynamoDB stream records are ordered, so it might make sense to fail the task on errors.

JSON Element
This value is saved as the dataset configuration's readConnector.throwOnMessageReadFail attribute

#Let's Data DynamoDBStreamsRecordReader Interface Implementation Class Name

AWS DynamoDB Streams Read Connector: #Let's Data DynamoDBStreamsRecordReader Interface Implementation Class Name

streamReaderClassName: The user data handler needs to implement the 'StreamRecordReader' interface on how to parse records from the DynamoDB stream. Set this to the fully qualified class name of the 'DynamoDBStreamsRecordReader' interface implementation. Not required when ECR Image is being used (Python / Javascript)

This is where you tell us what about your implementation - what Java class files implement the reader interface. Look at the SDK docs below to learn how to implement the reader.

JSON Element
This value is saved as the dataset configuration's readConnector.streamReaderClassName attribute

#Let'sData Data Model

Documents

These are explained as follows:

The interface is simple and defined as:

        /**
 * The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
 */
public interface DocumentInterface {
    /**
     * Gets the documentId for the document
     * @return documentId
     */
    String getDocumentId();

    /**
     * Gets the record type of the document
     * @return the record type
     */
    String getRecordType();

    /**
     * Gets any optional metadata for the document as a map
     * @return map of optional document metadata
     */
    Map getDocumentMetadata();

    /**
     * Interface method that serializes the document to string that can be written to the destination
     * @return serialized document as string
     */
    String serialize();

    /**
     * The partition key of the document - useful to determine the partition for the document that would be written to
     * @return the partition key for the document
     */
    String getPartitionKey();
}

        In addition to the DocumentInterface interfaces, the SingleDocInterface implementations must implement the following:

        /**
 *  Single method interface for a Single Document (as opposed to a Composite Document)
 *  Single Documents should be used for each individual record in the file i.e. there should be 1-1 mapping between a file record and its corresponding java Single Doc Interface implementation.
 *  Composite Documents on the other hand are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface SingleDocInterface extends DocumentInterface {
    /**
     * Interface should return true for SingleDocInterface implementations (maybe some future SingleDoc implementations can return false)
     * @return - true for SingleDocInterface implementations
     */
    boolean isSingleDoc();
}

        In addition to the DocumentInterface interfaces, the CompositeDocInterface implementations must implement the following:

        /**
 *  Composite Documents are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface CompositeDocInterface extends DocumentInterface {
    /**
     * Since a composite doc is created from multiple single docs, it is possible that there are partial errors in some of the single docs.
     * This method returns a linked hashmap of single docs and the errors when constructing a composite doc
     * @return
     */
    LinkedHashMap> getDocumentList();
}

Here is the interface:

                In addition to the DocumentInterface interfaces, the ErrorDocInterface implementations must implement the following:

                /**
 * The "ErrorDocInterface" extends the "SingleDocInterface" is the base interface for any error documents that are returned by the user handlers.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default.
 * Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
 */
public interface ErrorDocInterface extends SingleDocInterface {
    /**
     * The erroneous record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The erroneous record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordEndOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The error message string that will be captured in the error record
     * @return The error message string
     */
    String getErrorMessage();
}

Here is the interface:

        In addition to the DocumentInterface interfaces, the SkipDocInterface implementations must implement the following:

        /**
 * The "SkipDocInterface" extends the "SingleDocInterface" is the base interface for any skip documents that are returned by the user handlers.
 * A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default.
 * Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.
 */
public interface SkipDocInterface extends SingleDocInterface {
    /**
     * The skip record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": 58965L
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": 58965L,
     *      "DATALOG": 5484726L,
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The skip record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The skip message string that will be captured in the skip record
     * @return The skip message string
     */
    String getSkipMessage();
}

#Let'sData Interfaces

The #Let'sData implementation defines:

Usecase - Implementation Requirements

DynamoDB Streams Reader: You'll need to implement the 'DynamoDBStreamsRecordReader' interface.

The user data handlers for the DynamoDB Streams reader need to implement the DynamoDBStreamsRecordReader interface.

    public ParseDocumentResult parseRecord(String streamArn, String shardId, String eventId, String eventName, String identityPrincipalId, String identityType, String sequenceNumber, Long sizeBytes, String streamViewType, Date approximateCreationDateTime, Map keys, Map oldImage, Map newImage);

Here is an example implementation that echoes the dynamoDB stream record contents as a #Let's Data document.

    public class CommonCrawlDDBStreamReader implements DynamoDBStreamsRecordReader {
        private static final Logger logger = LoggerFactory.getLogger(CommonCrawlStreamReader.class);
        private static final ObjectMapper objectMapper = new ObjectMapper();

        @Override
        public ParseDocumentResult parseRecord(String streamArn, String shardId, String eventId, String eventName, String identityPrincipalId, String identityType, String sequenceNumber, Long sizeBytes, String streamViewType, Date approximateCreationDateTime, Map keys, Map oldImage, Map newImage) {
            try {
                String docId = null;
                for (Object keyValue : keys.values()) {
                    if (docId == null) {
                        docId = keyValue.toString();
                    } else {
                        docId += ('|' + keyValue.toString());
                    }
                }

                if (newImage == null || newImage.size() <= 0) {
                    logger.error("newImage is null, returning skip - streamArn: " + streamArn + ", shardId: " +
                            shardId + ", eventName: " + eventName + ", sequenceNumber: " + sequenceNumber +
                            ", approximateArrivalTimestamp: " + approximateCreationDateTime + ", keys: " + keys);
                    Map offset = new HashMap<>();
                    offset.put("sequenceNumber", sequenceNumber);
                    String skipMessage = "DYNAMODBSTREAMS_SKIP delete record - sequenceNumber: "+sequenceNumber+", shardId: "+shardId;
                    SkipDoc skipDoc = new SkipDoc(offset, offset, skipMessage, docId, "DDB_DELETE_RECORD", null, skipMessage, shardId);

                    return new ParseDocumentResult(null, skipDoc, ParseDocumentResultStatus.SKIP);
                }

                logger.debug("processing record - sequenceNumber: " + sequenceNumber);
                logger.debug("returning success - docId: " + docId);

                IndexRecord documentInterface = objectMapper.readValue(objectMapper.writeValueAsString(newImage), IndexRecord.class);

                return new ParseDocumentResult(null, documentInterface, ParseDocumentResultStatus.SUCCESS);

            } catch (Exception ex) {
                logger.error("Exception in reading the document - streamArn: " + streamArn +
                        ", shardId: " + shardId + ", eventName: " + eventName + ", sequenceNumber: " + sequenceNumber +
                        ", approximateArrivalTimestamp: " + approximateCreationDateTime + ", keys: " + keys +
                        ", ex: " + ex);

                String docIdUUID = UUID.randomUUID().toString();
                Map offset = new HashMap<>();
                offset.put("sequenceNumber", sequenceNumber);
                String errorMessage = "Exception in processing record seqNum: "+sequenceNumber+", shardId: "+shardId+", ex: " + ex.getMessage();
                ErrorDoc errorDoc = new ErrorDoc(offset, offset, errorMessage, docIdUUID, "DDB_ERROR", null, errorMessage, shardId);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }
        }
    }

The user data handlers for the DynamoDB Streams reader need to implement the DynamoDBStreamsRecordReader interface.

			def parseRecord(self, streamArn : str, shardId : str, eventId : str, eventName : str, identityPrincipalId : str, identityType : str, sequenceNumber : str, sizeBytes : int, streamViewType : str, approximateCreationDateTime : int, keys : {}, oldImage : {}, newImage : {}) -> ParseDocumentResult:
				raise(Exception("Not Yet Implemented"))

Here is an example implementation that echoes the dynamoDB streams record contents as a #Let's Data document.


	import uuid
	from letsdata_interfaces.readers.model.RecordParseHint import RecordParseHint
	from letsdata_interfaces.readers.model.RecordHintType import RecordHintType
	from letsdata_interfaces.readers.model.ParseDocumentResultStatus import ParseDocumentResultStatus
	from letsdata_interfaces.readers.model.ParseDocumentResult import ParseDocumentResult
	from letsdata_interfaces.documents.Document import Document
	from letsdata_interfaces.documents.DocumentType import DocumentType
	from letsdata_interfaces.documents.ErrorDoc import ErrorDoc
	from letsdata_interfaces.documents.SkipDoc import SkipDoc
	from letsdata_utils.logging_utils import logger

	class DynamoDBStreamsRecordReader:
		def __init__(self) -> None:
			pass

		def parseRecord(self, streamArn : str, shardId : str, eventId : str, eventName : str, identityPrincipalId : str, identityType : str, sequenceNumber : str, sizeBytes : int, streamViewType : str, approximateCreationDateTime : int, keys : {}, oldImage : {}, newImage : {}) -> ParseDocumentResult:
			try:
				docId : str = None
				for keyValue in keys.values():
					if docId is None:
						docId = keyValue
					else:
						docId += ('|'+keyValue)

				if newImage is None or len(newImage) <= 0:
					logger.error(f"newImage is null, returning skip - streamArn: {streamArn}, shardId: {shardId}, eventName: {eventName}, sequenceNumber: {sequenceNumber}, approximateArrivalTimestamp: {approximateCreationDateTime}, keys: {keys}")
					error_doc = SkipDoc(docId, "DYNAMODBSTREAMS_SKIP", docId, {}, {}, {"sequenceNumber": sequenceNumber}, {"sequenceNumber": sequenceNumber}, "delete record")
					return ParseDocumentResult(None, error_doc, "SKIP")


				logger.debug(f"processing record - sequenceNumber: {sequenceNumber}")

				logger.debug(f"returning success - docId: {docId}")
				return ParseDocumentResult(None, Document(DocumentType.Document, docId, "DOCUMENT", docId, {}, newImage), "SUCCESS")
			except Exception as ex:
				logger.debug(f"Exception in reading the document - streamArn: {streamArn}, shardId: {shardId}, eventName: {eventName}, sequenceNumber: {sequenceNumber}, approximateArrivalTimestamp: {approximateCreationDateTime}, keys: {keys}, ex: {ex}")
				docIdUUID = str(uuid.uuid4())
				error_doc = ErrorDoc(docIdUUID, "KINESIS_ERROR", docIdUUID, {}, {}, {"sequenceNumber": sequenceNumber}, {"sequenceNumber": sequenceNumber}, f"Exception - {ex}")
				return ParseDocumentResult(None, error_doc, "ERROR")

The user data handlers for the DynamoDB Streams reader need to implement the DynamoDBStreamsRecordReader interface.

    parseRecord(streamArn, shardId, eventId, eventName, identityPrincipalId, identityType, sequenceNumber, sizeBytes, streamViewType, approximateCreationDateTime, keys, oldImage, newImage) {
        throw new Error("Not Yet Implemented");
    }

Here is an example implementation that echoes the dynamoDB streams record contents as a #Let's Data document.

    import { v4 as uuidv4 } from 'uuid';
    import { RecordParseHint } from '../model/RecordParseHint.js';
    import { RecordHintType } from '../model/RecordHintType.js';
    import { ParseDocumentResult } from '../model/ParseDocumentResult.js';
    import { ParseDocumentResultStatus } from '../model/ParseDocumentResultStatus.js';
    import { Document } from '../../documents/Document.js';
    import { DocumentType } from '../../documents/DocumentType.js';
    import { ErrorDoc } from '../../documents/ErrorDoc.js';
    import { logger } from '../../../letsdata_utils/logging_utils.js';

    export class DynamoDBStreamsRecordReader {
        constructor() {

        }

        parseRecord(streamArn, shardId, eventId, eventName, identityPrincipalId, identityType, sequenceNumber, sizeBytes, streamViewType, approximateCreationDateTime, keys, oldImage, newImage) {
            try {
                let docId = null;

                for (const keyValue of Object.values(keys)) {
                    if (docId === null) {
                        docId = keyValue;
                    } else {
                        docId += '|' + keyValue;
                    }
                }

                if (!newImage || Object.keys(newImage).length <= 0) {
                    logger.error(\`newImage is null, returning skip - streamArn: \${streamArn}, shardId: \${shardId}, eventName: \${eventName}, sequenceNumber: \${sequenceNumber}, approximateArrivalTimestamp: \${approximateCreationDateTime}, keys: \${JSON.stringify(keys)}\`);

                    const errorDoc = new SkipDoc(docId, 'DYNAMODBSTREAMS_SKIP', docId, {}, {}, { "sequenceNumber": sequenceNumber }, { "sequenceNumber": sequenceNumber }, 'delete record');
                    return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.SKIP);
                }

                logger.debug(\`processing record - sequenceNumber: \${sequenceNumber}\`);
                logger.debug(\`returning success - docId: \${docId}\`);

                return new ParseDocumentResult(null, new Document('Document', docId, 'DOCUMENT', docId, {}, newImage), ParseDocumentResultStatus.SUCCESS);
            } catch (ex) {
                logger.debug(\`Exception in reading the document - streamArn: \${streamArn}, shardId: \${shardId}, eventName: \${eventName}, sequenceNumber: \${sequenceNumber}, approximateArrivalTimestamp: \${approximateCreationDateTime}, keys: \${JSON.stringify(keys)}, ex: \${ex}\`);

                const docIdUUID = uuidv4();
                const errorDoc = new ErrorDoc(docIdUUID, 'DYNAMODBSTREAMS_ERROR', docIdUUID, {}, {}, { "sequenceNumber": sequenceNumber }, { "sequenceNumber": sequenceNumber }, \`Exception - \${ex}\`);
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }
        }
    }

Interfaces

Readers

DynamoDBStreamsRecordReader: The reader interface for "DynamoDB Stream Record Reader" reader type. This is where you construct a document from the DynamoDB stream record.

Here is the interface:

DynamoDBStreamsRecordReader interface on Git Hub DynamoDBStreamsRecordReader example on Git Hub

    The #LetsData DynamoDB Streams Record Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB stream to a #LetsData document. At a high level, the overall #LetsData DynamoDB Stream reader design is as follows:

    * #LetsData reads the records from the DynamoDB stream from the specified location (sequenceNumber) and passes the record contents to the user data handlers.
    * The user data handlers transform this record and returns a document.
    * #LetsData writes the document to the write / error destinations and checkpoints the location (sequenceNumber) in DynamoDB stream.
    * For any errors in #LetsData DynamoDB Streams Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
    * If the decision is to continue processing, the reader polls for next record in the stream.

    +---------------------+                              +---------------------+                        +---------------------+
    | AWS DynamoDB Stream | ------ Read Message -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
    |       Shard         |                              |  DDB Stream Reader  |<---- document -------- |                     |
    +---------------------+                              |                     |                        +---------------------+
                                                         |   Is Error Doc?     |
                                                         |        |            |                        +---------------------+
                                                         |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                         |        |            |                        +---------------------+
                                                         |        |            |                        +---------------------+
                                                         |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                         |        |            |                        +---------------------+
                                                         | Should Checkpoint?  |
                                                         |        |            |
              ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                         |        |            |
                                                         |        |            |
                                                         | Throw on Error?     |
                                                         |<- yes -+            |
                                                         |        |            |
                                                         |        V            |
                                                         |  Throw on Error     |
                                                         +---------------------+

    The DynamoDB Streams read connector configuration has details about the DynamoDB Streams read and on dealing with failures.

    public interface DynamoDBStreamsRecordReader {
        /**
          For detailed explanation of the interface parameters, see AWS docs as well:
             * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_Record.html
             * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_StreamRecord.html

          @param streamArn - The DynamoDB Stream ARN
          @param shardId - The DynamoDB Shard Id
          @param eventId - A globally unique identifier for the event that was recorded in this stream record.
          @param eventName - The type of data modification that was performed on the DynamoDB table. INSERT | MODIFY | REMOVE
          @param identityPrincipalId - The userIdentity's principalId
          @param identityType - The userIdentity's principalType
          @param sequenceNumber - The sequence number of the stream record
          @param sizeBytes - The size of the stream record, in bytes
          @param streamViewType - The stream view type - NEW_IMAGE | OLD_IMAGE | NEW_AND_OLD_IMAGES | KEYS_ONLY
          @param approximateCreationDateTime - The approximate date and time when the stream record was created, in UNIX epoch time format and rounded down to the closest second
          @param keys - The primary key attribute(s) for the DynamoDB item that was modified
          @param oldImage - The item in the DynamoDB table as it appeared before it was modified
          @param newImage - The item in the DynamoDB table as it appeared after it was modified
          @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
        */
        ParseDocumentResult parseRecord(String streamArn, String shardId, String eventId, String eventName, String identityPrincipalId, String identityType, String sequenceNumber, Long sizeBytes, String streamViewType, Date approximateCreationDateTime, Map keys, Map oldImage, Map newImage);
    }

DynamoDBStreamsRecordReader interface on Git Hub DynamoDBStreamsRecordReader example on Git Hub

    The #LetsData DynamoDB Streams Record Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB stream to a #LetsData document. At a high level, the overall #LetsData DynamoDB Stream reader design is as follows:

    * #LetsData reads the records from the DynamoDB stream from the specified location (sequenceNumber) and passes the record contents to the user data handlers.
    * The user data handlers transform this record and returns a document.
    * #LetsData writes the document to the write / error destinations and checkpoints the location (sequenceNumber) in DynamoDB stream.
    * For any errors in #LetsData DynamoDB Streams Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
    * If the decision is to continue processing, the reader polls for next record in the stream.

    +---------------------+                              +---------------------+                        +---------------------+
    | AWS DynamoDB Stream | ------ Read Message -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
    |       Shard         |                              |  DDB Stream Reader  |<---- document -------- |                     |
    +---------------------+                              |                     |                        +---------------------+
                                                         |   Is Error Doc?     |
                                                         |        |            |                        +---------------------+
                                                         |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                         |        |            |                        +---------------------+
                                                         |        |            |                        +---------------------+
                                                         |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                         |        |            |                        +---------------------+
                                                         | Should Checkpoint?  |
                                                         |        |            |
              ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                         |        |            |
                                                         |        |            |
                                                         | Throw on Error?     |
                                                         |<- yes -+            |
                                                         |        |            |
                                                         |        V            |
                                                         |  Throw on Error     |
                                                         +---------------------+

    The DynamoDB Streams read connector configuration has details about the DynamoDB Streams read and on dealing with failures.

    class DynamoDBStreamsRecordReader:
      def __init__(self) -> None:
          pass

      '''
      For detailed explanation of the parameters, see AWS docs:
        * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_Record.html
        * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_StreamRecord.html

      Parameters
      ----------
      streamArn : str
                      The Kinesis streamArn
      shardId   : str
                      The DynamoDB Shard Id
      eventId : str
                      A globally unique identifier for the event that was recorded in this stream record.
      eventName : str
                      The type of data modification that was performed on the DynamoDB table. INSERT | MODIFY | REMOVE
      identityPrincipalId : str
                      The userIdentity's principalId
      identityType : str
                      The userIdentity's principalType
      sequenceNumber : str
                      The sequence number of the stream record
      sizeBytes : str
                      The size of the stream record, in bytes
      streamViewType : str
                      The stream view type - NEW_IMAGE | OLD_IMAGE | NEW_AND_OLD_IMAGES | KEYS_ONLY
      approximateCreationDateTime : int
                      The approximate date and time when the stream record was created, in UNIX epoch time format and rounded down to the closest second
      keys : str
                      The primary key attribute(s) for the DynamoDB item that was modified
      oldImage : str
                      The item in the DynamoDB table as it appeared before it was modified
      newImage : str
                      The item in the DynamoDB table as it appeared after it was modified

      Returns
      -------
      ParseDocumentResult
          ParseDocumentResult has the extracted document and the status (error, success or skip)
    '''
    def parseRecord(self, streamArn : str, shardId : str, eventId : str, eventName : str, identityPrincipalId : str, identityType : str, sequenceNumber : str, sizeBytes : int, streamViewType : str, approximateCreationDateTime : int, keys : {}, oldImage : {}, newImage : {}) -> ParseDocumentResult:
        raise(Exception("Not Yet Implemented"))

DynamoDBStreamsRecordReader interface on Git Hub DynamoDBStreamsRecordReader example on Git Hub

     The #LetsData DynamoDB Streams Record Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB stream to a #LetsData document. At a high level, the overall #LetsData DynamoDB Stream reader design is as follows:

     * #LetsData reads the records from the DynamoDB stream from the specified location (sequenceNumber) and passes the record contents to the user data handlers.
     * The user data handlers transform this record and returns a document.
     * #LetsData writes the document to the write / error destinations and checkpoints the location (sequenceNumber) in DynamoDB stream.
     * For any errors in #LetsData DynamoDB Streams Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
     * If the decision is to continue processing, the reader polls for next record in the stream.

     +---------------------+                              +---------------------+                        +---------------------+
     | AWS DynamoDB Stream | ------ Read Message -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
     |       Shard         |                              |  DDB Stream Reader  |<---- document -------- |                     |
     +---------------------+                              |                     |                        +---------------------+
                                                          |   Is Error Doc?     |
                                                          |        |            |                        +---------------------+
                                                          |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                          |        |            |                        +---------------------+
                                                          |        |            |                        +---------------------+
                                                          |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                          |        |            |                        +---------------------+
                                                          | Should Checkpoint?  |
                                                          |        |            |
               ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                          |        |            |
                                                          |        |            |
                                                          | Throw on Error?     |
                                                          |<- yes -+            |
                                                          |        |            |
                                                          |        V            |
                                                          |  Throw on Error     |
                                                          +---------------------+

     The DynamoDB Streams read connector configuration has details about the DynamoDB Streams read and on dealing with failures.

    export class DynamoDBStreamsRecordReader {
        constructor() {
        }

        /**

        For detailed explanation of the parameters, see AWS docs:
          * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_Record.html
          * https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_StreamRecord.html

        * @param streamArn - The DynamoDB Stream ARN
        * @param shardId - The DynamoDB Shard Id
        * @param eventId - A globally unique identifier for the event that was recorded in this stream record.
        * @param eventName - The type of data modification that was performed on the DynamoDB table. INSERT | MODIFY | REMOVE
        * @param identityPrincipalId - The userIdentity's principalId
        * @param identityType - The userIdentity's principalType
        * @param sequenceNumber - The sequence number of the stream record
        * @param sizeBytes - The size of the stream record, in bytes
        * @param streamViewType - The stream view type - NEW_IMAGE | OLD_IMAGE | NEW_AND_OLD_IMAGES | KEYS_ONLY
        * @param approximateCreationDateTime - The approximate date and time when the stream record was created, in UNIX epoch time format and rounded down to the closest second
        * @param keys - The primary key attribute(s) for the DynamoDB item that was modified
        * @param oldImage - The item in the DynamoDB table as it appeared before it was modified
        * @param newImage - The item in the DynamoDB table as it appeared after it was modified
        * @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
        */
        parseMessage(messageId, messageGroupId, messageDeduplicationId, messageAttributes, messageBody) {
            throw new Error("Not Yet Implemented");
        }
    }

DynamoDB Table Read Connector

Read Connector DynamoDB Table Location

Does this DynamoDB Table exist in Customer's account or in #LetsData account?

Customer Account
(We'll ask about enabling access later) #Let's Data Account
(We'll enable access to the DynamoDB Table)

AWS DynamoDB Table Read Connector: Resource Location

resourceLocation: The table's resource location - i.e. the table's AWS account. Set this to 'Customer' if the DynamoDB table is in customer account. Set this to 'LetsData' if the DynamoDB table is public or was created by Let's Data. #Let's Data will use these to determine how to access the DynamoDB table. If resourceLocation is Customer, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the DynamoDB table. You'll need to add access to the DynamoDB table in the IAM role's policy document. Otherwise, we'll use #Let's Data account to access the table.

JSON Element
This value is saved as the dataset configuration's readConnector.resourceLocation attribute

AWS DynamoDB Table Name

AWS DynamoDB Streams Read Connector: Table Name

tableName: A DynamoDB Table read connector requires a tableName for the dynamoDB table which has the items that will be read.

JSON Element
This value is saved as the dataset configuration's readConnector.tableName attribute

Throw on Message Read Failures?

AWS DynamoDB Table Read Connector: Throw On Message Read Failures

throwOnMessageReadFail: Optional - boolean value. Defaults to false. When this is set to true, #Let's Data will fail the tasks for any errors in the DynamoDB Table Reader or when error docs being returned by the user data handler. When this is false, such errors will be written to the error destination and the DynamoDB reader will continue and process next messages. This is useful to set when processing ordered messages (for example, scenarios where the subsequent messages are dependent on earlier messages and skipping an earlier message with an error will cause an inconsistency in the system). DynamoDB Table reader isn't necessarily ordered, however, so it might make sense to fail the task on errors for data quality.

JSON Element
This value is saved as the dataset configuration's readConnector.throwOnMessageReadFail attribute

#Lets Data DynamoDBTableItemReader Interface Implementation Class Name

AWS DynamoDB Table Read Connector: #Let's Data DynamoDBTableItemReader Interface Implementation Class Name

tableReaderClassName: The user data handler needs to implement the 'DynamoDBTableItemReader' interface on how to parse records from the DynamoDB table. Set this to the fully qualified class name of the 'DynamoDBTableItemReader' interface implementation. Not required when ECR Image is being used (Python / Javascript)

This is where you tell us what about your implementation - what Java class files implement the reader interface. Look at the SDK docs below to learn how to implement the reader.

JSON Element
This value is saved as the dataset configuration's readConnector.tableReaderClassName attribute

#Let'sData Data Model

Documents

These are explained as follows:

The interface is simple and defined as:

        /**
 * The "DocumentInterface" is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
 */
public interface DocumentInterface {
    /**
     * Gets the documentId for the document
     * @return documentId
     */
    String getDocumentId();

    /**
     * Gets the record type of the document
     * @return the record type
     */
    String getRecordType();

    /**
     * Gets any optional metadata for the document as a map
     * @return map of optional document metadata
     */
    Map getDocumentMetadata();

    /**
     * Interface method that serializes the document to string that can be written to the destination
     * @return serialized document as string
     */
    String serialize();

    /**
     * The partition key of the document - useful to determine the partition for the document that would be written to
     * @return the partition key for the document
     */
    String getPartitionKey();
}

        In addition to the DocumentInterface interfaces, the SingleDocInterface implementations must implement the following:

        /**
 *  Single method interface for a Single Document (as opposed to a Composite Document)
 *  Single Documents should be used for each individual record in the file i.e. there should be 1-1 mapping between a file record and its corresponding java Single Doc Interface implementation.
 *  Composite Documents on the other hand are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface SingleDocInterface extends DocumentInterface {
    /**
     * Interface should return true for SingleDocInterface implementations (maybe some future SingleDoc implementations can return false)
     * @return - true for SingleDocInterface implementations
     */
    boolean isSingleDoc();
}

        In addition to the DocumentInterface interfaces, the CompositeDocInterface implementations must implement the following:

        /**
 *  Composite Documents are documents created by Single File State Machine reader and Multiple File State machine readers where they output a document from multiple single document inputs.
 *  For example, the following multi file state machine reader setup shows how different parsers output different documents
 *  +-----------------------+
 *  | metadata_file1.gz     |   METADATALOG reader (Single File reader)
 *  | <metadata_rec1> |  ----------------------------------------> MetadataSingleDoc_rec1-----+
 *  | <metadata_rec2> |  ----------------------------------------> MetadataSingleDoc_rec2 ----|---+
 *  | <metadata_rec3> |  ----------------------------------------> MetadataSingleDoc_rec3 ----|---|---+
 *  +-----------------------+                                                                       |   |   |  Multiple File State Machine reader
 *                                                                                                  |   |   | --------------------------------------> CompositeDoc_rec3
 *                                                                                                  |   |   |
 *  +-------------------+                                                                           |   |   |  Multiple File State Machine reader
 *  | data_file1.gz     |  DATALOG reader (Single File reader)                                      | ---------------------------------------------> CompositeDoc_rec1
 *  |                   |                                                                           |   |   |
 *  | <data_rec1> |  --------------------------------------------> DataSingleDoc_rec1---------+   |   |  Multiple File State Machine reader
 *  | <data_rec2> |  --------------------------------------------> DataSingleDoc_rec2-------------+------------------------------------------> CompositeDoc_rec2
 *  |                   |                                                                                   |
 *  | <data_rec3> |  --------------------------------------------> DataSingleDoc_rec3-----------------+
 *  +-------------------+
 *
 *  Callouts in the example above:
 *      * a MetadataSingleDoc implementation is created from the SingleDocInterface for the  metadata records (metadata_rec) from METADATALOG file
 *      * a DataSingleDoc implementation is created from the SingleDocInterface for the data records (data_rec) from DATALOG file
 *      * a CompositeDoc implementation is created from the CompositeDocInterface for the record created from the combination of data and metadata documents
 */
public interface CompositeDocInterface extends DocumentInterface {
    /**
     * Since a composite doc is created from multiple single docs, it is possible that there are partial errors in some of the single docs.
     * This method returns a linked hashmap of single docs and the errors when constructing a composite doc
     * @return
     */
    LinkedHashMap> getDocumentList();
}

Here is the interface:

                In addition to the DocumentInterface interfaces, the ErrorDocInterface implementations must implement the following:

                /**
 * The "ErrorDocInterface" extends the "SingleDocInterface" is the base interface for any error documents that are returned by the user handlers.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.ErrorDoc" which is used by default.
 * Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
 */
public interface ErrorDocInterface extends SingleDocInterface {
    /**
     * The erroneous record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The erroneous record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordEndOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The error message string that will be captured in the error record
     * @return The error message string
     */
    String getErrorMessage();
}

Here is the interface:

        In addition to the DocumentInterface interfaces, the SkipDocInterface implementations must implement the following:

        /**
 * The "SkipDocInterface" extends the "SingleDocInterface" is the base interface for any skip documents that are returned by the user handlers.
 * A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination.
 * A default implementation for the interface is provided at "com.resonance.letsdata.data.documents.implementation.SkipDoc" which is used by default.
 * Customers can return skip records from handlers using this default implementation or write their own Skip docs and return these during processing.
 */
public interface SkipDocInterface extends SingleDocInterface {
    /**
     * The skip record start offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": 58965L
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": 58965L,
     *      "DATALOG": 5484726L,
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorStartOffsetMap();

    /**
     * The skip record end offset (in bytes) of the error record in the files by file types
     * For 'Single File' and 'Single File State Machine' readers, there would be a single file type in the return map.
     * For example,
     *  {
     *      "CLICKSTREAMLOGS": "58965"
     *  }
     *  For 'Multiple File State Machine' readers, the return map should have offsets (in bytes) into each of the files.
     *  For example,
     *  {
     *      "METADATALOG": "58965",
     *      "DATALOG": "5484726",
     *  }
     * @return Map of <FileType, RecordStartOffsetInBytes>
     */
    Map getErrorEndOffsetMap();

    /**
     * The skip message string that will be captured in the skip record
     * @return The skip message string
     */
    String getSkipMessage();
}

#Let'sData Interfaces

The #Let'sData implementation defines:

Usecase - Implementation Requirements

DynamoDB Table Reader: You'll need to implement the 'DynamoDBTableItemReader' interface.

The user data handlers for the DynamoDB Table reader need to implement the DynamoDBTableItemReader interface.

    ParseDocumentResult parseDynamoDBItem(String tableName, int segmentNumber, Map keys, Map item);

Here is an example implementation that echoes the dynamoDB table item contents as a #Let's Data document.

    public class CommonCrawlDDBItemReader implements DynamoDBTableItemReader {
        private static final Logger logger = LoggerFactory.getLogger(CommonCrawlDDBItemReader.class);
        private static final ObjectMapper objectMapper = new ObjectMapper();

        @Override
        public ParseDocumentResult parseDynamoDBItem(String tableName, int segmentNumber, Map keys, Map item) {
            try {
                String docId = null;
                for (Object keyValue : keys.values()) {
                    if (docId == null) {
                        docId = keyValue.toString();
                    } else {
                        docId += ('|' + keyValue.toString());
                    }
                }

                if (item == null || item.size() <= 0) {
                    logger.error("item is null, returning error - tableName: " + tableName + ", keys: " + keys);
                    Map offset = new HashMap<>();
                    offset.put("keys", objectMapper.writeValueAsString(keys));
                    String errorMessage = "DYNAMODB error record - keys: "+keys+", tableName: "+tableName;
                    ErrorDoc errorDoc = new ErrorDoc(offset, offset, errorMessage, docId, "DDB_ERROR_RECORD", null, errorMessage, Integer.toString(segmentNumber));

                    return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
                }

                logger.debug("processing record, returning success - docId: " + docId);

                IndexRecord documentInterface = objectMapper.readValue(objectMapper.writeValueAsString(item), IndexRecord.class);

                return new ParseDocumentResult(null, documentInterface, ParseDocumentResultStatus.SUCCESS);

            } catch (Exception ex) {
                logger.error("Exception in reading the document - tableName: " + tableName + ", keys: " + keys + ", ex: " + ex);

                String docIdUUID = UUID.randomUUID().toString();
                Map offset = new HashMap<>();
                try {
                    offset.put("keys", objectMapper.writeValueAsString(keys));
                } catch (Exception e) {
                    offset.put("keys", "error calculating offset");
                }
                String errorMessage = "Exception in processing item keys: "+keys+", ex: " + ex.getMessage();
                ErrorDoc errorDoc = new ErrorDoc(offset, offset, errorMessage, docIdUUID, "DDB_ERROR", null, errorMessage, Integer.toString(segmentNumber));
                return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
            }
        }
    }

The user data handlers for the DynamoDB Table reader need to implement the DynamoDBTableItemReader interface.

    def parseDynamoDBItem(self, tableName : str, segmentNumber : int, keys : {}, item : {}) -> ParseDocumentResult:
                raise(Exception("Not Yet Implemented"))

Here is an example implementation that echoes the dynamoDB table item contents as a #Let's Data document.

    import uuid
    from letsdata_interfaces.readers.model.RecordParseHint import RecordParseHint
    from letsdata_interfaces.readers.model.RecordHintType import RecordHintType
    from letsdata_interfaces.readers.model.ParseDocumentResultStatus import ParseDocumentResultStatus
    from letsdata_interfaces.readers.model.ParseDocumentResult import ParseDocumentResult
    from letsdata_interfaces.documents.Document import Document
    from letsdata_interfaces.documents.DocumentType import DocumentType
    from letsdata_interfaces.documents.ErrorDoc import ErrorDoc
    from letsdata_utils.logging_utils import logger

    class DynamoDBTableItemReader:
        def __init__(self) -> None:
            pass

        def parseDynamoDBItem(self, tableName : str, segmentNumber : int, keys : {}, item : {}) -> ParseDocumentResult:
            try:
                docId : str = None
                for keyValue in keys.values():
                    if docId is None:
                        docId = keyValue
                    else:
                        docId += ('|'+keyValue)

                if item is None or len(item) <= 0:
                    logger.error(f"item is null, returning skip - tableName: {tableName}, segmentNumber: {segmentNumber}, keys: {keys}")
                    error_doc = ErrorDoc(docId, "DDB_ERROR", docId, {}, {}, {"keys": keys}, {"keys": keys}, "empty record")
                    return ParseDocumentResult(None, error_doc, "ERROR")

                logger.debug(f"processing record, returning success - docId: {docId}")
                return ParseDocumentResult(None, Document(DocumentType.Document, docId, "DOCUMENT", docId, {}, item), "SUCCESS")
            except Exception as ex:
                logger.debug(f"Exception in reading the document - tableName: {tableName}, segmentNumber: {segmentNumber}, keys: {keys}, ex: {ex}")
                docIdUUID = str(uuid.uuid4())
                error_doc = ErrorDoc(docIdUUID, "DDB_ERROR", docIdUUID, {}, {}, {"keys": keys}, {"keys": keys}, f"Exception - {ex}")
                return ParseDocumentResult(None, error_doc, "ERROR")

The user data handlers for the DynamoDB Table reader need to implement the DynamoDBTableItemReader interface.

    parseDynamoDBItem(tableName, segmentNumber, keys, item) {
        throw new Error("Not Yet Implemented");
    }

Here is an example implementation that echoes the dynamoDB table item contents as a #Let's Data document.

            import { v4 as uuidv4 } from 'uuid';
            import { RecordParseHint } from '../model/RecordParseHint.js';
            import { RecordHintType } from '../model/RecordHintType.js';
            import { ParseDocumentResult } from '../model/ParseDocumentResult.js';
            import { ParseDocumentResultStatus } from '../model/ParseDocumentResultStatus.js';
            import { Document } from '../../documents/Document.js';
            import { DocumentType } from '../../documents/DocumentType.js';
            import { ErrorDoc } from '../../documents/ErrorDoc.js';
            import { logger } from '../../../letsdata_utils/logging_utils.js';

            export class DynamoDBTableItemReader {
                constructor() {

                }

                parseDynamoDBItem(tableName, segmentNumber, keys, item) {
                    try {
                        let docId = null;

                        for (const keyValue of Object.values(keys)) {
                            if (docId === null) {
                                docId = keyValue;
                            } else {
                                docId += '|' + keyValue;
                            }
                        }

                        if (!item || Object.keys(item).length <= 0) {
                            logger.error(\`item is null, returning error - tableName: \${tableName}, segmentNumber: \${segmentNumber}, keys: \${JSON.stringify(keys)}\`);

                            const errorDoc = new ErrorDoc(docId, 'DDB_ERROR', docId, {}, {}, { "keys": keys }, { "keys": keys }, 'error record');
                            return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
                        }

                        logger.debug(\`processing record, returning success - docId: \${docId}\`);

                        return new ParseDocumentResult(null, new Document('Document', docId, 'DOCUMENT', docId, {}, item), ParseDocumentResultStatus.SUCCESS);
                    } catch (ex) {
                        logger.debug(\`Exception in reading the document - tableName: \${tableName}, segmentNumber: \${segmentNumber}, keys: \${JSON.stringify(keys)}, ex: \${ex}\`);

                        const docIdUUID = uuidv4();
                        const errorDoc = new ErrorDoc(docIdUUID, 'DDB_ERROR', docIdUUID, {}, {}, { "keys": keys }, { "keys": keys }, \`Exception - \${ex}\`);
                        return new ParseDocumentResult(null, errorDoc, ParseDocumentResultStatus.ERROR);
                    }
                }
            }

Interfaces

Readers

DynamoDBTableItemReader: The reader interface for "DynamoDB Table Item Reader" reader type. This is where you construct a document from the DynamoDB table item.

Here is the interface:

DynamoDBTableItemReader interface on Git Hub DynamoDBTableItemReader example on Git Hub

             The #LetsData DynamoDB Table Item Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB Item to a #LetsData document. At a high level, the overall #LetsData DynamoDB Table Item Reader design is as follows:

             * #LetsData scans the DynamoDB table and passes the items to the user data handlers.
             * The user data handlers transform this record and returns a document.
             * #LetsData writes the document to the write / error destinations and checkpoints the DynamoDB table's location (lastEvaluatedKey)
             * For any errors in #LetsData DynamoDB Table Item Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
             * If the decision is to continue processing, the reader polls for next record in the stream.

             +---------------------+                              +---------------------+                        +---------------------+
             | AWS DynamoDB Table  | ------ Read Items - -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
             |       Scan          |                              |  DDB Table Reader   |<---- document -------- |                     |
             +---------------------+                              |                     |                        +---------------------+
                                                                  |   Is Error Doc?     |
                                                                  |        |            |                        +---------------------+
                                                                  |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                                  |        |            |                        +---------------------+
                                                                  |        |            |                        +---------------------+
                                                                  |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                                  |        |            |                        +---------------------+
                                                                  | Should Checkpoint?  |
                                                                  |        |            |
                       ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                                  |        |            |
                                                                  |        |            |
                                                                  | Throw on Error?     |
                                                                  |<- yes -+            |
                                                                  |        |            |
                                                                  |        V            |
                                                                  |  Throw on Error     |
                                                                  +---------------------+

             The DynamoDB Table Item read connector configuration has details about the DynamoDB Table Item read and on dealing with failures.

            public interface DynamoDBTableItemReader {
                /**
                 * The #Let's Data DynamoDB Table Reader uses this interface's implementation (also called as user data handlers) to transform the items from DynamoDB Table to a #Let's Data document.

                * @param tableName - The DynamoDB tableName
                * @param segmentNumber - The DynamoDB scan segmentNumber
                * @param keys - The primary key attribute(s) for the scanned DynamoDB item
                * @param item - The scanned item from the DynamoDB table
                * @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
                */
               ParseDocumentResult parseDynamoDBItem(String tableName, int segmentNumber, Map keys, Map item);
            }

DynamoDBTableItemReader interface on Git Hub DynamoDBTableItemReader example on Git Hub

               The #LetsData DynamoDB Table Item Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB Item to a #LetsData document. At a high level, the overall #LetsData DynamoDB Table Item Reader design is as follows:

               * #LetsData scans the DynamoDB table and passes the items to the user data handlers.
               * The user data handlers transform this record and returns a document.
               * #LetsData writes the document to the write / error destinations and checkpoints the DynamoDB table's location (lastEvaluatedKey)
               * For any errors in #LetsData DynamoDB Table Item Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
               * If the decision is to continue processing, the reader polls for next record in the stream.

               +---------------------+                              +---------------------+                        +---------------------+
               | AWS DynamoDB Table  | ------ Read Items - -------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
               |       Scan          |                              |  DDB Table Reader   |<---- document -------- |                     |
               +---------------------+                              |                     |                        +---------------------+
                                                                    |   Is Error Doc?     |
                                                                    |        |            |                        +---------------------+
                                                                    |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                                    |        |            |                        +---------------------+
                                                                    |        |            |                        +---------------------+
                                                                    |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                                    |        |            |                        +---------------------+
                                                                    | Should Checkpoint?  |
                                                                    |        |            |
                         ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                                    |        |            |
                                                                    |        |            |
                                                                    | Throw on Error?     |
                                                                    |<- yes -+            |
                                                                    |        |            |
                                                                    |        V            |
                                                                    |  Throw on Error     |
                                                                    +---------------------+

            class DynamoDBTableItemReader:
              def __init__(self) -> None:
                  pass

              '''
               Parameters
               ----------
               tableName : str
                              The DynamoDB tableName
               segmentNumber : str
                              The DynamoDB scan segment number
               keys : str
                              The primary key attribute(s) for the scanned DynamoDB item
               item : str
                              The scanned DynamoDB table item

               Returns
               -------
               ParseDocumentResult
                  ParseDocumentResult has the extracted document and the status (error, success or skip)
              '''
              def parseDynamoDBItem(self, tableName : str, segmentNumber : int, keys : {}, item : {}) -> ParseDocumentResult:
                  raise(Exception("Not Yet Implemented"))

DynamoDBTableItemReader interface on Git Hub DynamoDBTableItemReader example on Git Hub

            The #LetsData DynamoDB Table Item Reader uses this interface's implementation (also called as user data handlers) to transform the records from DynamoDB Item to a #LetsData document. At a high level, the overall #LetsData DynamoDB Table Item Reader design is as follows:

            * #LetsData scans the DynamoDB table and passes the items to the user data handlers.
            * The user data handlers transform this record and returns a document.
            * #LetsData writes the document to the write / error destinations and checkpoints the DynamoDB table's location (lastEvaluatedKey)
            * For any errors in #LetsData DynamoDB Table Item Reader, or error docs being returned by the user data handler, #LetsData looks at the reader configuration and determines 1./ whether to fail the task with error 2./ or write an error doc and continue processing
            * If the decision is to continue processing, the reader polls for next record in the stream.

            +---------------------+                              +---------------------+                        +---------------------+
            | AWS DynamoDB Table  | ------ Read Items ---------> |    # Lets Data      |---- parseDocument ---> |  User Data Handler  |
            |       Scan          |                              |  DDB Table Reader   |<---- document -------- |                     |
            +---------------------+                              |                     |                        +---------------------+
                                                                 |   Is Error Doc?     |
                                                                 |        |            |                        +---------------------+
                                                                 |        +---- yes ->-|---- write document --->|  Write Destination  |
                                                                 |        |            |                        +---------------------+
                                                                 |        |            |                        +---------------------+
                                                                 |        +---- no -->-|---- write error ------>|  Error Destination  |
                                                                 |        |            |                        +---------------------+
                                                                 | Should Checkpoint?  |
                                                                 |        |            |
                      ---<------- Checkpoint Task --------<------|<- yes -+            |
                                                                 |        |            |
                                                                 |        |            |
                                                                 | Throw on Error?     |
                                                                 |<- yes -+            |
                                                                 |        |            |
                                                                 |        V            |
                                                                 |  Throw on Error     |
                                                                 +---------------------+

            The DynamoDB Table Item read connector configuration has details about the DynamoDB Table Item read and on dealing with failures.

            export class DynamoDBTableItemReader {
                constructor() {

                }

                /**
                * @param tableName - The DynamoDB tableName
                * @param segmentNumber - The DynamoDB scan segmentNumber
                * @param keys - The primary key attribute(s) for the scanned DynamoDB item
                * @param item - The scanned item from the DynamoDB table
                * @return ParseDocumentResult which has the extracted document and the status (error, success or skip)
                */
                parseDynamoDBItem(tableName, segmentNumber, keys, item) {
                    throw new Error("Not Yet Implemented");
                }
            }

S3 Read Connector Manifest

S3 Reader Manifest Region

Reader Manifest File

Enter manifest as text
Manifest file S3 Link

Manifest file S3 Link

Reader Manifest File Access Does this manifest file exist in Customer's account or in #Let's Data account?
(We'll look at enabling access a little later)

Customer LetsData

S3 Reader Manifest Region

(Optional) The AWS region for the dataset's manifestFile. If not specified, manifest file will default to the dataset region. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's manifestFile.region attribute

Reader Manifest File

S3 Read Connector: For example, The S3 Read Connector manifest file would define:

what files in the S3 bucket need to be read as part of this dataset
each file's filetype
the mapping of different filetypes (for example, metadata_file1.gz maps to data_file1.gz, metadata_file2.gz maps to data_file2.gz etc). In this example, we create a manifest file that specifies the file types (metadata, data) and the individual files that need to be read and their mappings { (metadata_file1.gz -> data_file1.gz), (metadata_file2.gz -> data_file2.gz) }. This manifest file becomes the complete list of data in the dataset that would be processed by the read connector. Each line in the manifest file will become a task.

Manifest File Format

The format of each line (Datatask) in the manifest file is:

SingleFileReader / SingleFileStateMachineReader: Since their is only a single file being read, the fileTypes are not specified in the manifest file. The manifest file is simply the fileNames (one per line) that need to be read by the tasks.
```
    <Filepath relative to bucket root>\n
    <Filepath relative to bucket root>\n
    ...

    Things to note:
    \n is the task delimiter (one task per line)
                
```
Example
SingleFileReader / SingleFileStateMachineReader: Since their is only a single file being read, the fileTypes are not specified in the manifest file. The manifest file is simply the fileNames (one per line) that need to be read by the tasks.
```
    <Filepath relative to bucket root>\n
    <Filepath relative to bucket root>\n
    ...

    Things to note:
    \n is the task delimiter (one task per line)
                
```
Example
```
    Nov22/logfile_1.gz
    Nov22/logfile_2.gz
    Nov22/logfile_3.gz
```
- Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
- The S3 bucket may have many folders and files and we only need to process November's logfile data as part of the dataset. We add the relative path (Nov22/) from the bucket root to each individual file.
- This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.
SingleFileReader / SingleFileStateMachineReader: Since their is only a single file being read, the fileTypes are not specified in the manifest file. The manifest file is simply the fileNames (one per line) that need to be read by the tasks.
```
    <Filepath relative to bucket root>\n
    <Filepath relative to bucket root>\n
    ...

    Things to note:
    \n is the task delimiter (one task per line)
                
```
Example
```
    s3file1.gz
    s3file2.gz
    s3file3.gz
```
- Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
- The files are at the bucket root, so no relative path is specified
- This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.
MultipleFileStateMachineReader: Since the reader is parsing different file types, each task defines the fileType-fileName mapping for the task's readers/parsers.
```
    <Filetype1>:<Filetype1 Filepath relative to bucket root>|<Filetype2>:<Filetype2 Filepath relative to bucket root>|...\n
    <Filetype1>:<Filetype1 Filepath relative to bucket root>|<Filetype2>:<Filetype2 Filepath relative to bucket root>|...\n
    ...

    Things to note:
    : is the filetype filename delimiter
    | is the delimiter for additional file type
    \n is the task delimiter (one task per line)
                
```
Example
```
    METADATALOGFILE:June2022/metadata/metadata_file1.gz|DATALOGFILE:June2022/data/data_file1.gz
    METADATALOGFILE:June2022/metadata/metadata_file2.gz|DATALOGFILE:June2022/data/data_file2.gz
    METADATALOGFILE:June2022/metadata/metadata_file3.gz|DATALOGFILE:June2022/data/data_file3.gz
```
- We're reusing the reader filetypes (logical name) that we had defined earlier
- The S3 bucket may have many folders and files - we only need to process June's metadata and data as part of the dataset. We add the relative path (June2022/metadata) and (June2022/data) to each metadata and data file.
- Each line implicitly specifies the mapping of each metadata file to its corresponding data file [{metadata_file1, data_file1}, {metadata_file2, data_file2}, {metadata_file3, data_file3}] . The reader would use this implicit mapping to extract documents from these files.
- This dataset contains 6 files that would be read. Each file pair would be a Datatask (3 tasks in total) that would run independently and would log its progress separately.

Manifest File Types

A manifest file can be specified in two different ways:

inline as text in the dataset json
as a text file in an S3 bucket with the link to the S3 file in the dataset json

Configuration

Here are the attributes for SingleFileReader / SingleFileStateMachineReader inline manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- fileContents: The text contents of the manifest, each set of files is specified on a single line
- manifestType: The manifest type, since this is the inline text manifest definition, we'll specify S3ReaderTextManifestFile
- readerType: The reader type, copied from the read connector. Set this to SINGLEFILEREADER or SINGLEFILESTATEMACHINEREADER
```
    "manifestFile": {
        "fileContents": "warc/CC-MAIN-00007.warc.gz\r\nwarc/CC-MAIN-00004.warc.gz\r\n",
        "manifestType": "S3ReaderTextManifestFile",
        "readerType": "SINGLEFILEREADER"
    }
            
```
JSON Element
These values are saved as the dataset configuration's manifestFile.fileContents, manifestFile.manifestType and manifestFile.readerType attributes
Here are the attributes for SingleFileReader / SingleFileStateMachineReader S3 Link manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- manifestFileS3Uri: The S3 URI for the manifest file in S3
- manifestType: The manifest type, since this is the link to a text file in S3, we'll specify S3ReaderS3LinkManifestFile
- readerType: The reader type, copied from the read connector. Set this to SINGLEFILEREADER or SINGLEFILESTATEMACHINEREADER
- readerManifestResourceLocation: Access description of the linked S3 File whether it is in Customer account or in LetsData account. Customer resourceLocation requires access policy to be added in the dataset's access grant IAM role
```
    "manifestFile": {
        "manifestFileS3Uri": "s3://resonancemanifestfile/resonance_manifest.txt",
        "manifestType": "S3ReaderS3LinkManifestFile",
        "readerType": "SINGLEFILEREADER",
        "readerManifestResourceLocation": "Customer"
    }
            
```
JSON Element
These values are saved as the dataset configuration's manifestFile.manifestFileS3Uri, manifestFile.manifestType, manifestFile.readerManifestResourceLocation and manifestFile.readerType attributes
Here are the attributes for SingleFileReader / SingleFileStateMachineReader inline manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- fileContents: The text contents of the manifest, each set of files is specified on a single line
- manifestType: The manifest type, since this is the inline text manifest definition, we'll specify S3ReaderTextManifestFile
- readerType: The reader type, copied from the read connector. Set this to SINGLEFILEREADER or SINGLEFILESTATEMACHINEREADER
```
    "manifestFile": {
        "fileContents": "warc/CC-MAIN-00007.warc.gz\r\nwarc/CC-MAIN-00004.warc.gz\r\n",
        "manifestType": "S3ReaderTextManifestFile",
        "readerType": "SINGLEFILEREADER"
    }
            
```
JSON Element
These values are saved as the dataset configuration's manifestFile.fileContents, manifestFile.manifestType and manifestFile.readerType attributes
Here are the attributes for SingleFileReader / SingleFileStateMachineReader S3 Link manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- manifestFileS3Uri: The S3 URI for the manifest file in S3
- manifestType: The manifest type, since this is the link to a text file in S3, we'll specify S3ReaderS3LinkManifestFile
- readerType: The reader type, copied from the read connector. Set this to SINGLEFILEREADER or SINGLEFILESTATEMACHINEREADER
- readerManifestResourceLocation: Access description of the linked S3 File whether it is in Customer account or in LetsData account. Customer resourceLocation requires access policy to be added in the dataset's access grant IAM role
```
    "manifestFile": {
        "manifestFileS3Uri": "s3://resonancemanifestfile/resonance_manifest.txt",
        "manifestType": "S3ReaderS3LinkManifestFile",
        "readerType": "SINGLEFILEREADER",
        "readerManifestResourceLocation": "Customer"
    }
            
```
JSON Element
These values are saved as the dataset configuration's manifestFile.manifestFileS3Uri, manifestFile.manifestType, manifestFile.readerManifestResourceLocation and manifestFile.readerType attributes
Here are the attributes for the MultipleFileStateMachineReader inline manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- fileContents: The text contents of the manifest, each set of files is specified on a single line
- fileTypes: The set of file types defined in the reader connector file parser implementation class name map, copied from the read connector
- manifestType: The manifest type, since this is the inline text manifest definition, we'll specify S3ReaderTextManifestFile
- readerType: The reader type, copied from the read connector. Set this to MULTIPLEFILESTATEMACHINEREADER
```
    "manifestFile": {
        "fileContents": "WARC:warc/CC-MAIN-00007.warc.gz|WAT:wat/CC-MAIN-00007.warc.wat.gz|WET:wet/CC-MAIN-00007.warc.wet.gz\r\nWARC:warc/CC-MAIN-00004.warc.gz|WAT:wat/CC-MAIN-00004.warc.wat.gz|WET:wet/CC-MAIN-00004.warc.wet.gz\r\n",
        "fileTypes": ["WARC","WAT","WET"],
        "manifestType": "S3ReaderTextManifestFile",
        "readerType": "MULTIPLEFILESTATEMACHINEREADER"
    }
            
```
JSON Element
These values are saved as the dataset configuration's manifestFile.fileTypes, manifestFile.fileContents, manifestFile.manifestType and manifestFile.readerType attributes
Here are the attributes for the MultipleFileStateMachineReader S3 Link manifest file definition:
- region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
- fileTypes: The set of file types defined in the reader connector file parser implementation class name map, copied from the read connector
- manifestFileS3Uri: The S3 URI for the manifest file in S3
- manifestType: The manifest type, since this is the link to a text file in S3, we'll specify S3ReaderS3LinkManifestFile
- readerType: The reader type, copied from the read connector. Set this to MULTIPLEFILESTATEMACHINEREADER
- readerManifestFileIsPublic: Access description of the linked S3 File whether it is in Customer account or in LetsData account. Customer resourceLocation requires access policy to be added in the dataset's access grant IAM role
```
    "manifestFile": {
        "fileTypes": ["WARC","WAT","WET"],
        "manifestFileS3Uri": "s3://resonancemanifestfile/resonance_manifest.txt",
        "manifestType": "S3ReaderS3LinkManifestFile",
        "readerType": "MULTIPLEFILESTATEMACHINEREADER",
        "readerManifestResourceLocation": "Customer"
    }
```
JSON Element
These values are saved as the dataset configuration's manifestFile.fileTypes, manifestFile.manifestFileS3Uri, manifestFile.manifestType, manifestFile.readerManifestResourceLocation and manifestFile.readerType attributes

SQS Read Connector Manifest

SQS Reader Manifest Region

SQS Reader Manifest

SQS Reader Manifest Region

JSON Element
This value is saved as the dataset configuration's manifestFile.region attribute

Reader Manifest File

Task stop conditions: SQS Queues unlike the Files do not have an end of data marker and even if the queue is empty, new data can arrive before the next SQS queue poll. Therefore we rely on the user to tell us when we've reached the end of the queue so that the SQS queue reader tasks can complete. We've defined a couple of modes:

DrainQueues: In this mode, the tasks keep running until the queue becomes empty. The tasks will then complete when they poll empty queue consecutively for a defined threshold. ('sqsReaderTaskStopAfterConsecutiveEmptyReceives'). Here is an example:
```
    "manifestFile": {
        "sqsReaderTaskStopCondition": "DrainQueues",
        "sqsReaderTaskStopAfterConsecutiveEmptyReceives": 10
    }
    
```
Continuous: Tasks will continue to process, until explicitly stopped by the user. The wait between consecutive polls can be configured to be fixed ('Poll at fixed intervals (10 sec)') or can be configured as exponential delay ('Poll with exponential frequency (max 1 min)'). Here is an example:
```
    "manifestFile": {
        "sqsReaderTaskStopCondition": "Continuous",
        "sqsReaderTaskContinuousPollFrequency": "Poll at fixed intervals (10 sec) | Poll with exponential frequency (max 1 min)"
    }
    
```

JSON Element
These values are saved as the dataset configuration's manifestFile.sqsReaderTaskStopCondition, manifestFile.sqsReaderTaskStopAfterConsecutiveEmptyReceives and manifestFile.sqsReaderTaskContinuousPollFrequency attributes

Kinesis Read Connector Manifest

Kinesis Reader Manifest Region

Kinesis Reader Manifest

Kinesis Reader Manifest Region

JSON Element
This value is saved as the dataset configuration's manifestFile.region attribute

Reader Manifest File

Task StartFrom Condition

Task Start From conditions: The reader tasks can start processing the stream from anywhere in the Kinesis stream. The user can specify whether to start reading the stream from the 'Earliest' record or from the 'Latest' record. (We currently do not support reading from a specific sequence number yet for simplicity, however, internally, when tasks are rerun, they continue from the last checkpointed sequence number. We can add this support if needed, let us know).

Task Stop Condition

Task stop conditions: Kinesis streams do not have an end of data marker and even if the stream is empty, new data can arrive before the next stream reader poll. Therefore we rely on the user to tell us when we've reached the end of the stream so that the Kinesis stream reader tasks can complete. We've defined a few different modes:

StopWhenNoData: In this mode, the tasks keep running until the stream becomes empty. The tasks will then complete when they poll empty stream consecutively for a defined threshold. ('kinesisReaderTaskStopAfterConsecutiveEmptyReceives'). Here is an example:
```
    "manifestFile": {
        "kinesisReaderTaskStopCondition": "StopWhenNoData",
        "kinesisReaderTaskStopAfterConsecutiveEmptyReceives": 10
    }
    
```
Continuous: Tasks will continue to process, until explicitly stopped by the user. The wait between consecutive polls can be configured to be fixed ('Poll at fixed intervals (10 sec)') or can be configured as exponential delay ('Poll with exponential frequency (max 1 min)'). Here is an example:
```
    "manifestFile": {
        "kinesisReaderTaskStopCondition": "Continuous",
        "kinesisReaderTaskContinuousPollFrequency": "Poll at fixed intervals (10 sec) | Poll with exponential frequency (max 1 min)"
    }
    
```

JSON Element
These values are saved as the dataset configuration's

manifestFile.kinesisReaderTaskStopCondition
manifestFile.kinesisReaderTaskStopAfterConsecutiveEmptyReceives
manifestFile.kinesisReaderTaskContinuousPollFrequency
manifestFile.kinesisReaderTaskStartFromCondition

attributes

DynamoDB Streams Read Connector Manifest

DynamoDB Streams Reader Manifest Region

DynamoDB Streams Reader Manifest

DynamoDB Streams Reader Manifest Region

JSON Element
This value is saved as the dataset configuration's manifestFile.region attribute

Reader Manifest File

Task StartFrom Condition

Task Start From conditions: The reader tasks can start processing the stream from anywhere in the DynamoDB stream. The user can specify whether to start reading the stream from the Earliest record or from the Latest record. (We currently do not support reading from a specific sequence number yet for simplicity, however, internally, when tasks are rerun, they continue from the last checkpointed sequence number. We can add this support if needed, let us know).

Task Stop Condition

Task stop conditions: DynamoDB streams do not have an end of data marker and even if the stream is empty, new data can arrive before the next stream reader poll. Therefore we rely on the user to tell us when we've reached the end of the stream so that the DynamoDB stream reader tasks can complete. We've defined a few different modes:

StopWhenNoData: In this mode, the tasks keep running until the stream becomes empty. The tasks will then complete when they poll empty stream consecutively for a defined threshold. (dynamoDBStreamReaderTaskStopAfterConsecutiveEmptyReceives). Here is an example:
Continuous: Tasks will continue to process, until explicitly stopped by the user. The wait between consecutive polls can be configured to be fixed (Poll at fixed intervals (10 sec)) or can be configured as exponential delay (Poll with exponential frequency (max 1 min)). Here is an example:

JSON Element
These values are saved as the dataset configuration's

manifestFile.dynamoDBStreamsReaderTaskStopCondition
manifestFile.dynamoDBStreamsReaderTaskStopAfterConsecutiveEmptyReceives
manifestFile.dynamoDBStreamsReaderTaskContinuousPollFrequency
manifestFile.dynamoDBStreamsReaderTaskStartFromCondition

attributes

DynamoDB Table Read Connector Manifest

DynamoDB Table Reader Manifest Region

DynamoDB Table Reader Manifest

Num Reader Tasks

Reader Filter Expression

Reader Filter Expression Attribute Names

Reader Filter Expression Attribute Values

DynamoDB Table Reader Manifest Region

JSON Element
This value is saved as the dataset configuration's manifestFile.region attribute

Reader Manifest File

Num Reader Tasks

numReaderTasks: DynamoDB Table read connector scans the table using the configured parallel tasks. Allowed values: [1-1000000]

By specifying the number of reader tasks in conjunction with the lambda concurrency, you can effectively control the scanning speed.
- Aggresive Scanning: For example, with 1000 lambda concurrency and 1000 numReaderTasks, there would be 1000 parallel tasks scanning the table and the table should be adequately scaled for such read throughput
- Low Priority Scans: With lambda concurrency 1 and 1 numReaderTasks, you effectively have a single task scanning the table which is essentially simulating a lower priority scan
- Low Priority - Parallel Scan: With lambda concurrency 1, task timeout 60 seconds and 1000 numReaderTasks, you effectively have low priority parallel scan, where you know all parts of the table will be scanned within a given timeframe. We've not built any task scheduling yet (same tasks could get scheduled again and again) so there are no guarantees against starvation
Number of Items For Each Scan: Each scan call fetches at most 30 items (fixed in code). We can make this configurable as well in case customers would like such control
Delay between Scan calls: We've been also thinking about maybe adding some fixed / exponential delay between scan calls and allow users to configure this. These would essentially allow customers to control the scanning speed. Let us know if such configuration would be desirable

Task Stop Condition

dynamoDBReaderTaskStopCondition: DynamoDB Table read connector scans the table using the configured parallel tasks. The dynamoDBReaderTaskStopCondition tells #LetsData on when to complete these scan. For example, dynamoDBReaderTaskStopCondition: "SingleTableScan" completes the tasks after a single full table scan. With dynamoDBReaderTaskStopCondition: "Continuous", the tasks keep scanning continuosly until explicitly stopped (or errors are encountered). Allowed values: SingleTableScan, Continuous

Reader Filter Expression

readerFilterExpression: [Optional] String - the DynamoDB scan filter expression. If null, all items in the table will be returned, if not null, items meeting the filter criteria would be returned. To learn more about DynamoDB Filter Expressions, see the AWS Docs - Filter Expression. Examples, docLanguage = :languageAttrVal would filter items that have an attribute docLanguage with value equal to the value in :languageAttrVal

Reader Filter Expression Attribute Names

readerFilterExpressionAttributeNames: [Optional] Map<String, String> is the attribute name substitution map for the readerFilterExpression. For example, to create an expression views > :num, you'd specify the expression as #v >= :num since views is a reserved word in DynamoDB. Youll specify {"#v": "Views"} as the readerFilterExpressionAttributeNames. To learn more about Expressions attribute names, see the "} is the attribute value substitution map for the readerFilterExpression. Examples, docLanguage = :languageAttrVal would require ":c": {"{"} "S": "Black" {"}"} as the readerFilterExpressionAttributeValues. To learn more about DynamoDB Expressions attribute values, see the AWS Docs - Expression Attribute Values with the difference that you can specify the value as an object (String, Number, Boolean etc.) instead of specifying it as a Dynamo DB AttributeValue

JSON Element
These values are saved as the dataset configuration's

manifestFile.numReaderTasks
manifestFile.dynamoDBReaderTaskStopCondition
manifestFile.readerFilterExpression
manifestFile.readerFilterExpressionAttributeNames
manifestFile.readerFilterExpressionAttributeValues

attributes

Read Connector Artifact

Reader Connector Artifact

Artifact file S3 Link

Artifact file S3 Link

Is the object at Artifact file S3 Link in Customer account or in #LetsData account?
(This is needed for enabling access - we'll look at enabling access a little later)

Customer LetsData

Read Connector Artifact

artifactFileS3Link: The S3 link of the artifact - the JAR file that implements the required interfaces mentioned above.
artifactFileS3LinkResourceLocation: #Let's Data needs to know how to access the artifactFileS3Link. If it is in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access this JAR file. Otherwise, we'll use #Let's Data account to access. You'll need to add access to this JAR file in the IAM role's policy document.

JSON Element
These values are saved as the dataset configuration's readConnector.artifactFileS3Link, readConnector.artifactFileS3LinkResourceLocation attributes

Dataset Write Connector

Write Connector Region

Connector Destination

Dataset Write Connectors

Write connectors are logical connections that are created to write output docs to write connector destinations. For example, an Kinesis Write Connector is used to write data to Kinesis.

Write Connectors are implemented by #Let's Data - no coding is required by the customers to write records to each destination. All the customers have to do is to:

Configure a write connector for the dataset by specifying the writeConnector in the dataset configuration JSON
Output the documents from the user data handlers according to the requirements of the write connector destination (For example, SQS documents should be less than 256KiB in size, Dynamo DB Documents must have keys in correct format / size, Kinesis records should be less than 1 MB etc.)

connectorDestination: The data destination for the connector. This can be S3, SQS or any of the supported write connector destinations.

We currently support the following write connector destinations:

Momento Vector Index Write Connector
AWS Kinesis Write Connector
AWS Dynamo DB Write Connector
AWS S3 Write Connector
AWS SQS Write Connector
AWS Kafka Write Connector

JSON Element
This value is saved as the dataset configuration's writeConnector.connectorDestination attributes

Write Connector Region

(Optional) The AWS region for the dataset's writeConnector. If not specified, write connector will default to the dataset region. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's writeConnector.region attribute

Kinesis Write Connector

Write Connector Kinesis Stream Location

Does the kinesis stream for this write connector already exist or should lets data create a new kinesis stream for this write connector?

Existing Kinesis Stream
(We'll ask about enabling access later) Create New Stream
(We'll enable access to the stream)

AWS Kinesis Stream Name

Kinesis Stream Name Will Be Auto Generated

AWS Kinesis Stream's # of Shards

AWS Kinesis Write Connector: Config

#Let's Data's write connector for AWS Kinesis. At a high level, here are some notable implementation details:

Partitioning: Uses the document's partition key to write the data to the relevant shard.
Batching: The write connector writes to Kinesis when it has buffered 475 records in memory or 5 MB of record size. It does a batch write for these records. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Config

Here are the details about the Kinesis write connector JSON configuration that can be used to create new datasets to use Kinesis Write Connector

Schema
Example

        # Kinesis Single File Reader Configuration Schema
        "writeConnector": {

            # Configuration common to write connectors
            "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
            "connectorDestination": "Kinesis",
            "resourceLocation": "letsdata",

            # Kinesis write connector configuration
            "kinesisStreamName": "NULL when resourceLocation == letsdata | Valid Kinesis stream name when resourceLocation == customer",
            "kinesisShardCount": Integer - (Optional) the # of Shards. It specifies the # of shards the AWS Kinesis stream should be scaled to.
        }

        # Example Kinesis Write Connector Configuration - resourceLocation == letsdata
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "Kinesis",
            "resourceLocation": "letsdata",

            # Kinesis write connector configuration
            "kinesisShardCount": 15
        }

        # Example Kinesis Write Connector Configuration - resourceLocation == customer
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "Kinesis",
            "resourceLocation": "customer",

            # Kinesis write connector configuration
            "kinesisStreamName": "CommonCrawlKinesisStream",
            "kinesisShardCount": 15
        }

kinesisStreamName: String - a valid Kinesis stream name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing Kinesis stream. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique Kinesis stream name so that it does not collide with other stream names in the same account.
kinesisShardCount: Integer - (Optional) Optionally specify the number of shards for the Kinesis stream. It is the # of shards the AWS Kinesis stream should be scaled to. If the stream exists and the # of shards for the existing stream are different from the specified # of shards, then the stream would be updated to the specified # of shards. If # of shards is not specified, we'll automatically try and determine the # of shards that might be required by looking at the # of concurrent tasks that have been configured in the Compute configuration

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the Kinesis write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "Kinesis",
    "resourceLocation": "letsdata",
    "kinesisStreamName": null,
    "kinesisShardCount": 10
}

Dynamo DB Write Connector

Write Connector Dynamo DB Table Location

Does the dynamodb table for this write connector already exist or should lets data create a new dynamodb table for this write connector?

Existing DynamoDB Table
(We'll ask about enabling access later) Create New Table
(We'll enable access to the table)

AWS Dynamo DB Table Name

Dynamo DB TableName Will Be Auto Generated

AWS Dynamo DB Table Read Capacity Units

AWS Dynamo DB Table Write Capacity Units

AWS Dynamo DB Table's Partition Key

AWS Dynamo DB Table's Sort Key

Documents are DynamoDB JSON?

Allow Overwrite Existing Items?

Fail on Item Already Exists?

AWS DynamoDB Write Connector: Config

#Let's Data's write connector for AWS Dynamo DB. At a high level, here are some notable implementation details:

Schema Support: The write connector supports writing to 'Partition Key' tables and to 'Partition Key and Sort Key' tables. The write connector currently only supports 'String' type key tables.
Data Formats: The write connector supports the Dynamo DB JSON format as well as the regular JSON for the documents that are being written to Dynamo DB.
Failure Modes: The connector supports conditional writes to allow / disallow overwrites to existing data. The write connector supports fail-stop and fail-continue failure modes.
Batching: The write connector writes to Dynamo DB when it has buffered 100 records in memory. It does a conditional write for each record using 10 concurrent threads. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Config

Here are the details about the Dynamo DB write connector JSON configuration that can be used to create new datasets to use Dynamo DB Write Connector

Schema
Example

        # Dynamo DB Single File Reader Configuration Schema
        "writeConnector": {

            # Configuration common to write connectors
            "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
            "connectorDestination": "DYNAMODB",
            "resourceLocation": "letsdata",

            # DynamoDB write connector configuration
            "dynamoDBTableName": "NULL when resourceLocation == letsdata | Valid DynamoDB table name when resourceLocation == customer",
            "dynamoDBReadCapacityUnits": Integer - (Optional) RCUs for the dynamoDBTable,
            "dynamoDBWriteCapacityUnits": Integer - (Optional) WCUs for the dynamoDBTable,
            "dynamoDBPartitionKey": "String - the dynamoDBTable partitionKey attribute name",
            "dynamoDBSortKey": "String - (Optional) The dynamoDBTable partitionKey attribute name",
            "isDynamoDBJson": "Boolean - (Optional) true when item is Dynamo DB JSON, false if item is regular JSON - defaults to null",
            "allowOverwriteExistingItems": "Boolean - (Optional) when true, item will overwrite existing items in the table. When false items will not be overwritten. Defaults to false.",
            "throwOnItemExists": "Boolean - (Optional) when true, an existing item in the table will fail the task. When false, item exists will be handled as allowOverwriteExistingItems setting and the task will continue. Defaults to false."
        }

        # Example Dynamo DB Single File Reader Configuration
        "readConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
	        "connectorDestination": "DYNAMODB",
            "resourceLocation": "letsdata",

            # DynamoDB write connector configuration
            "dynamoDBPartitionKey": "url",
            "dynamoDBReadCapacityUnits": 500,
            "dynamoDBWriteCapacityUnits": 2500
        }

dynamoDBTableName: String - a valid DynamoDB table name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing DynamoDB table. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique table name so that it does not collide with other table names in the same account.
dynamoDBReadCapacityUnits: Integer - (Optional) Optionally specify the RCUs for the Dynamo DB Table. For a table created by #Let's Data, these RCUs would be used. If not specified, Dynamo DB would calculate the RCUs based on the lambda concurrency and amount of work (# of expected tasks). For an existing table (resourceLocation: customer), #Lets Data will scale the table to the specified RCUs. (TODO: To Be Implemented)
dynamoDBWriteCapacityUnits: Integer - (Optional) Optionally specify the WCUs for the Dynamo DB Table. For a table created by #Let's Data, these WCUs would be used. If not specified, Dynamo DB would calculate the WCUs based on the lambda concurrency and amount of work (# of expected tasks). For an existing table (resourceLocation: customer), #Lets Data will scale the table to the specified WCUs. (TODO: To Be Implemented)
dynamoDBPartitionKey: String - Required - The Dynamo DB table's partition key attribute. This is used in 1./ Table Creation when #Let's Data creates the table 2./ By the Put Item call, it extracts the key value from the JSON document and writes the document for that partition key. Currently, we support only 'String' type Dynamo DB partition key ('Number' and 'Binary' keys are not supported).
dynamoDBSortKey: String - Optional - The Dynamo DB table's sort key attribute. This is used in 1./ Table Creation when #Let's Data creates the table 2./ By the Put Item call, it extracts the key value from the JSON document and writes the document for that sort key. Currently, we support only 'String' type Dynamo DB sort key ('Number' and 'Binary' keys are not supported).
isDynamoDBJson: Boolean - Optional - Dynamo DB items can be specified in the Dynamo DB JSON format (that has the type information) or in normal JSON. Set this to true when the read connector output docs are Dynamo DB JSON and set this to false when read connector output docs are regular JSON. Defaults to false (regular JSON).
allowOverwriteExistingItems: Boolean - (Optional) - Specifies whether existing items in Dynamo DB should be overwritten or not. When true, item will overwrite existing items in the table. When false items will not be overwritten. Defaults to false.
throwOnItemExists: Boolean - (Optional) - Specifies what the task should do when the task finds an existing item. When true, an existing item in the table will fail the task. When false, item exists will be handled as allowOverwriteExistingItems setting - if allowOverwriteExistingItems is true, task will overwrite and continue. If allowOverwriteExistingItems is false, task will continue to the next document. Defaults to false.(TODO: create error record?).

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the DynamoDB write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "DYNAMODB",
    "resourceLocation": "letsdata",
    "dynamoDBTableName": "BigDataTestDataset",
    "dynamoDBReadCapacityUnits": 100,
    "dynamoDBWriteCapacityUnits": 100,
    "dynamoDBPartitionKey": "Url",
    "dynamoDBSortKey": null,
    "isDynamoDBJson": "false",
    "allowOverwriteExistingItems": "true",
    "throwOnItemExists": "false"
}

S3 Write Connector

S3 Writer Type

Write Connector S3 Bucket Location

Does the S3 bucket for this write connector already exist or should let's data create a new S3 bucket for this write connector?

Existing S3 Bucket
(We'll ask about enabling access later) Create New S3 Bucket
(We'll enable access to the S3 Bucket)

AWS S3 Bucket Name

S3 BucketName Will Be Auto Generated

AWS S3 Folder Name

Allow Overwrite Existing Items?

Compress S3 File Contents?

Add File to Data Catalog?

AWS S3 Write Connector: S3

#Let's Data's write connector for AWS S3. At a high level, here are some notable implementation details:

Prefix Key: The write connector creates the prefix key for each record using this scheme:
- if docId.length < 256: writeConnector.folderName/partitionKey.firstChar()/docId
- if docId.length >= 256: writeConnector.folderName/partitionKey.firstChar()/md5Hash(docId)
- if writeConnector.folderName == null, remove the folderName from prefix key => partitionKey.firstChar()/docId
- if writeConnector.compressFiles == true: append filename with '.gz' extension
Write Modes: The connector supports allowing / disallowing overwrites to existing data.
Batching: The write connector writes to S3 when it has buffered 100 records in memory. It does a get and a put for each record using 10 concurrent threads. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Config

Here are the details about the S3 write connector JSON configuration that can be used to create new datasets to use S3 Write Connector

Schema
Example

        #  S3 Write Connector Configuration Schema
        "writeConnector": {

            # Configuration common to write connectors
            "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
            "connectorDestination": "S3",
            "resourceLocation": "Customer|LetsData",

            # S3 write connector configuration
            "bucketName": "String - NULL when resourceLocation == letsdata | Valid S3 bucket name when resourceLocation == customer",
            "folderName": "String - (Optional) the prefix folderName for the files that would be written to S3 by the write connector",
            "allowOverwrites": Boolean - (Optional) when true, document will overwrite existing s3 file. When false existing S3 files will not be overwritten. Defaults to false.
            "compressFiles": Boolean - (Optional) when true, write a compressed gzip file. When false, write an uncompressed file. Defaults to false.
        }

        # Example S3 Write Connector Configuration - resourceLocation == letsdata
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "S3",
            "resourceLocation": "LetsData",

            # S3 write connector configuration
           "folderName": "letsdata-commoncrawl-test",
           "compressFiles": true
        }

        # Example S3 Write Connector Configuration - resourceLocation == customer
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "S3",
            "resourceLocation": "Customer",

            # S3 write connector configuration
            "bucketName": "letsdata-common-crawl",
            "folderName": "letsdata-commoncrawl-test",
            "compressFiles": true
        }

bucketName: String - a valid S3 bucket name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing s3 bucket. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique S3 bucket name so that it does not collide with other bucket names in the same account.
folderName: String - (Optional) Optionally specify the prefix folderName for the files that would be written to S3 by the write connector
allowOverwrites: Boolean - (Optional) when true, document will overwrite existing s3 file. When false existing S3 files will not be overwritten. Defaults to false
compressFiles: (Optional) when true, write a compressed gzip file. When false, write an uncompressed file. Defaults to false.

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the S3 write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "S3",
    "resourceLocation": "LetsData",
    "bucketName": "webcrawlarchives",
    "folderName": null,
    "allowOverwrites": false.
    "compressFiles": true
}

AWS S3 Write Connector: S3 Aggregate File

#Let's Data's write connector for writing aggregate files to AWS S3. At a high level, here are some notable implementation details:

Filenames: The write connector creates 1 or multiple files for each task and they are named as "__.gz".
Batching: The write connector writes to a temporary file locally when it has buffered 30 records in memory. These local writes are ephemeral (not durable) and are not yet checkpointed. This file is uploaded to S3 when:
- this file reaches the size > aggregateFileSizeInMB OR
- the task is stopped / rerun / completed
On a successful upload to S3, a checkpoint is created in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Config

Here are the details about the S3 Aggregate File write connector JSON configuration that can be used to create new datasets to use S3 Aggregate File Write Connector

Schema
Example

    #  S3 Aggregate File Write Connector Configuration Schema
    "writeConnector": {

        # Configuration common to write connectors
        "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
        "connectorDestination": "S3",
        "resourceLocation": "Customer|LetsData",

        # S3 aggregate file write connector configuration
        "writerType": "S3AggregateFile - Set the writerType to 'S3AggregateFile' tells #LetsData that this is an S3 Aggregate File write connector",
        "bucketName": "String - NULL when resourceLocation == letsdata | Valid S3 bucket name when resourceLocation == customer",
        "folderName": "String - (Optional) the prefix folderName for the files that would be written to S3 by the write connector",
        "aggregateFileSizeInMB": Integer - (Optional) aggregateFileSizeInMB is the write connector's output file size in MB in S3. Allowed values: [10-128]. Defaults to 128 MB.
        "fileRecordsSeparator": String - (Required) The string delimiter to separate the records in the file. For example, "\n".
    }

     # Example S3 Write Connector Configuration - resourceLocation == letsdata
     "writeConnector": {
         # Configuration common to write connectors
         "region": "us-east-1",
         "connectorDestination": "S3",
         "resourceLocation": "LetsData",

         # S3 aggregate file write connector configuration
         "writerType": "S3AggregateFile",
         "folderName": "letsdata-commoncrawl-test",
         "fileRecordsSeparator": "\n",
         "aggregateFileSizeInMB": 128

     }

     # Example S3 Write Connector Configuration - resourceLocation == customer
     "writeConnector": {
         # Configuration common to write connectors
         "region": "us-east-1",
         "connectorDestination": "S3",
         "resourceLocation": "Customer",

         # S3 aggregate file write connector configuration
         "writerType": "S3AggregateFile",
         "bucketName": "letsdata-common-crawl",
         "folderName": "letsdata-commoncrawl-test",
         "fileRecordsSeparator": "\n",
         "aggregateFileSizeInMB": 128
     }

writerType: Set the writerType to 'S3AggregateFile' tells #LetsData that this is an S3 Aggregate File write connector
bucketName: String - a valid S3 bucket name. This is required when the resourceLocation is Customer as this means data is being written to an existing s3 bucket. When resource is LetsData, this needs to be null. #Let's Data will generate a unique S3 bucket name so that it does not collide with other bucket names in the same account.
folderName: String - (Optional) Optionally specify the prefix folderName for the files that would be written to S3 by the write connector
fileRecordsSeparator: String - (Required) The string delimiter to separate the records in the file. For example, "\n".
aggregateFileSizeInMB: Integer - (Optional) aggregateFileSizeInMB is the write connector's output file size in MB in S3. Allowed values: [10-128]. Defaults to 128 MB.

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the S3 write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "S3",
    "resourceLocation": "LetsData",
    "writerType": "S3AggregateFile",
    "bucketName": "webcrawlarchives",
    "folderName": null,
    "aggregateFileSizeInMB": 128,
    "fileRecordsSeparator": "\n"
}

AWS S3 Write Connector: Spark

#Let's Data's write connector for writing spark dataframes to AWS S3 when using Spark Compute Engine. At a high level, here are some notable implementation details:

Filenames: Each call to Spark DataFrameWriter's save method will create files in the specified bucket/folder.
Spark File Format: The write connector currently supports writing files in json, parquet, csv and text formats. You can specify different options to customize how the files are written using the sparkWriteOptions. Spark's DataFrameWriter Docs
The compute engine docs have additional details around how the spark dataset works and how each different tasks (mapper / reducer) write output files. Spark Compute Engine Docs

Config

Here are the details about the S3 Spark write connector JSON configuration that can be used to create new datasets to use S3 Spark Write Connector

Schema
Example

    #  S3 Spark Write Connector Configuration Schema
    "writeConnector": {
        # Configuration common to write connectors
        "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
        "connectorDestination": "S3",
        "resourceLocation": "Customer|LetsData",

        # S3 spark write connector configuration
        "writerType": "Spark - Set the writerType to 'Spark' tells #LetsData that this is an S3 Spark write connector",
        "bucketName": "String - NULL when resourceLocation == letsdata | Valid S3 bucket name when resourceLocation == customer",
        "folderName": "String - (Optional) the prefix folderName for the files that would be written to S3 by the write connector",
        "sparkFileFormat": Spark Writer is used to write data to S3 Files using Apache Spark. We currently support json, parquet, csv and text formats. Set this attribute to one of these values. Allowed Values: [json, parquet, csv, text]
        "sparkWriteOptions": Each spark file format comes has its set of options that can be specified for writing. These can be specified as a  map using this attribute. For example, when writing files, you can enable compression using {"compression":"gzip"}
    }

     # Example S3 Spark Write Connector Configuration - resourceLocation == letsdata
     "writeConnector": {
         # Configuration common to write connectors
         "connectorDestination": "S3",
         "resourceLocation": "LetsData",

         # S3 spark write connector configuration
         "writerType": "Spark",
         "sparkFileFormat": "json",
         "sparkWriteOptions": {"compression":"gzip"}
       }

     # Example S3 Spark Write Connector Configuration - resourceLocation == customer
     "writeConnector": {
         # Configuration common to write connectors
         "region": "us-east-1",
         "connectorDestination": "S3",
         "resourceLocation": "Customer",

         # S3 spark  write connector configuration
         "writerType": "Spark",
         "bucketName": "letsdata-common-crawl",
         "folderName": "letsdata-commoncrawl-test",
         "sparkFileFormat": "json",
         "sparkWriteOptions": {"compression":"gzip"}
     }

writerType: Set the writerType to 'Spark' tells #LetsData that this is an S3 Spark write connector
bucketName: String - a valid S3 bucket name. This is required when the resourceLocation is Customer as this means data is being written to an existing s3 bucket. When resource is LetsData, this needs to be null. #Let's Data will generate a unique S3 bucket name so that it does not collide with other bucket names in the same account.
folderName: String - (Optional) Optionally specify the prefix folderName for the files that would be written to S3 by the write connector
sparkFileFormat: Spark Writer is used to write data to S3 Files using Apache Spark. We currently support json, parquet, csv and text formats. Set this attribute to one of these values. Allowed Values: [json, parquet, csv, text]
sparkWriteOptions: Each spark file format comes has its set of options that can be specified for writing. These can be specified as a map using this attribute. For example, when writing files, you can enable compression using {"compression":"gzip"}. You can read about Spark's DataFrameWriter and the different options at Spark Sql docs

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the S3 write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "S3",
    "resourceLocation": "LetsData",
    "writerType": "Spark",
    "bucketName": "webcrawlarchives",
    "folderName": null,
    "sparkFileFormat": "json",
    "sparkWriteOptions": {"compression":"gzip"}
}

SQS Write Connector

Write Connector SQS Queue Location

Does the SQS queue for this write connector already exist or should let's data create a new SQS queue for this write connector?

Existing SQS Queue
(We'll ask about enabling access later) Create New SQS Queue
(We'll enable access to the SQS Queue)

AWS SQS Queue Name

SQS QueueName Will Be Auto Generated

AWS SQS Queue AccountId

#Let's Data System AccountId will be used

SQSMessage: MessageId Attribute Name

SQSMessage: MessageGroupId Attribute Name

Throw on Message Send Failures?

SQS Queue Message Retention in Secs

SQS Queue Message Visibility Timeout in Secs

Enable Message Deduplication?

SQSMessage: MessageDeduplicationId Attribute Name

Queue Deduplication Scope

Fifo Throughput Limit

AWS SQS Write Connector: Config

The AWS SQS Queues can be configured as either 1./ FIFO queues or 2./ Non FIFO queues. There are slight differences in the config specification to achieve this. Let's look at these.

SQS Fifo Queues
SQS Non-Fifo Queues

#Let's Data's write connector for AWS SQS FIFO Queues. At a high level, here are some notable implementation details:

Ordering and Deduplication: The write connector supports FIFO ordered message delivery and deduplication using the messageDuplicationId
Failure Modes: The connector supports conditional writes to allow / disallow overwrites to existing data. The write connector supports fail-stop and fail-continue failure modes.
Batching: The write connector writes to SQS when it has buffered 475 records in memory or 1 MiB of record size. It does a batch send message using 10 concurrent threads. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

        # SQS FIFO Single File Reader Configuration Schema
        "writeConnector": {

            # Configuration common to write connectors
            "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS write connector queue configuration
            "queueName": "String - NULL when resourceLocation == letsdata | Valid SQS queue name when resourceLocation == customer",
            "queueAccountId": "String - NULL when resourceLocation == letsdata | Valid SQS queue's AWS Account Id when resourceLocation == customer",
            "messageIdAttributeName": "String - the attribute in the document JSON that needs to be used as the messageId attribute in the SQS message",
            "throwOnMessageSendFail": "Boolean - (Optional) when true, an message send failure will halt (fail) the task. When false, a message send failure will create an error record in the error destination and continue to next records. Defaults to false.",
            "queueRetentionInSecs": Integer - (Optional) the sqs queue's message retention duration in seconds. If resourceLocation == letsdata, SQS queue would be created with this value as the queue retention duration. If resourceLocation == customer, the queue's retention would be updated to this value. Defaults to 14 days. (TODO customer resourceLocation impl)
            "queueVisibilityTimeout": Boolean - (Optional) the visibility timeout in seconds i.e. how long a received message stays invisible until it is available for retries. Defaults to 5 minutes.
            "queueMaxReceiveCount": Integer - (Optional) how many times a message can be polled before being discarded / sent to DLQ. Defaults to Integer.MAX_VALUE

            # SQS write connector queue FIFO configuration
            "messageGroupIdAttributeName": "String - the attribute in the document JSON that needs to be used as the messageGroupId attribute in the SQS message",
            "messageDeduplicationIdAttributeName": "String - the attribute in the document JSON that needs to be used as the messageDeduplicationId attribute in the SQS message",
            "queueEnableDeduplication": "Boolean - Must be set to true to enable this queue as a FIFO queue. SQS queue deduplication is enabled.",
            "queueDeduplicationScope": "String - Specifies the deduplication scope for the queue. Supported values are 'messageGroup' or 'queue'",
            "queueFifoThroughputLimit": "String - Specifies the FIFO throughput limit queue. Supported values are 'perQueue' or 'perMessageGroupId'"
        }

        # Example SQS FIFO Write Connector Configuration - resourceLocation == letsdata
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS FIFO write connector configuration
            "messageIdAttributeName": "url",
            "messageGroupIdAttributeName": "og_type",
            "messageDeduplicationIdAttributeName": "url",
            "queueEnableDeduplication": true,
            "queueDeduplicationScope": "queue",
            "queueFifoThroughputLimit": "perQueue",
            "throwOnMessageSendFail": true
        }

        # Example SQS FIFO Write Connector Configuration - resourceLocation == customer
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS FIFO write connector configuration
            "queueName": "letsdataCommonCrawlMsgQueue",
            "queueAccountId": "457814575",
            "messageIdAttributeName": "url",
            "messageGroupIdAttributeName": "og_type",
            "messageDeduplicationIdAttributeName": "url",
            "queueEnableDeduplication": true,
            "queueDeduplicationScope": "queue",
            "queueFifoThroughputLimit": "perQueue",
            "throwOnMessageSendFail": true
        }

queueName: String - a valid SQS queue name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing SQS queue. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique SQS queue name so that it does not collide with other bucket names in the same account.
queueAccountId: String - the queue's AWS Account Id. This is required when the resourceLocation is 'customer' as this means data is being written to an existing SQS queue - and that queue's AWS Account Id is used. When resource is 'letsdata', this needs to be null. #Let's Data will set this to the #Let's Data system AWS Account Id which will be creating the new SQS queue.
messageIdAttributeName: String - the attribute in the document JSON that needs to be used as the messageId attribute in the SQS message
throwOnMessageSendFail: Boolean - (Optional) when true, an message send failure will halt (fail) the task. When false, a message send failure will create an error record in the error destination and continue to next records. Defaults to false.
queueRetentionInSecs: Integer - (Optional) the sqs queue's message retention duration in seconds. If resourceLocation == letsdata, SQS queue would be created with this value as the queue retention duration. If resourceLocation == customer, the queue's retention would be updated to this value. Defaults to 14 days. (TODO customer resourceLocation impl)
queueVisibilityTimeout: Boolean - (Optional) the visibility timeout in seconds i.e. how long a received message stays invisible until it is available for retries. Defaults to 5 minutes.
queueMaxReceiveCount: Integer - (Optional) how many times a message can be polled before being discarded / sent to DLQ. Defaults to Integer.MAX_VALUE
messageGroupIdAttributeName:String - the attribute in the document JSON that needs to be used as the messageGroupId attribute in the SQS message.
messageDeduplicationIdAttributeName: String - the attribute in the document JSON that needs to be used as the messageDeduplicationId attribute in the SQS message.
queueEnableDeduplication: Boolean - Must be set to true to enable this queue as a FIFO queue. SQS queue deduplication is enabled.
queueDeduplicationScope: String - Specifies the deduplication scope for the queue. Supported values are 'messageGroup' or 'queue'.
queueFifoThroughputLimit: String - Specifies the FIFO throughput limit queue. Supported values are 'perQueue' or 'perMessageGroupId'.

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the SQS FIFO Queue write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "SQS",
    "resourceLocation": "letsdata",
    "queueName": null,
    "queueAccountId": "102345678912",
    "messageIdAttributeName": "url",
    "throwOnMessageSendFail": "false",
    "queueRetentionInSecs": null
    "queueVisibilityTimeout": 60,
    "queueMaxReceiveCount": null,
    "messageGroupIdAttributeName": "partitionKey",
    "messageDeduplicationIdAttributeName": "url",
    "queueEnableDeduplication": true,
    "queueDeduplicationScope": "messageGroup",
    "queueFifoThroughputLimit": "perMessageGroupId"
}

#Let's Data's write connector for AWS SQS standard queues. At a high level, here are some notable implementation details:

Deduplication: The write connector supports deduplication using the messageDuplicationId TODO: impl and test
Failure Modes: The connector supports conditional writes to allow / disallow overwrites to existing data. The write connector supports fail-stop and fail-continue failure modes.
Batching: The write connector writes to SQS when it has 475 records in memory or 1 MiB of record size. It does a batch send message using 10 concurrent threads. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Config

Schema
Example

        # SQS Standard Queue Reader Configuration Schema
        "writeConnector": {

            # Configuration common to write connectors
            "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS write connector configuration
            "queueName": "String - NULL when resourceLocation == letsdata | Valid SQS queue name when resourceLocation == customer",
            "queueAccountId": "String - NULL when resourceLocation == letsdata | Valid SQS queue's AWS Account Id when resourceLocation == customer",
            "messageIdAttributeName": "String - the attribute in the document JSON that needs to be used as the messageId attribute in the SQS message",
            "throwOnMessageSendFail": "Boolean - (Optional) when true, an message send failure will halt (fail) the task. When false, a message send failure will create an error record in the error destination and continue to next records. Defaults to false.",
            "queueRetentionInSecs": Integer - (Optional) the sqs queue's message retention duration in seconds. If resourceLocation == letsdata, SQS queue would be created with this value as the queue retention duration. If resourceLocation == customer, the queue's retention would be updated to this value. Defaults to 14 days. (TODO customer resourceLocation impl)
            "queueVisibilityTimeout": Boolean - (Optional) the visibility timeout in seconds i.e. how long a received message stays invisible until it is available for retries. Defaults to 5 minutes.
            "queueMaxReceiveCount": Integer - (Optional) how many times a message can be polled before being discarded / sent to DLQ. Defaults to Integer.MAX_VALUE
        }

        # Example SQS Write Connector Configuration - resourceLocation == letsdata
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS write connector configuration
           "messageIdAttributeName": "url"
        }

        # Example SQS Write Connector Configuration - resourceLocation == customer
        "writeConnector": {

            # Configuration common to write connectors
            "region": "us-east-1",
            "connectorDestination": "SQS",
            "resourceLocation": "letsdata",

            # SQS write connector configuration
            "queueName": "letsdataCommonCrawlMsgQueue",
            "queueAccountId": "457814575",
            "messageIdAttributeName": "url"
        }

queueName: String - a valid SQS queue name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing SQS queue. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique SQS queue name so that it does not collide with other bucket names in the same account.
queueAccountId: String - the queue's AWS Account Id. This is required when the resourceLocation is 'customer' as this means data is being written to an existing SQS queue - and that queue's AWS Account Id is used. When resource is 'letsdata', this needs to be null. #Let's Data will set this to the #Let's Data system AWS Account Id which will be creating the new SQS queue.
messageIdAttributeName: String - the attribute in the document JSON that needs to be used as the messageId attribute in the SQS message
throwOnMessageSendFail: Boolean - (Optional) when true, an message send failure will halt (fail) the task. When false, a message send failure will create an error record in the error destination and continue to next records. Defaults to false.
queueRetentionInSecs: Integer - (Optional) the sqs queue's message retention duration in seconds. If resourceLocation == letsdata, SQS queue would be created with this value as the queue retention duration. If resourceLocation == customer, the queue's retention would be updated to this value. Defaults to 14 days. (TODO customer resourceLocation impl)
queueVisibilityTimeout: Boolean - (Optional) the visibility timeout in seconds i.e. how long a received message stays invisible until it is available for retries. Defaults to 5 minutes.
queueMaxReceiveCount: Integer - (Optional) how many times a message can be polled before being discarded / sent to DLQ. Defaults to Integer.MAX_VALUE

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

Kafka Write Connector

Write Connector Kafka Create Cluster

Does the kafka cluster for this write connector already exist or should lets data create a new kafka cluster for this write connector?

Existing Kafka Cluster
(We'll ask about enabling access later) Create New Cluster
(We'll enable access to the table)

Write Connector Kafka Cluster Location

Is the Kafka cluster in #LetsData AWS account or the Customer AWS account?

Customer AWS Acct LetsData AWS Acct

AWS Kafka Cluster Type

AWS Kafka Cluster Name

Kafka Cluster Name Will Be Auto Generated

AWS Kafka Cluster ARN

AWS Kafka Cluster Topic Name

AWS Kafka Cluster Topic # of Partitions

AWS Kafka Cluster Topic Replication Factor

AWS Kafka Cluster Size

AWS Kafka Cluster Vpc Id

AWS Kafka Version

AWS Kafka Broker Instance Type

AWS Kafka Number of Brokers

AWS Kafka Configuration

AWS Kafka Write Connector: Kafka

#Let's Data's write connector for AWS Kafka serverless. At a high level, here are some notable implementation details:

Cluster Sizes: The write connector supports different 'Cluster Sizes' (small, medium, large) to allocate different sized VPCs. Automated VPC management.
Secure in a VPC: The AWS Kafka serverless is launched in an private, secure VPC. Connect clients to the VPC via VPCPeeringConnections.
Topic Partition and Replication: The write connector supports partitioning up to max 2400 partitions and variable replication factor for each topic.
Serverless: The AWS Kafka Serverless is configured with on-demand scaling - provisioned IOPs are automatically increased when the request rates are high.
Batching: The write connector writes to Kafka when it has buffered 100 records in memory. It does a writes using 10 concurrent threads. Once the writes are completed, it writes any errors that were encountered and checkpoints its progress in the task.
Retries: Transient failures are retried. Any unhandled / terminal failures / exhausted retries are thrown as task failures.

Support Matrix

We currently support Kafka Serverless in the following configurations.

    +---+----------------------+-------------+--------------------+-------------------+--------------+----------------+-----------------+------------------------------+------------------------+
    | # | ResourceLocation     | ClusterType | Cluster Newtorking | Cluster           | VPC Creation | Kafka Versions | Cluster Access  | Custom Cluster Configuration | Cluster Connectivity   |
    +---+----------------------+-------------+--------------------+-------------------+--------------+----------------+-----------------+------------------------------+------------------------+
    | 1 | LetsData Aws Account | Serverless  | Private            | New Cluster       | New Vpc      | N/A            | IAM             | No                           | Vpc Peering            |
    | 2 | LetsData Aws Account | Serverless  | Private            | Existing Cluster  | N/A          | N/A            | IAM             | No                           | Vpc Peering            |
    +---+----------------------+-------------+--------------------+-------------------+--------------+----------------+-----------------+------------------------------+------------------------+

Config

Here are the details about the Kafka write connector JSON configuration that can be used to create new datasets to use Kafka Write Connector

Schema
Example

         # Kafka Write Connector Schema
         "writeConnector": {
             # Configuration common to write connectors
             "region": Optional -  The optional write connector region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
             "resourceLocation": "LetsData|Customer",
             "connectorDestination": "KAFKA",

             # kafka write connector properties
             "kafkaClusterType": "Provisioned|Serverless - (Required) the Kafka cluster type",
             "kafkaClusterSize": "String - the kafka cluster size used to allocate the VPC and subnets [small, medium, large]. (Not Required) when writing to an existing cluster (Required) when creating a new cluster and vpc. Defaults to small",

             "kafkaClusterArn": "String - the Arn of the kafka cluster (Not Required) In cases where Lets Data will create the cluster, this is not required. Once the cluster is created, the cluster Arn would be written to this property. (Required) In case of an existing cluster, the Arn of the cluster that should be used as a write connector.",
             "kafkaClusterName": "String - the cluster name. (Not Required) when a new cluster would be created. The name would be auto generated and updated in the write configuration by the system. (Required) The cluster name of the cluster writing to an existing cluster",
             "vpcId": "String - the cluster's vpcId. (Not Required) when a new cluster would be created. The system would create a new vpc and update the vpcId. (Required) The vpcId of the cluster's vpc when writing to an existing cluster",

             "kafkaTopicName": "String - the kafka topic name, can be an existing topic or a new topic would be created if topic does not exist",
             "kafkaTopicReplicationFactor": Integer - (Optional) the replication factor for the kafka topic [1-5]. Defaults to 3,
             "kafkaTopicPartitions": Integer - (Optional) the number of partitions for the kafka topic [1-2400]. Defaults to 5,
         }

         # Example Kafka Write Connector Configuration - resourceLocation == letsdata, create new Serverless cluster
         "writeConnector": {
              "region": "us-east-1",
              "resourceLocation": "LetsData",
              "connectorDestination": "KAFKA",

              "kafkaClusterType": "Serverless",

              "kafkaTopicName": "commoncrawl1"
         }

         # Example Kafka Write Connector Configuration - resourceLocation == letsdata, write to existing Serverless kafka cluster
         "writeConnector": {
             "region": "us-east-1",
             "resourceLocation": "LetsData",
             "connectorDestination": "KAFKA",

             "kafkaClusterType": "Serverless",

             "kafkaClusterArn": "arn:aws:kafka:us-east-1:223413462631:cluster/tldwc5dbfc1947527a392bb15a98f28e0c367/4a6c696a-c696-4668-b607-0a0246ca3416-14",
             "kafkaClusterName": "tldwc5dbfc1947527a392bb15a98f28e0c367",
             "vpcId": "vpc-01935d8d3011f612b",

             "kafkaTopicName": "commoncrawl1"
         }

         # Example Kafka Write Connector Configuration - resourceLocation == customer, write to existing kafka cluster
         "writeConnector": {
             "region": "us-east-1",
             "resourceLocation": "Customer",
             "connectorDestination": "KAFKA",

             # kafka write connector properties
             "kafkaClusterType": "Serverless",

             "kafkaClusterArn": "arn:aws:kafka:us-east-1:151166716410:cluster/bigdata-kafka-cluster/4a6c696a-c696-4668-b607-0a0246ca3416-14",
             "kafkaClusterName": "bigdata-kafka-cluster",
             "vpcId": "vpc-01935d8d3011f612b",

             "kafkaTopicName": "commoncrawl1"
         }

kafkaClusterType: Provisioned|Serverless - (Required) the Kafka cluster type
kafkaClusterSize: String - (Optional) - Optional when cluster already exists. Required when #Let's Data would create the cluster. This is used to determine the VPC size and ip allocations for the cluster. Allowed Values: [small, medium, large]
kafkaClusterArn: String - (Optional) a valid Kafka cluster Arn. This is required when writing to an existing cluster. This needs to be null if #Let's Data should create a new cluster.
kafkaClusterName: String - (Optional) a valid Kafka cluster name. This is required when writing to an existing cluster. If this is null, #Let's Data will generate a unique cluster name so that it does not collide with other clsuter names in the same account. In case customer wants to re-use a Kafka cluster created by Let's Data in a different dataset, use that cluster's name and specify the cluster's ARN.
vpcId: String - (Optional) The cluster's vpcId. This is required when writing to an existing cluster. If a new cluster is being created, this should be null, #Let's Data will update the vpcId with the newly created vpc's Id.
kafkaTopicName: String - a valid Kafka topic name. This is the kafka topic that the write connector will write records to. This can be an existing topic or a new topic would be created if topic does not exist.
kafkaTopicPartitions: Integer - (Optional) Optionally specify the # of partitions for the Kafka topic. This is only needed when a new topic is being created. For existing topics this is not required and would be ignored. Allowed values [1-2400]. Defaults to 5
kafkaTopicReplicationFactor: Integer - (Optional) Optionally specify the Kafka topic replicationFactor. This is only needed when a new topic is being created. For existing topics this is not required and would be ignored. Allowed values [1-5]. Defaults to 3.

resourceLocation: #Let's Data needs to know how to access the write destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the write destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the write destination in the IAM role's policy document. See 'Access Grants' for details.

JSON Element
These values form the Kafka write connector JSON element in the dataset configuration's
"writeConnector": {
    "connectorDestination": "KAFKA",
    "resourceLocation": "letsdata",
    "kafkaClusterName": "BigDataTestCluster",
    "kafkaTopicName": "TestTopic",
    "kafkaTopicPartitions": 100,
    "kafkaTopicReplicationFactor": 3,
    "kafkaClusterSize": "small"
}

Dataset Error Connector

Error Connector Region

Connector Destination

Error Connector S3 Bucket Location

Does the error connector S3 Bucket already exist or should #Let's Data create a new S3 Bucket?

Existing S3 Bucket
(We'll ask about enabling access later) Create New S3 Bucket
(We'll enable access to the bucket)

AWS S3 Error Connector Bucket Name

S3 Error Bucket Name Will Be Auto Generated

Error Connector Region

(Optional) The AWS region for the dataset's errorConnector. If not specified, error connector will default to the dataset region. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's errorConnector.region attribute

Dataset Error Connectors

Error connectors are logical connections that are created to write error docs to error connector destinations. For example, an S3 Error Connector is used to write error records to S3.

Error Connectors are implemented by #Let's Data - no coding is required by the customers to write error to the error destination. All the customers have to do is to:

Configure a error connector for the dataset by specifying the errorConnector in the dataset configuration JSON
Output the error documents from the user handers in case of errors

In addition to the user created error documents, #Let's Data can also create error docs for errors encountered during the task processing (For example, these errors could be SQS documents greater than 256KiB in size, Dynamo DB Documents keys in incorrect format / size, Kinesis records greater than 1 MB etc.)

These error documents are written to the error destination by the error connector. For details around error handling, error document schema and on how to retrieve error docs, look at the Error Handling docs

We currently only support AWS S3 Error Connector as error connector destination.

JSON Element
This value is saved as the dataset configuration's errorConnector.connectorDestination attribute

S3 Error Connectors: Config

S3 Error Connector Configuration

The S3 Error connector requires the following configuration:

region: (Optional) The optional error connector region and is set to dataset's region if unspecified. This is the region is where the error destination (S3 Bucket) is located. When the dataset starts processing, it will create clients for this region to connect to these error destinations. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
connectorDestination: The data destination for the error connector. Currently only S3 is the supported error destination.
resourceLocation: #Let's Data needs to know how to access the error destination. If it is publicly accessible or in the customer account, we'll use the IAM Role specified in the dataset's accessGrantRoleArn to access the error destination. Otherwise, we'll use #Let's Data account to access. You'll need to add access to the error destination in the IAM role's policy document. See 'Access Grants' for details.

Schema
Example

    # S3 Error Connector Schema
    "errorConnector": {
        # Configuration common to error connectors
        "region": "String - (Optional) The error connector region - us-east-1|us-east-2|us-west-2|eu-west-1|ap-south-1|ap-northeast-1",
        "connectorDestination": "S3",
        "resourceLocation": "Customer|LetsData",

        # S3 error connector configuration
        "bucketName": "String - NULL when resourceLocation == letsdata | Valid S3 bucket name when resourceLocation == customer",
    }

    # Example S3 Error Connector Configuration - resourceLocation == letsdata
    "errorConnector": {

        # Configuration common to error connectors
        "region": "us-east-1",
        "connectorDestination": "S3",
        "resourceLocation": "LetsData",
    }

    # Example S3 Error Connector Configuration - resourceLocation == customer
    "errorConnector": {

        # Configuration common to error connectors
        "region": "us-east-1",
        "connectorDestination": "S3",
        "resourceLocation": "Customer",

        # S3 error connector configuration
        "bucketName": "letsdata-common-crawl-errors"
    }

bucketName: String - a valid S3 bucket name. This is required when the resourceLocation is 'customer' as this means data is being written to an existing s3 bucket. When resource is 'letsdata', this needs to be null. #Let's Data will generate a unique S3 bucket name so that it does not collide with other bucket names in the same account.

JSON Element
These values form the S3 error connector JSON element in the dataset configuration's
"errorConnector": {
    "connectorDestination": "S3",
    "resourceLocation": "Customer",
    "bucketName": "letsdata-common-crawl-errors"
}

Dataset Compute Engine

Compute Engine Region

Dataset Compute Engine Type

Compute Engine Region

(Optional) The AWS region for the dataset's computeEngine. If not specified, compute engine will default to the dataset region. For details about how regions effect dataset processing, see Regions documentation.

JSON Element
This value is saved as the dataset configuration's computeEngine.region attribute

Dataset Compute Engine

Dataset Compute Engine is the compute infrastructure that is used to process the dataset. Lambda and Lambda And Sagemaker are currently supported Compute Engines in #LetsData. We intend to support EC2, ECS and Kubernetes compute engines as well. Here are the currently supported compute engines:

AWS Lambda Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring.
AWS Lambda And Sagemaker Compute Engine: The user's code is packaged as a Data Task Lambda Function and #LetsData manages the task creation, execution and monitoring. Additionally, AI/ ML inferences are run through a Sagemaker endpoint. #LetsData can create and manage these Sagemaker endpoints or re-use existing Sagemaker endpoints.
Spark Compute Engine: The Spark Compute Engine runs user's spark code on AWS Lambda and offers infrastructure simplifications such as removing the needs for clusters (no cluster provisioning, no cluster management, no cluster scaling issues). It adds infinte elastic scaling (no need for application scheduling, application run as soon as they are created) and adds a layer of file level progress / task management that is consistent with LetsData datasets. Your spark code will just work out of the box - no jar issues, classpath problems or elaborate session and cluster configurations.

#LetsData creates the selected compute engines for data processing and manages the infrastructure as needed until the data processing is complete. The user does not need to specify any additional infrastructure details or write any code to manage the compute infrastructure. This is what we promise - "Focus on the data and we'll manage the infrastructure."

JSON Element
These values form the S3 error connector JSON element in the dataset configuration's
"computeEngine": {
"computeEngineType": "Lambda",
}

AWS Lambda Concurrency

AWS Lambda Memory Limit (MB)

AWS Lambda Timeout (secs)

Compute Engine: Lambda

Configuration

The Lambda Compute Engine requires the following configuration:

Schema
Example

  "computeEngine": {
    "computeEngineType": "Lambda",
    "concurrency": Integer - the concurrent lambda invocations,
    "memoryLimitInMegabytes": Optional - Integer - task lambda function's memory limit in MB,
    "timeoutInSeconds": Optional - Integer - task lambda function's timeout in seconds,
    "logLevel": "Optional - DEBUG|INFO|WARN|ERROR - the log level for emitted logs"
   }

  "computeEngine": {
    "computeEngineType": "Lambda",
    "concurrency": 15,
    "memoryLimitInMegabytes": 10240,
    "timeoutInSeconds": 300,
    "logLevel": "INFO"
   }

region: (Optional) The optional compute engine region and is set to dataset's region if unspecified. LetsData will create the DataTask Lambda Functions that do the actual processing in this region. So all your reads, writes and errors will happen from the compute engine region. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
computeEngineType: The computeEngineType value Lambda specifies that this is the AWS Lambda Compute Engine
concurrency: # the concurrent lambda invocations allowed at any instant in time.
memoryLimitInMegabytes: The task lambda function's memory limit in MB. Defaults to 5120. Min allowed value: 512, Max Allowed Value: 10240.
timeoutInSeconds: The task lambda function's timeout in seconds. Defaults to 300. Min allowed value: 5, Max Allowed Value: 900
logLevel: The log level for emitted logs. Allowed values are 'DEBUG|INFO|WARN|ERROR'. Defaults to WARN

JSON Element
These values form the S3 error connector JSON element in the dataset configuration's
"computeEngine": {
    "concurrency": 15,
    "memoryLimitInMegabytes": 10240,
    "timeoutInSeconds": 300,
    "logLevel": "INFO"
}

AWS Lambda Concurrency

AWS Lambda Memory Limit (MB)

AWS Lambda Timeout (secs)

Compute Engine: Spark

Configuration

The Spark Compute Engine requires the following configuration:

Schema
Example

    "computeEngine": {
        "region": Optional - String -  The optional compute engine region [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1],
        "computeEngineType": "LAMBDA_AND_SPARK",
        "runSparkInterfaces": "Required - Specifies which spark interfaces would be created as tasks and run. Allowed values: [MAPPER_AND_REDUCER|MAPPER_ONLY|REDUCER_ONLY]",
        "sparkMapperInterfaceClassName": String - the fully qualified name of the class that implements the SparkMapperInterface for implementations in Java. Not required for Python / Javascript implementations,
        "sparkReducerInterfaceClassName": String - the fully qualified name of the class that implements the SparkReducerInterface for implementations in Java. Not required for Python / Javascript implementations
        "concurrency": "Integer - the concurrent lambda invocations",
        "memoryLimitInMegabytes": "Optional - Integer - task lambda function's memory limit in MB",
        "timeoutInSeconds": "Optional - Integer - task lambda function's timeout in seconds",
        "logLevel": "Optional - DEBUG|INFO|WARN|ERROR - the log level for emitted logs"
     }

  # Example for python implementations
      "computeEngine": {
          "computeEngineType": "LAMBDA_AND_SPARK",
          "runSparkInterfaces": "MAPPER_AND_REDUCER",
          "concurrency": 10,
          "memoryLimitInMegabytes": 10240,
          "timeoutInSeconds": 900,
          "logLevel": "WARN"
      }

  # Example for java implementations
  "computeEngine": {
      "computeEngineType": "LAMBDA_AND_SPARK",
      "runSparkInterfaces": "MAPPER_AND_REDUCER",
      "concurrency": 10,
      "memoryLimitInMegabytes": 10240,
      "timeoutInSeconds": 900,
      "logLevel": "WARN",
      "sparkMapperInterfaceClassName": "com.letsdata.commoncrawl.interfaces.implementations.spark.CommonCrawlSparkMapper",
      "sparkReducerInterfaceClassName": "com.letsdata.commoncrawl.interfaces.implementations.spark.CommonCrawlSparkReducer"
  }

region: (Optional) The optional compute engine region and is set to dataset's region if unspecified. LetsData will create the DataTask Lambda Functions that do the actual processing in this region. So all your reads, writes and errors will happen from the compute engine region. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
computeEngineType: The computeEngineType value LAMBDA_AND_SPARK specifies that this is the Spark Compute Engine
runSparkInterfaces: LetsData Spark Compute Engine can run in a few different runSparkInterfaces configurations: MAPPER_AND_REDUCER (mapper tasks and a reducer tasks are created and run), MAPPER_ONLY (mapper tasks are created only and run), REDUCER_ONLY (only reducer task is created and run)
sparkMapperInterfaceClassName: The user data handler needs to implement the sparkMapperInterfaceClassName ( Java, Python) interface on how to parse records for vectorization. When implementing in Java, set this to the fully qualified class name of the sparkMapperInterfaceClassName interface implementation. For Python, specifying 'sparkMapperInterfaceClassName' attribute is not required. Here is an Example implementation ( Java, Python)
sparkReducerInterfaceClassName: The user data handler needs to implement the sparkReducerInterfaceClassName ( Java, Python) interface on how to parse records for vectorization. When implementing in Java, set this to the fully qualified class name of the sparkReducerInterfaceClassName interface implementation. For Python, specifying 'sparkReducerInterfaceClassName' attribute is not required. Here is an Example implementation ( Java, Python)
concurrency: # the concurrent lambda invocations allowed at any instant in time.
memoryLimitInMegabytes: The task lambda function's memory limit in MB. Defaults to 5120. Min allowed value: 512, Max Allowed Value: 10240.
timeoutInSeconds: The task lambda function's timeout in seconds. Defaults to 300. Min allowed value: 5, Max Allowed Value: 900
logLevel: The log level for emitted logs. Allowed values are 'DEBUG|INFO|WARN|ERROR'. Defaults to WARN

JSON Element
These values form the S3 error connector JSON element in the dataset configuration's
"computeEngine": {
    "computeEngineType": "LAMBDA_AND_SPARK",
    "runSparkInterfaces": "MAPPER_AND_REDUCER",
    "sparkMapperInterfaceClassName": "com.letsdata.example.CommonCrawlSparkMapper",
    "sparkReducerInterfaceClassName": "com.letsdata.example.CommonCrawlSparkReducer",
    "concurrency": 15,
    "memoryLimitInMegabytes": 10240,
    "timeoutInSeconds": 300,
    "logLevel": "INFO"
}

Access Grants

Let's Data would need access to the different dataset resources

Let's Data would need access to the different resources that are needed to process the dataset.

Resource Locations

In terms of access, the different resources (S3 Buckets, DynamoDB Tables, SQS Queues etc) that are read from, written to, and managed by #Let's Data can be divided into two groups - 1./ Customer: Resources that are located in external AWS accounts - 2./ LetsData: Resources that located in #Let's Data AWS account

Customer: Resources that are not located in #Let's Data AWS account but are used in dataset processing can either be public or access limited by the owner. In these cases, #Let's Data requires that the owner adds #Let's Data to the access lists.
Let's Data: Resources that are located in #Let's Data AWS account are managed completely by #Let's Data - we'll grant the customer account access to these resources to read, write and manage them.

Regardless of the resource location, #Let's Data adheres to the strictest software security principles. The code follows the principal of least privilege, runs in context of the dataset's user and is granted access only to the resources that it needs.

Managing Access

Some resources (such as read connector s3 buckets, artifact file in s3 and manifest file in s3) are read-only to Let's Data, so they need to be public or the customer needs to give their Let's Data IAM User access (More on how to give access later).
The resources which are written to by Let's Data, customer can decide whether these would be in customer account or should Let's Data manage these resources. These resources are the error connector S3 bucket, the write connector kinesis stream etc.
If these are managed by the customer, then the customer would need to grant access to these resources to their Let's Data IAM User. If these are managed by Let's Data, then we'll be granting the customer's AWS account access to these resources.

How To?

See the "What Access Policy do I need?" button to generate a policy document from the dataset configuration in this form. The "How do I Grant Access?" button has details on how to grant access using this policy document.

Validating Access

We'd be validating this access as part of the dataset creation and would let you know if there are any access issues.

JSON Element
This value is saved as the dataset configuration's accessGrantRoleArn attribute

Dataset Resources Role ARN

Access Grant Role ARN

What Access Policy do I need?

Access Checklist

Here's a quick checklist that would help us determine what access needs to be granted by the customer and what access Let's Data might need to provision:

Access Grant Role Policy

Based on the checklist above, Let's Data would need access the following resources that are needed to process the dataset:

    {
        "Version": "2012-10-17",
        "Statement": [
            ...
        ]
    }

How do I Grant Access?

Access Grant Instructions

Here are the instructions to grant these permissions using the AWS Console or the AWS CLI:

Find the User details: We need the following identifiers from the logged in user's data to enable access.

#Let's Data IAM Account ARN: the logged in user's #Let's Data IAM Account ARN. This is the IAM user that was created automatically by #Let's Data when you signed up. All the dataset execution would be scoped to this user's security perimeter.
UserId: the logged in user's User Id. We use the userId as the STS ExternalId to follow Amazon's security best practices. This would be an additional identifier (similar to MFA) that would limit someone inadvertently gaining access.

The console's 'User Management' tab lists your IAM user ARN. You can also find it via CLI.

CLI
Console

    $ > ./letsdata users list --prettyPrint

    Output
    ------
        {
          "loggedInUser": {                                                                                                      <----- logged in user details
            "fullName": "#Let's Data Support Account",
            "emailAddress": "support@letsdata.io",
            "iamUserARN": "arn:aws:iam::956943252347:user/cc02afb4-fbe1-4a6a-8ab8-eedff083faa6-support@letsdata.io",             <----- The IAM User ARN
            "userRole": "TenantAdmin",
            "userStatus": "ACTIVE",
            "userId": "4de219ce-4f24-424b-b169-afcf6088c696",                                                                    <----- The User Id
            ...
          },
          "users": [
            ...
          ]
        }

IAM user account ARN on the #Let's Data console

Create an IAM role and policy

Grant Access Using the AWS CLI
Grant Access Using the AWS Console

        # define a trust policy to trust the user's #Let's Data IAM User for access (who you trust for access)
        # scope it to the userId as the externalId for an additional security layer
        $ > cat policies/iam_trust_policy.json

        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "AWS": "arn:aws:iam::223413462631:user/c16bc0aa-994f-4d8a-983e-fe94cd31cee7-support@letsdata.io"        <----- Replace the IAM User ARN
              },
              "Action": "sts:AssumeRole",
              "Condition": {
                  "StringEquals": {
                      "sts:ExternalId": "4de219ce-4f24-424b-b169-afcf6088c696"                                          <----- Replace the UserId
                  }
              }
            }
          ]
        }

        # define the IAM policy using the policy document we created to allow access to the resources in an iam_policy.json file (this should be listed before the trust iam policy document)
        $ > ls policies/iam_policy.json

        iam_policy.json

        # create the role, policy and attach the policy to the role (rename the names in command according to the usecase)
        # also replace the <aws_acct_id> in the policy arn with your account id - this is essentially the arn of the created policy (or copy the ARN from the command's output)
        $ > aws iam create-role --role-name LetsData_AccessRole_LogFilesNov2022 --assume-role-policy-document file://policies/iam_trust_policy.json
        $ > aws iam create-policy --policy-name LetsData_AccessRole_LogFilesNov2022Policy --policy-document file://policies/iam_policy.json
        $ > aws iam attach-role-policy --role-name LetsData_AccessRole_LogFilesNov2022 --policy-arn arn:aws:iam::<aws_acct_id>:policy/LetsData_AccessRole_LogFilesNov2022Policy

ECR Image: When implementation language is Python / Javascript, we'll be using the ECR Image to create Lambda functions. The ECR Repo permissions need to be configured for ecr:BatchGetImage,ecr:GetDownloadUrlForLayer permissions to allow access to #LetsData lambda functions arn:aws:lambda:us-east-1:956943252347:function:*. 956943252347 is the #LetsData AWS account. This is configured directly on the ECR repo permissions, different from the access grant role arn and policy configuration that we do for the different dataset resources.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "CrossAccountPermissionForLambdaServiceAndDatasetFunctions",
          "Effect": "Allow",
          "Action": [
            "ecr:BatchGetImage",
            "ecr:GetDownloadUrlForLayer"
          ],
          "Principal": {
            "AWS": [
              "arn:aws:iam::956943252347:root",
              "arn:aws:sts::956943252347:assumed-role/ProdCreateDatasetLambdaFunctionIAMRole/ProdCreateDatasetLambdaFunction",
              "arn:aws:sts::956943252347:assumed-role/ProdUpdateDatasetCodeLambdaFunctionIAMRole/ProdUpdateDatasetCodeLambdaFunction"
            ]
          }
        },
        {
          "Sid": "LambdaECRImageCrossAccountRetrievalPolicy",
          "Effect": "Allow",
          "Action": [
            "ecr:BatchGetImage",
            "ecr:GetDownloadUrlForLayer"
          ],
          "Principal": {
            "Service": "lambda.amazonaws.com"
          },
          "Condition": {
            "StringLike": {
              "aws:sourceARN": "arn:aws:lambda:us-east-1:956943252347:function:*"
            }
          }
        }
      ]
    }

The following commands can be used to set this policy on the ECR Repo named 'letsdata_example_functions' using AWS CLI.

    aws ecr set-repository-policy --repository-name letsdata_example_functions --policy-text file://ecr_image_policy.json

Access For Customer Account

Customer AWS Account For Accessing Let's Data resources

Customer Access

#Let's Data initialization will grant access to any error and write destinations created during data processing to the specified customer's AWS account. Read on for detailed explanation.

JSON Element
This value is saved as the dataset configuration's customerAccountForAccess attribute

Details

The #Let's Data datasets write output records to a write destination and error records to an error destination. These write and error destinations can be:

either located in the #Let's Data AWS account and be managed by #Let's Data
or located in the customer's account.

When these error and write destinations are in the customer's account, accessing the output and error records is simple - the customer can use their credentials with the AWS API and access the records.

However, when these error and write destinations are located in the #Let's Data AWS account, the #Let's Data initialization workflow will grant the customer account access to these error and write detinations via an IAM role. This IAM role is listed in the dataset json as the 'customerAccessRoleArn' attribute. We'll also need the dataset's createDatetime which is set as the externalId (contextId) for the sts:assumeRole call:

    {
        "datasetName": "ExtractTargetUri1222202216",
        "accessGrantRoleArn": "arn:aws:iam::308240606591:role/LetsData_AccessRole_TargetUriExtractor",
        "customerAccountForAccess": "308240606591",

        "customerAccessRoleArn": "arn:aws:iam::223413462631:role/TestCustomerAccess24d29d89b4a2eedc6988cfa17a2c3d81IAMRole",
        "createDatetime": 1685331931671,

        "readConnector": {
          "readerType": "SINGLEFILEREADER",
          "bucketName": "commoncrawl",
          ...
        }
    }

Here is how the customer can access the data using the customerAccessRoleArn:

Use AWS SecurityTokenService (AWS STS)'s assumeRole API to get access credentials to the resources. In this case, the customer code is running as the customer's aws account which would then assume the 'customerAccessRoleArn' IAM role. There is one caveat, assumeRole API can assume roles only when running as an IAM user (not as a root account). If the customer code is running as the root account, the assumeRole API will return an error. The simple fix is to create an IAM User and grant it assumeRole access. (We've granted these IAM users AdministratorAccess and that seems to work fine). To follow the AWS security best practices, we've also added an additional externalId (contextId) in the sts:assumeRole call to disallow access in from unknown contexts. Currently, the dataset's createDatetime is set as the externalId.
Call the write / error destination APIs to get the data using these access credentials. The stream details such as streamName and the error bucketName are in the dataset json.

A sample implementation of the STS assume role is in the STSUtil.java - this can be used for the Kinesis and S3 destinations. For the Kafka destination, we use the AWS's aws-msk-iam-auth library which uses the same methodology to connect securely to the Kafka cluster. We did make a private fix to this library - you'll need to download our custom version of the jar to access Kafka Cluster. For those interested, here is the issue and the fix that we made

    #download the aws-msk-iam-auth custom jar
    $ > curl -o aws-msk-iam-auth-1.1.7-letsdata-custom.jar https://d108vtfcfy7u5c.cloudfront.net/downloads/aws-msk-iam-auth-1.1.7-letsdata-custom.jar

    # install the aws-msk-iam-auth-1.1.7-letsdata-custom.jar JAR in the maven repo - update the downloaded path as needed
    mvn -e install:install-file -Dfile=aws-msk-iam-auth-1.1.7-letsdata-custom.jar -DgroupId=software.amazon.msk -DartifactId=aws-msk-iam-auth -Dpackaging=jar -Dversion=1.1.7-letsdata-custom

Here are details for each of these steps - STS assumeRole API, Kinesis Reader, S3 Reader, Kafka Reader, IAM User with AdministratorAccess and the cli driver Main class. You can view these code examples in entirety at the letsdata-writeconnector-reader github repo.

Simple implementation creates an STS client using the IAM User's credentials
Calls the assumeRole API with the roleArn and policy texts

package com.letsdata.reader;

import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.services.securitytoken.AWSSecurityTokenService;
import com.amazonaws.services.securitytoken.AWSSecurityTokenServiceClientBuilder;
import com.amazonaws.services.securitytoken.model.AssumeRoleRequest;
import com.amazonaws.services.securitytoken.model.AssumeRoleResult;
import com.amazonaws.services.securitytoken.model.PolicyDescriptorType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;

public class STSUtil {
    private static final Logger logger = LoggerFactory.getLogger(Logger.class);
    private final AWSSecurityTokenService stsClient;

    public STSUtil(String region, String awsAccessKeyId, String awsSecretKey) {
        AWSCredentials awsCredentials = new AWSCredentials() {
            @Override
            public String getAWSAccessKeyId() {
                return awsAccessKeyId;
            }

            @Override
            public String getAWSSecretKey() {
                return awsSecretKey;
            }
        };

        this.stsClient = AWSSecurityTokenServiceClientBuilder.standard().
                withRegion(region).
                withCredentials(new AWSStaticCredentialsProvider(awsCredentials)).
                build();
    }

    public AssumeRoleResult assumeRole(String roleArn, String externalId, String policy, String roleSessionName, List managedPolicyArnList) {
        AssumeRoleRequest request = new AssumeRoleRequest().withRoleArn(roleArn).withPolicy(policy).withRoleSessionName(roleSessionName).withExternalId(externalId);

        if (managedPolicyArnList != null && !managedPolicyArnList.isEmpty()) {
            List policyDescriptorTypeList = new ArrayList<>();
            for (String managedPolicyArn : managedPolicyArnList) {
                policyDescriptorTypeList.add(new PolicyDescriptorType().withArn(managedPolicyArn));
            }
            request.withPolicyArns(policyDescriptorTypeList);
        }

        logger.debug("calling assume role for role arn "+roleArn);

        AssumeRoleResult result = stsClient.assumeRole(request);
        return result;
    }
}

Create a Kinesis Client using the STS Assume Role utility from the previous step
Use the Kinesis clent to describeStreams, listShards, getShardIterator and getRecords

package com.letsdata.reader;

import com.amazonaws.auth.AWSSessionCredentials;
import com.amazonaws.auth.AWSSessionCredentialsProvider;
import com.amazonaws.services.kinesis.AmazonKinesis;
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder;
import com.amazonaws.services.kinesis.model.*;
import com.amazonaws.services.securitytoken.model.AssumeRoleResult;
import com.amazonaws.services.securitytoken.model.Credentials;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.TimeUnit;

public class KinesisReader {
    private static final Logger logger = LoggerFactory.getLogger(KinesisReader.class);

    private final STSUtil stsUtil;
    private final AmazonKinesis amazonKinesis;
    private final String roleArn;
    private final String externalId;
    private final String roleAccessPolicyText;
    private final String roleSessionName;
    private final List managedPolicyArnList;

    public KinesisReader(String region, STSUtil stsUtil, String roleArn, String externalId, String roleAccessPolicyText, String roleSessionName, List managedPolicyArnList) {
        this.stsUtil = stsUtil;
        this.roleArn = roleArn;
        this.externalId = externalId;
        this.roleAccessPolicyText = roleAccessPolicyText;
        this.roleSessionName = roleSessionName;
        this.managedPolicyArnList = managedPolicyArnList;

        this.amazonKinesis = AmazonKinesisClientBuilder.
                standard().
                withRegion(region).
                withCredentials(new AWSSessionCredentialsProvider() {
                    private volatile Credentials credentials;

                    @Override
                    public AWSSessionCredentials getCredentials() {
                        if (credentials == null || credentials.getExpiration().before(new Date(System.currentTimeMillis() + TimeUnit.MINUTES.toMillis(1)))) {
                            refresh();
                        }
                        if (credentials != null && !credentials.getExpiration().before(new Date())) {
                            return new AWSSessionCredentials() {
                                @Override
                                public String getSessionToken() {
                                    return credentials.getSessionToken();
                                }

                                @Override
                                public String getAWSAccessKeyId() {
                                    return credentials.getAccessKeyId();
                                }

                                @Override
                                public String getAWSSecretKey() {
                                    return credentials.getSecretAccessKey();
                                }
                            };
                        } else {
                            throw new RuntimeException("Credentials could not be obtained");
                        }
                    }

                    @Override
                    public void refresh() {
                        AssumeRoleResult stsAssumeRoleResult = stsUtil.assumeRole(roleArn, externalId, roleAccessPolicyText, roleSessionName,  managedPolicyArnList);
                        this.credentials = stsAssumeRoleResult.getCredentials();
                    }
                }).
                build();
    }

    public DescribeStreamResult describeStream(String streamName, String exclusiveStartShardId) {
        DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest();
        describeStreamRequest.setStreamName(streamName);
        if (exclusiveStartShardId != null) {
            describeStreamRequest.setExclusiveStartShardId(exclusiveStartShardId);
        }
        return amazonKinesis.describeStream(describeStreamRequest);
    }

    public List listShards(String streamName, Date streamCreationTimestamp, String startToken) {
        List shardList = new ArrayList<>();
        String nextToken = startToken;
        do {
            ListShardsRequest listShardsRequest = new ListShardsRequest();
            listShardsRequest.setStreamName(streamName);

            if (streamCreationTimestamp != null) {
                listShardsRequest.setStreamCreationTimestamp(streamCreationTimestamp);
            }

            if (nextToken != null) {
                listShardsRequest.setNextToken(nextToken);
            }
            listShardsRequest.setMaxResults(1000);


            ListShardsResult listShardsResult = null;
            try {
                logger.debug("Executing listShards iteration");
                listShardsResult = amazonKinesis.listShards(listShardsRequest);
                logger.debug("Completed listShards iteration");
            } catch (Exception ex) {
                logger.error(streamName + " listShards threw an exception ", ex.getCause());
                throw new RuntimeException(ex);
            }

            shardList.addAll(listShardsResult.getShards());
            nextToken = listShardsResult.getNextToken();
        } while (nextToken != null);

        return shardList;
    }

    public String getShardIterator(String streamName, String shardId, ShardIteratorType shardIteratorType, String startingSequenceNumber, Date startingTimestamp) {
        GetShardIteratorRequest getShardIteratorRequest = new GetShardIteratorRequest();
        getShardIteratorRequest.setStreamName(streamName);
        getShardIteratorRequest.setShardId(shardId);
        getShardIteratorRequest.setShardIteratorType(shardIteratorType);
        if (startingSequenceNumber != null) {
            getShardIteratorRequest.setStartingSequenceNumber(startingSequenceNumber);
        }

        if (startingTimestamp != null) {
            getShardIteratorRequest.setTimestamp(startingTimestamp);
        }

        GetShardIteratorResult getShardIteratorResult = null;
        try {
            logger.debug("Executing getShardIterator");
            getShardIteratorResult = amazonKinesis.getShardIterator(getShardIteratorRequest);
            logger.debug("Completed getShardIterator");
        } catch (Exception ex) {
            logger.error("shardId " + shardId+ " getShardIterator threw an exception ", ex.getCause());
            throw new RuntimeException(ex);
        }

        return getShardIteratorResult.getShardIterator();
    }

    public GetRecordsResult getRecords(Integer limit, String shardIterator) {
        int recordLimit = limit == null ? 1000 : Math.min(1000, limit);

        GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
        getRecordsRequest.setLimit(recordLimit);
        getRecordsRequest.setShardIterator(shardIterator);

        GetRecordsResult getRecordsResult = null;
        try {
            logger.debug("Executing getRecords");
            getRecordsResult = amazonKinesis.getRecords(getRecordsRequest);
            logger.debug("Completed getRecords");
        } catch (Exception ex) {
            logger.error("shardIterator " + shardIterator + " getRecords threw an exception ", ex.getCause());
            throw new RuntimeException(ex);
        }
        return getRecordsResult;
    }
}

Create a Kafka Consumer using the `aws-msk-iam-auth` library for auth
Use the Kafka client to listTopics, assignTopicPartitions, listAssignments, pollTopic, commitPolledRecords, topicPartitionPositions, listSubscriptions, subscribeTopic

package com.letsdata.reader;

import com.amazonaws.auth.AWSSessionCredentials;
import com.amazonaws.auth.AWSSessionCredentialsProvider;
import com.amazonaws.services.kafka.AWSKafka;
import com.amazonaws.services.kafka.AWSKafkaClientBuilder;
import com.amazonaws.services.kafka.model.GetBootstrapBrokersRequest;
import com.amazonaws.services.kafka.model.GetBootstrapBrokersResult;
import com.amazonaws.services.securitytoken.model.AssumeRoleResult;
import com.amazonaws.services.securitytoken.model.Credentials;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.*;
import java.util.concurrent.TimeUnit;

public class KafkaReader {
    private static final Logger logger = LoggerFactory.getLogger(KafkaReader.class);

    private final STSUtil stsUtil;
    private final AWSKafka awsKafka;
    private final KafkaConsumer kafkaConsumer;
    private final String roleArn;
    private final String externalId;
    private final String roleAccessPolicyText;
    private final String roleSessionName;
    private final List<String> managedPolicyArnList;

    public KafkaReader(String region, String clusterArn, String awsAccessKeyId, String awsSecretAccessKey, STSUtil stsUtil, String roleArn, String externalId, String roleAccessPolicyText, String roleSessionName, List<String> managedPolicyArnList) {
        this.stsUtil = stsUtil;
        this.roleArn = roleArn;
        this.externalId = externalId;
        this.roleAccessPolicyText = roleAccessPolicyText;
        this.roleSessionName = roleSessionName;
        this.managedPolicyArnList = managedPolicyArnList;

        this.awsKafka = AWSKafkaClientBuilder.
                standard().
                withRegion(region).
                withCredentials(new AWSSessionCredentialsProvider() {
                    private volatile Credentials credentials;

                    @Override
                    public AWSSessionCredentials getCredentials() {
                        if (credentials == null || credentials.getExpiration().before(new Date(System.currentTimeMillis() + TimeUnit.MINUTES.toMillis(1)))) {
                            refresh();
                        }
                        if (credentials != null && !credentials.getExpiration().before(new Date())) {
                            return new AWSSessionCredentials() {
                                @Override
                                public String getSessionToken() {
                                    return credentials.getSessionToken();
                                }

                                @Override
                                public String getAWSAccessKeyId() {
                                    return credentials.getAccessKeyId();
                                }

                                @Override
                                public String getAWSSecretKey() {
                                    return credentials.getSecretAccessKey();
                                }
                            };
                        } else {
                            throw new RuntimeException("Credentials could not be obtained");
                        }
                    }

                    @Override
                    public void refresh() {
                        AssumeRoleResult stsAssumeRoleResult = stsUtil.assumeRole(roleArn, externalId, roleAccessPolicyText, roleSessionName,  managedPolicyArnList);
                        this.credentials = stsAssumeRoleResult.getCredentials();
                    }
                }).
                build();

        Properties consumerConfig = new Properties();
        try {
            consumerConfig.put("client.id", InetAddress.getLocalHost().getHostName());
            consumerConfig.put("group.id", "foo");
            consumerConfig.put("bootstrap.servers", getBootstrapBrokers(clusterArn));
            consumerConfig.put("security.protocol", "SASL_SSL");
            consumerConfig.put("sasl.mechanism", "AWS_MSK_IAM");
            consumerConfig.put("sasl.client.callback.handler.class", "software.amazon.msk.auth.iam.IAMClientCallbackHandler");
            consumerConfig.put("sasl.jaas.config", "software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn=\""+roleArn+"\" awsRoleAccessKeyId=\""+awsAccessKeyId+"\" awsRoleSecretAccessKey=\""+awsSecretAccessKey+"\" awsRoleExternalId=\""+externalId+"\" awsRoleSessionName=\"KafkaConsumer"+UUID.randomUUID().toString()+"\"  awsStsRegion=\""+region+"\";");
            consumerConfig.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
            consumerConfig.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
            logger.debug("creating new consumerClient");
            kafkaConsumer = new KafkaConsumer(consumerConfig);
        } catch (UnknownHostException e) {
            throw new RuntimeException("Unexpected exception in creating kafka consumer", e);
        }
    }

    public String getBootstrapBrokers(String clusterArn) {
        GetBootstrapBrokersRequest getBootstrapBrokersRequest = new GetBootstrapBrokersRequest().withClusterArn(clusterArn);
        GetBootstrapBrokersResult getBootstrapBrokersResult = awsKafka.getBootstrapBrokers(getBootstrapBrokersRequest);
        return getBootstrapBrokersResult.getBootstrapBrokerStringSaslIam();
    }

    public Map<String, List<PartitionInfo>> listTopics() {
        return kafkaConsumer.listTopics();
    }

    public Set<String> listSubscriptions() {
        return kafkaConsumer.subscription();
    }

    public void subscribe(String topicName) {
        kafkaConsumer.subscribe(Arrays.asList(topicName));
    }

    public ConsumerRecords pollTopic(long timeout) {
        return kafkaConsumer.poll(timeout);
    }

    public void commitSync() {
        kafkaConsumer.commitSync();
    }

    public Set<TopicPartition> assignments() {
        return kafkaConsumer.assignment();
    }

    public void assign(String topicName) {
        Set<TopicPartition> topicPartitionSet = new HashSet<>();
        List<PartitionInfo> partitionInfoList = kafkaConsumer.partitionsFor(topicName);
        for(PartitionInfo partitionInfo : partitionInfoList) {
            topicPartitionSet.add(new TopicPartition(topicName, partitionInfo.partition()));
        }
        kafkaConsumer.assign(topicPartitionSet);
        kafkaConsumer.seekToBeginning(topicPartitionSet);
    }

    public Map<String, Map<Integer, Long>> positions() {
        Map<String, Map<Integer, Long>> topicNamePartitionPositionMap = new HashMap<>();
        Map<String, List<PartitionInfo>> topicPartitionInfoMap = listTopics();
        for (String topic : topicPartitionInfoMap.keySet()) {
            for (PartitionInfo topicPartitionInfo : topicPartitionInfoMap.get(topic)){
                String topicName = topicPartitionInfo.topic();
                int partition = topicPartitionInfo.partition();
                TopicPartition topicPartition = new TopicPartition(topicName, partition);
                long position = kafkaConsumer.position(topicPartition);
                if (!topicNamePartitionPositionMap.containsKey(topicName)) {
                    topicNamePartitionPositionMap.put(topicName, new HashMap<>());
                }
                Long existing = topicNamePartitionPositionMap.get(topicName).put(partition, position);
                if (existing != null) {
                    throw new RuntimeException("Unexpected duplicate entry for topic partition");
                }
            }
        }
        return topicNamePartitionPositionMap;
    }
}

Create a AWS S3 Client using the STS Assume Role utility from the STS Assume Role step
Use the S3 clent to listObjectsForPrefix and readObjectFromS3Bucket

package com.letsdata.reader;

import com.amazonaws.auth.AWSSessionCredentials;
import com.amazonaws.auth.AWSSessionCredentialsProvider;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.*;
import com.amazonaws.services.securitytoken.model.AssumeRoleResult;
import com.amazonaws.services.securitytoken.model.Credentials;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.TimeUnit;

public class S3Reader {
    private static final Logger logger = LoggerFactory.getLogger(S3Reader.class);
    private final STSUtil stsUtil;
    private AmazonS3 s3Client;
    private final String roleArn;
    private final String externalId;
    private final String roleAccessPolicyText;
    private final String roleSessionName;
    private final List<String> managedPolicyArnList;

    public S3Reader(String region, STSUtil stsUtil, String roleArn, String externalId, String roleAccessPolicyText, String roleSessionName, List<String> managedPolicyArnList) {
        this.stsUtil = stsUtil;
        this.roleArn = roleArn;
        this.externalId = externalId;
        this.roleAccessPolicyText = roleAccessPolicyText;
        this.roleSessionName = roleSessionName;
        this.managedPolicyArnList = managedPolicyArnList;

        this.s3Client = AmazonS3ClientBuilder.standard().
                withRegion(region).
                withCredentials(new AWSSessionCredentialsProvider() {
                    private volatile Credentials credentials;

                    @Override
                    public AWSSessionCredentials getCredentials() {
                        if (credentials == null || credentials.getExpiration().before(new Date(System.currentTimeMillis() + TimeUnit.MINUTES.toMillis(1)))) {
                            refresh();
                        }
                        if (credentials != null && !credentials.getExpiration().before(new Date())) {
                            return new AWSSessionCredentials() {
                                @Override
                                public String getSessionToken() {
                                    return credentials.getSessionToken();
                                }

                                @Override
                                public String getAWSAccessKeyId() {
                                    return credentials.getAccessKeyId();
                                }

                                @Override
                                public String getAWSSecretKey() {
                                    return credentials.getSecretAccessKey();
                                }
                            };
                        } else {
                            throw new RuntimeException("Credentials could not be obtained");
                        }
                    }

                    @Override
                    public void refresh() {
                        AssumeRoleResult stsAssumeRoleResult = stsUtil.assumeRole(roleArn, externalId, roleAccessPolicyText, roleSessionName,  managedPolicyArnList);
                        this.credentials = stsAssumeRoleResult.getCredentials();
                    }
                }).
                build();
    }

    public List<S3ObjectSummary> listObjectsForPrefix(String bucketName, String prefix) {
        logger.debug("calling listObjectsForPrefix - bucketName: "+bucketName+", prefix: "+prefix);
        String nextMarker = null;
        ObjectListing objectListing = null;
        List<S3ObjectSummary> objectSummaryList = new ArrayList<>();
        do {
            try {
                logger.debug("iteration listObjectsForPrefix - bucketName: "+bucketName+", prefix: "+prefix+", nextMarker: "+nextMarker);
                ListObjectsRequest request = new ListObjectsRequest(bucketName, prefix, nextMarker, null, null);
                objectListing = s3Client.listObjects(request);
                logger.debug("completing iteration listObjectsForPrefix - bucketName: "+bucketName+", prefix: "+prefix+", nextMarker: "+nextMarker+", objectListing.size(): "+objectListing.getObjectSummaries().size()+", objectListing.nextMarker(): "+objectListing.getNextMarker());
            } catch (AmazonS3Exception ex) {
                throw ex;
            }

            objectSummaryList.addAll(objectListing.getObjectSummaries());
            nextMarker = objectListing.getNextMarker();
        } while (objectListing != null && objectListing.isTruncated() && nextMarker != null);
        return objectSummaryList;
    }

    public String readObjectFromS3Bucket(String bucketName, String object) {
        String contents = null;
        try {
            contents = s3Client.getObjectAsString(bucketName, object);
        } catch (Exception ex) {
            logger.error("readObjectFromS3Bucket threw exception - bucketName: "+bucketName+", object: "+ object+", ex: "+ex);
            throw ex;
        }
        return contents;
    }
}

The assumeRole API is disallowed for root accounts. The simple fix is to create an IAM User and grant it assumeRole access. (We'll grant these IAM users AdministratorAccess).

# create an IAM user
$ > aws iam create-user --user-name letsDataReader

# attach user policy to allow AdministratorAccess
$ > aws iam attach-user-policy --policy-arn arn:aws:iam:<ACCOUNT-ID>:aws:policy/AdministratorAccess --user-name letsDataReader

The CLI driver code uses the Kinesis Reader and the STS Util from earlier to implement the following CLI commands:

    # Given a streamName, list shards for the stream
    $ > kinesis_reader listShards --streamName 'streamName' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey'

    # Given a shardId, get the Shard Iterator
    $ > kinesis_reader getShardIterator --streamName 'streamName' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey' --shardId 'shardId'

    # Given a shardIterator, get the records from the stream
    $ > kinesis_reader getRecords --streamName 'streamName' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey' --shardIterator 'shardIterator'

Here is the code. It creates an inline policy to allow listShards, getShardIterator and getRecords APIs. You can use any Kinesis API, just add that in the inline policy:

package com.letsdata.reader;

import com.amazonaws.services.kinesis.model.GetRecordsResult;
import com.amazonaws.services.kinesis.model.Record;
import com.amazonaws.services.kinesis.model.Shard;
import com.amazonaws.services.kinesis.model.ShardIteratorType;
import net.sourceforge.argparse4j.ArgumentParsers;
import net.sourceforge.argparse4j.inf.ArgumentParser;
import net.sourceforge.argparse4j.inf.ArgumentParserException;
import net.sourceforge.argparse4j.inf.Namespace;
import software.amazon.awssdk.utils.StringUtils;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.*;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;

public class Main {

    public static void main(String[] args) {
        ArgumentParser parser = ArgumentParsers.newFor("letsdatawriteconnector").build();
        parser.addArgument("action").choices("listShards", "getShardIterator", "getRecords").required(true).help("The kinesis client api method that needs to be called. [\"listShards\", \"getShardIterator\", \"getRecords\"]");
        parser.addArgument("--awsRegion").required(false).type(String.class).help("The awsRegion - default to us-east-1").setDefault("us-east-1");
        parser.addArgument("--awsAccessKeyId").required(true).type(String.class).help("The awsAccessKeyId for the customerAccountForAccess for the dataset");
        parser.addArgument("--customerAccessRoleArn").required(true).type(String.class).help("The customerAccessRoleArn from the dataset that has the been granted the access to the write connector");
        parser.addArgument("--externalId").required(true).type(String.class).help("The externalId for the sts assumeRole. This is the dataset createDatetime.");
        parser.addArgument("--awsSecretKey").required(true).type(String.class).help("The awsSecretKey for the customerAccountForAccess for the dataset");
        parser.addArgument("--streamName").required(true).type(String.class).help("The kinesis stream name");
        parser.addArgument("--shardId").required(false).type(String.class).help("The shardId for the getShardIterator call");
        parser.addArgument("--shardIterator").required(false).type(String.class).help("The shardIterator for the getRecords call");

        try {
            Namespace namespace = parser.parseArgs(args);

            String action = namespace.get("action");
            if (StringUtils.isBlank(action)) {
                throw new ArgumentParserException("action should not be blank", parser);
            }

            String region = namespace.getString("awsRegion");
            String streamName = namespace.getString("streamName");
            String customerAccessRoleArn = namespace.getString("customerAccessRoleArn");
            String externalId = namespace.getString("externalId");
            STSUtil stsUtil = new STSUtil(region, namespace.getString("awsAccessKeyId"), namespace.getString("awsSecretKey"));
            String roleAccessPolicyText = "{\n" +
                    "    \"Version\": \"2012-10-17\",\n" +
                    "    \"Statement\": [\n" +
                    "        {\n" +
                    "            \"Effect\": \"Allow\",\n" +
                    "            \"Action\": [\n" +
                    "                \"kinesis:GetShardIterator\",\n" +
                    "                \"kinesis:GetRecords\"\n" +
                    "            ],\n" +
                    "            \"Resource\": \"arn:aws:kinesis:" + region + ":223413462631:stream/" + streamName + "\"\n" +
                    "        },\n" +
                    "        {\n" +
                    "            \"Effect\": \"Allow\",\n" +
                    "            \"Action\": \"kinesis:ListShards\",\n" +
                    "            \"Resource\": \"*\"\n" +
                    "        }\n" +
                    "    ]\n" +
                    "}";

            String roleSessionName = streamName + System.currentTimeMillis();
            KinesisReader kinesisReader = new KinesisReader(region, stsUtil, customerAccessRoleArn, externalId, roleAccessPolicyText, roleSessionName, null);
            switch (action) {
                case "listShards": {
                    List shardList = kinesisReader.listShards(streamName, null, null);
                    System.out.println(shardList);
                    break;
                }
                case "getShardIterator": {
                    String shardIterator = kinesisReader.getShardIterator(streamName, namespace.getString("shardId"), ShardIteratorType.TRIM_HORIZON, null, null);
                    System.out.println(shardIterator);
                    break;
                }
                case "getRecords": {
                    GetRecordsResult getRecordsResult = kinesisReader.getRecords(null, namespace.getString("shardIterator"));
                    List recordList = getRecordsResult.getRecords();
                    for (Record record : recordList) {
                        byte[] dataBytes = record.getData().array();
                        String recordStr = new String(GZipUtil.decompressByteArr(dataBytes));
                        System.out.println("record: " + recordStr);
                    }
                    System.out.println("getRecordsResult - recordList.size:  " + getRecordsResult.getRecords().size() + ", nextShardIterator: " + getRecordsResult.getNextShardIterator());
                    break;
                }
                default: {
                    throw new ArgumentParserException("Unknown action " + action, parser);
                }
            }
        } catch (ArgumentParserException e) {
            parser.handleError(e);
        }
    }
}

Run the kafka_reader.sh file in the bin folder. You may need to update the jar path as needed.
The CLI driver code (KafkaMain.java) uses the Kafka Reader to implement the following CLI commands:

    # cd into the bin directory
    $ > cd src/bin

    # awsAccessKeyId and awsSecretKey are the security credentials of an IAM User in the customer AWS account. This is the customer AWS account that was granted access. In case this is a root account, you can create an IAM user. See the "IAM User With AdministratorAccess" section above.

    # Connect a Kafka Consumer to the Kafka Cluster using aws-msk-iam-auth library
    $ > kafka_reader --clusterArn 'clusterArn' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey' --topicName 'topicName'

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    listTopics
    {commoncrawl1}

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    assignTopicPartitions

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    topicPartitionPositions
    {commoncrawl1={0=0, 1=0, 2=0, 3=0, 4=0}}

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    pollTopic
    ...
    ...
    ...

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    commitPolledRecords

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    topicPartitionPositions
    {commoncrawl1={0=179, 1=424, 2=249, 3=185, 4=233}}

    > Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
    quit

Here is the code. It creates an inline policy to allow kafka and kafka-cluster APIs. You can use any Kafka Client API (the client api to read / write data to the cluster), or any Kafka API to manage the cluster (implement in KafkaReader):

package com.letsdata.reader;

import net.sourceforge.argparse4j.ArgumentParsers;
import net.sourceforge.argparse4j.inf.ArgumentParser;
import net.sourceforge.argparse4j.inf.ArgumentParserException;
import net.sourceforge.argparse4j.inf.Namespace;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.Iterator;

public class KafkaMain {
    // $ > kafka_reader --clusterArn 'clusterArn' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey' --topicName 'topicName'
    public static void main(String[] args) {
        ArgumentParser parser = ArgumentParsers.newFor("letsdatawriteconnector").build();
        parser.addArgument("--awsRegion").required(false).type(String.class).help("The awsRegion - default to us-east-1").setDefault("us-east-1");
        parser.addArgument("--awsAccessKeyId").required(true).type(String.class).help("The awsAccessKeyId for the customerAccountForAccess for the dataset");
        parser.addArgument("--customerAccessRoleArn").required(true).type(String.class).help("The customerAccessRoleArn from the dataset that has the been granted the access to the write connector");
        parser.addArgument("--externalId").required(true).type(String.class).help("The externalId for the sts assumeRole. This is the dataset createDatetime.");
        parser.addArgument("--awsSecretKey").required(true).type(String.class).help("The awsSecretKey for the customerAccountForAccess for the dataset");
        parser.addArgument("--clusterArn").required(true).type(String.class).help("The kafka clusterArn");
        parser.addArgument("--topicName").required(true).type(String.class).help("The kafka topic name");

        try {
            Namespace namespace = parser.parseArgs(args);

            String region = namespace.getString("awsRegion");
            String clusterArn = namespace.getString("clusterArn");
            String customerAccessRoleArn = namespace.getString("customerAccessRoleArn");
            String externalId = namespace.getString("externalId");
            STSUtil stsUtil = new STSUtil(region, namespace.getString("awsAccessKeyId"), namespace.getString("awsSecretKey"));
            String roleAccessPolicyText = "{\n" +
                    "    \"Version\": \"2012-10-17\",\n" +
                    "    \"Statement\": [\n" +
                    "        {\n" +
                    "            \"Effect\": \"Allow\",\n" +
                    "            \"Action\": [\n" +
                    "                \"kafka:*\",\n" +
                    "                \"kafka-cluster:*\"\n" +
                    "            ],\n" +
                    "            \"Resource\": \"*\"\n" +
                    "        }\n" +
                    "    ]\n" +
                    "}";
            String action;
            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
            String roleSessionName = "KafkaReader" + System.currentTimeMillis();
            KafkaReader kafkaReader = new KafkaReader(region, clusterArn, namespace.getString("awsAccessKeyId"), namespace.getString("awsSecretKey"), stsUtil, customerAccessRoleArn, externalId, roleAccessPolicyText, roleSessionName, null);
            do {
                System.out.println("> Enter the kafka consumer method to invoke. [\"listTopics\", \"listSubscriptions\", \"subscribeTopic\", \"pollTopic\", \"commitPolledRecords\", \"topicPartitionPositions\",\"assignTopicPartitions\", \"listAssignments\",\"quit\"]");

                action = reader.readLine();
                switch (action) {
                    case "listTopics": {
                        System.out.println(kafkaReader.listTopics());
                        break;
                    }
                    case "listSubscriptions": {
                        System.out.println(kafkaReader.listSubscriptions());
                        break;
                    }
                    case "subscribeTopic": {
                        kafkaReader.subscribe(namespace.getString("topicName"));
                        break;
                    }
                    case "pollTopic": {
                        ConsumerRecords consumerRecords = kafkaReader.pollTopic(60000);
                        System.out.println("Polled "+consumerRecords.count()+ " records");
                        Iterator<ConsumerRecord> iter = consumerRecords.records(namespace.getString("topicName")).iterator();
                        while (iter.hasNext()) {
                            System.out.println(iter.next().value());
                        }
                        break;
                    }
                    case "commitPolledRecords": {
                        kafkaReader.commitSync();
                        break;
                    }
                    case "topicPartitionPositions": {
                        System.out.println(kafkaReader.positions());
                        break;
                    }
                    case "listAssignments": {
                        System.out.println(kafkaReader.assignments());
                        break;
                    }
                    case "assignTopicPartitions": {
                        kafkaReader.assign(namespace.getString("topicName"));
                        break;
                    }
                    case "quit":{
                        break;
                    }
                    default: {
                        System.out.println("ERROR: Unknown action " + action +", try again");
                    }
                }
            } while (!"quit".equalsIgnoreCase(action));
        } catch (ArgumentParserException e) {
            parser.handleError(e);
        } catch (Exception ex) {
            throw new RuntimeException(ex);
        }
    }
}

Here is the POM File with the different imports and maven config to create the project:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>

        <groupId>com.resonance.letsdata.example</groupId>
        <artifactId>letsdata-writeconnector-reader</artifactId>
        <version>1.0-SNAPSHOT</version>
        <description>
            Sample code to access the write connector data by assuming the customerAccessRole which is granted by #Let's Data. This is needed when write connector resource location is #Let's Data. See https://letsdata.io/docs#accessgrants (Granting Customer Access to #Let's Data Resources) for details
        </description>
        <properties>
            <maven.compiler.source>1.8</maven.compiler.source>
            <maven.compiler.target>1.8</maven.compiler.target>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
            <build.time>${maven.build.timestamp}</build.time>
            <maven.build.timestamp.format>yyyyMMddhhmmss</maven.build.timestamp.format>
            <build.tag />
        </properties>

        <dependencies>
            <!-- aws -->
            <dependency>
                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk-sts</artifactId>
                <version>1.12.302</version>
            </dependency>
            <dependency>
                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk-kinesis</artifactId>
                <version>1.12.302</version>
            </dependency>
            <dependency>
                <groupId>software.amazon.awssdk</groupId>
                <artifactId>kinesis</artifactId>
                <version>2.17.157</version>
            </dependency>
            <dependency>
                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk-s3</artifactId>
                <version>1.12.302</version>
            </dependency>
            <dependency>
                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk-kafka</artifactId>
                <version>1.12.496</version>
            </dependency>
            <dependency>
                <groupId>software.amazon.msk</groupId>
                <artifactId>aws-msk-iam-auth</artifactId>
                <version>1.1.7-letsdata-custom</version>
            </dependency>

            <!-- Kafka -->
            <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka-clients</artifactId>
                <version>2.8.1</version>
            </dependency>

            <!-- Cli parser -->
            <dependency>
                <groupId>net.sourceforge.argparse4j</groupId>
                <artifactId>argparse4j</artifactId>
                <version>0.9.0</version>
            </dependency>

            <!-- Test Dependencies -->
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.12</version>
                <scope>test</scope>
            </dependency>
        </dependencies>
        <!-- "mvn clean compile assembly:single" for a single jar with dependencies -->
        <!-- "mvn clean compile test assembly:single" will run tests as well -->
        <build>
            <plugins>
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <archive>
                            <manifest>
                                <mainClass>com.letsdata.reader.Main</mainClass>
                            </manifest>
                        </archive>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    </project>

List Datasets

Create New Datasets

Dataset Configuration JSON

Overview

User Configuration

Schema & Example

Dataset Configuration Form:

Dataset Read Connector

S3 Read Connector

Spark Reader

#Let'sData Data Model

Documents

#Let'sData Interfaces

Usecase - Implementation Requirements

Interfaces

Parsers

Readers

SQS Read Connector

#Let'sData Data Model

Documents

#Let'sData Interfaces

Usecase - Implementation Requirements

Interfaces

Readers

Kinesis Read Connector

#Let'sData Data Model

Documents

#Let'sData Interfaces

Usecase - Implementation Requirements

Interfaces

Readers

DynamoDB Streams Read Connector

#Let'sData Data Model

Documents

#Let'sData Interfaces

Usecase - Implementation Requirements

Interfaces

Readers

DynamoDB Table Read Connector

#Let'sData Data Model

Documents

#Let'sData Interfaces

Usecase - Implementation Requirements

Interfaces

Readers

S3 Read Connector Manifest

Manifest File Format

Example

Example

Example

Example

Manifest File Types

Configuration

SQS Read Connector Manifest

Kinesis Read Connector Manifest

Task StartFrom Condition

Task Stop Condition

DynamoDB Streams Read Connector Manifest

Task StartFrom Condition

Task Stop Condition

DynamoDB Table Read Connector Manifest

Num Reader Tasks

Task Stop Condition

Reader Filter Expression

Reader Filter Expression Attribute Names

Read Connector Artifact

Dataset Write Connector

Kinesis Write Connector

Config

Dynamo DB Write Connector

Config

S3 Write Connector

Config

Config

Config

SQS Write Connector

Config

Config

Kafka Write Connector

Support Matrix