Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

SDK Interface

The #Let'sData infrastructure simplifies the data processing for the customers. Our promise to the customers is that "Focus on the data - we'll manage the infrastructure".

The customer needs to implement user data handlers that tell us what makes a data record in the files - we'll then send the records to these data handlers where the user can transform the records into documents that can be sent to the write destination. This requires implementation of the user data handler interfaces. These interfaces are defined in the #LetsData Interface. Let's look at how we can implement these.

The #Let'sData infrastructure implements a control plane for the data tasks, reads and writes to different destinations, scales the processing infrastructure as needed, builds in monitoring and instrumentation. However, it does not know what the data is, how it is stored in the data files and how to read the data and create transformed documents. This is where the #Let'sData infrastructure requires the customer code.

#Let'sData Data Model

Documents

#Let'sData defines the following data model className / interfaces for the transformed document implementations that are required in data processing. These are defined in the namespace com.resonance.letsdata.data.documents.interfaces.
These are explained as follows:

DocumentInterface: The DocumentInterface is the base interface for any document that can be returned by the user handlers. All other document interfaces and documents either extend or implement this interface.
DocumentInterface docs
Document Interface on GitHub
ErrorDocInterface: The ErrorDocInterface extends the DocumentInterface and is the container for any error documents that are returned by the user handlers. Customers can return errors from handlers using this implementation.
ErrorDocInterface docs
The ErrorDocInterface extends the DocumentInterface is the base interface for any error documents that are returned by the user handlers. A default implementation for the interface is provided at com.resonance.letsdata.data.documents.implementation.ErrorDoc ErrorDoc on GitHub which is used by default. Customers can return errors from handlers using this default implementation or write their own Error docs and return these during processing.
SkipDocInterface: The SkipDocInterface extends the DocumentInterface and is the container for any skip documents that are returned by the user handlers. A skip document is returned when the processor determines that the record from the file is not of interest to the current processor and should be skipped from being written to the write destination. Customers can return skip records from handlers using these default implementations.
SkipDocInterface docs
The SkipDocInterface extends the DocumentInterface is the base interface for any skip documents that are returned by the user handlers. A default implementation for the interface is provided at com.resonance.letsdata.data.documents.implementation.SkipDoc Skip on GitHub which is used by default.
JAVA only interfaces: The following interfaces are for the stateful readers and are currently available for java implementations only.
- SingleDocInterface: The SingleDocInterface extends the DocumentInterface is the base interface for any documents that are transformed from single records and are returned by the user handlers. The java doc on the interface below explain these in detail. Available in Java only.
  SingleDocInterface docs
- CompositeDocInterface: The CompositeDocInterface extends the DocumentInterface is the base interface for any documents that are composited from multiple single docs and are returned by the user handlers. The java doc on the interface below explain these in detail. Available in Java only.
  CompositeDocInterface docs

#Let'sData Interfaces

The #Let'sData implementation defines:

S3: 2 parser interfaces and 3 reader interface for the different reader types (Single File Reader, Single File State Machine Reader, Multiple File State Machine Reader). In the simplest case, the Single File Reader, you'll need to implement only 1 parser interface. In the most complicated case, the Multiple File State Machine Reader, you'll need to implement parser interfaces (1 for each file type) and a Reader interface that can combine the parsed records from multiple files into a single, composite output document. The details about each interface are defined later, but before we look at the individual interfaces, lets look at these different usecases and the interfaces that need to be implemented for each usecase.
S3 Spark: S3 Spark uses the Spark Compute Engine to process the docs reading / writing to S3. LetsData's Spark interfaces are inspired by the original Google Map Reduce paper - we've defined a Mapper interface (SparkMapperInterface) and a Reducer interface (SparkReducerInterface). Spark Compute Engine configuration requires a runSparkInterfaces attribute - when this value is set to MAPPER_AND_REDUCER, you'll need to implement SparkMapperInterface and SparkReducerInterface. When runSparkInterfaces: MAPPER_ONLY, implement the SparkMapperInterface only. For runSparkInterfaces: REDUCER_ONLY, implement the SparkReducerInterface only.
SQS: A reader interface for the reader type SQS Queue Reader - you'll need to implement this interface QueueMessageReader.
Kinesis: A reader interface for the reader type Kinesis Stream Reader - you'll need to implement this interface KinesisRecordReader.
DynamoDB Streams: A reader interface for the reader type DynamoDB Streams Reader - you'll need to implement this interface DynamoDBStreamsRecordReader.
DynamoDB Tables: A reader interface for the reader type Dynamo DB Table Reader - you'll need to implement this interface DynamoDBTableItemReader.

Parsers

S3 - SingleFileParser: The parser interface for Single File S3 reader usecase. This is where you tell us how to parse the individual records from the S3 file. Since this is single file reader, there is no state machine maintained.
SingleFileParser Interface
SingleFileParser interface on Git Hub SingleFileParser example on Git Hub
JAVA only interfaces: The following interfaces are for the stateful readers and are currently available for java implementations only.
- S3 - SingleFileStateMachineParser: The parser interface for Single File State Machine reader usecase. This is where you tell us how to parse the different records from a file. This className maintains the overall state machine for the file parser. It will create the extracted document from different file records that are being read from the files.
  SingleFileStateMachineParser Interface

Readers

Spark: Spark Compute Engine interfaces - LetsData's Spark interfaces are inspired by the original Google Map Reduce paper - we've defined a Mapper interface (SparkMapperInterface) and a Reducer interface (SparkReducerInterface).
- SparkMapperInterface
  SparkMapperInterface interface on Git Hub SparkMapperInterface example on Git Hub
- SparkReducerInterface
  SparkReducerInterface interface on Git Hub SparkReducerInterface example on Git Hub
QueueMessageReader: The reader interface for SQS Queue Reader reader type. This is where you construct a document from the SQS queue message.
QueueMessageReader Interface
QueueMessageReader interface on Git Hub QueueMessageReader example on Git Hub
SagemakerVectorsInterface: The interface for the Sagemaker Compute Engine's document conversions. This is where you construct a document for vectorization from the read record and then a write document from the vectors.
SagemakerVectorsInterface Interface
SagemakerVectorsInterface interface on Git Hub SagemakerVectorsInterface example on Git Hub
KinesisRecordReader: The interface for the Kinesis Stream Reader type. This is where you construct a document from the Kinesis record.
KinesisRecordReader Interface
KinesisRecordReader interface on Git Hub KinesisRecordReader example on Git Hub
DynamoDBStreamsRecordReader: The interface for the DynamoDB Streams Reader type. This is where you construct a document from the DynamoDB Stream Record.
DynamoDBStreamsRecordReader Interface
DynamoDBStreamsRecordReader interface on Git Hub DynamoDBStreamsRecordReader example on Git Hub
DynamoDBTableItemReader: The interface for the DynamoDB Table Reader type. This is where you construct a document from the DynamoDB item.
DynamoDBTableItemReader Interface
DynamoDBTableItemReader interface on Git Hub DynamoDBTableItemReader example on Git Hub
JAVA only interfaces: The following interfaces are for the stateful readers and are currently available for java implementations only.
- S3 - SingleFileStateMachineReader: The SingleFileStateMachineReader implements the logic to combine the individual records parsed by the SingleFileStateMachine parser and output them to a composite doc. For example, if we have a DATAFILE which contains 2 types of records `$metadata record, data record` and the output doc is constructed by the combining these two docs, then the SingleFileStateMachineReader combines each `$metadata data` record pair into an output doc.
  SingleFileStateMachineReader Interface
- S3 - MultipleFileStateMachineReader: The reader interface for Multiple File State Machinereader. This is where you tell us how to make sense of the individual records that are parsed from multiple files. This className would maintain the overall state machine across the files. It will create the extracted document from different file records that are being read from the files.
  MultipleFileStateMachineReader Interface

Usecase - Implementation Requirements

Single File Reader: You'll need to implement the SingleFileReaderParserInterface
Single File Reader Example
The Single File Reader usecase, as explained earlier, is when all the files are of a single type and the records in the file do not follow a state machine. Simple example is a log file where each line is a data record and the extracted document is transformed from each data record (line in the file).
In this simple example, you'll only need to implement the SingleFileParser interface. Here is a quick look at the interface methods, the implementation has detailed comments on what each method does and how to implement it:
getS3FileType(): The logical name of the filetype. For example we could name the fileType as LOGFILE.
getResolvedS3FileName():The filename resolved from the manifest name
getRecordStartPattern():The start pattern / delimiter of the record
getRecordEndPattern():The end pattern / delimiter of the record
parseDocument():The logic to skip, error or return the parsed document
Here is an example implementation:
This example assumes that this code is built as LogFileParserImpl-1.0-SNAPSHOT.jar and uploaded to S3 at s3://logfileparserimpl-jar/LogFileParserImpl-1.0-SNAPSHOT.jar
Spark Reader: Spark Compute Engine configuration requires a runSparkInterfaces attribute - when this value is set to MAPPER_AND_REDUCER, you'll need to implement SparkMapperInterface and SparkReducerInterface. When runSparkInterfaces: MAPPER_ONLY, implement the SparkMapperInterface only. For runSparkInterfaces: REDUCER_ONLY, implement the SparkReducerInterface only.
Spark Reader Example
The user data handlers for the Spark reader need to implement the SparkMapperInterface and the SparkReducerInterface interface.
Here is an SparkMapperInterface example implementation that extracts a web crawl record as a #Let's Data document. A SparkReducerInterface example implementation reduces the output docs from mapper to compute the 90th percentile web page content length grouped by the web page language and writes these as a json document.
SQS Queue Reader: You'll need to implement the QueueMessageReader interface.
SQS Queue Reader Example
The user data handlers for the SQS reader need to implement the QueueMessageReader interface.
Here is an example implementation that echoes the sqs message contents as a #Let's Data document.
Sagemaker Vectors Interface: You'll need to implement the SagemakerVectorsInterface interface.
Sagemaker Vectors Example
The user data handlers for the Sagemaker Vectors Interface need to implement the SagemakerVectorsInterface interface.
Here is an example implementation that 1./ extracts a page's content and the description for vectorization from an Web Crawl Index Document. 2./ Constructs a vector document with the sagemaker vectors for the write destination.
Kinesis Stream Reader: You'll need to implement the KinesisRecordReader interface.
Kinesis Stream Reader Example
The user data handlers for the Kinesis reader need to implement the KinesisRecordReader interface.
Here is an example implementation that echoes the kinesis record contents as a #Let's Data document.
DynamoDB Streams Reader: You'll need to implement the DynamoDBStreamsRecordReader interface.
DynamoDB Streams Reader Example
The user data handlers for the DynamoDB Streams reader need to implement the DynamoDBStreamsRecordReader interface.
Here is an example implementation that echoes the dynamoDB stream record contents as a #Let's Data document.
DynamoDB Table Reader: You'll need to implement the DynamoDBTableItemReader interface.
DynamoDB Table Reader Example
The user data handlers for the DynamoDB Table reader need to implement the DynamoDBTableItemReader interface.
Here is an example implementation that echoes the dynamoDB table item contents as a #Let's Data document.
JAVA only interfaces: The following interfaces are for the stateful readers and are currently available for java implementations only.
- Single File State Machine Reader: You'll need to implement the SingleFileStateMachineParser interface (to parse individual records from file and maintain a state machine) & SingleFileStateMachineReader interface (to output a composite doc from the parsed records)
  Single File State Machine Reader Example
  The Single File State Machine Reader usecase, as explained earlier, is when data document to be extracted is completely contained in a single file and is created from multiple data record in the file. The records in the file follow a finite state machine. Simple example is a data file where a record's metadata and data are written sequentially as two separate records.
  
  The records in each file follow the following state machine - this state machine is encoded in the SingleFileStateMachineParser implementation as it parses the records from the file.
  The SingleFileStateMachineReader implements the logic to combine each {metadata, data} record pair into an output doc.
- Multiple File State Machine Reader: You'll need to implement either the SingleFileParser or SingleFileStateMachineParser interface for each file. If the records in the file are a single record type and do not follow a state machine, use the SingleFileParser interface. If there are multiple record types in the file that follow a state machine, use the SingleFileStateMachineParser interface. You'll also need to implement MultipleFileStateMachineReader interface - this will combine the records returned by the individual file parsers and produce a Composite document. It will also maintain the state machine across files - you'll be adding logic to get the next records from each file and combining them into the result doc.
  Multiple File State Machine Reader Example
  The Single File State Machine Reader usecase, as explained earlier, is when data document to be extracted is completely contained in a single file and is created from multiple data record in the file. The records in the file follow a finite state machine. Simple example is a data file where a record's metadata and data are written sequentially as two separate records.
  
  The records in each file follow the following state machine - this state machine is encoded in the SingleFileStateMachineParser implementation as it parses the records from the file.
  The SingleFileStateMachineReader implements the logic to combine each {metadata, data} record pair into an output doc.