logo
SIGN UP

Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Read Connector - Manifest

The manifest file defines the amount of work for a dataset - how the read data source and the reader type is mapped to #Let's Data tasks and any additional details around the read destination's execution environment.
In simple words, Manifest files define the source data that the dataset would be reading (for example, S3 files that should be read) and its mapping to logical task units (for example, S3 fileType to file mappings for dataset tasks)

  • S3 Read Connector: For example, The S3 Read Connector manifest file would define:
    • what files in the S3 bucket need to be read as part of this dataset
    • each file's filetype
    • the mapping of different filetypes (for example, metadata_file1.gz maps to data_file1.gz, metadata_file2.gz maps to data_file2.gz etc). In this example, we create a manifest file that specifies the file types (metadata, data) and the individual files that need to be read and their mappings (metadata_file1.gz -> data_file1.gz), (metadata_file2.gz -> data_file2.gz). This manifest file becomes the complete list of data in the dataset that would be processed by the read connector. Each line in the manifest file will become a task.
  • SQS Read Connector: For example, The SQS Read Connector manifest file would define:
    • # of concurrent tasks to run: this is currently set internally by default to the computeEngine Lambda concurrency.
    • Task stop conditions: User defined criteria to determine the end of the queue so that the SQS queue reader tasks can complete
  • Kinesis Read Connector: Kinesis read connector creates a task for each available shard. In addition, the Kinesis Read Connector manifest file would define:
    • Task Start From conditions: User can specify whether to start reading the stream from the Earliest record or from the Latest record.
    • Task stop conditions: User defined criteria to determine the end of the stream so that the Kinesis stream reader tasks can complete
  • DynamoDB Streams Read Connector: DynamoDB Streams read connector creates a task for each available shard. In addition, the DynamoDB Streams Read Connector manifest file would define:
    • Task Start From conditions: User can specify whether to start reading the stream from the Earliest record or from the Latest record.
    • Task stop conditions: User defined criteria to determine the end of the stream so that the DynamoDB Streams stream reader tasks can complete
  • DynamoDB Table Read Connector: DynamoDB Table read connector creates the specified number of tasks in the manifest that parallely scan the DynamoDB table. It also allows for a readerFilterExpression to filter the items during the scan. The DynamoDB Table Read Connector manifest file would define:
    • numReaderTasks: Number of tasks for the parallel scan.
    • Task stop conditions: User defined criteria to either stop after a single scan or continue the scan until explicitly stopped.
    • Scan Filter Expressions: User defined criteria to filter the items during a scan.

S3 Manifest File Format


The format of each line (Datatask) in the manifest file is:
  • SingleFileReader / SparkReader / SingleFileStateMachineReader: Since their is only a single file being read, the fileTypes are not specified in the manifest file. The manifest file is simply the fileNames (one per line) that need to be read by the tasks.
     
  • MultipleFileStateMachineReader: Since the reader is parsing different file types, each task defines the fileType-fileName mapping for the task's readers/parsers.
     

Manifest File Examples

In case of the examples in the docs, the example manifest files could be:
  • Single File Reader

    • Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
    • The S3 bucket may have many folders and files and we only need to process November's logfile data as part of the dataset. We add the relative path (Nov22/) from the bucket root to each individual file.
    • This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.
  • Spark Reader

    • Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
    • The S3 bucket may have many folders and files and we only need to process November's logfile data as part of the dataset. We add the relative path (Nov22/) from the bucket root to each individual file.
    • This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.
  • Single File State Machine Reader

    • Since there is a single file type, the manifest lists only the log file names (not the FileType:FileName format that we use for multiple file reader)
    • The files are at the bucket root, so no relative path is specified
    • This dataset contains 3 files that would be read. Each file would be a Datatask that would run independently and would log its progress separately.
  • Multiple File State Machine Reader

    • We're reusing the reader filetypes (logical name) that we had defined earlier
    • The S3 bucket may have many folders and files - we only need to process June's metadata and data as part of the dataset. We add the relative path (June2022/metadata) and (June2022/data) to each metadata and data file.
    • Each line implicitly specifies the mapping of each metadata file to its corresponding data file [metadata_file1, data_file1, metadata_file2, data_file2, metadata_file3, data_file3] . The reader would use this implicit mapping to extract documents from these files.
    • This dataset contains 6 files that would be read. Each file pair would be a Datatask (3 tasks in total) that would run independently and would log its progress separately.

Manifest File Types

When a dataset is created, Manifest files need to be specified so that #Let's Data knows what data need to be processed.

A manifest file can be specified in two different ways:

  • inline as text in the dataset json
  • as a text file in an S3 bucket with the link to the S3 file in the dataset json

The dataset json is a json document that is stored in the database, we we therefore limit the size of the document to a few hundred kilobytes. With this in mind, a large manifest file specified as inline text in the dataset json is not recommended. The rule of the thumb we follow here is that if the manifest file text is greater than 164KB or 4096 lines (each line is a task so 4096 tasks), we'd recommend using the S3 link to specify the manifest file.

Hitting these scale limits is not uncommon for different usecases, for example, the Index Web Crawl usecase that we tested the system with has a manifest file of 80,000 tasks for a total size of ~ 24.95 MB and is specified as a link to a text file in S3.

Here are a example json for specifying manifest file as inline text and as an S3 link.

Here are the attributes for SingleFileReader / SingleFileStateMachineReader / SparkReader inline manifest file definition:

  • region: Optional - The manifest file region and is set to dataset's region if unspecified. The manifest file region is where the manifest file is located. This should ideally be the same as artifact and read connector, but being different region isn't a problem. LetsData will create manifest file clients in the manifest file region to access the manifest file. Supported Regions: [us-east-1, us-east-2, us-west-2, eu-west-1, ap-south-1, ap-northeast-1]
  • fileContents: The text contents of the manifest, each set of files is specified on a single line
  • manifestType: The manifest type, since this is the inline text manifest definition, we'll specify S3ReaderTextManifestFile
  • readerType: The reader type, copied from the read connector. Set this to SINGLEFILEREADER, SPARKREADER or SINGLEFILESTATEMACHINEREADER
On This Page