Import Framework
Summary
Give users the ability to bulk import raw data from a single CSV file into Concourse.
Motivation
This feature will enable more testing because it will be easier to seed Concourse with large and diverse data sets. This feature will also lead to increased Concourse usage by third party applications because they will have the ability to take data from their existing stores and bring it into Concourse easily.
Use Cases
The Import Framework is designed to provide an out of the box pipeline to easily bring raw data into Concourse while also defining an extensible framework that can be used to implement custom logic for specific imports.
Generic CSV import
A user wants to take a CSV file (with or without headers) and atomically import the data into Concourse using a CLI. We choose CSV as the supported format because it can be converted from other formats (i.e. xls, sql, etc) and has support in many languages. In the future, we may provide out of the box solutions for additional formats.
Importing into new records
A user wants to import individual lines of a CSV file into a new record.
Importing into existing records
A user wants to import individual lines of CSV file into one or more existing records by specifying a resolveKey. For each line, the importer should find all the records that have a key mapping to a value equal to one of the values associated with the resolveKey in the line. The importer should then import all the data in the line into each of those records.
Importing data that contains links to existing records
A user wants to import data that contains links to existing records by declaring or transforming raw data into a resolvable link. When the importer encounters a resolvable link, similar to a resolveKey, it finds all the records that have the key specified in the resolvable link mapping to the value specified in the resolvable link on the line. The importer should then link the key associated with the resolvable link on the line to all those resolved records.
Example
Lets assume I customer data in Concourse. Each customer record has a "customer_id" key that maps to a numeric value. Now I want to import account data, where each account record has a foreign key "reference" to customer_id of the the customer that owns the account. So, I want to import that data and link to the appropriate customer record that already exists. Using the Import Framework, I should be able to do this by specifying a resolvable link in my raw data.
account_number | customer | account_type |
---|---|---|
12345 | @<customer_id>@678@<customer_id>@ | SAVINGS |
This means that I want to import the line into a new record and link the "customer" key in the record to all the records that have a "customer_id" key that maps to 678.
We should provide a utility and mechanism in the framework for the user to easily convert raw data to a resolvable link without having to know the appropriate format.
Important Questions and Semantics
- This framework should be independently versioned.
- This means the first release will depend on concourse 0.3.0. This should be compatible with concourse 0.4.0 pre-release versions since there are no breaking API changes between the two.
- This framework should only rely on the client. We should not have any server logic that handles file imports.
- The scope of the import framework is intentionally limited to a single file. A future release may expand the scope.
- There are lots of nuances involved with importing multiple files:
- Are all files imported in a single transaction?
- What happens if the import fails before terminating, do we need to add resume logic?
- Do we need to use map/reduce to improve the performance of the import process?
- There are lots of nuances involved with importing multiple files:
- The entire file should be imported as a single transaction
Implementation Plan
Feature | Description | Notes |
---|---|---|
Config Framework | Create an IV framework to handle reading/writing concourse configuration files | https://github.com/cinchapi/concourse-config |
CLI Framework | Create an IV framework to facilitate the creation of client side CLIs that interact with Concourse | https://github.com/cinchapi/concourse-cli |
Test Framework | Create an IV framework that provides a mechanism for spinning up Concourse test environments | |
Convert raw string data to appropriate java objects | Write shareable logic that can be used to convert raw string data to the appropriate java object based on sensible rules | |
Abstract logic to import into new records | Create logic to import a single line/group of data into a new record | |
Abstract logic to import into existing records | Create logic to import a single line/group of data into one or more existing record by using the resolveKey to find the appropriate records | |
Utility for converting raw data to a resolvable link | Create some utility method(s) to convert raw data into the format that specifies a resolvable link. This utility should not alter the raw data, but it should convert it in memory and pass it off to the rest of the import logic. | |
Abstract logic to import resolvable links | Create logic to handle resolvable links | |
Abstract wrapping of imports in a single transaction | Make sure that the import happens in a single transaction | |
Generic CSV importer | Write an importer that can handle a generic CSV file with headers | |
Generic CSV import cli | Write a CLI that uses the generic CSV importer | |
Package csv import cli as standalone app | Package the generic CSV importer and cli as an application that can be run from anywhere (possibly on windows too!) | |
Package csv import cli with concourse-server | Package the generic CSV importer and cli with concourse-server (similar to what is done with CaSH). |