Data Science Database
Overview
Data science is an evolving field which has both theoretical and engineering bottlenecks. In this assignment you will work towards solving the engineering problem of managing and organizing source data, training data, training run specificications, and trained models. We are just creating a generic backend that is platform agnostic from a data scientist perspective. In other words, we don’t need to know about the exact types of models, their format, their parameters, etc.
We are going to set a goal for this project to be the one open source and public solution for data scientists everywhere. This means you will have to consider multiple users who do not want their data exposed to other users. But some of them may want to make some of their data public. Users should also be able to comment on other data science projects. We will be ignoring teams for now.
Technology
This system will be built atop Cassandra (http://cassandra.apache.org/).
Concepts
You must decide how you are to model and store all of the following. Here are the concepts and the relationships between them.
Data Science Project (or just Project)
A project is one of the basic units. A project can have multiple pieces of source data. A project can have multiple trained models. A project can have multiple training datasets. A project can have multiple training runs.
Training Run
A training run is a single run that outputs 1 to many trained models. It has a single training run specification. A model is a blob (binary large object). We don’t need to know the exact details of the blob. You can just make an arbitrary binary object.
Training Specification
A training specification has the parameters that explain to the system how to run a training. We don’t need to know the details of these parameters (and they can definitely evolve and change). We can just label them param_1, param_2, etc… They should be a mix of strings, integers, lists, etc.
Data Science Algorithm
We don’t need to explicity store the algorithm. It is embedded inside and transparent to the database.
Users
As was laid out in the overview, we can have multiple users. Each user can have multiple projects some of which may be public and some of which may be public.