Development

Neu.ro Tool Integrations

In our experience, nearly all AI development efforts, be they at large enterprises or new startups, begin by spending the first 3-6 months building their first ML pipelines from available tools. These custom integrations are time consuming and expensive to produce, can be fragile and frequently require drastic changes as project requirements evolve. 

Frequently, these custom ML pipelines only support a small set of built-in algorithms, or a single ML library and are tied to each company’s infrastructure. Users cannot easily leverage new ML libraries, or share their work with a wider community.

Neuro facilitates adoption of robust, adaptable Machine Learning Operations (MLOps) by simplifying resource orchestration, automation and instrumentation at all steps of ML system construction, including integration, testing, deployment, monitoring and infrastructure management.

To maintain agility and avoid the pitfalls of technical debt, Neuro allows for the seamless connection of an ever-expanding universe of ML tools into your workflow.

We cover the entire ML lifecycle from Data Collection to Testing and Interpretation. All resources, processes and permissions are managed through our Neu.ro platform and can be installed and run on virtually any compute infrastructure, be it on-premise or in the cloud of your choice.

Development

The various components of a machine learning workflow can be split up into independent, reusable, modular parts that can be pipelined together to create, test and deploy models.

Our toolset integrator, Neu.ro Toolbox, contains up to date out of the box integrations with a wide range of open-source and commercial tools required for modern ML/AI development.

For Development, Neu.ro recommends the open source DVC and proprietary Pachyderm.

DVC:

DVC is an open source version control system for machine learning projects that tracks ML models and data sets and allows models to be shareable and reproducible.

The complete evolution of every ML model can be tracked with full code and data provenance. This guarantees reproducibility and makes it easy to switch back and forth between experiments.

DVC also allows for easy experiment reproduction by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

Finally, DVC has full support for data branching: DVC supports instantaneous Git branching, even with large files. Users can harness the full power of Git branches to try different ideas instead of sloppy file suffixes and comments in code. Data is not duplicated — one file version can belong to dozens of experiments and users can create as many experiments as they want, instantaneously switching from one to another, and with a saved history of all attempts.

Pachyderm:

Pachyderm is a leading ML data management system that allows for explainable, repeatable, and scalable Machine Learning (ML) and Artificial Intelligence (AI).

The Pachyderm platform brings together version control for data with the tools to build scalable end-to-end ML/AI pipelines while empowering users to develop their code in any language, framework, or tool of their choice. Pachyderm has been proven to be the ideal foundation for teams looking to use ML and AI to solve real-world problems in a reliable way.

Core concepts:

  • Datum: In Pachyderm, a datum is the smallest indivisible unit of computation within a job. A job can have one, many or no datums. Each datum is processed independently with a single execution of the user code and then the results of all the datums are merged together to create the final output commit.
  • Repository: A Pachyderm repository is a location where you store your data inside Pachyderm. Similar to other version-control systems, a Pachyderm repository tracks all the changes to its contents, and creates a history of data modifications that you can access and review. You can store virtually any type of file in a Pachyderm repo.
  • Pipelines: Pachyderm Pipelines are the computational component of the Pachyderm platform. They are responsible for reading data from a specified source, such as a Pachyderm repo, transforming it according to the pipeline configuration, and writing the result to an output repo. Pachyderm pipelines can be easily chained together to create a directed acyclic graph (DAG).