Overview of DC/OS Data Science Engine features

DC/OS Data Science Engine supports and develops the interactive computing products Jupyter Notebook and Jupyter Hub.

The Jupyter Family

The data science community has converged around Jupyter Notebook, an interactive computing notebook developed and supported by Project Jupyter, as the de facto user interface for cloud computing. Jupyter Notebook has proven itself to be a phenomenally effective tool for prototyping and exploration. DC/OS Data Science Engine is the next-generation interface for Project Jupyter, offering the same tools that are familiar to Jupyter Notebook users: terminal, text editor, notebook, file browser, and so forth. Users can work with text editors, terminals, data file viewers, and other components in a notebook with tabbed work areas.

DC/OS Data Science Engine at the Enterprise level

There are several reasons for adopting DC/OS Data Science Engine at the Enterprise level.

  • Setup takes too long - While Jupyter Notebooks has been a great and easy tool for data scientists to work with on their laptops, provisioning it for enterprise level use is a much more complex task. When data science teams collaborate on large projects, they spend a significant portion of their time installing the software and libraries they need to work on their projects. Version dependencies and operating system incompatibilities make setup slow and painful. It can take data scientists days or weeks just to prepare the working environment needed to run a complex project. DC/OS Data Science Engine shortens the time to deployment for large projects.
  • Security and silo environments restrain collaboration - Jupyter Notebooks are an effective tool for prototyping and exploration. But when data scientists work in a vacuum on their local machine or workstations, cut off from their peers, they cannot easily collaborate and get real-time feedback. In addition, enforcing data security policy across a large number of silo environments not only slows down data access but increases the risk of data breaches. These challenges can slow model development and deployment.
  • Training and tuning large data models takes a long time - The new generation of sophisticated models perform better when you feed them more data. To be able to train these sophisticated models in a timely manner, data scientists need to have access to a large pool of compute resources. Integrating and using distributed and parallel computing tools may present a steep learning curve or prove challenging to exploits.

What is included in DC/OS Data Science Engine?

DC/OS Data Science Engine works on any infrastructure (cloud, bare metal and virtual)

  • Framework lifecycle for upgrades and updates
  • 24/7 Mesosphere engineering support for all components included in the stack
    • DC/OS Data Science Engine
    • Spark and Spark History Server
    • TensorFlow
    • Tensorboard
    • PyTorch
    • MXNet
    • Horovod on Spark
    • Integration to pool CPU and GPU compute resources in the entire cluster
    • Easy configurable resource quota to dynamically share cluster resources
    • Secure AuthN+Z to the Notebook UI with OpenID Connect
    • Secured access to datasets on Kerberized HDFS and Authenticated S3 Buckets
    • Pre-installed Python and R kernels
    • Pre-installed Apache Toree kernels (Spark, Scala)
    • Pre-installed popular Python and R libraries