Provision the Microsoft Data Science Virtual Machine.
Provision the Microsoft Data Science Virtual Machine
The
Microsoft Data Science Virtual Machine is a Windows Azure virtual
machine (VM) image pre-installed and configured with several popular
tools that are commonly used for data analytics and machine learning.
The tools included are:
- Microsoft R Server Developer Edition
- Anaconda Python distribution
- Jupyter notebook (with R, Python kernels)
- Visual Studio Community Edition
- Power BI desktop
- SQL Server 2016 Developer Edition
- Machine learning and Data Analytics tools
- Computational Network Toolkit (CNTK): A deep learning software toolkit from Microsoft Research.
- Vowpal Wabbit: A fast machine learning system supporting techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
- XGBoost: A tool providing fast and accurate boosted tree implementation.
- Rattle (the R Analytical Tool To Learn Easily): A tool that makes getting started with data analytics and machine learning in R easy, with GUI-based data exploration, and modeling with automatic R code generation.
- mxnet: a deep learning framework designed for both efficiency and flexibility
- Weka : A visual data mining and machine learning software in Java.
- Apache Drill: A schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. Supports ODBC and JDBC interfaces to enable querying NoSQL and files from standard BI tools like PowerBI, Excel, Tableau.
- Libraries in R and Python for use in Azure Machine Learning and other Azure services
- Git including Git Bash to work with source code repositories including GitHub, Visual Studio Team Services
- Windows ports of several popular Linux command-line utilities (including awk, sed, perl, grep, find, wget, curl etc) accessible through command prompt.
Doing data science involves iterating on a sequence of tasks:
- Finding, loading, and pre-processing data
- Building and testing models
- Deploying the models for consumption in intelligent applications
Data
scientists use a variety of tools to complete these tasks. It can be
quite time consuming to find the appropriate versions of the software,
and then download and install them. The Microsoft Data Science Virtual
Machine can ease this burden by providing a ready-to-use image that can
be provisioned on Azure with all several popular tools pre-installed and
configured.
The
Microsoft Data Science Virtual Machine jump-starts your analytics
project. It enables you to work on tasks in various languages including
R, Python, SQL, and C#. Visual Studio provides an IDE to develop and
test your code that is easy to use. The Azure SDK included in the VM
allows you to build your applications using various services on
Microsoft’s cloud platform.
There
are no software charges for this data science VM image. You only pay
for the Azure usage fees which dependent on the size of the virtual
machine you provision. More details on the compute fees can be found in
the Pricing details section on the Data Science Virtual Machine page.
Other Versions of the Data Science Virtual Machine
A CentOS image is also available, with many of the same tools as the Windows image. An Ubuntu image is available as well, with many similar tools plus deep learning frameworks.
Prerequisites
Before you can create a Microsoft Data Science Virtual Machine, you must have the following:
- An Azure subscription: To obtain one, see Get Azure free trial.
- An Azure storage account: To create one, see Create an Azure storage account. Alternatively, the storage account can be created as part of the process of creating the VM if you do not want to use an existing account.
Create your Microsoft Data Science Virtual Machine
Here are the steps to create an instance of the Microsoft Data Science Virtual Machine:
- Navigate to the virtual machine listing on Azure portal.
- Select the Create button at the bottom to be taken into a wizard.
- The wizard used to create the Microsoft Data Science Virtual Machine requires inputs for each of the five steps enumerated on the right of this figure. Here are the inputs needed to configure each of these steps:
- Basics
- Name: Name of your data science server you are creating.
- User Name: Admin account login id.
- Password: Admin account password.
- Subscription: If you have more than one subscription, select the one on which the machine is to be created and billed.
- Resource Group: You can create a new one or use an existing group.
- Location: Select the data center that is most appropriate. Usually it is the data center that has most of your data or is closest to your physical location for fastest network access.
- Size: Select one of the server types that meets your functional requirement and cost constraints. You can get more choices of VM sizes by selecting “View All”.
- Settings:
- Disk Type: Choose Premium if you prefer a solid-state drive (SSD), else choose “Standard”.
- Storage Account: You can create a new Azure storage account in your subscription or use an existing one in the same Location that was chosen on the Basics step of the wizard.
- Other parameters: Usually you just use the default values. You can hover over the informational link for help on the specific fields in case you want to consider the use of non-default values.
- Summary: Verify that all information you entered is correct.
- Buy: Click Buy to start the provisioning. A link is provided to the terms of the transaction. The VM does not have any additional charges beyond the compute for the server size you chose in the Size step.
- Basics
