Work with Remote Data
You can read and write data from a remote location using MATLAB® functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:
Amazon S3™ (Simple Storage Service)
Azure® Blob Storage (previously known as Windows Azure® Storage Blob (WASB))
Hadoop® Distributed File System (HDFS™)
Amazon S3
MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form
s3://bucketname/path_to_file
bucketname
is the name of the container and
path_to_file
is the path to the file or folders.
Amazon S3 provides data storage through web services interfaces. You can use a bucket as a container to store objects in Amazon S3.
Set Up Access
To work with remote data in Amazon S3, you must set up access first:
Sign up for an Amazon Web Services (AWS) root account. See Amazon Web Services: Account.
Using your AWS root account, create an IAM (Identity and Access Management) user. See Creating an IAM User in Your AWS Account.
Generate an access key to receive an access key ID and a secret access key. See Managing Access Keys for IAM Users.
Configure your machine with the AWS access key ID, secret access key, and region using the AWS Command Line Interface tool from https://aws.amazon.com/cli/. Alternatively, set the environment variables directly by using
setenv
:AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
— Authenticate and enable use of Amazon S3 services. (You generated this pair of access key variables in step 3.)AWS_DEFAULT_REGION
(optional) — Select the geographic region of your bucket. The value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually.AWS_SESSION_TOKEN
(optional) — Specify the session token if you are using temporary security credentials, such as with AWS® Federated Authentication.
If your Amazon S3 location is authorized for public access, you do not need to set environment variables or configure authentication. For more information on how to configure public access, see Blocking public access to your Amazon S3 storage.
Amazon S3 also allows multiple users to access one account. For more information on Amazon S3 access, see AWS Identity and Access Management (IAM).
If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You
can copy your client environment variables to the workers on a cluster by setting
EnvironmentVariables
in parpool
,
batch
, createJob
, or in the Cluster Profile
Manager.
Read Data from Amazon S3
The following example shows how to use an ImageDatastore
object to
read a specified image from Amazon S3, and then display the image to
screen.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
Write Data to Amazon S3
The following example shows how to use a tabularTextDatastore
object to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting,
and then write it back to Amazon S3.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('s3://bucketname/preprocessedData/'); tt = tall(ds);
Azure Blob Storage
MATLAB lets you use Azure Blob Storage for online file storage. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container and
path_to_file
is the path to the file or folders.
Azure provides data storage through web services interfaces. You can use a blob to store data files in Azure. See What is Azure for more information.
Set Up Access
To work with remote data in Azure storage, you must set up access first:
Sign up for a Microsoft Azure account, see Microsoft Azure Account.
Set up your authentication details by setting exactly one of the two following environment variables using
setenv
:MW_WASB_SAS_TOKEN
— Authentication via Shared Access Signature (SAS)Obtain an SAS. For details, see the "Get the SAS for a blob container" section in Manage Azure Blob Storage resources with Storage Explorer.
In MATLAB, set
MW_WASB_SAS_TOKEN
to the SAS query string. For example,setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'
You must set this string to a valid SAS token generated from the Azure Storage web UI or Explorer.
MW_WASB_SECRET_KEY
— Authentication via one of the Account's two secret keysEach Storage Account has two secret keys that permit administrative privilege access. This same access can be given to MATLAB without having to create an SAS token by setting the
MW_WASB_SECRET_KEY
environment variable. For example:setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'
If your Azure storage location is authorized for public access, you do not need to set environment variables or configure authentication. For more information on public access to Azure storage, see Configure anonymous public read access.
Azure storage also allows multiple users to access one account. For more information on managing multiple users, see Azure Active Directory Identity and access management operations reference guide.
If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster
by setting EnvironmentVariables
in parpool
,
batch
, createJob
, or in the Cluster Profile
Manager.
For more information, see Use Azure storage with Azure HDInsight clusters.
Read Data from Azure
To read data from an Azure Blob Storage location, specify the location using the following syntax:
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container and
path_to_file
is the path to the file or folders.
For example, if you have a file airlinesmall.csv
in a folder
/airline
on a test storage account
wasbs://blobContainer@storageAccount.blob.core.windows.net/
, then you
can create a datastore by
using:
location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', {'ArrDelay'});
You can use Azure for all calculations datastores support, including direct reading,
mapreduce
, tall arrays and deep learning. For example, create an
ImageDatastore
object, read a specified image from the datastore, and
then display the image to
screen.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
Write Data to Azure
This example shows how to read tabular data from Azure into a tall array using a tabularTextDatastore
object,
preprocess it by removing missing entries and sorting, and then write it back to
Azure.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/'); tt = tall(ds);
Hadoop Distributed File System
Specify Location of Data
MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:
hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file
hostname
is the name of the host or server and
path_to_file
is the path to the file or folders. Specifying
the hostname
is optional. When you do not specify the
hostname
, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.
For example, you can use either of these commands to create a datastore for the file,
file1.txt
, in a folder named data
located at a
host named myserver
:
ds = tabularTextDatastore('hdfs:///data/file1.txt')
ds = tabularTextDatastore('hdfs://myserver/data/file1.txt')
If hostname
is specified, it must correspond to the
namenode defined by the fs.default.name
property in the Hadoop XML configuration files for your Hadoop cluster.
Optionally, you can include the port number. For example, this location specifies a
host named myserver
with port 7867
, containing the
file file1.txt
in a folder named data
:
'hdfs://myserver:7867/data/file1.txt'
The specified port number must match the port number set in your HDFS configuration.
Set Hadoop Environment Variable
Before reading from HDFS, use the setenv
function to set the appropriate
environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.
Hadoop v1 only — Set the
HADOOP_HOME
environment variable.Hadoop v2 only — Set the
HADOOP_PREFIX
environment variable.If you work with both Hadoop v1 and Hadoop v2, or if the
HADOOP_HOME
andHADOOP_PREFIX
environment variables are not set, then set theMATLAB_HADOOP_INSTALL
environment variable.
For example, use this command to set the HADOOP_HOME
environment variable. hadoop-folder
is the folder where
Hadoop is installed, and /mypath/
is the path to that
folder.
setenv('HADOOP_HOME','/mypath/hadoop-folder');
HDFS data on Hortonworks or CLOUDERA
If your current machine has access to HDFS data on Hortonworks or CLOUDERA®, then you do not have to set the HADOOP_HOME
or
HADOOP_PREFIX
environment variables. MATLAB automatically assigns these environment variables when using Hortonworks or
CLOUDERA application edge nodes.
Prevent Clearing Code from Memory
When reading from HDFS or when reading Sequence files locally, the datastore
function calls the javaaddpath
command. This command does the
following:
Clears the definitions of all Java® classes defined by files on the dynamic class path
Removes all global variables and variables from the base workspace
Removes all compiled scripts, functions, and MEX-functions from memory
To prevent persistent variables, code files, or MEX-files from being
cleared, use the mlock
function.
Write Data to HDFS
This example shows how to use a tabularTextDatastore
object to
write data to an HDFS location. Use the write
function to write your
tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or
tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from
HDFS into a tall array, preprocess it by removing missing entries and sorting,
and then write it back to HDFS.
ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);
To read your tall data back, use the datastore
function.
ds = datastore('hdfs://myserver/some/path/preprocessedData/'); tt = tall(ds);
See Also
datastore
| tabularTextDatastore
| write
| imageDatastore
| imread
| imshow
| javaaddpath
| mlock
| setenv
Related Topics
- Read and Analyze Hadoop Sequence File
- Work with Deep Learning Data in AWS (Deep Learning Toolbox)