Read and Analyze Hadoop Sequence File
This example shows how to create a datastore for a Sequence file containing
key-value data. Then, you can read and process the data one block at a time. Sequence files
are outputs of mapreduce
operations that use Hadoop®.
Set the appropriate environment variable to the location
where Hadoop is installed. In this case, set the MATLAB_HADOOP_INSTALL
environment
variable.
setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')
hadoop-folder
is the folder where Hadoop is
installed and mypath
is the path to that
folder.
Create a datastore from the sample file, mapredout.seq
,
using the datastore
function. The sample file
contains unique keys representing airline carrier codes and corresponding
values that represent the number of flights operated by that carrier.
ds = datastore('mapredout.seq')
ds = KeyValueDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\mapredout.seq' } ReadSize: 1 key-value pairs FileType: 'seq'
datastore
returns a KeyValueDatastore
.
The datastore
function automatically determines
the appropriate type of datastore to create.
Set the ReadSize
property to six so
that each call to read
reads at most six key-value
pairs.
ds.ReadSize = 6;
Read subsets of the data from ds
using
the read
function in a while
loop.
For each subset of data, compute the sum of the values. Store the
sum for each subset in an array named sums
. The while
loop
executes until hasdata(ds)
returns false
.
sums = []; while hasdata(ds) T = read(ds); T.Value = cell2mat(T.Value); sums(end+1) = sum(T.Value); end
View the last subset of key-value pairs read.
T
T = Key Value ________ _____ 'WN' 15931 'XE' 2357 'YV' 849 'ML (1)' 69 'PA (1)' 318
Compute the total number of flights operated by all carriers.
numflights = sum(sums)
numflights = 123523
See Also
datastore
| KeyValueDatastore
| mapreduce
| tall