Data Cleaning and Calculations in Tables
This example shows how to clean data stored in a MATLAB® table. It also shows how to perform calculations by using the numeric and categorical data that the table contains.
Because tables and timetables are containers, working with them is somewhat different than working with ordinary numeric arrays. The example shows how to use different tabular subscripting modes, how these modes differ, and the advantages and disadvantages of each mode for different situations. It also shows how to access and assign data, apply transformation and summary functions, convert table variables to different data types, and plot results.
The Ames Housing Data used in this example comes from residential real estate data for the town of Ames, Iowa, in the United States. You can download the original data from an XLS (Excel® Workbook) spreadsheet. The data description is available as a text file. (Used with permission of the copyright holder. Please contact the copyright holder if you wish to publish or redistribute this data.)
Import Spreadsheet Data to Table
The best way to import a spreadsheet into MATLAB is to use the readtable
function, or for data that include timestamps, the readtimetable
function. While the Ames Housing Data includes the sale month and year for each house, the month and year are stored in separate columns. In this case, it is simpler to use readtable
.
Read the housing data. With readtable
you can read data directly from a URL. Store all text data from the spreadsheet as string arrays in the output table. Also, when readtable
reads column headers from a file, it uses them as table variable names and transforms them into valid MATLAB identifiers. To preserve the original names, use the 'VariableNamingRule'
name-value argument.
housing = readtable("","TextType","string");
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property. Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
Display housing
. The table has one variable for each of the 82 columns in the spreadsheet.
housing
housing=2930×82 table
Order PID MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemod_Add RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical x1stFlrSF x2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch x3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
_____ ____________ __________ ________ ___________ _______ ______ _____ ________ ___________ _________ _________ _________ ____________ __________ __________ ________ __________ ___________ ___________ _________ _____________ _________ _________ ___________ ___________ __________ __________ _________ _________ __________ ________ ________ ____________ ____________ __________ ____________ __________ _________ ___________ _______ _________ __________ __________ _________ _________ ____________ _________ ____________ ____________ ________ ________ ____________ ____________ ___________ ____________ __________ __________ ___________ __________ ___________ ____________ __________ __________ __________ __________ __________ __________ ___________ _____________ __________ ___________ ________ ______ _______ ___________ _______ ______ ______ ________ _____________ _________
1 "0526301100" "020" "RL" 141 31770 "Pave" "NA" "IR1" "Lvl" "AllPub" "Corner" "Gtl" "NAmes" "Norm" "Norm" "1Fam" "1Story" 6 5 1960 1960 "Hip" "CompShg" "BrkFace" "Plywood" "Stone" 112 "TA" "TA" "CBlock" "TA" "Gd" "Gd" "BLQ" 639 "Unf" 0 441 1080 "GasA" "Fa" "Y" "SBrkr" 1656 0 0 1656 1 0 1 0 3 1 "TA" 7 "Typ" 2 "Gd" "Attchd" 1960 "Fin" 2 528 "TA" "TA" "P" 210 62 0 0 0 0 "NA" "NA" "NA" 0 5 2010 "WD" "Normal" 2.15e+05
2 "0526350040" "020" "RH" 80 11622 "Pave" "NA" "Reg" "Lvl" "AllPub" "Inside" "Gtl" "NAmes" "Feedr" "Norm" "1Fam" "1Story" 5 6 1961 1961 "Gable" "CompShg" "VinylSd" "VinylSd" "None" 0 "TA" "TA" "CBlock" "TA" "TA" "No" "Rec" 468 "LwQ" 144 270 882 "GasA" "TA" "Y" "SBrkr" 896 0 0 896 0 0 1 0 2 1 "TA" 5 "Typ" 0 "NA" "Attchd" 1961 "Unf" 1 730 "TA" "TA" "Y" 140 0 0 0 120 0 "NA" "MnPrv" "NA" 0 6 2010 "WD" "Normal" 1.05e+05
3 "0526351010" "020" "RL" 81 14267 "Pave" "NA" "IR1" "Lvl" "AllPub" "Corner" "Gtl" "NAmes" "Norm" "Norm" "1Fam" "1Story" 6 6 1958 1958 "Hip" "CompShg" "Wd Sdng" "Wd Sdng" "BrkFace" 108 "TA" "TA" "CBlock" "TA" "TA" "No" "ALQ" 923 "Unf" 0 406 1329 "GasA" "TA" "Y" "SBrkr" 1329 0 0 1329 0 0 1 1 3 1 "Gd" 6 "Typ" 0 "NA" "Attchd" 1958 "Unf" 1 312 "TA" "TA" "Y" 393 36 0 0 0 0 "NA" "NA" "Gar2" 12500 6 2010 "WD" "Normal" 1.72e+05
4 "0526353030" "020" "RL" 93 11160 "Pave" "NA" "Reg" "Lvl" "AllPub" "Corner" "Gtl" "NAmes" "Norm" "Norm" "1Fam" "1Story" 7 5 1968 1968 "Hip" "CompShg" "BrkFace" "BrkFace" "None" 0 "Gd" "TA" "CBlock" "TA" "TA" "No" "ALQ" 1065 "Unf" 0 1045 2110 "GasA" "Ex" "Y" "SBrkr" 2110 0 0 2110 1 0 2 1 3 1 "Ex" 8 "Typ" 2 "TA" "Attchd" 1968 "Fin" 2 522 "TA" "TA" "Y" 0 0 0 0 0 0 "NA" "NA" "NA" 0 4 2010 "WD" "Normal" 2.44e+05
5 "0527105010" "060" "RL" 74 13830 "Pave" "NA" "IR1" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "2Story" 5 5 1997 1998 "Gable" "CompShg" "VinylSd" "VinylSd" "None" 0 "TA" "TA" "PConc" "Gd" "TA" "No" "GLQ" 791 "Unf" 0 137 928 "GasA" "Gd" "Y" "SBrkr" 928 701 0 1629 0 0 2 1 3 1 "TA" 6 "Typ" 1 "TA" "Attchd" 1997 "Fin" 2 482 "TA" "TA" "Y" 212 34 0 0 0 0 "NA" "MnPrv" "NA" 0 3 2010 "WD" "Normal" 1.899e+05
6 "0527105030" "060" "RL" 78 9978 "Pave" "NA" "IR1" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "2Story" 6 6 1998 1998 "Gable" "CompShg" "VinylSd" "VinylSd" "BrkFace" 20 "TA" "TA" "PConc" "TA" "TA" "No" "GLQ" 602 "Unf" 0 324 926 "GasA" "Ex" "Y" "SBrkr" 926 678 0 1604 0 0 2 1 3 1 "Gd" 7 "Typ" 1 "Gd" "Attchd" 1998 "Fin" 2 470 "TA" "TA" "Y" 360 36 0 0 0 0 "NA" "NA" "NA" 0 6 2010 "WD" "Normal" 1.955e+05
7 "0527127150" "120" "RL" 41 4920 "Pave" "NA" "Reg" "Lvl" "AllPub" "Inside" "Gtl" "StoneBr" "Norm" "Norm" "TwnhsE" "1Story" 8 5 2001 2001 "Gable" "CompShg" "CemntBd" "CmentBd" "None" 0 "Gd" "TA" "PConc" "Gd" "TA" "Mn" "GLQ" 616 "Unf" 0 722 1338 "GasA" "Ex" "Y" "SBrkr" 1338 0 0 1338 1 0 2 0 2 1 "Gd" 6 "Typ" 0 "NA" "Attchd" 2001 "Fin" 2 582 "TA" "TA" "Y" 0 0 170 0 0 0 "NA" "NA" "NA" 0 4 2010 "WD" "Normal" 2.135e+05
8 "0527145080" "120" "RL" 43 5005 "Pave" "NA" "IR1" "HLS" "AllPub" "Inside" "Gtl" "StoneBr" "Norm" "Norm" "TwnhsE" "1Story" 8 5 1992 1992 "Gable" "CompShg" "HdBoard" "HdBoard" "None" 0 "Gd" "TA" "PConc" "Gd" "TA" "No" "ALQ" 263 "Unf" 0 1017 1280 "GasA" "Ex" "Y" "SBrkr" 1280 0 0 1280 0 0 2 0 2 1 "Gd" 5 "Typ" 0 "NA" "Attchd" 1992 "RFn" 2 506 "TA" "TA" "Y" 0 82 0 0 144 0 "NA" "NA" "NA" 0 1 2010 "WD" "Normal" 1.915e+05
9 "0527146030" "120" "RL" 39 5389 "Pave" "NA" "IR1" "Lvl" "AllPub" "Inside" "Gtl" "StoneBr" "Norm" "Norm" "TwnhsE" "1Story" 8 5 1995 1996 "Gable" "CompShg" "CemntBd" "CmentBd" "None" 0 "Gd" "TA" "PConc" "Gd" "TA" "No" "GLQ" 1180 "Unf" 0 415 1595 "GasA" "Ex" "Y" "SBrkr" 1616 0 0 1616 1 0 2 0 2 1 "Gd" 5 "Typ" 1 "TA" "Attchd" 1995 "RFn" 2 608 "TA" "TA" "Y" 237 152 0 0 0 0 "NA" "NA" "NA" 0 3 2010 "WD" "Normal" 2.365e+05
10 "0527162130" "060" "RL" 60 7500 "Pave" "NA" "Reg" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "2Story" 7 5 1999 1999 "Gable" "CompShg" "VinylSd" "VinylSd" "None" 0 "TA" "TA" "PConc" "TA" "TA" "No" "Unf" 0 "Unf" 0 994 994 "GasA" "Gd" "Y" "SBrkr" 1028 776 0 1804 0 0 2 1 3 1 "Gd" 7 "Typ" 1 "TA" "Attchd" 1999 "Fin" 2 442 "TA" "TA" "Y" 140 60 0 0 0 0 "NA" "NA" "NA" 0 6 2010 "WD" "Normal" 1.89e+05
11 "0527163010" "060" "RL" 75 10000 "Pave" "NA" "IR1" "Lvl" "AllPub" "Corner" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "2Story" 6 5 1993 1994 "Gable" "CompShg" "HdBoard" "HdBoard" "None" 0 "TA" "TA" "PConc" "Gd" "TA" "No" "Unf" 0 "Unf" 0 763 763 "GasA" "Gd" "Y" "SBrkr" 763 892 0 1655 0 0 2 1 3 1 "TA" 7 "Typ" 1 "TA" "Attchd" 1993 "Fin" 2 440 "TA" "TA" "Y" 157 84 0 0 0 0 "NA" "NA" "NA" 0 4 2010 "WD" "Normal" 1.759e+05
12 "0527165230" "020" "RL" NaN 7980 "Pave" "NA" "IR1" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "1Story" 6 7 1992 2007 "Gable" "CompShg" "HdBoard" "HdBoard" "None" 0 "TA" "Gd" "PConc" "Gd" "TA" "No" "ALQ" 935 "Unf" 0 233 1168 "GasA" "Ex" "Y" "SBrkr" 1187 0 0 1187 1 0 2 0 3 1 "TA" 6 "Typ" 0 "NA" "Attchd" 1992 "Fin" 2 420 "TA" "TA" "Y" 483 21 0 0 0 0 "NA" "GdPrv" "Shed" 500 3 2010 "WD" "Normal" 1.85e+05
13 "0527166040" "060" "RL" 63 8402 "Pave" "NA" "IR1" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "2Story" 6 5 1998 1998 "Gable" "CompShg" "VinylSd" "VinylSd" "None" 0 "TA" "TA" "PConc" "Gd" "TA" "No" "Unf" 0 "Unf" 0 789 789 "GasA" "Gd" "Y" "SBrkr" 789 676 0 1465 0 0 2 1 3 1 "TA" 7 "Typ" 1 "Gd" "Attchd" 1998 "Fin" 2 393 "TA" "TA" "Y" 0 75 0 0 0 0 "NA" "NA" "NA" 0 5 2010 "WD" "Normal" 1.804e+05
14 "0527180040" "020" "RL" 85 10176 "Pave" "NA" "Reg" "Lvl" "AllPub" "Inside" "Gtl" "Gilbert" "Norm" "Norm" "1Fam" "1Story" 7 5 1990 1990 "Gable" "CompShg" "HdBoard" "HdBoard" "None" 0 "TA" "TA" "PConc" "Gd" "TA" "Gd" "GLQ" 637 "Unf" 0 663 1300 "GasA" "Gd" "Y" "SBrkr" 1341 0 0 1341 1 0 1 1 2 1 "Gd" 5 "Typ" 1 "Po" "Attchd" 1990 "Unf" 2 506 "TA" "TA" "Y" 192 0 0 0 0 0 "NA" "NA" "NA" 0 2 2010 "WD" "Normal" 1.715e+05
15 "0527182190" "120" "RL" NaN 6820 "Pave" "NA" "IR1" "Lvl" "AllPub" "Corner" "Gtl" "StoneBr" "Norm" "Norm" "TwnhsE" "1Story" 8 5 1985 1985 "Gable" "CompShg" "HdBoard" "HdBoard" "None" 0 "Gd" "TA" "PConc" "Gd" "TA" "Av" "GLQ" 368 "BLQ" 1120 0 1488 "GasA" "TA" "Y" "SBrkr" 1502 0 0 1502 1 0 1 1 1 1 "Gd" 4 "Typ" 0 "NA" "Attchd" 1985 "RFn" 2 528 "TA" "TA" "Y" 0 54 0 0 140 0 "NA" "NA" "NA" 0 6 2010 "WD" "Normal" 2.12e+05
16 "0527216070" "060" "RL" 47 53504 "Pave" "NA" "IR2" "HLS" "AllPub" "CulDSac" "Mod" "StoneBr" "Norm" "Norm" "1Fam" "2Story" 8 5 2003 2003 "Hip" "CompShg" "CemntBd" "Wd Shng" "BrkFace" 603 "Ex" "TA" "PConc" "Gd" "TA" "Gd" "ALQ" 1416 "Unf" 0 234 1650 "GasA" "Ex" "Y" "SBrkr" 1690 1589 0 3279 1 0 3 1 4 1 "Ex" 12 "Mod" 1 "Gd" "BuiltIn" 2003 "Fin" 3 841 "TA" "TA" "Y" 503 36 0 0 210 0 "NA" "NA" "NA" 0 6 2010 "WD" "Normal" 5.38e+05
⋮
The spreadsheet has some column headers with spaces and other column headers that start with numbers. Column headers become variable names in the output table. By default, readtable
standardizes names with spaces by using camel case, and standardizes names beginning with numbers by prepending them with 'x'
. Although a table can have variable names with spaces and other non-alphanumeric characters in them, the standardization makes working with table variable names more natural. Before standardizing names, readtable
saves the original column headers in housing.Properties.VariableDescriptions
.
housing.Properties.VariableDescriptions
ans = 1×82 cell
{'Order'} {'PID'} {'MS SubClass'} {'MS Zoning'} {'Lot Frontage'} {'Lot Area'} {'Street'} {'Alley'} {'Lot Shape'} {'Land Contour'} {'Utilities'} {'Lot Config'} {'Land Slope'} {'Neighborhood'} {'Condition 1'} {'Condition 2'} {'Bldg Type'} {'House Style'} {'Overall Qual'} {'Overall Cond'} {'Year Built'} {'Year Remod/Add'} {'Roof Style'} {'Roof Matl'} {'Exterior 1st'} {'Exterior 2nd'} {'Mas Vnr Type'} {'Mas Vnr Area'} {'Exter Qual'} {'Exter Cond'} {'Foundation'} {'Bsmt Qual'} {'Bsmt Cond'} {'Bsmt Exposure'} {'BsmtFin Type 1'} {'BsmtFin SF 1'} {'BsmtFin Type 2'} {'BsmtFin SF 2'} {'Bsmt Unf SF'} {'Total Bsmt SF'} {'Heating'} {'Heating QC'} {'Central Air'} {'Electrical'} {'1st Flr SF'} {'2nd Flr SF'} {'Low Qual Fin SF'} {'Gr Liv Area'} {'Bsmt Full Bath'} {'Bsmt Half Bath'} {'Full Bath'} {'Half Bath'} {'Bedroom AbvGr'} {'Kitchen AbvGr'} {'Kitchen Qual'} {'TotRms AbvGrd'} {'Functional'} {'Fireplaces'} {'Fireplace Qu'} {'Garage Type'} {'Garage Yr Blt'} {'Garage Finish'} {'Garage Cars'} {'Garage Area'} {'Garage Qual'} {'Garage Cond'} {'Paved Drive'} {'Wood Deck SF'} {'Open Porch SF'} {'Enclosed Porch'} {'3Ssn Porch'} {'Screen Porch'} {'Pool Area'} {'Pool QC'} {'Fence'} {'Misc Feature'} {'Misc Val'} {'Mo Sold'} {'Yr Sold'} {'Sale Type'} {'Sale Condition'} {'SalePrice'}
In this example, the original variable names are not needed. To delete them, assign an empty cell array to the VariableDescriptions
property.
housing.Properties.VariableDescriptions = {};
Clean Data Before Analysis
You can remove the Order
variable because it is a row index and not needed. To remove one variable from the table, assign an empty array, []
, to the variable, just as you delete rows or columns from a matrix.
housing.Order = [];
There are 81 variables left in the table. For a complete analysis of the housing prices, most of the variables are probably important. But for this example, only a much smaller subset is needed. To delete the unwanted variables one-by-one is tedious. The removevars
function can delete them all at once, but in this case there is an easier way. First list the variables that you want to keep. Then use subscripting to select them and delete the others. Selecting variables by name is often much easier than figuring out their numeric indices.
keep = ["PID" "MSSubClass" "LotFrontage", "LotArea" "Neighborhood" "BldgType" ... "OverallCond" "YearBuilt" "YearRemod_Add" "Foundation" "Heating" ... "CentralAir" "x1stFlrSF" "x2ndFlrSF" "LowQualFinSF" "GrLivArea" ... "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr" ... "GarageType" "MoSold" "YrSold" "SalePrice"]; housing = housing(:,keep)
housing=2930×25 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir x1stFlrSF x2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType MoSold YrSold SalePrice
____________ __________ ___________ _______ ____________ ________ ___________ _________ _____________ __________ _______ __________ _________ _________ ____________ _________ ____________ ____________ ________ ________ ____________ __________ ______ ______ _________
"0526301100" "020" 141 31770 "NAmes" "1Fam" 5 1960 1960 "CBlock" "GasA" "Y" 1656 0 0 1656 1 0 1 0 3 "Attchd" 5 2010 2.15e+05
"0526350040" "020" 80 11622 "NAmes" "1Fam" 6 1961 1961 "CBlock" "GasA" "Y" 896 0 0 896 0 0 1 0 2 "Attchd" 6 2010 1.05e+05
"0526351010" "020" 81 14267 "NAmes" "1Fam" 6 1958 1958 "CBlock" "GasA" "Y" 1329 0 0 1329 0 0 1 1 3 "Attchd" 6 2010 1.72e+05
"0526353030" "020" 93 11160 "NAmes" "1Fam" 5 1968 1968 "CBlock" "GasA" "Y" 2110 0 0 2110 1 0 2 1 3 "Attchd" 4 2010 2.44e+05
"0527105010" "060" 74 13830 "Gilbert" "1Fam" 5 1997 1998 "PConc" "GasA" "Y" 928 701 0 1629 0 0 2 1 3 "Attchd" 3 2010 1.899e+05
"0527105030" "060" 78 9978 "Gilbert" "1Fam" 6 1998 1998 "PConc" "GasA" "Y" 926 678 0 1604 0 0 2 1 3 "Attchd" 6 2010 1.955e+05
"0527127150" "120" 41 4920 "StoneBr" "TwnhsE" 5 2001 2001 "PConc" "GasA" "Y" 1338 0 0 1338 1 0 2 0 2 "Attchd" 4 2010 2.135e+05
"0527145080" "120" 43 5005 "StoneBr" "TwnhsE" 5 1992 1992 "PConc" "GasA" "Y" 1280 0 0 1280 0 0 2 0 2 "Attchd" 1 2010 1.915e+05
"0527146030" "120" 39 5389 "StoneBr" "TwnhsE" 5 1995 1996 "PConc" "GasA" "Y" 1616 0 0 1616 1 0 2 0 2 "Attchd" 3 2010 2.365e+05
"0527162130" "060" 60 7500 "Gilbert" "1Fam" 5 1999 1999 "PConc" "GasA" "Y" 1028 776 0 1804 0 0 2 1 3 "Attchd" 6 2010 1.89e+05
"0527163010" "060" 75 10000 "Gilbert" "1Fam" 5 1993 1994 "PConc" "GasA" "Y" 763 892 0 1655 0 0 2 1 3 "Attchd" 4 2010 1.759e+05
"0527165230" "020" NaN 7980 "Gilbert" "1Fam" 7 1992 2007 "PConc" "GasA" "Y" 1187 0 0 1187 1 0 2 0 3 "Attchd" 3 2010 1.85e+05
"0527166040" "060" 63 8402 "Gilbert" "1Fam" 5 1998 1998 "PConc" "GasA" "Y" 789 676 0 1465 0 0 2 1 3 "Attchd" 5 2010 1.804e+05
"0527180040" "020" 85 10176 "Gilbert" "1Fam" 5 1990 1990 "PConc" "GasA" "Y" 1341 0 0 1341 1 0 1 1 2 "Attchd" 2 2010 1.715e+05
"0527182190" "120" NaN 6820 "StoneBr" "TwnhsE" 5 1985 1985 "PConc" "GasA" "Y" 1502 0 0 1502 1 0 1 1 1 "Attchd" 6 2010 2.12e+05
"0527216070" "060" 47 53504 "StoneBr" "1Fam" 5 2003 2003 "PConc" "GasA" "Y" 1690 1589 0 3279 1 0 3 1 4 "BuiltIn" 6 2010 5.38e+05
⋮
Two of the variable names are not very clear. Rename those variables with better names by using the VariableNames
property.
housing.Properties.VariableNames(["GrLivArea" "LowQualFinSF"]) = ["TotalAboveGroundLivingArea" "LowQualFinishedArea"];
There are two other variable names, starting with 'x'
, that look odd. Another way to rename them is to use the renamevars
function. If you use renamevars
, assign the output to the original table. Otherwise the update is lost.
housing = renamevars(housing,["x1stFlrSF" "x2ndFlrSF"],["FirstFlrArea" "SecondFlrArea"]);
Convert and Clean Up Data Types
Six of the variables are string arrays. Conceptually they all contain categorical data: discrete, nonnumeric values drawn from a small fixed set of possible values or categories. It is almost always a good idea to convert that kind of data to categorical
arrays. You can use the detectImportOptions
function to control the data types of the data you read with readtable
. But instead of starting over, you can convert these table variables to have the categorical
data type. For example, convert the Neighborhood
variable to a categorical
array.
housing.Neighborhood = categorical(housing.Neighborhood);
This assignment overwrites, or replaces, the existing text variable Neighborhood
in the table with a new categorical
variable. Replacement is what enables the assignment to change the data type. In contrast, this assignment, using indexing:
housing.Neighborhood(:) = categorical(housing.Neighborhood)
assigns values into the existing text variable, element by element, rather than replacing the variable. In that case housing.Neighborhood
remains a string array. This behavior is consistent with the behavior of ordinary workspace variables. Assignment by indexing into an array does not change the type of the array. For example, if you index into an array of integers and assign a floating-point value to an element, the value is truncated and stored as an integer.
x = uint32([1 2 3]);
x(2) = 2.2 % converted to 2, as a uint32
x = 1×3 uint32 row vector
1 2 3
Assignment with dot notation is one way to convert the type of a variable in a table. The convertvars
function is another way and has two benefits. First, it avoids any confusion about overwriting as opposed to assignment into a variable. The convertvars
function always overwrites existing variables and converts their type. Second, convertvars
can operate on more than one variable at a time. There are several more text variables in housing
to be converted to the categorical
data type. Changing them one at a time would get tedious, but convertvars
can convert more than one variable in one command.
housing = convertvars(housing,["BldgType" "Foundation"],"categorical");
It is not necessary to explicitly list the variables by name or position in the table. You can find all the table variables that are string arrays and convert them to categorical
variables. To specify table variables that are string arrays, use the function handle @isstring
when calling convertvars
.
housing = convertvars(housing,@isstring,"categorical");
In both cases, assign the output of convertvars
back to the original table. Otherwise, the update is lost.
Sometimes, converting all text variables to categorical
is too much. For example, if the current homeowners' names were present in the data, then it would not make sense to store them in a categorical
variable. Homeowners' names do not define housing categories. You might keep their names in a string array instead.
As another example, the CentralAir
variable is one of the variables that was converted to categorical
. But because its categories are just Y
and N
, it might make more sense to consider it a logical
variable.
summary(housing.CentralAir)
N 196 Y 2734
The logical
data type (like all the integer types) does not allow missing values (analogous to NaN
), while categorical
does. The CentralAir
variable happens to have no missing data values. You can use either logical
or categorical
as the data type for CentralAir
.
any(ismissing(housing.CentralAir))
ans = logical
0
Convert the data type to logical
, with true
corresponding to Y
, using dot notation to overwrite the existing categorical
variable with the new logical
one.
housing.CentralAir = (housing.CentralAir == "Y");
Display the converted data in housing
.
housing
housing=2930×25 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType MoSold YrSold SalePrice
__________ __________ ___________ _______ ____________ ________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ ______ ______ _________
0526301100 020 141 31770 NAmes 1Fam 5 1960 1960 CBlock GasA true 1656 0 0 1656 1 0 1 0 3 Attchd 5 2010 2.15e+05
0526350040 020 80 11622 NAmes 1Fam 6 1961 1961 CBlock GasA true 896 0 0 896 0 0 1 0 2 Attchd 6 2010 1.05e+05
0526351010 020 81 14267 NAmes 1Fam 6 1958 1958 CBlock GasA true 1329 0 0 1329 0 0 1 1 3 Attchd 6 2010 1.72e+05
0526353030 020 93 11160 NAmes 1Fam 5 1968 1968 CBlock GasA true 2110 0 0 2110 1 0 2 1 3 Attchd 4 2010 2.44e+05
0527105010 060 74 13830 Gilbert 1Fam 5 1997 1998 PConc GasA true 928 701 0 1629 0 0 2 1 3 Attchd 3 2010 1.899e+05
0527105030 060 78 9978 Gilbert 1Fam 6 1998 1998 PConc GasA true 926 678 0 1604 0 0 2 1 3 Attchd 6 2010 1.955e+05
0527127150 120 41 4920 StoneBr TwnhsE 5 2001 2001 PConc GasA true 1338 0 0 1338 1 0 2 0 2 Attchd 4 2010 2.135e+05
0527145080 120 43 5005 StoneBr TwnhsE 5 1992 1992 PConc GasA true 1280 0 0 1280 0 0 2 0 2 Attchd 1 2010 1.915e+05
0527146030 120 39 5389 StoneBr TwnhsE 5 1995 1996 PConc GasA true 1616 0 0 1616 1 0 2 0 2 Attchd 3 2010 2.365e+05
0527162130 060 60 7500 Gilbert 1Fam 5 1999 1999 PConc GasA true 1028 776 0 1804 0 0 2 1 3 Attchd 6 2010 1.89e+05
0527163010 060 75 10000 Gilbert 1Fam 5 1993 1994 PConc GasA true 763 892 0 1655 0 0 2 1 3 Attchd 4 2010 1.759e+05
0527165230 020 NaN 7980 Gilbert 1Fam 7 1992 2007 PConc GasA true 1187 0 0 1187 1 0 2 0 3 Attchd 3 2010 1.85e+05
0527166040 060 63 8402 Gilbert 1Fam 5 1998 1998 PConc GasA true 789 676 0 1465 0 0 2 1 3 Attchd 5 2010 1.804e+05
0527180040 020 85 10176 Gilbert 1Fam 5 1990 1990 PConc GasA true 1341 0 0 1341 1 0 1 1 2 Attchd 2 2010 1.715e+05
0527182190 120 NaN 6820 StoneBr TwnhsE 5 1985 1985 PConc GasA true 1502 0 0 1502 1 0 1 1 1 Attchd 6 2010 2.12e+05
0527216070 060 47 53504 StoneBr 1Fam 5 2003 2003 PConc GasA true 1690 1589 0 3279 1 0 3 1 4 BuiltIn 6 2010 5.38e+05
⋮
All the text data has been converted to categorical
variables. But there are still a few things to clean up.
The OverallCond
variable was read in as a numeric array, but its values are all drawn from the integers 1
-10
. You can leave these values as numeric data, but you can think of it as ordinal categorical
data. When a categorical
array is ordinal, its categories have a specified order. For example, the categories 10
and 5
can be compared (10 > 5
, because a house whose condition is rated as a 10
is theoretically nicer than one rated 5
), but for these comparisons, there is no numeric meaning to 10 - 5
. To avoid unintentionally treating OverallCond
as numeric data, convert it to an ordinal categorical
array, which still enables relational comparisons but prevents arithmetic operations. The category names 1
, 2
, and so on are easy to interpret and are acceptable.
housing.OverallCond = categorical(housing.OverallCond,1:10,"Ordinal",true);
Similarly, the MSSubClass
variable consisted of numeric codes in the original spreadsheet. You can think of those values as being categorical
data. Because there is no mathematical order to these particular codes, the categories are nonordinal (or nominal). In this case, readtable
read those values in as text to preserve leading zeroes in the codes. MSSubClass
was then converted to categorical
data.
While MSSubClass
has the data type that you want, you might find it difficult to interpret the codes as categories of houses. The file that describes the Ames Housing Data contains the definitions of the numeric codes. Giving these categories readable names can help you understand the data. To make it clear which names go with which numbers, specify both the categories (code
) and their names (subclass
) in another call to the categorical
function.
code = ["020" "030" "040" "045" "050" "060" "070" "075" "080" "085" "090" "120" "150" "160" "180" "190"]; subclass = ["1-STORY 1946 & NEWER ALL STYLES" ... "1-STORY 1945 & OLDER" ... "1-STORY W/FINISHED ATTIC ALL AGES" ... "1-1/2 STORY - UNFINISHED ALL AGES" ... "1-1/2 STORY FINISHED ALL AGES" ... "2-STORY 1946 & NEWER" ... "2-STORY 1945 & OLDER" ... "2-1/2 STORY ALL AGES" ... "SPLIT OR MULTI-LEVEL" ... "SPLIT FOYER" ... "DUPLEX - ALL STYLES AND AGES" ... "1-STORY PUD (Planned Unit Development) - 1946 & NEWER" ... "1-1/2 STORY PUD - ALL AGES" ... "2-STORY PUD - 1946 & NEWER" ... "PUD - MULTILEVEL - INCL SPLIT LEV/FOYER" ... "2 FAMILY CONVERSION - ALL STYLES AND AGES"]; housing.MSSubClass = categorical(housing.MSSubClass,code,subclass);
The category names for the BldgType
variable are not obvious. As with MSSubClass
, more descriptive names can help you understand the building categories. To display the number of houses in each building category, use the summary
function.
summary(housing.BldgType)
1Fam 2425 2fmCon 62 Duplex 109 Twnhs 101 TwnhsE 233
With only five categories, you can safely list the new category names in the right order without specifying the old names. To rename categories, use the renamecats
function.
types = ["Single-family Detached" "Two-family Conversion" "Duplex" "Townhouse End Unit" "Townhouse Inside Unit"]; housing.BldgType = renamecats(housing.BldgType,types);
The GarageType
variable includes the category NA
, standing for Not Applicable. In GarageType
, NA
means that the house does not have a garage. But it is too easy to confuse NA
with a missing value. A true missing value means it cannot be determined if a house has a garage. But in this housing data, it is always known if a house has a garage. Change that one category name to make its meaning clearer.
housing.GarageType = renamecats(housing.GarageType,"NA","None");
Finally, the PID
variable was read in as a string array. While its values were numeric, some of them had leading zeroes. The readtable
function preserved this information by storing the values as strings. Then the call to convertvars
converted the PID
variable to a categorical
array. PID
stores identification numbers that are unique. Identification numbers are assigned as needed and do not come from a fixed set of values. There is no particular advantage in storing them in a categorical
variable. If every identification number is a category, then adding a new identification number means adding a new category to PID
. It might be more convenient to convert PID
back to a string array. To convert values to strings, use the string
function.
housing.PID = string(housing.PID);
Display the results of the preliminary data cleaning.
housing
housing=2930×25 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType MoSold YrSold SalePrice
____________ _____________________________________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ ______ ______ _________
"0526301100" 1-STORY 1946 & NEWER ALL STYLES 141 31770 NAmes Single-family Detached 5 1960 1960 CBlock GasA true 1656 0 0 1656 1 0 1 0 3 Attchd 5 2010 2.15e+05
"0526350040" 1-STORY 1946 & NEWER ALL STYLES 80 11622 NAmes Single-family Detached 6 1961 1961 CBlock GasA true 896 0 0 896 0 0 1 0 2 Attchd 6 2010 1.05e+05
"0526351010" 1-STORY 1946 & NEWER ALL STYLES 81 14267 NAmes Single-family Detached 6 1958 1958 CBlock GasA true 1329 0 0 1329 0 0 1 1 3 Attchd 6 2010 1.72e+05
"0526353030" 1-STORY 1946 & NEWER ALL STYLES 93 11160 NAmes Single-family Detached 5 1968 1968 CBlock GasA true 2110 0 0 2110 1 0 2 1 3 Attchd 4 2010 2.44e+05
"0527105010" 2-STORY 1946 & NEWER 74 13830 Gilbert Single-family Detached 5 1997 1998 PConc GasA true 928 701 0 1629 0 0 2 1 3 Attchd 3 2010 1.899e+05
"0527105030" 2-STORY 1946 & NEWER 78 9978 Gilbert Single-family Detached 6 1998 1998 PConc GasA true 926 678 0 1604 0 0 2 1 3 Attchd 6 2010 1.955e+05
"0527127150" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 41 4920 StoneBr Townhouse Inside Unit 5 2001 2001 PConc GasA true 1338 0 0 1338 1 0 2 0 2 Attchd 4 2010 2.135e+05
"0527145080" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 43 5005 StoneBr Townhouse Inside Unit 5 1992 1992 PConc GasA true 1280 0 0 1280 0 0 2 0 2 Attchd 1 2010 1.915e+05
"0527146030" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 39 5389 StoneBr Townhouse Inside Unit 5 1995 1996 PConc GasA true 1616 0 0 1616 1 0 2 0 2 Attchd 3 2010 2.365e+05
"0527162130" 2-STORY 1946 & NEWER 60 7500 Gilbert Single-family Detached 5 1999 1999 PConc GasA true 1028 776 0 1804 0 0 2 1 3 Attchd 6 2010 1.89e+05
"0527163010" 2-STORY 1946 & NEWER 75 10000 Gilbert Single-family Detached 5 1993 1994 PConc GasA true 763 892 0 1655 0 0 2 1 3 Attchd 4 2010 1.759e+05
"0527165230" 1-STORY 1946 & NEWER ALL STYLES NaN 7980 Gilbert Single-family Detached 7 1992 2007 PConc GasA true 1187 0 0 1187 1 0 2 0 3 Attchd 3 2010 1.85e+05
"0527166040" 2-STORY 1946 & NEWER 63 8402 Gilbert Single-family Detached 5 1998 1998 PConc GasA true 789 676 0 1465 0 0 2 1 3 Attchd 5 2010 1.804e+05
"0527180040" 1-STORY 1946 & NEWER ALL STYLES 85 10176 Gilbert Single-family Detached 5 1990 1990 PConc GasA true 1341 0 0 1341 1 0 1 1 2 Attchd 2 2010 1.715e+05
"0527182190" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER NaN 6820 StoneBr Townhouse Inside Unit 5 1985 1985 PConc GasA true 1502 0 0 1502 1 0 1 1 1 Attchd 6 2010 2.12e+05
"0527216070" 2-STORY 1946 & NEWER 47 53504 StoneBr Single-family Detached 5 2003 2003 PConc GasA true 1690 1589 0 3279 1 0 3 1 4 BuiltIn 6 2010 5.38e+05
⋮
Create Variable for Date of Sale
The table has separate variables for the month and year of sale. It is more convenient if those variables are combined in one datetime
variable. Assignment by using dot notation is a good way to add a new variable at the right edge of a table. Add the date of sale as a new variable.
housing.LastSoldDate = datetime(housing.YrSold,housing.MoSold,0,"Format","MMM yyyy");
Now delete the two original variables. It is easier to list the variables by name and use removevars
.
housing = removevars(housing,["YrSold" "MoSold"])
housing=2930×24 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _____________________________________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ _________ ____________
"0526301100" 1-STORY 1946 & NEWER ALL STYLES 141 31770 NAmes Single-family Detached 5 1960 1960 CBlock GasA true 1656 0 0 1656 1 0 1 0 3 Attchd 2.15e+05 Apr 2010
"0526350040" 1-STORY 1946 & NEWER ALL STYLES 80 11622 NAmes Single-family Detached 6 1961 1961 CBlock GasA true 896 0 0 896 0 0 1 0 2 Attchd 1.05e+05 May 2010
"0526351010" 1-STORY 1946 & NEWER ALL STYLES 81 14267 NAmes Single-family Detached 6 1958 1958 CBlock GasA true 1329 0 0 1329 0 0 1 1 3 Attchd 1.72e+05 May 2010
"0526353030" 1-STORY 1946 & NEWER ALL STYLES 93 11160 NAmes Single-family Detached 5 1968 1968 CBlock GasA true 2110 0 0 2110 1 0 2 1 3 Attchd 2.44e+05 Mar 2010
"0527105010" 2-STORY 1946 & NEWER 74 13830 Gilbert Single-family Detached 5 1997 1998 PConc GasA true 928 701 0 1629 0 0 2 1 3 Attchd 1.899e+05 Feb 2010
"0527105030" 2-STORY 1946 & NEWER 78 9978 Gilbert Single-family Detached 6 1998 1998 PConc GasA true 926 678 0 1604 0 0 2 1 3 Attchd 1.955e+05 May 2010
"0527127150" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 41 4920 StoneBr Townhouse Inside Unit 5 2001 2001 PConc GasA true 1338 0 0 1338 1 0 2 0 2 Attchd 2.135e+05 Mar 2010
"0527145080" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 43 5005 StoneBr Townhouse Inside Unit 5 1992 1992 PConc GasA true 1280 0 0 1280 0 0 2 0 2 Attchd 1.915e+05 Dec 2009
"0527146030" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 39 5389 StoneBr Townhouse Inside Unit 5 1995 1996 PConc GasA true 1616 0 0 1616 1 0 2 0 2 Attchd 2.365e+05 Feb 2010
"0527162130" 2-STORY 1946 & NEWER 60 7500 Gilbert Single-family Detached 5 1999 1999 PConc GasA true 1028 776 0 1804 0 0 2 1 3 Attchd 1.89e+05 May 2010
"0527163010" 2-STORY 1946 & NEWER 75 10000 Gilbert Single-family Detached 5 1993 1994 PConc GasA true 763 892 0 1655 0 0 2 1 3 Attchd 1.759e+05 Mar 2010
"0527165230" 1-STORY 1946 & NEWER ALL STYLES NaN 7980 Gilbert Single-family Detached 7 1992 2007 PConc GasA true 1187 0 0 1187 1 0 2 0 3 Attchd 1.85e+05 Feb 2010
"0527166040" 2-STORY 1946 & NEWER 63 8402 Gilbert Single-family Detached 5 1998 1998 PConc GasA true 789 676 0 1465 0 0 2 1 3 Attchd 1.804e+05 Apr 2010
"0527180040" 1-STORY 1946 & NEWER ALL STYLES 85 10176 Gilbert Single-family Detached 5 1990 1990 PConc GasA true 1341 0 0 1341 1 0 1 1 2 Attchd 1.715e+05 Jan 2010
"0527182190" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER NaN 6820 StoneBr Townhouse Inside Unit 5 1985 1985 PConc GasA true 1502 0 0 1502 1 0 1 1 1 Attchd 2.12e+05 May 2010
"0527216070" 2-STORY 1946 & NEWER 47 53504 StoneBr Single-family Detached 5 2003 2003 PConc GasA true 1690 1589 0 3279 1 0 3 1 4 BuiltIn 5.38e+05 May 2010
⋮
Explore Data with Plots
Explore the data by making some simple plots. Many basic plotting commands do not accept tables as input arguments. But you can use dot notation to pass one or more table variables into a plotting function. You are taking arrays out of the table and passing them as input arguments to a plotting function.
For example, make a scatter plot of the sale prices of houses in the table as a function of the years in which they were built.
scatter(housing.YearBuilt,housing.SalePrice,20,"filled");
A log transformation of the prices might show a simpler relationship between year and price. Also, you can show more information in the scatter plot by using the living area of the houses to color the markers. The living areas have a long right tail, so it is also useful to show a log transformation of the areas. To transform the two table variables, wrap them in calls to the log
function. Then make another scatter plot.
logSalePrice = log(housing.SalePrice);
logLivingArea = log(housing.TotalAboveGroundLivingArea);
scatter(housing.YearBuilt,logSalePrice,20,logLivingArea,"filled");
Clean Errors in Data
Any large, complex data set collected over a long period of time might contain some errors. Check for errors in the housing data. Dates in the data are a good place to start. First compare YearBuilt
to YearRemod_Add
.
checkRows = housing.YearBuilt > housing.YearRemod_Add; housing(checkRows,:)
ans=1×24 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _______________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ _________ ____________
"0907194160" 1-STORY 1946 & NEWER ALL STYLES 65 10739 CollgCr Single-family Detached 5 2002 2001 PConc GasA true 1444 0 0 1444 1 0 2 0 3 Attchd 2.03e+05 Mar 2009
It is not possible for remodeling to have been done in 2001 if the house itself was built in 2002. If you assume that the YearBuilt
value is known to be the error (an assumption that needs to be confirmed), you can use dot notation to assign 2001 as the year in which this house was built.
housing.YearBuilt(checkRows) = 2001;
As another check, compare the new LastSoldDate
variable to YearBuilt
.
checkRows = housing.YearBuilt > year(housing.LastSoldDate); housing(checkRows,:)
ans=2×24 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _______________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ __________ ____________
"0908154235" 2-STORY 1946 & NEWER 313 63887 Edwards Single-family Detached 5 2008 2008 PConc GasA true 4692 950 0 5642 2 0 2 1 3 Attchd 1.6e+05 Dec 2007
"0908154195" 1-STORY 1946 & NEWER ALL STYLES 128 39290 Edwards Single-family Detached 5 2008 2009 PConc GasA true 5095 0 0 5095 1 1 2 1 2 Attchd 1.8385e+05 Sep 2007
There is another issue. These two houses were sold in late 2007, as shown in the LastSoldDate
variable. But the corresponding value in YearBuilt
is 2008. It might be that for these houses, the years in YearBuilt
were recorded in early 2008 (another assumption needing confirmation). Update the YearBuilt
variable, this time by using dot notation to assign to two rows.
housing.YearBuilt(checkRows) = 2007;
Clean Up Missing Data
The next step in cleaning the data is to check for missing data in the numeric and categorical
variables. The one logical
variable in housing
does not support missing values. The ismissing
function indicates which elements of the table have missing values.
missingElements = ismissing(housing)
missingElements = 2930×24 logical array
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
⋮
The ismissing
function returns a logical
matrix that is the same size as the table. Summing the columns of that matrix gives the number of missing values in each of the variables of the table.
numMissing = sum(missingElements,1)
numMissing = 1×24
0 0 490 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0
Only three of the variables have missing data, but without the variable names it is not easy to tell which variables they are. One way to tell is to index into the VariableNames
property of the table to find the names that correspond to the variables with missing values.
housing.Properties.VariableNames(numMissing > 0)
ans = 1×3 cell
{'LotFrontage'} {'BsmtFullBath'} {'BsmtHalfBath'}
Deciding what to do about missing data is a challenge. If the data is missing at random, and there are only a few missing values, one strategy is to remove those rows from the table. The four missing basement bath values (NaN
s, in this case) occur in only two rows. You can remove those two rows by using the rmmissing
function.
missingBsmtBath = ismissing(housing.BsmtFullBath) | ismissing(housing.BsmtHalfBath); housing(missingBsmtBath,:)
ans=2×24 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _______________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ _________ ____________
"0903230120" 1-STORY 1946 & NEWER ALL STYLES 99 5940 BrkSide Single-family Detached 7 1946 1950 PConc GasA true 896 0 0 896 NaN NaN 1 0 2 Detchd 79000 Mar 2008
"0908154080" 1-STORY 1946 & NEWER ALL STYLES 123 47007 Edwards Single-family Detached 7 1959 1996 Slab GasA true 3820 0 0 3820 NaN NaN 3 1 5 Attchd 2.847e+05 Jun 2008
housing = rmmissing(housing,"DataVariables",["BsmtFullBath" "BsmtHalfBath"]); whos housing
Name Size Bytes Class Attributes housing 2928x24 595935 table
This call to rmmissing
removes only the rows that have missing values in BsmtFullBath
and BsmtHalfBath
. The 490 rows with missing LotFrontage
values are still in the table. You can remove these 490 rows but doing so deletes more than 16% of the data. You also can fill these missing values with the mean frontage value by using the fillmissing
function, but that is not practical for this data. For variables that form a time series, fillmissing
also supports filling variables with interpolated values or moving-window smoothed values. LotFrontage
is not a time series. The data in this variable is a cross-sectional data set.
One commonly used strategy for filling in missing values in cross-sectional data is to create a regression model to predict the missing values in a row from the non-missing data in that row. A simple scatter plot indicates that there is a log-log relationship between the area of a lot and its frontage. That relationship suggests a model.
loglog(housing.LotArea,housing.LotFrontage,'o')
You can use that log-log relationship to fill in the missing LotFrontage
values by regressing the values on LotArea
.
missingValues = ismissing(housing.LotFrontage); beta = polyfit(log(housing.LotArea(~missingValues)),log(housing.LotFrontage(~missingValues)),1); housing.LotFrontage(missingValues) = exp(polyval(beta,log(housing.LotArea(missingValues))));
You can use dot notation to work on data in a table when you use functions such as polyfit
and polyval
that accept numeric vectors but not tables. You can think of a table as a container that is designed to hold data having different types. Functions such as polyfit
that are specifically for numeric inputs do not work on a table because a table often contains nonnumeric data. Even when a table contains only numeric data, it is still a container. The functions must be applied to the contents of the table. Use dot notation to access table variables.
Add the imputed missing values that you calculated with polyfit
and polyval
to the scatter plot. A simple imputation scheme might not be sufficient in a real analysis of this data, but it illustrates how to visualize and make computations on numeric data in a table.
hold on loglog(housing.LotArea(missingValues),housing.LotFrontage(missingValues),'rx') hold off
Arithmetic on Table Variables
Dot notation has been convenient for operations such as converting an existing table variable, adding a new variable, assigning values, plotting, and applying functions like polyval
to a table variable. Dot notation is also convenient for arithmetic operations on table variables. For example, convert the LotFrontage
variable from feet to meters.
housing.LotFrontage = 0.3048 * housing.LotFrontage; housing.Properties.VariableUnits("LotFrontage") = "m"
housing=2928×24 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _____________________________________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ ____________ __________ _________ ____________
"0526301100" 1-STORY 1946 & NEWER ALL STYLES 42.977 31770 NAmes Single-family Detached 5 1960 1960 CBlock GasA true 1656 0 0 1656 1 0 1 0 3 Attchd 2.15e+05 Apr 2010
"0526350040" 1-STORY 1946 & NEWER ALL STYLES 24.384 11622 NAmes Single-family Detached 6 1961 1961 CBlock GasA true 896 0 0 896 0 0 1 0 2 Attchd 1.05e+05 May 2010
"0526351010" 1-STORY 1946 & NEWER ALL STYLES 24.689 14267 NAmes Single-family Detached 6 1958 1958 CBlock GasA true 1329 0 0 1329 0 0 1 1 3 Attchd 1.72e+05 May 2010
"0526353030" 1-STORY 1946 & NEWER ALL STYLES 28.346 11160 NAmes Single-family Detached 5 1968 1968 CBlock GasA true 2110 0 0 2110 1 0 2 1 3 Attchd 2.44e+05 Mar 2010
"0527105010" 2-STORY 1946 & NEWER 22.555 13830 Gilbert Single-family Detached 5 1997 1998 PConc GasA true 928 701 0 1629 0 0 2 1 3 Attchd 1.899e+05 Feb 2010
"0527105030" 2-STORY 1946 & NEWER 23.774 9978 Gilbert Single-family Detached 6 1998 1998 PConc GasA true 926 678 0 1604 0 0 2 1 3 Attchd 1.955e+05 May 2010
"0527127150" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 12.497 4920 StoneBr Townhouse Inside Unit 5 2001 2001 PConc GasA true 1338 0 0 1338 1 0 2 0 2 Attchd 2.135e+05 Mar 2010
"0527145080" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 13.106 5005 StoneBr Townhouse Inside Unit 5 1992 1992 PConc GasA true 1280 0 0 1280 0 0 2 0 2 Attchd 1.915e+05 Dec 2009
"0527146030" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 11.887 5389 StoneBr Townhouse Inside Unit 5 1995 1996 PConc GasA true 1616 0 0 1616 1 0 2 0 2 Attchd 2.365e+05 Feb 2010
"0527162130" 2-STORY 1946 & NEWER 18.288 7500 Gilbert Single-family Detached 5 1999 1999 PConc GasA true 1028 776 0 1804 0 0 2 1 3 Attchd 1.89e+05 May 2010
"0527163010" 2-STORY 1946 & NEWER 22.86 10000 Gilbert Single-family Detached 5 1993 1994 PConc GasA true 763 892 0 1655 0 0 2 1 3 Attchd 1.759e+05 Mar 2010
"0527165230" 1-STORY 1946 & NEWER ALL STYLES 19.049 7980 Gilbert Single-family Detached 7 1992 2007 PConc GasA true 1187 0 0 1187 1 0 2 0 3 Attchd 1.85e+05 Feb 2010
"0527166040" 2-STORY 1946 & NEWER 19.202 8402 Gilbert Single-family Detached 5 1998 1998 PConc GasA true 789 676 0 1465 0 0 2 1 3 Attchd 1.804e+05 Apr 2010
"0527180040" 1-STORY 1946 & NEWER ALL STYLES 25.908 10176 Gilbert Single-family Detached 5 1990 1990 PConc GasA true 1341 0 0 1341 1 0 1 1 2 Attchd 1.715e+05 Jan 2010
"0527182190" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 17.465 6820 StoneBr Townhouse Inside Unit 5 1985 1985 PConc GasA true 1502 0 0 1502 1 0 1 1 1 Attchd 2.12e+05 May 2010
"0527216070" 2-STORY 1946 & NEWER 14.326 53504 StoneBr Single-family Detached 5 2003 2003 PConc GasA true 1690 1589 0 3279 1 0 3 1 4 BuiltIn 5.38e+05 May 2010
⋮
Using dot notation means that the multiplication is applied not to the housing
table, which cannot be done because tables are containers, but rather to its LotFrontage
variable, which is a numeric vector. With dot notation, you extracted LotFrontage
from the table and put the modified version back in.
Another way to access the contents of a table is to subscript into it by using curly braces, just as you use curly braces to extract the contents of a cell array. You can use curly brace subscripting to refer to and operate on data in a table by extracting and reinserting contents. For example, convert LotFrontage
back to feet by using curly brace subscripting.
housing{:,"LotFrontage"} = housing{:,"LotFrontage"} / 0.3048; housing.Properties.VariableUnits("LotFrontage") = "ft";
Dot notation and brace subscripting are different syntaxes for the same kinds of operations. They both work on the contents of a table. Also, they both enable you to specify a table variable and a subset of its rows.
housing.LotArea(1:2)
ans = 2×1
31770
11622
housing{1:2,"LotArea"}
ans = 2×1
31770
11622
While both syntaxes work on the contents of table, there are two subtle differences to consider.
First, a limitation of curly brace subscripting is that it assigns into the contents of a table rather than replacing a variable. For example, this assignment does not change the data type of the LotFrontage
variable in the way that an assignment using dot notation does. The call to the single
function on the right side of the assignment creates an array having the single
data type. But by subscripting into housing
with curly braces, you assign values from that array into the existing table variable. And the data type of LotFrontage
is double
. The values from the right side are converted back to double
by this assignment.
housing{:,"LotFrontage"} = single(housing{:,"LotFrontage"});
Second, a benefit of curly brace subscripting is that, unlike dot notation, it uses the familiar two-dimensional subscripting syntax. This syntax enables you to refer to more than one variable at a time and also to a subset of rows. For example, there are five variables whose units are square feet. Converting these variables to square meters one at a time is tedious. To apply the multiplication to all five variables at once, use curly brace subscripting.
areaVars = ["LotArea" "FirstFlrArea" "SecondFlrArea" "LowQualFinishedArea" "TotalAboveGroundLivingArea"]; housing{:,areaVars} = 0.3048^2 * housing{:,areaVars}; housing.Properties.VariableUnits(areaVars) = "m^2";
A common mistake is to use parenthesis subscripting instead of braces to operate on the contents of a table. While some functions, such as ismissing
or varfun
, do accept a table as their input, many numeric operations, including arithmetic, do not. For example, this assignment using parentheses results in an error. The try-catch
block catches the error and displays it.
try housing(:,areaVars) = 0.3048^2 * housing(:,areaVars); catch ME disp(ME.message) end
Undefined function 'mtimes' for input arguments of type 'table'.
This assignment results in an error because housing(:,areaVars)
is a 2928-by-5 table, not a numeric matrix. If you used curly brace subscripting, such as housing{:,areaVars}
, then the result would be a 2928-by-5 numeric matrix. Because tables are designed to hold data of different types, including nonnumeric data, many functions that make sense for only numeric data do not work on a table. Dot notation and curly brace subscripting exist to give you access to data in a table.
A third way to do calculations on numeric variables in a table is to use the varfun
function. Like curly brace subscripting, varfun
can operate on all or only some of the variables in a table. Unlike curly braces, varfun
operates on each table variable separately. By default, varfun
returns another table containing a variable for each separate result.
Sometimes the operation that you want to apply is an existing function. To pass the function as an argument to varfun
, use a function handle. For example, use the round
function to round data in the variables specified by areaVars
.
roundedAreaTable = varfun(@round,housing,"InputVariables",areaVars)
roundedAreaTable=2928×5 table
round_LotArea round_FirstFlrArea round_SecondFlrArea round_LowQualFinishedArea round_TotalAboveGroundLivingArea
_____________ __________________ ___________________ _________________________ ________________________________
2952 154 0 0 154
1080 83 0 0 83
1325 123 0 0 123
1037 196 0 0 196
1285 86 65 0 151
927 86 63 0 149
457 124 0 0 124
465 119 0 0 119
501 150 0 0 150
697 96 72 0 168
929 71 83 0 154
741 110 0 0 110
781 73 63 0 136
945 125 0 0 125
634 140 0 0 140
4971 157 148 0 305
⋮
If there is no function that does exactly what you want, you can also write an anonymous function to do it.
sqMeters2sqFeet = @(x) x / 0.3048^2;
areaTable = varfun(sqMeters2sqFeet,housing,"InputVariables",areaVars)
areaTable=2928×5 table
Fun_LotArea Fun_FirstFlrArea Fun_SecondFlrArea Fun_LowQualFinishedArea Fun_TotalAboveGroundLivingArea
___________ ________________ _________________ _______________________ ______________________________
31770 1656 0 0 1656
11622 896 0 0 896
14267 1329 0 0 1329
11160 2110 0 0 2110
13830 928 701 0 1629
9978 926 678 0 1604
4920 1338 0 0 1338
5005 1280 0 0 1280
5389 1616 0 0 1616
7500 1028 776 0 1804
10000 763 892 0 1655
7980 1187 0 0 1187
8402 789 676 0 1465
10176 1341 0 0 1341
6820 1502 0 0 1502
53504 1690 1589 0 3279
⋮
Because that result is a table, it can be assigned back into the original table with parenthesis subscripting.
housing(:,areaVars) = areaTable;
housing.Properties.VariableUnits(areaVars) = "ft^2";
It is important to understand the difference between the parentheses in
housing(:,areaVars) = areaTable;
and the braces in
housing{:,areaVars} = 0.3048^2 * housing{:,areaVars};
The two assignments have the same effect. The assignment with parentheses assigns one table to another. The assignment with curly braces explicitly assigns values to the content of the table. The left and right sides of that assignment are numeric matrices. Because curly brace subscripting extracts and reinserts data, it is a convenient way to modify data in place. Contents-to-contents assignment can operate on only one data type at a time, while table-to-table assignment can move data of different types. For example, this assignment results in an error because it involves mixed numeric and categorical
data in brace subscripting.
try housing{:,["LotFrontage" "OverallCond"]} = normalize(housing{:,["LotFrontage" "OverallCond"]}); catch ME disp(ME.message) end
Unable to concatenate the specified table variables.
Because varfun
returns a table, assignment using parenthesis subscripting cannot change the type of any table variables. For example, this assignment does not convert any variables from the double
to single
data type.
housing(:,areaVars) = varfun(@single,housing,"InputVariables",areaVars);
To convert the data types of table variables, use convertvars
, as previously shown.
Row Operations on Data in Table
Because curly brace subscripting extracts the variables from a table as one matrix having one data type, you can use it to perform row operations across numeric variables in a table. For example, a check on the data is to compare the individual square footage variables against TotalAboveGroundLivingArea
. Extract the former by using curly braces. Then compare their row sums to TotalAboveGroundLivingArea
, extracted by using dot notation.
area = housing{:,["FirstFlrArea" "SecondFlrArea" "LowQualFinishedArea"]}
area = 2928×3
1656 0 0
896 0 0
1329 0 0
2110 0 0
928 701 0
926 678 0
1338 0 0
1280 0 0
1616 0 0
1028 776 0
⋮
isequal(sum(area,2), housing.TotalAboveGroundLivingArea)
ans = logical
1
The square footage data is consistent. Another example is to compute the total number of bathrooms in each house by extracting the four different bathroom counts and adding them up across each row.
bathCountVars = ["BsmtHalfBath" "HalfBath" "BsmtFullBath" "FullBath"]; bathCounts = housing{:,bathCountVars}
bathCounts = 2928×4
0 0 1 1
0 0 0 1
0 1 0 1
0 1 1 2
0 1 0 2
0 1 0 2
0 0 1 2
0 0 0 2
0 0 1 2
0 1 0 2
⋮
You might think to sum the rows of that matrix as:
sum(housing{:,bathCountVars},2);
but that sum is not correct. Half-baths count only half as much as full bathrooms. A trend in real estate listings is to account for multiple half-baths by counting them after the decimal point. Matrix multiplication makes that operation one line.
TotalBaths = housing{:,bathCountVars} * [.1; .1; 1; 1];
Replace those four variables with TotalBaths
, rather than adding a new variable at the end of the table. Begin this replacement by using addvars
to add TotalBaths
next to the existing variables.
housing = addvars(housing,TotalBaths, 'After',"HalfBath");
There is a mistake in one row of the data. A townhouse built in 2007 probably does not have four half baths and no full baths.
groupcounts(housing,"TotalBaths")
ans=17×3 table
TotalBaths GroupCount Percent
__________ __________ ________
0.4 1 0.034153
1 442 15.096
1.1 293 10.007
1.2 20 0.68306
1.3 2 0.068306
2 890 30.396
2.1 558 19.057
2.2 29 0.99044
3 349 11.919
3.1 288 9.8361
3.2 6 0.20492
3.3 1 0.034153
4 25 0.85383
4.1 16 0.54645
4.2 3 0.10246
6 2 0.068306
⋮
housing(housing.TotalBaths < 1,:)
ans=1×25 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea BsmtFullBath BsmtHalfBath FullBath HalfBath TotalBaths BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _____________________________________________________ ___________ _______ ____________ _____________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ ____________ ____________ ________ ________ __________ ____________ __________ _________ ____________
"0528228275" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 53 3922 Blmngtn Townhouse Inside Unit 5 2006 2007 PConc GasA true 1402 0 0 1402 0 2 0 2 0.4 2 Attchd 1.942e+05 Jan 2007
The BsmtHalfBath
count should be two full bathrooms. The bathroom counts are all numeric. The assignment with braces updates all three values across that row.
housing{housing.TotalBaths < 1,["BsmtHalfBath" "FullBath" "TotalBaths"]} = [0 2 2.2];
Next use removevars
to delete the redundant original variables.
housing = removevars(housing,bathCountVars)
housing=2928×21 table
PID MSSubClass LotFrontage LotArea Neighborhood BldgType OverallCond YearBuilt YearRemod_Add Foundation Heating CentralAir FirstFlrArea SecondFlrArea LowQualFinishedArea TotalAboveGroundLivingArea TotalBaths BedroomAbvGr GarageType SalePrice LastSoldDate
____________ _____________________________________________________ ___________ _______ ____________ ______________________ ___________ _________ _____________ __________ _______ __________ ____________ _____________ ___________________ __________________________ __________ ____________ __________ _________ ____________
"0526301100" 1-STORY 1946 & NEWER ALL STYLES 141 31770 NAmes Single-family Detached 5 1960 1960 CBlock GasA true 1656 0 0 1656 2 3 Attchd 2.15e+05 Apr 2010
"0526350040" 1-STORY 1946 & NEWER ALL STYLES 80 11622 NAmes Single-family Detached 6 1961 1961 CBlock GasA true 896 0 0 896 1 2 Attchd 1.05e+05 May 2010
"0526351010" 1-STORY 1946 & NEWER ALL STYLES 81 14267 NAmes Single-family Detached 6 1958 1958 CBlock GasA true 1329 0 0 1329 1.1 3 Attchd 1.72e+05 May 2010
"0526353030" 1-STORY 1946 & NEWER ALL STYLES 93 11160 NAmes Single-family Detached 5 1968 1968 CBlock GasA true 2110 0 0 2110 3.1 3 Attchd 2.44e+05 Mar 2010
"0527105010" 2-STORY 1946 & NEWER 74 13830 Gilbert Single-family Detached 5 1997 1998 PConc GasA true 928 701 0 1629 2.1 3 Attchd 1.899e+05 Feb 2010
"0527105030" 2-STORY 1946 & NEWER 78 9978 Gilbert Single-family Detached 6 1998 1998 PConc GasA true 926 678 0 1604 2.1 3 Attchd 1.955e+05 May 2010
"0527127150" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 41 4920 StoneBr Townhouse Inside Unit 5 2001 2001 PConc GasA true 1338 0 0 1338 3 2 Attchd 2.135e+05 Mar 2010
"0527145080" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 43 5005 StoneBr Townhouse Inside Unit 5 1992 1992 PConc GasA true 1280 0 0 1280 2 2 Attchd 1.915e+05 Dec 2009
"0527146030" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 39 5389 StoneBr Townhouse Inside Unit 5 1995 1996 PConc GasA true 1616 0 0 1616 3 2 Attchd 2.365e+05 Feb 2010
"0527162130" 2-STORY 1946 & NEWER 60 7500 Gilbert Single-family Detached 5 1999 1999 PConc GasA true 1028 776 0 1804 2.1 3 Attchd 1.89e+05 May 2010
"0527163010" 2-STORY 1946 & NEWER 75 10000 Gilbert Single-family Detached 5 1993 1994 PConc GasA true 763 892 0 1655 2.1 3 Attchd 1.759e+05 Mar 2010
"0527165230" 1-STORY 1946 & NEWER ALL STYLES 62.496 7980 Gilbert Single-family Detached 7 1992 2007 PConc GasA true 1187 0 0 1187 3 3 Attchd 1.85e+05 Feb 2010
"0527166040" 2-STORY 1946 & NEWER 63 8402 Gilbert Single-family Detached 5 1998 1998 PConc GasA true 789 676 0 1465 2.1 3 Attchd 1.804e+05 Apr 2010
"0527180040" 1-STORY 1946 & NEWER ALL STYLES 85 10176 Gilbert Single-family Detached 5 1990 1990 PConc GasA true 1341 0 0 1341 2.1 2 Attchd 1.715e+05 Jan 2010
"0527182190" 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 57.299 6820 StoneBr Townhouse Inside Unit 5 1985 1985 PConc GasA true 1502 0 0 1502 2.1 1 Attchd 2.12e+05 May 2010
"0527216070" 2-STORY 1946 & NEWER 47 53504 StoneBr Single-family Detached 5 2003 2003 PConc GasA true 1690 1589 0 3279 4.1 4 BuiltIn 5.38e+05 May 2010
⋮
Unlike curly braces, varfun
operates on each variable in a table separately. For that reason, varfun
cannot do row operations. The related function rowfun
can do row operations. It is often simpler and faster to use curly brace subscripting for row operations.
Reductions of Data in Table
In previous sections, the operations on numeric data in the table were transformations that replace the original values. Many other important operations are reductions whose results are scalars. For example, calculate the median price of the values in SalePrice
.
median(housing.SalePrice)
ans = 160000
The median
function works column-wise on matrices. You can use curly brace subscripting to extract those four variables as a numeric matrix. Then you can calculate the medians of the columns of the matrix.
median(housing{:,["LotFrontage", "LotArea" "TotalAboveGroundLivingArea" "SalePrice"]})
ans = 1×4
105 ×
0.0007 0.0944 0.0144 1.6000
This operation does not attach variable names or any other table metadata to the result. As an alternative, you can use varfun
to apply median
to each variable in the table. With varfun
, the result is another table that contains separate numeric results and preserves the names.
varfun(@median,housing,"InputVariables",["LotFrontage", "LotArea" "TotalAboveGroundLivingArea" "SalePrice"])
ans=1×4 table
median_LotFrontage median_LotArea median_TotalAboveGroundLivingArea median_SalePrice
__________________ ______________ _________________________________ ________________
69.183 9436.5 1442 1.6e+05
These two ways to get the medians are equivalent. There is a trade-off between having the variable names preserved in another table and having the results in one numeric row vector. The way you pick depends on what you plan to do with the result.
Operations on Mixed Data Types
Using curly braces when calculating the medians has another drawback. Curly braces require compatible data type for all the variables. That is, the data you extract from the variables must have data types that allow them to be concatenated into one matrix. Ordinal categorical
data can also have median values. Because categorical
and numeric arrays cannot be concatenated, this operation results in an error.
median(housing{:,["LotFrontage", "LotArea" "OverallCond" "TotalAboveGroundLivingArea" "SalePrice"]})
But because varfun
operates on each variable in the table separately, there is no requirement that the variables have the same data type or compatible types allowing concatenation. The only requirement is that all the variables must support the function that is applied. To calculate the medians of ordinal categorical
variables and numeric variables in one function call use varfun
.
varfun(@median,housing,"InputVariables",["LotFrontage", "LotArea" "OverallCond" "TotalAboveGroundLivingArea" "SalePrice"])
ans=1×5 table
median_LotFrontage median_LotArea median_OverallCond median_TotalAboveGroundLivingArea median_SalePrice
__________________ ______________ __________________ _________________________________ ________________
69.183 9436.5 5 1442 1.6e+05
See Also
categorical
| table
| readtable
| varfun
| renamevars
| convertvars
| summary
| ismissing
| rmmissing
| datetime
| removevars
| addvars
| groupcounts