Main Content

Data Cleaning and Calculations in Tables

This example shows how to clean data stored in a MATLAB® table. It also shows how to perform calculations by using the numeric and categorical data that the table contains.

Because tables and timetables are containers, working with them is somewhat different than working with ordinary numeric arrays. The example shows how to use different tabular subscripting modes, how these modes differ, and the advantages and disadvantages of each mode for different situations. It also shows how to access and assign data, apply transformation and summary functions, convert table variables to different data types, and plot results.

The Ames Housing Data used in this example comes from residential real estate data for the town of Ames, Iowa, in the United States. You can download the original data from an XLS (Excel® Workbook) spreadsheet. The data description is available as a text file. (Used with permission of the copyright holder. Please contact the copyright holder if you wish to publish or redistribute this data.)

Import Spreadsheet Data to Table

The best way to import a spreadsheet into MATLAB is to use the readtable function, or for data that include timestamps, the readtimetable function. While the Ames Housing Data includes the sale month and year for each house, the month and year are stored in separate columns. In this case, it is simpler to use readtable.

Read the housing data. With readtable you can read data directly from a URL. Store all text data from the spreadsheet as string arrays in the output table. Also, when readtable reads column headers from a file, it uses them as table variable names and transforms them into valid MATLAB identifiers. To preserve the original names, use the 'VariableNamingRule' name-value argument.

housing = readtable("","TextType","string");
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.

Display housing. The table has one variable for each of the 82 columns in the spreadsheet.

housing
housing=2930×82 table
    Order        PID         MSSubClass    MSZoning    LotFrontage    LotArea    Street    Alley    LotShape    LandContour    Utilities    LotConfig    LandSlope    Neighborhood    Condition1    Condition2    BldgType    HouseStyle    OverallQual    OverallCond    YearBuilt    YearRemod_Add    RoofStyle    RoofMatl     Exterior1st    Exterior2nd    MasVnrType    MasVnrArea    ExterQual    ExterCond    Foundation    BsmtQual    BsmtCond    BsmtExposure    BsmtFinType1    BsmtFinSF1    BsmtFinType2    BsmtFinSF2    BsmtUnfSF    TotalBsmtSF    Heating    HeatingQC    CentralAir    Electrical    x1stFlrSF    x2ndFlrSF    LowQualFinSF    GrLivArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    KitchenAbvGr    KitchenQual    TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu    GarageType    GarageYrBlt    GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive    WoodDeckSF    OpenPorchSF    EnclosedPorch    x3SsnPorch    ScreenPorch    PoolArea    PoolQC     Fence     MiscFeature    MiscVal    MoSold    YrSold    SaleType    SaleCondition    SalePrice
    _____    ____________    __________    ________    ___________    _______    ______    _____    ________    ___________    _________    _________    _________    ____________    __________    __________    ________    __________    ___________    ___________    _________    _____________    _________    _________    ___________    ___________    __________    __________    _________    _________    __________    ________    ________    ____________    ____________    __________    ____________    __________    _________    ___________    _______    _________    __________    __________    _________    _________    ____________    _________    ____________    ____________    ________    ________    ____________    ____________    ___________    ____________    __________    __________    ___________    __________    ___________    ____________    __________    __________    __________    __________    __________    __________    ___________    _____________    __________    ___________    ________    ______    _______    ___________    _______    ______    ______    ________    _____________    _________

      1      "0526301100"      "020"         "RL"          141         31770     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Corner"       "Gtl"       "NAmes"         "Norm"         "Norm"      "1Fam"       "1Story"          6              5           1960           1960          "Hip"       "CompShg"     "BrkFace"      "Plywood"     "Stone"          112          "TA"         "TA"        "CBlock"       "TA"        "Gd"          "Gd"           "BLQ"            639          "Unf"              0          441          1080        "GasA"       "Fa"          "Y"         "SBrkr"        1656            0            0            1656            1               0             1           0             3               1             "TA"              7           "Typ"           2            "Gd"        "Attchd"         1960           "Fin"            2            528           "TA"          "TA"          "P"           210             62               0             0               0           0         "NA"     "NA"         "NA"              0       5        2010       "WD"        "Normal"        2.15e+05
      2      "0526350040"      "020"         "RH"           80         11622     "Pave"    "NA"      "Reg"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "NAmes"         "Feedr"        "Norm"      "1Fam"       "1Story"          5              6           1961           1961          "Gable"     "CompShg"     "VinylSd"      "VinylSd"     "None"             0          "TA"         "TA"        "CBlock"       "TA"        "TA"          "No"           "Rec"            468          "LwQ"            144          270           882        "GasA"       "TA"          "Y"         "SBrkr"         896            0            0             896            0               0             1           0             2               1             "TA"              5           "Typ"           0            "NA"        "Attchd"         1961           "Unf"            1            730           "TA"          "TA"          "Y"           140              0               0             0             120           0         "NA"     "MnPrv"      "NA"              0       6        2010       "WD"        "Normal"        1.05e+05
      3      "0526351010"      "020"         "RL"           81         14267     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Corner"       "Gtl"       "NAmes"         "Norm"         "Norm"      "1Fam"       "1Story"          6              6           1958           1958          "Hip"       "CompShg"     "Wd Sdng"      "Wd Sdng"     "BrkFace"        108          "TA"         "TA"        "CBlock"       "TA"        "TA"          "No"           "ALQ"            923          "Unf"              0          406          1329        "GasA"       "TA"          "Y"         "SBrkr"        1329            0            0            1329            0               0             1           1             3               1             "Gd"              6           "Typ"           0            "NA"        "Attchd"         1958           "Unf"            1            312           "TA"          "TA"          "Y"           393             36               0             0               0           0         "NA"     "NA"         "Gar2"        12500       6        2010       "WD"        "Normal"        1.72e+05
      4      "0526353030"      "020"         "RL"           93         11160     "Pave"    "NA"      "Reg"         "Lvl"       "AllPub"     "Corner"       "Gtl"       "NAmes"         "Norm"         "Norm"      "1Fam"       "1Story"          7              5           1968           1968          "Hip"       "CompShg"     "BrkFace"      "BrkFace"     "None"             0          "Gd"         "TA"        "CBlock"       "TA"        "TA"          "No"           "ALQ"           1065          "Unf"              0         1045          2110        "GasA"       "Ex"          "Y"         "SBrkr"        2110            0            0            2110            1               0             2           1             3               1             "Ex"              8           "Typ"           2            "TA"        "Attchd"         1968           "Fin"            2            522           "TA"          "TA"          "Y"             0              0               0             0               0           0         "NA"     "NA"         "NA"              0       4        2010       "WD"        "Normal"        2.44e+05
      5      "0527105010"      "060"         "RL"           74         13830     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "2Story"          5              5           1997           1998          "Gable"     "CompShg"     "VinylSd"      "VinylSd"     "None"             0          "TA"         "TA"        "PConc"        "Gd"        "TA"          "No"           "GLQ"            791          "Unf"              0          137           928        "GasA"       "Gd"          "Y"         "SBrkr"         928          701            0            1629            0               0             2           1             3               1             "TA"              6           "Typ"           1            "TA"        "Attchd"         1997           "Fin"            2            482           "TA"          "TA"          "Y"           212             34               0             0               0           0         "NA"     "MnPrv"      "NA"              0       3        2010       "WD"        "Normal"       1.899e+05
      6      "0527105030"      "060"         "RL"           78          9978     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "2Story"          6              6           1998           1998          "Gable"     "CompShg"     "VinylSd"      "VinylSd"     "BrkFace"         20          "TA"         "TA"        "PConc"        "TA"        "TA"          "No"           "GLQ"            602          "Unf"              0          324           926        "GasA"       "Ex"          "Y"         "SBrkr"         926          678            0            1604            0               0             2           1             3               1             "Gd"              7           "Typ"           1            "Gd"        "Attchd"         1998           "Fin"            2            470           "TA"          "TA"          "Y"           360             36               0             0               0           0         "NA"     "NA"         "NA"              0       6        2010       "WD"        "Normal"       1.955e+05
      7      "0527127150"      "120"         "RL"           41          4920     "Pave"    "NA"      "Reg"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "StoneBr"       "Norm"         "Norm"      "TwnhsE"     "1Story"          8              5           2001           2001          "Gable"     "CompShg"     "CemntBd"      "CmentBd"     "None"             0          "Gd"         "TA"        "PConc"        "Gd"        "TA"          "Mn"           "GLQ"            616          "Unf"              0          722          1338        "GasA"       "Ex"          "Y"         "SBrkr"        1338            0            0            1338            1               0             2           0             2               1             "Gd"              6           "Typ"           0            "NA"        "Attchd"         2001           "Fin"            2            582           "TA"          "TA"          "Y"             0              0             170             0               0           0         "NA"     "NA"         "NA"              0       4        2010       "WD"        "Normal"       2.135e+05
      8      "0527145080"      "120"         "RL"           43          5005     "Pave"    "NA"      "IR1"         "HLS"       "AllPub"     "Inside"       "Gtl"       "StoneBr"       "Norm"         "Norm"      "TwnhsE"     "1Story"          8              5           1992           1992          "Gable"     "CompShg"     "HdBoard"      "HdBoard"     "None"             0          "Gd"         "TA"        "PConc"        "Gd"        "TA"          "No"           "ALQ"            263          "Unf"              0         1017          1280        "GasA"       "Ex"          "Y"         "SBrkr"        1280            0            0            1280            0               0             2           0             2               1             "Gd"              5           "Typ"           0            "NA"        "Attchd"         1992           "RFn"            2            506           "TA"          "TA"          "Y"             0             82               0             0             144           0         "NA"     "NA"         "NA"              0       1        2010       "WD"        "Normal"       1.915e+05
      9      "0527146030"      "120"         "RL"           39          5389     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "StoneBr"       "Norm"         "Norm"      "TwnhsE"     "1Story"          8              5           1995           1996          "Gable"     "CompShg"     "CemntBd"      "CmentBd"     "None"             0          "Gd"         "TA"        "PConc"        "Gd"        "TA"          "No"           "GLQ"           1180          "Unf"              0          415          1595        "GasA"       "Ex"          "Y"         "SBrkr"        1616            0            0            1616            1               0             2           0             2               1             "Gd"              5           "Typ"           1            "TA"        "Attchd"         1995           "RFn"            2            608           "TA"          "TA"          "Y"           237            152               0             0               0           0         "NA"     "NA"         "NA"              0       3        2010       "WD"        "Normal"       2.365e+05
     10      "0527162130"      "060"         "RL"           60          7500     "Pave"    "NA"      "Reg"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "2Story"          7              5           1999           1999          "Gable"     "CompShg"     "VinylSd"      "VinylSd"     "None"             0          "TA"         "TA"        "PConc"        "TA"        "TA"          "No"           "Unf"              0          "Unf"              0          994           994        "GasA"       "Gd"          "Y"         "SBrkr"        1028          776            0            1804            0               0             2           1             3               1             "Gd"              7           "Typ"           1            "TA"        "Attchd"         1999           "Fin"            2            442           "TA"          "TA"          "Y"           140             60               0             0               0           0         "NA"     "NA"         "NA"              0       6        2010       "WD"        "Normal"        1.89e+05
     11      "0527163010"      "060"         "RL"           75         10000     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Corner"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "2Story"          6              5           1993           1994          "Gable"     "CompShg"     "HdBoard"      "HdBoard"     "None"             0          "TA"         "TA"        "PConc"        "Gd"        "TA"          "No"           "Unf"              0          "Unf"              0          763           763        "GasA"       "Gd"          "Y"         "SBrkr"         763          892            0            1655            0               0             2           1             3               1             "TA"              7           "Typ"           1            "TA"        "Attchd"         1993           "Fin"            2            440           "TA"          "TA"          "Y"           157             84               0             0               0           0         "NA"     "NA"         "NA"              0       4        2010       "WD"        "Normal"       1.759e+05
     12      "0527165230"      "020"         "RL"          NaN          7980     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "1Story"          6              7           1992           2007          "Gable"     "CompShg"     "HdBoard"      "HdBoard"     "None"             0          "TA"         "Gd"        "PConc"        "Gd"        "TA"          "No"           "ALQ"            935          "Unf"              0          233          1168        "GasA"       "Ex"          "Y"         "SBrkr"        1187            0            0            1187            1               0             2           0             3               1             "TA"              6           "Typ"           0            "NA"        "Attchd"         1992           "Fin"            2            420           "TA"          "TA"          "Y"           483             21               0             0               0           0         "NA"     "GdPrv"      "Shed"          500       3        2010       "WD"        "Normal"        1.85e+05
     13      "0527166040"      "060"         "RL"           63          8402     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "2Story"          6              5           1998           1998          "Gable"     "CompShg"     "VinylSd"      "VinylSd"     "None"             0          "TA"         "TA"        "PConc"        "Gd"        "TA"          "No"           "Unf"              0          "Unf"              0          789           789        "GasA"       "Gd"          "Y"         "SBrkr"         789          676            0            1465            0               0             2           1             3               1             "TA"              7           "Typ"           1            "Gd"        "Attchd"         1998           "Fin"            2            393           "TA"          "TA"          "Y"             0             75               0             0               0           0         "NA"     "NA"         "NA"              0       5        2010       "WD"        "Normal"       1.804e+05
     14      "0527180040"      "020"         "RL"           85         10176     "Pave"    "NA"      "Reg"         "Lvl"       "AllPub"     "Inside"       "Gtl"       "Gilbert"       "Norm"         "Norm"      "1Fam"       "1Story"          7              5           1990           1990          "Gable"     "CompShg"     "HdBoard"      "HdBoard"     "None"             0          "TA"         "TA"        "PConc"        "Gd"        "TA"          "Gd"           "GLQ"            637          "Unf"              0          663          1300        "GasA"       "Gd"          "Y"         "SBrkr"        1341            0            0            1341            1               0             1           1             2               1             "Gd"              5           "Typ"           1            "Po"        "Attchd"         1990           "Unf"            2            506           "TA"          "TA"          "Y"           192              0               0             0               0           0         "NA"     "NA"         "NA"              0       2        2010       "WD"        "Normal"       1.715e+05
     15      "0527182190"      "120"         "RL"          NaN          6820     "Pave"    "NA"      "IR1"         "Lvl"       "AllPub"     "Corner"       "Gtl"       "StoneBr"       "Norm"         "Norm"      "TwnhsE"     "1Story"          8              5           1985           1985          "Gable"     "CompShg"     "HdBoard"      "HdBoard"     "None"             0          "Gd"         "TA"        "PConc"        "Gd"        "TA"          "Av"           "GLQ"            368          "BLQ"           1120            0          1488        "GasA"       "TA"          "Y"         "SBrkr"        1502            0            0            1502            1               0             1           1             1               1             "Gd"              4           "Typ"           0            "NA"        "Attchd"         1985           "RFn"            2            528           "TA"          "TA"          "Y"             0             54               0             0             140           0         "NA"     "NA"         "NA"              0       6        2010       "WD"        "Normal"        2.12e+05
     16      "0527216070"      "060"         "RL"           47         53504     "Pave"    "NA"      "IR2"         "HLS"       "AllPub"     "CulDSac"      "Mod"       "StoneBr"       "Norm"         "Norm"      "1Fam"       "2Story"          8              5           2003           2003          "Hip"       "CompShg"     "CemntBd"      "Wd Shng"     "BrkFace"        603          "Ex"         "TA"        "PConc"        "Gd"        "TA"          "Gd"           "ALQ"           1416          "Unf"              0          234          1650        "GasA"       "Ex"          "Y"         "SBrkr"        1690         1589            0            3279            1               0             3           1             4               1             "Ex"             12           "Mod"           1            "Gd"        "BuiltIn"        2003           "Fin"            3            841           "TA"          "TA"          "Y"           503             36               0             0             210           0         "NA"     "NA"         "NA"              0       6        2010       "WD"        "Normal"        5.38e+05
      ⋮

The spreadsheet has some column headers with spaces and other column headers that start with numbers. Column headers become variable names in the output table. By default, readtable standardizes names with spaces by using camel case, and standardizes names beginning with numbers by prepending them with 'x'. Although a table can have variable names with spaces and other non-alphanumeric characters in them, the standardization makes working with table variable names more natural. Before standardizing names, readtable saves the original column headers in housing.Properties.VariableDescriptions.

housing.Properties.VariableDescriptions
ans = 1×82 cell
    {'Order'}    {'PID'}    {'MS SubClass'}    {'MS Zoning'}    {'Lot Frontage'}    {'Lot Area'}    {'Street'}    {'Alley'}    {'Lot Shape'}    {'Land Contour'}    {'Utilities'}    {'Lot Config'}    {'Land Slope'}    {'Neighborhood'}    {'Condition 1'}    {'Condition 2'}    {'Bldg Type'}    {'House Style'}    {'Overall Qual'}    {'Overall Cond'}    {'Year Built'}    {'Year Remod/Add'}    {'Roof Style'}    {'Roof Matl'}    {'Exterior 1st'}    {'Exterior 2nd'}    {'Mas Vnr Type'}    {'Mas Vnr Area'}    {'Exter Qual'}    {'Exter Cond'}    {'Foundation'}    {'Bsmt Qual'}    {'Bsmt Cond'}    {'Bsmt Exposure'}    {'BsmtFin Type 1'}    {'BsmtFin SF 1'}    {'BsmtFin Type 2'}    {'BsmtFin SF 2'}    {'Bsmt Unf SF'}    {'Total Bsmt SF'}    {'Heating'}    {'Heating QC'}    {'Central Air'}    {'Electrical'}    {'1st Flr SF'}    {'2nd Flr SF'}    {'Low Qual Fin SF'}    {'Gr Liv Area'}    {'Bsmt Full Bath'}    {'Bsmt Half Bath'}    {'Full Bath'}    {'Half Bath'}    {'Bedroom AbvGr'}    {'Kitchen AbvGr'}    {'Kitchen Qual'}    {'TotRms AbvGrd'}    {'Functional'}    {'Fireplaces'}    {'Fireplace Qu'}    {'Garage Type'}    {'Garage Yr Blt'}    {'Garage Finish'}    {'Garage Cars'}    {'Garage Area'}    {'Garage Qual'}    {'Garage Cond'}    {'Paved Drive'}    {'Wood Deck SF'}    {'Open Porch SF'}    {'Enclosed Porch'}    {'3Ssn Porch'}    {'Screen Porch'}    {'Pool Area'}    {'Pool QC'}    {'Fence'}    {'Misc Feature'}    {'Misc Val'}    {'Mo Sold'}    {'Yr Sold'}    {'Sale Type'}    {'Sale Condition'}    {'SalePrice'}

In this example, the original variable names are not needed. To delete them, assign an empty cell array to the VariableDescriptions property.

housing.Properties.VariableDescriptions = {};

Clean Data Before Analysis

You can remove the Order variable because it is a row index and not needed. To remove one variable from the table, assign an empty array, [], to the variable, just as you delete rows or columns from a matrix.

housing.Order = [];

There are 81 variables left in the table. For a complete analysis of the housing prices, most of the variables are probably important. But for this example, only a much smaller subset is needed. To delete the unwanted variables one-by-one is tedious. The removevars function can delete them all at once, but in this case there is an easier way. First list the variables that you want to keep. Then use subscripting to select them and delete the others. Selecting variables by name is often much easier than figuring out their numeric indices.

keep = ["PID" "MSSubClass" "LotFrontage", "LotArea" "Neighborhood" "BldgType" ...
        "OverallCond" "YearBuilt" "YearRemod_Add" "Foundation" "Heating" ...
        "CentralAir" "x1stFlrSF" "x2ndFlrSF" "LowQualFinSF" "GrLivArea" ...
        "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr" ...
        "GarageType" "MoSold" "YrSold" "SalePrice"];
housing = housing(:,keep)
housing=2930×25 table
        PID         MSSubClass    LotFrontage    LotArea    Neighborhood    BldgType    OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    x1stFlrSF    x2ndFlrSF    LowQualFinSF    GrLivArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    MoSold    YrSold    SalePrice
    ____________    __________    ___________    _______    ____________    ________    ___________    _________    _____________    __________    _______    __________    _________    _________    ____________    _________    ____________    ____________    ________    ________    ____________    __________    ______    ______    _________

    "0526301100"      "020"           141         31770      "NAmes"        "1Fam"           5           1960           1960          "CBlock"     "GasA"        "Y"          1656            0            0            1656            1               0             1           0             3          "Attchd"        5        2010      2.15e+05
    "0526350040"      "020"            80         11622      "NAmes"        "1Fam"           6           1961           1961          "CBlock"     "GasA"        "Y"           896            0            0             896            0               0             1           0             2          "Attchd"        6        2010      1.05e+05
    "0526351010"      "020"            81         14267      "NAmes"        "1Fam"           6           1958           1958          "CBlock"     "GasA"        "Y"          1329            0            0            1329            0               0             1           1             3          "Attchd"        6        2010      1.72e+05
    "0526353030"      "020"            93         11160      "NAmes"        "1Fam"           5           1968           1968          "CBlock"     "GasA"        "Y"          2110            0            0            2110            1               0             2           1             3          "Attchd"        4        2010      2.44e+05
    "0527105010"      "060"            74         13830      "Gilbert"      "1Fam"           5           1997           1998          "PConc"      "GasA"        "Y"           928          701            0            1629            0               0             2           1             3          "Attchd"        3        2010     1.899e+05
    "0527105030"      "060"            78          9978      "Gilbert"      "1Fam"           6           1998           1998          "PConc"      "GasA"        "Y"           926          678            0            1604            0               0             2           1             3          "Attchd"        6        2010     1.955e+05
    "0527127150"      "120"            41          4920      "StoneBr"      "TwnhsE"         5           2001           2001          "PConc"      "GasA"        "Y"          1338            0            0            1338            1               0             2           0             2          "Attchd"        4        2010     2.135e+05
    "0527145080"      "120"            43          5005      "StoneBr"      "TwnhsE"         5           1992           1992          "PConc"      "GasA"        "Y"          1280            0            0            1280            0               0             2           0             2          "Attchd"        1        2010     1.915e+05
    "0527146030"      "120"            39          5389      "StoneBr"      "TwnhsE"         5           1995           1996          "PConc"      "GasA"        "Y"          1616            0            0            1616            1               0             2           0             2          "Attchd"        3        2010     2.365e+05
    "0527162130"      "060"            60          7500      "Gilbert"      "1Fam"           5           1999           1999          "PConc"      "GasA"        "Y"          1028          776            0            1804            0               0             2           1             3          "Attchd"        6        2010      1.89e+05
    "0527163010"      "060"            75         10000      "Gilbert"      "1Fam"           5           1993           1994          "PConc"      "GasA"        "Y"           763          892            0            1655            0               0             2           1             3          "Attchd"        4        2010     1.759e+05
    "0527165230"      "020"           NaN          7980      "Gilbert"      "1Fam"           7           1992           2007          "PConc"      "GasA"        "Y"          1187            0            0            1187            1               0             2           0             3          "Attchd"        3        2010      1.85e+05
    "0527166040"      "060"            63          8402      "Gilbert"      "1Fam"           5           1998           1998          "PConc"      "GasA"        "Y"           789          676            0            1465            0               0             2           1             3          "Attchd"        5        2010     1.804e+05
    "0527180040"      "020"            85         10176      "Gilbert"      "1Fam"           5           1990           1990          "PConc"      "GasA"        "Y"          1341            0            0            1341            1               0             1           1             2          "Attchd"        2        2010     1.715e+05
    "0527182190"      "120"           NaN          6820      "StoneBr"      "TwnhsE"         5           1985           1985          "PConc"      "GasA"        "Y"          1502            0            0            1502            1               0             1           1             1          "Attchd"        6        2010      2.12e+05
    "0527216070"      "060"            47         53504      "StoneBr"      "1Fam"           5           2003           2003          "PConc"      "GasA"        "Y"          1690         1589            0            3279            1               0             3           1             4          "BuiltIn"       6        2010      5.38e+05
      ⋮

Two of the variable names are not very clear. Rename those variables with better names by using the VariableNames property.

housing.Properties.VariableNames(["GrLivArea" "LowQualFinSF"]) = ["TotalAboveGroundLivingArea" "LowQualFinishedArea"];

There are two other variable names, starting with 'x', that look odd. Another way to rename them is to use the renamevars function. If you use renamevars, assign the output to the original table. Otherwise the update is lost.

housing = renamevars(housing,["x1stFlrSF" "x2ndFlrSF"],["FirstFlrArea" "SecondFlrArea"]);

Convert and Clean Up Data Types

Six of the variables are string arrays. Conceptually they all contain categorical data: discrete, nonnumeric values drawn from a small fixed set of possible values or categories. It is almost always a good idea to convert that kind of data to categorical arrays. You can use the detectImportOptions function to control the data types of the data you read with readtable. But instead of starting over, you can convert these table variables to have the categorical data type. For example, convert the Neighborhood variable to a categorical array.

housing.Neighborhood = categorical(housing.Neighborhood);

This assignment overwrites, or replaces, the existing text variable Neighborhood in the table with a new categorical variable. Replacement is what enables the assignment to change the data type. In contrast, this assignment, using indexing:

housing.Neighborhood(:) = categorical(housing.Neighborhood)

assigns values into the existing text variable, element by element, rather than replacing the variable. In that case housing.Neighborhood remains a string array. This behavior is consistent with the behavior of ordinary workspace variables. Assignment by indexing into an array does not change the type of the array. For example, if you index into an array of integers and assign a floating-point value to an element, the value is truncated and stored as an integer.

x = uint32([1 2 3]);
x(2) = 2.2 % converted to 2, as a uint32
x = 1×3 uint32 row vector

   1   2   3

Assignment with dot notation is one way to convert the type of a variable in a table. The convertvars function is another way and has two benefits. First, it avoids any confusion about overwriting as opposed to assignment into a variable. The convertvars function always overwrites existing variables and converts their type. Second, convertvars can operate on more than one variable at a time. There are several more text variables in housing to be converted to the categorical data type. Changing them one at a time would get tedious, but convertvars can convert more than one variable in one command.

housing = convertvars(housing,["BldgType" "Foundation"],"categorical");

It is not necessary to explicitly list the variables by name or position in the table. You can find all the table variables that are string arrays and convert them to categorical variables. To specify table variables that are string arrays, use the function handle @isstring when calling convertvars.

housing = convertvars(housing,@isstring,"categorical");

In both cases, assign the output of convertvars back to the original table. Otherwise, the update is lost.

Sometimes, converting all text variables to categorical is too much. For example, if the current homeowners' names were present in the data, then it would not make sense to store them in a categorical variable. Homeowners' names do not define housing categories. You might keep their names in a string array instead.

As another example, the CentralAir variable is one of the variables that was converted to categorical. But because its categories are just Y and N, it might make more sense to consider it a logical variable.

summary(housing.CentralAir)
     N       196 
     Y      2734 

The logical data type (like all the integer types) does not allow missing values (analogous to NaN), while categorical does. The CentralAir variable happens to have no missing data values. You can use either logical or categorical as the data type for CentralAir.

any(ismissing(housing.CentralAir))
ans = logical
   0

Convert the data type to logical, with true corresponding to Y, using dot notation to overwrite the existing categorical variable with the new logical one.

housing.CentralAir = (housing.CentralAir == "Y");

Display the converted data in housing.

housing
housing=2930×25 table
       PID        MSSubClass    LotFrontage    LotArea    Neighborhood    BldgType    OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    MoSold    YrSold    SalePrice
    __________    __________    ___________    _______    ____________    ________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    ______    ______    _________

    0526301100       020            141         31770       NAmes          1Fam            5           1960           1960           CBlock       GasA        true            1656               0                  0                        1656                    1               0             1           0             3           Attchd         5        2010      2.15e+05
    0526350040       020             80         11622       NAmes          1Fam            6           1961           1961           CBlock       GasA        true             896               0                  0                         896                    0               0             1           0             2           Attchd         6        2010      1.05e+05
    0526351010       020             81         14267       NAmes          1Fam            6           1958           1958           CBlock       GasA        true            1329               0                  0                        1329                    0               0             1           1             3           Attchd         6        2010      1.72e+05
    0526353030       020             93         11160       NAmes          1Fam            5           1968           1968           CBlock       GasA        true            2110               0                  0                        2110                    1               0             2           1             3           Attchd         4        2010      2.44e+05
    0527105010       060             74         13830       Gilbert        1Fam            5           1997           1998           PConc        GasA        true             928             701                  0                        1629                    0               0             2           1             3           Attchd         3        2010     1.899e+05
    0527105030       060             78          9978       Gilbert        1Fam            6           1998           1998           PConc        GasA        true             926             678                  0                        1604                    0               0             2           1             3           Attchd         6        2010     1.955e+05
    0527127150       120             41          4920       StoneBr        TwnhsE          5           2001           2001           PConc        GasA        true            1338               0                  0                        1338                    1               0             2           0             2           Attchd         4        2010     2.135e+05
    0527145080       120             43          5005       StoneBr        TwnhsE          5           1992           1992           PConc        GasA        true            1280               0                  0                        1280                    0               0             2           0             2           Attchd         1        2010     1.915e+05
    0527146030       120             39          5389       StoneBr        TwnhsE          5           1995           1996           PConc        GasA        true            1616               0                  0                        1616                    1               0             2           0             2           Attchd         3        2010     2.365e+05
    0527162130       060             60          7500       Gilbert        1Fam            5           1999           1999           PConc        GasA        true            1028             776                  0                        1804                    0               0             2           1             3           Attchd         6        2010      1.89e+05
    0527163010       060             75         10000       Gilbert        1Fam            5           1993           1994           PConc        GasA        true             763             892                  0                        1655                    0               0             2           1             3           Attchd         4        2010     1.759e+05
    0527165230       020            NaN          7980       Gilbert        1Fam            7           1992           2007           PConc        GasA        true            1187               0                  0                        1187                    1               0             2           0             3           Attchd         3        2010      1.85e+05
    0527166040       060             63          8402       Gilbert        1Fam            5           1998           1998           PConc        GasA        true             789             676                  0                        1465                    0               0             2           1             3           Attchd         5        2010     1.804e+05
    0527180040       020             85         10176       Gilbert        1Fam            5           1990           1990           PConc        GasA        true            1341               0                  0                        1341                    1               0             1           1             2           Attchd         2        2010     1.715e+05
    0527182190       120            NaN          6820       StoneBr        TwnhsE          5           1985           1985           PConc        GasA        true            1502               0                  0                        1502                    1               0             1           1             1           Attchd         6        2010      2.12e+05
    0527216070       060             47         53504       StoneBr        1Fam            5           2003           2003           PConc        GasA        true            1690            1589                  0                        3279                    1               0             3           1             4           BuiltIn        6        2010      5.38e+05
      ⋮

All the text data has been converted to categorical variables. But there are still a few things to clean up.

The OverallCond variable was read in as a numeric array, but its values are all drawn from the integers 1-10. You can leave these values as numeric data, but you can think of it as ordinal categorical data. When a categorical array is ordinal, its categories have a specified order. For example, the categories 10 and 5 can be compared (10 > 5, because a house whose condition is rated as a 10 is theoretically nicer than one rated 5), but for these comparisons, there is no numeric meaning to 10 - 5. To avoid unintentionally treating OverallCond as numeric data, convert it to an ordinal categorical array, which still enables relational comparisons but prevents arithmetic operations. The category names 1, 2, and so on are easy to interpret and are acceptable.

housing.OverallCond = categorical(housing.OverallCond,1:10,"Ordinal",true);

Similarly, the MSSubClass variable consisted of numeric codes in the original spreadsheet. You can think of those values as being categorical data. Because there is no mathematical order to these particular codes, the categories are nonordinal (or nominal). In this case, readtable read those values in as text to preserve leading zeroes in the codes. MSSubClass was then converted to categorical data.

While MSSubClass has the data type that you want, you might find it difficult to interpret the codes as categories of houses. The file that describes the Ames Housing Data contains the definitions of the numeric codes. Giving these categories readable names can help you understand the data. To make it clear which names go with which numbers, specify both the categories (code) and their names (subclass) in another call to the categorical function.

code = ["020" "030" "040" "045" "050" "060" "070" "075" "080" "085" "090" "120" "150" "160" "180" "190"];
subclass = ["1-STORY 1946 & NEWER ALL STYLES" ...
            "1-STORY 1945 & OLDER" ...
            "1-STORY W/FINISHED ATTIC ALL AGES" ...
            "1-1/2 STORY - UNFINISHED ALL AGES" ...
            "1-1/2 STORY FINISHED ALL AGES" ...
            "2-STORY 1946 & NEWER" ...
            "2-STORY 1945 & OLDER" ...
            "2-1/2 STORY ALL AGES" ...
            "SPLIT OR MULTI-LEVEL" ...
            "SPLIT FOYER" ...
            "DUPLEX - ALL STYLES AND AGES" ...
            "1-STORY PUD (Planned Unit Development) - 1946 & NEWER" ...
            "1-1/2 STORY PUD - ALL AGES" ...
            "2-STORY PUD - 1946 & NEWER" ...
            "PUD - MULTILEVEL - INCL SPLIT LEV/FOYER" ...
            "2 FAMILY CONVERSION - ALL STYLES AND AGES"];
housing.MSSubClass = categorical(housing.MSSubClass,code,subclass);

The category names for the BldgType variable are not obvious. As with MSSubClass, more descriptive names can help you understand the building categories. To display the number of houses in each building category, use the summary function.

summary(housing.BldgType)
     1Fam        2425 
     2fmCon        62 
     Duplex       109 
     Twnhs        101 
     TwnhsE       233 

With only five categories, you can safely list the new category names in the right order without specifying the old names. To rename categories, use the renamecats function.

types = ["Single-family Detached" "Two-family Conversion" "Duplex" "Townhouse End Unit" "Townhouse Inside Unit"];
housing.BldgType = renamecats(housing.BldgType,types);

The GarageType variable includes the category NA, standing for Not Applicable. In GarageType, NA means that the house does not have a garage. But it is too easy to confuse NA with a missing value. A true missing value means it cannot be determined if a house has a garage. But in this housing data, it is always known if a house has a garage. Change that one category name to make its meaning clearer.

housing.GarageType = renamecats(housing.GarageType,"NA","None");

Finally, the PID variable was read in as a string array. While its values were numeric, some of them had leading zeroes. The readtable function preserved this information by storing the values as strings. Then the call to convertvars converted the PID variable to a categorical array. PID stores identification numbers that are unique. Identification numbers are assigned as needed and do not come from a fixed set of values. There is no particular advantage in storing them in a categorical variable. If every identification number is a category, then adding a new identification number means adding a new category to PID. It might be more convenient to convert PID back to a string array. To convert values to strings, use the string function.

housing.PID = string(housing.PID);

Display the results of the preliminary data cleaning.

housing
housing=2930×25 table
        PID                              MSSubClass                          LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    MoSold    YrSold    SalePrice
    ____________    _____________________________________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    ______    ______    _________

    "0526301100"    1-STORY 1946 & NEWER ALL STYLES                              141         31770       NAmes         Single-family Detached         5           1960           1960           CBlock       GasA        true            1656               0                  0                        1656                    1               0             1           0             3           Attchd         5        2010      2.15e+05
    "0526350040"    1-STORY 1946 & NEWER ALL STYLES                               80         11622       NAmes         Single-family Detached         6           1961           1961           CBlock       GasA        true             896               0                  0                         896                    0               0             1           0             2           Attchd         6        2010      1.05e+05
    "0526351010"    1-STORY 1946 & NEWER ALL STYLES                               81         14267       NAmes         Single-family Detached         6           1958           1958           CBlock       GasA        true            1329               0                  0                        1329                    0               0             1           1             3           Attchd         6        2010      1.72e+05
    "0526353030"    1-STORY 1946 & NEWER ALL STYLES                               93         11160       NAmes         Single-family Detached         5           1968           1968           CBlock       GasA        true            2110               0                  0                        2110                    1               0             2           1             3           Attchd         4        2010      2.44e+05
    "0527105010"    2-STORY 1946 & NEWER                                          74         13830       Gilbert       Single-family Detached         5           1997           1998           PConc        GasA        true             928             701                  0                        1629                    0               0             2           1             3           Attchd         3        2010     1.899e+05
    "0527105030"    2-STORY 1946 & NEWER                                          78          9978       Gilbert       Single-family Detached         6           1998           1998           PConc        GasA        true             926             678                  0                        1604                    0               0             2           1             3           Attchd         6        2010     1.955e+05
    "0527127150"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         41          4920       StoneBr       Townhouse Inside Unit          5           2001           2001           PConc        GasA        true            1338               0                  0                        1338                    1               0             2           0             2           Attchd         4        2010     2.135e+05
    "0527145080"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         43          5005       StoneBr       Townhouse Inside Unit          5           1992           1992           PConc        GasA        true            1280               0                  0                        1280                    0               0             2           0             2           Attchd         1        2010     1.915e+05
    "0527146030"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         39          5389       StoneBr       Townhouse Inside Unit          5           1995           1996           PConc        GasA        true            1616               0                  0                        1616                    1               0             2           0             2           Attchd         3        2010     2.365e+05
    "0527162130"    2-STORY 1946 & NEWER                                          60          7500       Gilbert       Single-family Detached         5           1999           1999           PConc        GasA        true            1028             776                  0                        1804                    0               0             2           1             3           Attchd         6        2010      1.89e+05
    "0527163010"    2-STORY 1946 & NEWER                                          75         10000       Gilbert       Single-family Detached         5           1993           1994           PConc        GasA        true             763             892                  0                        1655                    0               0             2           1             3           Attchd         4        2010     1.759e+05
    "0527165230"    1-STORY 1946 & NEWER ALL STYLES                              NaN          7980       Gilbert       Single-family Detached         7           1992           2007           PConc        GasA        true            1187               0                  0                        1187                    1               0             2           0             3           Attchd         3        2010      1.85e+05
    "0527166040"    2-STORY 1946 & NEWER                                          63          8402       Gilbert       Single-family Detached         5           1998           1998           PConc        GasA        true             789             676                  0                        1465                    0               0             2           1             3           Attchd         5        2010     1.804e+05
    "0527180040"    1-STORY 1946 & NEWER ALL STYLES                               85         10176       Gilbert       Single-family Detached         5           1990           1990           PConc        GasA        true            1341               0                  0                        1341                    1               0             1           1             2           Attchd         2        2010     1.715e+05
    "0527182190"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER        NaN          6820       StoneBr       Townhouse Inside Unit          5           1985           1985           PConc        GasA        true            1502               0                  0                        1502                    1               0             1           1             1           Attchd         6        2010      2.12e+05
    "0527216070"    2-STORY 1946 & NEWER                                          47         53504       StoneBr       Single-family Detached         5           2003           2003           PConc        GasA        true            1690            1589                  0                        3279                    1               0             3           1             4           BuiltIn        6        2010      5.38e+05
      ⋮

Create Variable for Date of Sale

The table has separate variables for the month and year of sale. It is more convenient if those variables are combined in one datetime variable. Assignment by using dot notation is a good way to add a new variable at the right edge of a table. Add the date of sale as a new variable.

housing.LastSoldDate = datetime(housing.YrSold,housing.MoSold,0,"Format","MMM yyyy");

Now delete the two original variables. It is easier to list the variables by name and use removevars.

housing = removevars(housing,["YrSold" "MoSold"])
housing=2930×24 table
        PID                              MSSubClass                          LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _____________________________________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    _________    ____________

    "0526301100"    1-STORY 1946 & NEWER ALL STYLES                              141         31770       NAmes         Single-family Detached         5           1960           1960           CBlock       GasA        true            1656               0                  0                        1656                    1               0             1           0             3           Attchd        2.15e+05      Apr 2010  
    "0526350040"    1-STORY 1946 & NEWER ALL STYLES                               80         11622       NAmes         Single-family Detached         6           1961           1961           CBlock       GasA        true             896               0                  0                         896                    0               0             1           0             2           Attchd        1.05e+05      May 2010  
    "0526351010"    1-STORY 1946 & NEWER ALL STYLES                               81         14267       NAmes         Single-family Detached         6           1958           1958           CBlock       GasA        true            1329               0                  0                        1329                    0               0             1           1             3           Attchd        1.72e+05      May 2010  
    "0526353030"    1-STORY 1946 & NEWER ALL STYLES                               93         11160       NAmes         Single-family Detached         5           1968           1968           CBlock       GasA        true            2110               0                  0                        2110                    1               0             2           1             3           Attchd        2.44e+05      Mar 2010  
    "0527105010"    2-STORY 1946 & NEWER                                          74         13830       Gilbert       Single-family Detached         5           1997           1998           PConc        GasA        true             928             701                  0                        1629                    0               0             2           1             3           Attchd       1.899e+05      Feb 2010  
    "0527105030"    2-STORY 1946 & NEWER                                          78          9978       Gilbert       Single-family Detached         6           1998           1998           PConc        GasA        true             926             678                  0                        1604                    0               0             2           1             3           Attchd       1.955e+05      May 2010  
    "0527127150"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         41          4920       StoneBr       Townhouse Inside Unit          5           2001           2001           PConc        GasA        true            1338               0                  0                        1338                    1               0             2           0             2           Attchd       2.135e+05      Mar 2010  
    "0527145080"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         43          5005       StoneBr       Townhouse Inside Unit          5           1992           1992           PConc        GasA        true            1280               0                  0                        1280                    0               0             2           0             2           Attchd       1.915e+05      Dec 2009  
    "0527146030"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER         39          5389       StoneBr       Townhouse Inside Unit          5           1995           1996           PConc        GasA        true            1616               0                  0                        1616                    1               0             2           0             2           Attchd       2.365e+05      Feb 2010  
    "0527162130"    2-STORY 1946 & NEWER                                          60          7500       Gilbert       Single-family Detached         5           1999           1999           PConc        GasA        true            1028             776                  0                        1804                    0               0             2           1             3           Attchd        1.89e+05      May 2010  
    "0527163010"    2-STORY 1946 & NEWER                                          75         10000       Gilbert       Single-family Detached         5           1993           1994           PConc        GasA        true             763             892                  0                        1655                    0               0             2           1             3           Attchd       1.759e+05      Mar 2010  
    "0527165230"    1-STORY 1946 & NEWER ALL STYLES                              NaN          7980       Gilbert       Single-family Detached         7           1992           2007           PConc        GasA        true            1187               0                  0                        1187                    1               0             2           0             3           Attchd        1.85e+05      Feb 2010  
    "0527166040"    2-STORY 1946 & NEWER                                          63          8402       Gilbert       Single-family Detached         5           1998           1998           PConc        GasA        true             789             676                  0                        1465                    0               0             2           1             3           Attchd       1.804e+05      Apr 2010  
    "0527180040"    1-STORY 1946 & NEWER ALL STYLES                               85         10176       Gilbert       Single-family Detached         5           1990           1990           PConc        GasA        true            1341               0                  0                        1341                    1               0             1           1             2           Attchd       1.715e+05      Jan 2010  
    "0527182190"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER        NaN          6820       StoneBr       Townhouse Inside Unit          5           1985           1985           PConc        GasA        true            1502               0                  0                        1502                    1               0             1           1             1           Attchd        2.12e+05      May 2010  
    "0527216070"    2-STORY 1946 & NEWER                                          47         53504       StoneBr       Single-family Detached         5           2003           2003           PConc        GasA        true            1690            1589                  0                        3279                    1               0             3           1             4           BuiltIn       5.38e+05      May 2010  
      ⋮

Explore Data with Plots

Explore the data by making some simple plots. Many basic plotting commands do not accept tables as input arguments. But you can use dot notation to pass one or more table variables into a plotting function. You are taking arrays out of the table and passing them as input arguments to a plotting function.

For example, make a scatter plot of the sale prices of houses in the table as a function of the years in which they were built.

scatter(housing.YearBuilt,housing.SalePrice,20,"filled");

A log transformation of the prices might show a simpler relationship between year and price. Also, you can show more information in the scatter plot by using the living area of the houses to color the markers. The living areas have a long right tail, so it is also useful to show a log transformation of the areas. To transform the two table variables, wrap them in calls to the log function. Then make another scatter plot.

logSalePrice = log(housing.SalePrice);
logLivingArea = log(housing.TotalAboveGroundLivingArea);
scatter(housing.YearBuilt,logSalePrice,20,logLivingArea,"filled");

Clean Errors in Data

Any large, complex data set collected over a long period of time might contain some errors. Check for errors in the housing data. Dates in the data are a good place to start. First compare YearBuilt to YearRemod_Add.

checkRows = housing.YearBuilt > housing.YearRemod_Add;
housing(checkRows,:)
ans=1×24 table
        PID                   MSSubClass               LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _______________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    _________    ____________

    "0907194160"    1-STORY 1946 & NEWER ALL STYLES        65          10739       CollgCr       Single-family Detached         5           2002           2001           PConc        GasA        true            1444              0                   0                        1444                    1               0             2           0             3            Attchd      2.03e+05       Mar 2009  

It is not possible for remodeling to have been done in 2001 if the house itself was built in 2002. If you assume that the YearBuilt value is known to be the error (an assumption that needs to be confirmed), you can use dot notation to assign 2001 as the year in which this house was built.

housing.YearBuilt(checkRows) = 2001;

As another check, compare the new LastSoldDate variable to YearBuilt.

checkRows = housing.YearBuilt > year(housing.LastSoldDate);
housing(checkRows,:)
ans=2×24 table
        PID                   MSSubClass               LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    SalePrice     LastSoldDate
    ____________    _______________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    __________    ____________

    "0908154235"    2-STORY 1946 & NEWER                   313         63887       Edwards       Single-family Detached         5           2008           2008           PConc        GasA        true            4692             950                  0                        5642                    2               0             2           1             3            Attchd         1.6e+05      Dec 2007  
    "0908154195"    1-STORY 1946 & NEWER ALL STYLES        128         39290       Edwards       Single-family Detached         5           2008           2009           PConc        GasA        true            5095               0                  0                        5095                    1               1             2           1             2            Attchd      1.8385e+05      Sep 2007  

There is another issue. These two houses were sold in late 2007, as shown in the LastSoldDate variable. But the corresponding value in YearBuilt is 2008. It might be that for these houses, the years in YearBuilt were recorded in early 2008 (another assumption needing confirmation). Update the YearBuilt variable, this time by using dot notation to assign to two rows.

housing.YearBuilt(checkRows) = 2007;

Clean Up Missing Data

The next step in cleaning the data is to check for missing data in the numeric and categorical variables. The one logical variable in housing does not support missing values. The ismissing function indicates which elements of the table have missing values.

missingElements = ismissing(housing)
missingElements = 2930×24 logical array

   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
      ⋮

The ismissing function returns a logical matrix that is the same size as the table. Summing the columns of that matrix gives the number of missing values in each of the variables of the table.

numMissing = sum(missingElements,1)
numMissing = 1×24

     0     0   490     0     0     0     0     0     0     0     0     0     0     0     0     0     2     2     0     0     0     0     0     0

Only three of the variables have missing data, but without the variable names it is not easy to tell which variables they are. One way to tell is to index into the VariableNames property of the table to find the names that correspond to the variables with missing values.

housing.Properties.VariableNames(numMissing > 0)
ans = 1×3 cell
    {'LotFrontage'}    {'BsmtFullBath'}    {'BsmtHalfBath'}

Deciding what to do about missing data is a challenge. If the data is missing at random, and there are only a few missing values, one strategy is to remove those rows from the table. The four missing basement bath values (NaNs, in this case) occur in only two rows. You can remove those two rows by using the rmmissing function.

missingBsmtBath = ismissing(housing.BsmtFullBath) | ismissing(housing.BsmtHalfBath);
housing(missingBsmtBath,:)
ans=2×24 table
        PID                   MSSubClass               LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _______________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    _________    ____________

    "0903230120"    1-STORY 1946 & NEWER ALL STYLES         99          5940       BrkSide       Single-family Detached         7           1946           1950           PConc        GasA        true             896              0                   0                         896                   NaN             NaN            1           0             2            Detchd          79000      Mar 2008  
    "0908154080"    1-STORY 1946 & NEWER ALL STYLES        123         47007       Edwards       Single-family Detached         7           1959           1996           Slab         GasA        true            3820              0                   0                        3820                   NaN             NaN            3           1             5            Attchd      2.847e+05      Jun 2008  

housing = rmmissing(housing,"DataVariables",["BsmtFullBath" "BsmtHalfBath"]);
whos housing
  Name            Size             Bytes  Class    Attributes

  housing      2928x24            595935  table              

This call to rmmissing removes only the rows that have missing values in BsmtFullBath and BsmtHalfBath. The 490 rows with missing LotFrontage values are still in the table. You can remove these 490 rows but doing so deletes more than 16% of the data. You also can fill these missing values with the mean frontage value by using the fillmissing function, but that is not practical for this data. For variables that form a time series, fillmissing also supports filling variables with interpolated values or moving-window smoothed values. LotFrontage is not a time series. The data in this variable is a cross-sectional data set.

One commonly used strategy for filling in missing values in cross-sectional data is to create a regression model to predict the missing values in a row from the non-missing data in that row. A simple scatter plot indicates that there is a log-log relationship between the area of a lot and its frontage. That relationship suggests a model.

loglog(housing.LotArea,housing.LotFrontage,'o')

You can use that log-log relationship to fill in the missing LotFrontage values by regressing the values on LotArea.

missingValues = ismissing(housing.LotFrontage);
beta = polyfit(log(housing.LotArea(~missingValues)),log(housing.LotFrontage(~missingValues)),1);
housing.LotFrontage(missingValues) = exp(polyval(beta,log(housing.LotArea(missingValues))));

You can use dot notation to work on data in a table when you use functions such as polyfit and polyval that accept numeric vectors but not tables. You can think of a table as a container that is designed to hold data having different types. Functions such as polyfit that are specifically for numeric inputs do not work on a table because a table often contains nonnumeric data. Even when a table contains only numeric data, it is still a container. The functions must be applied to the contents of the table. Use dot notation to access table variables.

Add the imputed missing values that you calculated with polyfit and polyval to the scatter plot. A simple imputation scheme might not be sufficient in a real analysis of this data, but it illustrates how to visualize and make computations on numeric data in a table.

hold on
loglog(housing.LotArea(missingValues),housing.LotFrontage(missingValues),'rx')
hold off

Arithmetic on Table Variables

Dot notation has been convenient for operations such as converting an existing table variable, adding a new variable, assigning values, plotting, and applying functions like polyval to a table variable. Dot notation is also convenient for arithmetic operations on table variables. For example, convert the LotFrontage variable from feet to meters.

housing.LotFrontage = 0.3048 * housing.LotFrontage;
housing.Properties.VariableUnits("LotFrontage") = "m"
housing=2928×24 table
        PID                              MSSubClass                          LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _____________________________________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    ____________    __________    _________    ____________

    "0526301100"    1-STORY 1946 & NEWER ALL STYLES                            42.977        31770       NAmes         Single-family Detached         5           1960           1960           CBlock       GasA        true            1656               0                  0                        1656                    1               0             1           0             3           Attchd        2.15e+05      Apr 2010  
    "0526350040"    1-STORY 1946 & NEWER ALL STYLES                            24.384        11622       NAmes         Single-family Detached         6           1961           1961           CBlock       GasA        true             896               0                  0                         896                    0               0             1           0             2           Attchd        1.05e+05      May 2010  
    "0526351010"    1-STORY 1946 & NEWER ALL STYLES                            24.689        14267       NAmes         Single-family Detached         6           1958           1958           CBlock       GasA        true            1329               0                  0                        1329                    0               0             1           1             3           Attchd        1.72e+05      May 2010  
    "0526353030"    1-STORY 1946 & NEWER ALL STYLES                            28.346        11160       NAmes         Single-family Detached         5           1968           1968           CBlock       GasA        true            2110               0                  0                        2110                    1               0             2           1             3           Attchd        2.44e+05      Mar 2010  
    "0527105010"    2-STORY 1946 & NEWER                                       22.555        13830       Gilbert       Single-family Detached         5           1997           1998           PConc        GasA        true             928             701                  0                        1629                    0               0             2           1             3           Attchd       1.899e+05      Feb 2010  
    "0527105030"    2-STORY 1946 & NEWER                                       23.774         9978       Gilbert       Single-family Detached         6           1998           1998           PConc        GasA        true             926             678                  0                        1604                    0               0             2           1             3           Attchd       1.955e+05      May 2010  
    "0527127150"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER      12.497         4920       StoneBr       Townhouse Inside Unit          5           2001           2001           PConc        GasA        true            1338               0                  0                        1338                    1               0             2           0             2           Attchd       2.135e+05      Mar 2010  
    "0527145080"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER      13.106         5005       StoneBr       Townhouse Inside Unit          5           1992           1992           PConc        GasA        true            1280               0                  0                        1280                    0               0             2           0             2           Attchd       1.915e+05      Dec 2009  
    "0527146030"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER      11.887         5389       StoneBr       Townhouse Inside Unit          5           1995           1996           PConc        GasA        true            1616               0                  0                        1616                    1               0             2           0             2           Attchd       2.365e+05      Feb 2010  
    "0527162130"    2-STORY 1946 & NEWER                                       18.288         7500       Gilbert       Single-family Detached         5           1999           1999           PConc        GasA        true            1028             776                  0                        1804                    0               0             2           1             3           Attchd        1.89e+05      May 2010  
    "0527163010"    2-STORY 1946 & NEWER                                        22.86        10000       Gilbert       Single-family Detached         5           1993           1994           PConc        GasA        true             763             892                  0                        1655                    0               0             2           1             3           Attchd       1.759e+05      Mar 2010  
    "0527165230"    1-STORY 1946 & NEWER ALL STYLES                            19.049         7980       Gilbert       Single-family Detached         7           1992           2007           PConc        GasA        true            1187               0                  0                        1187                    1               0             2           0             3           Attchd        1.85e+05      Feb 2010  
    "0527166040"    2-STORY 1946 & NEWER                                       19.202         8402       Gilbert       Single-family Detached         5           1998           1998           PConc        GasA        true             789             676                  0                        1465                    0               0             2           1             3           Attchd       1.804e+05      Apr 2010  
    "0527180040"    1-STORY 1946 & NEWER ALL STYLES                            25.908        10176       Gilbert       Single-family Detached         5           1990           1990           PConc        GasA        true            1341               0                  0                        1341                    1               0             1           1             2           Attchd       1.715e+05      Jan 2010  
    "0527182190"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER      17.465         6820       StoneBr       Townhouse Inside Unit          5           1985           1985           PConc        GasA        true            1502               0                  0                        1502                    1               0             1           1             1           Attchd        2.12e+05      May 2010  
    "0527216070"    2-STORY 1946 & NEWER                                       14.326        53504       StoneBr       Single-family Detached         5           2003           2003           PConc        GasA        true            1690            1589                  0                        3279                    1               0             3           1             4           BuiltIn       5.38e+05      May 2010  
      ⋮

Using dot notation means that the multiplication is applied not to the housing table, which cannot be done because tables are containers, but rather to its LotFrontage variable, which is a numeric vector. With dot notation, you extracted LotFrontage from the table and put the modified version back in.

Another way to access the contents of a table is to subscript into it by using curly braces, just as you use curly braces to extract the contents of a cell array. You can use curly brace subscripting to refer to and operate on data in a table by extracting and reinserting contents. For example, convert LotFrontage back to feet by using curly brace subscripting.

housing{:,"LotFrontage"} = housing{:,"LotFrontage"} / 0.3048;
housing.Properties.VariableUnits("LotFrontage") = "ft";

Dot notation and brace subscripting are different syntaxes for the same kinds of operations. They both work on the contents of a table. Also, they both enable you to specify a table variable and a subset of its rows.

housing.LotArea(1:2)
ans = 2×1

       31770
       11622

housing{1:2,"LotArea"}
ans = 2×1

       31770
       11622

While both syntaxes work on the contents of table, there are two subtle differences to consider.

First, a limitation of curly brace subscripting is that it assigns into the contents of a table rather than replacing a variable. For example, this assignment does not change the data type of the LotFrontage variable in the way that an assignment using dot notation does. The call to the single function on the right side of the assignment creates an array having the single data type. But by subscripting into housing with curly braces, you assign values from that array into the existing table variable. And the data type of LotFrontage is double. The values from the right side are converted back to double by this assignment.

housing{:,"LotFrontage"} = single(housing{:,"LotFrontage"});

Second, a benefit of curly brace subscripting is that, unlike dot notation, it uses the familiar two-dimensional subscripting syntax. This syntax enables you to refer to more than one variable at a time and also to a subset of rows. For example, there are five variables whose units are square feet. Converting these variables to square meters one at a time is tedious. To apply the multiplication to all five variables at once, use curly brace subscripting.

areaVars = ["LotArea" "FirstFlrArea" "SecondFlrArea" "LowQualFinishedArea" "TotalAboveGroundLivingArea"];
housing{:,areaVars} = 0.3048^2 * housing{:,areaVars};
housing.Properties.VariableUnits(areaVars) = "m^2";

A common mistake is to use parenthesis subscripting instead of braces to operate on the contents of a table. While some functions, such as ismissing or varfun, do accept a table as their input, many numeric operations, including arithmetic, do not. For example, this assignment using parentheses results in an error. The try-catch block catches the error and displays it.

try
    housing(:,areaVars) = 0.3048^2 * housing(:,areaVars);
catch ME
    disp(ME.message)
end
Undefined function 'mtimes' for input arguments of type 'table'.

This assignment results in an error because housing(:,areaVars) is a 2928-by-5 table, not a numeric matrix. If you used curly brace subscripting, such as housing{:,areaVars}, then the result would be a 2928-by-5 numeric matrix. Because tables are designed to hold data of different types, including nonnumeric data, many functions that make sense for only numeric data do not work on a table. Dot notation and curly brace subscripting exist to give you access to data in a table.

A third way to do calculations on numeric variables in a table is to use the varfun function. Like curly brace subscripting, varfun can operate on all or only some of the variables in a table. Unlike curly braces, varfun operates on each table variable separately. By default, varfun returns another table containing a variable for each separate result.

Sometimes the operation that you want to apply is an existing function. To pass the function as an argument to varfun, use a function handle. For example, use the round function to round data in the variables specified by areaVars.

roundedAreaTable = varfun(@round,housing,"InputVariables",areaVars)
roundedAreaTable=2928×5 table
    round_LotArea    round_FirstFlrArea    round_SecondFlrArea    round_LowQualFinishedArea    round_TotalAboveGroundLivingArea
    _____________    __________________    ___________________    _________________________    ________________________________

        2952                154                      0                        0                              154               
        1080                 83                      0                        0                               83               
        1325                123                      0                        0                              123               
        1037                196                      0                        0                              196               
        1285                 86                     65                        0                              151               
         927                 86                     63                        0                              149               
         457                124                      0                        0                              124               
         465                119                      0                        0                              119               
         501                150                      0                        0                              150               
         697                 96                     72                        0                              168               
         929                 71                     83                        0                              154               
         741                110                      0                        0                              110               
         781                 73                     63                        0                              136               
         945                125                      0                        0                              125               
         634                140                      0                        0                              140               
        4971                157                    148                        0                              305               
      ⋮

If there is no function that does exactly what you want, you can also write an anonymous function to do it.

sqMeters2sqFeet = @(x) x / 0.3048^2;
areaTable = varfun(sqMeters2sqFeet,housing,"InputVariables",areaVars)
areaTable=2928×5 table
    Fun_LotArea    Fun_FirstFlrArea    Fun_SecondFlrArea    Fun_LowQualFinishedArea    Fun_TotalAboveGroundLivingArea
    ___________    ________________    _________________    _______________________    ______________________________

       31770             1656                   0                      0                            1656             
       11622              896                   0                      0                             896             
       14267             1329                   0                      0                            1329             
       11160             2110                   0                      0                            2110             
       13830              928                 701                      0                            1629             
        9978              926                 678                      0                            1604             
        4920             1338                   0                      0                            1338             
        5005             1280                   0                      0                            1280             
        5389             1616                   0                      0                            1616             
        7500             1028                 776                      0                            1804             
       10000              763                 892                      0                            1655             
        7980             1187                   0                      0                            1187             
        8402              789                 676                      0                            1465             
       10176             1341                   0                      0                            1341             
        6820             1502                   0                      0                            1502             
       53504             1690                1589                      0                            3279             
      ⋮

Because that result is a table, it can be assigned back into the original table with parenthesis subscripting.

housing(:,areaVars) = areaTable;
housing.Properties.VariableUnits(areaVars) = "ft^2";

It is important to understand the difference between the parentheses in

housing(:,areaVars) = areaTable;

and the braces in

housing{:,areaVars} = 0.3048^2 * housing{:,areaVars};

The two assignments have the same effect. The assignment with parentheses assigns one table to another. The assignment with curly braces explicitly assigns values to the content of the table. The left and right sides of that assignment are numeric matrices. Because curly brace subscripting extracts and reinserts data, it is a convenient way to modify data in place. Contents-to-contents assignment can operate on only one data type at a time, while table-to-table assignment can move data of different types. For example, this assignment results in an error because it involves mixed numeric and categorical data in brace subscripting.

try
    housing{:,["LotFrontage" "OverallCond"]} = normalize(housing{:,["LotFrontage" "OverallCond"]});
catch ME
    disp(ME.message)
end
Unable to concatenate the specified table variables.

Because varfun returns a table, assignment using parenthesis subscripting cannot change the type of any table variables. For example, this assignment does not convert any variables from the double to single data type.

housing(:,areaVars) = varfun(@single,housing,"InputVariables",areaVars);

To convert the data types of table variables, use convertvars, as previously shown.

Row Operations on Data in Table

Because curly brace subscripting extracts the variables from a table as one matrix having one data type, you can use it to perform row operations across numeric variables in a table. For example, a check on the data is to compare the individual square footage variables against TotalAboveGroundLivingArea. Extract the former by using curly braces. Then compare their row sums to TotalAboveGroundLivingArea, extracted by using dot notation.

area = housing{:,["FirstFlrArea" "SecondFlrArea" "LowQualFinishedArea"]}
area = 2928×3

        1656           0           0
         896           0           0
        1329           0           0
        2110           0           0
         928         701           0
         926         678           0
        1338           0           0
        1280           0           0
        1616           0           0
        1028         776           0
      ⋮

isequal(sum(area,2), housing.TotalAboveGroundLivingArea)
ans = logical
   1

The square footage data is consistent. Another example is to compute the total number of bathrooms in each house by extracting the four different bathroom counts and adding them up across each row.

bathCountVars = ["BsmtHalfBath" "HalfBath" "BsmtFullBath" "FullBath"];
bathCounts = housing{:,bathCountVars}
bathCounts = 2928×4

     0     0     1     1
     0     0     0     1
     0     1     0     1
     0     1     1     2
     0     1     0     2
     0     1     0     2
     0     0     1     2
     0     0     0     2
     0     0     1     2
     0     1     0     2
      ⋮

You might think to sum the rows of that matrix as:

sum(housing{:,bathCountVars},2);

but that sum is not correct. Half-baths count only half as much as full bathrooms. A trend in real estate listings is to account for multiple half-baths by counting them after the decimal point. Matrix multiplication makes that operation one line.

TotalBaths = housing{:,bathCountVars} * [.1; .1; 1; 1];

Replace those four variables with TotalBaths, rather than adding a new variable at the end of the table. Begin this replacement by using addvars to add TotalBaths next to the existing variables.

housing = addvars(housing,TotalBaths, 'After',"HalfBath");

There is a mistake in one row of the data. A townhouse built in 2007 probably does not have four half baths and no full baths.

groupcounts(housing,"TotalBaths")
ans=17×3 table
    TotalBaths    GroupCount    Percent 
    __________    __________    ________

       0.4             1        0.034153
         1           442          15.096
       1.1           293          10.007
       1.2            20         0.68306
       1.3             2        0.068306
         2           890          30.396
       2.1           558          19.057
       2.2            29         0.99044
         3           349          11.919
       3.1           288          9.8361
       3.2             6         0.20492
       3.3             1        0.034153
         4            25         0.85383
       4.1            16         0.54645
       4.2             3         0.10246
         6             2        0.068306
      ⋮

housing(housing.TotalBaths < 1,:)
ans=1×25 table
        PID                              MSSubClass                          LotFrontage    LotArea    Neighborhood          BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    BsmtFullBath    BsmtHalfBath    FullBath    HalfBath    TotalBaths    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _____________________________________________________    ___________    _______    ____________    _____________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    ____________    ____________    ________    ________    __________    ____________    __________    _________    ____________

    "0528228275"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER        53          3922        Blmngtn       Townhouse Inside Unit         5           2006           2007           PConc        GasA        true            1402              0                   0                        1402                    0               2             0           2           0.4             2            Attchd      1.942e+05      Jan 2007  

The BsmtHalfBath count should be two full bathrooms. The bathroom counts are all numeric. The assignment with braces updates all three values across that row.

housing{housing.TotalBaths < 1,["BsmtHalfBath" "FullBath" "TotalBaths"]} = [0 2 2.2];

Next use removevars to delete the redundant original variables.

housing = removevars(housing,bathCountVars)
housing=2928×21 table
        PID                              MSSubClass                          LotFrontage    LotArea    Neighborhood           BldgType           OverallCond    YearBuilt    YearRemod_Add    Foundation    Heating    CentralAir    FirstFlrArea    SecondFlrArea    LowQualFinishedArea    TotalAboveGroundLivingArea    TotalBaths    BedroomAbvGr    GarageType    SalePrice    LastSoldDate
    ____________    _____________________________________________________    ___________    _______    ____________    ______________________    ___________    _________    _____________    __________    _______    __________    ____________    _____________    ___________________    __________________________    __________    ____________    __________    _________    ____________

    "0526301100"    1-STORY 1946 & NEWER ALL STYLES                               141        31770       NAmes         Single-family Detached         5           1960           1960           CBlock       GasA        true            1656               0                  0                        1656                    2             3           Attchd        2.15e+05      Apr 2010  
    "0526350040"    1-STORY 1946 & NEWER ALL STYLES                                80        11622       NAmes         Single-family Detached         6           1961           1961           CBlock       GasA        true             896               0                  0                         896                    1             2           Attchd        1.05e+05      May 2010  
    "0526351010"    1-STORY 1946 & NEWER ALL STYLES                                81        14267       NAmes         Single-family Detached         6           1958           1958           CBlock       GasA        true            1329               0                  0                        1329                  1.1             3           Attchd        1.72e+05      May 2010  
    "0526353030"    1-STORY 1946 & NEWER ALL STYLES                                93        11160       NAmes         Single-family Detached         5           1968           1968           CBlock       GasA        true            2110               0                  0                        2110                  3.1             3           Attchd        2.44e+05      Mar 2010  
    "0527105010"    2-STORY 1946 & NEWER                                           74        13830       Gilbert       Single-family Detached         5           1997           1998           PConc        GasA        true             928             701                  0                        1629                  2.1             3           Attchd       1.899e+05      Feb 2010  
    "0527105030"    2-STORY 1946 & NEWER                                           78         9978       Gilbert       Single-family Detached         6           1998           1998           PConc        GasA        true             926             678                  0                        1604                  2.1             3           Attchd       1.955e+05      May 2010  
    "0527127150"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER          41         4920       StoneBr       Townhouse Inside Unit          5           2001           2001           PConc        GasA        true            1338               0                  0                        1338                    3             2           Attchd       2.135e+05      Mar 2010  
    "0527145080"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER          43         5005       StoneBr       Townhouse Inside Unit          5           1992           1992           PConc        GasA        true            1280               0                  0                        1280                    2             2           Attchd       1.915e+05      Dec 2009  
    "0527146030"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER          39         5389       StoneBr       Townhouse Inside Unit          5           1995           1996           PConc        GasA        true            1616               0                  0                        1616                    3             2           Attchd       2.365e+05      Feb 2010  
    "0527162130"    2-STORY 1946 & NEWER                                           60         7500       Gilbert       Single-family Detached         5           1999           1999           PConc        GasA        true            1028             776                  0                        1804                  2.1             3           Attchd        1.89e+05      May 2010  
    "0527163010"    2-STORY 1946 & NEWER                                           75        10000       Gilbert       Single-family Detached         5           1993           1994           PConc        GasA        true             763             892                  0                        1655                  2.1             3           Attchd       1.759e+05      Mar 2010  
    "0527165230"    1-STORY 1946 & NEWER ALL STYLES                            62.496         7980       Gilbert       Single-family Detached         7           1992           2007           PConc        GasA        true            1187               0                  0                        1187                    3             3           Attchd        1.85e+05      Feb 2010  
    "0527166040"    2-STORY 1946 & NEWER                                           63         8402       Gilbert       Single-family Detached         5           1998           1998           PConc        GasA        true             789             676                  0                        1465                  2.1             3           Attchd       1.804e+05      Apr 2010  
    "0527180040"    1-STORY 1946 & NEWER ALL STYLES                                85        10176       Gilbert       Single-family Detached         5           1990           1990           PConc        GasA        true            1341               0                  0                        1341                  2.1             2           Attchd       1.715e+05      Jan 2010  
    "0527182190"    1-STORY PUD (Planned Unit Development) - 1946 & NEWER      57.299         6820       StoneBr       Townhouse Inside Unit          5           1985           1985           PConc        GasA        true            1502               0                  0                        1502                  2.1             1           Attchd        2.12e+05      May 2010  
    "0527216070"    2-STORY 1946 & NEWER                                           47        53504       StoneBr       Single-family Detached         5           2003           2003           PConc        GasA        true            1690            1589                  0                        3279                  4.1             4           BuiltIn       5.38e+05      May 2010  
      ⋮

Unlike curly braces, varfun operates on each variable in a table separately. For that reason, varfun cannot do row operations. The related function rowfun can do row operations. It is often simpler and faster to use curly brace subscripting for row operations.

Reductions of Data in Table

In previous sections, the operations on numeric data in the table were transformations that replace the original values. Many other important operations are reductions whose results are scalars. For example, calculate the median price of the values in SalePrice.

median(housing.SalePrice)
ans = 160000

The median function works column-wise on matrices. You can use curly brace subscripting to extract those four variables as a numeric matrix. Then you can calculate the medians of the columns of the matrix.

median(housing{:,["LotFrontage", "LotArea" "TotalAboveGroundLivingArea" "SalePrice"]})
ans = 1×4
105 ×

    0.0007    0.0944    0.0144    1.6000

This operation does not attach variable names or any other table metadata to the result. As an alternative, you can use varfun to apply median to each variable in the table. With varfun, the result is another table that contains separate numeric results and preserves the names.

varfun(@median,housing,"InputVariables",["LotFrontage", "LotArea" "TotalAboveGroundLivingArea" "SalePrice"])
ans=1×4 table
    median_LotFrontage    median_LotArea    median_TotalAboveGroundLivingArea    median_SalePrice
    __________________    ______________    _________________________________    ________________

          69.183              9436.5                      1442                       1.6e+05     

These two ways to get the medians are equivalent. There is a trade-off between having the variable names preserved in another table and having the results in one numeric row vector. The way you pick depends on what you plan to do with the result.

Operations on Mixed Data Types

Using curly braces when calculating the medians has another drawback. Curly braces require compatible data type for all the variables. That is, the data you extract from the variables must have data types that allow them to be concatenated into one matrix. Ordinal categorical data can also have median values. Because categorical and numeric arrays cannot be concatenated, this operation results in an error.

median(housing{:,["LotFrontage", "LotArea" "OverallCond" "TotalAboveGroundLivingArea" "SalePrice"]})

But because varfun operates on each variable in the table separately, there is no requirement that the variables have the same data type or compatible types allowing concatenation. The only requirement is that all the variables must support the function that is applied. To calculate the medians of ordinal categorical variables and numeric variables in one function call use varfun.

varfun(@median,housing,"InputVariables",["LotFrontage", "LotArea" "OverallCond" "TotalAboveGroundLivingArea" "SalePrice"])
ans=1×5 table
    median_LotFrontage    median_LotArea    median_OverallCond    median_TotalAboveGroundLivingArea    median_SalePrice
    __________________    ______________    __________________    _________________________________    ________________

          69.183              9436.5                5                           1442                       1.6e+05     

See Also

| | | | | | | | | | | |

Related Topics