Main Content

Data Cleaner

Preprocess and organize column-oriented data

Since R2022a

Description

The Data Cleaner app is an interactive tool for identifying messy column-oriented data, cleaning multiple variables of data at a time, and iterating on and refining the cleaning process.

Using this app, you can:

  • Access column-oriented data in the MATLAB® workspace or import column-oriented data from a file.

  • Explore data by using the visualization, data, and summary views.

  • Sort by a variable, rename a variable, or remove a variable.

  • Retime data in a timetable, stack or unstack table variables, clean missing data, clean outlier data, smooth data, or normalize data.

  • Edit previously performed cleaning steps.

  • Export cleaned data to the MATLAB workspace, or export code for data cleaning as a script or function.

  • The Data Cleaner app currently supports cleaning only table and timetable data.

  • The Data Cleaner app currently supports cleaning only one table or timetable at a time.

Data Cleaner app

Open the Data Cleaner App

  • MATLAB Toolstrip: On the Apps tab, under MATLAB, click the app icon.

  • MATLAB command prompt: Enter dataCleaner.

Examples

expand all

Use the Data Cleaner app to preprocess and organize messy timetable data by removing a variable and retiming, smoothing, and normalizing the data. Then, export the cleaned data to the MATLAB workspace. You can follow these steps to preprocess and organize messy timetable data, but note that your data may require a different set of cleaning steps.

This example shows how to preprocess and organize time-stamped bicycle traffic data. The data set comes from sensors on Broadway Street in Cambridge, MA. The City of Cambridge provides public access to the full data set at the Cambridge Open Data site.

  1. Open Timetable in Data Cleaner App

    Use the MATLAB Toolstrip or the MATLAB command window to open the Data Cleaner app.

    Load the time-stamped bicycle traffic data by using bikeData = readtimetable("BicycleCounts.csv") in the command window. Then, select Import > Import from Workspace in the Data Cleaner app, and specify the timetable bikeData. Alternatively, import the data by selecting Import > Import from File in the Data Cleaner app.

    Once the timetable is loaded into the app, view the raw data in the Data tab and a data summary in the Summary tab.

    Summary tab with a data summary including timetable statistics and variable statistics

    Explore the timetable data in the Visualization tab. Select the Total, Westbound, and Eastbound timetable variables in the Variables panel.

    The plots suggest that there is a correlation between time of the year and bike traffic.

  2. Remove Variable from Timetable

    The Day variable contains redundant data because the day of data collection is reflected in the timestamp. Interactively remove Day from the timetable by using the Variables panel. To remove the variable, right-click Day and select Delete. Variable removal now appears as a step in the Cleaning Steps panel.

  3. Retime Timetable

    The data summary shows missing and duplicate timestamp values in the timetable. To sort the timetable and establish unique row times, click Retime Timetable in the Cleaning Methods section of the Home tab of the app. Specify Unique row times of input as the selection method and use the Sum method to aggregate. Accept the cleaning parameters to add the cleaning step and update the timetable.

    After accepting the retiming parameters, the updated data summary shows that there are no missing or duplicate timestamp values and that the timestamps are sorted from earliest to latest.

    If retiming is not necessary for your timetable, you can interactively sort by Timestamp or another timetable variable. Access the sorting options by clicking the arrow in the variable header in the Data tab.

  4. Smooth Data

    Because the bicycle traffic spikes for certain days of each week, smoothing can lessen the noise within each week and give better insight into the bicycle traffic trend throughout the year. To smooth the data, use the Smooth Data cleaning method. Select the Moving mean smoothing method and specify a centered 7-day window for smoothing. Accept the cleaning parameters to add the cleaning step and update the timetable.

  5. Normalize Data

    Because the three numeric variables Total, Westbound, and Eastbound have different scales, use normalization to scale by standard deviation. To normalize the data, use the Normalize Data cleaning method. Select Scale as the normalization method and Standard deviation as the scale type.

    To more clearly preview this cleaning step, clear the original data in the legend of the visualizations. Accept the cleaning parameters to add the cleaning step and update the timetable.

    Visualization of normalized data and list of cleaning steps

  6. Export Timetable

    Export the cleaned timetable to the MATLAB workspace by selecting Export > Export to Workspace.

    Alternatively, export timetable cleaning code by selecting Export > Generate Script or Export > Generate Function.

Parameters

expand all

Clean Missing Data

Select one of these values to specify the missing-value indicators.

IndicatorsIndicator ParametersDescription
Use only standard indicatorsNot applicable

Use only standard indicators to detect missing values.

Standard missing values depend on the data type:

  • NaNdouble, single, duration, and calendarDuration

  • NaTdatetime

  • <missing>string

  • <undefined>categorical

  • {''}cell of character vectors

Specify non-standard indicatorsIndicatorsInside single quotes, list non-standard indicator values to treat as missing, separated by commas. For example, '–99, "N/A" '

Select one of these method values and, if necessary, additional method parameters to specify how to handle missing data.

MethodMethod ParametersDescription
Fill missingMax gap to fillFill missing values. Gaps in data larger than this specified value are not filled (positive scalar). See the Fill method parameter.
UnitsFill missing values. Specify the gap size unit type.
Remove missingNot applicableRemove data rows with missing entries.

Select one of these method values and, if necessary, additional method parameters to specify how to fill missing data.

MethodMethod ParametersDescription
Constant valueConstant valueUse a constant scalar value.
Previous valueNot applicableUse the previous nonmissing value.
Next valueNot applicableUse the next nonmissing value.
Nearest valueNot applicableUse the nearest nonmissing value as defined by the x-axis.
Linear interpolationNot applicableUse the linear interpolation of neighboring, nonmissing values.
Spline interpolationNot applicableUse the piecewise cubic spline interpolation.
Shape-preserving cubic interpolation (PCHIP)Not applicableUse the shape-preserving piecewise cubic spline interpolation.
Modified Akima cubic interpolationNot applicableUse the modified Akima cubic Hermite interpolation.
Moving medianMoving window typeCenter or asymmetrically align the moving window about the current element.
Window lengthSpecify the length of the moving window (positive scalar).
Right half window length (if moving window type is Asymmetric)Specify the number of window units after the current element to define the window alignment (positive scalar).
UnitsSpecify the moving window unit type.
Moving meanMoving window typeCenter or asymmetrically align the moving window about the current element.
Window lengthSpecify the length of the moving window (positive scalar).
Right half window length (if moving window type is Asymmetric)Specify the number of window units after the current element to define the window alignment (positive scalar).
UnitsSpecify the moving window unit type.
K-nearest neighborsNum neighbor rowsSpecify the number of nearest neighbors (k) to average for the fill value.
Distance functionDefine the distance between rows of data as the Euclidean distance or the scaled Euclidean distance.
Clean Outlier Data

Select one of these method values to specify how to handle outlier data.

MethodDescription
Fill outliersFill outlier values. See the Fill method parameter.
Remove outliersRemove data rows with outlier values.

Select one of these method values to specify the fill method for replacing outlier data.

MethodDescription
Constant valueUse the specified constant scalar value.
Center valueUse the center value determined by the find method.
Clip to threshold valueUse the lower threshold value for elements less than the lower threshold determined by the find method. Use the upper threshold value for elements greater than the upper threshold determined by the find method.
Previous valueUse the previous nonoutlier value.
Next valueUse the next nonoutlier value.
Nearest valueUse the nearest nonoutlier value.
Linear interpolationUse the linear interpolation of neighboring, nonoutlier values.
Spline interpolationUse the piecewise cubic spline interpolation.
Shape-preserving cubic interpolation (PCHIP)Use the shape-preserving piecewise cubic spline interpolation.
Modified Akima cubic interpolationUse the modified Akima cubic Hermite interpolation.

Select one of these method values and additional method parameters to specify the detection method for identifying outlier data.

MethodMethod ParametersDescription
MedianThreshold factorOutliers are defined as elements more than the specified threshold of scaled median absolute deviations (MAD) from the median. For input data A, the scaled MAD is defined as c*median(abs(A-median(A))), where c=-1/(sqrt(2)*erfcinv(3/2)).
MeanThreshold factorOutliers are defined as elements more than the specified threshold of standard deviations from the mean. This method is faster but less robust than Median.
QuartilesThreshold factorOutliers are defined as elements more than the specified threshold of interquartile ranges above the upper quartile (75 percent) or below the lower quartile (25 percent). This method is useful when the input data is not normally distributed.
GrubbsThreshold factorOutliers are detected using Grubbs’ test, which removes one outlier per iteration based on hypothesis testing. This method assumes that the input data is normally distributed.
Generalized extreme studentized deviate (GESD)Threshold factorOutliers are detected using the generalized extreme studentized deviate test for outliers. This iterative method is similar to Grubbs but can perform better when multiple outliers are masking each other.
Moving medianThreshold factorOutliers are defined as elements more than the specified threshold of local scaled MAD from the local median over a specified window.
Moving window typeCenter or asymmetrically align the moving window about the current element.
Window lengthSpecify the length of the moving window (positive scalar).
Right half window length (if moving window type is Asymmetric)Specify the number of window units after the current element to define the window alignment (positive scalar).
UnitsSpecify the moving window unit type.
Moving meanThreshold factorOutliers are defined as elements more than the specified threshold of local standard deviations from the local mean over a specified window.
Moving window typeCenter or asymmetrically align the moving window about the current element.
Window lengthSpecify the length of the moving window (positive scalar).
Right half window length (if moving window type is Asymmetric)Specify the number of window units after the current element to define the window alignment (positive scalar).
UnitsSpecify the moving window unit type.
PercentilesLower thresholdOutliers are defined as elements outside of the percentile range specified by an upper and lower threshold.
Upper thresholdOutliers are defined as elements outside of the percentile range specified by an upper and lower threshold.
Normalize Data

Select one of these method values and, if necessary, additional method parameters to specify the method for normalizing data.

MethodMethod ParametersDescription
Z-scoreZ-score type

Center and scale to have mean 0 and standard deviation 1 by specifying Standard deviation.

Center and scale to have median of 0 and median absolute deviation 1 by specifying Median absolute deviation.

NormP-NormScale data by p-norm (positive scalar or Inf for infinity norm).
RangeLeft limitRescale range of data with left and right range limits to an interval of the form [a b], where a < b.
Right limitRescale range of data with left and right range limits to an interval of the form [a b], where a < b.
Median IQRNot applicableCenter and scale data to have median 0 and interquartile range 1.
CenterCenter Type

Center to have mean 0 by subtracting the mean from the input data with Mean.

Center to have median 0 by subtracting the median from the input data with Median.

Shift center by the specified numeric value with Numeric scalar.

ScaleScale type

Scale data by standard deviation with Standard deviation.

Scale data by median absolute deviation with Median absolute deviation.

Scale data by the first element of the data with First element.

Scale data by interquartile range with Interquartile range.

Scale data by dividing by the specified numeric factor (positive scalar) with Numeric scalar.

Center and scaleCenter Type

Center to have mean 0 by subtracting the mean from the input data with Mean.

Center to have median 0 by subtracting the median from the input data with Median.

Shift center by the specified numeric value with Numeric scalar.

Scale type

Scale data by standard deviation with Standard deviation.

Scale data by median absolute deviation with Median absolute deviation.

Scale data by the first element of the data with First element.

Scale data by interquartile range with Interquartile range.

Scale data by dividing by the specified numeric factor (positive scalar) with Numeric scalar.

Smooth Data

Select one of these method values to specify the smoothing method for noisy data.

MethodDescription
Moving mean

Use the moving average. This method is useful for reducing periodic trends in data.

Moving medianUse the moving median. This method is useful for reducing periodic trends in data when outliers are present.
Gaussian filterUse the Gaussian-weighted moving average.
Local linear regression (Lowess)Use linear regression. This method can be computationally expensive, but it results in fewer discontinuities.
Local quadratic regression (Loess)Use quadratic regression. This method is slightly more computationally expensive than local linear regression.
Robust Lowess Use robust linear regression. This method is a more computationally expensive version of local linear regression, but it is more robust to outliers.
Robust LoessUse robust quadratic regression. This method is a more computationally expensive version of local quadratic regression, but it is more robust to outliers.
Savitzky-Golay polynomial filterUse the Savitzky-Golay polynomial filter, which smooths according to a specified polynomial degree and is fitted over each window. This method can be more effective than other methods when the data varies rapidly.

Select one of these parameter values and additional parameter options to specify the options for data smoothing.

ParameterParameter OptionsDescription
Smoothing factorSmoothing factorSpecify the amount of smoothing (positive scalar).
Moving windowMoving window typeCenter or asymmetrically align the moving window about the current element.
Window lengthSpecify the length of the moving window (positive scalar).
Right half window length (if moving window type is Asymmetric)Specify the number of window units after the current element to define the window alignment (positive scalar).
UnitsSpecify the moving window unit type.
Retime Timetable

Select one of these method values and additional method parameters to specify the selection method for retimed row times.

MethodMethod ParametersDescription
Time stepTime stepSpecify the length of time between consecutive regularly spaced row times in the output table (positive scalar).
Time step unitsSpecify the time step units.
Sample rateSample rateSpecify the number of samples in the output table per unit of time (positive scalar).
Sample rate unitsSpecify the sample rate units.

Select one of these method values to specify the retiming method.

MethodDescription
Fill with missingUse the missing data indicators (for example, NaN for numeric variables).
Fill with constantUse the specified constant value. The default value is 0.
Fill with previous valueCopy data from the nearest preceding neighbor in the input timetable, proceeding from the end of the vector of row times. If there are duplicate row times, then use the last of the duplicates.
Fill with next valueCopy data from the nearest following neighbor in the input timetable, proceeding from the beginning of the vector of row times. If there are duplicate row times, then use the first of the duplicates.
Fill with nearest valueCopy data from the nearest neighbor in the input timetable.
Linear interpolationUse linear interpolation.
Spline interpolationUse piecewise cubic spline interpolation.
Shape-preserving cubic interpolation (PCHIP)Use shape-preserving piecewise cubic interpolation.
Modified Akima cubic interpolationUse modified Akima cubic Hermite interpolation.
SumUse the sum of the values in each time bin.
MeanUse the mean of the values in each time bin.
ProductUse the product of the values in each time bin.
MinimumUse the minimum of the values in each time bin.
MaximumUse the maximum of the values in each time bin.
Number of valuesUse the number of values in each time bin.
First value in binUse the first value in each time bin.
Last value in binUse the last value in each time bin.
CustomUse the function specified by the function handle.
Stack Table Variables

Select one or more table variables to combine.

Unstack Table Variables

Select a table variable containing the names of the new table variables.

Select a table variable to unstack into multiple table variables.

Select one or more table variables to define groups of rows.

Select one of these values to specify the function to aggregate data values into a single value.

FunctionDescription
SumUse sum of each group of values.
MeanUse the mean of each group of values.
MedianUse the median of each group of values.
ModeUse the mode of each group of values.
MaximumUse the maximum of each group of values.
MinimumUse the minimum of each group of values.
FirstUse the first value of each group of values.
UniqueUse the number of unique values in each group of values.
CountUse the number of values in each group of values.
CustomUse the function specified by the function handle.

Tips

  • To interactively sort by a data variable, access the sorting options by clicking the arrow in the variable header in the Data tab. The sorting appears as a step in the Cleaning Steps panel.

  • To interactively rename a variable from the data, double-click the variable name in the Variables panel. The renaming appears as a step in the Cleaning Steps panel.

  • To interactively remove a variable from the data, right-click the variable name in the Variables panel and select Delete. The removal appears as a step in the Cleaning Steps panel.

  • To alter previously performed cleaning steps, perform one of these actions:

    • View or edit cleaning parameters by clicking a step in the Cleaning Steps panel.

    • Change the order in which cleaning steps are performed by dragging a step to a new location in the Cleaning Steps panel.

    • Disable cleaning steps by clearing a cleaning step or right-clicking a step and selecting Disable Steps Below in the Cleaning Steps panel.

  • To view only the input data or cleaned data, select or clear elements in the plot legends in the Visualizations tab.

Version History

Introduced in R2022a

expand all