Main Content

Compute by Group

Summarize, transform, or filter by group in the Live Editor

Since R2021b

Description

The Compute by Group task lets you interactively group data and compute summary statistics, perform transformations, or apply filters for each group. The task automatically generates MATLAB® code for your live script.

Using this task, you can:

  • Define groups of data in an array, table, or timetable.

  • Summarize, transform, or filter the data based on each grouping.

  • Output a new table or timetable with the results of the computation.

Compute by Group task in the Live Editor

Open the Task

To add the Compute by Group task to a live script in the MATLAB Editor:

  • On the Live Editor tab, select Task > Compute by Group.

  • In a code block in the script, type a relevant keyword, such as group. Select Compute by Group from the suggested command completions.

Examples

expand all

Summarize data by interactively grouping the data, specifying variables to operate on, and computing statistics using the Compute by Group task in the Live Editor.

Create a timetable using the sample file outages.csv. The file contains six columns of data representing electric utility outages. Convert the Region and Cause column-oriented variables to categorical arrays and display the timetable.

outages = readtimetable("outages.csv");
outages.Region = categorical(outages.Region);
outages.Cause = categorical(outages.Cause)
outages=1468×5 timetable
         OutageTime          Region       Loss     Customers       RestorationTime            Cause     
    ____________________    _________    ______    __________    ____________________    _______________

    01-Feb-2002 12:18:00    SouthWest    458.98    1.8202e+06    07-Feb-2002 16:50:00    winter storm   
    23-Jan-2003 00:49:00    SouthEast    530.14    2.1204e+05                     NaT    winter storm   
    07-Feb-2003 21:15:00    SouthEast     289.4    1.4294e+05    17-Feb-2003 08:14:00    winter storm   
    06-Apr-2004 05:44:00    West         434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault
    16-Mar-2002 06:18:00    MidWest      186.44    2.1275e+05    18-Mar-2002 23:23:00    severe storm   
    18-Jun-2003 02:49:00    West              0             0    18-Jun-2003 10:54:00    attack         
    20-Jun-2004 14:39:00    West         231.29           NaN    20-Jun-2004 19:16:00    equipment fault
    06-Jun-2002 19:28:00    West         311.86           NaN    07-Jun-2002 00:51:00    equipment fault
    16-Jul-2003 16:23:00    NorthEast    239.93         49434    17-Jul-2003 01:12:00    fire           
    27-Sep-2004 11:09:00    MidWest      286.72         66104    27-Sep-2004 16:37:00    equipment fault
    05-Sep-2004 17:48:00    SouthEast    73.387         36073    05-Sep-2004 20:46:00    equipment fault
    21-May-2004 21:45:00    West         159.99           NaN    22-May-2004 04:23:00    equipment fault
    01-Sep-2002 18:22:00    SouthEast    95.917         36759    01-Sep-2002 19:12:00    severe storm   
    27-Sep-2003 07:32:00    SouthEast       NaN    3.5517e+05    04-Oct-2003 07:02:00    severe storm   
    12-Nov-2003 06:12:00    West         254.09    9.2429e+05    17-Nov-2003 02:04:00    winter storm   
    18-Sep-2004 05:54:00    NorthEast         0             0                     NaT    equipment fault
      ⋮

Open the Compute by Group task in the Live Editor. To group the data by the five regions where the outages occurred, select outages as the input data and group by unique values of the Region variable. Then, compute on the Loss and Customers variables by selecting All numeric variables in the Compute on field.

The Compute by Group task can perform three different types of computations for groups. To summarize the outage data, set the computation type to Compute stats by group. Then, to compute the mean and maximum values for the numeric variables Loss and Customers, use the Computations per group field to select the Mean and Maximum methods.

The resulting timetable contains the group observation count, mean power loss, maximum power loss, mean number of affected customers, and maximum number of affected customers for the outages in each region.

Live Task
outageStats=5×6 table
     Region      GroupCount    mean_Loss    max_Loss    mean_Customers    max_Customers
    _________    __________    _________    ________    ______________    _____________

    MidWest         142         1137.7        23141       2.4015e+05        3.972e+06  
    NorthEast       557         551.65        23418       1.4917e+05       5.9689e+06  
    SouthEast       389         495.35       8767.3       1.6776e+05       2.2249e+06  
    SouthWest        26         493.88         2796       2.6975e+05       1.8202e+06  
    West            354         433.37        16659       1.5201e+05         4.26e+06  

Improve the interpretability or appearance of data by interactively grouping data, specifying variables to operate on, and applying a transformation operation using the Compute by Group task in the Live Editor.

Create a timetable using the sample file outages.csv. The file contains six columns of data representing electric utility outages. Convert the Region and Cause column-oriented variables to categorical arrays and display the timetable.

outages = readtimetable("outages.csv");
outages.Region = categorical(outages.Region);
outages.Cause = categorical(outages.Cause)
outages=1468×5 timetable
         OutageTime          Region       Loss     Customers       RestorationTime            Cause     
    ____________________    _________    ______    __________    ____________________    _______________

    01-Feb-2002 12:18:00    SouthWest    458.98    1.8202e+06    07-Feb-2002 16:50:00    winter storm   
    23-Jan-2003 00:49:00    SouthEast    530.14    2.1204e+05                     NaT    winter storm   
    07-Feb-2003 21:15:00    SouthEast     289.4    1.4294e+05    17-Feb-2003 08:14:00    winter storm   
    06-Apr-2004 05:44:00    West         434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault
    16-Mar-2002 06:18:00    MidWest      186.44    2.1275e+05    18-Mar-2002 23:23:00    severe storm   
    18-Jun-2003 02:49:00    West              0             0    18-Jun-2003 10:54:00    attack         
    20-Jun-2004 14:39:00    West         231.29           NaN    20-Jun-2004 19:16:00    equipment fault
    06-Jun-2002 19:28:00    West         311.86           NaN    07-Jun-2002 00:51:00    equipment fault
    16-Jul-2003 16:23:00    NorthEast    239.93         49434    17-Jul-2003 01:12:00    fire           
    27-Sep-2004 11:09:00    MidWest      286.72         66104    27-Sep-2004 16:37:00    equipment fault
    05-Sep-2004 17:48:00    SouthEast    73.387         36073    05-Sep-2004 20:46:00    equipment fault
    21-May-2004 21:45:00    West         159.99           NaN    22-May-2004 04:23:00    equipment fault
    01-Sep-2002 18:22:00    SouthEast    95.917         36759    01-Sep-2002 19:12:00    severe storm   
    27-Sep-2003 07:32:00    SouthEast       NaN    3.5517e+05    04-Oct-2003 07:02:00    severe storm   
    12-Nov-2003 06:12:00    West         254.09    9.2429e+05    17-Nov-2003 02:04:00    winter storm   
    18-Sep-2004 05:54:00    NorthEast         0             0                     NaT    equipment fault
      ⋮

Open the Compute by Group task in the Live Editor. To group the data by the ten causes by which the outages occurred, select outages as the input data and group by unique values of the Cause variable. Then, set Compute on as the Loss variable.

The Compute by Group task can perform three different types of computations for groups. To transform the outage data, set the computation type to Transform by group. Then, to fill missing power loss values, set Computation per group as the Fill missing with group mean method.

The resulting timetable contains the outage data with missing power loss replaced with the mean power loss for outages with the same cause.

Live Task
outageTransform=1468×5 timetable
         OutageTime          Region       Loss     Customers       RestorationTime            Cause     
    ____________________    _________    ______    __________    ____________________    _______________

    01-Feb-2002 12:18:00    SouthWest    458.98    1.8202e+06    07-Feb-2002 16:50:00    winter storm   
    23-Jan-2003 00:49:00    SouthEast    530.14    2.1204e+05                     NaT    winter storm   
    07-Feb-2003 21:15:00    SouthEast     289.4    1.4294e+05    17-Feb-2003 08:14:00    winter storm   
    06-Apr-2004 05:44:00    West         434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault
    16-Mar-2002 06:18:00    MidWest      186.44    2.1275e+05    18-Mar-2002 23:23:00    severe storm   
    18-Jun-2003 02:49:00    West              0             0    18-Jun-2003 10:54:00    attack         
    20-Jun-2004 14:39:00    West         231.29           NaN    20-Jun-2004 19:16:00    equipment fault
    06-Jun-2002 19:28:00    West         311.86           NaN    07-Jun-2002 00:51:00    equipment fault
    16-Jul-2003 16:23:00    NorthEast    239.93         49434    17-Jul-2003 01:12:00    fire           
    27-Sep-2004 11:09:00    MidWest      286.72         66104    27-Sep-2004 16:37:00    equipment fault
    05-Sep-2004 17:48:00    SouthEast    73.387         36073    05-Sep-2004 20:46:00    equipment fault
    21-May-2004 21:45:00    West         159.99           NaN    22-May-2004 04:23:00    equipment fault
    01-Sep-2002 18:22:00    SouthEast    95.917         36759    01-Sep-2002 19:12:00    severe storm   
    27-Sep-2003 07:32:00    SouthEast    697.41    3.5517e+05    04-Oct-2003 07:02:00    severe storm   
    12-Nov-2003 06:12:00    West         254.09    9.2429e+05    17-Nov-2003 02:04:00    winter storm   
    18-Sep-2004 05:54:00    NorthEast         0             0                     NaT    equipment fault
      ⋮

Focus on specific information in a data set by interactively grouping data, specifying variables to operate on, and applying a group filter with Compute by Group.

Create a timetable using the sample file outages.csv. The file contains six columns of data representing electric utility outages. Convert the Region and Cause column-oriented variables to categorical arrays and display the timetable.

outages = readtimetable("outages.csv");
outages.Region = categorical(outages.Region);
outages.Cause = categorical(outages.Cause)
outages=1468×5 timetable
         OutageTime          Region       Loss     Customers       RestorationTime            Cause     
    ____________________    _________    ______    __________    ____________________    _______________

    01-Feb-2002 12:18:00    SouthWest    458.98    1.8202e+06    07-Feb-2002 16:50:00    winter storm   
    23-Jan-2003 00:49:00    SouthEast    530.14    2.1204e+05                     NaT    winter storm   
    07-Feb-2003 21:15:00    SouthEast     289.4    1.4294e+05    17-Feb-2003 08:14:00    winter storm   
    06-Apr-2004 05:44:00    West         434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault
    16-Mar-2002 06:18:00    MidWest      186.44    2.1275e+05    18-Mar-2002 23:23:00    severe storm   
    18-Jun-2003 02:49:00    West              0             0    18-Jun-2003 10:54:00    attack         
    20-Jun-2004 14:39:00    West         231.29           NaN    20-Jun-2004 19:16:00    equipment fault
    06-Jun-2002 19:28:00    West         311.86           NaN    07-Jun-2002 00:51:00    equipment fault
    16-Jul-2003 16:23:00    NorthEast    239.93         49434    17-Jul-2003 01:12:00    fire           
    27-Sep-2004 11:09:00    MidWest      286.72         66104    27-Sep-2004 16:37:00    equipment fault
    05-Sep-2004 17:48:00    SouthEast    73.387         36073    05-Sep-2004 20:46:00    equipment fault
    21-May-2004 21:45:00    West         159.99           NaN    22-May-2004 04:23:00    equipment fault
    01-Sep-2002 18:22:00    SouthEast    95.917         36759    01-Sep-2002 19:12:00    severe storm   
    27-Sep-2003 07:32:00    SouthEast       NaN    3.5517e+05    04-Oct-2003 07:02:00    severe storm   
    12-Nov-2003 06:12:00    West         254.09    9.2429e+05    17-Nov-2003 02:04:00    winter storm   
    18-Sep-2004 05:54:00    NorthEast         0             0                     NaT    equipment fault
      ⋮

Open the Compute by Group task in the Live Editor. To group the data by the year and region in which the outages occurred, use Group by to bin the OutageTime variable by year and group the Region variable by unique values. Then, compute on the power loss by selecting the Loss variable in the Compute on field.

The Compute by Group task can perform three different types of computations for groups. To filter the outage data, set the computation type to Filter by group. Then, set Computation per group as a new local function and customize the filter by writing a function which gives a true result for the outlier data to keep and a false result for non-outlier data to be filtered out.

The resulting timetable contains only outlier outage data, where the power loss is outside of three standard deviations from the mean of the losses for the year and region.

Live Task
outageFilter=159×6 timetable
         OutageTime          Region       Loss     Customers       RestorationTime            Cause         year_OutageTime
    ____________________    _________    ______    __________    ____________________    _______________    _______________

    06-Apr-2004 05:44:00    West         434.81    3.4037e+05    06-Apr-2004 06:10:00    equipment fault         2004      
    06-Jun-2002 19:28:00    West         311.86           NaN    07-Jun-2002 00:51:00    equipment fault         2002      
    08-Mar-2005 16:37:00    SouthEast    1339.2    4.3003e+05    10-Mar-2005 20:42:00    winter storm            2005      
    02-Jul-2004 09:16:00    MidWest       15128    2.0104e+05    06-Jul-2004 14:11:00    thunder storm           2004      
    20-Apr-2002 16:46:00    MidWest       23141           NaN                     NaT    unknown                 2002      
    10-Dec-2002 10:45:00    MidWest       14493    3.0879e+06    11-Dec-2002 18:06:00    unknown                 2002      
    18-May-2002 11:04:00    MidWest      1389.1    1.3447e+05    21-May-2002 01:22:00    unknown                 2002      
    22-Sep-2003 00:53:00    MidWest      3995.8    6.7808e+05    23-Sep-2003 03:45:00    unknown                 2003      
    05-Nov-2005 12:46:00    NorthEast    2966.1           NaN    06-Nov-2005 21:40:00    unknown                 2005      
    17-Aug-2002 09:05:00    NorthEast     21673           NaN    19-Aug-2002 21:45:00    unknown                 2002      
    16-Sep-2004 19:42:00    NorthEast      4718           NaN                     NaT    unknown                 2004      
    20-May-2002 10:57:00    NorthEast    9116.6    2.4983e+06    21-May-2002 15:22:00    unknown                 2002      
    05-Sep-2003 20:15:00    SouthEast    1700.1    1.6393e+05    10-Sep-2003 19:59:00    thunder storm           2003      
    20-Sep-2004 12:37:00    SouthEast    8767.3    2.2249e+06    02-Oct-2004 06:00:00    severe storm            2004      
    14-Sep-2005 15:45:00    SouthEast    1839.2    3.4144e+05                     NaT    severe storm            2005      
    14-Sep-2003 16:09:00    NorthEast    2011.3    6.9368e+05    24-Sep-2003 07:44:00    severe storm            2003      
      ⋮

function tf = myFilterFcn(x)
% x is the data in a group from one computation variable
% tf is true, false, or a logical column vector with the same height as x
tf = isoutlier(x);
end

Related Examples

Parameters

expand all

Specify groups by selecting valid workspace grouping variables from the Group by drop-down list. When the data is contained in a table or timetable, additionally select the table variables to group by. You can group by unique values or specify how to bin the data.

From the Compute on drop-down list, select the workspace data to compute on. When the data is contained in a table or timetable, select the table variables to compute on.

Select one of these computation options:

Computation TypeDescription
Compute stats by groupA summary (or aggregate) of data, such as a mean or maximum. You can also supply a custom function by providing a local function name or a function handle. The function must return one entity per group whose first dimension has length 1. For more information, see groupsummary.
Transform by groupTransform the data, for example, scale the data by the 2-norm or fill missing data. You can also supply a custom function by providing a local function name or a function handle. The function must return one entity whose first dimension has length 1 or has the same number of rows as the input data. For more information, see grouptransform.
Filter by groupFilter members from each group by providing a local function or function handle that defines the filtering computation. The function must return a logical scalar or a logical column vector with the same number of rows as the data indicating which group members to select. If the function returns a logical scalar, then either all members of the group are filtered (when the value is false) or none are (when the value is true). If the function returns a logical vector, then members of groups are filtered when the corresponding element is false. Members are kept when the corresponding element is true. For more information, see groupfilter.

For all computation types, you can click New to create a new function in the Live Script that defines the computation. Clicking New automatically inserts an example function into the Live Script that uses the appropriate syntax for the selected computation type. If you change the name of the example function, to use the new function name, reselect the method from the drop-down list in the live task.

Version History

Introduced in R2021b

expand all