Troubleshoot Common Problems
This section offers advice on solving problems you might encounter with MATLAB® Parallel Server™ software.
License Errors
When starting a MATLAB worker, a licensing problem might result in the message
License checkout failed. No such FEATURE exists. License Manager Error -5
There are many reasons why you might receive this error:
This message usually indicates that you are trying to use a product for which you are not licensed. Look at your
license.dat
file located within your MATLAB installation to see if you are licensed to use this product.If you are licensed for this product, this error may be the result of having extra carriage returns or tabs in your license file. To avoid this, ensure that each line begins with either
#
,SERVER
,DAEMON
, orINCREMENT
.After fixing your
license.dat
file, restart the network license manager and MATLAB should work properly.This error may also be the result of an incorrect system date. If your system date is before the date that your license was made, you will get this error.
If you receive this error when starting a worker with MATLAB Parallel Server software:
You may be calling the
startworker
command from an installation that does not have access to a worker license. For example, starting a worker from a client installation of the Parallel Computing Toolbox™ product causes the following error:The mjs service on the host hostname returned the following error: Problem starting the MATLAB worker. The cause of this problem is: ============================================================== Most likely, the MATLAB worker failed to start due to a licensing problem, or MATLAB crashed during startup. Check the worker log file /tmp/mjs_user/node_node_worker_05-11-01_16-52-03_953.log for more detailed information. The mjs log file /tmp/mjs_user/mjs-service.log may also contain some additional information. ===============================================================
In the worker log files, you see the following information:
License checkout failed. License Manager Error -15 MATLAB is unable to connect to the license server. Check that the license manager has been started, and that the MATLAB client machine can communicate with the license server. Troubleshoot this issue by visiting: /support/lme/R2009a/15 Diagnostic Information: Feature: MATLAB_Distrib_Comp_Engine License path: /apps/matlab/etc/license.dat FLEXnet Licensing error: -15,570. System Error: 115
If you installed only the Parallel Computing Toolbox product, and you are attempting to run a worker on the same machine, you will receive this error because the MATLAB Parallel Server product is not installed, and therefore the worker cannot obtain a license.
Memory Errors on UNIX Operating Systems
If the number of processes created by the server services on a machine running a Linux® operating system exceeds the operating system limits, the services fail and generate an out-of-memory error. It is recommended that you adjust your system limits. For more information, see Recommended System Limits for Macintosh and Linux (Parallel Computing Toolbox).
Run Server Processes on Windows Network Installation
Many networks are configured not to allow LocalSystem
to have access to
UNC or mapped network shares. In this case, run the mjs process under a different user with
rights to log on as a service. See Set the User.
Required Ports
With Job Manager
BASE_PORT. The mjs_def
file specifies and describes the ports required by the job
manager and all workers. See the following file in the MATLAB installation used for each cluster process:
(on UNIX® operating systems)matlabroot
/toolbox/parallel/bin/mjs_def.sh
(on Windows® operating systems)matlabroot
\toolbox\parallel\bin\mjs_def.bat
Communicating Jobs. On worker machines running a UNIX operating system, the number of ports required by MPICH for the running of
communicating jobs ranges from BASE_PORT + 1000
to BASE_PORT +
2000
.
With Third-Party Scheduler
Communication Between Workers. Before the worker processes start, you can control the range of ports used by the workers
for communicating jobs by defining the environment variable
MPICH_PORT_RANGE
with the value minport:maxport
.
Open Ports on Workers for Inbound Communication from Client. You can control the listening port range workers open to connect to clients for interactive parallel pool jobs.
Use the
pctconfig
(Parallel Computing Toolbox) function to specify which listening ports workers must open orDefine the environment variable
PARALLEL_SERVER_OVERRIDE_PORT_RANGE
with the value"minport maxport"
. This will override the port range specified withpctconfig
.For Microsoft® HPC Pack, set
PARALLEL_SERVER_OVERRIDE_PORT_RANGE
in the job template with an addition to the Environments field. For example, to open a listening port in the range 30000 to 31000, add this code to the job template.PARALLEL_SERVER_OVERRIDE_PORT_RANGE=30000 31000;
For other third-party schedulers, set
PARALLEL_SERVER_OVERRIDE_PORT_RANGE
in thecommunicatingJobWrapper.sh
script. For example, to open a listening port in the range 29000 to 31000, add this code to thecommunicatingJobWrapper.sh
script.To learn more about theexport PARALLEL_SERVER_OVERRIDE_PORT_RANGE="29000 31000"
communicatingJobWrapper.sh
script, see Wrapper Scripts (Parallel Computing Toolbox).
Client Ports
With the pctconfig
(Parallel Computing Toolbox) function, you specify the ports used by
the client. If the default ports cannot be used, this function allows you to configure ports
separately for communication with the job scheduler and communication with a parallel
pool.
Ephemeral TCP Ports with Job Manager
If you use the job manager on a cluster of nodes running Windows operating systems, you must make sure that a large number of ephemeral TCP ports are available on the job manager machine. By default, the maximum valid ephemeral TCP port number on a Windows operating system is 5000, but transfers of large data sets might fail if this setting is not increased. In particular, if your cluster has 32 or more workers, you should increase the maximum valid ephemeral TCP port number using the following procedure:
Start the Registry Editor.
Locate the following subkey in the registry, and click Parameters:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
On the Registry Editor window, select Edit > New > DWORD Value.
In the list of entries on the right, change the new value name to
MaxUserPort
and press Enter.Right-click on the
MaxUserPort
entry name and select Modify.In the Edit DWORD Value dialog, enter
65534
in the Value data field. Under Base select Decimal. Click OK.This parameter controls the maximum port number that is used when a program requests any available user port from the system. Typically, ephemeral (short-lived) ports are allocated between the values of 1024 and 5000 inclusive. This action allows allocation for port numbers up to 65534.
Quit the Registry Editor.
Reboot your machine.
Host Communications Problems
If a worker is not able to make a connection with its MATLAB Job Scheduler, or if a client session cannot validate a profile that uses that scheduler, this might indicate communications problems between nodes.
With Command-Line Interface
First, be sure that the machines in question agree on their IP resolutions. The IP
address for a particular host should be the same for itself as it is from the perspective of
another host. For example, if a process on hostB
cannot connect to one on
hostA
, find out the hostA
IP address for itself, then
see what the IP address for hostA
is from hostB
. They
should be the same.
If the machines can identify each other, the nodestatus
command can be useful for diagnosing problems between their processes.
Use the function to determine what MATLAB
Parallel Server processes are running on the local host, and which are accessible from remote
hosts. If a worker on hostA
cannot register with its job manager on
hostB
, run nodestatus
on both hosts to see what each can
see on hostB
.
On hostB
, execute:
nodestatus -remotehost hostB
Then on hostA
, run exactly the same command:
nodestatus -remotehost hostB
The results should be the same, showing the same listing of job managers and workers.
If the output indicates problems, run the command again with a higher information level to receive more detailed information:
nodestatus -remotehost hostB -infolevel 3
With Admin Center GUI
You can diagnose some communications problems using Admin Center.
If you cannot successfully add hosts to the listing by specifying host name, you can use their IP addresses instead (see Add Hosts). If you suspect any communications problems, in the Admin Center GUI click Test Connectivity (see Test Connectivity). This testing verifies that the nodes can identify each other and allow their processes to communicate with each other.
Verify Network Communications for Cluster Discovery
If you want to use the discover cluster capabilities in Parallel Computing Toolbox, your network must be configured to use DNS SRV or DNS TXT records.
DNS SRV Record
When you use DNS for MATLAB Job Scheduler cluster discovery, you require a DNS SRV record for each domain. You can have multiple DNS SRV records for multiple MATLAB Job Schedulers. Use the following general form for each DNS SRV record.
_mdcs._tcp.<domain> <TTL> IN SRV <priority> <weight> <port> <hostname>.
Construct a DNS SRV record for a MATLAB Job Scheduler server using the following parts.
<domain>
is the domain name (likecompany.com
oruniversity.edu
) that the client machine searches.<TTL>
indicates how long (in seconds) the DNS record can be cached.3600
is recommended.IN SRV
is required as shown, indicating that this is a service record.<priority>
and<weight>
indicate priority and weight values. If you create multiple DNS SRV records, you can specify their priority with these fields. A value of0
is recommended for each. The lower<priority>
is, the higher priority the host has. When two records have the same<priority>
, the record with the highest<weight>
is used first. Use the<weight>
value to specify server preference.<port>
is the port on which you connect to the MATLAB Job Scheduler server. The default port is27350
. If you change port for the MATLAB Job Scheduler server, change<port>
accordingly.<hostname>
is the fully qualified domain name for the host serving the MATLAB Job Scheduler. The machinemjs-1
on the domaincompany.com
has a fully qualified domain namemjs-1.company.com
.
A valid DNS SRV record for the company.com
network running a
MATLAB Job Scheduler on machine mjs-1
might look like this:
_mdcs._tcp.company.com 3600 IN SRV 0 0 27350 mjs-1.company.com.
Note
If multiple domains are required to locate the cluster, use a DNS SRV record for each domain. If the network accessed by users via VPN has different DNS SRV records to your internal network, ensure that a DNS SRV record exists for each domain.
Use the standard procedure for your DNS system to create appropriate DNS SRV records. You
can use standard utilities such as the nslookup
command to verify that your
network is configured with the necessary DNS SRV records. To examine MATLAB Job Scheduler DNS SRV records for the company.com
domain, use
the following command.
nslookup -type=SRV _mdcs._tcp.company.com
DNS TXT Record
Use DNS TXT records for third-party scheduler cluster discovery. A DNS TXT record associates a text string with a particular domain. To let MATLAB know where to find cluster discovery configuration files, store the locations of cluster discovery configuration files as text strings in DNS TXT records.
You can have multiple DNS TXT records for multiple clusters. Use this general form for each DNS TXT record.
_mdcs._tcp.<domain> IN TXT "discover_folder=<folder>"
Construct a DNS TXT record to discover a third-party scheduler using these parts.
<domain>
is the domain name (likecompany.com
oruniversity.edu
) that the client machine searches.IN TXT
is required as shown, indicating that this is a text record."discover_folder=<folder>"
where <folder> is the location of your cluster discovery configuration files.
A valid DNS TXT record for the company.com
network running a Slurm
scheduler cluster with a cluster discovery configuration file stored in
/network/share/discovery
might look like this:
_mdcs._tcp.company.com IN TXT "discover_folder=/network/share/discovery"
Note
If multiple domains are required to locate the cluster, use a DNS TXT record for each domain. If the network accessed by users via VPN has different DNS TXT records to your internal network, ensure that a DNS TXT record exists for each domain.
Use the standard procedure for your DNS system to create appropriate DNS TXT records. You
can use standard utilities such as the nslookup
command to verify that your
network is configured with the necessary DNS TXT records. To examine DNS TXT records for the
company.com
domain, use the following command.
nslookup -type=TXT _mdcs._tcp.company.com