TIBCO Spotfire Web Player Schedule Update having short reloadFrequency does not recover after failure

Product:TIBCO Spotfire Server
Versions:All

Summary:
For the schedule updates having short reloadFrequency, a situation may arise when a job gets into some issue and before all it's failure retries are complete the next job gets triggered. If this job also fails, then these jobs stay In_Progress forever and no new jobs are triggered for that schedule update.

Details:
If a failed schedule update has reloadFrequency lesser than the combined time it takes for the job to retry all the failures defined in the configuration, then this schedule update does not recover from the failure. Even if the initial issue gets resolved, the schedule update would run successfully only for one of the first In_Progress jobs but no new jobs are triggered.

Disable/Enable schedule update does not help.
Reload schedule update manually does not help.
Delete the schedule update rule and create a new rule for the same file. It will still not run.

The root cause of the missing reloads is that when the job fails and is retried due to failure along with short reloadFrequency triggering new jobs, the TIBCO Spotfire Server(TSS)/Web Player(WP) gets overwhelmed and misses to update statuses of all the jobs, so TSS thinks that a job is still "In_Progress" when it has actually failed and all retries have been done. Following log entries could be seen in web player logs:

Spotfire.Dxp.Web.Library.ScheduledUpdates;"The update event for *** with JobDefinitionId=***, 
JobInstanceId=*** will not be performed due to an update is already in progress."

Example: The rule is set to reload every 1 minute. TSS is set to retry 3 times before failure, and also retry at each reload(below snippet is from configuration.xml):

<stop-updates-after-repeated-fail>
      <enabled>true</enabled>
      <fails-before-stop>3</fails-before-stop>
      <stop-only-when-cached>false</stop-only-when-cached>
      <always-retry-when-scheduled>true</always-retry-when-scheduled>
</stop-updates-after-repeated-fail>

So when a job fails the first load, it then retries 3 times which takes more than a minute, so in the meantime, a new job is added to the queue, and so on and so forth. TSS/WP might not be able to keep track of all these jobs.

Resolution:
Set the reload interval of the schedule to a higher value than the combined time it takes for the job to retry all the times it's defined to do, then it should work as expected.

So if(in configuration.xml):
<fails-before-stop>3</fails-before-stop> is set to 3 and the
<update-interval-seconds>60</update-interval-seconds> is set to 60(seconds),
then the reload time of the schedule should be at least 4 minutes(preferably a couple of minutes higher to give a marginal), then it should work as expected.

When the schedule jobs are not set according to the above configuration and if the issue mentioned in this article occurs then the only solution available currently is to restart the WP instance. When doing a restart of the WP instance, TSS set all running jobs to "Failed", and so the jobs would start loading again when the instance is back up.

Your Notifications (0)

Comments

Related articles