Skip to main content

Monitoring & Troubleshooting

Learn how to monitor your production lines, investigate issues, and troubleshoot common problems.

Overview

Effective monitoring ensures your data pipelines run reliably and issues are caught early. In this guide, you'll learn about:

  • Monitoring production line execution
  • Investigating failed tasks
  • Common error patterns and solutions
  • Using Pulse Health for operational insights

Prerequisites

  • Production lines that have been executed
  • Understanding of your pipeline expected behaviour

Monitoring your pipelines

Execution status

Each production line run has a status:

StatusMeaningAction
RunningCurrently executingMonitor progress
SucceededCompleted successfullyNo action needed
FailedOne or more tasks failedInvestigate and resolve
CancelledManually stoppedReview why cancelled

For the complete list of all task run statuses including Inactive, Skipped, Blocked, and Stale, see Task Run Statuses.

Viewing run history

  1. Navigate to Operate > Pulse Health
  2. Select your production line
  3. View the run history showing:
    • Start and end times
    • Duration
    • Status
    • Task-level details

Task-level monitoring

Drill into individual tasks to see:

  • Task execution time
  • Rows processed
  • Data volume
  • Error messages (if failed)

Troubleshooting failed tasks

Step 1: Identify the failure

  1. Find the failed run in Pulse Health
  2. Identify which task(s) failed
  3. Note the error message

Step 2: Review error details

Click on the failed task to see:

  • Full error message
  • Stack trace (if available)
  • Execution logs
  • Input parameters

Step 3: Common error patterns

Connection errors

Symptoms: "Connection refused", "Timeout", "Authentication failed"

Solutions:

  • Verify connection credentials are correct
  • Check network connectivity
  • Confirm firewall rules allow access
  • Test the connection in the Connections page

Data errors

Symptoms: "Schema mismatch", "Data type error", "Null value in non-null column"

Solutions:

  • Review source data for unexpected values
  • Check column mapping configuration
  • Verify schema enforcement settings
  • Consider adding data validation

Resource errors

Symptoms: "Out of memory", "Cluster unavailable", "Quota exceeded"

Solutions:

  • Increase cluster resources
  • Optimise data processing (smaller batches)
  • Schedule during off-peak hours
  • Review data volume trends

Step 4: Retry execution

After fixing the issue:

  1. Navigate to the failed run
  2. Click Retry to re-run failed tasks
  3. Monitor the retry execution

Retry configuration

Configure automatic retries for transient failures:

SettingDescriptionRecommended Value
Max Retry AttemptsHow many times to retry2-3 for transient errors
Retry Wait TimeDelay between retries30-60 seconds

Configure at the task or task group level.

Pulse Health overview

Pulse Health provides operational insights:

Dashboard views

  • Pipeline Overview: Status of all production lines
  • Recent Failures: Quick access to problems
  • Execution Trends: Duration and success rate over time
  • Resource Utilisation: Cluster and storage usage

Setting up alerts

Configure alerts for:

  • Task failures
  • Execution duration exceeding threshold
  • Data volume anomalies
  • Consecutive failures

For email notifications on pipeline success or failure, see Notifications.

Best practices

Proactive monitoring:

  • Review dashboards daily
  • Set up alerts for critical pipelines
  • Track execution duration trends
  • Monitor data quality metrics

Incident response:

  • Document common issues and solutions
  • Create runbooks for critical pipelines
  • Establish escalation procedures
  • Conduct post-incident reviews

Performance optimisation:

  • Identify slow-running tasks
  • Review execution patterns
  • Optimise based on monitoring data
  • Right-size cluster resources

Key Concepts

TermDefinition
Pulse HealthThe operational monitoring dashboard in Insight Factory
Run HistoryRecord of past production line executions
RetryRe-executing a failed task without restarting the entire pipeline
AlertAutomated notification when conditions are met