Monitoring & Troubleshooting

Learn how to monitor your production lines, investigate issues, and troubleshoot common problems.

Overview

Effective monitoring ensures your data pipelines run reliably and issues are caught early. In this guide, you'll learn about:

Monitoring production line execution
Investigating failed tasks
Common error patterns and solutions
Using Pulse Health for operational insights

Prerequisites

Production lines that have been executed
Understanding of your pipeline expected behaviour

Monitoring your pipelines

Execution status

Each production line run has a status:

Status	Meaning	Action
Running	Currently executing	Monitor progress
Succeeded	Completed successfully	No action needed
Failed	One or more tasks failed	Investigate and resolve
Cancelled	Manually stopped	Review why cancelled

For the complete list of all task run statuses including Inactive, Skipped, Blocked, and Stale, see Task Run Statuses.

Viewing run history

Navigate to Operate > Pulse Health
Select your production line
View the run history showing:
- Start and end times
- Duration
- Status
- Task-level details

Task-level monitoring

Drill into individual tasks to see:

Task execution time
Rows processed
Data volume
Error messages (if failed)

Troubleshooting failed tasks

Step 1: Identify the failure

Find the failed run in Pulse Health
Identify which task(s) failed
Note the error message

Step 2: Review error details

Click on the failed task to see:

Full error message
Stack trace (if available)
Execution logs
Input parameters

Step 3: Common error patterns

Connection errors

Symptoms: "Connection refused", "Timeout", "Authentication failed"

Solutions:

Verify connection credentials are correct
Check network connectivity
Confirm firewall rules allow access
Test the connection in the Connections page

Data errors

Symptoms: "Schema mismatch", "Data type error", "Null value in non-null column"

Solutions:

Review source data for unexpected values
Check column mapping configuration
Verify schema enforcement settings
Consider adding data validation

Resource errors

Symptoms: "Out of memory", "Cluster unavailable", "Quota exceeded"

Solutions:

Increase cluster resources
Optimise data processing (smaller batches)
Schedule during off-peak hours
Review data volume trends

Step 4: Retry execution

After fixing the issue:

Navigate to the failed run
Click Retry to re-run failed tasks
Monitor the retry execution

Retry configuration

Configure automatic retries for transient failures:

Setting	Description	Recommended Value
Max Retry Attempts	How many times to retry	2-3 for transient errors
Retry Wait Time	Delay between retries	30-60 seconds

Configure at the task or task group level.

Pulse Health overview

Pulse Health provides operational insights:

Dashboard views

Pipeline Overview: Status of all production lines
Recent Failures: Quick access to problems
Execution Trends: Duration and success rate over time
Resource Utilisation: Cluster and storage usage

Setting up alerts

Configure alerts for:

Task failures
Execution duration exceeding threshold
Data volume anomalies
Consecutive failures

For email notifications on pipeline success or failure, see Notifications.

Best practices

Proactive monitoring:

Review dashboards daily
Set up alerts for critical pipelines
Track execution duration trends
Monitor data quality metrics

Incident response:

Document common issues and solutions
Create runbooks for critical pipelines
Establish escalation procedures
Conduct post-incident reviews

Performance optimisation:

Identify slow-running tasks
Review execution patterns
Optimise based on monitoring data
Right-size cluster resources

Key Concepts

Term	Definition
Pulse Health	The operational monitoring dashboard in Insight Factory
Run History	Record of past production line executions
Retry	Re-executing a failed task without restarting the entire pipeline
Alert	Automated notification when conditions are met

Overview​

Prerequisites​

Monitoring your pipelines​

Execution status​

Viewing run history​

Task-level monitoring​

Troubleshooting failed tasks​

Step 1: Identify the failure​

Step 2: Review error details​

Step 3: Common error patterns​

Connection errors​

Data errors​

Resource errors​

Step 4: Retry execution​

Retry configuration​

Pulse Health overview​

Dashboard views​

Setting up alerts​

Best practices​

Key Concepts​

Related Guides​

Overview

Prerequisites

Monitoring your pipelines

Execution status

Viewing run history

Task-level monitoring

Troubleshooting failed tasks

Step 1: Identify the failure

Step 2: Review error details

Step 3: Common error patterns

Connection errors

Data errors

Resource errors

Step 4: Retry execution

Retry configuration

Pulse Health overview

Dashboard views

Setting up alerts

Best practices

Key Concepts

Related Guides