Monitoring & Troubleshooting
Learn how to monitor your production lines, investigate issues, and troubleshoot common problems.
Overview
Effective monitoring ensures your data pipelines run reliably and issues are caught early. In this guide, you'll learn about:
- Monitoring production line execution
- Investigating failed tasks
- Common error patterns and solutions
- Using Pulse Health for operational insights
Prerequisites
- Production lines that have been executed
- Understanding of your pipeline expected behaviour
Monitoring your pipelines
Execution status
Each production line run has a status:
| Status | Meaning | Action |
|---|---|---|
| Running | Currently executing | Monitor progress |
| Succeeded | Completed successfully | No action needed |
| Failed | One or more tasks failed | Investigate and resolve |
| Cancelled | Manually stopped | Review why cancelled |
For the complete list of all task run statuses including Inactive, Skipped, Blocked, and Stale, see Task Run Statuses.
Viewing run history
- Navigate to Operate > Pulse Health
- Select your production line
- View the run history showing:
- Start and end times
- Duration
- Status
- Task-level details
Task-level monitoring
Drill into individual tasks to see:
- Task execution time
- Rows processed
- Data volume
- Error messages (if failed)
Troubleshooting failed tasks
Step 1: Identify the failure
- Find the failed run in Pulse Health
- Identify which task(s) failed
- Note the error message
Step 2: Review error details
Click on the failed task to see:
- Full error message
- Stack trace (if available)
- Execution logs
- Input parameters
Step 3: Common error patterns
Connection errors
Symptoms: "Connection refused", "Timeout", "Authentication failed"
Solutions:
- Verify connection credentials are correct
- Check network connectivity
- Confirm firewall rules allow access
- Test the connection in the Connections page
Data errors
Symptoms: "Schema mismatch", "Data type error", "Null value in non-null column"
Solutions:
- Review source data for unexpected values
- Check column mapping configuration
- Verify schema enforcement settings
- Consider adding data validation
Resource errors
Symptoms: "Out of memory", "Cluster unavailable", "Quota exceeded"
Solutions:
- Increase cluster resources
- Optimise data processing (smaller batches)
- Schedule during off-peak hours
- Review data volume trends
Step 4: Retry execution
After fixing the issue:
- Navigate to the failed run
- Click Retry to re-run failed tasks
- Monitor the retry execution
Retry configuration
Configure automatic retries for transient failures:
| Setting | Description | Recommended Value |
|---|---|---|
| Max Retry Attempts | How many times to retry | 2-3 for transient errors |
| Retry Wait Time | Delay between retries | 30-60 seconds |
Configure at the task or task group level.
Pulse Health overview
Pulse Health provides operational insights:
Dashboard views
- Pipeline Overview: Status of all production lines
- Recent Failures: Quick access to problems
- Execution Trends: Duration and success rate over time
- Resource Utilisation: Cluster and storage usage
Setting up alerts
Configure alerts for:
- Task failures
- Execution duration exceeding threshold
- Data volume anomalies
- Consecutive failures
For email notifications on pipeline success or failure, see Notifications.
Best practices
Proactive monitoring:
- Review dashboards daily
- Set up alerts for critical pipelines
- Track execution duration trends
- Monitor data quality metrics
Incident response:
- Document common issues and solutions
- Create runbooks for critical pipelines
- Establish escalation procedures
- Conduct post-incident reviews
Performance optimisation:
- Identify slow-running tasks
- Review execution patterns
- Optimise based on monitoring data
- Right-size cluster resources
Key Concepts
| Term | Definition |
|---|---|
| Pulse Health | The operational monitoring dashboard in Insight Factory |
| Run History | Record of past production line executions |
| Retry | Re-executing a failed task without restarting the entire pipeline |
| Alert | Automated notification when conditions are met |