Ingesting Data from Files
Learn how to ingest data from file sources like CSV, Excel, and JSON files from SFTP or cloud storage.
Overview
File-based ingestion is essential for working with data exports, spreadsheets, and file drops. In this guide, you'll learn how to:
- Create connections to file sources (SFTP, ADLS)
- Configure file ingestion tasks
- Handle column mapping and schema inference
- Work with different file formats
Prerequisites
- An existing Production Line (see Production Lines)
- Access credentials for your file source (SFTP server, Azure Data Lake, etc.)
- Understanding of your source file format and structure
Step-by-Step Guide
1. Create a file source connection
- Navigate to Build > Connections
- Click New Connection
- Select your file source type:
- SFTP for secure file transfer servers
- Azure Data Lake Storage Gen2 for ADLS
- Azure Blob Storage for blob containers
- Enter your connection details and credentials
- Test and save the connection
2. Create a file ingestion task
- Open your production line and navigate to the Graph view
- Add a new task using one of these methods:
- Click the + button in the graph side menu
- Right-click on an existing node and select Add Task from the context menu
- Enter a unique Code and Name for your task
- Select the appropriate ingestion activity from the Activity dropdown:
- "Ingest Delimited File to Lakehouse" for CSV files
- "Ingest Excel Worksheet to Lakehouse" for Excel files
- "Ingest JSON File to Lakehouse" for JSON files
- Configure the task properties
3. Configure file settings
Depending on your file format, you may need to configure:
For Delimited Files (CSV):
- Column delimiter (comma, tab, pipe, etc.)
- Quote character
- Header row settings
- Encoding
For Excel Files:
- Worksheet name or index
- Header row settings
- Data range
For JSON Files:
- JSON path expression
- Array handling
4. Set up column mapping
- Review the inferred schema
- Adjust column names if needed
- Set appropriate data types
- Add any computed columns
5. Run and verify
- Save your task configuration
- Run the ingestion task
- Verify the data in your Lakehouse
Key Concepts
| Term | Definition |
|---|---|
| Schema Inference | Automatic detection of column names and data types from file structure |
| Column Mapping | The process of defining how source columns map to destination columns |
| Delimited File | A text file where columns are separated by a specific character |