Database Sources
This guide covers how to connect relational databases as data sources in Datalinx AI. Database connections enable you to ingest data from your operational systems for transformation and analysis.
Supported Databases
Datalinx AI supports the following database systems:
| Database | Versions | Connection Method |
|---|---|---|
| PostgreSQL | 11+ | Direct, SSH tunnel |
| MySQL | 5.7+, 8.x | Direct, SSH tunnel |
| Snowflake | Current | Native connector |
| Databricks | Unity Catalog | Native connector |
| SQL Server | 2016+ | Direct |
| Oracle | 12c+ | Direct |
| BigQuery | Current | Service account |
Adding a Database Source
Step 1: Navigate to Sources
- Open your workspace
- Click Sources in the sidebar
- Click Add Source
- Select Database as the source type
Step 2: Choose Database Type
Select your database type from the list. Each type has specific connection requirements.
Step 3: Configure Connection
PostgreSQL / MySQL Example
Connection Details:
Host: db.example.com
Port: 5432
Database: production
Username: datalinx_reader
Password: ********
Advanced Options:
SSL Mode: require
Connection Timeout: 30s
Read Timeout: 300s
Snowflake Example
Connection Details:
Account: acme.us-east-1
Warehouse: COMPUTE_WH
Database: ANALYTICS
Schema: RAW_DATA
Role: DATA_READER
Authentication:
Method: Key Pair
Private Key: [uploaded]
Passphrase: ********
Databricks Example
Connection Details:
Host: acme.cloud.databricks.com
HTTP Path: /sql/1.0/warehouses/abc123
Catalog: main
Schema: raw
Authentication:
Method: Service Principal
Client ID: ********
Client Secret: ********
Step 4: Test Connection
Click Test Connection to verify:
Testing connection to PostgreSQL...
✓ Network connectivity
✓ Authentication successful
✓ Schema access verified
✓ Found 24 tables
Connection successful!
Step 5: Configure Schema Discovery
Choose which tables to include:
- All Tables: Include all accessible tables
- Select Tables: Choose specific tables
- Pattern Match: Use wildcards (e.g.,
orders_*,customer_*)
Connection Security
Network Security
Direct Connection
For databases accessible from the internet:
- Whitelist Datalinx AI IP addresses
- Enable SSL/TLS encryption
- Use strong credentials
# Required IP whitelist
Datalinx AI IPs:
- 52.1.2.3/32
- 52.1.2.4/32
- 52.1.2.5/32
SSH Tunnel
For databases in private networks:
SSH Tunnel:
Enabled: true
SSH Host: bastion.example.com
SSH Port: 22
SSH User: datalinx
SSH Key: [uploaded private key]
Database Host: internal-db.private
Database Port: 5432
Credential Storage
All credentials are encrypted:
- At Rest: AES-256 encryption
- In Transit: TLS 1.3
- Access: Only decrypted during pipeline execution
Use read-only database users for source connections. Datalinx AI never needs write access to your source databases.
Schema Discovery
Automatic Discovery
When you add a database source, Datalinx AI automatically:
- Scans available schemas
- Lists all tables
- Discovers column names and types
- Identifies primary keys and relationships
- Samples data for profiling
Schema Refresh
Schemas can change. Configure automatic refresh:
| Option | Description |
|---|---|
| On Connect | Refresh each time pipeline runs |
| Scheduled | Daily/weekly refresh |
| Manual | Only when you click refresh |
Handling Schema Changes
When source schema changes are detected:
Schema Change Detected:
┌─────────────────────────────────────────────┐
│ Table: customers │
│ │
│ + Added columns: │
│ - loyalty_tier (VARCHAR) │
│ - signup_source (VARCHAR) │
│ │
│ ~ Modified columns: │
│ - email: VARCHAR(100) → VARCHAR(255) │
│ │
│ - Removed columns: │
│ - legacy_id │
└─────────────────────────────────────────────┘
[Update Mappings] [Ignore] [View Details]
Performance Optimization
Query Pushdown
Datalinx AI pushes filters to the source database when possible:
-- Instead of fetching all data:
SELECT * FROM orders;
-- Pushes filter to source:
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
AND status = 'completed';
Incremental Loading
Configure incremental loading to avoid full table scans:
Incremental Config:
Cursor Column: updated_at
Cursor Type: TIMESTAMP
Initial Value: 2024-01-01T00:00:00Z
Batch Size: 10000
Connection Pooling
For high-frequency pipelines:
Pool Settings:
Min Connections: 2
Max Connections: 10
Idle Timeout: 300s
Troubleshooting
Connection Failures
| Error | Cause | Solution |
|---|---|---|
Connection refused | Network/firewall | Check IP whitelist |
Authentication failed | Wrong credentials | Verify username/password |
SSL handshake failed | Certificate issue | Check SSL configuration |
Timeout | Network latency | Increase timeout values |
Permission Errors
-- Required permissions for source user
GRANT CONNECT ON DATABASE mydb TO datalinx_reader;
GRANT USAGE ON SCHEMA public TO datalinx_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datalinx_reader;
Slow Discovery
If schema discovery is slow:
- Limit tables with pattern matching
- Exclude large tables from initial discovery
- Schedule discovery during off-peak hours
Best Practices
Security
- Create dedicated users: Don't use admin accounts
- Grant minimal permissions: Read-only access only
- Use SSL/TLS: Always encrypt connections
- Rotate credentials: Update passwords regularly
Performance
- Use incremental loading: Avoid full table scans
- Index cursor columns: Speed up incremental queries
- Limit concurrent connections: Respect database limits
- Schedule off-peak: Run large syncs during low-traffic periods
Maintenance
- Monitor schema changes: Enable change detection
- Review connection logs: Check for errors
- Test connections regularly: Catch issues early
- Document sources: Maintain connection documentation
Related Documentation
- Data Sources Overview - Introduction to data sources
- File Sources - Connecting file-based data
- API Sources - Connecting REST/GraphQL APIs
- Schema Configuration - Working with schemas