Skip to main content

Database Sources

This guide covers how to connect relational databases as data sources in Datalinx AI. Database connections enable you to ingest data from your operational systems for transformation and analysis.

Supported Databases

Datalinx AI supports the following database systems:

DatabaseVersionsConnection Method
PostgreSQL11+Direct, SSH tunnel
MySQL5.7+, 8.xDirect, SSH tunnel
SnowflakeCurrentNative connector
DatabricksUnity CatalogNative connector
SQL Server2016+Direct
Oracle12c+Direct
BigQueryCurrentService account

Adding a Database Source

Step 1: Navigate to Sources

  1. Open your workspace
  2. Click Sources in the sidebar
  3. Click Add Source
  4. Select Database as the source type

Step 2: Choose Database Type

Select your database type from the list. Each type has specific connection requirements.

Step 3: Configure Connection

PostgreSQL / MySQL Example

Connection Details:
Host: db.example.com
Port: 5432
Database: production
Username: datalinx_reader
Password: ********

Advanced Options:
SSL Mode: require
Connection Timeout: 30s
Read Timeout: 300s

Snowflake Example

Connection Details:
Account: acme.us-east-1
Warehouse: COMPUTE_WH
Database: ANALYTICS
Schema: RAW_DATA
Role: DATA_READER

Authentication:
Method: Key Pair
Private Key: [uploaded]
Passphrase: ********

Databricks Example

Connection Details:
Host: acme.cloud.databricks.com
HTTP Path: /sql/1.0/warehouses/abc123
Catalog: main
Schema: raw

Authentication:
Method: Service Principal
Client ID: ********
Client Secret: ********

Step 4: Test Connection

Click Test Connection to verify:

Testing connection to PostgreSQL...
✓ Network connectivity
✓ Authentication successful
✓ Schema access verified
✓ Found 24 tables

Connection successful!

Step 5: Configure Schema Discovery

Choose which tables to include:

  • All Tables: Include all accessible tables
  • Select Tables: Choose specific tables
  • Pattern Match: Use wildcards (e.g., orders_*, customer_*)

Connection Security

Network Security

Direct Connection

For databases accessible from the internet:

  1. Whitelist Datalinx AI IP addresses
  2. Enable SSL/TLS encryption
  3. Use strong credentials
# Required IP whitelist
Datalinx AI IPs:
- 52.1.2.3/32
- 52.1.2.4/32
- 52.1.2.5/32

SSH Tunnel

For databases in private networks:

SSH Tunnel:
Enabled: true
SSH Host: bastion.example.com
SSH Port: 22
SSH User: datalinx
SSH Key: [uploaded private key]

Database Host: internal-db.private
Database Port: 5432

Credential Storage

All credentials are encrypted:

  • At Rest: AES-256 encryption
  • In Transit: TLS 1.3
  • Access: Only decrypted during pipeline execution
tip

Use read-only database users for source connections. Datalinx AI never needs write access to your source databases.

Schema Discovery

Automatic Discovery

When you add a database source, Datalinx AI automatically:

  1. Scans available schemas
  2. Lists all tables
  3. Discovers column names and types
  4. Identifies primary keys and relationships
  5. Samples data for profiling

Schema Refresh

Schemas can change. Configure automatic refresh:

OptionDescription
On ConnectRefresh each time pipeline runs
ScheduledDaily/weekly refresh
ManualOnly when you click refresh

Handling Schema Changes

When source schema changes are detected:

Schema Change Detected:
┌─────────────────────────────────────────────┐
│ Table: customers │
│ │
│ + Added columns: │
│ - loyalty_tier (VARCHAR) │
│ - signup_source (VARCHAR) │
│ │
│ ~ Modified columns: │
│ - email: VARCHAR(100) → VARCHAR(255) │
│ │
│ - Removed columns: │
│ - legacy_id │
└─────────────────────────────────────────────┘

[Update Mappings] [Ignore] [View Details]

Performance Optimization

Query Pushdown

Datalinx AI pushes filters to the source database when possible:

-- Instead of fetching all data:
SELECT * FROM orders;

-- Pushes filter to source:
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
AND status = 'completed';

Incremental Loading

Configure incremental loading to avoid full table scans:

Incremental Config:
Cursor Column: updated_at
Cursor Type: TIMESTAMP
Initial Value: 2024-01-01T00:00:00Z
Batch Size: 10000

Connection Pooling

For high-frequency pipelines:

Pool Settings:
Min Connections: 2
Max Connections: 10
Idle Timeout: 300s

Troubleshooting

Connection Failures

ErrorCauseSolution
Connection refusedNetwork/firewallCheck IP whitelist
Authentication failedWrong credentialsVerify username/password
SSL handshake failedCertificate issueCheck SSL configuration
TimeoutNetwork latencyIncrease timeout values

Permission Errors

-- Required permissions for source user
GRANT CONNECT ON DATABASE mydb TO datalinx_reader;
GRANT USAGE ON SCHEMA public TO datalinx_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datalinx_reader;

Slow Discovery

If schema discovery is slow:

  1. Limit tables with pattern matching
  2. Exclude large tables from initial discovery
  3. Schedule discovery during off-peak hours

Best Practices

Security

  1. Create dedicated users: Don't use admin accounts
  2. Grant minimal permissions: Read-only access only
  3. Use SSL/TLS: Always encrypt connections
  4. Rotate credentials: Update passwords regularly

Performance

  1. Use incremental loading: Avoid full table scans
  2. Index cursor columns: Speed up incremental queries
  3. Limit concurrent connections: Respect database limits
  4. Schedule off-peak: Run large syncs during low-traffic periods

Maintenance

  1. Monitor schema changes: Enable change detection
  2. Review connection logs: Check for errors
  3. Test connections regularly: Catch issues early
  4. Document sources: Maintain connection documentation