Core Concepts
This page explains the fundamental concepts and terminology used throughout Datalinx AI.
Organization
An Organization is the top-level container in Datalinx AI, representing your company or team. Each organization has:
- Isolated data and configurations
- Its own users and roles
- Separate billing and usage tracking
Workspace
A Workspace is an isolated environment within an organization for a specific project or use case. Workspaces contain:
- Data source configurations
- Schema definitions and mappings
- Pipeline definitions
- Monitoring rules
Use separate workspaces for development, staging, and production to prevent accidental changes to live data.
Data Sources
Data Sources are the origins of your raw data. Datalinx AI supports three types:
Database Sources
Direct connections to databases:
- PostgreSQL
- Snowflake
- Databricks
- BigQuery
API Sources
REST API endpoints with schema definitions:
- OpenAPI/Swagger specifications
- Custom REST endpoints
- GraphQL (coming soon)
File Sources
Files stored in cloud storage:
- CSV, JSON, Parquet formats
- AWS S3, Azure Blob, GCS
- SFTP servers
Schemas
Source Schema
The Source Schema describes the structure of your raw data:
- Tables and columns
- Data types
- Relationships (if detected)
Source schemas are automatically discovered when you connect a data source.
Target Schema
The Target Schema defines the desired output structure:
- Standardized table definitions
- Required and optional fields
- Data type specifications
- Canonical field types (email, phone, currency, etc.)
Datalinx AI provides pre-built target schema templates for common use cases.
Mappings
Mappings define how source data transforms into target data. Each mapping specifies:
- Source field(s)
- Target field
- Transformation logic (if any)
Mapping Types
| Type | Description | Example |
|---|---|---|
| Direct | 1:1 field copy | source.name → target.name |
| Expression | SQL transformation | CONCAT(first_name, ' ', last_name) |
| Decorated | Apply standardization | LOWER(email) |
| Computed | Derived value | SUM(line_items.amount) |
CTEs (Common Table Expressions)
CTEs are intermediate transformations that create derived datasets. Use CTEs when you need to:
- Join multiple source tables
- Aggregate data before mapping
- Apply complex business logic
- Create reusable sub-queries
-- CTE Example: Calculate customer lifetime value
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total) as lifetime_value
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_orders
Decorators
Decorators are pre-built transformation functions that standardize data:
| Decorator | Purpose | Example |
|---|---|---|
lowercase | Normalize case | JOHN@EMAIL.COM → john@email.com |
phone_format | Standardize phone numbers | (555) 123-4567 → +15551234567 |
date_parse | Parse date strings | 01/15/2024 → 2024-01-15 |
currency_cents | Convert to cents | $19.99 → 1999 |
null_if_empty | Convert empty to NULL | "" → NULL |
Pipelines
A Pipeline is the execution of mappings to transform source data into target data. Pipelines can be:
- Ad-hoc: Manually triggered
- Scheduled: Run on a cron schedule
- Triggered: Activated by events
Pipeline Modes
| Mode | Description | Use Case |
|---|---|---|
| Full Refresh | Process all data | Initial load, schema changes |
| Incremental | Process only new/changed | Regular updates |
| Backfill | Process historical range | Gap filling |
Control Plane vs Data Plane
Datalinx AI uses a split architecture for security and scalability:
Control Plane
The Control Plane handles:
- User interface and API
- Configuration management
- Authentication and authorization
- Command orchestration
Data Plane
The Data Plane executes:
- Data transformations (dbt)
- Pipeline runs
- Query execution
- Monitoring checks
Identity Resolution
Identity Resolution links records from different sources to the same entity:
- Match on email, phone, or custom identifiers
- Build an identity graph
- Create canonical customer records
Audiences
Audiences are segments of entities based on criteria:
-- Example: High-value customers
SELECT customer_id
FROM customers
WHERE lifetime_value > 1000
AND last_order_date > CURRENT_DATE - INTERVAL '90 days'
Audiences can be:
- Exported to marketing platforms
- Used for analysis
- Monitored for changes
Reverse ETL
Reverse ETL pushes transformed data back to operational systems:
- CRM updates (Salesforce, HubSpot)
- Marketing platform syncs (Meta, Google Ads)
- Custom webhook destinations
Monitoring
Monitoring continuously checks data quality:
Rule Types
- Volume: Row counts within expected range
- Freshness: Data updated within time window
- Quality: Values match expected patterns
- Schema: Structure hasn't changed unexpectedly
Alerts
When rules fail, alerts notify you via:
- Slack
- Webhooks
Service Accounts
Service Accounts provide programmatic access:
- API key authentication
- Limited permissions
- Used by Data Plane workers
- Audit logged
Roles and Permissions
Datalinx AI uses role-based access control (RBAC):
| Role | Capabilities |
|---|---|
| Admin | Full access to organization |
| Developer | Create/edit workspaces and mappings |
| Operator | Run pipelines, view monitoring |
| Viewer | Read-only access |
Next Steps
- Quick Start - Apply these concepts hands-on
- System Requirements - Prepare your environment
- Architecture Overview - Deep dive into system design