Skip to main content

Core Concepts

This page explains the fundamental concepts and terminology used throughout Datalinx AI.

Organization

An Organization is the top-level container in Datalinx AI, representing your company or team. Each organization has:

  • Isolated data and configurations
  • Its own users and roles
  • Separate billing and usage tracking

Workspace

A Workspace is an isolated environment within an organization for a specific project or use case. Workspaces contain:

  • Data source configurations
  • Schema definitions and mappings
  • Pipeline definitions
  • Monitoring rules
Best Practice

Use separate workspaces for development, staging, and production to prevent accidental changes to live data.

Data Sources

Data Sources are the origins of your raw data. Datalinx AI supports three types:

Database Sources

Direct connections to databases:

  • PostgreSQL
  • Snowflake
  • Databricks
  • BigQuery

API Sources

REST API endpoints with schema definitions:

  • OpenAPI/Swagger specifications
  • Custom REST endpoints
  • GraphQL (coming soon)

File Sources

Files stored in cloud storage:

  • CSV, JSON, Parquet formats
  • AWS S3, Azure Blob, GCS
  • SFTP servers

Schemas

Source Schema

The Source Schema describes the structure of your raw data:

  • Tables and columns
  • Data types
  • Relationships (if detected)

Source schemas are automatically discovered when you connect a data source.

Target Schema

The Target Schema defines the desired output structure:

  • Standardized table definitions
  • Required and optional fields
  • Data type specifications
  • Canonical field types (email, phone, currency, etc.)

Datalinx AI provides pre-built target schema templates for common use cases.

Mappings

Mappings define how source data transforms into target data. Each mapping specifies:

  • Source field(s)
  • Target field
  • Transformation logic (if any)

Mapping Types

TypeDescriptionExample
Direct1:1 field copysource.name → target.name
ExpressionSQL transformationCONCAT(first_name, ' ', last_name)
DecoratedApply standardizationLOWER(email)
ComputedDerived valueSUM(line_items.amount)

CTEs (Common Table Expressions)

CTEs are intermediate transformations that create derived datasets. Use CTEs when you need to:

  • Join multiple source tables
  • Aggregate data before mapping
  • Apply complex business logic
  • Create reusable sub-queries
-- CTE Example: Calculate customer lifetime value
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total) as lifetime_value
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_orders

Decorators

Decorators are pre-built transformation functions that standardize data:

DecoratorPurposeExample
lowercaseNormalize caseJOHN@EMAIL.COM → john@email.com
phone_formatStandardize phone numbers(555) 123-4567 → +15551234567
date_parseParse date strings01/15/2024 → 2024-01-15
currency_centsConvert to cents$19.99 → 1999
null_if_emptyConvert empty to NULL"" → NULL

Pipelines

A Pipeline is the execution of mappings to transform source data into target data. Pipelines can be:

  • Ad-hoc: Manually triggered
  • Scheduled: Run on a cron schedule
  • Triggered: Activated by events

Pipeline Modes

ModeDescriptionUse Case
Full RefreshProcess all dataInitial load, schema changes
IncrementalProcess only new/changedRegular updates
BackfillProcess historical rangeGap filling

Control Plane vs Data Plane

Datalinx AI uses a split architecture for security and scalability:

Control Plane

The Control Plane handles:

  • User interface and API
  • Configuration management
  • Authentication and authorization
  • Command orchestration

Data Plane

The Data Plane executes:

  • Data transformations (dbt)
  • Pipeline runs
  • Query execution
  • Monitoring checks

Identity Resolution

Identity Resolution links records from different sources to the same entity:

  • Match on email, phone, or custom identifiers
  • Build an identity graph
  • Create canonical customer records

Audiences

Audiences are segments of entities based on criteria:

-- Example: High-value customers
SELECT customer_id
FROM customers
WHERE lifetime_value > 1000
AND last_order_date > CURRENT_DATE - INTERVAL '90 days'

Audiences can be:

  • Exported to marketing platforms
  • Used for analysis
  • Monitored for changes

Reverse ETL

Reverse ETL pushes transformed data back to operational systems:

  • CRM updates (Salesforce, HubSpot)
  • Marketing platform syncs (Meta, Google Ads)
  • Custom webhook destinations

Monitoring

Monitoring continuously checks data quality:

Rule Types

  • Volume: Row counts within expected range
  • Freshness: Data updated within time window
  • Quality: Values match expected patterns
  • Schema: Structure hasn't changed unexpectedly

Alerts

When rules fail, alerts notify you via:

  • Email
  • Slack
  • Webhooks

Service Accounts

Service Accounts provide programmatic access:

  • API key authentication
  • Limited permissions
  • Used by Data Plane workers
  • Audit logged

Roles and Permissions

Datalinx AI uses role-based access control (RBAC):

RoleCapabilities
AdminFull access to organization
DeveloperCreate/edit workspaces and mappings
OperatorRun pipelines, view monitoring
ViewerRead-only access

Next Steps