Every application you have ever used sits on top of a data model. Pick the wrong one and you spend the next three years fighting your own database instead of shipping features.
For the past four decades, one model has dominated: the relational model. Flat tables, foreign keys, SQL. It is not glamorous. It is not trendy. But there is a reason almost every bank, airline, hospital, and e-commerce platform still runs on it — and understanding why is the first step to understanding databases at all.
In 1970, Edgar F. Codd published “A Relational Model of Data for Large Shared Data Banks.” The core insight was radical at the time: separate the logical representation of data from its physical storage. Applications should not care whether data lives on disk, in memory, or across ten machines. They should see tables — nothing more.
A relational database organizes data into relations (tables). Each table has:
Columns (attributes) — typed fields like name VARCHAR(100) or price DECIMAL(10,2)
Rows (tuples) — individual records
Primary key — a column (or set of columns) that uniquely identifies each row
Foreign key — a column that references the primary key of another table, establishing a relationship
These four concepts give you everything you need to model surprisingly complex domains.
Four tables. Three foreign keys. That is enough to represent users placing orders containing multiple products — a model used by companies processing billions of dollars in revenue.
SQL (Structured Query Language) is how you talk to relational databases. It is declarative: you describe what data you want, not how to get it. The database engine figures out the execution plan.
JOINs combine rows from multiple tables. There are four types you need to know:
JOIN Type
Returns
INNER JOIN
Only rows with matches in both tables
LEFT JOIN
All rows from left table, NULLs where right has no match
RIGHT JOIN
All rows from right table, NULLs where left has no match
FULL OUTER JOIN
All rows from both tables, NULLs on both sides where no match
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Get all orders with user info and item details
SELECTu.full_name,o.order_id,p.nameASproduct_name,oi.quantity,oi.unit_price,(oi.quantity*oi.unit_price)ASline_totalFROMordersoINNERJOINusersuONo.user_id=u.user_idINNERJOINorder_itemsoiONo.order_id=oi.order_idINNERJOINproductspONoi.product_id=p.product_idWHEREo.status='completed'ORDERBYo.order_id;
Aggregation collapses multiple rows into summary rows:
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Revenue per category, only categories with > $1000 revenue
SELECTp.category,COUNT(DISTINCTo.order_id)ASorder_count,SUM(oi.quantity)AStotal_units,SUM(oi.quantity*oi.unit_price)AStotal_revenueFROMorder_itemsoiJOINproductspONoi.product_id=p.product_idJOINordersoONoi.order_id=o.order_idWHEREo.status='completed'GROUPBYp.categoryHAVINGSUM(oi.quantity*oi.unit_price)>1000.00ORDERBYtotal_revenueDESC;
-- Add a phone number column
ALTERTABLEusersADDCOLUMNphoneVARCHAR(20);-- Add a default value to an existing column
ALTERTABLEproductsALTERCOLUMNstockSETDEFAULT0;-- Rename a column (PostgreSQL)
ALTERTABLEproductsRENAMECOLUMNnameTOproduct_name;-- Add a composite unique constraint
ALTERTABLEorder_itemsADDCONSTRAINTuq_order_productUNIQUE(order_id,product_id);-- Drop a column (be careful in production)
ALTERTABLEusersDROPCOLUMNphone;
In production, ALTER TABLE on large tables can lock the table for minutes or hours. We will cover online DDL strategies in Article 8.
Rule: 1NF + every non-key column depends on the entire primary key, not just part of it.
This applies to tables with composite primary keys. If (order_id, product_id) is the key, then product_name depends only on product_id — it violates 2NF. Move product data to a separate products table.
Normalization eliminates redundancy. But reads often need to JOIN multiple tables — and JOINs have a cost. Sometimes you intentionally denormalize for performance:
1
2
3
4
5
6
7
8
9
10
11
12
-- Instead of joining orders + order_items + products every time,
-- store a pre-calculated total on the order itself
ALTERTABLEordersADDCOLUMNtotal_amountDECIMAL(12,2);-- Update it when items change
UPDATEordersoSETtotal_amount=(SELECTSUM(oi.quantity*oi.unit_price)FROMorder_itemsoiWHEREoi.order_id=o.order_id)WHEREo.order_id=42;
Common denormalization patterns:
Pattern
When to use
Trade-off
Cached aggregates
Dashboard queries that run constantly
Must keep in sync on every write
Materialized views
Complex reporting queries
Stale data between refreshes
Redundant columns
Avoid expensive JOINs in hot paths
Update anomalies return
Summary tables
Time-series rollups (hourly/daily)
Extra storage, ETL complexity
The rule: normalize first, denormalize only when you have measured a performance problem.
Advanced SQL: Subqueries, CTEs, and Window Functions#
-- Users whose total spending exceeds the average user's spending
SELECTu.full_name,user_totals.total_spentFROMusersuJOIN(SELECTo.user_id,SUM(oi.quantity*oi.unit_price)AStotal_spentFROMordersoJOINorder_itemsoiONo.order_id=oi.order_idWHEREo.status='completed'GROUPBYo.user_id)user_totalsONu.user_id=user_totals.user_idWHEREuser_totals.total_spent>(SELECTAVG(sub.total_spent)FROM(SELECTSUM(oi.quantity*oi.unit_price)AStotal_spentFROMordersoJOINorder_itemsoiONo.order_id=oi.order_idWHEREo.status='completed'GROUPBYo.user_id)sub)ORDERBYuser_totals.total_spentDESC;
CTEs can reference themselves — useful for hierarchical data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Employee org chart: find all reports under a manager
WITHRECURSIVEreportsAS(-- Base case: the manager themselves
SELECTemployee_id,name,manager_id,0ASdepthFROMemployeesWHEREemployee_id=1UNIONALL-- Recursive case: find direct reports of current level
SELECTe.employee_id,e.name,e.manager_id,r.depth+1FROMemployeeseJOINreportsrONe.manager_id=r.employee_id)SELECTdepth,employee_id,nameFROMreportsORDERBYdepth,name;
Window functions compute values across a set of rows related to the current row, without collapsing rows (unlike GROUP BY):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Rank users by spending within each month
SELECTu.full_name,DATE_TRUNC('month',o.created_at)ASmonth,SUM(oi.quantity*oi.unit_price)ASmonthly_spend,RANK()OVER(PARTITIONBYDATE_TRUNC('month',o.created_at)ORDERBYSUM(oi.quantity*oi.unit_price)DESC)ASspend_rank,ROW_NUMBER()OVER(PARTITIONBYDATE_TRUNC('month',o.created_at)ORDERBYSUM(oi.quantity*oi.unit_price)DESC)ASrow_numFROMordersoJOINusersuONo.user_id=u.user_idJOINorder_itemsoiONo.order_id=oi.order_idWHEREo.status='completed'GROUPBYu.full_name,DATE_TRUNC('month',o.created_at)ORDERBYmonth,spend_rank;
Output:
1
2
3
4
5
6
7
8
full_name | month | monthly_spend | spend_rank | row_num
---------------+------------+---------------+------------+---------
David Park | 2023-10-01 | 1245.00 | 1 | 1
Alice Chen | 2023-10-01 | 892.50 | 2 | 2
Carol White | 2023-10-01 | 892.50 | 2 | 3
Bob Martinez | 2023-10-01 | 334.99 | 4 | 4
Alice Chen | 2023-11-01 | 1567.20 | 1 | 1
David Park | 2023-11-01 | 998.00 | 2 | 2
Notice that RANK gives the same rank to tied values (both Alice and Carol get rank 2), while ROW_NUMBER always assigns unique numbers.
LAG and LEAD let you access previous or next rows:
-- Compare each month's revenue to the previous month
WITHmonthly_revenueAS(SELECTDATE_TRUNC('month',o.created_at)ASmonth,SUM(oi.quantity*oi.unit_price)ASrevenueFROMordersoJOINorder_itemsoiONo.order_id=oi.order_idWHEREo.status='completed'GROUPBYDATE_TRUNC('month',o.created_at))SELECTmonth,revenue,LAG(revenue)OVER(ORDERBYmonth)ASprev_month_revenue,ROUND((revenue-LAG(revenue)OVER(ORDERBYmonth))/LAG(revenue)OVER(ORDERBYmonth)*100,1)ASgrowth_pctFROMmonthly_revenueORDERBYmonth;
Modern databases blur the line between relational and document models. PostgreSQL’s jsonb and MySQL’s JSON type let you store semi-structured data alongside traditional columns — best of both worlds when used carefully.
-- GIN index for containment queries (@>, ?, ?|, ?&)
CREATEINDEXidx_events_payloadONeventsUSINGGIN(payload);-- Expression index on a specific JSON path
CREATEINDEXidx_events_error_codeONevents((payload->>'code'))WHEREevent_type='error';
GIN indexes make @> containment checks fast — O(log n) instead of scanning every row.
-- MySQL uses -> for extraction (returns JSON), ->> for text
SELECTpayload->>'$.url'ASurlFROMeventsWHEREevent_type='page_view';-- Generated column + index (MySQL's approach to JSON indexing)
ALTERTABLEeventsADDCOLUMNerror_codeINTGENERATEDALWAYSAS(payload->>'$.code')VIRTUAL;CREATEINDEXidx_error_codeONevents(error_code);
-- BAD: Entire row is JSON (you just built a document DB with extra steps)
CREATETABLEbad_design(idSERIALPRIMARYKEY,dataJSONB-- everything in here
);-- BAD: Querying JSON in WHERE without an index
SELECT*FROMeventsWHEREpayload->>'user_agent'LIKE'%Chrome%';-- full table scan
-- GOOD: Promote frequently-queried fields to real columns
ALTERTABLEeventsADDCOLUMNerror_codeINT;UPDATEeventsSETerror_code=(payload->>'code')::intWHEREevent_type='error';CREATEINDEXidx_events_error_codeONevents(error_code);
Rule of thumb: if you query a JSON field in WHERE or JOIN more than occasionally, it should be a real column.
Many real-world data structures are trees: org charts, category hierarchies, threaded comments, bill-of-materials. Recursive CTEs query these without application-level loops.
WITHRECURSIVEcte_nameAS(-- Base case: starting rows
SELECT...FROMtableWHEREconditionUNIONALL-- Recursive step: join back to the CTE
SELECT...FROMtableJOINcte_nameON...)SELECT*FROMcte_name;
CREATETABLEcategories(category_idINTPRIMARYKEY,nameVARCHAR(100)NOTNULL,parent_idINTREFERENCEScategories(category_id));INSERTINTOcategoriesVALUES(1,'All Products',NULL),(2,'Electronics',1),(3,'Computers',2),(4,'Laptops',3),(5,'Desktops',3),(6,'Phones',2),(7,'Furniture',1),(8,'Desks',7);-- Find all descendants of "Electronics" with their depth
WITHRECURSIVEsubtreeAS(-- Base: start at Electronics
SELECTcategory_id,name,parent_id,0ASdepth,name::textASpathFROMcategoriesWHEREname='Electronics'UNIONALL-- Recursive: find children
SELECTc.category_id,c.name,c.parent_id,s.depth+1,s.path||' > '||c.nameFROMcategoriescJOINsubtreesONc.parent_id=s.category_id)SELECTdepth,pathFROMsubtreeORDERBYpath;
LATERAL lets a subquery in the FROM clause reference columns from preceding tables — like a for-each loop in SQL. This solves problems that are awkward or impossible with regular JOINs.
The classic problem: find the 3 most recent orders per user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Without LATERAL: window function approach (loads all orders into memory)
WITHrankedAS(SELECT*,ROW_NUMBER()OVER(PARTITIONBYuser_idORDERBYcreated_atDESC)ASrnFROMorders)SELECT*FROMrankedWHERErn<=3;-- With LATERAL: stops after 3 per user (more efficient for large tables)
SELECTu.user_id,u.full_name,recent.*FROMusersuCROSSJOINLATERAL(SELECTorder_id,total_amount,created_atFROMordersoWHEREo.user_id=u.user_idORDERBYo.created_atDESCLIMIT3)recent;
The LATERAL version can use an index on (user_id, created_at DESC) and stops scanning after 3 rows per user.
-- Unnest a JSONB array and join with referenced table
SELECTe.event_id,e.event_type,item->>'sku'ASsku,item->>'qty'ASqtyFROMeventseCROSSJOINLATERALjsonb_array_elements(e.payload->'items')ASitemWHEREe.event_type='purchase';
-- For each product, compute running stats from the last 30 days of orders
SELECTp.name,stats.*FROMproductspCROSSJOINLATERAL(SELECTCOUNT(*)ASorders_30d,COALESCE(SUM(oi.quantity),0)ASunits_30d,COALESCE(AVG(oi.unit_price),0)ASavg_price_30dFROMorder_itemsoiJOINordersoONoi.order_id=o.order_idWHEREoi.product_id=p.product_idANDo.created_at>NOW()-INTERVAL'30 days')statsORDERBYstats.units_30dDESC;
Here is a real-world query that combines CTEs, JOINs, and window functions. Given our e-commerce schema, find the top product in each category by total revenue, along with what percentage of category revenue it represents:
The relational model won because it offers something no other model did at the time: data independence. You can change the physical storage, add indexes, partition tables, and replicate data — all without changing a single line of application code. The SQL interface stays the same.
It is not perfect. Some data (social graphs, time series, documents with deeply nested structures) fits awkwardly into tables. We will explore those alternatives in Article 5. But for most applications — especially those where data integrity matters — the relational model remains the default choice for good reason.
Knowing SQL is necessary but not sufficient. Writing a correct query and writing a fast query are different skills. In the next article, we will look at indexing and query planning — how databases actually find your data, and how to make them find it faster.