Immich × Databricks

Ingest your Immich photo library via the REST API into a Bronze / Silver / Gold medallion lakehouse — with incremental updates, SCD deletion tracking, scheduled jobs, dashboards, and an interactive map.

Values like this ✏ need to match your setup. Hover for guidance.

00 — Bootstrap secrets (run once, then delete)

Creates the Databricks secret scope and stores your Immich credentials. Uses the Databricks REST API directly — no CLI needed. Fill in the four highlighted values, run once, then delete the cell immediately.

Delete this cell after running. Your credentials are now in Databricks Secrets.
Pythoncell 0 — create secrets (run once, then delete)
import requests

# ── Fill these in, run once, then DELETE this cell ────────────────────
DATABRICKS_PAT  = "dapi..."
WORKSPACE_URL   = "https://adb-xxxxxxxx.azuredatabricks.net"
IMMICH_BASE_URL = "https://your-immich-url.com"
IMMICH_API_KEY  = "your_immich_api_key_here"
# ─────────────────────────────────────────────────────────────────────

headers = {"Authorization": f"Bearer {DATABRICKS_PAT}", "Content-Type": "application/json"}
base = f"{WORKSPACE_URL}/api/2.0/secrets"

def create_scope(name):
    r = requests.post(f"{base}/scopes/create", headers=headers,
                      json={"scope": name, "initial_manage_principal": "users"})
    if r.status_code == 200: print(f"Scope '{name}' created.")
    elif "RESOURCE_ALREADY_EXISTS" in r.text: print(f"Scope '{name}' already exists.")
    else: raise Exception(r.text)

def put_secret(scope, key, value):
    r = requests.post(f"{base}/put", headers=headers,
                      json={"scope": scope, "key": key, "string_value": value})
    if r.status_code == 200: print(f"  secret '{key}' stored.")
    else: raise Exception(r.text)

create_scope("immich")
put_secret("immich", "api_key",  IMMICH_API_KEY)
put_secret("immich", "base_url", IMMICH_BASE_URL)
print("\nDone. Now DELETE this cell.")
Expected output
Scope 'immich' created. secret 'api_key' stored. secret 'base_url' stored. Done. Now DELETE this cell.

01 — Workspace setup & secrets

Run at the start of every session. Sets CATALOG, creates schemas, imports all required libraries, and verifies Immich connectivity.

Pythoncell 1 — workspace setup
import requests, json
from datetime import datetime, timezone, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, from_json, to_timestamp, year, month,
    dayofmonth, lower, current_timestamp, lit, when, count
)
from pyspark.sql.types import (
    StructType, StructField, StringType, LongType,
    BooleanType, DoubleType, IntegerType, TimestampType
)

spark = SparkSession.builder.getOrCreate()

CATALOG = spark.sql("SELECT current_catalog()").collect()[0][0]
print(f"Catalog : {CATALOG}")

for schema in ["immich_bronze", "immich_silver", "immich_gold"]:
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{schema}")
    print(f"  schema ready: {CATALOG}.{schema}")

API_KEY  = dbutils.secrets.get(scope="immich", key="api_key")
BASE_URL = dbutils.secrets.get(scope="immich", key="base_url").rstrip("/")
HEADERS  = {"x-api-key": API_KEY, "Accept": "application/json"}

resp = requests.get(f"{BASE_URL}/api/server/statistics", headers=HEADERS)
resp.raise_for_status()
stats = resp.json()
print(f"\nImmich connection OK")
print(f"  Photos : {stats.get('photos','?')}")
print(f"  Videos : {stats.get('videos','?')}")
Expected output
Catalog : workspace schema ready: workspace.immich_bronze schema ready: workspace.immich_silver schema ready: workspace.immich_gold Immich connection OK Photos : 12453 Videos : 381

02 — Reset Drop and recreate all layers

Only run when you want a completely clean slate. Drops every table including deletion history. Skip on normal runs.

Destructive. Permanently deletes all data in bronze, silver, and gold.
Pythoncell 2 — reset all layers
for schema in ["immich_bronze", "immich_silver", "immich_gold"]:
    full = f"{CATALOG}.{schema}"
    tables = spark.sql(f"SHOW TABLES IN {full}").collect()
    if tables:
        for row in tables:
            spark.sql(f"DROP TABLE IF EXISTS {full}.{row['tableName']}")
            print(f"  dropped: {full}.{row['tableName']}")
    else:
        print(f"  {full}: already empty")
print("\nAll layers cleared.")
First run
workspace.immich_bronze: already empty workspace.immich_silver: already empty workspace.immich_gold: already empty All layers cleared.
Subsequent runs
dropped: workspace.immich_bronze.assets_raw dropped: workspace.immich_bronze.albums_raw dropped: workspace.immich_silver.assets dropped: workspace.immich_silver.albums dropped: workspace.immich_gold.photos_by_month dropped: workspace.immich_gold.camera_summary dropped: workspace.immich_gold.geography dropped: workspace.immich_gold.album_stats dropped: workspace.immich_gold.ratings_summary All layers cleared.

03 — Explore the raw API response

Fetch one asset to inspect the exact structure your Immich version returns. withExif: True is required — without it the search endpoint omits exifInfo.

Pythoncell 3 — inspect a single asset
resp = requests.post(f"{BASE_URL}/api/search/metadata", headers=HEADERS,
                     json={"page": 1, "size": 1, "withExif": True})
resp.raise_for_status()
data = resp.json()

assets_block = data.get("assets", {})
envelope = {k: v for k, v in assets_block.items() if k != "items"}
print("=== Pagination envelope ===")
print(json.dumps(envelope, indent=2))
print("\n=== Full structure of one asset ===")
print(json.dumps(assets_block["items"][0], indent=2))
Expected output (abbreviated)
=== Pagination envelope === { "total": 1, "count": 1, "nextPage": "2", "facets": [] } === Full structure of one asset === { "id": "5a92104a-b6db-...", "originalFileName": "20260524-DSC06146.jpg", "type": "IMAGE", "fileCreatedAt": "2026-05-24T07:29:15.627Z", "localDateTime": "2026-05-24T17:29:15.627Z", "isFavorite": false, "visibility": "timeline", "exifInfo": { "make": "SONY", "model": "ILCE-7CM2", "lensModel": "FE 28-60mm F4-5.6", "fNumber": 4.0, "iso": 4000, "latitude": -31.959281, "longitude": 115.89533, "city": "Burswood", "country": "Australia", "fileSizeInByte": 6637963, "exifImageWidth": 7008, "exifImageHeight": 4672, "rating": 1, "timeZone": "UTC+10" } }

04 — Analyse field coverage

Sample 100 assets to see which fields are populated in your library. Useful before building gold queries that depend on sparse fields like city or rating.

Pythoncell 4 — field coverage
resp = requests.post(f"{BASE_URL}/api/search/metadata", headers=HEADERS,
                     json={"page": 1, "size": 100, "withExif": True})
sample = resp.json().get("assets", {}).get("items", [])
n = len(sample)
top_counts, exif_counts = {}, {}

for asset in sample:
    for k, v in asset.items():
        if v is not None and v != "" and v != [] and v != "0:00:00.00000":
            top_counts[k] = top_counts.get(k, 0) + 1
    for k, v in (asset.get("exifInfo") or {}).items():
        if v is not None and v != "":
            exif_counts[k] = exif_counts.get(k, 0) + 1

print(f"Sample: {n} assets\n")
print("=== Top-level fields ===")
for k, v in sorted(top_counts.items(), key=lambda x: -x[1]):
    bar = "█" * int(v / n * 30)
    print(f"  {k:<28} {bar:<30} {v}/{n}")
print("\n=== exifInfo fields ===")
for k, v in sorted(exif_counts.items(), key=lambda x: -x[1]):
    bar = "█" * int(v / n * 30)
    print(f"  {k:<28} {bar:<30} {v}/{n}")
Expected output (your values will differ)
Sample: 100 assets === Top-level fields === id ██████████████████████████████ 100/100 originalFileName ██████████████████████████████ 100/100 type ██████████████████████████████ 100/100 exifInfo █████████████████████████████ 98/100 === exifInfo fields === fileSizeInByte █████████████████████████████ 98/100 make ████████████████████████ 82/100 model ████████████████████████ 82/100 city ████████████████ 54/100 rating ████████████████ 52/100

05 — Bronze Full refresh ingestion

Fetches every asset from Immich and merges into bronze using Delta's MERGE. New assets are inserted, changed assets updated in place (Type 1), and assets no longer in Immich are flagged as deleted with a timestamp (Type 2 SCD). Run this weekly.

What bronze stores

  • Raw JSON per asset
  • Ingestion timestamp
  • is_deleted flag
  • deleted_at timestamp

SCD approach

  • Deletions: Type 2 (flag)
  • Metadata changes: Type 1
  • Re-imports: new UUID = new row

Tables created

  • immich_bronze.assets_raw
  • immich_bronze.albums_raw
On the very first run the table doesn't exist yet — the cell detects this and falls back to a plain write.
Pythoncell 5a — fetch all assets from Immich
def fetch_all_assets(base_url, headers, page_size=500):
    """
    Fetch all assets with EXIF inline.
    withExif=True is required — without it exifInfo is omitted.
    Stops when a page returns fewer items than page_size.
    """
    all_assets, page = [], 1
    while True:
        resp = requests.post(f"{base_url}/api/search/metadata", headers=headers,
                             json={"page": page, "size": page_size, "withExif": True})
        resp.raise_for_status()
        items = resp.json().get("assets", {}).get("items", [])
        all_assets.extend(items)
        print(f"  Page {page}: +{len(items)}  (total: {len(all_assets)})")
        if len(items) < page_size:
            break
        page += 1
    print(f"\nFetch complete. Total: {len(all_assets)}")
    return all_assets

print("Fetching all assets from Immich...")
raw_assets = fetch_all_assets(BASE_URL, HEADERS)
Expected output
Fetching all assets from Immich... Page 1: +500 (total: 500) Page 2: +500 (total: 1000) ... Page 25: +453 (total: 12453) Fetch complete. Total: 12453
Pythoncell 5b — merge assets into bronze with SCD deletion tracking
raw_schema    = StructType([StructField("raw_json", StringType(), True)])
id_schema     = StructType([StructField("id", StringType())])
BRONZE_ASSETS = f"{CATALOG}.immich_bronze.assets_raw"

incoming_ids = {a["id"] for a in raw_assets}

df_incoming = (
    spark.createDataFrame([(json.dumps(a),) for a in raw_assets], raw_schema)
    .withColumn("_p",          from_json(col("raw_json"), id_schema))
    .withColumn("asset_id",    col("_p.id"))
    .withColumn("ingested_at", current_timestamp())
    .withColumn("source",      lit("immich_api"))
    .withColumn("is_deleted",  lit(False))
    .withColumn("deleted_at",  lit(None).cast(TimestampType()))
    .drop("_p")
)

table_exists = spark.catalog.tableExists(BRONZE_ASSETS)

if not table_exists:
    print("First run — creating bronze table...")
    (df_incoming.write.format("delta").mode("overwrite").saveAsTable(BRONZE_ASSETS))
    print(f"  Wrote {df_incoming.count()} rows")
else:
    df_incoming.createOrReplaceTempView("incoming_assets")

    # Upsert: insert new, update changed
    spark.sql(f"""
        MERGE INTO {BRONZE_ASSETS} AS target
        USING incoming_assets AS source
        ON target.asset_id = source.asset_id
        WHEN MATCHED THEN UPDATE SET
            target.raw_json    = source.raw_json,
            target.ingested_at = source.ingested_at,
            target.is_deleted  = false,
            target.deleted_at  = NULL
        WHEN NOT MATCHED THEN INSERT *
    """)

    # Flag assets no longer in Immich as deleted
    ids_list = ", ".join([f"'{i}'" for i in incoming_ids])
    spark.sql(f"""
        UPDATE {BRONZE_ASSETS}
        SET is_deleted = true, deleted_at = current_timestamp()
        WHERE is_deleted = false
          AND asset_id NOT IN ({ids_list})
    """)

    df_b  = spark.table(BRONZE_ASSETS)
    total = df_b.count()
    active  = df_b.where(col("is_deleted") == False).count()
    deleted = total - active
    print(f"  Total rows : {total}")
    print(f"  Active     : {active}")
    print(f"  Deleted    : {deleted}")
First run
First run — creating bronze table... Wrote 12453 rows
Subsequent runs
Total rows : 12461 Active : 12453 Deleted : 8
Pythoncell 5c — fetch albums & write to bronze
BRONZE_ALBUMS = f"{CATALOG}.immich_bronze.albums_raw"
print("Fetching albums...")
resp = requests.get(f"{BASE_URL}/api/albums", headers=HEADERS)
resp.raise_for_status()
raw_albums = resp.json()
print(f"  Found {len(raw_albums)} albums")

df_albums = (
    spark.createDataFrame([(json.dumps(a),) for a in raw_albums], raw_schema)
    .withColumn("ingested_at", current_timestamp())
    .withColumn("source",      lit("immich_api"))
)
(df_albums.write.format("delta").mode("overwrite").saveAsTable(BRONZE_ALBUMS))
print(f"  Wrote {df_albums.count()} rows to bronze.albums_raw")
Expected output
Fetching albums... Found 47 albums Wrote 47 rows to bronze.albums_raw

05b — Bronze Incremental refresh

Only fetches assets modified since the last ingestion. Much faster than a full refresh — use this daily or on-demand after adding new photos. Does not catch deletions; run the full refresh weekly to handle those.

Workflow: incremental daily (or after big imports) → full refresh weekly → always rebuild silver + gold after either.
Pythoncell 5d — incremental asset fetch & merge
BRONZE_ASSETS = f"{CATALOG}.immich_bronze.assets_raw"

last_run = spark.sql(f"""
    SELECT MAX(ingested_at) AS ts FROM {BRONZE_ASSETS} WHERE is_deleted = false
""").collect()[0]["ts"]

if last_run is None:
    print("No bronze data found — run the full refresh (cell 5) first.")
else:
    since = (last_run - timedelta(hours=1)).strftime("%Y-%m-%dT%H:%M:%S.000Z")
    print(f"Fetching assets updated since: {since}")

    updated, page, page_size = [], 1, 500
    while True:
        resp = requests.post(f"{BASE_URL}/api/search/metadata", headers=HEADERS,
                             json={"page": page, "size": page_size,
                                   "withExif": True, "updatedAfter": since})
        resp.raise_for_status()
        items = resp.json().get("assets", {}).get("items", [])
        updated.extend(items)
        print(f"  Page {page}: +{len(items)}  (total: {len(updated)})")
        if len(items) < page_size:
            break
        page += 1

    print(f"\n{len(updated)} new/updated assets found")

    if updated:
        df_new = (
            spark.createDataFrame([(json.dumps(a),) for a in updated], raw_schema)
            .withColumn("_p",          from_json(col("raw_json"), StructType([StructField("id", StringType())])))
            .withColumn("asset_id",    col("_p.id"))
            .withColumn("ingested_at", current_timestamp())
            .withColumn("source",      lit("immich_api"))
            .withColumn("is_deleted",  lit(False))
            .withColumn("deleted_at",  lit(None).cast(TimestampType()))
            .drop("_p")
        )
        df_new.createOrReplaceTempView("incoming_assets")
        spark.sql(f"""
            MERGE INTO {BRONZE_ASSETS} AS target
            USING incoming_assets AS source
            ON target.asset_id = source.asset_id
            WHEN MATCHED THEN UPDATE SET
                target.raw_json    = source.raw_json,
                target.ingested_at = source.ingested_at,
                target.is_deleted  = false,
                target.deleted_at  = NULL
            WHEN NOT MATCHED THEN INSERT *
        """)
        print("  Merged into bronze.")
    else:
        print("  Nothing to update.")
Changes found
Fetching assets updated since: 2026-05-31T09:00:00.000Z Page 1: +23 (total: 23) 23 new/updated assets found Merged into bronze.
Nothing new
Fetching assets updated since: 2026-05-31T09:00:00.000Z Page 1: +0 (total: 0) 0 new/updated assets found Nothing to update.

06 — Inspect the bronze layer

Verify what is in bronze including SCD columns. Deleted assets remain in the table with is_deleted = true — this is intentional.

Pythoncell 6 — inspect bronze
df = spark.table(f"{CATALOG}.immich_bronze.assets_raw")
print("=== bronze.assets_raw schema ===")
df.printSchema()

total   = df.count()
active  = df.where(col("is_deleted") == False).count()
deleted = df.where(col("is_deleted") == True).count()
print(f"\n  Total rows : {total}")
print(f"  Active     : {active}")
print(f"  Deleted    : {deleted}")

if deleted > 0:
    print("\n=== Recently deleted assets ===")
    (df.where(col("is_deleted") == True)
       .select("asset_id", "deleted_at", "ingested_at")
       .orderBy(col("deleted_at").desc())
       .limit(5).show(truncate=False))

sample = df.where(col("is_deleted") == False).limit(1).collect()[0]["raw_json"]
print("\n=== Sample raw_json (first 400 chars) ===")
print(sample[:400] + "...")
Expected output
=== bronze.assets_raw schema === root |-- raw_json: string (nullable = true) |-- asset_id: string (nullable = true) |-- ingested_at: timestamp (nullable = false) |-- source: string (nullable = false) |-- is_deleted: boolean (nullable = false) |-- deleted_at: timestamp (nullable = true) Total rows : 12461 Active : 12453 Deleted : 8

07 — Silver Parse & clean

Reads from bronze (active rows only), parses raw JSON into typed columns, flattens exifInfo, and adds derived date fields. Run after every bronze refresh.

Pythoncell 7a — parse assets to silver
exif_schema = StructType([
    StructField("make",             StringType()),
    StructField("model",            StringType()),
    StructField("lensModel",        StringType()),
    StructField("fNumber",          DoubleType()),
    StructField("focalLength",      DoubleType()),
    StructField("iso",              IntegerType()),
    StructField("exposureTime",     StringType()),
    StructField("latitude",         DoubleType()),
    StructField("longitude",        DoubleType()),
    StructField("city",             StringType()),
    StructField("state",            StringType()),
    StructField("country",          StringType()),
    StructField("description",      StringType()),
    StructField("timeZone",         StringType()),
    StructField("dateTimeOriginal", StringType()),
    StructField("rating",           IntegerType()),
    StructField("exifImageWidth",   IntegerType()),
    StructField("exifImageHeight",  IntegerType()),
    StructField("fileSizeInByte",   LongType()),
])
asset_schema = StructType([
    StructField("id",               StringType()),
    StructField("originalFileName", StringType()),
    StructField("type",             StringType()),
    StructField("fileCreatedAt",    StringType()),
    StructField("localDateTime",    StringType()),
    StructField("isFavorite",       BooleanType()),
    StructField("isArchived",       BooleanType()),
    StructField("isTrashed",        BooleanType()),
    StructField("visibility",       StringType()),
    StructField("duration",         StringType()),
    StructField("exifInfo",         exif_schema),
])

print("Parsing bronze assets to silver...")
df_silver = (
    spark.table(f"{CATALOG}.immich_bronze.assets_raw")
    .where(col("is_deleted") == False)        # exclude SCD-deleted assets
    .withColumn("data", from_json(col("raw_json"), asset_schema))
    .select(
        col("data.id").alias("asset_id"),
        col("data.originalFileName").alias("file_name"),
        lower(col("data.type")).alias("media_type"),
        to_timestamp(col("data.fileCreatedAt")).alias("created_at"),
        to_timestamp(col("data.localDateTime")).alias("local_datetime"),
        year(to_timestamp(col("data.localDateTime"))).alias("year"),
        month(to_timestamp(col("data.localDateTime"))).alias("month"),
        dayofmonth(to_timestamp(col("data.localDateTime"))).alias("day"),
        col("data.isFavorite").alias("is_favourite"),
        col("data.isArchived").alias("is_archived"),
        col("data.isTrashed").alias("is_trashed"),
        col("data.visibility").alias("visibility"),
        col("data.duration").alias("video_duration"),
        col("data.exifInfo.make").alias("camera_make"),
        col("data.exifInfo.model").alias("camera_model"),
        col("data.exifInfo.lensModel").alias("lens_model"),
        col("data.exifInfo.fNumber").alias("f_number"),
        col("data.exifInfo.iso").alias("iso"),
        col("data.exifInfo.latitude").alias("latitude"),
        col("data.exifInfo.longitude").alias("longitude"),
        col("data.exifInfo.city").alias("city"),
        col("data.exifInfo.country").alias("country"),
        col("data.exifInfo.fileSizeInByte").alias("file_size_bytes"),
        col("data.exifInfo.exifImageWidth").alias("width_px"),
        col("data.exifInfo.exifImageHeight").alias("height_px"),
        col("data.exifInfo.timeZone").alias("timezone"),
        col("data.exifInfo.rating").alias("rating"),
        col("ingested_at")
    )
    .dropDuplicates(["asset_id"])
    .where(col("asset_id").isNotNull())
)
(df_silver.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.immich_silver.assets"))

total    = df_silver.count()
images   = df_silver.where(col("media_type") == "image").count()
videos   = df_silver.where(col("media_type") == "video").count()
with_gps = df_silver.where(col("latitude").isNotNull()).count()
rated    = df_silver.where(col("rating").isNotNull()).count()
print(f"  Total  : {total}")
print(f"  Images : {images}")
print(f"  Videos : {videos}")
print(f"  GPS    : {with_gps} ({round(with_gps/total*100)}%)")
print(f"  Rated  : {rated} ({round(rated/total*100)}%)")
print(f"\nWritten to: {CATALOG}.immich_silver.assets")
Expected output
Parsing bronze assets to silver... Total : 12453 Images : 12072 Videos : 381 GPS : 8941 (72%) Rated : 6208 (50%) Written to: workspace.immich_silver.assets
Pythoncell 7b — parse albums to silver
album_schema = StructType([
    StructField("id",            StringType()),
    StructField("albumName",     StringType()),
    StructField("description",   StringType()),
    StructField("assetCount",    IntegerType()),
    StructField("startDate",     StringType()),
    StructField("endDate",       StringType()),
    StructField("shared",        BooleanType()),
    StructField("hasSharedLink", BooleanType()),
])
df_albums = (
    spark.table(f"{CATALOG}.immich_bronze.albums_raw")
    .withColumn("data", from_json(col("raw_json"), album_schema))
    .select(
        col("data.id").alias("album_id"),
        col("data.albumName").alias("album_name"),
        col("data.description").alias("description"),
        col("data.assetCount").alias("asset_count"),
        to_timestamp(col("data.startDate")).alias("start_date"),
        to_timestamp(col("data.endDate")).alias("end_date"),
        col("data.shared").alias("is_shared"),
        col("data.hasSharedLink").alias("has_shared_link"),
        col("ingested_at")
    )
    .dropDuplicates(["album_id"])
)
(df_albums.write.format("delta").mode("overwrite").saveAsTable(f"{CATALOG}.immich_silver.albums"))
print(f"  Albums : {df_albums.count()}")
print(f"  Shared : {df_albums.where(col('is_shared') == True).count()}")
Expected output
Albums : 47 Shared : 12

08 — Inspect the silver layer

Pythoncell 8 — inspect silver
df = spark.table(f"{CATALOG}.immich_silver.assets")
print("=== silver.assets schema ===")
df.printSchema()

print("=== 5 sample rows ===")
(df.select("asset_id", "file_name", "media_type", "year", "month",
           "camera_make", "camera_model", "city", "country",
           "file_size_bytes", "rating")
   .limit(5).show(truncate=35))

print("=== Null counts in key columns ===")
cols_to_check = ["camera_make","camera_model","city","country",
                 "latitude","longitude","year","file_size_bytes","rating","timezone"]
null_counts = df.select([
    count(when(col(c).isNull(), c)).alias(c) for c in cols_to_check
]).collect()[0].asDict()
total = df.count()
print(f"  (out of {total} rows)")
for c, n in null_counts.items():
    pct = round(n / total * 100)
    print(f"  {c:<20} {n:>6} nulls  ({pct}%)")
Expected output
=== silver.assets schema === root |-- asset_id: string |-- file_name: string |-- media_type: string |-- created_at: timestamp |-- local_datetime: timestamp |-- year: integer |-- month: integer |-- day: integer |-- is_favourite: boolean |-- camera_make: string |-- camera_model: string |-- city: string |-- country: string |-- latitude: double |-- longitude: double |-- file_size_bytes: long |-- timezone: string |-- rating: integer |-- ingested_at: timestamp === Null counts in key columns === (out of 12453 rows) camera_make 1823 nulls (15%) city 3512 nulls (28%) country 3512 nulls (28%) latitude 3512 nulls (28%) year 0 nulls (0%) file_size_bytes 154 nulls (1%) rating 6245 nulls (50%) timezone 420 nulls (3%)

09 — Gold Analytics tables

Purpose-built aggregations ready for dashboards. Note: lower rating = better (1 = first place, 5 = last) — queries sort accordingly.

SQL cells: Databricks expands ${CATALOG} automatically. For Python cells replace with {CATALOG} in an f-string.
SQLcell 9a — gold: photos by year & month
CREATE OR REPLACE TABLE ${CATALOG}.immich_gold.photos_by_month AS
SELECT
    year, month, media_type,
    COUNT(*)                                    AS asset_count,
    COUNT(*) FILTER (WHERE is_favourite = true) AS favourite_count,
    ROUND(SUM(file_size_bytes) / 1e9, 2)        AS total_size_gb,
    COUNT(DISTINCT camera_make)                 AS distinct_cameras
FROM ${CATALOG}.immich_silver.assets
WHERE is_trashed = false
GROUP BY year, month, media_type
ORDER BY year, month;
SQLcell 9b — gold: geography
CREATE OR REPLACE TABLE ${CATALOG}.immich_gold.geography AS
SELECT
    COALESCE(country, 'No GPS data') AS country,
    city,
    COUNT(*)                         AS photo_count,
    COUNT(DISTINCT camera_model)     AS cameras_used,
    MIN(local_datetime)              AS earliest_photo,
    MAX(local_datetime)              AS latest_photo,
    AVG(latitude)                    AS avg_lat,
    AVG(longitude)                   AS avg_lon
FROM ${CATALOG}.immich_silver.assets
WHERE is_trashed = false
GROUP BY country, city
ORDER BY photo_count DESC;
SQLcell 9c — gold: camera gear
CREATE OR REPLACE TABLE ${CATALOG}.immich_gold.camera_summary AS
SELECT
    COALESCE(camera_make,  'Unknown') AS make,
    COALESCE(camera_model, 'Unknown') AS model,
    COUNT(*)                          AS photo_count,
    ROUND(AVG(f_number), 1)           AS avg_f_number,
    ROUND(AVG(iso), 0)                AS avg_iso,
    MIN(year)                         AS first_used_year,
    MAX(year)                         AS last_used_year
FROM ${CATALOG}.immich_silver.assets
WHERE media_type = 'image' AND is_trashed = false
GROUP BY camera_make, camera_model
ORDER BY photo_count DESC;
SQLcell 9d — gold: ratings summary (1=best)
CREATE OR REPLACE TABLE ${CATALOG}.immich_gold.ratings_summary AS
SELECT
    rating,
    COUNT(*)                     AS photo_count,
    COUNT(DISTINCT country)      AS countries,
    COUNT(DISTINCT camera_model) AS cameras,
    MIN(year)                    AS earliest_year,
    MAX(year)                    AS latest_year
FROM ${CATALOG}.immich_silver.assets
WHERE is_trashed = false AND rating IS NOT NULL
GROUP BY rating
ORDER BY rating ASC;  -- 1 = best / first place
SQLcell 9e — gold: album stats
CREATE OR REPLACE TABLE ${CATALOG}.immich_gold.album_stats AS
SELECT
    album_name, asset_count, is_shared, has_shared_link,
    start_date, end_date,
    DATEDIFF(end_date, start_date) AS date_span_days
FROM ${CATALOG}.immich_silver.albums
ORDER BY asset_count DESC;

10 — Explore your data

SQLcell 10a — photos per year
SELECT year, SUM(asset_count) AS total
FROM ${CATALOG}.immich_gold.photos_by_month
WHERE media_type = 'image'
GROUP BY year ORDER BY year;
SQLcell 10b — top 20 countries
SELECT country, SUM(photo_count) AS photos
FROM ${CATALOG}.immich_gold.geography
WHERE country != 'No GPS data'
GROUP BY country ORDER BY photos DESC LIMIT 20;
SQLcell 10c — top 30 cities
SELECT city, country, SUM(photo_count) AS photos
FROM ${CATALOG}.immich_gold.geography
WHERE city IS NOT NULL AND country != 'No GPS data'
GROUP BY city, country ORDER BY photos DESC LIMIT 30;
SQLcell 10d — best rated photos by country (1=best)
SELECT country, city,
    COUNT(*) FILTER (WHERE rating = 1) AS first_place,
    COUNT(*) FILTER (WHERE rating = 2) AS second_place,
    COUNT(*) FILTER (WHERE rating = 3) AS third_place,
    COUNT(*) FILTER (WHERE rating IS NOT NULL) AS total_rated
FROM ${CATALOG}.immich_silver.assets
WHERE is_trashed = false AND country IS NOT NULL
GROUP BY country, city ORDER BY first_place DESC LIMIT 20;
Pythoncell 10e — chart: photos per year
import matplotlib.pyplot as plt
df = spark.sql(f"""
    SELECT year, SUM(asset_count) AS photos
    FROM {CATALOG}.immich_gold.photos_by_month
    WHERE media_type = 'image' AND year IS NOT NULL
    GROUP BY year ORDER BY year
""").toPandas()
fig, ax = plt.subplots(figsize=(12, 4))
ax.bar(df["year"], df["photos"], color="#1D9E75", width=0.7)
ax.set_xlabel("Year"); ax.set_ylabel("Photos"); ax.set_title("Photos per year")
for _, row in df.iterrows():
    ax.text(row["year"], row["photos"] + 20, str(row["photos"]), ha="center", fontsize=9)
plt.tight_layout(); display(fig)
Pythoncell 10f — chart: top 15 countries
df = spark.sql(f"""
    SELECT country, SUM(photo_count) AS photos
    FROM {CATALOG}.immich_gold.geography
    WHERE country != 'No GPS data'
    GROUP BY country ORDER BY photos DESC LIMIT 15
""").toPandas()
fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(df["country"][::-1], df["photos"][::-1], color="#378ADD")
ax.set_xlabel("Photos"); ax.set_title("Top 15 countries")
plt.tight_layout(); display(fig)

11 — Scheduled Automated jobs

Databricks Jobs run notebooks on a schedule with no manual intervention. Two jobs are recommended: a daily incremental and a weekly full refresh.

Daily — incremental

  • Cells 1, 5d, 7a, 7b, 9a–9e
  • Only new/changed assets
  • Fast — typically under 2 min
  • Does not catch deletions

Weekly — full refresh

  • Cells 1, 5a, 5b, 5c, 7a, 7b, 9a–9e
  • Full sync, catches deletions
  • 10–20 min for large libraries
  • Suggested: Sunday 3am

Ad-hoc

  • Run cell 5d then 7a–9e
  • After a big import session
  • Trigger from the Jobs page
  • Or just run manually

Creating a job

1

In Databricks, click Workflows in the left sidebar → Create job.

2

Give it a name, e.g. immich_daily_incremental or immich_weekly_full.

3

Under Task, set type to Notebook and select your notebook.

4

Under Cluster, select Serverless (only option in Free Edition).

5

Click Add triggerScheduled. Paste the cron expression for the desired schedule.

6

Click Create. You can also trigger manually from the job page at any time.

Cell selection tip: Databricks Jobs run the whole notebook top to bottom. For separate daily/weekly schedules, create two notebooks — e.g. immich_incremental.py starting with cell 1 then cell 5d → 7a → 7b → 9a–9e, and immich_full_refresh.py with cell 1 then 5a → 5b → 5c → 7a → 7b → 9a–9e.
Croncommon schedules
0 3 * * *       # every day at 3am
0 3 * * 0       # every sunday at 3am  ← recommended for full refresh
0 */6 * * *     # every 6 hours
0 3 1 * *       # first of every month at 3am
0 3 * * 1-5     # weekdays only at 3am

Cron format: minute hour day-of-month month day-of-week. Adjust the hour (3) to your local timezone offset from UTC.

12 — AI/BI Databricks dashboards

Databricks AI/BI Dashboards (Lakeview) connect directly to your gold tables. They're the only dashboard product available in Free Edition and are genuinely good for this use case.

Creating a new dashboard

1

In the left sidebar click DashboardsCreate dashboard.

2

Give it a name, e.g. Immich Photo Library.

3

Click Add a visualization to add your first panel.

4

Each panel has a SQL editor at the bottom. Paste a query, click Run, then pick a chart type from the right panel.

5

Drag panels to arrange them. Click Publish when done to get a shareable read-only URL.

In dashboard SQL, replace workspace with your actual catalog name. You cannot use ${CATALOG} widget syntax in dashboard panels — hardcode the catalog name directly.

Panel 1 — photos per year

SQLdashboard panel 1
SELECT year, SUM(asset_count) AS photos
FROM workspace.immich_gold.photos_by_month
WHERE media_type = 'image' AND year IS NOT NULL
GROUP BY year ORDER BY year;

Chart: Bar · X: year · Y: photos

Panel 2 — photos per month heatmap

SQLdashboard panel 2
SELECT year, month, SUM(asset_count) AS photos
FROM workspace.immich_gold.photos_by_month
WHERE media_type = 'image' AND year IS NOT NULL
GROUP BY year, month ORDER BY year, month;

Chart: Heatmap · X: month · Y: year · Color: photos

Panel 3 — top 20 countries

SQLdashboard panel 3
SELECT country, SUM(photo_count) AS photos
FROM workspace.immich_gold.geography
WHERE country != 'No GPS data'
GROUP BY country ORDER BY photos DESC LIMIT 20;

Chart: Bar (horizontal) · X: photos · Y: country

Panel 4 — top 30 cities

SQLdashboard panel 4
SELECT city || ', ' || country AS location,
       SUM(photo_count) AS photos
FROM workspace.immich_gold.geography
WHERE city IS NOT NULL AND country != 'No GPS data'
GROUP BY city, country ORDER BY photos DESC LIMIT 30;

Chart: Bar (horizontal) · X: photos · Y: location

Panel 5 — camera gear

SQLdashboard panel 5
SELECT make || ' ' || model AS camera,
       photo_count, first_used_year, last_used_year
FROM workspace.immich_gold.camera_summary
WHERE make != 'Unknown'
ORDER BY photo_count DESC LIMIT 15;

Chart: Bar (horizontal) · X: photo_count · Y: camera

Panel 6 — rating distribution (1=best)

SQLdashboard panel 6
SELECT
    CASE rating
        WHEN 1 THEN '1 — First place'
        WHEN 2 THEN '2 — Second place'
        WHEN 3 THEN '3 — Third place'
        WHEN 4 THEN '4 — Fourth place'
        WHEN 5 THEN '5 — Fifth place'
    END AS rating_label,
    photo_count
FROM workspace.immich_gold.ratings_summary
ORDER BY rating ASC;

Chart: Bar · X: rating_label · Y: photo_count

Counter widgets — add one panel per query

SQLtotal photos
SELECT SUM(asset_count) AS total_photos
FROM workspace.immich_gold.photos_by_month
WHERE media_type = 'image';
SQLcountries visited
SELECT COUNT(DISTINCT country) AS countries_visited
FROM workspace.immich_gold.geography
WHERE country != 'No GPS data';
SQLtotal storage (GB)
SELECT ROUND(SUM(total_size_gb), 1) AS total_gb
FROM workspace.immich_gold.photos_by_month;
SQLfirst place photos
SELECT COUNT(*) AS first_place_photos
FROM workspace.immich_silver.assets
WHERE rating = 1 AND is_trashed = false;

13 — Bonus Interactive Leaflet map

Standalone bonus section. Exports your geography gold table to a self-contained HTML map with Leaflet — dark-themed, circle markers sized by photo count, clickable popups. Serve it from your VPS alongside this tutorial.

Prerequisites: complete section 09 (gold layer) first. The map is built from immich_gold.geography.
Pythoncell 13a — export geography data
import json

geo_df = spark.sql(f"""
    SELECT country, city,
           ROUND(avg_lat, 4) AS lat,
           ROUND(avg_lon, 4) AS lon,
           photo_count, earliest_photo, latest_photo
    FROM {CATALOG}.immich_gold.geography
    WHERE avg_lat IS NOT NULL AND avg_lon IS NOT NULL
      AND country != 'No GPS data'
    ORDER BY photo_count DESC
""").toPandas()

locations      = json.dumps(geo_df.to_dict(orient="records"))
total_photos   = int(geo_df["photo_count"].sum())
total_locations = len(geo_df)
print(f"Locations with GPS : {total_locations}")
print(f"Photos mapped      : {total_photos}")
print(geo_df[["city","country","photo_count"]].head())
Expected output
Locations with GPS : 312 Photos mapped : 8941 city country photo_count 0 Perth Australia 2103 1 Singapore Singapore 841 2 London UK 734
Pythoncell 13b — generate map HTML and save to DBFS
html = f"""




Photo Map





""" dbutils.fs.put("/FileStore/photo_map.html", html, overwrite=True) print("Saved to /FileStore/photo_map.html") print("Download: Catalog → Files → FileStore in the Databricks UI")
Deploying the map: download photo_map.html from Databricks (Catalog → Files → FileStore), then copy to your VPS: scp ~/Downloads/photo_map.html root@tadhock.ovh:/mnt/blockstorage/tutorial/html/map.html — it'll be live at https://tutorial.tadhock.ovh/map.html immediately.