All projects
Data Engineering

Bali Properties Data Platform: 10 sources, 60K+ listings, monthly cycle

Two-pipeline market intelligence infrastructure for Investland Bali Properties: REID Looker Studio API extractor with weighted statistics methodology, plus 9-site scraper with quadtree spatial decomposition for Airbnb at 40K+ listings. Bronze layer on R2, Gold on Supabase, human-in-the-loop staging.

DateAugust 2025
StatusCompleted
ClientInvestland Bali Properties
Bali Properties Data Platformv1.0 · August 2025
10
Data sources
63K+
Listings per cycle
41,202
API calls / month
14 days
Full collection cycle

The brief

Investland is a property developer in Bali that owns and manages a portfolio of villas and residential projects. Pricing and positioning those properties accurately requires visibility into the entire Bali market, not just the 200-odd properties Investland touches directly, but the 60,000+ listings, transactions, and aggregate indicators that define what "market rate" actually means in Canggu, Ubud, or Uluwatu.

Two complementary pipelines feed that visibility. The first interfaces with REIDBase, a public dashboard powered by Looker Studio, to capture market-level aggregates (median sold price, days listed, yield estimates) across 63 neighbourhoods. The second scrapes nine listing websites, including Airbnb at 40,000+ rental records, for property-level granularity. They converge on a Supabase Gold layer joined with the firm's own Airtable portfolio, producing a single searchable surface that powers both an interactive dashboard and monthly market reports.

Architecture

Bali Properties data platform architecture: REID extractor and 9-site scraper feeding into a Supabase Gold layer joined with Airtable, with Bronze (R2) and Cleaning (Pandas) stages between scraper output and the Gold layer.

The platform follows a medallion-ish pattern with one important asymmetry: REID data arrives pre-aggregated and writes directly to Gold, while scraped listings flow through the full Bronze → Cleaning → Staging → Gold pipeline. A "Monthly Pulse" cron triggers the scraper fleet; the cleaner runs immediately after, landing output in a staging area that requires single-click human approval before promotion to production. Aggregation SQL views sit on top of the Gold tables to materialize the indicators consumed by the dashboard and reports.

Pipeline 1: REID API extractor

REIDBase is a public dashboard built on Looker Studio. Its API endpoints serve the dashboard's tiles directly, are not authenticated, and respond with the same pre-aggregated metrics displayed in the UI. There is no transaction-level data exposed, only "median sold price for 2-bedroom villas in Canggu" granularity. The extractor's job is to enumerate every dimension combination and reconstruct sufficient detail through downstream weighting.

Cartesian enumeration: 41,202 calls per cycle

Each monthly extraction iterates over the full Cartesian product of available dimensions. The breakdown across the four metric groups:

GroupMetricsNeighbourhoodsBedroomsProperty typesCalls
Sales Performance(2×1 + 4×3) = 14636210,584
Market Supply(4×3 + 2×1) = 14636210,584
Built Information5×3 = 15639217,010
Rental Performance(2×3 + 1×1) = 72463 (incl. Guest House)3,024
Total41,202

At ~2.3 seconds per call (Looker's natural response cadence; pacing matters here, otherwise the source rate-limits), one month of data takes ~25 hours of continuous extraction. A full-year backfill is 12-13 days.

Weighted statistics: reconstructing granularity

Pre-aggregated metrics aren't the dead end they look like. By weighting each retrieved aggregate by the active inventory it represents (sourced from the Supply group's available_properties_* counts), the pipeline computes composite indicators (weighted median sold price, weighted price per square meter) that approximate what a transaction-level aggregation would produce. Validation against the source dashboard's own composite displays shows ~90% fidelity.

weighted-median.tstypescript
function weightedMedian(items: { value: number; weight: number }[]): number {
  if (items.length === 0) return 0
  const sorted = [...items].sort((a, b) => a.value - b.value)
  const totalWeight = sorted.reduce((sum, i) => sum + i.weight, 0)
  if (totalWeight === 0) return 0

  const halfWeight = totalWeight / 2
  let cumulative = 0

  for (let i = 0; i < sorted.length; i++) {
    cumulative += sorted[i].weight
    if (cumulative >= halfWeight) {
      // exact midpoint: average two surrounding values
      if (cumulative === halfWeight && i + 1 < sorted.length) {
        return (sorted[i].value + sorted[i + 1].value) / 2
      }
      return sorted[i].value
    }
  }
  return sorted[sorted.length - 1].value
}

// Weight by active inventory at each (neighbourhood × bedroom × type) cell.
// Bias is intentional. Active inventory reflects what's currently transactable.

The bias toward active inventory is deliberate. A sold-property median weighted by 5-year-old historical transactions doesn't reflect what's currently for sale; weighting by today's available inventory keeps the indicator anchored to actionable supply.

Pipeline 2: Multi-site scraper

Where REID gives breadth at the cost of granularity, the scraper fleet gives the inverse: every individual listing across nine major Bali real estate sites, including occupancy data on Airbnb's 40K+ rental properties. The pipeline is split across two repositories (scraping-automation for collection and transform-pipeline for cleaning), orchestrated together with Docker.

The fleet

SourceSalesRentalOccupancy
Airbnb38,000–40,000
Bali Exception900–1,000100–200
Bali Home Immo2,900–3,0001,900–2,000
Betterplace700–900500–600
Fazwaz11,000–12,0001,700–2,000
James Edition700–800
Lamudi1,300–1,4001,000–1,200
Propertia900–1,000
Villa Bali Sale2,200–2,4001,200–1,300
Total~20,000~43,000~40,000

Total cycle: 14 days. Airbnb alone is 10 of those.

Quadtree spatial decomposition for Airbnb

Airbnb caps each search query at roughly 270 results regardless of pagination. Bali has roughly 40,000 active rental listings. The naive answer is offset pagination; the real answer is recursive spatial subdivision.

270+depth 0: split270+95142270+depth 1: split the 270+ cellsdepth 2: all under cap

The algorithm:

quadtree-scrape.pypython
MAX_RESULTS_PER_CELL = 250        # ~10% buffer below Airbnb's ~270 cap
MAX_SUBDIVISION_DEPTH = 2         # bali fits comfortably under depth 2

def scrape_cell(cell, depth=MAX_SUBDIVISION_DEPTH):
    listings = airbnb_search(cell.bbox, max_pages=15)

    # Adaptive subdivision: if a cell is dense and we still have depth budget,
    # split into 4 quadrants and recurse.
    if len(listings) >= MAX_RESULTS_PER_CELL and depth > 0:
        for sub_cell in subdivide(cell):
            listings.extend(scrape_cell(sub_cell, depth=depth - 1))

    return dedupe(listings)

def subdivide(cell):
    mid_lat = (cell.sw_lat + cell.ne_lat) / 2
    mid_lng = (cell.sw_lng + cell.ne_lng) / 2
    return [
        Cell(cell.sw_lat, cell.sw_lng, mid_lat, mid_lng),       # SW
        Cell(cell.sw_lat, mid_lng, mid_lat, cell.ne_lng),       # SE
        Cell(mid_lat, cell.sw_lng, cell.ne_lat, mid_lng),       # NW
        Cell(mid_lat, mid_lng, cell.ne_lat, cell.ne_lng),       # NE
    ]

MAX_SUBDIVISION_DEPTH=2 is a deliberate ceiling. Bali fits comfortably under that depth at the threshold chosen, and bounded recursion is bounded runtime. If a future hot zone demands depth 3, the constant lifts; the algorithm doesn't change.

Stealth and resilience

Public scraping is an arms race. The scraper assumes hostility and structures around it: per-domain persistent Chrome profiles (preserves cookies and trust signals across runs), preflight circuit breakers that detect CloudFlare interstitials before committing to a full scrape, humanization jitter (0.4–1.2s) on navigation events, and an in-memory TTL cache (15 min default) to suppress redundant page loads when paths overlap.

The preflight check matters most. Without it, a session that's been silently flagged burns hours producing junk before anyone notices.

Per-source plugin architecture

Eight of nine sites inherit from a BaseScraper abstract class implementing the common contract: preflight, listing enumeration, detail extraction, write to R2. Each subclass overrides one or two methods to handle source-specific quirks (DOM shape, pagination style, anti-bot defenses). Adding a tenth site is one new class implementing one method, not a refactor.

Airbnb is the standalone outlier. Its quadtree-based phase structure differs enough from the others that fitting it into BaseScraper would have meant either weakening the base or special-casing the orchestrator. Both options were worse than keeping it standalone.

Cleaner pipeline (transform-pipeline)

Pandas-based, registry pattern matching the scraper architecture. Each source has its own cleaner subclass of BaseCleaner; a registry routes raw listings to the correct cleaner based on the source name. The cleaning pass produces 16+ anomaly flags per record: PRICE_OVERFLOW, BUILDING_LARGER_THAN_LAND, SUSPECT_LOW_SALE_PRICE, HIGHER_BUILDING_SIZE, multi-currency bound checks (IDR 999T, USD 50M for sales), and one I'm particularly fond of: per-are price detection. If a land listing's price-per-square-meter falls under $10, it's almost certainly priced per are (Indonesian land unit, 100 sqm) rather than per total. The cleaner flags it for human inspection rather than silently scaling.

Flagged records land in a Supabase staging table. A reviewer examines each flag against the source listing, approves or amends, and a single button promotes the staging snapshot to the Gold layer. The dashboard auto-refreshes on promotion.

Convergence: the Gold layer

Both pipelines and the Airtable portfolio terminate in the same Supabase Gold layer. The schema is intentionally flat per concern: REID aggregates land in their own tables (period × neighbourhood × dimension keys), scraped listings populate property-level tables (one row per listing × source), and the Airtable mirror sits alongside as an authoritative own-portfolio snapshot.

The aggregation layer materializes through SQL views that fold all three sources into the indicators the dashboard actually queries. A single searchable view joins own-portfolio rows with comparable market listings, which is what powers the Compare Property feature documented in the dashboard case study.

Promotion from staging to Gold is a deliberate gate. The reviewer sees flagged records grouped by source, can drill into the original listing URL for verification, and either approves the cleaner's interpretation or amends the values manually. Once approved, a single button promotes the staging snapshot to Gold and the dashboard auto-refreshes via Supabase Realtime, so the moment a market shift is verified, decisions can be made against it.

Tradeoffs and lessons

  1. Pre-aggregated data isn't a dead end. Weighted statistics on Cartesian-enumerated metrics reach ~90% fidelity without raw transaction access. The methodology is more important than the source granularity.

  2. Multi-day extraction jobs need resume discipline. A 25-hour run will fail somewhere; per-tuple progress tracking and cookie auto-refresh aren't optional features, they're table stakes.

  3. Quadtree is the right answer when paginated APIs cap by spatial query. Offset pagination doesn't help if the result-set size cap stays the same; subdividing the query domain does.

  4. Staging discipline beats automation purity. Sixteen flag categories cost minutes of human review per cycle. Without them, a 50,000 sqm villa quietly enters the median.

  5. Per-source plugin architecture scales the team, not just the code. The unit of growth is one new class implementing one contract, not a coordinated refactor across the codebase.

Interested in working together?

Open to AI engineering roles, freelance, and collaborations.

Get in touch