Building the Intelligence
Layer for Real Estate
A chronological look at how the platform evolved: from a crawler stack and research agent into a public data platform, portfolio intelligence system, and AI-native document workflow.
Before ChatGPT had web search, we were serving live content through an LLM. Before “agentic AI” was a buzzword, we were building agents that planned, executed, and analyzed across multiple data sources.
0+
Web Scrapers
0+
LLM Tools
0
Gov API Integrations
0
Public API Endpoints
0
Product Iterations
0hr
Crawl Cycle
Chronology
System Architecture
200 Sources
Gov, News, Finance, Legal, Think Tanks
Crawler Stack
150+ scrapers → Content API → Quality scoring
Pinecone + PostgreSQL
Vector store + relational source of truth
ALFReD + Public API
Research agent, public endpoints, and creation tools
Products
ALFReD + CONCaiRGE · ATLAS · Studio
Chronology
Chapter 1
Fall 2023
The Crawler Stack
150+ scrapers. 200 sources. A two-hour heartbeat.
In the fall of 2023, Impact Capitol set out to tackle a massive problem: building a centralized, AI-optimized knowledge hub for real estate information. At the time, most AI work in the industry was focused on predictive modeling, automated valuation models, and price forecasting. We believed generative AI pointed toward a different opportunity entirely: a new way to conduct research and analysis, and in doing so, democratize knowledge, data, and workflows that had historically been reserved for highly technical teams or C-suite operators.
But this was before Perplexity, and before the web-searching ChatGPT people now take for granted. There was no real blueprint for how LLMs could be used in production research workflows, and just as much uncertainty around the capabilities of the models themselves as there was around where the technology was headed. Before making any major decisions about the user-facing AI product, we first had to solve the much less glamorous problem underneath it: collecting, cleaning, and structuring the data.
We cycled through a range of early approaches in those first months: RSS feeds, ElasticSearch, even custom Bing and Google search workflows. Each solution helped in some areas, but none gave us the level of control we wanted around quality, ranking, freshness, and auditability. The more serious we became about building a real product, the clearer it became that we would need to build the ingestion layer ourselves, from scratch.
That meant weeks of painstaking manual work. We inspected relevant domains one by one, looking through sitemaps, robots.txt files, archive structures, article URLs, and the specific HTML patterns that signaled new content. Over time, we built dedicated crawlers across more than 150 of the most important sources in U.S. real estate, economics, politics, housing policy, and current events: the Fed and regional Federal Reserve banks, HUD, Treasury, SEC, OCC, FDIC, FEMA, CFPB, NAR, Zillow, CBRE, HousingWire, Brookings, financial institutions, and legal specialty outlets. The final system settled into 150+ site-specific scrapers, each tailored to the source it monitored rather than forcing every site through one brittle generic pipeline.
However, in late 2023, simply handing a URL to an LLM was nowhere near enough. The models could not reliably conduct research in the agentic way people now expect, and blindly indexing every newly discovered page would have degraded result quality almost immediately. We needed a way to extract, evaluate, process, and maintain thousands of incoming URLs per day in a form that was genuinely useful to an LLM, while still keeping the whole system manageable for a small, non-technical team of three.
That pressure led to a three-part proprietary crawler and RAG pipeline. First came the website scrapers: a large fleet of source-specific scripts with deduplication logic and scheduled runs to discover fresh content pages. Second came the Content Processing API, an AI layer that subscribed to those scrapers, parsed and cleaned page content, generated document-level metadata like qualityScore, publicationDate, is_headline, and headline_category, and then split the content into configurable chunks with their own metadata and quality signals. Third came the Knowledgebase Management UI, with /crawl and /scrape workflows, tabular crawl-job visibility, CSV uploads, PDF ingestion, individual URL ingestion, and admin actions to generate metadata or upsert directly into the knowledge base.
This modular architecture gave us something that was both scalable and operationally realistic. We could enforce quality thresholds like a document score of 8 and a chunk score of 7, ensuring that only strong context entered the searchable knowledge base without requiring constant human review. For a lean team, that mattered as much as model performance itself.
It also gave us flexibility that standard RAG setups at the time generally did not. Instead of treating the embedded text itself as the final retrieval target, we could generate metadata such as search terms, publication dates, quality measures, and categorical labels that improved how user queries matched against the vector store. In practice, that meant the system could compare a user's question against generated search terms while still returning the richer chunk or full-document context to the model. The same metadata also improved the product experience directly, allowing both users and the model to filter by time windows like the last 24 hours or a specific publication range. That degree of control over index structure, retrieval quality, and auditability simply was not standard in the available tooling at the time.
Vector Space
Semantic organization of the knowledge base. Each point is a sentence; color and proximity indicate thematic similarity. Labels like policy, regulatory, residential, and mortgage map to the headline_category metadata generated by the Content Processing API.
Technical Highlights
- PostgresBaseScraperEnhanced base class with per-source scrape_all_sections()
- Consolidated scheduler on a 2-hour cycle with RUN_YYYYMMDD_HHMMSS tracing
- Content Processing API (Express.js/TypeScript) with Firecrawl for JS-heavy sites
- Mistral OCR API for PDF table extraction
- o3-mini quality scoring pipeline — only score > 7 content enters the vector store
- ~800-char chunking with sentence-boundary awareness
- Four-layer deduplication: discovery log, processed results, ON CONFLICT, queue check
Chapter 2
Spring 2024–2025
ALFReD — The Research Agent
From an early RAG chatbot into a true agentic research system.
With the knowledgebase in place, the next step was putting intelligence on top of it. But generative intelligence in late 2023 and early 2024 was a very different thing than what users now expect from frontier models. The systems were materially less capable, context windows were smaller, and they were far worse at handling edge cases or recovering gracefully when a task fell even slightly outside their core instructions. Finding the right balance between answer depth, latency, cost, and accuracy required thousands of manual tests, repeated prompt refinements, and a constant willingness to rebuild pieces of the product as the models changed.
The first meaningful version of ALFReD was, in retrospect, a relatively simple RAG chatbot. A user would send a message, we would parse it into a search query against the knowledgebase, and then return the full document contents for roughly the top 10 to 20 results. That meant large prompt payloads, but it also produced something that felt genuinely magical at the time. Before ChatGPT search existed, users could ask a question and receive a live, curated stream of real-estate news and analysis synthesized from sources that had been carefully selected and processed for this exact purpose.
That early experience proved the demand, but it also revealed the architectural ceiling. Returning full document contents by default often meant injecting tens of thousands of tokens into the prompt at once, which increased latency, drove up cost, and sometimes diluted the very instructions we needed the model to follow. The more complex the user request became, the more obvious it was that a single retrieval pass and one giant context dump would not be enough.
At the same time, the market was moving quickly. As frontier labs rolled out plugin ecosystems, live web search, and more capable tool use, the definition of a defensible AI product started to shift. The moat was no longer just “we have a chatbot,” but rather we have proprietary data, specialized workflows, and a system that knows how to use them intelligently. That broader market transition, from chatbots to agents, defined most of ALFReD's life.
From there, the work became less about prompting a model to answer and more about designing an application that could plan, retrieve, filter, and reason before answering. We pushed ALFReD toward an agentic architecture: one where the model could connect to external tools, decide how much context it actually needed, respect user constraints like source selection or date ranges, and synthesize a response from multiple structured inputs rather than one monolithic text blob.
Three changes were especially important. First, we introduced the Retrieve tool so the model could start with cheap chunk-level retrieval and only expand into full documents when it judged the extra context necessary. Second, we improved instruction-following by introducing a lightweight XML reasoning pattern with <scratchpad> and <answer> blocks, giving the model space to quietly restate requirements and plan before producing a final user-facing response. Third, as live web search became more commoditized, we began building specialized public-data workflows, including an internal API Agent that could conduct multi-step research across structured government datasets and return concise summaries plus chart-ready outputs for ALFReD to use.
ALFReD is ultimately best understood not as one static product, but as a moving target that evolved alongside the model landscape. What started as a compelling retrieval interface became a much more deliberate research system: a Next.js application on the Vercel AI SDK with multi-provider model support, Cognito authentication, PostgreSQL-backed sessions, and a tool layer that grew into more than 25 capabilities across search, filings, public data, property intelligence, and content generation. Over the following year, three major user-facing capabilities were layered directly into that core agent:
Dashboard Demo
A live look at the broader ALFReD shell: navigation, workspace loading, and the product environment users moved through during research workflows.
Interaction Demo
A live look at the interface in motion: search, property context, and conversational research happening inside the same product surface.
Technical Highlights
- Next.js 14 + Vercel AI SDK (streamText / useChat)
- Multi-provider LLM support: OpenAI, Anthropic, Google
- 25+ tools: search, property data, gov APIs, document creation
- Internal API Agent for multi-step public-data research and chart-ready summaries
- 15 public data endpoints spanning crime, climate risk, demographics, schools, grants, housing finance, LIHTC, Opportunity Zones, and investability scoring
- ToolResult<T> envelope pattern with source attribution and citations
- XML scratchpad reasoning for multi-tool alignment
- AWS Cognito auth with JWT + PostgreSQL session management
- Adaptive two-stage retrieval: chunks first, full-content on demand
- BatchData Property API for property search, comps, and BPO reports
Chapter 3
Fall 2025
ATLAS — Portfolio-Scale Intelligence
A portfolio-native workspace for maps, spreadsheets, and agentic research.
With CONCaiRGE, we learned that there was real operational value in using AI to generate downloadable deliverables that could be shared with third parties and stakeholders. But as we kept demoing that workflow to real-estate firms, mortgage companies, and trade organizations, the same request kept resurfacing: could this work at the portfolio level instead of one property at a time?
There was also a growing demand for visual analysis. Clients did not just want an answer in a chat thread; they wanted to see their assets on a map and understand how a portfolio sat inside layers like crime exposure, school quality, CFPB rural and underserved designations, and natural-disaster risk. In other words, they wanted a system that worked for lenders, brokers, asset managers, and property teams who think in geographies, spreadsheets, and shareable outputs, not just prompts and replies.
That requirement exposed the limits of the chatbot format that CONCaiRGE still relied on. A back-and-forth user/assistant flow works well when the task is investigating a single property. It becomes much less natural when the user needs to manage dozens or hundreds of addresses at once, compare incomplete records, preserve structured fields, and iteratively enrich a portfolio over time. The interface itself had to change.
That shift produced ATLAS, our attempt to rethink what agentic AI looked like for more demanding property workflows. Instead of a chat-first product, ATLAS ingested portfolio spreadsheets and created a virtual database that preserved original columns and data types while turning every property into a durable record. The system could then assess completeness at the row level, express that as a visible completion percentage, and treat missing cells as research tasks rather than passive blanks.
The UI revolved around three connected surfaces. First, an interactive map plotted every property by latitude and longitude and layered on external context. Second, an Excel-style spreadsheet editor gave users direct control over the portfolio itself: editing values, reviewing missing fields, and preserving manual overrides. Third, a dynamic chat sidebar let users ask ALFReD to research an individual property, analyze an entire portfolio, edit selected cells, create new columns, or pull in context from both the knowledgebase and the public API stack.
As users or ALFReD filled in missing information, the system automatically created new portfolio versions so the full workflow stayed auditable end to end. That single-source-of-truth approach was especially important because the underlying MLS and property-data landscape was volatile. Regulatory changes and provider inconsistency could cause the quality of results for the same address to fluctuate over time, including missing images or degraded record quality. Without granular, persistent control over the portfolio data itself, property research in a pure chatbot format became fragile and frustrating.
ATLAS also marked our first serious push into geospatial product design. Using Leaflet and React-Leaflet, we layered school quality zones, FBI crime statistics, CFPB rural and underserved areas, and TIGER-based Census geography onto the map so portfolio analysis could happen spatially as well as textually. While ATLAS never shipped to clients as a production product, the research, prototypes, and feedback loop around it directly influenced what came next: more emphasis on structured data ingestion, richer visual workflows, and ultimately products like Studio that treated AI output as something users needed to inspect, edit, and deliver.
Map Layers Demo
A deeper geospatial look at ATLAS: portfolio addresses, layered spatial context, and the kinds of visual analysis clients kept asking for.
Workflow Demo
An in-product look at ATLAS in motion, showing how the spreadsheet-style workspace could accelerate portfolio enrichment and editing.
Technical Highlights
- Portfolio ingestion into a virtual database that preserves source columns and data types
- Spreadsheet-first UI with per-row completion tracking and manual overrides
- ALFReD Portfolio Agent for cell-level edits, portfolio-wide research, and new-column generation
- Dynamic chat sidebar connected to the knowledgebase and public API stack
- Leaflet + React-Leaflet geospatial layers across school quality, crime, CFPB designations, and TIGER boundaries
- Automatic portfolio versioning and comparison over time
Chapter 4
Winter 2025–26
Studio — Making Research Shareable
The end-to-end research-to-deliverable workflow we had been building toward.
Studio was the culmination of every lesson we had learned navigating the waters of this rapidly evolving industry. The move toward it addressed all the limitations of chat-based AI we could put a finger on: the lack of shareability and iterability, the opacity of an LLM's research process, the weak support for citations and native visual understanding, and, from the lens of our own platform, the inability to upload multimodal work materials like spreadsheets, PDF documents, brand logos, and reference graphics.
The core idea was to turn the conversation into a document workflow. The model would produce a first draft, the user would revise it the way they would in PowerPoint or Word, and then the model could step back in as a colleague: visually processing the document and editing it in all the ways a human could. The intent was that Studio could serve as a genuine end-to-end workspace for market research, analysis, and polished client deliverables.
But getting there exposed a fundamental shortcoming in our legacy systems: they were all text-based. The depth and breadth of our knowledgebase and ALFReD's research abilities had always been restricted to what could be coherently processed and transformed into clean text. Embedded website images were reduced to markdown URLs. PDF charts, infographics, and visuals were effectively flattened into placeholders, stripping an enormous amount of contextual value out of the information we were processing and citing in our outputs.
We heard about it directly. In an early Studio demo for a prominent policy think tank, a representative pointed out how the system had thoroughly failed to synthesize and explain the charts in one of their more visual-heavy PDF market reports. That feedback was hard to argue with, and it accelerated a pivot we had already been considering.
What we found was that a vision-native approach was well within the capabilities of the frontier models, surprisingly easy to integrate and scale across individual tools without prohibitive cost or latency, and, most importantly, game-changingly effective for Studio's objectives. We rebuilt the PDF ingestion pipeline so that ALFReD processed every page as both a high-resolution image and its OCR-extracted text. A 50-page report could now be synthesized the way a human would read it, in a fraction of the time and with far stronger retention of visual context like charts, tables, and layouts. Cost and latency stayed manageable because we preserved the original search-and-retrieve tooling: semantic search with metadata filters for document ID, then a vision-enabled retrieve tool that returned both the text and the page image for the model to reason over.
That same vision-first foundation unlocked something we had not fully anticipated: an inherited ability for visual expression and design awareness. Studio could understand your brand identity and maintain its integrity across outputs. Users could upload logos, define color palettes with hex values, and the model would use those assets faithfully, knowing where and when to place a logo, staying true to the given color theme, and expressing that consistency across content-rich, transparently cited presentations and reports. Brand Kits became a first-class concept: folders with associated palettes, logos, and reference materials that the model could access through an @mention system.
Under the hood, Studio turned model output into a fully editable canvas with layers, materials, and revision controls instead of a dead-end response. Logos, charts, and uploaded graphics could flow directly into the document workflow, image generation used the current slide as context so visuals matched the deck's style, and the finished work could be exported as polished PPTX or PDF deliverables.
Studio was a pleasure to work on, and the resoundingly positive feedback from live demos with high-profile partners confirmed that it was a genuinely differentiated product with real potential. Market forces outside of our control ultimately cut its life cycle short. Like many firms building in this space, we were increasingly confronted with the question of differentiation against horizontal tools like Claude, Copilot, and other multimodal workflow products that kept expanding their surface area.
That said, we are still eager to get Studio into others' hands. In the coming weeks, we plan to open-source a version on GitHub built on the Vercel AI SDK, so the architecture and patterns behind it are available for anyone to study, extend, or build on top of. Details to follow.
Agent Revision Demo
A user asks Studio to cite a fact on a slide, the agent researches the PDF to find the source for the new-residents figure, and then inserts a dynamic citation directly into the document.
Citation System Demo
A live look at Studio's more transparent research experience: vectorized document knowledge, dynamic citation links, and a more explorable relationship between source material and output.
Technical Highlights
- Streamed XML markup (draft_slideshow / draft_pdf) rendered client-side into an editable canvas
- Vision-native PDF pipeline: high-res page images + OCR text via pgvector semantic search
- Vision-enabled retrieve tool returning multimodal content (text + page image) to the model
- Dual-provider image generation: Gemini 3 Pro Image Preview + OpenAI gpt-image-1.5 in parallel
- Brand Kits with hex palettes, logo uploads, and @mention-based asset embedding
- HTML serializer for positioned elements (text, images, shapes) with material URL resolution
- PPTX export via PptxGenJS; pixel-accurate PDF export via headless Chromium
- Slide screenshot capture + reference images + HTML style summary as generation context
The Technical Foundation
PostgreSQL as the source of truth. Every system uses PostgreSQL — for crawl history, content storage, user sessions, chat messages, portfolio data, and document state.
Pinecone for semantic intelligence. Multiple indices serve different purposes — general knowledgebase, headlines, FRED/BLS metadata, SEC filings, state regulations — all using OpenAI's text-embedding-3-large.
Docker and AWS ECS for deployment. Every service is containerized with Alpine-based images. Health check endpoints enable ECS auto-recovery. AWS Secrets Manager handles credentials, RDS provides managed PostgreSQL.
TypeScript on the frontend, Python where it makes sense. Web applications, Studio workflows, and the public data layer use TypeScript/Node.js for their ecosystem advantages. The crawler stack still leans on Python where data extraction and processing tooling make it the better fit.
Consistent tool patterns. Whether it's a web scraper, an API client, or an LLM tool, every component follows similar patterns: typed inputs with validation (Zod for TypeScript, Pydantic for Python), structured error handling, retry logic with backoff, and comprehensive logging.
Looking Back
What started as a web scraping project became a platform. Each phase solved a real problem — “we need current data” led to the crawler stack, “we need intelligent answers” led to ALFReD, “we need portfolio-scale operations” led to ATLAS, and “we need shareable deliverables” led to Studio. Along the way, the agent itself evolved internally: adaptive retrieval, structured public-data workflows, public API endpoints, and property-level intelligence were layered directly into ALFReD rather than treated as fully separate products.
The progression mirrors the broader evolution of LLM applications: from simple retrieval-augmented generation, to multi-tool agents, to distributed multi-agent systems, to full-stack AI-native workflows.