• Built LangGraph workflow: 4-step pipeline for scene detection, RAG retrieval, safety assessment, and image editing.
• Used GPT-4o / Gemini VLM to analyze home images, detecting room type, hazards, and safety features.
• Built Hybrid RAG (FAISS + BM25) retrieving 105 curated CDC/HSSAT guidelines with risk-level citations.
• Fused deterministic scoring rules with LLM reasoning to output 0-10 safety score and cited recommendations.
• Generated safety improvement previews via Gemini Nano Banana showing recommended modifications.
• Shipped FastAPI + React app with Docker deployment, supporting 11 scene types and real-time analysis.
Deep dive
Deep Dive — SafeLLM / Fall Risk Detection AI System (safellm3/safellm_deploy)
Scope note: this workspace contains multiple iterations (safellm/, safellm2/, safellm3/). The only directory with git history is safellm3/safellm_deploy/ (contains .git/), so this Deep Dive treats that as the “repo” for evidence and history.
1. What This Is (one paragraph)
SafeLLM is a deployable web app + API that takes a single photo of a home environment, classifies the scene into one of 11 fall‑risk categories, retrieves fall‑prevention guidelines from a small curated knowledge base, and returns a structured safety report (score, hazards, prioritized actions, cost/difficulty) plus an optional AI‑generated “visual improvements” image that overlays the recommended fixes. The repo also contains extracted guideline documents (e.g., CDC STEADI PDFs → markdown) under knowledge_base/processed/ for provenance/transparency.
2. Who It’s For + Use Cases
Primary users (as described in the repo docs):
Family caregivers assessing an elderly parent’s home for preventable fall hazards.
Clinicians / discharge planners doing a quick home safety pre‑screen.
Home modification services triaging what to fix first and estimating effort/cost.
Real estate / property managers evaluating accessibility and safety.
What “success” means (repo evidence + gaps):
Success (evidenced): system returns a structured report and can boot/build/test reliably (run_full_deploy_test.sh).
Success (inferred but not measured in repo): fewer missed hazards, fewer hallucinated hazards, actionable fixes, low latency, low cost. Unknown (not found in repo): defined product metrics (accuracy, NPS, retention, clinical outcomes). Suggested metrics are in §8.
3. Product Surface Area (Features)
A. End‑user web experience (React)
Upload photo → POST /assess (multipart file) from the UI (frontend/src/App.jsx:27).
Live “Analyzing…” UI while waiting (non‑streaming; one blocking request).
Structured results page:
Score + risk level
Hazard lists (critical/important/minor)
Priority action plan
Cost + difficulty
Knowledge Base References section (shows which guidelines were retrieved and match %)
Visual Safety Improvements (polls backend until edited image is ready) (frontend/src/components/Results.jsx:43)
Print report via window.print() (frontend‑only).
B. Backend API (FastAPI)
User‑visible endpoints:
GET / serves the built frontend (frontend/dist/) if present, else returns API info (backend/api.py:235).
GET /health returns workflow status + active model configuration (backend/api.py:267).
POST /assess runs the core pipeline (steps 0–3 sync; step 4 async) (backend/api.py:290).
POST /scene-detect runs scene classification only (backend/api.py:529).
GET /categories returns supported scene categories (backend/api.py:587).
GET /stats returns curated KB chunk counts by category (backend/api.py:609).
GET /edit_status/{image_id} returns async image-edit job status (backend/api.py:642).
GET /edited_images/{image_id}_edited.png serves the edited image (backend/api.py:633).
C. Knowledge base tooling
Curated KB stored as structured markdown under knowledge_base/curated_knowledge/ and compiled into JSONL (knowledge_base/curated_chunks/metadata.jsonl) via knowledge_base/process_curated_knowledge.py.
FAISS index built via knowledge_base/create_curated_embeddings.py (OpenAI embeddings).
KB linting via knowledge_base/kb_lint.py + unit tests in tests/test_knowledge_base.py.
D. Deployment/build tooling
Dockerfile builds frontend + runs backend (Cloud Run‑style PORT support).
start_server.sh and start_server.bat for local startup (shell + Windows).
Constraints / caveats (evidence vs unknown):
No auth / no user accounts (evidenced by code search; no auth middleware; see §7).
Docs drift: multiple READMEs describe older model choices and paths (e.g., gpt‑4o vs gemini/gpt‑5; different env var names). Evidence is in file-level references in §11.
B. Repo inventory (top 2–3 levels, focus on runtime)
safellm3/safellm_deploy/
backend/
api.py # FastAPI server + endpoints
workflow.py # 4-step workflow + providers + async image edit
frontend/
src/ # React UI (upload/results/polling)
knowledge_base/
curated_knowledge/ # human-authored markdown hazards by scene
curated_chunks/ # JSONL metadata for 105 curated chunks
curated_embeddings/ # FAISS index used at runtime
processed/ # extracted source docs (CDC PDFs -> markdown) for transparency
prompts/
scene_detection_prompt.py
safety_assessment_prompt.py
image_editing_prompt.py
tests/
test_knowledge_base.py
Dockerfile
requirements.txt
run_full_deploy_test.sh
start_server.sh
test_frontend.py # manual integration script (skipped under pytest)
D. Key modules (what they do / why they matter)
backend/api.py: FastAPI app that owns the HTTP contract (uploads, responses, polling) and also serves the built SPA in production; it’s the main deployable surface and where reliability controls (EXIF fixes, cleanup, async jobs) live.
backend/workflow.py: Core orchestration for the 4-step pipeline (provider selection, determinism, retrieval wiring, image-edit generation); this is where most AI behavior is defined.
knowledge_base/curated_retrieval.py: Hybrid FAISS+BM25 retrieval that grounds the LLM in a small, scene-filtered knowledge base; it strongly shapes output relevance and consistency.
knowledge_base/process_curated_knowledge.py: “Compiler” from curated markdown → structured JSONL chunks; enforces enumerations (risk levels, hazard types) and creates stable IDs for retrieval.
knowledge_base/create_curated_embeddings.py: Builds the FAISS index used at runtime; without it, retrieval cannot load.
prompts/scene_detection_prompt.py: Defines the scene classifier output shape and allowed categories; constrains LLM1 to avoid adding noise.
prompts/safety_assessment_prompt.py: Defines strict JSON schema + scoring conventions for LLM2; primary control surface for hallucination and output stability.
prompts/image_editing_prompt.py: Converts a small structured “edit plan” into a constrained natural-language image prompt; drives consistent visuals across providers.
frontend/src/App.jsx: Upload handler and environment-based API routing (VITE_API_BASE vs localhost); defines the user flow into /assess.
frontend/src/components/Results.jsx: Results rendering and async polling for the edited image (/edit_status/{image_id}); defines the post-upload UX.
tests/test_knowledge_base.py: Unit tests that protect curated KB processing/validation from regressions.
run_full_deploy_test.sh: Repeatable “deploy simulation” (deps → pytest → build) that makes build confidence auditable.
test_frontend.py: Manual end-to-end script (uploads real images and polls image edits); useful for smoke testing but intentionally skipped in CI-style pytest runs.
C. Key runtime assumptions
Backend is single-process and keeps job status in memory (JOBS = {} in backend/api.py:154), so pending image edits are not durable across restarts.
File storage for uploads/edited images is local disk; cleanup is best-effort (24h window) (backend/api.py:60, called on startup and after successful /assess).
External providers must be reachable for /assess to complete fully (OpenAI embeddings always; Gemini/OpenAI for LLMs; optional OpenRouter/OpenAI Images for step 4).
5. Data Model
There is no database in this deployable repo. Data is stored as:
A. Runtime request state (in-memory)
Per-request workflow state is a Python dict matching WorkflowState in backend/workflow.py (contains image_base64, scene_category, retrieved_knowledge, hazards, etc.).
Async image editing job status is stored in an in-memory dict JOBS (backend/api.py:154, /edit_status/{image_id} at backend/api.py:642).
B. Runtime files (local disk)
uploads/<uuid>.<ext>: incoming images (kept at least long enough for async edit to run).
edited_images/<uuid>_edited.png: the “visual improvements” output image.
Cleanup: files older than 24h are deleted on startup and after successful /assess (backend/api.py:60, backend/api.py:183, backend/api.py:370).
C. Curated Knowledge Base (static artifacts in repo)
knowledge_base/curated_chunks/metadata.jsonl: 105 lines (one per chunk) with fields like chunk_id, category, hazard_name, risk_level, keywords, hazard_types, version. (Example schema is visible by reading any JSONL line; see processor in knowledge_base/process_curated_knowledge.py.)
knowledge_base/curated_embeddings/faiss_index/: FAISS index built from curated chunks.
D. “Raw/processed” source docs (transparency / provenance)
knowledge_base/raw_documents/ and knowledge_base/processed/: downloaded PDFs (e.g., CDC STEADI) and extracted markdown by category. Example file: knowledge_base/processed/indoor/kitchen/cdc_81518_DS1_extracted.md contains per-page text plus metadata header.
Note: the “raw documents” pipeline in this deployable folder references text_extractor.py (knowledge_base/process_documents.py:18) which is not present in safellm3/safellm_deploy/ (it exists in sibling directories). Running that pipeline here is Unknown (likely broken without copying that file).
Vector store: FAISS, loaded from disk (knowledge_base/curated_retrieval.py:39).
Operational impact:
Every retrieval likely requires embedding the query (cost + latency).
Optimization opportunity: the current retrieval query is category-driven and repeated per scene ("{category} fall safety hazards and improvements for elderly"), so embeddings can be cached per category.
The backend performs basic “post-LLM” validation and attaches validation_warnings (citation count, summary length, hazard counts) (backend/workflow.py:843–895).
Important consistency note:
The current prompt/schema includes internal contradictions (e.g., schema enumerations that force confidence="high" while other parts mention low/medium; some post-processing expects different hazard field names). This is not a runtime crash in normal paths, but it is a maintainability risk (§9).
E. Visual feedback (image editing; Step 4)
Step 4 is optional and runs asynchronously in the API:
/assess schedules a background task and returns immediately (backend/api.py:405).
Job status is polled via /edit_status/{image_id} (backend/api.py:642).
How the image edit is generated (backend/workflow.py:969+):
Ask LLM2 for a small editing plan (JSON; ≤3 annotations) (backend/workflow.py:980+).
Convert that plan into a constrained natural-language image prompt (build_prompt_from_plan) (backend/workflow.py:1085, prompts/image_editing_prompt.py:150).
Call one of:
Gemini image generation (if IMAGE_EDIT_MODEL starts with gemini-)
OpenRouter image generation (if IMAGE_EDIT_MODEL contains /) (backend/workflow.py:1142, POST https://openrouter.ai/api/v1/chat/completions at backend/workflow.py:1188)
OpenAI Images edit API (client.images.edit) (backend/workflow.py:1272)
Save output to edited_images/<image_id>_edited.png and serve it via backend route (backend/workflow.py:1310+, backend/api.py:633).
F. Evaluation (what exists)
Evidenced evaluation/testing assets:
Unit tests for curated knowledge processing + linting: tests/test_knowledge_base.py.
A manual end-to-end integration script (skipped under pytest) that uploads test images and polls edit status: test_frontend.py (pytestmark skip at test_frontend.py:16).
Sample test images + saved API outputs: test_images/bathroom2.jpg, test_images/bathroom2_api_result.json, etc.
Regression tests for hallucination rate or grounding quality.
7. Reliability, Security, and Privacy
Reliability & correctness mechanisms
EXIF orientation normalization + downscaling on upload (addresses iPhone camera photos and reduces model token load): backend/api.py:94 with MAX_IMAGE_DIMENSION default 2048 (backend/api.py:91).
Retry logic for Gemini safety assessment (handles transient empty responses): in backend/workflow.py Step 3 (Gemini path).
Async image editing is non-fatal: Step 4 failures do not fail the whole assessment (backend/workflow.py:1320+).
Disk growth control: deletes uploads/edited images older than 24h (backend/api.py:60).
fixed retrieval k=5 and category-driven query (backend/workflow.py:551–557)
strict JSON schemas in prompts (prompt-level determinism)
Security posture (current)
No authentication / authorization. Anyone with network access to the backend can call /assess.
CORS is fully open (allow_origins=["*"]) (backend/api.py:52), which is convenient for demos but not safe for a public deployment without further controls.
File upload validation checks MIME type starts with image/ (backend/api.py:308) but does not enforce a server-side size limit (frontend UI suggests “max 10MB” but backend does not appear to enforce it).
Prompt injection risk exists via image content (e.g., text in the image). Mitigations in repo are mostly prompt-level constraints + strict JSON output; there is no explicit “prompt injection sanitizer” or allowlist of visual evidence.
In-memory job state means attackers could potentially cause memory growth by spamming /assess (no rate limit).
Privacy & data handling
Local disk stores uploaded images and edited images at least temporarily (uploads are retained for async editing; best-effort cleanup after 24h).
Third-party processing:
OpenAI: embeddings (always) and optionally vision + image editing depending on env config.
Google Gemini: vision LLM steps if configured.
OpenRouter: image generation if configured.
Repo docs claim “images are not stored permanently” (README.md), but the actual server implementation stores them on disk and cleans them later. Treat “not stored permanently” as “not intended to be retained long-term,” not as “never written to disk.”
Likely slow parts: vision LLM calls + image generation; retrieval is local but embeddings call may be remote.
Evidence: backend/api.py:290, backend/workflow.py:551, backend/workflow.py:969.
Q2: How would you scale this backend horizontally?
Current blockers: in-memory JOBS state and local-disk image storage (not shared across instances).
Make job state durable: Redis / DB + queue (Celery/RQ) or managed task queue.
Move images to object storage (S3/GCS) with signed URLs.
Ensure retrieval artifacts are bundled or cached per instance; warm start loads FAISS index at boot.
Evidence: backend/api.py:154, backend/api.py:633, backend/workflow.py:275.
Q3: How would you enforce tenant isolation / authentication?
Add auth layer (JWT/OAuth) at FastAPI, enforce per-user rate limits and storage namespaces.
Right now there is no auth; CORS is wildcard.
Evidence: backend/api.py:52 and no auth code found.
Q4: What’s your data retention policy and how do you implement it?
Intended: temporary storage only; best-effort cleanup after 24 hours (_cleanup_old_images).
In practice: images are written to disk; deletions occur on startup and after successful /assess.
Hybrid weights are explicitly tuned (0.6/0.4) and threshold-filtered to avoid noise.
Curated KB is small and structured, improving precision vs open web scraping.
Evidence: knowledge_base/curated_retrieval.py:68, knowledge_base/curated_retrieval.py:191–194.
Q6: How do you control hallucinations?
Hard constraints: strict JSON schema outputs, consistent scoring rules, and “only report what is visible” instructions (prompt-level).
Retrieval grounding: the prompt injects retrieved guidelines and instructs explicit citations; backend warns when citations are missing.
Limitations: no automated grounding verifier; no image-region evidence mapping; confidence field currently may be forced high by schema.
Evidence: prompts/safety_assessment_prompt.py (schema + instructions), backend/workflow.py:843–895.
Q7: Why is the retrieval query category-driven instead of image-driven?
Chosen for determinism and stability: fixed query + fixed k reduces run-to-run variability.
Tradeoff: less adaptive retrieval; relies on small curated KB and LLM2 to choose relevant hazards.
Evidence: backend/workflow.py:551–557.
Q8: How would you evaluate this system offline?
Create a labeled dataset of images per scene + expected hazards and priority actions.
Compute:
scene classification accuracy
hazard precision/recall by risk tier
grounding score: % hazards mapped to retrieved guidelines
cost/difficulty calibration vs human baseline
Add regression tests that run on PRs with fixed seeds and stable snapshots.
Evidence: existing hooks: tests/test_knowledge_base.py, test_images/*, test_frontend.py.
Debugging & reliability
Q9: Tell me about a tricky production bug you fixed.
Example evidenced in git history: “Root cause of iPhone direct camera upload failures” and Gemini robustness fixes.
Step 4 exceptions are caught and do not fail the main response; job status becomes error and frontend continues showing text results.
Evidence: backend/workflow.py:1320+, backend/api.py:405, backend/api.py:642.
Product sense
Q11: What’s the core user promise and how do you keep it?
Promise: “upload one photo, get actionable fall-prevention improvements.”
Risk: no explicit user feedback loop in product; no saved reports; no trust UI beyond “knowledge base references.”
Evidence: frontend/src/components/Results.jsx + KB reference section.
Q12: If you had to pick one metric to optimize first, what is it?
Suggest: “Actionability rate” (users who implement ≥1 recommendation) + “time-to-first-action.”
Instrument: track which priority actions were shown + user follow-up surveys.
Repo currently has no analytics; would need to add.
Evidence: no telemetry found in code search.
Behavioral
Q13: How did you manage ambiguity in requirements?
Built a small, high-precision curated KB (105 chunks) rather than scraping the web; kept scope to 11 scenes.
Added determinism controls to keep outputs stable for demos and testing.
Validated by unit tests for KB processing and a deploy simulation script.
Evidence: knowledge_base/curated_chunks/metadata.jsonl, backend/workflow.py:551, run_full_deploy_test.sh.
Q14: How do you communicate tradeoffs to stakeholders?
Example: “We use fixed retrieval to reduce variability, which may miss some edge hazards; roadmap adds image-driven query + eval harness.”
Backed by clear evidence anchors and a phased hardening plan (auth, storage, queue).
Evidence: backend/workflow.py:551–557, backend/api.py:154.
13. Roadmap (high-leverage upgrades)
Must (production hardening)
Add auth + rate limiting (protect /assess, mitigate abuse); restrict CORS in production (backend/api.py:52).
Make async image-edit jobs durable (queue + shared state) and move images to object storage (S3/GCS).
Unify docs with current code: model defaults, env var names (VITE_API_BASE vs older VITE_API_URL), deployment steps, and correct file paths.