← Back to search

How GeneE works

From seven public databases to a cited answer in your browser — here is the full pipeline.

GeneE system architectureSeven public data sources flow through Python ingestion pipelines into PostgreSQL with pgvector and Typesense. A FastAPI backend on Railway combines stored data with Claude or Gemini RAG synthesis, serving the Next.js frontend on Vercel at genee.bio.SourcesIngestionStorageAPI + AIFrontendNCBI GeneUniProtGene OntologyPubTator 3.0Open TargetsClinVargnomADBulk + incrementalpipelines (Python)weekly refreshPostgreSQL + pgvectorgene data + 384-dim embeddingsTypesenseautocomplete + keywordFastAPI on Railwayhybrid search + cachingClaude / Gemini RAGevery claim cites a PMIDNext.js on Vercelgenee.bioISR / SSGStatic and on-demand: structured data lives in Postgres; AI summaries are pre-computed in batch and refreshed on demand.

The pipeline, step by step

1
Data aggregation
We pull structured gene data from seven authoritative public databases — NCBI Gene, UniProt, Gene Ontology, PubTator 3.0, Open Targets, ClinVar, and gnomAD — and unify them under a single canonical identifier (NCBI Gene ID). Bulk FTP loads handle the initial ingest; incremental jobs keep things fresh.
2
AI summarization with mandatory citations
For each gene, the most relevant PubMed abstracts are retrieved and fed to a large language model (Claude or Gemini). Every factual claim in the resulting summary must cite a specific PMID. Summaries that fail automated citation validation are withheld — users see the structured data instead.
3
Hybrid search
Queries are matched using both keyword search (Typesense) for exact gene symbols and semantic search (pgvector with 384-dim embeddings) for natural-language questions like "genes involved in DNA repair." Results from both paths are merged via Reciprocal Rank Fusion.
4
Scheduled refresh
A weekly pipeline refreshes ClinVar pathogenic variants every Sunday. Open Targets drugs and disease associations are re-pulled monthly to track each quarterly Open Targets release. AI summaries are regenerated on demand when a corpus snapshot warrants it.

Why every claim has a citation

Large language models can fabricate plausible-sounding biology when asked open-ended questions. GeneE constrains the model in two ways: it only ever sees PubMed abstracts that PubTator has already linked to the gene in question, and it must attach a PMID to each claim. An automated validator then verifies that every cited PMID actually exists and that chromosome and disease claims match the structured data we already hold.

Summaries that fail this check are not displayed. Where an AI summary is unavailable, gene pages fall back to the original NCBI summary text or the structured data alone.

Data sources

Full database list, license terms, and attribution are on the Data Sources & Attribution page.