ML / HealthTechLive

Clinical AI Summarizer

An extractive clinical summarization engine built on fine-tuned SciBERT. Built to solve information overload for doctors reading EMR notes — without the hallucination risk that makes generative LLMs dangerous in healthcare.

View Project

ML Engineer

March 2026

ML / HealthTech

43.75%

ROUGE-1 Improvement

0.3782

Final ROUGE-1 Score

Hallucination Risk

Live

Deployed on HF Spaces

The Challenge

Doctors spend an agonizing amount of time reading through lengthy EMR notes, discharge summaries, and lab findings — causing fatigue and delaying critical decisions. The obvious fix is to pass the text to an LLM like ChatGPT. But in a clinical setting, generative LLMs are dangerous: they use abstractive summarization, generating brand new text. If an AI hallucinates a symptom, a timeline, or a drug dosage, it could cost a life.

The Solution

An extractive summarization engine — not a generative one. Instead of writing new text, the engine acts as a highly intelligent highlighter: it scores every sentence by clinical salience and returns only the most critical ones directly from the source material, guaranteeing 100% factual integrity. Fine-tuned SciBERT (pre-trained on biomedical literature with its own SciVocab) powers the scoring, deployed as a FastAPI inference API on Hugging Face Spaces.

What I Built

Phase 1 — Built an unsupervised K-Means baseline using raw SciBERT 768-dimensional embeddings, establishing a ROUGE-1 score of 0.2631 as the starting benchmark

Phase 2 — Wrote a Greedy Matching algorithm to generate extractive sentence labels from the ccdv/pubmed-summarization dataset, which only contained abstractive abstracts

Phase 3 — Implemented BERTsum architecture: inserted [CLS]/[SEP] tokens around every sentence, added interval segment embeddings, and attached a custom PyTorch classification head (BioExtractor) for binary salience scoring

Fine-tuned the full model on a Tesla T4 GPU in Google Colab, achieving a ROUGE-1 of 0.3782 — a 43.75% improvement over baseline

Phase 4 — Built a FastAPI async inference API with lazy-loading of custom .pt weights into the SciBERT skeleton, deployed to Hugging Face Spaces

Implemented chronological re-sorting of extracted sentences before returning the final summary — preserving clinical timeline integrity, which is critical for safe medical use

The Story

Phase 1 was the lazy route — an unsupervised K-Means pipeline using raw SciBERT embeddings to cluster sentences, picking the centroid of each cluster as a summary sentence. ROUGE-1 came in at 0.2631. The problem: K-Means is context-blind. It groups by mathematical distance, not clinical importance. Phase 2 was hacking the labels. The ccdv/pubmed-summarization dataset on HuggingFace only had human-written abstracts — no extractive labels. So I wrote a Greedy Matching algorithm: for every sentence across 200 training records, it calculated how much the ROUGE score would increase if that sentence was included. It hunted for the exact combination of original sentences that best matched the human abstract, labeling each sentence 1 (include) or 0 (ignore). Phase 3 was the heavy lifting. Standard SciBERT produces one embedding for a whole document — useless for sentence-level scoring. I implemented BERTsum architecture: inserted [CLS] and [SEP] tokens around every individual sentence, added interval segment embeddings so the model could distinguish adjacent sentences, then chopped off the default model head and attached a custom PyTorch nn.Linear classification layer (BioExtractor). Fine-tuned on a Tesla T4 GPU in Google Colab. ROUGE-1 jumped to 0.3782 — a 43.75% improvement. The model had learned to think like a clinician: prioritizing diagnoses, outcomes, and dosages over general fluff. Phase 4 was deployment. A model in a Colab notebook is useless in production. I built a FastAPI async inference API, but deploying a custom PyTorch architecture isn't like calling AutoModel.from_pretrained() — standard HuggingFace pipelines don't know about the custom BioExtractor head. I had to save the fine-tuned weights as a .pt file, upload it directly into the Hugging Face Space, and write a lazy-loading function that injects the weights into the SciBERT skeleton the moment the server boots. One final touch: before returning the top N sentences, I sort them back into their original chronological index. In medicine, the timeline is everything — a summary that puts the diagnosis before the symptom creates dangerous clinical confusion.

What I Learned

Generative AI is not the right tool for every problem — and knowing when not to use it is more valuable than knowing how to prompt it well. Labeling strategy matters more than model architecture: without the greedy matching step, there's no training signal. Deploying custom PyTorch architectures requires explicitly defining the model class skeleton in your serving code — you can't rely on HuggingFace's standard pipeline abstractions. And chronological ordering, a single sort() call, turned out to be the most clinically important line of code in the whole project.

Technologies Used

PythonPyTorchSciBERTBERTsumHuggingFaceFastAPISpaCyK-MeansGoogle ColabHugging Face Spaces

Back to

All Case Studies