IntelMedica
AI Datasets

AI Datasets

Five open datasets. 461,600 clinical sentences. 184 hours of synthetic medical speech. All DOI-registered, all available to the global medical AI research community.

Curated by physicians, vetted for clinical accuracy.

5

Open datasets

461,600

Clinical sentences

184 hours

Synthetic medical speech

DOI-registered

Citable artifacts

Open Datasets

Five medical AI datasets

All DOI-registered. CC-BY-NC-4.0 (research-friendly).

Each dataset is gated. You'll need a Hugging Face account, and approval is manual (usually within a day). Click any card to request access.

CC-BY-NC-4.0
medical-tts-parquet-2-16khz

Synthetic medical speech for fine-tuning Whisper-based ASR. 19 voices across 3 English accent groups. Sourced from RxNorm and FDA.

101,475
Audio-text pairs
184.1 hrs
Duration
16 kHz mono
Sample rate
DOI:10.57967/hf/8264
CC-BY-NC-4.0
nursing-sentences-1

Synthetic nursing-specific clinical documentation sentences. 70/15/15 stratified split for ML training.

40,247
Total rows
28K
Train
6K / 6K
Val / Test
DOI:10.57967/hf/8425
CC-BY-NC-4.0
physician-sentences-1

Synthetic physician-specific clinical documentation sentences for training medical ASR and language models.

107,906
Total rows
75K
Train
16K / 16K
Val / Test
DOI:10.57967/hf/8426
CC-BY-NC-4.0
general-medical-sentences-1

Synthetic general medical terminology sentences for broad clinical ASR use across specialties.

313,447
Total rows
219K
Train
47K / 47K
Val / Test
DOI:10.57967/hf/8424

Legacy / reference

Deprecated
medical-tts-parquet-1

Original 24 kHz dataset, superseded by v2. Kept for reference.

Deprecated
Status
DOI:10.57967/hf/8249

License and commercial use

All datasets are released under CC-BY-NC-4.0. Free for academic and research use, with attribution.

Commercial deployment needs a separate license. Talk to us about commercial access.

For research purposes only. Not a medical device. Not intended for clinical decision-making.