Data & API Resources
A curated collection of datasets and APIs to power your research and data science workflows. Each resource includes Python integration details and access information.
Name | URL | Type | Access | Documentation URL | Licensing Information | Category | Notes | Python π |
---|---|---|---|---|---|---|---|---|
Kaggle Datasets Platform | https://www.kaggle.com/datasetsβ | Dataset | Free (with account) | https://www.kaggle.com/docs/apiβ | Varies by dataset (CC0, CC-BY, etc.) | Machine Learning | Large repository of user-contributed datasets across domains; Kaggle API available for Python to download data. | Yes π |
UCI Machine Learning Repository | https://archive.ics.uci.edu/β | Dataset | Free | https://archive.ics.uci.edu/β | CC BY 4.0 | Machine Learning | Collection of 600+ datasets for ML algorithms; widely used for benchmarking and teaching. | Yes π |
OpenML | https://www.openml.org/β | Dataset/API | Free | https://docs.openml.org/β | Open data platform (licenses vary) | Machine Learning | Open platform to share datasets, models, experiments. Provides Python APIs (OpenML-Python). | Yes π |
Hugging Face Datasets Hub | https://huggingface.co/datasetsβ | Dataset | Free | https://huggingface.co/docs/datasetsβ | Apache 2.0 (library); dataset licenses vary | NLP/CV | Large collection of ready-to-use ML datasets accessible via Python datasets library. | Yes π |
TensorFlow Datasets (TFDS) | https://www.tensorflow.org/datasetsβ | Dataset | Free | https://www.tensorflow.org/datasets/overviewβ | Apache 2.0; original dataset licenses vary | Machine Learning | Catalog of ML datasets (images, text, audio, etc.) accessible via Python. Integrated with TensorFlow & others. | Yes π |
OpenAI API (GPT models) | https://openai.com/api/β | API | Paid (pay-as-you-go) | https://platform.openai.com/docs/β | Proprietary (OpenAI terms) | NLP | Provides AI models (GPT, DALLΒ·E, etc.) via REST API. Python client for integration in notebooks. | Yes π |
Wikidata SPARQL API | https://query.wikidata.org/β | API/Dataset | Free | https://www.wikidata.org/wiki/Wikidata:SPARQL_query_serviceβ | CC0 1.0 Public Domain | Knowledge Graphs | Collaborative knowledge base of 100+ million items. SPARQL endpoint. Public domain content for data science usage. | Yes π |
Data.gov (US Gov Open Data) | https://data.govβ | Dataset | Free | https://data.govβ | Public domain or open licenses | Government Data | Portal aggregating hundreds of thousands of U.S. government datasets. Often analyzed with Python. | Yes π |
Nasdaq Data Link (Quandl) | https://data.nasdaq.com/β | API/Dataset | Free & Paid | https://docs.data.nasdaq.com/β | Free (some); premium subscriptions | Quant Finance | Aggregator of financial, economic, alternative datasets. Many free time-series, plus premium. REST API & Python SDK. | Yes π |
Alpha Vantage | https://www.alphavantage.co/β | API | Free (limited) & Paid | https://www.alphavantage.co/documentation/β | Free for personal/academic; paid for more | Financial Data | Stock/forex/crypto market data API. Free tier allows limited calls; higher rates with paid plans. Widely used in Python. | Yes π |
FRED (Federal Reserve Data) | https://fred.stlouisfed.org/β | API/Dataset | Free (API key) | https://fred.stlouisfed.org/docs/api/fred/β | Public domain (U.S. gov) | Macro-Economics | Macro-economic time series (interest rates, GDP, etc.). Free with registration. Often accessed via Python fredapi . | Yes π |
Bloomberg API (BLPAPI) | https://www.bloomberg.com/professional/support/api-library/β | API | Paid (Terminal subscription) | https://bloomberg.github.io/blpapi-docs/β | Proprietary (Bloomberg T&Cs) | Market Data | Enterprise financial data API for Bloomberg Terminal users (~$24k/yr). Python access via blpapi . Strict usage terms. | Yes π |
SEC EDGAR Filings API | https://data.sec.gov/β | API | Free | https://www.sec.gov/edgar/sec-api-documentationβ | Public domain (SEC data) | Regulatory Filings | REST API for corporate filings (10-K, 10-Q, etc.). No key required but must set user agent. Commonly used with Python for financial/NLP analysis. | Yes π |
NASA Open APIs | https://api.nasa.govβ | API | Free (API key) | https://api.nasa.govβ | Public domain (NASA) | Astronomy/Earth Sci | Catalog of NASA APIs for space/earth data. Free API key for higher rate limits. Used in Python for education, astrophysics, etc. | Yes π |
CERN Open Data Portal | http://opendata.cern.chβ | Dataset | Free | http://opendata.cern.ch/docs/guideβ | CC0 or CC BY (varies) | High Energy Physics | Real LHC experimental data for particle physics. Often analyzed in Python (ROOT, Scikit-HEP). | Yes π |
Sloan Digital Sky Survey (SDSS) | https://www.sdss.orgβ | Dataset/API | Free (optional reg) | http://skyserver.sdss.orgβ | Public domain (astronomy) | Astrophysics Survey | Astronomical survey data (spectra, images). SQL query via SkyServer/CasJobs. Widely used in Python (AstroPy, Astroquery). | Yes π |
CDS VizieR Catalog Service | http://vizier.cds.unistra.frβ | Dataset/API | Free | https://vizier.cds.unistra.fr/vizier-doc/β | Open access catalogs (citation needed) | Astrophysical Catalogs | Thousands of published astronomical catalogs. Query by object or coordinates. Often used with Python (Astroquery). Must cite original sources. | Yes π |
Materials Project | https://materialsproject.orgβ | API/Dataset | Free (login) | https://docs.materialsproject.org/β | CC BY 4.0 (data) | Materials Science | Computed properties for inorganic materials (band gaps, etc.). Free REST API w/ key. Often used with Python (pymatgen). | Yes π |
PhysioNet | https://physionet.orgβ | Dataset | Free (some credentialed) | https://physionet.org/about/β | Often open-access; some restricted | Biomedical Signals | Physiological/clinical datasets (ECG, EEG, etc.). Many open; some need credentialing. Widely used in Python (WFDB, Pandas). | Yes π |
NCBI Entrez (E-utilities) | https://www.ncbi.nlm.nih.govβ | API | Free | https://www.ncbi.nlm.nih.gov/home/develop/api/β | Public domain (NLM/NCBI) | Bioinformatics | Public APIs for searching biological databases (PubMed, GenBank, etc.). Often used with Biopython or requests . | Yes π |
UniProt Knowledgebase API | https://www.uniprot.orgβ | API/Dataset | Free | https://www.uniprot.org/help/api_queriesβ | CC BY 4.0 (UniProt data) | Proteomics | Protein sequence & annotation DB. REST & SPARQL APIs. Commonly used with Python (Biopython, Bioservices). | Yes π |
RCSB Protein Data Bank (PDB) | https://www.rcsb.orgβ | API/Dataset | Free | https://data.rcsb.org/#documentationβ | Public domain (citation suggested) | Structural Biology | 3D biomolecular structures. No restrictions, but citing PDB is encouraged. Used with Python (Biopython, PyMOL, etc.). | Yes π |
Cancer Genomics Data Commons (GDC) | https://gdc.cancer.govβ | API/Dataset | Free & controlled | https://docs.gdc.cancer.gov/API/Users_Guide/β | NIH/NCI public domain; some restricted | Cancer Genomics | Portal for cancer genomics (TCGA, etc.). Open-access data via REST & Python SDK; raw genomic data is controlled-access. | Yes π |
IEEE DataPort | https://ieee-dataport.orgβ | Dataset | Free/Paid (subscription) | https://ieee-dataport.orgβ | Varies by dataset | Engineering Data | Repository for EE/related fields. Up to 2TB free upload, but downloads often need subscription. Common in signal/image processing. | Yes π |
LibriSpeech ASR Corpus | http://www.openslr.org/12β | Dataset | Free | http://www.openslr.org/12β | Public Domain (LibriVox) | Speech Recognition | ~1000 hours of English speech audio w/ transcripts for ASR. From LibriVox audiobooks. Widely used in Python (TensorFlow, PyTorch). | Yes π |
Mozilla Common Voice | https://commonvoice.mozilla.orgβ | Dataset | Free | https://commonvoice.mozilla.org/datasetsβ | CC0 1.0 (public domain) | Speech Recognition | Crowdsourced multilingual speech dataset. Released under CC0. Commonly used in Python (Torchaudio, etc.). | Yes π |
ImageNet | http://www.image-net.orgβ | Dataset | Free (research only) | http://www.image-net.org/downloadβ | Non-commercial research only | Computer Vision | 14+ million labeled images across 20k categories. Free for non-commercial research; acceptance of terms needed. Fueled DL breakthroughs. | Yes π |
COCO (Common Objects in Context) | https://cocodataset.orgβ | Dataset | Free | https://cocodataset.org/#homeβ | Images (Flickr), labels CC BY 4.0 | Computer Vision | ~330K images with dense annotations. Widely used in CV tasks. Python tools (pycocotools ) for loading/eval. | Yes π |
OR-Library (Beasleyβs OR Library) | http://people.brunel.ac.uk/~mastjjb/jeb/orlib.htmlβ | Dataset | Free | http://people.brunel.ac.uk/~mastjjb/jeb/info.htmlβ | Free for research use | Operations Research | Collection of benchmark datasets for TSP, set covering, bin packing, etc. Plain text files. Commonly used with Python OR-Tools, PuLP, etc. | Yes π |
TSPLIB | http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/β | Dataset | Free | http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/β | Public domain (benchmark instances) | Combinatorial Opt | Standard library of TSP & related problems. Widely used to test optimization algorithms. Data parseable in Python (NetworkX, TSPLIB parsers). | Yes π |
NEOS Optimization Server | https://neos-server.org/neos/β | API | Free | https://neos-guide.org/neos/β | Free service (academic/public) | Optimization | Cloud-based solver service. Users submit LP/IP/NLP models; NEOS runs them on hosted solvers. Often used with Python Pyomo or similar. | Yes π |
Last updated on