公共数据集列表 From Github

NeuroTimes | Oct. 29, 2015



U.S. Department of Agriculture's PLANTS Database



1000 Genomes

American Gut (Microbiome Project)

Collaborative Research in Computational Neuroscience (CRCNS)

Gene Expression Omnibus (GEO)

Gene Ontology (GO)

Global Biotic Interations (GloBI)

Sequence Read Archive(SRA)

EBI ArrayExrepss

ENCODE project

Human Microbiome Project (HMP)

ICOS PSP Benchmark

MIT Cancer Genomics Data

NIH Microarray data (FTP)

OpenSNP genotypes data

Pathguid: Protein-Protein Interactions Catalog

Protein Data Bank

PubChem Project

PubGene (now Coremine Medical)

Stanford Microarray Data

The Personal Genome Project or PGP

UCSC Public Data


The Catalogue of Life



Australian Weather

Brazilian Weather - Historical data (In Portuguese)

Canadian Meteorological Centre

Climate Data from UEA (updated monthly)

Global Climate Data Since 1929

NASA Global Imagery Browse Services

NOAA Bering Sea Climate

NOAA Climate Datasets

NOAA Realtime Weather Models

The World Bank Open Data Resources for Climate Change

UEA Climatic Research Unit

WU Historical Weather Worldwide


Complex Networks

CrossRef DOI URLs

DBLP Citation dataset

NBER Patent Citations

NIST complex networks data collection

Protein-protein interaction network

PyPI and Maven Dependency Network

Scopus Citation Database

Small Network Data

Stanford GraphBase (Steven Skiena)

Stanford Large Network Dataset Collection

The Koblenz Network Collection

The Laboratory for Web Algorithmics (UNIMI)

The Nexus Network Repository

UCI Network Data Repository

UCI Network Data Repository

UFL sparse matrix collection

WSU Graph Database


Computer Networks

3.5B Web Pages from CommonCraw 2012

53.5B Web clicks of 100K users in Indiana Univ.

CAIDA Internet Datasets

ClueWeb09 - 1B web pages

ClueWeb12 - 733M web pages

CommonCrawl Web Data over 7 years

CRAWDAD Wireless datasets from Dartmouth Univ.

Criteo click-through data

Open Mobile Data by MobiPerf

UCSD Network Telescope, IPv4 /8 net


Contextual Data

Context-aware data sets from five domains or GitHub


Data Challenges

Challenges in Machine Learning

D4D Challenge of Orange


DrivenData Competitions for Social Good

ICWSM Data Challenge (since 2009)

Kaggle Competition Data

KDD Cup by Tencent 2012

Localytics Data Visualization Challenge

Netflix Prize

Space Apps Challenge

Telecom Italia Big Data Challenge

Yelp Dataset Challenge



American Economic Ass (AEA)

EconData from UMD

Internet Product Code Database
















CBOE Futures Exchange

Google Finance

Google Trends



OSU Financial data


St Louis Federal

Yahoo Finance



USGS Earthquake Archives

Smithsonian Institution Global Volcano and Eruption Database



BODC - marine data of ~22K vars

Cambridge, MA, US, GIS data on GitHub

EOSDIS - NASA's earth observing system data

Factual Global Location Data

Geo Spatial Data from ASU

GeoNames Worldwide

Global Administrative Areas Database (GADM)

Landsat 8 on AWS

Natural Earth - vectors and rasters of the world

OpenStreetMap (OSM)

TIGER/Line - U.S. boundaries and roads

TwoFishes - Foursquare's coarse geocoder

TZ Timezones shapfiles

World countries in multiple formats

List of all countries in all languages




Antwerp, Belgium

Austin, TX, US

Australia (abs.gov.au)

Australia (data.gov.au)

Austria (data.gv.at)



Cambridge, MA, US



Dallas Open Data

Denver Open Data

Durham, NC Open Data

England LGInform






Ghent, Belgium

Glasgow, Scotland, UK

Guardian world governments

Houston Open Data

Indian Government Data

Indonesian Data Portal

London Datastore, UK

Los Angeles Open Data

MassGIS, Massachusetts, U.S.



New Zealand

NYC betanyc

NYC Open Data



Open Government Data (OGD) Platform India

Rio de Janeiro, Brazil


San Francisco Data sets


Singapore Government Data

South Africa


The World Bank

Texas Open Data

Puerto Rico Government

U.K. Government Data


U.S. American Community Survey

U.S. CDC Public Health datasets

U.S. Census Bureau

U.S. National Center for Education Statistics (NCES)

U.S. Department of Housing and Urban Development (HUD)

U.S. Federal Government Agencies

U.S. Federal Government Data Catalog

U.S. Food and Drug Administration (FDA)

U.S. Open Government

UK 2011 Census Open Atlas Project

United Nations

Vancouver, BC Open Data Catalog



EHDP Large Health Data Sets

Gapminder World, demographic databases

Medicare Coverage Database (MCD), U.S.

Medicare Data Engine of medicare.gov Data

Medicare Data File

MeSH, the vocabulary thesaurus used for indexing articles for PubMed

Number of Ebola Cases and Deaths in Affected Countries (2014)


Image Processing

10k US Adult Faces Database

2GB of Photos of Cats (Original down - 20Agst2015) or Archive version

Stanford Dogs Dataset

The Oxford-IIIT Pet Dataset

Animals with attributes

Affective Image Classification

Face Recognition Benchmark

ImageNet (in WordNet hierarchy)

International Affective Picture System, UFL

Massive Visual Memory Stimuli, MIT

SUN database, MIT

YouTube Faces Database

Indoor Scene Recognition


Machine Learning

Delve Datasets for classification and regression (Univ. of Toronto)

Discogs Monthly Data

eBay Online Auctions (2012)

IMDb Database

Keel Repository for classification, regression and time series

Lending Club Loan Data

Machine Learning Data Set Repository

Million Song Dataset

More Song Datasets

MovieLens Data Sets

RDataMining - "R and Data Mining" ebook data

Registered Meteorites on Earth

Restaurants Health Score Data in San Francisco

UCI Machine Learning Repository

Yahoo! Ratings and Classification Data



Cooper-Hewitt's Collection Database

Minneapolis Institute of Arts metadata

Natural History Museum (London) Data Portal

Rijksmuseum Historical Art Collection

Tate Collection metadata

The Getty vocabularies


Natural Language

Blogger Corpus

ClueWeb09 FACC

ClueWeb12 FACC

DBpedia - 4.58M things with 583M facts

Flickr Personal Taxonomies

Google Books Ngrams (2.2TB)

Google Web 5gram (1TB, 2006)

Gutenberg eBooks List

Hansards text chunks of Canadian Parliament

Machine Translation of European languages

SMS Spam Collection in English

SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)

USENET postings corpus of 2005~2011

Wikidata - Wikipedia databases

Wikipedia Links data - 40 Million Entities in Context

WordNet databases and tools



CERN Open Data Portal

NSSDC (NASA) data of 550 space spacecraft

NASA Exoplanet Archive

Sloan Digital Sky Survey (SDSS) - Mapping the Universe



OSU Cognitive Modeling Repository Datasets


Public Domains


Archive.org Datasets

CMU JASA data archive

CMU StatLab collections





KDNuggets Data Collections

Microsoft Azure Data Market Free DataSets


Reddit Datasets

RevolutionAnalytics Collection

Sample R data sets

Stats4Stem R data sets


The Washington Post List

UCLA SOCR data collection

UFO Reports

Wikileaks 911 pager intercepts

Yahoo Webscope


Search Engines

Academic Torrents of data sharing from UMB

Archive-it from Internet Archive


DataMarket (Qlik)

Freebase.com of people, places, and things

Harvard Dataverse Network of scientific data


Open Data Certificates (beta)

Statista.com - statistics and Studies


Social Networks

72 hours #gamergate scrape

Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

May 2011 Calufa Twitter Scrape

Network Twitter Data

Social Twitter Data

Twitter Data for Sentiment Analysis


Social Sciences

Ancestry.com Forum Dataset over 10 years

CMU Enron Email of 150 users

EDRM Enron EMail of 151 users, hosted on S3

Facebook Data Scrape (2005)

Facebook Social Networks from LAW (since 2007)

FBI Hate Crime 2013 - aggregated data

Foursquare Social Network in 2010, 2011

Foursquare from UMN/Sarwat (2013)

General Social Survey (GSS) since 1972

GetGlue - users rating TV shows

GitHub Collaboration Archive

MIT Reality Mining Dataset

Mobile Social Networks from UMASS

PewResearch Internet Survey Project

Reddit Comments

SourceForge.net Research Data

StackExchange Data Explorer

Titanic Survival Data Set

Texas Inmates Executed Since 1984

Twitter Graph of entire Twitter site

UCB's Archive of Social Science Data (D-Lab)

UCLA Social Sciences Data Archive

UNIMI/LAW Social Network Datasets

Universities Worldwide

UPJOHN for Labor Employment Research

Yahoo! Graph and Social Data

Youtube Video Social Graph in 2007,2008

Google Scholar citation relations

Political Polarity Data

GDELT Global Events Database

Skytrax' Air Travel Reviews Dataset



Betfair Historical Exchange Data

Cricsheet Matches (cricket)

Ergast Formula 1, from 1950 up to date (API)

Football/Soccer resources (data and APIs)

Lahman's Baseball Database

Retrosheet Baseball Statistics


Time Series

Time Series Data Library (TSDL) from MU

UC Riverside Time Series Dataset

Hard Drive Failure Rates

Heart Rate Time Series from MIT



Airlines OD Data 1987-2008

Bike Share Systems (BSS) collection

Bay Area Bike Share Data

GeoLife GPS Trajectory from Microsoft Research

Hubway Million Rides in MA

Marine Traffic - ship tracks, port calls and more

NYC Taxi Trip Data 2013 (FOIA/FOILed)

NYC Taxi Trip Data 2009-

OpenFlights - airport, airline and route data

Plane Crash Database, since 1920

RITA Airline On-Time Performance data

RITA/BTS transport data collection (TranStat)

Transport for London (TFL)

Travel Tracker Survey (TTS) for Chicago

U.S. Bureau of Transportation Statistics (BTS)

U.S. Domestic Flights 1990 to 2009

U.S. Freight Analysis Framework since 2007

NYC Uber trip data April 2014 to September 2014


Complementary Collections

DataWrangling: Some Datasets Available on the Web

Inside-r: Finding Data on the Internet

Quora: Where can I find large datasets open to the public?

RS.io: 100+ Interesting Data Sets for Statistics

StaTrek: Leveraging open data to understand urban lives

OpenDataMonitor: An overview of available open data resources in Europe

OpenDataNetwork: A search engine of all Socrata powered data portals ranging from small cities to federal agencies and non-profits

Zenodo: An open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.


© 2014-2015 NeuroTimes