公共数据集列表 From Github


NeuroTimes | Oct. 29, 2015



Github上收集的各行各业的公开高质量数据集列表。包含:

Agriculture

U.S. Department of Agriculture's PLANTS Database

 

Biology

1000 Genomes

American Gut (Microbiome Project)

Collaborative Research in Computational Neuroscience (CRCNS)

Gene Expression Omnibus (GEO)

Gene Ontology (GO)

Global Biotic Interations (GloBI)

Sequence Read Archive(SRA)

EBI ArrayExrepss

ENCODE project

Human Microbiome Project (HMP)

ICOS PSP Benchmark

MIT Cancer Genomics Data

NIH Microarray data (FTP)

OpenSNP genotypes data

Pathguid: Protein-Protein Interactions Catalog

Protein Data Bank

PubChem Project

PubGene (now Coremine Medical)

Stanford Microarray Data

The Personal Genome Project or PGP

UCSC Public Data

UniGene

The Catalogue of Life

 

Climate/Weather

Australian Weather

Brazilian Weather - Historical data (In Portuguese)

Canadian Meteorological Centre

Climate Data from UEA (updated monthly)

Global Climate Data Since 1929

NASA Global Imagery Browse Services

NOAA Bering Sea Climate

NOAA Climate Datasets

NOAA Realtime Weather Models

The World Bank Open Data Resources for Climate Change

UEA Climatic Research Unit

WU Historical Weather Worldwide

 

Complex Networks

CrossRef DOI URLs

DBLP Citation dataset

NBER Patent Citations

NIST complex networks data collection

Protein-protein interaction network

PyPI and Maven Dependency Network

Scopus Citation Database

Small Network Data

Stanford GraphBase (Steven Skiena)

Stanford Large Network Dataset Collection

The Koblenz Network Collection

The Laboratory for Web Algorithmics (UNIMI)

The Nexus Network Repository

UCI Network Data Repository

UCI Network Data Repository

UFL sparse matrix collection

WSU Graph Database

 

Computer Networks

3.5B Web Pages from CommonCraw 2012

53.5B Web clicks of 100K users in Indiana Univ.

CAIDA Internet Datasets

ClueWeb09 - 1B web pages

ClueWeb12 - 733M web pages

CommonCrawl Web Data over 7 years

CRAWDAD Wireless datasets from Dartmouth Univ.

Criteo click-through data

Open Mobile Data by MobiPerf

UCSD Network Telescope, IPv4 /8 net

 

Contextual Data

Context-aware data sets from five domains or GitHub

 

Data Challenges

Challenges in Machine Learning

D4D Challenge of Orange

CrowdANALYTIX dataX

DrivenData Competitions for Social Good

ICWSM Data Challenge (since 2009)

Kaggle Competition Data

KDD Cup by Tencent 2012

Localytics Data Visualization Challenge

Netflix Prize

Space Apps Challenge

Telecom Italia Big Data Challenge

Yelp Dataset Challenge

 

Economics

American Economic Ass (AEA)

EconData from UMD

Internet Product Code Database

 

Energy

AMPds

BLUEd

COMBED

Dataport

ECO

EIA

HFED

iAWE

Plaid

REDD

UK-Dale

 

Finance

CBOE Futures Exchange

Google Finance

Google Trends

NASDAQ

OANDA

OSU Financial data

Quandl

St Louis Federal

Yahoo Finance

 

Geology

USGS Earthquake Archives

Smithsonian Institution Global Volcano and Eruption Database

 

GeoSpace/GIS

BODC - marine data of ~22K vars

Cambridge, MA, US, GIS data on GitHub

EOSDIS - NASA's earth observing system data

Factual Global Location Data

Geo Spatial Data from ASU

GeoNames Worldwide

Global Administrative Areas Database (GADM)

Landsat 8 on AWS

Natural Earth - vectors and rasters of the world

OpenStreetMap (OSM)

TIGER/Line - U.S. boundaries and roads

TwoFishes - Foursquare's coarse geocoder

TZ Timezones shapfiles

World countries in multiple formats

List of all countries in all languages

OpenAddresses

 

Government

Antwerp, Belgium

Austin, TX, US

Australia (abs.gov.au)

Australia (data.gov.au)

Austria (data.gv.at)

Belgium

Brazil

Cambridge, MA, US

Canada

Chicago

Dallas Open Data

Denver Open Data

Durham, NC Open Data

England LGInform

EuroStat

FedStats

Finland

France

Germany

Ghent, Belgium

Glasgow, Scotland, UK

Guardian world governments

Houston Open Data

Indian Government Data

Indonesian Data Portal

London Datastore, UK

Los Angeles Open Data

MassGIS, Massachusetts, U.S.

Mexico

Netherlands

New Zealand

NYC betanyc

NYC Open Data

OECD

Oklahoma

Open Government Data (OGD) Platform India

Rio de Janeiro, Brazil

Romania

San Francisco Data sets

Seattle

Singapore Government Data

South Africa

Switzerland

The World Bank

Texas Open Data

Puerto Rico Government

U.K. Government Data

Uruguay

U.S. American Community Survey

U.S. CDC Public Health datasets

U.S. Census Bureau

U.S. National Center for Education Statistics (NCES)

U.S. Department of Housing and Urban Development (HUD)

U.S. Federal Government Agencies

U.S. Federal Government Data Catalog

U.S. Food and Drug Administration (FDA)

U.S. Open Government

UK 2011 Census Open Atlas Project

United Nations

Vancouver, BC Open Data Catalog

 

Healthcare

EHDP Large Health Data Sets

Gapminder World, demographic databases

Medicare Coverage Database (MCD), U.S.

Medicare Data Engine of medicare.gov Data

Medicare Data File

MeSH, the vocabulary thesaurus used for indexing articles for PubMed

Number of Ebola Cases and Deaths in Affected Countries (2014)

 

Image Processing

10k US Adult Faces Database

2GB of Photos of Cats (Original down - 20Agst2015) or Archive version

Stanford Dogs Dataset

The Oxford-IIIT Pet Dataset

Animals with attributes

Affective Image Classification

Face Recognition Benchmark

ImageNet (in WordNet hierarchy)

International Affective Picture System, UFL

Massive Visual Memory Stimuli, MIT

SUN database, MIT

YouTube Faces Database

Indoor Scene Recognition

 

Machine Learning

Delve Datasets for classification and regression (Univ. of Toronto)

Discogs Monthly Data

eBay Online Auctions (2012)

IMDb Database

Keel Repository for classification, regression and time series

Lending Club Loan Data

Machine Learning Data Set Repository

Million Song Dataset

More Song Datasets

MovieLens Data Sets

RDataMining - "R and Data Mining" ebook data

Registered Meteorites on Earth

Restaurants Health Score Data in San Francisco

UCI Machine Learning Repository

Yahoo! Ratings and Classification Data

 

Museums

Cooper-Hewitt's Collection Database

Minneapolis Institute of Arts metadata

Natural History Museum (London) Data Portal

Rijksmuseum Historical Art Collection

Tate Collection metadata

The Getty vocabularies

 

Natural Language

Blogger Corpus

ClueWeb09 FACC

ClueWeb12 FACC

DBpedia - 4.58M things with 583M facts

Flickr Personal Taxonomies

Google Books Ngrams (2.2TB)

Google Web 5gram (1TB, 2006)

Gutenberg eBooks List

Hansards text chunks of Canadian Parliament

Machine Translation of European languages

SMS Spam Collection in English

SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)

USENET postings corpus of 2005~2011

Wikidata - Wikipedia databases

Wikipedia Links data - 40 Million Entities in Context

WordNet databases and tools

 

Physics

CERN Open Data Portal

NSSDC (NASA) data of 550 space spacecraft

NASA Exoplanet Archive

Sloan Digital Sky Survey (SDSS) - Mapping the Universe

 

Psychology/Cognition

OSU Cognitive Modeling Repository Datasets

 

Public Domains

Amazon

Archive.org Datasets

CMU JASA data archive

CMU StatLab collections

Data360

Datamob.org

Google

Infochimps

KDNuggets Data Collections

Microsoft Azure Data Market Free DataSets

Numbray

Reddit Datasets

RevolutionAnalytics Collection

Sample R data sets

Stats4Stem R data sets

StatSci.org

The Washington Post List

UCLA SOCR data collection

UFO Reports

Wikileaks 911 pager intercepts

Yahoo Webscope

 

Search Engines

Academic Torrents of data sharing from UMB

Archive-it from Internet Archive

Datahub.io

DataMarket (Qlik)

Freebase.com of people, places, and things

Harvard Dataverse Network of scientific data

ICPSR (UMICH)

Open Data Certificates (beta)

Statista.com - statistics and Studies

 

Social Networks

72 hours #gamergate scrape

Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape

May 2011 Calufa Twitter Scrape

Network Twitter Data

Social Twitter Data

Twitter Data for Sentiment Analysis

 

Social Sciences

Ancestry.com Forum Dataset over 10 years

CMU Enron Email of 150 users

EDRM Enron EMail of 151 users, hosted on S3

Facebook Data Scrape (2005)

Facebook Social Networks from LAW (since 2007)

FBI Hate Crime 2013 - aggregated data

Foursquare Social Network in 2010, 2011

Foursquare from UMN/Sarwat (2013)

General Social Survey (GSS) since 1972

GetGlue - users rating TV shows

GitHub Collaboration Archive

MIT Reality Mining Dataset

Mobile Social Networks from UMASS

PewResearch Internet Survey Project

Reddit Comments

SourceForge.net Research Data

StackExchange Data Explorer

Titanic Survival Data Set

Texas Inmates Executed Since 1984

Twitter Graph of entire Twitter site

UCB's Archive of Social Science Data (D-Lab)

UCLA Social Sciences Data Archive

UNIMI/LAW Social Network Datasets

Universities Worldwide

UPJOHN for Labor Employment Research

Yahoo! Graph and Social Data

Youtube Video Social Graph in 2007,2008

Google Scholar citation relations

Political Polarity Data

GDELT Global Events Database

Skytrax' Air Travel Reviews Dataset

 

Sports

Betfair Historical Exchange Data

Cricsheet Matches (cricket)

Ergast Formula 1, from 1950 up to date (API)

Football/Soccer resources (data and APIs)

Lahman's Baseball Database

Retrosheet Baseball Statistics

 

Time Series

Time Series Data Library (TSDL) from MU

UC Riverside Time Series Dataset

Hard Drive Failure Rates

Heart Rate Time Series from MIT

 

Transportation

Airlines OD Data 1987-2008

Bike Share Systems (BSS) collection

Bay Area Bike Share Data

GeoLife GPS Trajectory from Microsoft Research

Hubway Million Rides in MA

Marine Traffic - ship tracks, port calls and more

NYC Taxi Trip Data 2013 (FOIA/FOILed)

NYC Taxi Trip Data 2009-

OpenFlights - airport, airline and route data

Plane Crash Database, since 1920

RITA Airline On-Time Performance data

RITA/BTS transport data collection (TranStat)

Transport for London (TFL)

Travel Tracker Survey (TTS) for Chicago

U.S. Bureau of Transportation Statistics (BTS)

U.S. Domestic Flights 1990 to 2009

U.S. Freight Analysis Framework since 2007

NYC Uber trip data April 2014 to September 2014

 

Complementary Collections

DataWrangling: Some Datasets Available on the Web

Inside-r: Finding Data on the Internet

Quora: Where can I find large datasets open to the public?

RS.io: 100+ Interesting Data Sets for Statistics

StaTrek: Leveraging open data to understand urban lives

OpenDataMonitor: An overview of available open data resources in Europe

OpenDataNetwork: A search engine of all Socrata powered data portals ranging from small cities to federal agencies and non-profits

Zenodo: An open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.




分享到


© 2014-2015 NeuroTimes