Harvesting Speech Datasets for Linguistic Research on the Web

Abstract

This project will harvest audio and transcribed data from podcasts, news broadcasts, public and educational lectures and other sources to create a massive corpus of speech. Tools will then be developed to analyze the different uses of prosody (rhythm, stress and intonation) within spoken communication.

Principal Investigators

Mats Rooth, Cornell University, US, NSF

Michael Wagner, McGill University, CAN, SSHRC

Links:

Project Website

Github Site for Ezra Software

David Lutz, Parry Cadwallader, & Mats Rooth, A Web Application for Filtering and Annotating Web Speech Data

Jonathan Howell, Meaning And Prosody: On The Web, In The Lab And From The Theorist's Armchair

Kyle Gorman, Jonathan Howell, & Michael Wagner, Prosodylab-aligner: A Tool for Forced Alignment of Laboratory Speech

Article in the Cornell Chronicle

Prosody Datasets Website: Examples

harvesting_speech_datasets_rooth-howell-wagner.final_.wp_.pdf

harvesting_speech_datasets_rooth-howell-wagner.final_.wp_.pdf (2.6 MB)

Award Updates / Recent Press
Digging into Data Resources
Application Materials

Award News and Updates

October 9, 2019: Round 4 Conference (2020)

...

March 28, 2017: Winners of Round Four of the T-AP Digging into Data Challenge

...

February 27, 2017: White Papers for Digging Round Three (2013) Now Available

All Digging into Data Challenge projects issue a final "white paper" for the public that summarizes their research. Most of the 2013 projects have completed their white papers -- check out the Round Three...

August 17, 2016: "Digging into Data for Research"

August 2016. Read this new piece by Ted Hewitt, President of the Social Sciences and Humanities Research Council of Canada, a recent feature from Adjacent Government.

January 20, 2016: Digging Round Three Conference

This conference, held in Glasgow, UK on January 27 - 28, 2016, brings together the Digging into Data Round 3 Project award holders from all participating countries with additional stakeholders.

Data Repositories

In advance of the DID Challenge, the funders approached many major repositories of digital materials and asked them to provide contact and technical support information for gaining access to their collections. This list is constantly being updated, so check back often. If you are a representative of such a collection and wish to be added to this list, please contact the DID Challenge organizers.

ARTstor

ARTstor is a digital library of nearly one million images in the areas of art, architecture, the humanities, and social sciences with a set of tools to view, present, and manage images for research and pedagogical purposes.

Biodiversity Heritage Library

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature...

Chronicling America Library of Congress National Digital Newspaper Program

As of March 2011, Chronicling America provides free and searchable access to more than 3.3 million pages of historic newspapers, published between 1860 and 1922. These newspapers are selected and digitized by NEH awardees through the National...

Connecting Repositories (CORE)

As of January 2013, the COnnecting REpositories (CORE) system, hosted at The Open University, provides access to millions of open access research articles aggregated from over 250+...

Data Archiving and Networked Services (DANS)

DANS is the Dutch national centre of expertise and repository for research data. We help researchers make their data available for reuse. This allows researchers to use the data for new research and makes published research...

2016 Application materials for the T-AP Digging into Data Challenge.

The 2016 deadline is now over. Thanks to the many international teams that applied!

Main RFP

t-ap.did_.2016.rfp_.english.4.pdf (591.65 KB) (Revised 25 April)
tap-did_rfp_2016_f.3.pdf (363.56 KB) (Revised 28 March)

Instructions on How to Use Online Application System

application_instructions_did_round_4.0.english.27june.pdf (217.4 KB)
app.instructions.french28june.pdf (264.07 KB) (edited 28 June)

Apply Now

The 2016 deadline is now over!

Country-Specific Documents

Argentina

mincyt_addendum.2016.pdf (18.74 KB)
mincyt_budgetform_2016.dic_.23.2015_mincyt.2.xls (38 KB)

Brazil

fapesp-addendum_.2016.final20160226_.pdf (311.33 KB)
fapesp_budget_2016.apr_.final_.xlsx (981.76 KB)

Canada

digging_into_data_canadian_funders_joint_addendum_2016_finale.pdf (126.41 KB)
digging_into_data_canadian_funders_joint_addendum_2016_finalf.pdf (147.2 KB)
sshrc_budget_table_2016.final_with_french_tab_2015.xlsx (14.19 KB)
nserc_budget_form.2016.xlsx (10.8 KB)
frqsc-frqnt_joint_budget.form_2016.en-fr_vf.xlsx (15.61 KB)

Finland

aka_addendum.2016.pdf (88.93 KB)
aka_budget_table.2016.xlsx (20.64 KB)

France

anr_addendum.2.dida2016_anr.2.pdf (644.69 KB)
anr_budget_final.1.dida2016_2.xls (450.5 KB)

Germany

dfg_addendum_.2016.09-dec-2015.pdf (111.14 KB)
dfg_budget_template.2016.docx (19.64 KB)
dfg_budget_example.2016.docx (22.3 KB)

Mexico

conacyt_addendum.2016.28.march_.pdf (149.88 KB)
conacyt_budget_form_2016.v2.xlsx (13.3 KB)

Netherlands

nwo_addendum_2.pdf (211.61 KB)
nwo_financial_form_did.2016.docx (41.34 KB)

Portugal*

fct.addendum.version2.pdf (99.03 KB)
fct_budgetform.blank_.2016.xlsx (10.76 KB)

*Note: Edit on 14 March: FCT (Portugal) is now confirmed to participate. New version of Addendum posted 15 March.

United Kingdom

uk_joint_addendum_2016.2015_final.pdf (22.93 KB)
uk_joint_budget_template.2016._2015.xlsx (15.51 KB)

United States

neh.addendum.3-march.2016.pdf (98.59 KB)
nsf_addendum_final_03_09_2016.pdf (209.8 KB)
imls_addendum_2016.12-dec-2015.pdf (113.54 KB)
us_joint_budget_form_2016.xls (45 KB)
us_joint_budget_instructions_2016.pdf (186.24 KB)
us_joint_sample_budget_2016.pdf (23.2 KB)