
Dataset 1:

Name of the Corpus:    saved_Adrien_News_Articles_56M

Curator:               Adrien Dubois
Num of Chars:          56,550,653 
Corpus Size:           89 MB      [note that UTF8 encoded chars are multi-byte]
Num of Articles:       15,966


Information on the file structure:

Each article is its own .txt file with the following naming convention:
athlete_news_i.txt where "i" is the index of that file in the subfolder.

In total there are: 15962 articles included in the subfolder.

Information about the CommonCrawl Dataset:

All of these articles were taken from Huggingface's "cc_news" dataset, which as far
as I understand was originally created by Meta AI for the fasttext project. The
"cc_news" dataset is a cleaned-up version of the giant "Common Crawl" dataset run by
the CommonCrawl foundation.

Also note that articles in the "cc_news" dataset are only from 2017 and 2018.

The CommonCrawl dataset seems to be free use, as they quote on their website's
landing page: "Common Crawl maintains a free, open repository of web crawl data that
can be used by anyone" (https://commoncrawl.org/). The terms of use are located on
this webpage if you would like to read them: https://commoncrawl.org/terms-of-use.

I have seen on other sources online that this is a Creative Commons CC-BY 4.0
license, but I found no mention of such on CommonCrawl actual website. However, as
long as you attribute credit to the dataset to the CommonCrawl foundation I think you
should be fully allowed to use all of the data (as far as I understand the
CommonCrawl foundation would handle IP rights violations (ie. news publishers not
wanting their articles in the dataset) themselves).

Information on the filtering I performed:

I filtered the "cc_news" dataset hosted on Huggingface by looking for articles where
one of the following very famous athletes was mentioned at least once. This list was
curated by looking at ESPN's most famous athletes of 2018, while specifically looking
for a group of athletes that would at least span the top 10 most popular sports in
the world. Here is the information on the athlete's included in the dataset:

    Aaron Judge: baseball
    Anthony Joshua: boxing
    Ardie Savea: rugby
    Connor McDavid: ice hockey
    Connor McGregor: mixed martial arts
    Floyd Mayweather: boxing
    Gareth Bale: football
    James Rodriguez: football
    Karch Kiraly: volleyball
    Kevin Durant: basketball
    Lewis Hamilton: Formula 1
    LeBron James: basketball
    Lionel Messi: football
    Lindsey Vonn: alpine skiing
    Ma Long: table tennis
    Magnus Carlsen: chess
    Manny Pacquiao: boxing
    MS Dhoni: cricket
    Nathan MacKinnon: ice hockey
    Neymar Jr: football
    Novak Djokovic: tennis
    Phil Mickelson: golf
    Rohit Sharma: cricket
    Rory McIlroy: golf
    Saina Nehwal: badminton
    Serena Williams: tennis
    Shaun White: snowboarding
    Sidney Crosby: ice hockey
    Sohei Ohtani: baseball
    Stephen Curry: basketball
    Steph Curry: basketball
    Sun Yang: swimming
    Tiger Woods: golf
    Tom Brady: American football
    Virat Kohli: cricket
    Wilfredo Leon: volleyball
    Yuzuru Hanyu: figure skating
    Zhang Jike: table tennis
    Roger Federer: tennis

=================================================================================


Dataset 2:

Name of the Corpus:    saved_articles_dir_12M

Curator:               babyGPT
Num of Chars:          12,723,919
Corpus Size:           18 MB      [note that UTF8 encoded chars are multi-byte]
Num of Articles:       2,760


This corpus was created by running the script 

            run_gatherer.py 

in the Examples directory of babyGPT.  The newspaper sites I used for gathering the
articles are in the list of the URLs in the code for that script when you first 
install the babyGPT module.

