I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.
[Edit: If you want a light introduction to this, I recommend this double CD]
I wanted to download it … all!
But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …
Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood
, select the identifier
field (only), set the format to CSV. Set the limit to 30000
(there are about 25000+ records), and download the huge CSV:
$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv
"identifier"
"78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b"
"78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b"
"78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b"
"78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b"
A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:
78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b
→ https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b
What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.
#!/usr/bin/python
import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
with open('search.csv', 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csvreader:
if row[0] == "identifier":
continue
url = "https://archive.org/download/%s/" % row[0]
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
# Only want the first link in the page.
link = links[0]
link = link.get('href', None)
link = urlparse.urljoin(url, link)
print link
When you run this it converts each identifier into a download URL:
Edit: Amusingly WordPress turns the next pre
section with MP3 URLs into music players. I recommend listening to them!
$ ./download.py | head -10
https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b/Jeannine%20I%20Dream%20Of%20You%20%22Lilac%20%20-%20Bar%20Harbor%20Society%20Orch..mp3https://archive.org/download/78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b/A%20Prisoner%27s%20Adieu%20-%20Jerry%20Irby%20-%20Modern%20Mountaineers.mp3https://archive.org/download/78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b/If%20I%20Had%20The%20Heart%20of%20A%20Clown%20-%20Bobby%20Wayne.mp3https://archive.org/download/78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b/How%20Many%20Times%20%28Can%20I%20Fal%20-%20Patty%20Andrews%20And%20Tommy%20Dorsey.mp3https://archive.org/download/78_ill-forget-you_alan-dean-ball-burns-joe-lipman_gbia0002540a/I%27ll%20Forget%20You%20-%20Alan%20Dean%20-%20Ball%20-%20Burns.mp3https://archive.org/download/78_it-aint-gonna-rain-no-mo-ya-no-va-a-llover_international-novelty-orchestra-wend_gbia0014114a/It%20Ain%27t%20Gonna%20Rain%20No%20M%20-%20International%20Novelty%20Orchestra.mp3https://archive.org/download/78_i-still-keep-dreaming_leroy-holmes-and-his-orchestra-sourwine-johnny-corva_gbia0004815b/I%20Still%20Keep%20Dreaming%20-%20Leroy%20Holmes%20and%20his%20Orchestra.mp3https://archive.org/download/78_it-aint-nobodys-bizness_lulu-belle--scotty-browne-sampsel-markowitz_gbia0010017a/It%20Ain%27t%20Nobody%27s%20Bizness%20-%20Lulu%20Belle%20%26%20Scotty.mp3https://archive.org/download/78_i-still-get-a-thrill-thinking-of-you_art-lund-johnny-thompson-coots-davis_gbia0002767a/I%20Still%20Get%20A%20Thrill%20%28Thinking%20Of%20You%29%20-%20Art%20Lund.mp3https://archive.org/download/78_in-the-gloaming_art-hickmans-orchestra-logan_gbia0006430a/In%20The%20Gloaming%20-%20Art%20Hickman%27s%20Orchestra.mp3
And after that you can download as many 78s as you can handle 🙂 by doing:
$ ./download.py > downloads
$ wget -nc -i downloads
Update
I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.
Update #2
Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.