google colaboratoryを使ってgoogle画像検索から画像を一括ダウンロード - スタートアップエンジニアの作ってみた日記

タイトルの通りcolaboratoryを使ってgoogle画像検索から大量の画像を取得する方法を書きます。

一年前くらいにやっていたことを思い出しながら書くので、間違っていたり古い情報だったりしたらすみません。

そもそもgoogle colaboratoryとは
cloud上で実行できるJupyterノートブック環境です。
環境構築などのめんどくさい作業がほとんどいらず、ブラウザとgoogleアカウントさえあれば使用できます。

また、無料でGPUを使用できるので、機械学習などで大量のデータを学習させるときなどに非常に便利です。

というわけで、今回はこのgoogle colaboratoryを使っていきます。

まずは、colaboratoryを開きます。

ブラウザ上でgoogle driveを開き、新規ファイルの追加ボタンで、Google colaboratoryを選びます。

このとき、表示されない方は「+アプリの追加」から追加してください。

f:id:goengine:20190805010625p:plain

開くとこのような画面になります。

ここにコードを書いていきます。

f:id:goengine:20190805010913p:plain

以下のコードコピペしましょう。

import os
from urllib import request as req
from urllib import error
from urllib import parse
import bs4

#検索するワード
keyword ='犬'
#ディレクトリとファイル名を決めるためのワード
keyword2 = 'dog'

if not os.path.exists(keyword2):
    os.mkdir(keyword2)
    print(keyword2)

urlKeyword = parse.quote(keyword)
url = 'https://www.google.com/search?hl=jp&q=' + urlKeyword + '&btnG=Google+Search&tbs=0&safe=off&tbm=isch'

headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0",}
request = req.Request(url=url, headers=headers)
page = req.urlopen(request)

html = page.read().decode('utf-8')
html = bs4.BeautifulSoup(html, "html.parser")
elems = html.select('.rg_meta.notranslate')
counter = 0
for ele in elems:
    ele = ele.contents[0].replace('"','').split(',')
    eledict = dict()
    for e in ele:
        num = e.find(':')
        eledict[e[0:num]] = e[num+1:]
    imageURL = eledict['ou']

    pal = '.jpg'
    if '.jpg' in imageURL:
        pal = '.jpg'
    elif '.JPG' in imageURL:
        pal = '.jpg'
    elif '.png' in imageURL:
        pal = '.png'
    elif '.gif' in imageURL:
        pal = '.gif'
    elif '.jpeg' in imageURL:
        pal = '.jpeg'
    else:
        pal = '.jpg'

    try:
        img = req.urlopen(imageURL)
        #localfile = open('./'+keyword+'/'+keyword+str(counter)+pal, 'wb')
        localfile = open('./'+keyword2+ '/' +keyword2+str(counter)+pal, 'wb')
        localfile.write(img.read())
        img.close()
        localfile.close()
        counter += 1
    except UnicodeEncodeError:
        continue
    except error.HTTPError:
        continue
    except error.URLError:
        continue