以Geohash為概念免費取得plugshare.com的充電站訊息內容-下集(更新sample code，2023年11月測試有效)

2023-06-17

189
0
網路爬蟲
2023-11-29

Retrieve tesla charger point [detail] using geohash based method.

geohash、tesla charger、plugshare.com

在上一集的技術網誌中，我們提到了使用geohash為概念的方式取得充電站的代號(第2欄)與經緯度(第3、4欄)，抓取結果請參考下圖。

作者後續又接到另一個任務是要取得每個充電站的樁數、各類型插座數、是否收費以及收費資訊，也就是如下圖的內容。

由於上集我們已經取得每一個充電站的唯一id值與對應細節頁面的url，藉由輸入url觀察chrome的f12模式，可以發現到其中有一個request api是取得細節，可以在preview可以觀察到回傳的Json內容。

接下來開始分析Json的內容。

我們可以從Json的內容觀察，該站有20支充電樁(label為stations下的陣列0~19)，數量可透過點選更多內容的icon驗證。

往下點開可以發現有outlets選項，以我們的例子來看，陣列index第17個樁有兩個插座，分別是ChAdeMO與COMBO各一個。

收費資訊就比較簡單了，就是在amenities label下的cost與cost_description label。

了解了整個回傳結構後，便可以開始進行爬網取資料parsing。

與上集的方式一樣，需要使用cloudscraper的套件進行爬網，且須注意帶入的參數，避免被plugshare擋下來。

我直接讀取上集爬完取得的檔案，取出每個站的唯一id值，整成https://api.plugshare.com/v3/locations/{id}，進行request。

中間需要解決兩個問題，一個是網站回傳的中文為unicode的問題，另一個是回傳表情符號的問題。

中文unicode問題可用python encode decode解決，表情符號會造成轉換失敗，需要額外處理將他轉為?號(str.encode('utf-8','replace'))。

2023年11月重新review時發現網站回吐編碼已變成utf-16，包含utf-16 emotion符號，作者修改程式為了之後產生csv檔，將非utf-8的編碼全部去除
str = str.encode('utf-8').decode("unicode_escape")
str = re.sub('[\ud800-\udfff]+', '', str)

剩下的就是parsing json的苦功而已，我是習慣用關鍵字去split，靠著左切右切直接把內容切出來。

import requests
import cloudscraper
import cfscrape
import time
import re

#使用cloudscraper準備request header#
session = requests.Session()
session.headers = {
    'browser'            :'firefox', #Cloudflare Error 1020解決 #參數替換 firefox, chrome
    'Authority'          :'api.plugshare.com',
    'Method'             :'GET',
    'Path'               :'/v3/locations/356542',
    'Scheme'             :'https',
    'Accept'             :'application/json, text/plain, */*',
    'Accept-Encoding'    :'gzip, deflate, br',
    'Accept-Language'    :'zh-TW',
    'Authorization'      :'Basic d2ViX3YyOkVOanNuUE54NHhXeHVkODU=',
    'Origin'             :'https://www.plugshare.com',
    'Referer'            :'https://www.plugshare.com/location/356542',
    'Sec-Ch-Ua'          :'"Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"',
    'Sec-Ch-Ua-Mobile'   :'?1',
    'Sec-Ch-Ua-Platform' :'"windows"',#windows,Android
    'Sec-Fetch-Dest'     :'empty',
    'Sec-Fetch-Mode'     :'cors',
    'Sec-Fetch-Site'     :'same-site',
    'User-Agent'         :'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Mobile Safari/537.36'}
scraper = cloudscraper.create_scraper(delay=10, sess=session, browser={'browser': 'firefox','platform': 'windows','mobile': False})#, browser={'browser': 'chrome','platform': 'windows','desktop': True}


#讀檔and取得tesla充電站id#
teslaPointIdList = []
f = open('D:/workSpaceReunion/dataInOutput/plugShareDataParsing/step3_requestAPIResultParsing/teslaRequestAPIResultParsing.csv', 'r', encoding="utf-8")
line = f.readline()
#teslaPointId.append()

while line:
    #print(line)

    if line.strip() != '': #strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符, 這裡判斷如果是空行不處理
        teslaPointId = line.split(",")[1]
        teslaPointIdList.append(teslaPointId)
        #print(teslaPointId)
    line = f.readline()

f.close()

#依據teslaPointId開始爬網parsing#
chargerDetailResultList = []
for teslaId in teslaPointIdList:
    
    #進入正題
    url = 'https://api.plugshare.com/v3/locations/' + teslaId
    #print(scraper.get(url).text.replace('[','').replace(']',''))
    apiResponseJsonStr = scraper.get(url).text
    #print(apiResponseJsonStr) #檢查是不是又被當作是機器人
    strSplit = str.split(apiResponseJsonStr, "connector_name\":\"")
    first = True
    chargerTypeDict = dict()
    for line in strSplit:
        if first:
            first = False#要跳過第一個
            #利用跳過的機會取得cost資訊
            
            if "\"amenities\":" in line: #有資訊才執行
                strSplitAmenities = str.split(line, "\"amenities\":")
                amenities = strSplitAmenities[1]
                strSplitCreatedAt = str.split(amenities, ",\"created_at\":")
                createdAt = strSplitCreatedAt[0]
                strSplitCost = str.split(createdAt, "\"cost\":")
                cost = strSplitCost[1]
                strSplitCostDes = str.split(cost, ",\"cost_description\":")
                needCost = strSplitCostDes[0]
                #轉碼，並將,轉換掉避免切csv時出問題
                cost_description = strSplitCostDes[1].encode('utf-8').decode("unicode_escape").replace("\"","").replace(",","_").replace("\n","_").replace("\r","_")
                #cost_description = strSplitCostDes[1].encode('utf-8','replace').decode("utf-8").replace("\"","").replace(",","_").replace("\n","_").replace("\r","_")
                ##移除utf-16 emotion符號 https://stackoverflow.com/questions/67854195/removing-part-of-string-starting-with-ud
                cost_description = re.sub('[\ud800-\udfff]+', '', cost_description)#print(needCost)
                #print(cost_description)
            else:
                needCost = ""
                cost_description = ""
            
        else:
            strSplitIn = str.split(line, "\",\"connector_type\"")
            chargerType = strSplitIn[0]
            if chargerType not in chargerTypeDict:
                chargerTypeDict[chargerType] = 1
            else:
                chargerTypeDict[chargerType] = chargerTypeDict[chargerType] + 1
            #print(strSplitIn[0])

    pileCount = apiResponseJsonStr.count('\"outlets\":')#數樁數
    chargerTypeStrConcact = str(chargerTypeDict).replace("'","").replace(" ","").replace(",","#").replace("{","").replace("}","")
    chargerPointDetail = teslaId+","+str(pileCount)+","+chargerTypeStrConcact+","+needCost+","+cost_description
    print(chargerPointDetail)
    chargerDetailResultList.append(chargerPointDetail)
    time.sleep(1.5)
    
#寫入檔案#
chargerDetailPath = 'D://workSpaceReunion/dataInOutput/plugShareDataParsing/step4_requestChargerDetail/chargerDetail.csv'
with open(chargerDetailPath, 'w', encoding='utf-8') as f:
    #f.write('\ufeff')#Utf8-BOM
    #lines = ['Hello World\n', '123', '456\n', '789\n']
    f.write('\n'.join(chargerDetailResultList))

在程式的最後我有在打每個api之間sleep1.5秒，因為實驗過程中我發現即使我的header組成可以成功突破網站限制使用api，但打太快打到一半還是會被擋下來視為bot，1.5秒是我實驗過後可突破的最短間隔。

執行結果如下圖：

最後我們寫程式將上集和下集的內容join起來，變成一個完整的充電站爬網資訊。

#讀取上集充電樁主檔#
teslaPointMasterDict = dict()
f = open('D:/workSpaceReunion/dataInOutput/plugShareDataParsing/step3_requestAPIResultParsing/teslaRequestAPIResultParsing.csv', 'r', encoding="utf-8")
line = f.readline()
#teslaPointId.append()

while line:
    #print(line)

    if line.strip() != '': #strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符, 這裡判斷如果是空行不處理
        teslaPointId = line.split(",")[1]
        teslaPointMasterDict[teslaPointId] = line.replace("\n","").replace("\r","")
        #print(teslaPointId)
    line = f.readline()

f.close()


#讀取本篇結果, 並使用站點id join, 用上集的主檔dict判斷，若本篇結果存在則Join起來#
chargerDetailPath = 'D://workSpaceReunion/dataInOutput/plugShareDataParsing/step4_requestChargerDetail/chargerDetail.csv'
chargerDetailResultList = []

f = open(chargerDetailPath, 'r', encoding="utf-8")
line = f.readline()
#teslaPointId.append()

while line:
    
    if line.strip() != '': #strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符, 這裡判斷如果是空行不處理
        teslaPointId = line.split(",")[0]
        teslaPointPileCnt = line.split(",")[1]
        if teslaPointPileCnt != "0":
            seqs = line.split(",")[1:]
            seq = ','.join(seqs)
            teslaPointMasterDict[teslaPointId] = teslaPointMasterDict[teslaPointId] + ',' + seq.replace("\n","").replace("\r","")
            
            print(teslaPointMasterDict[teslaPointId])
            #將結果pop出來至list
            chargerDetailResultList.append(teslaPointMasterDict[teslaPointId])
            
        else:
            print(teslaPointId + ' EMPTY')
        teslaPointMasterDict[teslaPointId] = line.replace("\n","").replace("\r","")
        #print(teslaPointId)
    line = f.readline()

f.close()

#存檔#
chargerDetailFinalPath = 'D://workSpaceReunion/dataInOutput/plugShareDataParsing/step5_joinMasterDetail/chargerDetailJoinFinal.csv'

with open(chargerDetailFinalPath, 'w', encoding="utf-8") as f:
    f.write('\ufeff')#Utf8-BOM
    #lines = ['Hello World\n', '123', '456\n', '789\n']
    f.write('\n'.join(chargerDetailResultList))

目前為止上集和下集的程式都可以正確地爬取充電站資訊，希望plugshare不要太快改版，不然每次都要改爬網程式太累了。

回首頁

Ryuichi's Arsenal

實際實作寫下來的技術...才是自己的技術

以Geohash為概念免費取得plugshare.com的充電站訊息內容-下集(更新sample code，2023年11月測試有效)

關於作者

一個於資管所畢業已於職場工作十多年的工程師

對於技術研究頗有熱誠，習慣將實作過的東西用白話的方式做分享，幫助自己記憶的同時也希望可以幫助其他在技術研究路上的工程師

對網誌有興趣的人可跟我聯絡，也歡迎在部落格上留言給予回饋：pretendernash@gmail.com

And…拜託不要寄信跟我要source code，或是要我解其他跟網誌無關的問題

系列文章

標籤雲