First stone of a Python webcrawler

I’m learning Python, so I’m trying to create a Python webcrawler. The first method allows you to get a collection of unique url links from a given base url. I’m amazed with Python simplicity, this is the first stone of a webcrawler, and the beginning of a great friendship.

import urllib2
import sys
import re

def getlinksfromurl(url):
	linkset = set()
	usock = urllib2.urlopen(url)
	data = usock.read()
	usock.close()
	s = re.finditer('href', data)
	for link in s:
		endlink=data.find('"',link.end()+2,len(data))
		linkstring=data[link.end()+2:endlink]
		if linkstring.startswith('http'):
			linkset.add(linkstring)
		if linkstring.startswith('/'):
			linkset.add(sys.argv[1]+linkstring)
		if linkstring.startswith('./'):
			linkset.add(sys.argv[1]+linkstring[1:])
	return linkset

# Main program
url = sys.argv[1]
linkset=getlinksfromurl(url)
for link in linkset:
	print link
Posted in In progress, Internet, Programming, Python Tagged with: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*