Page Scraping Links with Ruby

Tinkering around with Ruby I figured I would write up a project that involved a little of everything a common Ruby program might have. A little sockets, a little class creation, array handling etc. The only thing I didn’t include was a module but you could easily build one in if you like. The result was a nice little project which scrapes links from a given page. Ruby is gaining in popularity for its expressive syntax, but with that comes a few dangers of ambiguity and hard to read code if you are not the one that wrote it because of a bunch of aliases you have to keep track of in my personal opinion. Either way it can do great things and perfect for that little script you have always wanted to write. We show you how to page scrape with our favorite gemstone language right here on the Programming Underground!

First of all we start by pulling in the socket library which will give us access to handling sockets and the TCPSocket class. We then create a class which we called SocketClient that will be in charge of contacting the host and pulling the page across. It will also be put in charge of fetching out the links and returning them to us when we call the “fetchLinks” method.

We then do a bit of error handling to check if the page fetch was successful and if so, we simply store the links fetched into an array. We loop through this array and print them to the screen one by one. Lets take a look at how this can be done…

# Require the socket library
require 'socket'

# Create a class which will take a host, port and possibly a page to fetch
class SocketClient
  def initialize(host, port, page = "/")
	@host = host
	@port = port
	@page = page
  end
  
  # Regular expression which pulls all link tags out of the response from the server.
  def fetchLinks
	return @answer.scan(/<a.*?<\/a>/i)
  end
	
  # Start kicks off the whole process, contacts the host, fetches the page and stores it in @answer class variable.
  def start
	begin
		@socket = TCPSocket.new(@host, 'www')
		@socket.print "GET #{@page} HTTP/1.1\r\n"
		@socket.print "Host: " << @host << "\r\n\r\n"
		@answer = @socket.gets(nil)
		@socket.close
		if @answer =~ /HTTP\/1.1 200 OK/
			return true
		else
			return false
		end
	rescue
		return false
	end
	
  end
end

# Create an instance of the class then
# make a call to google and fetch the links form his home page.
client = SocketClient.new("www.google.ca", 80)

# Start the client, create an array to store the links and pull the links into the array
if client.start
	links = Array.new
	links = client.fetchLinks
	
	# Loop through the array and print them to the screen if successful
	links.each { |item| puts item }
else
	puts "The server responded with an error code or unable to connect to the site"
end

As the in-code comments show, we create an instance of our class, give it the host (www.google.ca in our case) along with the port number and it contacts the host for us. We have an optional third parameter which is the exact page to fetch but it will default to the root page if a page is not specified. An example of a page would be something like /subdirectory/index.php to get the page in a subdirectory.

We then start the client by calling the start method and if all is good (the host replies with a 200 status number), we create a new array called “links” and fill it with the array returned from the fetchLinks method. After that we just loop through the links and display them on screen.

The script is small and that is the power of Ruby at work. We could condense this even more but for readability I struck a happy medium between lines of code and something that is easy to follow. As a programmer I am sure you can see something like this being part of a bigger project and maybe even change the regular expression in the fetchLinks method to pull out other tags or images etc. You could even build a spider on top of this if you so choose.

I hope you like the script and feel free to edit it as your leisure. As with all the code on my blog, it is in the public domain. Enjoy and thanks for reading! 🙂

About The Author

Martyr2 is the founder of the Coders Lexicon and author of the new ebooks "The Programmers Idea Book" and "Diagnosing the Problem" . He has been a programmer for over 25 years. He works for a hot application development company in Vancouver Canada which service some of the biggest tech companies in the world. He has won numerous awards for his mentoring in software development and contributes regularly to several communities around the web. He is an expert in numerous languages including .NET, PHP, C/C++, Java and more.