Tinkering around with Ruby I figured I would write up a project that involved a little of everything a common Ruby program might have. A little sockets, a little class creation, array handling etc. The only thing I didn’t include was a module but you could easily build one in if you like. The result was a nice little project which scrapes links from a given page. Ruby is gaining in popularity for its expressive syntax, but with that comes a few dangers of ambiguity and hard to read code if you are not the one that wrote it because of a bunch of aliases you have to keep track of in my personal opinion. Either way it can do great things and perfect for that little script you have always wanted to write. We show you how to page scrape with our favorite gemstone language right here on the Programming Underground!
First of all we start by pulling in the socket library which will give us access to handling sockets and the TCPSocket class. We then create a class which we called SocketClient that will be in charge of contacting the host and pulling the page across. It will also be put in charge of fetching out the links and returning them to us when we call the “fetchLinks” method.
We then do a bit of error handling to check if the page fetch was successful and if so, we simply store the links fetched into an array. We loop through this array and print them to the screen one by one. Lets take a look at how this can be done…
# Require the socket library require 'socket' # Create a class which will take a host, port and possibly a page to fetch class SocketClient def initialize(host, port, page = "/") @host = host @port = port @page = page end # Regular expression which pulls all link tags out of the response from the server. def fetchLinks return @answer.scan(/<a.*?<\/a>/i) end # Start kicks off the whole process, contacts the host, fetches the page and stores it in @answer class variable. def start begin @socket = TCPSocket.new(@host, 'www') @socket.print "GET #{@page} HTTP/1.1\r\n" @socket.print "Host: " << @host << "\r\n\r\n" @answer = @socket.gets(nil) @socket.close if @answer =~ /HTTP\/1.1 200 OK/ return true else return false end rescue return false end end end # Create an instance of the class then # make a call to google and fetch the links form his home page. client = SocketClient.new("www.google.ca", 80) # Start the client, create an array to store the links and pull the links into the array if client.start links = Array.new links = client.fetchLinks # Loop through the array and print them to the screen if successful links.each { |item| puts item } else puts "The server responded with an error code or unable to connect to the site" end
As the in-code comments show, we create an instance of our class, give it the host (www.google.ca in our case) along with the port number and it contacts the host for us. We have an optional third parameter which is the exact page to fetch but it will default to the root page if a page is not specified. An example of a page would be something like /subdirectory/index.php to get the page in a subdirectory.
We then start the client by calling the start method and if all is good (the host replies with a 200 status number), we create a new array called “links” and fill it with the array returned from the fetchLinks method. After that we just loop through the links and display them on screen.
The script is small and that is the power of Ruby at work. We could condense this even more but for readability I struck a happy medium between lines of code and something that is easy to follow. As a programmer I am sure you can see something like this being part of a bigger project and maybe even change the regular expression in the fetchLinks method to pull out other tags or images etc. You could even build a spider on top of this if you so choose.
I hope you like the script and feel free to edit it as your leisure. As with all the code on my blog, it is in the public domain. Enjoy and thanks for reading! 🙂