Dev Direct Solution Center

For more information and to buy this product...

Use wodHttpDLX to download and parse HTML content from a web site.

  For a copy of the sample project please click here to download and install the product evaluation.

Introduction

The wodHttpDLX is a HTTP client ActiveX control that provides easy, high and low level access to the complete HTTP protocol. It's primary purpose is to retrieve documents and other resources from web servers and it then provides methods to access the resultant HTML.

Detail

Our solution code illustrates a simple case in which we make a call to a web page (in this case we use "http://www.devdirect.com/"), fetch it's content, and then parse it in order to extract all the picture URL's for their retrieval.

Set up

Download and run the executable from the above link. This sample will be placed in the directory
C:\Program Files\WeOnlyDo.Com\HttpDLX\Samples\VB\HTML Parser\.

There are example projects for VB, VBS, ASP, Delphi7 and VC.

In the VB6 project, we reference the wodHttpDLXCom object, and event handlers are provided by the development environment.

This example handles the Done event, which is raised after execution of the the GET method has completed, and the StateChange event , which is raised as soon as the component changes state to something other than the state it is currently in (That means it will be fired while establishing the connection, fetching content etc.).

Also, in this demonstration we will use two instances of wodHttpDLX. One will be used for page retrieval, while the other will be used for retrieving the parsed image URL's. The wodHTMLParser is part of the wodHttpDLX package and is only used for parsing .

VB6
Dim WithEvents Http1 As wodHttpDLXCom Dim WithEvents Http2 As wodHttpDLXCom

The above code shows how we declared our wodHttpDLX instances. Now we need to get page content from the URL. We will do this simply by using the Get method inside of the Form_Load Event.

VB6
Private Sub Form_Load() ' get main webpage Http1.Get "http://www.devdirect.com/" End Sub

Once Get has been called, this will initiate state changes, which we will keep track of in the StateChange Event:

VB6
Debug.Print Http1.StateString(Http1.State)

Parsing the content

Following the download, the Done event will fire, which is where we will do the parsing. In order to simplify things here, we will initialize wodHTMLParser to do this for us.

The FileName proprty is set to the filename of the page which we just downloaded and then the Load method is called to tell the parser to load the page into its memory and parse the HTML tags.

VB6
Private Sub Http1_Done(ByVal ErrorCode As Long, ByVal ErrorText As String) If ErrorText <> "" Then MsgBox ErrorText Else Dim Parser1 As New wodHtmlParser Parser1.FileName = Http1.Response.FileName Parser1.Load

Having loaded the document into the parser, we can now use its methods to access different elements within the document. In this case we use the Filter method to create a collection of wodHtmlEntities which contains just the IMG (image) tags.

VB6
Dim ents As wodHtmlEntities Set ents = Parser1.Parts.Filter("IMG") Set ents = ents.Search(ByAttributeName, "src", False)

We can then access each individual wodHtmlEntity object in the collection and access its attributes by name. In the following code we retreive the src attribute in order to form the URL for subsequent image download requests:

VB6
Dim ent As wodHtmlEntity Dim i As Integer ' go one level inside and get all XML parts For i = 0 To ents.Count - 1 Set ent = ents(i) Dim a As String a = ent.Attributes("src").Value If Left$(a, 1) <> "/" And InStr(1, a, "://") < 1 Then a = Http1.URL & a End If ImagesURL.Add a Debug.Print a Next End If ' start fetching images Http2_Done 0, "" End Sub

The final step

As you can see at the end of the above code, we manually trigger the Done Event of our second wodHttpDLX instance. This Event is the one that contains the code for image retrieval.

VB6
Private Sub Http2_Done(ByVal ErrorCode As Long, ByVal ErrorText As String) If Http2.Response.FileName <> "" Then ' load picture Dim p As IPictureDisp On Error Resume Next Set p = LoadPicture(Http2.Response.FileName) ' and draw it Form1.PaintPicture p, Rnd() * Form1.Width / 2, Rnd() * Form1.Height / 2 End If If ImagesURL.Count > 0 Then Dim a As String a = ImagesURL.Item(1) ImagesURL.Remove 1 Http2.KeepAlive = Never Http2.Get a End If End Sub

Summary

After the code is started, the form will be filled with all the images used on the web page that we retrieved.

This simple demonstration shows us how to retrieve a page, parse it's content in search for specific tags, and retrieve all the images used on it.

Visit WeOnlyDo! Inc. for more information and more samples.