Use wodHttpDLX to download and parse HTML content from a web site.

For a copy of the sample project please
click here
to download and install the product evaluation.
Introduction
The wodHttpDLX is a HTTP client ActiveX control that provides easy, high and low level access to the complete HTTP protocol. It's primary purpose is to retrieve documents and other resources from web servers and it then provides methods to access the resultant HTML.
Detail
Our solution code illustrates a simple case in which we make a call to a web page (in this case we use "http://www.devdirect.com/"), fetch it's content, and then parse it in order to extract all the picture URL's for their retrieval.
Set up
Download and run the executable from the above link. This sample will
be placed in the directory
C:\Program Files\WeOnlyDo.Com\HttpDLX\Samples\VB\HTML Parser\.
There are example projects for VB, VBS, ASP, Delphi7 and VC.
In the VB6 project, we reference the wodHttpDLXCom object,
and event handlers are provided by the development environment.
This example handles the Done event, which is raised after execution of the the
GET method has completed, and the StateChange event ,
which is raised as soon as the component changes state to something other
than the state it is currently in (That means it will be fired while
establishing the connection, fetching content etc.).
Also, in this demonstration we will use two instances of wodHttpDLX. One
will be used for page retrieval, while the other will be used for retrieving the
parsed image URL's. The wodHTMLParser is part
of the wodHttpDLX package and is only used for parsing .
VB6
Dim WithEvents Http1 As wodHttpDLXCom
Dim WithEvents Http2 As wodHttpDLXCom
The above code shows how we declared our wodHttpDLX instances. Now we
need to get page content from the URL. We will do this simply by using
the Get method inside of the Form_Load Event.
VB6
Private Sub Form_Load()
Http1.Get "http://www.devdirect.com/"
End Sub
Once Get has been called, this will initiate state changes, which we will
keep track of in the StateChange Event:
VB6
Debug.Print Http1.StateString(Http1.State)
Parsing the content
Following the download, the Done event will fire,
which is where we will do the parsing. In order to simplify things here,
we will initialize wodHTMLParser to do this for us.
The FileName proprty is set to the filename of the page which we just downloaded and then the Load method is called to tell the parser to load the page into its memory and parse the HTML tags.
VB6
Private Sub Http1_Done(ByVal ErrorCode As Long, ByVal ErrorText As String)
If ErrorText <> "" Then
MsgBox ErrorText
Else
Dim Parser1 As New wodHtmlParser
Parser1.FileName = Http1.Response.FileName
Parser1.Load
Having loaded the document into the parser, we can now use its methods to access different elements within the document. In this case we use the Filter method to create a collection of wodHtmlEntities which contains just the IMG (image) tags.
VB6
Dim ents As wodHtmlEntities
Set ents = Parser1.Parts.Filter("IMG")
Set ents = ents.Search(ByAttributeName, "src", False)
We can then access each individual wodHtmlEntity object in the collection and access its attributes by name. In the following code we retreive the src attribute in order to form the URL for subsequent image download requests:
VB6
Dim ent As wodHtmlEntity
Dim i As Integer
For i = 0 To ents.Count - 1
Set ent = ents(i)
Dim a As String
a = ent.Attributes("src").Value
If Left$(a, 1) <> "/" And InStr(1, a, "://") < 1 Then
a = Http1.URL & a
End If
ImagesURL.Add a
Debug.Print a
Next
End If
Http2_Done 0, ""
End Sub
The final step
As you can see at the end of the above code, we manually trigger the
Done Event of our second wodHttpDLX instance. This Event is
the one that contains the code for image retrieval.
VB6
Private Sub Http2_Done(ByVal ErrorCode As Long, ByVal ErrorText As String)
If Http2.Response.FileName <> "" Then
Dim p As IPictureDisp
On Error Resume Next
Set p = LoadPicture(Http2.Response.FileName)
Form1.PaintPicture p, Rnd() * Form1.Width / 2, Rnd() * Form1.Height / 2
End If
If ImagesURL.Count > 0 Then
Dim a As String
a = ImagesURL.Item(1)
ImagesURL.Remove 1
Http2.KeepAlive = Never
Http2.Get a
End If
End Sub
Summary
After the code is started, the form will be filled with all the images
used on the web page that we retrieved.
This simple demonstration shows us how to retrieve a page, parse it's
content in search for specific tags, and retrieve all the images used on it.
Visit
WeOnlyDo! Inc.
for more information and more samples.