Datastead Software
THttpScan
Version
4.8 build 115 - October 5, 2010
http://www.datastead.com
Contact:
contact@datastead.com
Support:
support@datastead.com
Note
to C++Builder users:
Add a #pragma
link "inet.lib" statement at the top of your
main cpp file.
1. If a previous THttpScan package is already installed, remove it first:
-
Components | Install Packages,
- click on "Datastead
THttpScan",
- click "Remove",
- click
"Yes",
- click "Ok",
- search for
"HttpScan.*" and "THtScan.*" files in your
Borland directories and delete them, to be certain that old
units will not remain in the search paths (causing later raw
errors).
2. Install the current package:
-
unzip the archive in a folder of your choice,
- according to
your Delphi or C++Builder version, copy all the Delphi\*.* or
CBuilder\*.* archive files to your Borland\Delphi\Imports or
\Borland\CBuilder\Imports directory,
- run Delphi or
C++Builder,
- select Component | Install packages,
-
press the "Add" button,
- locate the THtScan.bpl
file in the Imports directory and select it,
- select Open,
-
select Ok,
- check the Datastead tab in the right of the
component palette. The THttpScan object should have been added.
Note:
If you get a "THtscanc.??? file not found" error when compiling or linking your project:
- go to Tools | Environment Options | Library, and check that you have ;$(DELPHI)\Imports (Delphi) or ;$(BCB)\Imports (C++Builder) in the Library path, otherwise add it at the end of the edit field.
- go to Project | Options | Packages, check "Build with runtime package", go to the end of the packages edit field, remove ";THtScan", and then uncheck "Build with runtime package".
function
Start:
Boolean
(1st syntax)
starts downloading and processing the URL set in the
StartingUrl property, which must have
been set beforehand.
function
Start (StartingUrl_:
string): boolean (2nd syntax)
starts downloading
and processing the URL set in the StartingUrl_ parameter passed to
the function.
procedure
Stop
Stops the HttpScan
process currently running.
AllowRedirect:
boolean = true
If enabled, THttpScan follows redirected URL.
If disabled, redirected URL are ignored.
ConcurrentDownloads:
integer = 6
number of html pages downloads running
simultaneously (between 4 and 20, according to your ISP speed and
your processor is a good range).
DepthSearchLevel:
integer = 3
represents
the deep of the followed pages tree starting from the first
Url. If kept on the host of the starting Url with LinkScan
=scanInitialSite,
a high value allow to grab an entire web site.
DepthSearchLevel
and LinkScan are the most important
parameters of THttpScan.
HttpPort:
integer = 80
http port of the starting Url
FileOfResults:
string = ' '
complete path of the file in which to store the
results of the processing. Left this property blank if you do not
want that THttpScan saves the results to a file.
KeywordsFilter:
string of keywords separated by char(13)+char(10), not visible on
the object properties.
Set of keywords to filter URLs. One keyword
per line. Very short keywords will eliminate a lot of Url (e.g. a
keyword like "th" eliminates all the Url containing "th").
Activated by KeywordsFilterEnabled = true.
KeywordsFilterEnabled:
boolean = false
if set to true, the KeywordsFilter stringlist
is used to determine if the URL contains one of the keywords and must
be ignored.
KeywordsLimiter:
string of keywords separated by char(13)+char(10), not visible
with the object inspector.
Set of keywords URLs MUST CONTAIN. One
keyword per line. Very short keywords will report a lot of Urls (e.g.
a keyword like "th" allows all the Url containing "th").
Activated by KeywordsLimiterEnabled = true.
KeywordsLimiterEnabled:
boolean = false
if set to true, the KeywordsLimiter stringlist
is used to determine if the URL contains one of the keywords and must
be reported.
LeavesFirst:
boolean = false
if we think to the pages scanned (starting
from the initial URL) as a tree with its branches and leaves,
THttpScan scans through the leaves before the branches.
LinkScan:
TLinkScan = (scanAllLinks, scanInitialSite, scanInitialPath)
Sets
the global way to surf through links.
scanAllLinks: for each
html page found, all the links are downloaded and scanned, and so
on...
scanInitialSite: scans only links owned by the site of the
starting url.
scanInitialPath: scans only links with the same sub
path than the starting url (links of the same tree level and below).
LinkReport:
TLinkReport = (reportAllLinks, reportCurrentSiteLinks,
reportCurrentPathLinks)
Sets the global way links are
reported.
reportAllLinks: reports all links found in the current
html page,
reportCurrentSiteLinks: reports links owned by the same
site than the current html page,
reportCurrentPathLinks: reports
only links with the same sub path than the current html page (links
of the same tree level and below).
To
explain by an example, let's say we have the following
page:
http://www.oursite.com/info/mainpage.htm
In
this page we have the following links:
1.
http://www.anothersite.com/externalimage.gif
2.
http://www.oursite.com/siteimage.gif
3.
http://www.oursite.com/info/siteimage2.gif
-
if you select reportAlllinks, the links #1, #2 and #3 will be
returned
- if you select ReportCurrentSiteLinks, only the #2 and
#3 links will be reported, because they are owned by www.oursite.com
which is the site of the mainpage.htm
- if you select
ReportCurrentPathLinks, only the #3 link will be reported, because it
is the only link under the path of mainpage.htm (under
http://www.site.com/info/ ).
MaxQueueSize:
integer = 5000
maximum size of the html pages queue. The html
pages queue grows faster than the analyzed pages. After a few
minutes, we can have 50 pages analyzed and 10000 pages in queue. This
queue size limitation helps to avoid memory problems with huge
queues. New links founds are ignored if adding them implies a queue
size greater than MaxQueueSize.
Password:
string = ' '
needed if the starting Url is username/password
protected.
ProxyAddress:
string = ''
Ip address of the proxy server
ProxyPassword:
string = ''
password to authenticate to the proxy
server
ProxyPort:
integer
Port of the proxy server
ProxyType:
tProxyType = (PROXY_DIRECT, PROXY_USE_PROXY,
PROXY_DEFAULT)
PROXY_DIRECT: direct connection to Internet, all
the Proxy.. parameters are ignored
PROXY_USE_PROXY: the Proxy...
parameters are used to authenticate to the proxy
server
PROXY_DEFAULT: the control panel parameters are
used
ProxyUser:
string
username to authenticate to the proxy server
Referrer:
string = ' '
OBSOLETE.
Retries:
integer = 3
number of download retries when a connect or GET
error occurs.
SeekRobotsTxt:
boolean = false
if
set to true, THttpScan searches for robots.txt files at the root of
the sites (http://www.hostname.foo/robots.txt). If the file is found,
the body content is returned by the OnPageReceived
event
StartingUrl:
string = ' '
the
Url from which the scanning will be performed.
Must be set
before calling the Start function if it is called without Url
parameter.
TimeOut:
integer = 300
time
left to the http thread to connect to an URL (in seconds) before
aborting process. The thread tries to connect Retries
times before the OnError event occurs.
TypeFilter:
string of file types separated by char(13)+char(10), not visible
on the object properties.
Set of file types to report only
corresponding URLs. One file type per line (e.g. : jpg gif
mp3). Lowercase only. For jpeg use "jpg" and for mpeg use
"mpg" (THttpScan converts jpeg in jpg and mpeg in mpg).
Activated by TypeFilterEnabled = true.
TypeFilterEnabled:
boolean = false
if set to true, the TypeFilter stringlist is
used to report on URL whose file type is found in the TypeFilter
list.
UserName:
string = ' '
needed if the starting Url is username/password
protected.
Working:
boolean = false.
Read only, non visible in the object properties.
Indicates the
state of HttpScan: "waiting" or "working". Can be
tested before closing the Form to know if downloads are currently
running. See also the OnWorking event.
OnError
(Sender:
TObject; Url: String; ErrorCode: Cardinal; ErrorMsg: String);
occurs
when a "GET" request fails. Returns the Url which failed,
with the error code and the error message if available.
OnHttpAuthenticate
(Sender:
TObject; HostName, Url: string; var UserName, Password: String;
RetryCount: Integer; var Cancel: Boolean);
occurs
when an http page requires an authentication. You can set the related
UserName as Password here, or assign False to Cancel to ignore the
http authentication.
If a wrong UserName or Password is set,
the event occurs again and RetryCount increases.
OnLinkFound
(Sender:
TObject; UrlFound, TypeLink, FromUrl, HostName, UrlPath,
UrlPathWithFile, FileName, ExtraInfos: String; Port: Integer; var
WriteToFile: String; HrefOrSrc: Char;
CountArea: Integer; var FollowIfHtmlLink: Boolean);
This
event occurs each time a link is found and returns the following
parameters:
UrlFound: the full address on the link found
TypeLink:
type of link (htm, jpg, mpg, cgi, php, etc...)
FromUrl: the
referring url (.htm) from which the link come from
Hostname: the
host name of the UrlFound address
UrlPath: the Url path (without
host name & without filename)
UrlPathWithFile: the Url path
(without host name but with filename)
FileName: the file name
extracted from the Url path
ExtraInfos: the extra info passed to
the URL (e.g. ?param1=v)
Port: Integer: the port number used in
the http request (usually 80)
WriteToFile:
the line to be written to FileOfResult.
See comments here.
HrefOrSrc: returns
'S' if the link is an object loaded on the page (a thumb for example)
and 'H' if the link is the destination URL.
CountArea: all the
area found receive a sequential number. When a Href or Src link is
found, it receives the number corresponding to his area. So, the
couples Href / Src link can be associated.
FollowIfHtmlLink: if
you set FollowIfHtmlLink to false in this event, THttpScan stops
searching in the direction of the current link.
Onlog
(Sender: TObject; LogMessage: string);
returns a string
which explains the internal process (for debugging purposes)
OnMetaTag
(Sender: TObject; Url, ReferringUrl, TagType, Tag1stAttrib,
Tag1stValue, Tag2ndAttrib, Tag2ndValue, Tag3rdAttrib, Tag3rdValue:
String);
Returns the tag type and attributes of the current
html page. If there is 5 tags on a page the event occurs 5 times for
this page. The number of attributes is different according to the tag
types, so the attribute parameters are called "1st", "2nd"
and "3rd".
Url: the URL from which the meta tag is
returned
ReferringUrl: the parent URL
TagType: TITLE, META,
LINK, BASE, etc...
Tag1stAttrib: tag attribute, according to the
TagType. E.g. if TagType = META, returns "NAME",
"HTTP-EQUIV", etc...
Tag1stValue: value of the
Tag1stAttrib, e.g. if Tag1stAttrib = "NAME", returns
"keywords", "description", etc...
Tag2ndAttrib:
e.g. if Tag1stAttrib = "NAME" and Tag1stValue = "keywords",
returns "CONTENT" ;
Tag2ndValue: e.g. if Tag1stAttrib =
"NAME", Tag1stValue = "keywords" and Tag2ndAttrib
= "CONTENT", returns the content
string.
Tag3rdAttrib: e.g. if TagType = "LINK",
Tag1stAttrib = "REL", Tag1stValue = "STYLESHEET",
Tag2ndAttrib = "HREF", Tag2ndValue = "/style/??.css",
returns "TYPE".
Tag3rdValue: e.g. "/text/css"
for the sample above.
If you find this is complicated, take a look
at the demo, and you'll think it is finally very simple!
OnPageReceived
(Sender: TObject; Hostname, Url, Head, Body: string);
this
event occurs each time an html page is downloaded and returns the
following parameters:
Url: Url of the text page received
Hostname:
hostname of the page received
Head: head of the http query request
for the page received
Body: body of the text of the page received.
OnProxyAuthenticate
(Sender:
TObject; var UserName, Password: String; Integer; var Cancel:
Boolean);
occurs
when a proxy authentication is required. You can set the related
UserName as Password here, or assign False to Cancel to ignore the
proxy authentication.
If a wrong UserName or Password is set, the
event occurs again and RetryCount increases.
OnUpdatedStats
(Sender: TObject; InQueue, Downloading, ToAnalyze, Done, Retries,
Errors: Integer);
occurs each time something changes in the
HttpScan state. Returns the number of pages in queue (waiting for
download), the number of pages currently downloading, the number of
pages waiting to be analyzed, the number of pages analyzed (done),
and the number of page downloads in error.
OnWorking
(Sender:
TObject; working_: boolean);
occurs
when HttpScan pass from the state "waiting" to the state
"working" and opposite. Can be used to detected when
HttpScan has terminated his job. You can use also the Working
property.
Comments about the WriteToFile parameter used in the OnLinkFound event:
WriteToFile contains the string (the last link found) that will be written to the FileOfResults. If you leave it untouched, for each link found a line is written to the file like this : "TypeLink";"NewUrl";"HostName".
WriteToFile is useful to write links to the FileOfResult file only for some kind of links (e.g. "jpg"), or to choose the information written to the file. For examples:
If
you want to write your own data to the file, e.g. Typelink, NewUrl
and FromUrl then add the following line in the event:
WriteToFile:=
'"' + TypeLink + '";"' + NewUrl + '";"' +
HostName + '"';
If
you want to skip the event's link and not to write anything into the
file for the current link found, simply add the following line in the
event:
if ...=... then begin
WriteToFile:=
'';
end;
LICENSE
AGREEMENT
BY
INSTALLING, COPYING OR OTHERWISE USING THIS SOFTWARE AND ANY RELATED
PRINTED MATERIALS ("SOFTWARE"), YOU ARE ACCEPTING AND
AGREEING TO THE TERMS OF THIS AGREEMENT.
IF YOU DO NOT AGREE
WITH THE TERMS OF THIS AGREEMENT, DO NOT USE THE SOFTWARE.
Copyright
All Datastead components and applications are copyrighted by Michel FORNENGO (hereafter "author"), and shall remain the exclusive property of the author.
General license agreement
This
software and any accompanying documentation are protected by
International Copyright laws and Treaty provisions.
Any use
of this software in violation of copyright law or the terms of this
agreement will be prosecuted to the best of the author's
ability.
You are hereby authorized to make archival copies of
this software for the sole purpose of back-up and protecting your
investment from loss.
Under no circumstances may you copy
this software or documentation for the purposes of distribution to
others. Under no conditions may you remove the copyright notices made
part of the software or documentation.
By
installing this software you agree with:
- you may not
manipulate any binary files included in this package,
- You may
not distribute any file included in this package (source code or
binaries) to non licensed people, unless stated in the in the
"Distribution Rights of the licensed versions"
section below.
The
origin of this software must not be misrepresented, you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated, but is not required.
Evaluation versions
The evaluations versions of the Datastead components have limited features, stop randomly and/or display nag-screens.
The
evaluation versions may be used ONLY for evaluation purpose.
Licensed versions
By purchasing a license you are granted the non-exclusive right to develop your end-user applications based on the licensed version of the component you received after ordering the license.
Under no circumstances may you copy this software or documentation for the purposes of distribution to others, unless stated in the "Distribution Rights of the licensed versions" section below.
Under
no conditions may you remove the copyright notices made part of the
software or documentation.
Distribution Rights of the licensed versions
Delphi or C++Builder native VCL versions of the Datastead components
You are granted a non-exclusive, royalty-free right to produce and distribute your compiled end-user applications (.exe, .dll, etc...) compiled with the licensed Delphi or C++Builder versions of the Datastead components.
If you purchased a license including source code, you MAY NOT redistribute the source code, you may only modify and/or rebuild the component for the purpose of distributing your compiled end-user application.
OCX versions of the Datastead components
You
are granted non-exclusive, royalty-free right:
- to produce and
distribute your end-user applications based on the OCX version of the
component,
- to distribute the licensed Datastead OCX
component, for the sole purpose of running and using your end-user
applications based on the OCX version of the component.
Demo projects
The demo projects included in the package are free of use. The purpose of the sample code is to demonstrate how to use the SDK, so the sample code can be reused freely.
Competitive products
You
may not use the licensed version of the component to create a
competitive product.
Restrictions
Unless stated above in the "Distribution rights of the licensed versions" section, you may not distribute any of the author's commercial source code, compiled units or documentation by any means whatsoever. You may not transfer, lease, lend, copy, modify, translate, sublicense, time-share, or electronically transmit or receive the software or documentation.
Upgrade
The
upgrade version of the software constitutes a single product of the
author's software that you upgraded. For example, the upgrade and the
software that you upgraded cannot both be available for use by two
different people at the same time, without written permission from
the author.
Limited warranty
Datastead warrants that for a period of ninety (90) days from the date of shipment from Datastead: (i) the media on which the Software is furnished will be free of defects in materials and workmanship under normal use; and (ii) the Software substantially conforms to its published specifications. Except for the foregoing, the Software is provided AS IS. This limited warranty extends only to Customer as the original licensee. Customer's exclusive remedy and the entire liability of Datastead and its suppliers under this limited warranty will be the refund of the Software. In no event does Datastead warrant that the Software is error free or that Customer will be able to operate the Software without problems or interruptions.
This
warranty does not apply if the software (a) has been altered, except
by Datastead, (b) has not been installed, operated, repaired, or
maintained in accordance with instructions supplied by Datastead, (c)
has been subjected to abnormal physical or electrical stress, misuse,
negligence, or accident, or (d) is used in ultra hazardous
activities.
Disclaimer
The
Author cannot and does not warrant that any functions contained in
the Software will meet your requirements, or that its operations will
be error free. The entire risk as to the Software performance or
quality, or both, is solely with the user and not the Author. You
assume responsibility for the selection of the component to achieve
your intended results, and for the installation, use, and results
obtained from the Software.
The Author makes no warranty,
either implied or expressed, including without limitation any
warranty with respect to this Software documented here, its quality,
performance, or fitness for a particular purpose. In no event shall
the Author be liable to you for damages, whether direct or indirect,
incidental, special, or consequential arising out the use of or any
defect in the Software, even if the Author has been advised of the
possibility of such damages, or for any claim by any other party.
ALL
DATASTEAD SOFTWARE IS NOT DESIGNED, MANUFACTURED, OR INTENDED FOR USE
OR RESALE AS ON-LINE CONTROL EQUIPMENT IN HAZARDOUS ENVIRONMENTS
REQUIRING FAIL-SAFE PERFORMANCE SUCH AS IN THE OPERATION OF NUCLEAR
FACILITIES, AIRCRAFT NAVIGATION OR COMMUNICATION SYSTEMS, AIR TRAFFIC
CONTROL, DIRECT LIFE SUPPORT MACHINES, OR WEAPONS SYSTEMS, IN WHICH
THE FAILURE OF THE SOFTWARE COULD LEAD DIRECTLY OR INDIRECTLY TO
DEATH, PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRONMENTAL DAMAGE.
All other warranties of any
kind, either express or implied, including but not limited to the
implied warranties of merchantability and fitness for a particular
purpose, are expressly excluded.
General
This Agreement is the complete statement of the Agreement between the parties on the subject matter, and merges and supersedes all other or prior understandings, order orders, agreements and arrangements. This Agreement shall be governed by the laws of France. Exclusive jurisdiction and venue for all matters relating to this Agreement shall be in courts and fora located in France, and you consent to such jurisdiction and venue. There are no third party beneficiaries of any promises, obligations or representations made by Datastead.
THTTPSCAN analyzes recursively HTML pages and reports all the links it finds to a text file: html, mail, jpg, mpeg, mp3, etc...
THttpScan
extracts links through HTML pages in the neighborhood of the initial
URL. The html links found are added in a download queue. THttpScan
downloads each related page, extracts the links found, and so
on...
- the LinkScan property limits the scanning to the
initial site or the initial URL path,
- the LinkReport
property lets report only links owned by the current site, or the
links under the subfolders of the initial link.
- the DepthSearchLevel property allows to limit the level of pages scanned, starting from the initial page, especially when the scanning is not limited to a web site.
By
using the LinkScan and LinkReport properties combined with an high
DephSearchLevel value, you can easily scan a whole site or only a
subdirectory from a web site.
Events occur for each link found
and each page read, returning URL, meta tags, document type,
referrer, host name...
According to the line speed, thousands of links may be extract from a starting URL in a few minutes.
Most
common parameters can be simply set from the Object Inspector.
System requirements
Windows
Vista / XP / MCE / 2000 / NT / 98 / 95
Delphi or C++Builder
FAQ
Q: does THttpScan support window.open("sub-page.html") links?
Yes, THttpScan finds this kind of link,
you can test that easily by running Demo.exe included in the demo
project and starting from the following
page:
http://www.javascript-coder.com/window-popup/javascript-window-open-example1.html
Q: If I pass an invalid URL when calling the Start method, this event is triggered but the error code is 0 and there is no error message. Shouldn't this return a 404?
The 404 error code is returned by an
existing and responding server when the specified HTML page does not
exist.
E.g. in http://www.datastead.com/wrongpage.htm,
the server (datastead.com) will return the 404 error.
However
if the server name itself is wrong (e.g.
http://www.dataZtead.com/wrongpage.htm),
the host name "dataZtead.com" does not exist, so it cannot
return any error code.
The "0" error code means that
the specified server does not exist or does not reply.
Q: when I paste a long URL in the edit field, no link is reported.
A: If you paste the URL in an TEdit field, the TEdit field truncates the URL to 255 characters.
The workaround consists to use a TMemo field instead (be sure to disable the WordWrap property, otherwise the URL could be truncated with line feeds).
Q: when I build my project using
C++Builder, I get the following errors:
Unresolved external
'InternetCloseHandle' referenced from
BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
Unresolved external
'InternetCrackUrlA' referenced from
BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
Unresolved external
'InternetCombineUrlA' referenced from
BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
A:
- Go to Project | Add to
Project
- in the "file type" listbox at the bottom
of the tab, choose ".lib" files,
- navigate to your
CBuilder\Lib directory, and choose either inet.lib (CBuilder4) or
wininet.lib (CBuilder5 and CBuilder6).
- Build and save your
project.
Q: when I build my project using
C++Builder, I get the following error:
Unable
to find package import: THtScan.bpi
A: Go to Project | Options | Packages. Then go to the "Runtime packages" groupbox, at the bottom of the tab. Go to the end of the packages list edit box, and remove THtScan.bpi.