Q: I set-up a
project with Web Site extraction - but no page was processed? WDE can not
connect?
Q: I set-up a project
with "URLs from File" extraction, enter the filename - but WDE
can not find any link in the file?
Q: When I run WDE, it sucks all
my computer power, screen is hardly refreshing?
Q: Can I resume an
interrupted session in WDE?
Q: How I can add search
engine listing other than those specified in Engine Listing dialog for
specific data mining tasks?
Q: What are inactive sites
shown in data tab?
Q: Why the extractor slow down after
running whole day?
Q: How to get more data in
WDE? When I query in search engine I see million of matches.
Q:
I need to be able to get into a message board community that is
username/password protected and get every email there. Can your product do
this effectively?
Q: Should I use more thread
to complete the session quickly?
Q: I set-up a project with WebSite extraction -
but no
page was processed? WDE can not connect?
A: There are several things that may cause this:
(1) Check your Internet connection - you must be online.
(2) Check your proxy settings. If you are behind a firewall / proxy
server, you need to enter necessary information in the "New Session
Dialog - Proxy" tab. If you do not know proxy data then contact your
ISP / system administrator.
(3) Is the site password protected? You can not extract data from
protected sites.
(4) Make sure the site is not down temporarily/permanently. You can check
it using your default browser. Your default browser can load it?
(5) Is the site using some type of redirect system. That is you enter a
URL like http://www.car.com and the server
redirects to http://www.truck.com . In that
case, you need to use http://www.truck.com as
your starting address in "New Session" dialog.
(6) Check you didn't use any exclude URL filter like "/" or
"com" in "New Session Dialog - URL Filter" which will
prevent WDE to process all sites.
(7) Check the site doesn't use only a Java applet in the home / index
page. Like other spider, WDE can not parse Java applet.
(8) WDE doesn't support secured https:// protocol.
(9) Finally, did you use a very low request time-out period in "New
Session - Other" tab? The default time-out period is 100 secs. With a
very lower value, WDE may stop the request before host sever reply.
Q: I set-up a project with "URLs from File" extraction, enter
the filename - but WDE can
not find any link in the file?
A: Make sure the file exist in disk. The file must have
URL line-by-line, other format is not supported, WDE will accept only
lines that starts with http:// text. Also WDE will not accept URLs that
point to image/binary files, because those files will not have any text
data to extract.
Q: When I run WDE, it sucks all my computer power, screen
is hardly refreshing?
A: It seems you are using high number of threads.
Decrease the thread value to "5" in "New Session -
Other" tab. WDE can launch multiple threads simultaneously. But
remember, too high a thread setting may be too much for your computer
and/or internet connection to handle it and also puts an unfair load on
the host server which may slow the process down.
Q: Can I resume
an interrupted session in WDE?
A: Yes. Use 'File - Open' menu command to open previously
stopped session's log file.
Q: How I can add
search engine listing other than those specified in
Engine Listing dialog?
A: It is easy. In "URL" field type the search
query URL. Replace the search keyword part with WDE syntax {SEARCH_KEYWORD}
For Example: an AOL query URL with "Flower Shop" search is:
http://search.aol.com/dirsearch.adp?query=Flower+Shop
You just replace Flower+Shop part with {SEARCH_KEYWORD}
like following:
http://search.aol.com/dirsearch.adp?query={SEARCH_KEYWORD}
After adding the new engine list, click "Save"
button.
Q: Why the extractor
slow down after running whole day?
A: Do not use many thread in New
Session Dialog - Other tab. Use only 5 or less.
Also do not use it for very broad search because program uses RAM to store
extracted url, email, etc... to avoid duplicate data and not to visit
already visited site.. so this use lots of RAM and may slow up.
If you use for broad search then uncheck 'View - Display data in data tab'
menu so no data will be shown in data tab and performance will increase.
Do not use 'Follow External Sites - Spider Unlimited Loop' in New Session
Dialog. This way it can travel entire internet and crash easily.
Q: How to get more data
in WDE? When I query in search engine I see million of matches.
A: To get more results:
(1) Select all search engines - click Save in New Session Dialog ->
Engine Listing Dialog.
(2) Use Intelligent Spidering Mode in External Site Tab.
Note that:
(1) Although you see millions of matches in search result, search engines
do not deliver more than 1000 results. For example: try to view 1001 th
result in any search engine.
(2) You will see some similar programs showing huge emails. They are not
actual, targeted but convincing to purchase. Always check out the source
of extracted emails.
Q: What are inactive sites
that shown in data tab?
A: WDE can not connect to these sites. The
site could be down temporarily or domain expired. If you want to try these
sites later then save the list using "Save" button and use
"New Session Dialog - URLs from File" option to process these
sites later.
Q: I need to be able to get into a message
board community that is username/password protected and get every
email there. Can your product do this effectively?
A: What kind of authentication used in the site?
It is possible for password protected directory. (Enter login info in New
Session Dialog - Login tab.)
If it is like http://mail.yahoo.com/ then not
possible.
Q: Should I use more thread
to complete the session quickly?
A: It is correct for a smaller session which will
complete within few hours.
But for large scale sessions that will take many hours, use low thread
(say 5).
Thread used to download data simultaneously.
Its not right that - more thread means faster extraction. Because after
data download, program needs to analyze, parse the data to extract email,
phone, .. and get inside links for further extraction, etc.... So more
thread you use, the program and CPU will become more and more busy. You
should use 10 for smaller session 5 for large session.
|