Configure HTTrack to mirror websites

You are here:Home»KB»Web Design»General»Configure HTTrack to mirror websites

Tuesday, 20 October 2015 15:37

Configure HTTrack to mirror websites

Written by shoulders

This article will give you the settings to get HTTrack just to scrape your chosen website without trying to download the whole internet. HTTrack is powerful but needs to be setup correctly to get the best results.

Solution / Explanation

Scan Rules Tab

This is a good generic rule set

+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/*

This is a good generic rule set but restricted to a particular domain

+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/*
-*[name].templates.joomla-monster.com/*

To just get images and html use
```
-* +mime:text/html +mime:image/*
```

Limits Tab

Maxmimum mirroring depth = empty
- this prevents httrack going offsite
- 0 = scrape everything, the first page is 1 subsequent pages, and images are 2 etc..
Maximum external depth = empty
- this sets the depths of the external website that should be scanned
- 0 = scrape everything, the first page is 1 subsequent pages, and images are 2 etc..

Flow Control Tab

Number of connections = 4
- this prevents the scrapping getting flagged b ecause of to many connections
- Keep connections persistent = yes

Links Tab

Attempt to detect all links = yes
Get non-HTML files related to a link, eg external zip or pictures = yes
- will get images, etc even if off site..
Get HTML files first = yes
- this gets the html first just incase httrack gets blocked.

Build Tab

No external Pages = no
- Rewrite all external links (links that needs an Internet connection) so that there can be a warning page before ("Warning, you need to be online to go to this link..")Useful if you want to separate the local and online realm

Browser ID Tab

I use this browser ID string for better results

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0

HTML Footer = none
- this removes the HTTtrack mesage from the footer
Language = en.*
Default referer = https://www.google.co.uk/

Spider Tab

Spider: no robots.txt rules
- we want everything

Leave everything else as default

Notes

SquareSpace is using javascript to render the assets on the page so HTTrack can not download these.

Example Config

This is my working configuration for HTTrack, try it on a demo site before doing anything large.

To use this

create a file called HTTrack working options.opt
paste the code in below
Goto (Preferences-->Load Options)
select the file you have just created
build your project and it will use the new options

You can save these as your default options by clicking (Preferences-->Save default options)

Near=1
Test=0
ParseAll=1
HTMLFirst=1
Cache=1
NoRecatch=0
Dos=0
Index=1
WordIndex=0
MailIndex=0
Log=1
RemoveTimeout=0
RemoveRateout=0
KeepAlive=1
FollowRobotsTxt=0
NoErrorPages=0
NoExternalPages=0
NoPwdInPages=0
NoQueryStrings=0
NoPurgeOldFiles=0
Cookies=1
CheckType=1
ParseJava=1
HTTP10=0
TolerantRequests=0
UpdateHack=1
URLHack=1
StoreAllInCache=0
LogType=0
UseHTTPProxyForFTP=1
Build=0
PrimaryScan=3
Travel=1
GlobalTravel=0
RewriteLinks=0
BuildString=%%h%%p/%%n%%q.%%t
Category=Ripped Web Sites
MaxHtml=
MaxOther=
MaxAll=
MaxWait=
Sockets=4
Retry=
MaxTime=
TimeOut=
RateOut=
UserID=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
Footer=(none)
AcceptLanguage=en, *
OtherHeaders=
DefaultReferer=https://www.google.co.uk/
MaxRate=25000
WildCardFilters=+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
Proxy=
Port=
Depth=
ExtDepth=
MaxConn=
MaxLinks=
MIMEDefsExt1=
MIMEDefsExt2=
MIMEDefsExt3=
MIMEDefsExt4=
MIMEDefsExt5=
MIMEDefsExt6=
MIMEDefsExt7=
MIMEDefsExt8=
MIMEDefsMime1=
MIMEDefsMime2=
MIMEDefsMime3=
MIMEDefsMime4=
MIMEDefsMime5=
MIMEDefsMime6=
MIMEDefsMime7=
MIMEDefsMime8=
CurrentUrl=
CurrentAction=0
CurrentURLList=

Links

Read 13263 times Last modified on Saturday, 09 December 2017 12:27

Published in General

back to top