You are here:Home»KB»Web Design»General»Configure HTTrack to mirror websites
Tuesday, 20 October 2015 15:37

Configure HTTrack to mirror websites

Written by

This article will give you the settings to get HTTrack just to scrape your chosen website without trying to download the whole internet. HTTrack is powerful but needs to be setup correctly to get the best results.

Solution / Explanation

Scan Rules Tab

  • This is a good generic rule set
    +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/*
  • This is a good generic rule set but restricted to a particular domain
    +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/*
    -*[name].templates.joomla-monster.com/*
    
  • To just get images and html use
    -* +mime:text/html +mime:image/*

Limits Tab

  • Maxmimum mirroring depth = empty
    • this prevents httrack going offsite
    • 0 =  scrape everything, the first page is 1 subsequent pages, and images are 2 etc..
  • Maximum external depth = empty
    • this sets the depths of the external website that should be scanned
    • 0 =  scrape everything, the first page is 1 subsequent pages, and images are 2 etc..

Flow Control Tab

  • Number of connections = 4
    • this prevents the scrapping getting flagged b ecause of to many connections
    • Keep connections persistent = yes

Links Tab

  • Attempt to detect all links = yes
  • Get non-HTML files related to a link, eg external zip or pictures = yes
    • will get images, etc even if off site..
  • Get HTML files first = yes
    • this gets the html first just incase httrack gets blocked.

Build Tab

  • No external Pages = no
    • Rewrite all external links (links that needs an Internet connection) so that there can be a warning page before ("Warning, you need to be online to go to this link..")Useful if you want to separate the local and online realm

Browser ID Tab

  • I use this browser ID string for better results
    Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
  • HTML Footer = none
    • this removes the HTTtrack mesage from the footer
  • Language = en.*
  • Default referer = https://www.google.co.uk/

Spider Tab

  • Spider: no robots.txt rules
    • we want everything

 

Leave everything else as default

 

Notes

  • SquareSpace is using javascript to render the assets on the page so HTTrack can not download these.

Example Config

This is my working configuration for HTTrack, try it on a demo site before doing anything large.

To use this

  • create a file called HTTrack working options.opt
  • paste the code in below
  • Goto (Preferences-->Load Options)
  • select the file you have just created
  • build your project and it will use the new options

You can save these as your default options by clicking (Preferences-->Save default options)

Near=1
Test=0
ParseAll=1
HTMLFirst=1
Cache=1
NoRecatch=0
Dos=0
Index=1
WordIndex=0
MailIndex=0
Log=1
RemoveTimeout=0
RemoveRateout=0
KeepAlive=1
FollowRobotsTxt=0
NoErrorPages=0
NoExternalPages=0
NoPwdInPages=0
NoQueryStrings=0
NoPurgeOldFiles=0
Cookies=1
CheckType=1
ParseJava=1
HTTP10=0
TolerantRequests=0
UpdateHack=1
URLHack=1
StoreAllInCache=0
LogType=0
UseHTTPProxyForFTP=1
Build=0
PrimaryScan=3
Travel=1
GlobalTravel=0
RewriteLinks=0
BuildString=%%h%%p/%%n%%q.%%t
Category=Ripped Web Sites
MaxHtml=
MaxOther=
MaxAll=
MaxWait=
Sockets=4
Retry=
MaxTime=
TimeOut=
RateOut=
UserID=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
Footer=(none)
AcceptLanguage=en, *
OtherHeaders=
DefaultReferer=https://www.google.co.uk/
MaxRate=25000
WildCardFilters=+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar
Proxy=
Port=
Depth=
ExtDepth=
MaxConn=
MaxLinks=
MIMEDefsExt1=
MIMEDefsExt2=
MIMEDefsExt3=
MIMEDefsExt4=
MIMEDefsExt5=
MIMEDefsExt6=
MIMEDefsExt7=
MIMEDefsExt8=
MIMEDefsMime1=
MIMEDefsMime2=
MIMEDefsMime3=
MIMEDefsMime4=
MIMEDefsMime5=
MIMEDefsMime6=
MIMEDefsMime7=
MIMEDefsMime8=
CurrentUrl=
CurrentAction=0
CurrentURLList=

Links

Read 11368 times Last modified on Saturday, 09 December 2017 12:27