This article will give you the settings to get HTTrack just to scrape your chosen website without trying to download the whole internet. HTTrack is powerful but needs to be setup correctly to get the best results.
Solution / Explanation
Scan Rules Tab
- This is a good generic rule set
+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/*
- This is a good generic rule set but restricted to a particular domain
+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar +mime:text/html +mime:image/* -*[name].templates.joomla-monster.com/*
- To just get images and html use
-* +mime:text/html +mime:image/*
Limits Tab
- Maxmimum mirroring depth = empty
- this prevents httrack going offsite
- 0 = scrape everything, the first page is 1 subsequent pages, and images are 2 etc..
- Maximum external depth = empty
- this sets the depths of the external website that should be scanned
- 0 = scrape everything, the first page is 1 subsequent pages, and images are 2 etc..
Flow Control Tab
- Number of connections = 4
- this prevents the scrapping getting flagged b ecause of to many connections
- Keep connections persistent = yes
Links Tab
- Attempt to detect all links = yes
- Get non-HTML files related to a link, eg external zip or pictures = yes
- will get images, etc even if off site..
- Get HTML files first = yes
- this gets the html first just incase httrack gets blocked.
Build Tab
- No external Pages = no
- Rewrite all external links (links that needs an Internet connection) so that there can be a warning page before ("Warning, you need to be online to go to this link..")Useful if you want to separate the local and online realm
Browser ID Tab
- I use this browser ID string for better results
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
- HTML Footer = none
- this removes the HTTtrack mesage from the footer
- Language = en.*
- Default referer = https://www.google.co.uk/
Spider Tab
- Spider: no robots.txt rules
- we want everything
Leave everything else as default
Notes
- SquareSpace is using javascript to render the assets on the page so HTTrack can not download these.
Example Config
This is my working configuration for HTTrack, try it on a demo site before doing anything large.
To use this
- create a file called HTTrack working options.opt
- paste the code in below
- Goto (Preferences-->Load Options)
- select the file you have just created
- build your project and it will use the new options
You can save these as your default options by clicking (Preferences-->Save default options)
Near=1 Test=0 ParseAll=1 HTMLFirst=1 Cache=1 NoRecatch=0 Dos=0 Index=1 WordIndex=0 MailIndex=0 Log=1 RemoveTimeout=0 RemoveRateout=0 KeepAlive=1 FollowRobotsTxt=0 NoErrorPages=0 NoExternalPages=0 NoPwdInPages=0 NoQueryStrings=0 NoPurgeOldFiles=0 Cookies=1 CheckType=1 ParseJava=1 HTTP10=0 TolerantRequests=0 UpdateHack=1 URLHack=1 StoreAllInCache=0 LogType=0 UseHTTPProxyForFTP=1 Build=0 PrimaryScan=3 Travel=1 GlobalTravel=0 RewriteLinks=0 BuildString=%%h%%p/%%n%%q.%%t Category=Ripped Web Sites MaxHtml= MaxOther= MaxAll= MaxWait= Sockets=4 Retry= MaxTime= TimeOut= RateOut= UserID=Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0 Footer=(none) AcceptLanguage=en, * OtherHeaders= DefaultReferer=https://www.google.co.uk/ MaxRate=25000 WildCardFilters=+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar Proxy= Port= Depth= ExtDepth= MaxConn= MaxLinks= MIMEDefsExt1= MIMEDefsExt2= MIMEDefsExt3= MIMEDefsExt4= MIMEDefsExt5= MIMEDefsExt6= MIMEDefsExt7= MIMEDefsExt8= MIMEDefsMime1= MIMEDefsMime2= MIMEDefsMime3= MIMEDefsMime4= MIMEDefsMime5= MIMEDefsMime6= MIMEDefsMime7= MIMEDefsMime8= CurrentUrl= CurrentAction=0 CurrentURLList=