$ wget --spider http://ligman.me/yZtVHF Spider mode enabled. Check if remote file exists. --2012-08-08 20:38:24-- http://ligman.me/yZtVHF Resolving ligman.me... 69.58.188.49 Connecting to ligman.me|69.58.188.49|:80... connected. HTTP request sent, awaiting response... 301 Moved Location: http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf [following] Spider mode enabled. Check if remote file exists. --2012-08-08 20:38:25-- http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf Resolving download.microsoft.com... 202.7.177.83, 203.26.28.162, 203.26.28.153, ... Connecting to download.microsoft.com|202.7.177.83|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 7564363 (7.2M) [application/octet-stream] Remote file exists.
Improve the command a little bit, the command below filter out the url we need:
$ wget --spider http://ligman.me/yZtVHF 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1 http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf
If you have a html file contains many shorten urls, you can use the following comands to replace them with the real urls:
cat 1.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq > urls.txt cat urls.txt | while read url1; do url2=$(wget --spider ${url1} 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1); sed -i .bak 's|${url1}|${url2}|g' 1.html; done;or the following script is more readable:
#!/bin/bash cat 1.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq > urls.txt cat urls.txt | while read url1 do url2=$(wget --spider ${url1} 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1) sed -i .bak 's|${url1}|${url2}|g' 1.html done
No comments:
Post a Comment