Shell: replace the shorten url with the real url

The shorten url are sometimes used to hide the real download url. With wget you can resolve the real url using --spider option:
$ wget --spider http://ligman.me/yZtVHF
Spider mode enabled. Check if remote file exists.
--2012-08-08 20:38:24--  http://ligman.me/yZtVHF
Resolving ligman.me... 69.58.188.49
Connecting to ligman.me|69.58.188.49|:80... connected.
HTTP request sent, awaiting response... 301 Moved
Location: http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf [following]
Spider mode enabled. Check if remote file exists.
--2012-08-08 20:38:25--  http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf
Resolving download.microsoft.com... 202.7.177.83, 203.26.28.162, 203.26.28.153, ...
Connecting to download.microsoft.com|202.7.177.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7564363 (7.2M) [application/octet-stream]
Remote file exists.


Improve the command a little bit, the command below filter out the url we need:
$ wget --spider http://ligman.me/yZtVHF 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1
http://download.microsoft.com/download/F/F/2/FF2EECEE-397A-45B9-83A4-821243F8DFFD/668836ebook.pdf


If you have a html file contains many shorten urls, you can use the following comands to replace them with the real urls:
cat 1.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq > urls.txt
cat urls.txt | while read url1; do url2=$(wget --spider ${url1} 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1); sed -i .bak 's|${url1}|${url2}|g' 1.html; done;
or the following script is more readable:
#!/bin/bash
cat 1.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq > urls.txt
cat urls.txt | while read url1
do
    url2=$(wget --spider ${url1} 2>&1 | grep -o -E 'http[^"#]*' | tail -n 1)
    sed -i .bak 's|${url1}|${url2}|g' 1.html
done

No comments:

Post a Comment