Posted: . At: 11:52 AM. This was 2 years ago. Post ID: 15930
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.


Parsing Apache logs with awk is a lot of fun and easy too.


Parsing Apache logs is a lot of fun with awk on Linux. This can be very interesting, to see all of the bots and other visitors you get, there are so many crawlers accessing your website and indexing all of your pages. But this can be used to find unwanted leaching of bandwidth as well, this can negatively affect your website. But this can be fixed easily.

This example is looking for all Yandex bots.

┌──(john㉿DESKTOP-PF01IEE)-[/mnt/c/Users/Intel i5/Documents]
└─$ awk '/column/ { print $4,"\n"$7,"\n"$13,"\n"$14,"\n"$15 }' sslaccesslog_securitronlinux.com_2_13_2022
[13/Feb/2022:15:52:02 
/arch-linux/how-to-add-a-column-heading-with-awk-on-linux/ 
(compatible; 
YandexBot/3.0; 
+http://yandex.com/bots)"

And this is a very useful example, this is looking for all instances of people hotlinking your images from your website and embedding them on their website. This is wasting your bandwidth and is very annoying, but a simple .htaccess edit can fix this easily. It is good to know how to see this happening on your server, this means you can fix it. This is easily done, there is a code sample here that can help out.

┌──(john㉿DESKTOP-PF01IEE)-[/mnt/c/Users/Intel i5/Documents]
└─$ awk -F\" '($2 ~ /\.(jpg|gif)/ && $4 !~ /^https:\/\/\securitronlinux\.com/){print $4}' access.log | sort | uniq -c | sort

You may also print the bandwidth in total for the period of the access log.

┌──(john㉿DESKTOP-PF01IEE)-[/mnt/c/Users/Intel i5/Documents]
└─$ awk '{ sum += $10 } END { print sum /1024/1024 }' sslaccesslog_securitronlinux.com_2_13_2022
265.992

This shows the amount of bandwidth in megabytes, Awk is very easy to use and you do not need to use cat and then pipe into it, it is not the way the Linus shell need to be used. Unneeded use of cat is a good way to lose geek points.

The best way to parse and analyze the Apache logs is via the goaccess tool. Install this via apt-get.

┌──(john㉿DESKTOP-PF01IEE)-[/mnt/c/Users/Intel i5/Documents]
└─$ sudo apt install goaccess

Then run it like this on a log file from the Apache webserver.

┌──(john㉿DESKTOP-PF01IEE)-[/mnt/c/Users/Intel i5/Documents]
└─$ goaccess access.log --log-format=COMBINED -a -o report.html
 [PARSING access.log] {9,235} @ {0/s}

The report will look like this. This is incredible, so give this a go and see how good the reports look, this is a great way to get some cool insights into the activity on your website. Highly recommended. As I said, leaching images from other websites is annoying, but it can be nipped in the bud promptly, once I set the changes in the htaccess file and then wiped the Cloudflare cache it is fixed. Now the images are replaced by another image. So take that guys. That is what you get for stealing bandwidth. Why not just upload images on your own webspace instead? But this is the reality of the modern Internet, so many fake websites with poor quality content polluting the Internet, at least it is possible to have a personal website and gain top ranking in Google with enough hard work and perseverance that pays off after a few years of adding content. That is how you can gain a high ranking on Google, just have a massive amount of text content and good SEO work on your website. Then the views will come. But it takes time.

You just need to be patient and keep writing content and tweaking older content to newer standards, this is very helpful.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.