Posted: . At: 11:09 AM. This was 5 years ago. Post ID: 13052
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.


Nice web scraper to get a listing of all 4chan threads on a certain board.


This is a nice web scraper that will read a 4chan board and return a listing of all threads on that board page. This could be very useful code to expand into a useful script.

#!/usr/bin/env bash
 
set -e
 
links=( $( wget "$@" -qo /dev/null -SO - | grep -oE '</span><a href=\"thread\/[0-9]+\"' |
grep -oE 'thread\/[0-9]+\"' | sed 's/^/\"boards.4chan.org\/g\//' ) )
 
for i in "${links[@]}"
do
    echo "$i"
done

This is the output of this script when run on Linux.

jason@hoshi:~/Docum$ ./scraper.sh https://boards.4channel.org/g
"boards.4chan.org/g/thread/51971506"
"boards.4chan.org/g/thread/70373121"
"boards.4chan.org/g/thread/70375043"
"boards.4chan.org/g/thread/70373042"
"boards.4chan.org/g/thread/70369732"
"boards.4chan.org/g/thread/70370890"
"boards.4chan.org/g/thread/70368516"
"boards.4chan.org/g/thread/70376650"
"boards.4chan.org/g/thread/70378082"
"boards.4chan.org/g/thread/70377467"
"boards.4chan.org/g/thread/70376099"
"boards.4chan.org/g/thread/70356384"
"boards.4chan.org/g/thread/70347662"
"boards.4chan.org/g/thread/70373097"
"boards.4chan.org/g/thread/70376867"

This is a good example of a useful web scraper. This can be used to get a listing of news reports from a website.

But to get your daily news fix fast, just use this in a script. The output is cropped for brevity, but this will return quite a long listing of news stories.

jason@hoshi:~/Docum$ curl -s http://feeds.bbci.co.uk/news/rss.xml | grep "<title>" | sed "s/            <title><\!\[CDATA\[//g;s/\]\]><\/title>//;" | grep -v "BBC News"
Brexit: PM cannot 'ignore' soft Brexit MPs, says minister
Edmonton stabbings: Four people hurt in 'random attacks'
Ukraine election: Comedian leads presidential contest - exit poll
Eurostar protest: Man charged with obstructing railway
IS defeat: British fighters emerge after fall of Baghuz
Alex Jones hosted The One Show after miscarriage
Brexit fine: Ex-Vote Leave chairwoman does not apologise over spend
Nazanin Zaghari-Ratcliffe: Mother's Day card delivered to embassy
Boys charged over Birmingham Grindr date robberies
Knife crime: More stop and search powers for police
Model with alopecia wants people to embrace differences
Labour plans national bank using Post Office network
Saudi Arabia 'hacked Amazon boss's phone', says investigator

This shows how easy it is to get information off the web with the command line.

Run the one-liner like this to get only the top 10 stories.

jason@hoshi:~/Docum$ curl -s http://feeds.bbci.co.uk/news/rss.xml | grep "<title>" | sed "s/            <title><\!\[CDATA\[//g;s/\]\]><\/title>//;" | grep -v "BBC News" | head -n 10
Brexit: PM cannot 'ignore' soft Brexit MPs, says minister
Edmonton stabbings: Four people hurt in 'random attacks'
School LGBT teaching row: What is in the No Outsiders books?
Ukraine election: Comedian leads presidential contest - exit poll
Eurostar protest: Man charged with obstructing railway
IS defeat: British fighters emerge after fall of Baghuz
Alex Jones hosted The One Show after miscarriage
Brexit fine: Ex-Vote Leave chairwoman does not apologise over spend
Nazanin Zaghari-Ratcliffe: Mother's Day card delivered to embassy
Boys charged over Birmingham Grindr date robberies

This would be very useful to have in your .bashrc to see the latest news when your terminal is opened.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.