Posted: . At: 11:28 AM. This was 6 months ago. Post ID: 18637
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.


Another web scraping example.


This script is a nice web scraping example to get the news from the MSNBC website.

#!/bin/sh
 
#----------------------------------------------------------------------
# Description: Web scraping news example.
# Author: John Cartwright <>
# Created at: Fri Oct 20 09:40:53 AEDT 2023
# Computer: localhost.localdomain
# System: Linux 5.14.0-284.30.1.el9_2.x86_64 on x86_64
#
# Copyright (c) 2023 John Cartwright  All rights reserved.
#
#----------------------------------------------------------------------
# Configure section:
 
year=$( date +%Y )
month=$( date +%B | tr '[:upper:]' '[:lower:]')
 
# End Configure section:
 
curl --silent -L https://www.msnbc.com/archive/articles/$year/$month \
| htmlq 'main.MonthPage' | awk '{gsub(/<\/a>/,"</a><br />"); print}' \
| w3m -dump -T text/html

This needs htmlq to run, but it works perfectly. Below is the output of this script. This is a very useful example, this shows how to use web scraping to get the news from a website.

(jcartwright@localhost) 192.168.1.5 Documents  $ ./news.sh 
articles
 
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Biden should stop focusing on the economy
Why is the right so angry at Taylor Swift cheering on Travis Kelce?
The anti-abortion pill judge is back — with an alarming new target
OpenAI is encouraging people to use ChatGPT for therapy. That’s dangerous.
Trump set for civil fraud trial in $250 million New York case
With a government shutdown narrowly averted, what happens now?
Why the special counsel cares about Trump’s rant against Milley
Join Mika Brzezinski's #KnowYourValueChallenge
GOP’s Matt Gaetz tees up fight over Speaker McCarthy's gavel
Republicans eye ‘reset’ after failed impeachment inquiry hearing
The curious case of Jamaal Bowman and a congressional fire alarm
Monday’s Campaign Round-Up, 10.2.23
GOP avoids a shutdown, but serious governing problems persist
Jack Smith notes Trump gun incident in support of gag order
Kevin McCarthy has a roadmap to survive as speaker
Matt Gaetz keeps talking, but delays motion to ousting McCarthy
Clarence Thomas recused from Eastman case — so he knows how
Monday’s Mini-Report, 10.2.23
Biden provides an era-defining interview to ProPublica
Donald Trump is running for president while pushing a fascist platform
Gaetz gets the MAGA melee he's always wanted after shutdown charade
3 things that stood out on Day 1 of Trump's New York fraud trial
Trump’s legal woes are bad. His New York civil trial is pure humiliation.
Why McCarthy has reason to worry about Gaetz’s move against him
John Kelly confirms: Trump privately disparaged troops, veterans
Tuberville’s ‘more military’ boast receives necessary pushback
Why it's not 'very unfair' that Trump's fraud case lacks a jury
We shouldn’t let courts decide Trump's 14th Amendment eligibility
Trump adds yet another judge to his rhetorical target list
The curious reason a House Republican is threatening to resign
Tuesday’s Campaign Round-Up, 10.3.23
Key Supreme Court cases, issues to watch in new perilous term
Jane Goodall: How I overcame being called 'just a girl'
The U.S. Army needs to pay the Black St. Louis residents it secretly
experimented on
Trump looks especially angsty at his N.Y. fraud trial. This could be why.
Newsom’s pick adds complexity to California’s Senate race
With Democrats, Kevin McCarthy spent the year sealing his fate
In historic first, House votes to oust Kevin McCarthy as speaker
Gaetz toppled McCarthy. What happens next is anyone’s guess.

Here is a related tip, how to get information from a JSON file using the command-line on Linux.

(jcartwright@localhost) 192.168.1.5 Documents  $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .name' | uniq
Xtreme Adventure
America's Next Top Model
Becker
Bellator
Beverly Hills 90210
Bold and The Beautiful Classics
Bondi Rescue
Baywatch
Diagnosis Murder
Dynasty
48 Hours
Gecko TV
Gunsmoke
Happy Days
Haunt TV
Hawaii Five-O
I Love Lucy
Judge Judy
MasterChef
Matlock
Medical Emergency
Merlin
Mission Impossible
Moviesphere
MTV Biggest Pop
MTV Dating
MTV Drama
MTV Entertainment
MTV Love
MTV Reality
MTV Retro
MTV The Shores
MTV Best Of The EMAs
NatureTime
Nick Classics
Nick Jr
Nick Movies
NickTeen
NickToons
10
Prisoner
10
Nick Rewind
Rush
10
Seasonal Movies
South Park
Survivor
Survivor US
The Brady Bunch
The-Drew-Barrymore-Show
The Graham Norton Show
The Twilight Zone
True Stories
10

And how to read JSON to get a listing of all entities in a JSON schema section.

(jcartwright@localhost) 192.168.1.5 ~  $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .logo' | uniq
https://10play.com.au/ip/s3/2023/07/31/692875006a27f9580330b073e080e129-1252238.png
https://10play.com.au/ip/s3/2023/06/28/8db83f030cafec640ddec14612d03daa-1246018.png
https://10play.com.au/ip/s3/2023/06/28/e0449ed804aaacb1de64f87d94912dd3-1246030.png
https://10play.com.au/ip/s3/2023/07/13/5759d0aeb5b418ff82d0a44426fb8919-1249180.png
https://10play.com.au/ip/s3/2023/06/28/d39a21f0c9a2c64970150768b266ecea-1246001.png
https://10play.com.au/ip/s3/2023/07/12/24ba98d5a9e362a7eab844cf00b5abec-1248618.png
https://10play.com.au/ip/s3/2023/07/12/7ddc52ba7477554f24bd1ad819b1d593-1248612.png
https://10play.com.au/ip/s3/2022/09/15/28b62659b93d3b56e4e1c8c2485a202b-1180224.png
https://10play.com.au/ip/s3/2023/06/28/5f85fc3d794a167630b935cb639abcc3-1246005.png
https://10play.com.au/ip/s3/2023/06/28/36fb189fc45e71ab01e7e6d22997a24c-1246037.png
https://10play.com.au/ip/s3/2023/06/28/7dd910de29f31095b69e17f1da8f3f0e-1245997.png
https://10play.com.au/ip/s3/2023/02/23/b81bfc3f4737696257ef326b0e68208e-1219777.png
https://10play.com.au/ip/s3/2023/06/30/ea73899442aeda0541d39da7187fd0ae-1246615.png
https://10play.com.au/ip/s3/2023/06/28/060cdfa84b76c04285a91099d1dc5cfb-1246013.png
https://10play.com.au/ip/s3/2023/07/12/8cb62aeb9de0e12611b68059325da0f3-1248632.png
https://10play.com.au/ip/s3/2023/06/28/78cc7f7508fbc3e0d23f28e209d006cd-1246034.png
https://10play.com.au/ip/s3/2023/06/30/ee75d540e6cdda6df5044708edda6500-1246588.png
https://10play.com.au/ip/s3/2022/09/02/461008ac0d02137f713ec0e4883be6c4-1177265.png
https://10play.com.au/ip/s3/2022/09/02/3fff1b47f7c01770b7f0f319770d5362-1177252.png
https://10play.com.au/ip/s3/2023/06/28/05e69d623cea2b74ae1637c40e02cdd7-1246022.png
https://10play.com.au/ip/s3/2023/06/28/115b6796134f1e983f41c224c31cc7c9-1246084.png
https://10play.com.au/ip/s3/2022/12/05/34e0e28792f221c0b9a2629966e5bd28-1202917.png
https://10play.com.au/ip/s3/2023/07/11/dc85f360d18a325975756c734eb17b69-1248406.png
https://10play.com.au/ip/s3/2022/12/05/6bb229de8fcf96e206e99ca455fc2267-1202912.png
https://10play.com.au/ip/s3/2023/09/25/57fee983dfaa2afa821ae798c49a002b-1266109.png
https://10play.com.au/ip/s3/2023/09/25/87a0cf6001b8049ac6dd935caec9e7a4-1266117.png
https://10play.com.au/ip/s3/2023/09/25/786a6fc1b188d22e46a15061dcf15d2b-1266105.png
https://10play.com.au/ip/s3/2023/09/25/3c671025176136a04748f2b810e72bf3-1266111.png
https://10play.com.au/ip/s3/2023/09/25/68943f881f079398dcdf4681db2fe648-1266107.png
https://10play.com.au/ip/s3/2023/09/25/f51b3f2548dfdb88949cdbd13f037679-1266119.png
https://10play.com.au/ip/s3/2023/09/25/b732245ef2430d278e0b2ef24a704cbe-1266114.png
https://10play.com.au/ip/s3/2023/07/11/3901f5c4522a0c36cfed8630ab781a5e-1248401.png
https://10play.com.au/ip/s3/2023/09/25/b8cf0f3edf522efa6ae95b26dc9098b1-1266101.png
https://10play.com.au/ip/s3/2023/07/12/7b0ce57edb70336a469f938d59645bbe-1248639.png
https://10play.com.au/ip/s3/2023/08/03/b9d7b0881bd3510944d31ddf27fb99ea-1253556.png
https://10play.com.au/ip/s3/2023/07/14/e287e61bf65923f77e8e74f488f53498-1249270.png
https://10play.com.au/ip/s3/2023/08/03/03b870793599a6760249eb6a5392352a-1253558.png
https://10play.com.au/ip/s3/2023/08/03/80794063b69ee9991510e95c601c83e7-1253578.png
https://10play.com.au/ip/s3/2023/08/03/e30c04dad94ed20e16f069dd0914ecb0-1253580.png

Use it like this to get 2 different sections at once.

(jcartwright@localhost) 192.168.1.5 ~  $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .logo,.headers' | uniq
https://10play.com.au/ip/s3/2023/07/31/692875006a27f9580330b073e080e129-1252238.png
{
  "referer": " ",
  "seekable": "0",
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
https://10play.com.au/ip/s3/2023/06/28/8db83f030cafec640ddec14612d03daa-1246018.png
{
  "referer": " ",
  "seekable": "0",
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
https://10play.com.au/ip/s3/2023/06/28/e0449ed804aaacb1de64f87d94912dd3-1246030.png
{
  "referer": " ",
  "seekable": "0",
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}
https://10play.com.au/ip/s3/2023/07/13/5759d0aeb5b418ff82d0a44426fb8919-1249180.png
{
  "referer": " ",
  "seekable": "0",
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
}

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.