This script is a nice web scraping example to get the news from the MSNBC website.
#!/bin/sh #---------------------------------------------------------------------- # Description: Web scraping news example. # Author: John Cartwright <> # Created at: Fri Oct 20 09:40:53 AEDT 2023 # Computer: localhost.localdomain # System: Linux 5.14.0-284.30.1.el9_2.x86_64 on x86_64 # # Copyright (c) 2023 John Cartwright All rights reserved. # #---------------------------------------------------------------------- # Configure section: year=$( date +%Y ) month=$( date +%B | tr '[:upper:]' '[:lower:]') # End Configure section: curl --silent -L https://www.msnbc.com/archive/articles/$year/$month \ | htmlq 'main.MonthPage' | awk '{gsub(/<\/a>/,"</a><br />"); print}' \ | w3m -dump -T text/html |
This needs htmlq to run, but it works perfectly. Below is the output of this script. This is a very useful example, this shows how to use web scraping to get the news from a website.
(jcartwright@localhost) 192.168.1.5 Documents $ ./news.sh articles ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Biden should stop focusing on the economy Why is the right so angry at Taylor Swift cheering on Travis Kelce? The anti-abortion pill judge is back — with an alarming new target OpenAI is encouraging people to use ChatGPT for therapy. That’s dangerous. Trump set for civil fraud trial in $250 million New York case With a government shutdown narrowly averted, what happens now? Why the special counsel cares about Trump’s rant against Milley Join Mika Brzezinski's #KnowYourValueChallenge GOP’s Matt Gaetz tees up fight over Speaker McCarthy's gavel Republicans eye ‘reset’ after failed impeachment inquiry hearing The curious case of Jamaal Bowman and a congressional fire alarm Monday’s Campaign Round-Up, 10.2.23 GOP avoids a shutdown, but serious governing problems persist Jack Smith notes Trump gun incident in support of gag order Kevin McCarthy has a roadmap to survive as speaker Matt Gaetz keeps talking, but delays motion to ousting McCarthy Clarence Thomas recused from Eastman case — so he knows how Monday’s Mini-Report, 10.2.23 Biden provides an era-defining interview to ProPublica Donald Trump is running for president while pushing a fascist platform Gaetz gets the MAGA melee he's always wanted after shutdown charade 3 things that stood out on Day 1 of Trump's New York fraud trial Trump’s legal woes are bad. His New York civil trial is pure humiliation. Why McCarthy has reason to worry about Gaetz’s move against him John Kelly confirms: Trump privately disparaged troops, veterans Tuberville’s ‘more military’ boast receives necessary pushback Why it's not 'very unfair' that Trump's fraud case lacks a jury We shouldn’t let courts decide Trump's 14th Amendment eligibility Trump adds yet another judge to his rhetorical target list The curious reason a House Republican is threatening to resign Tuesday’s Campaign Round-Up, 10.3.23 Key Supreme Court cases, issues to watch in new perilous term Jane Goodall: How I overcame being called 'just a girl' The U.S. Army needs to pay the Black St. Louis residents it secretly experimented on Trump looks especially angsty at his N.Y. fraud trial. This could be why. Newsom’s pick adds complexity to California’s Senate race With Democrats, Kevin McCarthy spent the year sealing his fate In historic first, House votes to oust Kevin McCarthy as speaker Gaetz toppled McCarthy. What happens next is anyone’s guess. |
Here is a related tip, how to get information from a JSON file using the command-line on Linux.
(jcartwright@localhost) 192.168.1.5 Documents $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .name' | uniq Xtreme Adventure America's Next Top Model Becker Bellator Beverly Hills 90210 Bold and The Beautiful Classics Bondi Rescue Baywatch Diagnosis Murder Dynasty 48 Hours Gecko TV Gunsmoke Happy Days Haunt TV Hawaii Five-O I Love Lucy Judge Judy MasterChef Matlock Medical Emergency Merlin Mission Impossible Moviesphere MTV Biggest Pop MTV Dating MTV Drama MTV Entertainment MTV Love MTV Reality MTV Retro MTV The Shores MTV Best Of The EMAs NatureTime Nick Classics Nick Jr Nick Movies NickTeen NickToons 10 Prisoner 10 Nick Rewind Rush 10 Seasonal Movies South Park Survivor Survivor US The Brady Bunch The-Drew-Barrymore-Show The Graham Norton Show The Twilight Zone True Stories 10 |
And how to read JSON to get a listing of all entities in a JSON schema section.
(jcartwright@localhost) 192.168.1.5 ~ $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .logo' | uniq https://10play.com.au/ip/s3/2023/07/31/692875006a27f9580330b073e080e129-1252238.png https://10play.com.au/ip/s3/2023/06/28/8db83f030cafec640ddec14612d03daa-1246018.png https://10play.com.au/ip/s3/2023/06/28/e0449ed804aaacb1de64f87d94912dd3-1246030.png https://10play.com.au/ip/s3/2023/07/13/5759d0aeb5b418ff82d0a44426fb8919-1249180.png https://10play.com.au/ip/s3/2023/06/28/d39a21f0c9a2c64970150768b266ecea-1246001.png https://10play.com.au/ip/s3/2023/07/12/24ba98d5a9e362a7eab844cf00b5abec-1248618.png https://10play.com.au/ip/s3/2023/07/12/7ddc52ba7477554f24bd1ad819b1d593-1248612.png https://10play.com.au/ip/s3/2022/09/15/28b62659b93d3b56e4e1c8c2485a202b-1180224.png https://10play.com.au/ip/s3/2023/06/28/5f85fc3d794a167630b935cb639abcc3-1246005.png https://10play.com.au/ip/s3/2023/06/28/36fb189fc45e71ab01e7e6d22997a24c-1246037.png https://10play.com.au/ip/s3/2023/06/28/7dd910de29f31095b69e17f1da8f3f0e-1245997.png https://10play.com.au/ip/s3/2023/02/23/b81bfc3f4737696257ef326b0e68208e-1219777.png https://10play.com.au/ip/s3/2023/06/30/ea73899442aeda0541d39da7187fd0ae-1246615.png https://10play.com.au/ip/s3/2023/06/28/060cdfa84b76c04285a91099d1dc5cfb-1246013.png https://10play.com.au/ip/s3/2023/07/12/8cb62aeb9de0e12611b68059325da0f3-1248632.png https://10play.com.au/ip/s3/2023/06/28/78cc7f7508fbc3e0d23f28e209d006cd-1246034.png https://10play.com.au/ip/s3/2023/06/30/ee75d540e6cdda6df5044708edda6500-1246588.png https://10play.com.au/ip/s3/2022/09/02/461008ac0d02137f713ec0e4883be6c4-1177265.png https://10play.com.au/ip/s3/2022/09/02/3fff1b47f7c01770b7f0f319770d5362-1177252.png https://10play.com.au/ip/s3/2023/06/28/05e69d623cea2b74ae1637c40e02cdd7-1246022.png https://10play.com.au/ip/s3/2023/06/28/115b6796134f1e983f41c224c31cc7c9-1246084.png https://10play.com.au/ip/s3/2022/12/05/34e0e28792f221c0b9a2629966e5bd28-1202917.png https://10play.com.au/ip/s3/2023/07/11/dc85f360d18a325975756c734eb17b69-1248406.png https://10play.com.au/ip/s3/2022/12/05/6bb229de8fcf96e206e99ca455fc2267-1202912.png https://10play.com.au/ip/s3/2023/09/25/57fee983dfaa2afa821ae798c49a002b-1266109.png https://10play.com.au/ip/s3/2023/09/25/87a0cf6001b8049ac6dd935caec9e7a4-1266117.png https://10play.com.au/ip/s3/2023/09/25/786a6fc1b188d22e46a15061dcf15d2b-1266105.png https://10play.com.au/ip/s3/2023/09/25/3c671025176136a04748f2b810e72bf3-1266111.png https://10play.com.au/ip/s3/2023/09/25/68943f881f079398dcdf4681db2fe648-1266107.png https://10play.com.au/ip/s3/2023/09/25/f51b3f2548dfdb88949cdbd13f037679-1266119.png https://10play.com.au/ip/s3/2023/09/25/b732245ef2430d278e0b2ef24a704cbe-1266114.png https://10play.com.au/ip/s3/2023/07/11/3901f5c4522a0c36cfed8630ab781a5e-1248401.png https://10play.com.au/ip/s3/2023/09/25/b8cf0f3edf522efa6ae95b26dc9098b1-1266101.png https://10play.com.au/ip/s3/2023/07/12/7b0ce57edb70336a469f938d59645bbe-1248639.png https://10play.com.au/ip/s3/2023/08/03/b9d7b0881bd3510944d31ddf27fb99ea-1253556.png https://10play.com.au/ip/s3/2023/07/14/e287e61bf65923f77e8e74f488f53498-1249270.png https://10play.com.au/ip/s3/2023/08/03/03b870793599a6760249eb6a5392352a-1253558.png https://10play.com.au/ip/s3/2023/08/03/80794063b69ee9991510e95c601c83e7-1253578.png https://10play.com.au/ip/s3/2023/08/03/e30c04dad94ed20e16f069dd0914ecb0-1253580.png |
Use it like this to get 2 different sections at once.
(jcartwright@localhost) 192.168.1.5 ~ $ curl -s https://i.mjh.nz/au/all/tv.json | jq -r '.[] | .logo,.headers' | uniq https://10play.com.au/ip/s3/2023/07/31/692875006a27f9580330b073e080e129-1252238.png { "referer": " ", "seekable": "0", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" } https://10play.com.au/ip/s3/2023/06/28/8db83f030cafec640ddec14612d03daa-1246018.png { "referer": " ", "seekable": "0", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" } https://10play.com.au/ip/s3/2023/06/28/e0449ed804aaacb1de64f87d94912dd3-1246030.png { "referer": " ", "seekable": "0", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" } https://10play.com.au/ip/s3/2023/07/13/5759d0aeb5b418ff82d0a44426fb8919-1249180.png { "referer": " ", "seekable": "0", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" } |