Posted: . At: 9:13 AM. This was 7 months ago. Post ID: 18567
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.


Easily scrape data from a table with htmlq on Linux.


Retrieving data from a table on a website is quite easy. I wanted to get this data and then output it in plain text. This turned out to be easier than I thought.

╭──(john㉿DESKTOP-PF01IEE)───╮
╰───────────────────────────╾╯(~/Documents)-(172.18.116.29)┋ curl --silent -L https://aec.gov.au/media/2023/09-21a.htm | htmlq td | w3m -dump -T text/html | awk NF
18*
                                                                         68,263
                                                                         52,638
                                                                         45,206
                                                                         22,100
                                                                         14,138
                                                                          4,721
                                                                          3,966
                                                                          1,414
                                                                        212,446
                                                                           1.2%
19
                                                                         82,251
                                                                         64,273
                                                                         55,307
                                                                         27,035
                                                                         16,991
                                                                          5,385
                                                                          4,776
                                                                          2,254
                                                                        258,272
                                                                           1.5%
20-24
                                                                        409,622
                                                                        325,346
                                                                        280,506
                                                                        134,919
                                                                         88,992
                                                                         26,956
                                                                         25,152
                                                                         12,870
                                                                      1,304,363
                                                                           7.4%
25-29
                                                                        421,117
                                                                        348,130
                                                                        290,857
                                                                        140,745
                                                                         94,188
                                                                         28,342
                                                                         28,003
                                                                         15,569
                                                                      1,366,951
                                                                           7.7%
30-34
                                                                        446,040
                                                                        378,590
                                                                        296,065
                                                                        150,544
                                                                         96,875
                                                                         29,502
                                                                         29,677
                                                                         16,690
                                                                      1,443,983
                                                                           8.2%
35-39
                                                                        477,934
                                                                        407,336
                                                                        304,999
                                                                        167,130
                                                                        102,795
                                                                         30,068
                                                                         31,938
                                                                         16,524
                                                                      1,538,724
                                                                           8.7%

I am getting the raw HTML this way.

curl --silent -L https://aec.gov.au/media/2023/09-21a.htm

Then I feed it into htmlq to find all TD elements on the web page.

htmlq td

Then dump the HTML as plain text and filter out all blank lines.

w3m -dump -T text/html | awk NF

Here is another example: printing the headers for the content.

╭──(john㉿DESKTOP-PF01IEE)───╮
╰───────────────────────────╾╯(~/Documents)-(172.18.116.29)┋ curl --silent -L https://aec.gov.au/media/2023/09-21a.htm | htmlq h2,td | w3m -dump -T text/html

This will print the H2 and TD tags.

htmlq h2,td

This does look very good indeed.

                  17,676,347 Australians are enrolled to vote
                        in the upcoming 2023 referendum
 
Enrolment by state, territory and age
 
18*
 
                                                                         68,263
 
                                                                         52,638
 
                                                                         45,206
 
                                                                         22,100
 
                                                                         14,138
 
                                                                          4,721
 
                                                                          3,966
 
                                                                          1,414
 
                                                                        212,446
 
                                                                           1.2%
 
19
 
                                                                         82,251
 
                                                                         64,273
 
                                                                         55,307
 
                                                                         27,035
 
                                                                         16,991
 
                                                                          5,385
 
                                                                          4,776
 
                                                                          2,254
 
                                                                        258,272
 
                                                                           1.5%

Yet another example. This finds every UL tag with the CSS class ‘_2eAhj’. This is very effective.

╭──(john㉿DESKTOP-PF01IEE)───╮
╰───────────────────────────╾╯(~/Documents)-(172.18.116.29)┋ curl --silent -L https://theage.com.au/siteguide | htmlq ul._2eAhj | w3m -dump -T text/html
  • Federal
  • Victoria
  • NSW
  • Queensland
  • Western Australia
 
  • Companies
  • Markets
  • The economy
  • Banking & finance
  • Entrepreneurship
  • Media
  • Workplace
 
  • North America
  • Europe
  • Asia
  • Middle East
  • Oceania
  • South America
  • Africa
 
  • NSW
  • Queensland
  • Western Australia
 
  • News
  • Living
  • Auctions
  • Financing
 
  • AFL
  • Cricket
  • Soccer
  • Racing
  • Tennis
  • NRL
  • Rugby union
  • Netball
  • Basketball
  • Motorsport
  • Cycling
  • Golf
  • NFL
  • Athletics
  • Swimming
  • Boxing

This is very nicely formatted.

And finally, a nice way to get a weather forecast. This is printing the contents of an HTML table.

╭──(john㉿DESKTOP-PF01IEE)───╮
╰───────────────────────────╾╯(~/Documents)-(172.18.116.29)┋ curl --silent -L https://www.dailymail.co.uk/weather/australia/index.html | htmlq table | w3m -dump -T text/html
    Location      Condition  Now  Min  Max
Capital Cities
Sydney                       16°C 15°C 24°C
 
Melbourne                    8°C  7°C  26°C
 
Brisbane                     18°C 16°C 28°C
 
Perth                        11°C 9°C  23°C
 
Adelaide                     21°C 15°C 31°C
 
Canberra                     9°C  6°C  24°C
 
Hobart                       8°C  7°C  21°C
 
Darwin                       25°C 25°C 35°C
 
More Top Towns
Sydney                       16°C 15°C 24°C
 
Brisbane                     18°C 16°C 28°C
 
Perth                        11°C 9°C  23°C
 
Melbourne                    8°C  7°C  26°C
 
Adelaide                     21°C 15°C 31°C
 
Hobart                       8°C  7°C  21°C
 
Newcastle                    19°C 14°C 26°C
 
Canberra                     9°C  6°C  24°C
 
Wollongong                   17°C 15°C 23°C
 
Gold Coast                   19°C 17°C 26°C
 
Carrapateena Mine            20°C 18°C 34°C
 
Gruyere Mine                 19°C 18°C 31°C
 
Bridport                     9°C  7°C  17°C
 
Boorowa                      5°C  6°C  24°C
 
Iron Bridge Mine             19°C 19°C 41°C
 
Darwin                       25°C 25°C 35°C
 
Boco Rock                    6°C  7°C  22°C
 
Eliwana                      22°C 18°C 37°C
 
Annuello                     14°C 6°C  31°C
 
Goondiwindi                  19°C 14°C 32°C

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.