Posted: . At: 10:48 AM. This was 3 years ago. Post ID: 14799
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.


How to easily extract text from a website using curl and a simple script.


Extracting text from a website can be hard to do, but there are easier ways to get around this.

The curl utility is used to view the contents of a website in the terminal. This is part of the solution. But we need to filter the output for a particular DIV layer. That is where nokogiri can help.

Install this package first.

┌──[jason@192.168.1.2][~]
└──╼  ╼ $ sudo apt install ruby-nokogiri

This requires Ruby to be installed first.

Then we can read an online news article and filter the output for the “story-content” DIV layer.

curl 'https://www.news.com.au/technology/online/security/victorian-government-launches-new-qr-code-checking-system/news-story/c3e9f53b4ba1ee96804acacb143c316a' -s | nokogiri -e 'puts $_.at_css("div.story-content").text'

Use the -s parameter to curl to stop it from printing any extraneous output. The output below is what I got when I was parsing a news article from news.com.au.

1
2
3
┌──[jason@192.168.1.2][~]
└──╼  ╼ $ curl 'https://www.news.com.au/technology/online/security/victorian-government-launches-new-qr-code-checking-system/news-story/c3e9f53b4ba1ee96804acacb143c316a' -s | nokogiri -e 'puts $_.at_css("div.story-content").text'
And on the first day of its rollout Victorians are already citing issues, with Melbourne resident Lachlan Thomas unable to register after multiple attempts.“It’s such a simple form – your name, email, phone and tick a couple of boxes and that’s it – yet there’s still issues,” he said. Mr Thomas said he tried registering on multiple browsers on multiple occasions, but still the system would fail to register his details.He then called the support line provided on the online form, and was told the system was “likely overwhelmed” and to “try again later”.“It defeats the purpose of having this tool,” he said.“The government has taken months to develop a QR code system, and when it’s finally launched it’s not working – it’s useless – it’s like the COVIDSafe app.”After registering details online with the free QR code service, users can then download and print a poster with the Victorian government QR code and display it in their businesses.Visitors then must scan the QR code using their smartphone camera.

I need to find a way to work with the paragraph tags, but this works OK as a starting point.

Here is an example that works on the CNN website. Look for the “div.l-container” div id.

┌──[jason@192.168.1.2][~/Documents]
└──╼  ╼ $ curl 'https://edition.cnn.com/2020/11/30/us/massachusetts-attacks-waltham-trnd/index.html' -s | nokogiri -e 'puts $_.at_css("div.l-container").text'
11 unprovoked attacks in one month have rocked a Massachusetts cityBy Scottie Andrew, CNNUpdated 2124 GMT (0524 HKT) November 30, 2020 Police patrol a neighborhood in Waltham following an attack. (CNN)Eleven unprovoked attacks have occurred in the last month in a Massachusetts city, scaring residents and confounding police. Now, the Waltham Police Department is asking for the public's help. The department released footage of a person in a hoodie, their face obscured, who police say is a suspect. The attacks began November 10. The first five occurred near an apartment complex, minutes from Bentley University, but more recent incidents have occurred downtown, Detective Sgt. Steve McCarthy told CNN. As of last week, there have been 11 unprovoked assaults recorded in the city, McCarthy said. The victims were all targeted with a blunt object by an unknown assailant after dark, he said. One victim, who said he was attacked the day before Thanksgiving, is in the hospital with several breaks in his face and skull, CNN Boston affiliate WCVB reported. Read MoreDavid Cameros, another victim, told WCVB he was hit in the eye with an unknown object before his attacker ran away. "I don't know if it is only one or there are more attackers," Cameros told WCVB. "The aggressor always attacks from behind."Police told WCVB they're not sure whether the same person committed all 11 assaults. Residents who recognize the person in the video have been asked to contact Waltham police and be aware of their surroundings if they go out at night. Waltham, about 10 miles west of Boston, has about 63,000 residents.

So, this does work very well to extract text. Just find the proper DIV id and then go from there.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.