A web scraper for Arduino is a device that is able to download information from a web server that is interesting or necessary.
The most common are articles, reviews, product prices, telephone numbers, e-mail addresses,
current prices on stock exchanges, securities.
The web scraper for Arduino is able to obtain information once or for a certain period, which can monitor, for example,
the complex development of prices on stock exchanges, the development of product prices, and
thus can draw attention to, for example, discounts and bargains. The most used language for the
scraper is Python, but it is possible to implement it in practically any language – in this case for
Wiring – a simplified C language in which it is possible to program platforms ESP8266, ESP32?
Arduino.
Most websites today work on the HTTPS encrypted protocol (443). For demonstration, we will use
web scraper for domain https://forum.hwkitchen.cz/ , which is a popular Czech Arduino related forum.
A Web Scraper for Arduino platform can only be used for HTTP services, as it does not support encryption in
terms of computing power. The ESP8266, ESP32 platform allows you to use encryption for HTTPS
connections. The programmatic implementation consists only of denying HTTP to the encrypted
channel. ESP8266 uses a server certificate fingerprint in SHA1 format and ESP32 uses a Root CA
certificate for HTTPS connections. To implement an encrypted connection, we will use the
WebSocket from the header file WiFiClientSecure.h. It basically uses the page like a standard
browser.
But how do I get a fingerprint and a Root CA certificate?
One of the simplest options is to use the cryptographic tool OpenSSL, which can also be used for
these purposes.
OpenSSL is supported natively without the need for installation on some Linux systems.
Command to get SHA1 fingerprint (www.marjun.net):
openssl s_client -connect example.com: 443 -showcerts < /dev/ null 2 > /dev/ null | openssl
x509 - in /dev/stdin -sha1 -noout -fingerprint
The command for obtaining a Root CA certificate in .pem format:
openssl s_client -showcerts -verify 5 -connect example.com: 443 < /dev/ null
The listing works hierarchically from the lowest CA to the highest (Root CA) – Chain of Trust.
In our case (for instance), Root CA – DST Root CA X3, which issues free Let’s Encrypt certificates.
The second way to get the SHA1 fingerprint is directly in the browser when displaying information
about the web server’s certificate, while it is also possible to look at the certification path, where
you can see the name of the Root CA.
The disadvantage of a fingerprint is that it changes every time the server’s certificate changes.
Let’s Encrypt certificates for webserver are issued every 3 months, it is renewed. Thus, after three
months, it is necessary to change the program for ESP8266, as the connection will not be
established with the old fingerprint.
On the other hand, the issuer does not change when issuing Let’s Encrypt certificates, so ESP32
with Root CA certificate can work until the end of the Root CA certificate DST Root CA X3,
currently until September 30, 2021 – valid since 2000.
Web Scraper for Arduino implementation:
- We will use a web scraper for Arduino to retrieve information about the available names of the main forums.
- The number of topics covered in each forum.
- A total number of posts in the forum
- Author of the last post
- In the Recent topics section, the topic name, topic author, author of the last post are loaded (and again the author of the topic, scraper error, the information is in the source code 2x, once it is visually written to the user) many other pieces of information can be retrieved. Thread URL, thread creation time, last post-time, subforums, categories, views.
Connection with webserver:
For connection to the webserver, it is possible to use the built-in examples of the WiFiClientSecure
library – WebClient for the ESP8266 and ESP32 platforms in Arduino Core. For ESP32:
https://github.com/espressif/arduino-es … Secure.ino After a successful GET connection, it is
possible to load the response of the webserver – most often line by line. The HTTP header is
loaded, followed by the payload sent by the web server in response to the request. In this case, it’s
the HTML source code of the main forum page that we load.
Source code analysis:
A web scraper for Arduino is a loaded source code that needs to be analyzed to find clues on which the scraper can rely and to
retrieve the dynamic information that is located at that point in the source code. The better the
page is formatted (it has more ids, classes), the more it is possible to facilitate the work of the web
scraper, it is easier to determine where the scraper should expect the given information. The
analysis can also be done from a classic browser.
The analysis showed that it is relatively easy to predict where dynamic information will
appear in the HTML code:
- The forum name is between HTML snippets: class = “forumtitle”>
- The thread name is between HTML snippets: class = “topictitle”>
- Threads are in HTML snippets:
- Forum posts are HTML snippets:
- The author of the topic (or the author of the last post) is among the HTML code snippets: class = “username”>
Program implementation:
The program implementation contains one program for the ESP8266 and ESP32 platforms
simultaneously. Based on the directives, it is possible to compile a program for the target platform
selected by the user in the Arduino IDE. Since we are looking for dynamic information about which
we only know where they occur, we will not use a regular expression (this is more suitable for
searching e-mail addresses, telephone numbers …), but we will choose the parsing method to
extract the necessary information between source code snippets ( which we found in the analysis
process) from the loaded line. Basically, we will create a SubString that will represent our
information and write it to the Serial (UART) line.
The source code for ESP8266 is designed for Arduino Core 2.5.0 and 2.5.2, respectively. It may
not work with the latest version of Arduino Core. The program for ESP32 is compatible with every
version from stable 1.0.0.
Some forums and websites also have RSS or JSON output, which can be similarly retrieved and
parsed with the necessary information from it. Such a format already has predefined variables to
which the value belongs and they are relatively easy to process using a scraper. The
implementation uses only the setup () function, as a 1x connection is sufficient, the microcontroller
does not flood the server in similarly the same multiple requests.
Source code:
//Webscraper - forum.hwkitchen.cz
//Autor:
//Used only for demonstration purposes
const char * ssid = "WIFI_NAME" ;
const char * password = "WIFI_PASSWORD" ;
const char * host = "forum.hwkitchen.cz" ;
const int serverPort = 443 ; //http port
#if defined(ESP32)
#include <WiFi.h>
//DST ROOT CA X3 - .pem
const static char * test_root_ca PROGMEM = \
"-----BEGIN CERTIFICATE-----\n" \
"MIIDSjCCAjKgAwIBAgIQRK+wgNajJ7qJMDmGLvhAazANBgkqhkiG9w0BAQUFADA/\n" \
"MSQwIgYDVQQKExtEaWdpdGFsIFNpZ25hdHVyZSBUcnVzdCBDby4xFzAVBgNVBAMT\n" \
"DkRTVCBSb290IENBIFgzMB4XDTAwMDkzMDIxMTIxOVoXDTIxMDkzMDE0MDExNVow\n" \
"PzEkMCIGA1UEChMbRGlnaXRhbCBTaWduYXR1cmUgVHJ1c3QgQ28uMRcwFQYDVQQD\n" \
"Ew5EU1QgUm9vdCBDQSBYMzCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB\n" \
"AN+v6ZdQCINXtMxiZfaQguzH0yxrMMpb7NnDfcdAwRgUi+DoM3ZJKuM/IUmTrE4O\n" \
"rz5Iy2Xu/NMhD2XSKtkyj4zl93ewEnu1lcCJo6m67XMuegwGMoOifooUMM0RoOEq\n" \
"OLl5CjH9UL2AZd+3UWODyOKIYepLYYHsUmu5ouJLGiifSKOeDNoJjj4XLh7dIN9b\n" \
"xiqKqy69cK3FCxolkHRyxXtqqzTWMIn/5WgTe1QLyNau7Fqckh49ZLOMxt+/yUFw\n" \
"7BZy1SbsOFU5Q9D8/RhcQPGX69Wam40dutolucbY38EVAjqr2m7xPi71XAicPNaD\n" \
"aeQQmxkqtilX4+U9m5/wAl0CAwEAAaNCMEAwDwYDVR0TAQH/BAUwAwEB/zAOBgNV\n" \
"HQ8BAf8EBAMCAQYwHQYDVR0OBBYEFMSnsaR7LHH62+FLkHX/xBVghYkQMA0GCSqG\n" \
"SIb3DQEBBQUAA4IBAQCjGiybFwBcqR7uKGY3Or+Dxz9LwwmglSBd49lZRNI+DT69\n" \
"ikugdB/OEIKcdBodfpga3csTS7MgROSR6cz8faXbauX+5v3gTt23ADq1cEmv8uXr\n" \
"AvHRAosZy5Q6XkjEGB5YGV8eAlrwDPGxrancWYaLbumR9YbK+rlmM6pZW87ipxZz\n" \
"R8srzJmwN0jP41ZL9c8PDHIyh8bwRLtTcm1D9SZImlJnt1ir/md2cXjbDaJWFBM5\n" \
"JDGFoqgCWjBH4d1QB7wCCZAA62RjYJsWvIjJEubSfZGL+T0yjWW06XyxV3bqxbYo\n" \
"Ob8VZRzI9neWagqNdwvYkQsEjgfbKbYK7p2CNTUQ\n" \
"-----END CERTIFICATE-----\n" ;
#elif defined(ESP8266)
#include <ESP8266WiFi.h>
const char fingerprint[] PROGMEM = "10 f6 06 df 7a 2b f3 f3 08 ed a5 8c e1 9b 15 e5 3f 3f
d3 32" ; //SHA1 Fingerprint
#endif
#include <WiFiClientSecure.h>
String midString (String str, String start, String finish) {
int locStart = str.indexOf(start);
if (locStart == -1 ) return "" ;
locStart += start.length();
int locFinish = str.indexOf(finish, locStart);
if (locFinish == -1 ) return "" ;
return str.substring(locStart, locFinish);
}
void setup () {
Serial.begin( 115200 );
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) {
delay( 500 );
Serial.print( "." );
}
Serial.println( "" );
Serial.println( "WiFi connected sucessfully" );
Serial.println( "IP adress: " );
Serial.println(WiFi.localIP());
Serial.println( "Ready" );
WiFiClientSecure client;
#if defined(ESP32)
client.setCACert(test_root_ca);
Serial.println( "Using DST ROOT CA X3" );
#elif defined(ESP8266)
client.setFingerprint(fingerprint);
Serial. printf ( "Using fingerprint '%s'\n" , fingerprint);
#endif
String url = "/" ;
if (client.connect(host, serverPort)) {
Serial.println( "Connection is OKAY!" );
client.print(String( "GET " ) + url + " HTTP/1.1\r\n" + "Host: " + host + "\r\n" +
"User-Agent: ESPBoard\r\n" + "Connection: close\r\n\r\n" );
while (client.connected()) {
String line = client.readStringUntil( '\n' );
//Serial.println(line);
if (line.indexOf( "class=\"forumtitle\">" ) > 0 ) {
Serial.println();
Serial.println( "Forum: " + midString(line, "class=\"forumtitle\">" , "</a>" ));
}
if (line.indexOf( "class=\"topictitle\">" ) > 0 ) {
Serial.println();
Serial.println( "Thread: " + midString(line, "class=\"topictitle\">" , "</a>" ));
}
if (line.indexOf( "<dd class=\"topics\">" ) > 0 ) {
Serial.println( "Number of themes: " + midString(line, "<dd class=\"topics\">" , "
<dfn>" ));
}
if (line.indexOf( "<dd class=\"posts\">" ) > 0 ) {
Serial.println( "Number of posts: " + midString(line, "<dd class=\"posts\">" , "
<dfn>" ));
}
if (line.indexOf( "class=\"username\">" ) > 0 ) {
Serial.println( "Author of post: " + midString(line, "class=\"username\">" ,
"</a>" ));
}
}
} else {
Serial.println( "Error connecting to webpage!" );
}
client.stop();
}
void loop () {
}
Scraper built on these platforms will not work on websites that use various types of authentication
against robots via Javascript, such as CAPTCHA, CloudFlare, and the like. The microcontroller
reads only the source code, it cannot run Javascript or a similar client-side language. For HTTP
scraper can be used WiFiClient example for HTTP connection. Handling requests and payload is
the same as the client object.