Make an own crawler
Contents:
Preparation for coding:
1. Virtual box and Vagrant2. Install Apache
3. Install MySQL
4. Install PHP
Edit crawler
1. Download and configure for PHPcrawl
2. Editing code of the crawler
3. Save fetched information in a database.
4. Sharing a local website inside a local network (optional)
5. User interface
*php crawl can not get information from https websites because of a bug. Fix this problem this way.
Save fetched information in a database
We have successfully fetched information from a website by using a crawler. But even if we have succeeded in fetching information, if we don't do anything to the fetched information and just dump the information, it is meaningless. We will save the fetched information in MySQL, so that we can use the information for some good purpose later on.
We have installed MySQL in our virtual machine in the previous post "Install MySQL", so you should have MySQL in your virtual machine from the beginning.
Remove tags
The fetched information from websites are made of HTML tags because (almost) all websites are built with HTML (often combined with other scripts/languages). It is like this:
You can see this by pressing F12 on your keyboard.
"CSS" is also a kind of language that is used to customize style of a website
HTML is useful for browsers, but not for us. It contains much useless information. We can remove such tags by using regular expression.
For this time, we just leave them as they are.
Make a database in the virtual machine
Become a super user at first:
$ su
The password is "vagrant".On the virtual machine, sign into the MySQL:
# mysql -u root -p
The password should be "root" if you followed instructions of this post: Install MySQL.Create a database, of which name is "testdb".
mysql>CREATE DATABASE testdb;
Now we have a database which is called "testdb". Create a table now.
mysql>CREATE TABLE testdb.testtable (
id int NOT NULL AUTO_INCREMENT,
webinfo text NOT NULL,
PRIMARY KEY (ID)
);
id int NOT NULL AUTO_INCREMENT,
webinfo text NOT NULL,
PRIMARY KEY (ID)
);
We have a database, a table and columns now. We will make a function to save them in this table. Open your netbeans and the crawler's project. Add this variable to the MyCrawler Class:
public $howmanyrecordmade = 0;
Now add this code to the handleDocumentInfo function:
$dsn = "mysql:dbname=testdb;host=localhost";
$user = "root";
$password = "root";
try{
$conn = new PDO($dsn, $user, $password);
$conn->query('SET NAMES utf8');
// set the PDO error mode to exception
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$sql = 'INSERT INTO testtable(webinfo) VALUES ("'.htmlspecialchars(trim($DocInfo->source)).'")';
// use exec() because no results are returned
$conn->exec($sql);
echo "New record created successfully"."<br>";
$this->howmanyrecordmade++;
}
catch(PDOException $e){
echo $e->getMessage()."<br>";
var_dump($this->dbc);
}
$user = "root";
$password = "root";
try{
$conn = new PDO($dsn, $user, $password);
$conn->query('SET NAMES utf8');
// set the PDO error mode to exception
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$sql = 'INSERT INTO testtable(webinfo) VALUES ("'.htmlspecialchars(trim($DocInfo->source)).'")';
// use exec() because no results are returned
$conn->exec($sql);
echo "New record created successfully"."<br>";
$this->howmanyrecordmade++;
}
catch(PDOException $e){
echo $e->getMessage()."<br>";
var_dump($this->dbc);
}
And this line to the end of the execution:
echo "How many new record made: ".$crawler->howmanyrecordmade." record(s)".$lb;
Then save the file. Press ctrl + S or save it from the menu bar.
Then start the crawler!
The crawler crawls the webpage little by little...
Then it shows the result summary at the end of the execution:
Check the database by this SQL:
mysql>SELECT * FROM testdb.testtable;
(It would take a lot of time to show all of the information. To abort the execution, press "ctrl + z" or "ctrl + c").Or use some software to see the database:
You can see that all pages were saved as they were (including all html tags, but they were escaped for security reason). It is like this:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
(Omitted)
<p>
Note, that many languages are just under translation, and
the untranslated parts are still in English. Also some translated
parts might be outdated. The translation teams are open to
contributions.
</p>
<div class="warning">
<p>
Documentation for PHP 4 has been removed from the
manual, but there is archived version still available. For
more informations, please read <a href="/manual/php4.php">
Documentation for PHP 4</a>.
</p>
</div>
</div>
(Omitted)
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
(Omitted)
<p>
Note, that many languages are just under translation, and
the untranslated parts are still in English. Also some translated
parts might be outdated. The translation teams are open to
contributions.
</p>
<div class="warning">
<p>
Documentation for PHP 4 has been removed from the
manual, but there is archived version still available. For
more informations, please read <a href="/manual/php4.php">
Documentation for PHP 4</a>.
</p>
</div>
</div>
(Omitted)
It automatically crawls for information and save it in your database! :)
But the problems are:
this saves every information whatever it is. (It might save duplicate information in the database)
this doesn't check if the information is already in the database.
this saves information without removing html tags.
the dsn information should be in a separated file for a security reason.
there is no user-friendly interface.
error might occur depending on the character code (codes except utf-8)
...
Yes, still incomplete.
We will deal with the problems next time.
(By the way, to run the crawler automatically and periodically, use cron of cent os. As long as the virtual server is running, you can excute programs automatically by crons)