Sunday, June 18, 2017

User interface

Contents


Preparation for coding:
1. Virtual box and Vagrant

2. Install Apache

3. Install MySQL

4. Install PHP

Edit crawler
1. Download and configure for PHPcrawl

2. Editing code of the crawler

3. Save fetched information in a database.

4. Sharing a local website inside a local network (optional)

5. User interface

One of the problems that was still remaining was that there isn't a user-friendly interface of the crawler. We will make an interface for the crawler today. Other problems? We will deal with them later.

Actually there is example index already in PHP crawl. But we will make more simple version of this to learn how it works.
PHPCrawl_083/test_interface/index.php is the interface

Copy and paste the following to a txt file. Then change the name to "index.html" and save it in the share folder. This will be the simple interface.
<!DOCTYPE HTML>
<html>
<link rel="icon" href="favicon.ico" type="image/x-icon" />

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Crawler</title>
</head>

<body>
    <div class="all" style="vertical-align: middle;">
        <h1>Crawler</h1>
        <p>URL here</p>
        <form method="POST" action="./PHPCrawl_083/example.php">
            Url:
            <input type="text" name="webToCrawl" size="50" value="" />
            <br />
            <p>data amount to crawl (*1024):
                <input type="text" name="dataamount" size="50" value="250" />
                <br />
            </p>
            <p>page limit:
                <input type="text" name="plimit" size="50" value="" />
                <br />
            </p>
            <p>ignore robot.txt?
                <input type="radio" name="ig_robottxt" value="false"> Yes
                <input type="radio" name="ig_robottxt" value="true" checked> No
            </p>
            <p>Follow mode:
                <input type="radio" name="follow" value="0"> 0
                <input type="radio" name="follow" value="1"> 1
                <input type="radio" name="follow" value="2" checked> 2
                <input type="radio" name="follow" value="3"> 3</p>
            <p>
                <input type="submit" value="Crawl" />
            </p>
        </form>
        <p>Write URL which you want to crawl</p>
        <br clear="all">
    </div>
</body>
</html>

here

If you access to the html file from browser, you can see that there is a simple interface now. (It doesn't work yet though)

Change the php code of  example.php like this:
<?php
// It may take a whils to crawl a site ...
set_time_limit(10000);

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");

// Extend the class and override the handleDocumentInfo()-method
class MyCrawler extends PHPCrawler
{
 
  public $howmanyrecordmade = 0;

  function handleDocumentInfo($DocInfo)
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";

    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
 
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
 
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb;
 
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source)
    if($DocInfo->received == true){ //this will be executed only when receiving inforamtion successfully
            $dsn = "mysql:dbname=testdb;host=localhost";
            $user = "root";
            $password = "root";
            try{
                $conn = new PDO($dsn, $user, $password);
                $conn->query('SET NAMES utf8');
                // set the PDO error mode to exception
                $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
                $sql = 'INSERT INTO testtable(webinfo) VALUES ("'.htmlspecialchars(trim($DocInfo->source)).'")';
                // use exec() because no results are returned
                $conn->exec($sql);
                echo "New record created successfully"."<br>";
                $this->howmanyrecordmade++;
            }
            catch(PDOException $e){
                echo $e->getMessage()."<br>";
                var_dump($this->dbc);
            }
    }
 
    echo $lb;
 
    flush();
  }
}

// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.

$crawler = new MyCrawler();

// URL to crawl
$urlToCrawl = filter_input(INPUT_POST, 'urlToCrawl');
$crawler->setURL($urlToCrawl);

// Set delay time
$crawler->setRequestDelay(5);

// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");

// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$dataAmount = filter_input(INPUT_POST, 'dataAmount');
$dataAmount = $dataAmount * 1024;
$crawler->setTrafficLimit($dataAmount);

// Obey robot.txt or not. This should be true usually.
$mode = filter_input(INPUT_POST, 'ig_robottxt');
$crawler->obeyRobotsTxt($mode);

// Limit page numbers
$plimit = filter_input(INPUT_POST, 'plimit');
if(strlen($plimit) > 0){
    $crawler->setPageLimit($plimit);
}

// Set follow mode
$follow = filter_input(INPUT_POST, 'follow');
$crawler->setFollowMode($follow);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
 
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "How many new record made: ".$crawler->howmanyrecordmade." record(s)".$lb;
echo "Documents received: ".$report->files_received.$lb;
$byteData = $report->bytes_received;
$megabyteData = round( $byteData/1024/1024, 2);
echo "Bytes received: ".$byteData." bytes".$lb;
echo "Megabytes received: ".$megabyteData." megabytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb;
?>

Then we will have a simple interface for the crawler. We can operate the crawler from here.

But as you know, the quality is not good yet. But at least it works. We will deal with other problems next time.

Also, php crawl can not get information from https websites. Fix this problem this way.

Sharing a local website inside a local network

Contents

Preparation for coding:
1. Virtual box and Vagrant

2. Install Apache

3. Install MySQL

4. Install PHP

Edit crawler
1. Download and configure for PHPcrawl

2. Editing code of the crawler

3. Save fetched information in a database.

4. Sharing a local website inside a local network (optional)

5. User interface

If you use Vagrant, you can share a local website inside a local network. At first, check the IP address that you are using. Do "ipconfig" command on your command prompt:


IP address related information will be displayed:


The information you have to check is the IP address that is surrounded by a red square. This is called host address. My host address is 192.168.11 according to this information.

 And add a number, that is not used by other device, to this host address. I will add "151" to the host address because 151 is seemingly not used by other device. So my address for the virtual device would be "192.168.11.151". We will add this information to the Vagrantfile.

Add this line to the Vagrantfile:
config.vm.network "public_network", ip: "192.168.11.151"

like this:

Save and close the file. Then do "vagrant reload" (if your virtual machine is already running) or  "vagrant up" on the command prompt.

If you want to access to the website from your phone, please make sure your phone is connected the wifi network:


The website (or virtual machine) can be accessed from any devices that are inside the local network. Just write the IP address that was specified in the Vagrantfile and press Enter. The website can not be accessed from outside of the local network.

From my phone:


From my PC:



Sunday, June 11, 2017

Save information in database

Make an own crawler 

Contents:


Preparation for coding:
1. Virtual box and Vagrant

2. Install Apache

3. Install MySQL

4. Install PHP

Edit crawler
1. Download and configure for PHPcrawl

2. Editing code of the crawler

3. Save fetched information in a database.

4. Sharing a local website inside a local network (optional)

5. User interface

*php crawl can not get information from https websites because of a bug. Fix this problem this way.

Save fetched information in a database


We have successfully fetched information from a website by using a crawler. But even if we have succeeded in fetching information, if we don't do anything to the fetched information and just dump the information, it is meaningless. We will save the fetched information in MySQL, so that we can use the information for some good purpose later on.

We have installed MySQL in our virtual machine in the previous post "Install MySQL", so you should have MySQL in your virtual machine from the beginning.


Remove tags


The fetched information from websites are made of HTML tags because (almost) all websites are built with HTML (often combined with other scripts/languages). It is like this:

You can see this by pressing F12 on your keyboard.
"CSS" is also a kind of language that is used to customize style of a website

HTML is useful for browsers, but not for us. It contains much useless information. We can remove such tags by using regular expression.

For this time, we just leave them as they are.


Make a database in the virtual machine


Become a super user at first:
$ su
The password is "vagrant".

On the virtual machine, sign into the MySQL:
# mysql -u root -p
The password should be "root" if you followed instructions of this post: Install MySQL.

Create a database, of which name is "testdb".
mysql>CREATE DATABASE testdb;

Now we have a database which is called "testdb". Create a table now.
mysql>CREATE TABLE testdb.testtable (
  id int NOT NULL AUTO_INCREMENT,
  webinfo text NOT NULL,
  PRIMARY KEY (ID)
);

We have a database, a table and columns now. We will make a function to save them in this table. Open your netbeans and the crawler's project. Add this variable to the MyCrawler Class:
public $howmanyrecordmade = 0;



Now add this code to the handleDocumentInfo function:
    $dsn = "mysql:dbname=testdb;host=localhost";
    $user = "root";
    $password = "root";
    try{
        $conn = new PDO($dsn, $user, $password);
        $conn->query('SET NAMES utf8');
        // set the PDO error mode to exception
        $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
        $sql = 'INSERT INTO testtable(webinfo) VALUES ("'.htmlspecialchars(trim($DocInfo->source)).'")';
        // use exec() because no results are returned
        $conn->exec($sql);
        echo "New record created successfully"."<br>";
        $this->howmanyrecordmade++;
    }
    catch(PDOException $e){
        echo $e->getMessage()."<br>";
        var_dump($this->dbc);
    }


And this line to the end of the execution:
echo "How many new record made: ".$crawler->howmanyrecordmade." record(s)".$lb;


Then save the file. Press ctrl + S or save it from the menu bar.


Then start the crawler!


The crawler crawls the webpage little by little...

Then it shows the result summary at the end of the execution:

Check the database by this SQL:
mysql>SELECT * FROM testdb.testtable;
(It would take a lot of time to show all of the information. To abort the execution, press "ctrl + z" or "ctrl + c").

Or use some software to see the database:

You can see that all pages were saved as they were (including all html tags, but they were escaped for security reason). It is like this:
&lt;!DOCTYPE html&gt;
&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; lang=&quot;en&quot;&gt;
&lt;head&gt;

(Omitted)

&lt;p&gt;
 Note, that many languages are just under translation, and
 the untranslated parts are still in English. Also some translated
 parts might be outdated. The translation teams are open to
 contributions.
&lt;/p&gt;

 &lt;div class=&quot;warning&quot;&gt;
  &lt;p&gt;
   Documentation for PHP 4 has been removed from the
   manual, but there is archived version still available. For
   more informations, please read &lt;a href=&quot;/manual/php4.php&quot;&gt;
   Documentation for PHP 4&lt;/a&gt;.
  &lt;/p&gt;
 &lt;/div&gt;
&lt;/div&gt;

(Omitted)

It automatically crawls for information and save it in your database! :)

But the problems are:
this saves every information whatever it is. (It might save duplicate information in the database)
this doesn't check if the information is already in the database.
this saves information without removing html tags.
the dsn information should be in a separated file for a security reason.
there is no user-friendly interface.
error might occur depending on the character code (codes except utf-8)
...

Yes, still incomplete.
We will deal with the problems next time.
(By the way, to run the crawler automatically and periodically, use cron of cent os. As long as the virtual server is running, you can excute programs automatically by crons)







Tuesday, June 6, 2017

Modify phpcrawl and develop a new web crawler: editing code

Contents:

Preparation for coding:
1. Virtual box and Vagrant

2. Install Apache

3. Install MySQL

4. Install PHP

Edit crawler
1. Download and configure for PHPcrawl

2. Editing code of the crawler

3. Save fetched information in a database.

4. Sharing a local website inside a local network (optional)

5. User interface

What is webcrawler?

Webcrawler is a program that crawls on internet and gather information from internet. For example, google crawler is a very famous crawler that collects website information to use the information for Google search.

we will modify and improve PHPcrawl to learn how to develop a program with php.

Preparation


Open /etc/php.ini
# vi /etc/php.ini

And change the value of output_buffering to "off":
output_buffering = off


What is output buffering? - stackoverflaw
Turning off output_buffering is not essential, but we will make it off for now (because, if it is on, we need to wait for a long time until the crawler finishes crawling and displays the full result. If it is off, the result is displayed one by one).

Glance at the code


If you see inside of PHPCrawl, you can see there are so many classes. You might be scared because you might think you need to understand how all of the classes will work. But you don't need to be scared by how lots of classes are there. When the crawler is working, only if they are needed, these classes are used. What we need to do is just to write how the crawler should work. 

Many classes.

This is similar to the fact that we can drive a car even though we don't know how each part of the car moves. What we must know is only some of them, not all of them.
What we will do is not to design a new crawler, but just to modify an existing crawler and make it an original one. (Actually many creative products are made from open source projects.)

Edit the code

At first, double-click "example.php" on "Projects" panel. Then example.php will show up in the code editor.


From this code editor, you can edit the example.php's code. Anyway, at first, we will see how it works. Click the debug button to start the crawler.


If you have checked "stop at first line", it would stop at the first line. But before stepping forward, maybe click "Window" from the menu.


Click "Debugging" => "Variables"


Then you can see what variables are used and what values are inside. Seems like only superglobals are used at the beginning. "Superglobals" refers to"super global variables" like $_POST arrays or $_GET arrays. You don't need to care what they are yet.



While the execution is being suspended at the first line, the browser keeps displaying "Loading" on the tab. The content is, of course, blank. Because no code to display html or words are executed yet.
html is a kind of programming language that is used to build a static web page. (Static webpages are ones like this blog, which just shows words on browser. Dynamic webpages are ones like Google. Google dynamically does many things like searching a certain word... Script languages like PHP are used to build dynamic webpages.)


Press F7 on the keyboard or click the down arrow to step forward. If you want to jump to the end of the execution, click the green play button. You can see how this program is executed and which class and method are used for the execution. For now, we don't need to check in detail, so just click the green button.



You will see the crawler will fetch data page by page every 5 seconds (if not, maybe output_buffering in php.ini isn't set to off):
You shouldn't decrease the seconds for the every execution because, if it is too fast, it can damage the web server. (It can be regarded as even an attack sometimes, so be careful.)

Maybe we need also megabyte for the result summary, not only byte of how much data is processed.
So we change the last part of the code of example.php like this:
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
$byteData = $report->bytes_received;
$megabyteData = $byteData/1024/1024;
echo "Bytes received: ".$byteData." bytes".$lb;
echo "Megabytes received: ".$megabyteData." megabytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb; 


And you will see "megabyte" information is added to the result summary:
As you see, if it is a small change, it is quite easy to accomplish. :)

You want to make the crawler to fetch only PDF files? Such changes are even easier because fetching only PDF files is supported by default (original creator had prepared such method beforehand, so all we need to do is to call such method). But if you want changes that are not supported by default, it is where we need to develop by ourselves.