Sunday, July 30, 2017

Download and configure for PHPCrawl

Contents:

Preparation for coding:
1. Virtual box and Vagrant

2. Install Apache

3. Install MySQL

4. Install PHP

Edit crawler
1. Download and configure for PHPcrawl

2. Editing code of the crawler

3. Save fetched information in a database.

4. Sharing a local website inside a local network (optional)

5. User interface

PHPcrawl


Download PHPcrawl from here: https://sourceforge.net/projects/phpcrawl/
Then unzip it inside the share folder of virtual machine:


...so that we can access to it inside the virtual machine.

Netbeans


If you don't have netbeans yet, download and install Netbeans from here:
https://netbeans.org/features/php/

I downloaded Netbeans version 8.2.

Run the Netbeans as administrator then choose File --> New Project.


PHP --> PHP Application with existing sources
Then Next.



 Choose your source folder (the folder which contains source code of PHPCrawl_083) from "Browse" and choose PHP 7.0 from "PHP version". Then click Next.


Add your vagrant's URL for PHPCrawl to "Project URL". Then choose "example.php" for index file.
Then finish.


You have your open source project now.... :)


Now we will enable xdebug in Netbeans.
Right click on the project name and choose "Properties":


Choose "Run Configuration" and "Advanced".


Write the path for the project that is in the virtual machine as "Server Path". Write path for the project that is in the local PC as "Project Path" like below:


From the menu bar, choose "Tools" then "Options":



From "PHP", choose "Debugging", then check "Stop at First Line" then click "OK":


Add this to the code:
// Set delay time
$crawler->setRequestDelay(5.0);
This is because accessing a web server too many times for a too short time can be regarded as a DDOS attack. For example, google crawler is said to send access every 15 seconds.


Then save.


Add some breakpoints and click the debug button from the menu bar. Then your PHPCrawl project is supposed to start.
If the lower right keeps saying "waiting for connection (netbeans-xdebug)" and never be changed, maybe some of your setting is wrong. Check if your php.ini is written correctly (especially your IP address and if the firewall is blocking the connection.)

Then your program is executed step by step (press F7 to step forward). In this way, you can see how this program is executed and how code is used in the execution. We will check how PHPCrawl is executed and where we can modify.

Supplemental advice


By the way, if you want to use only the breakpoints, un-check the "Stop at First line":



Then the execution will not stop every time you change the page in the browser.

If your xdebug is not working correctly, check your server log:
# less /var/log/httpd/error_log

To check connections, use this command on Teraterm:
# netstat -an

After starting debug on netbeans, if you use this netstat command on Teraterm, you can see the port of xdebug is 80:
tcp6       0      0 192.168.33.10:80       192.168.33.2:49901      ESTABLISHED

The port of netbeans is 9000:
tcp        0      0 192.168.33.10:52024    192.168.33.2:9000       ESTABLISHED

The debugging port of netbeans can be changed here:

Diagram from stackoverflow:
Citation from stackoverflow (an answer from Linus Kleen) https://stackoverflow.com/questions/8049776/xdebug-for-remote-server-not-connecting

If the connection is not established, it is highly possible that your connection is blocked by your anti-virus program or firewall. You can check it in your anti-virus program or firewall and remove the block as it is a safe connection.