Contents:
Preparation for coding:
1. Virtual box and Vagrant2. Install Apache
3. Install MySQL
4. Install PHP
Edit crawler
1. Download and configure for PHPcrawl
2. Editing code of the crawler
3. Save fetched information in a database.
4. Sharing a local website inside a local network (optional)
5. User interface
PHPcrawl
Download PHPcrawl from here: https://sourceforge.net/projects/phpcrawl/
Then unzip it inside the share folder of virtual machine:
...so that we can access to it inside the virtual machine.
Netbeans
If you don't have netbeans yet, download and install Netbeans from here:
https://netbeans.org/features/php/
I downloaded Netbeans version 8.2.
Run the Netbeans as administrator then choose File --> New Project.
PHP --> PHP Application with existing sources
Then Next.
Add your vagrant's URL for PHPCrawl to "Project URL". Then choose "example.php" for index file.
Then finish.
You have your open source project now.... :)
Now we will enable xdebug in Netbeans.
Right click on the project name and choose "Properties":
Choose "Run Configuration" and "Advanced".
Write the path for the project that is in the virtual machine as "Server Path". Write path for the project that is in the local PC as "Project Path" like below:
From the menu bar, choose "Tools" then "Options":
From "PHP", choose "Debugging", then check "Stop at First Line" then click "OK":
Add this to the code:
// Set delay time
$crawler->setRequestDelay(5.0);
$crawler->setRequestDelay(5.0);
This is because accessing a web server too many times for a too short time can be regarded as a DDOS attack. For example, google crawler is said to send access every 15 seconds.
Then save.
Add some breakpoints and click the debug button from the menu bar. Then your PHPCrawl project is supposed to start.
If the lower right keeps saying "waiting for connection (netbeans-xdebug)" and never be changed, maybe some of your setting is wrong. Check if your php.ini is written correctly (especially your IP address and if the firewall is blocking the connection.)
Then your program is executed step by step (press F7 to step forward). In this way, you can see how this program is executed and how code is used in the execution. We will check how PHPCrawl is executed and where we can modify.
Supplemental advice
By the way, if you want to use only the breakpoints, un-check the "Stop at First line":
Then the execution will not stop every time you change the page in the browser.
If your xdebug is not working correctly, check your server log:
# less /var/log/httpd/error_log
To check connections, use this command on Teraterm:
# netstat -an
After starting debug on netbeans, if you use this netstat command on Teraterm, you can see the port of xdebug is 80:
tcp6 0 0 192.168.33.10:80 192.168.33.2:49901 ESTABLISHED
The port of netbeans is 9000:
tcp 0 0 192.168.33.10:52024 192.168.33.2:9000 ESTABLISHED
The debugging port of netbeans can be changed here:
Diagram from stackoverflow:
Citation from stackoverflow (an answer from Linus Kleen) https://stackoverflow.com/questions/8049776/xdebug-for-remote-server-not-connecting
If the connection is not established, it is highly possible that your connection is blocked by your anti-virus program or firewall. You can check it in your anti-virus program or firewall and remove the block as it is a safe connection.