General Description

The project has a repository https://gitlab.com/Axghouse/axghouse. Before cloning, you need to check the relevance with the repository on the server. The system consists of the following main components:

Supervisor - used to create project schedule, supervisorctl status - see all running workers. restart all restart.
Supervisor conf files are located in /etc/supervisor/conf.d
php scripts from the application / crons folder are launched from cron, a description of these scripts will be given below.
Scripts written in python for adding links, checking deleted content and email search in /root/pythons folder.
Backend is a mysql database set with a max connections of 150.

Background scripts

The schedule of scripts can be seen by running crontab -e

Scripts (omitted script path /var/www/html/app/crons):

`Operating mode - script name - Description`

 Hourly - cron_add_pirate.sh - Creates search engine tasks for projects by phrases

Daily - cron_all - Clearing old records and links

Once a day at 11pm - cron_not_deleted_notification.sh - Send a report to the user about not deleted links

Every 2 hours -cron_delete_content_detect.sh - Check for deleted content

Every midnight - cron_check_expired_users.sh - Check users for expiration dates

Hourly - cron_create_project_schedule.sh --add schedules for recently added projects

Daily - cron_check_disabled_project.sh --check projects that have expired and disable them

Constant --cron_mail_send.sh --Send takedown notifications

Constant --cron_cloudflare_send.sh --send cloudflare forms

Python Scripts

These are located in /root/pythons folder

 update_website_removal_time.py: Runs every 4 hours - Updates website removal time for websites with wrt not >14days

Formula for caluclating website removal is as follows:

Select Deleted or Not Deleted links for the website for the last 3 mnoths for which notifictions have been swent.
If no links found, select the last 100 Deleted or Not Deleted the website for which notifications have been sent. These links can be in either reference table or reference_backup table
For each of the links, calculate removal time. Removal time is the time between when notifications were sent and when the link was updated to Deleted
Take the average of the removal times for all found links
Set this removal time as the Website Removal Time (WRT)

  run_update_ip.py: - Runs every hour. Checks for websites that have updated IPs.

run_add_axgbot.py: - Final confirmation for full website search links. Runs every 31 minutes.

run_add_mirrors.py: - Runs every minute to add mirror links from manuualy added links

send_mail.py: - Send email for websites that become >14days

send_mail_two.py: - Send email for websites whose removal time is about to increase and has pending links

run_fws.py: - Create fws tasks spread across all participating servers

update_removal_time.py: - Update website removal time for websites that have recently deleted links

delete_user_projects.py: - Deletes projects for users that been disabled.

check-supervisor.py:- Ensures that supervisor is running all the time

Update IP

run_update_ip.py -- This adds IPs to newly added websites. After the IP is found the website is queued for adding hosting email. The function uses the inbuilt python gethostname function.

System Logs

All logs are in application/logs folder
cron scripts write to the logs of the same name. Search for a phrase in all scripts - grep -rnw 'application / logs' -e 'phrase' A background script cleans these logs once every week.

Users

Create new users

https://axg.house/users/

 Enter fields (* - required), email *,

select the function to send all copies of notifications to the second email, enter the second email.

Username * and surname, select the name of the copyright holder for all projects,

select the available content types, select the available search engines to the user, hide the schedule for the user and configure

it in the settings phone number, password *, select the user role (admin, manager, guest), select the expiration date of the account,

select the available projects for managers and guests.

The user expiration date is set, the crons/cron_check_expired_users script is responsible for its expiration

Personal user settings

https://axg.house/account/

Change the name and surname, enter data for connecting SMTP mail, enter a signature (valid only when using your personal templates, when using the formula), change the password.
Edit user data
Invoices management
Change password and delete user account

Projects

Creating Projects

https://axg.house/project

Main tab - Visibility for the manager, selecting the type of content from the available ones, entering a title, entering values by author, year or artist (these fields vary depending on the selected type of content), insert links to official web resources, as well as a power of attorney.

 Search keywords - when you go to the tab, keywords from the Title field are automatically generated,

if there is a value in after Author / Year / Artist.

As well as the Translator of copyrighted work field, key phrases from this field are also generated in conjunction with the Title,

but provided that in the system-> search keywords section, put the "+" in the translator field in the keyword phrase +.

Content type - affects the field name Author / Year / Artist.

Schedule - scan schedule, works by server time (Germany) +2 utc

Whitelist - Whitelist for links

Document - Download Files

When you save the project for the first time, a search is automatically launched for all search engines available to the user. After the first save, the RUN NOW button appears - start the search now.

Hitting the RUN SEARCH button will start search for the selected project for all search engines. We have four search engines (google,bing,yandex,axgbot).
Hitting RUN FWS will run full website search immediately (using search engine axgbot).
Tasks for the engine axgbot are picked by other servers. The servers are: 212.83.171.22:32004 web1 212.83.171.22:32008 web2 212.83.171.22:32012 web3 212.83.171.22:32016 web4 212.83.171.22:32020 web5 212.83.162.31:32024 web6
Search engines use proxies located at /opt/aparser/files/proxy.txt

The other servers are: 135.181.199.23 --Links verifier server (web10) 37.27.2.94 --i.house

Disabling projects based on expiration date- Each day all projects are checked on the expiration dates.
The function that performs this is located at cron/cron_not_deleted_notification.
If project expiration is in the past, the project is disabled and won't scan for links.

For each project we can manaully add a list of official links. For many others, we scan the official links from Amazon. Scanning is via a python script that is manually run from web3 whenever there are new projects.

When a project is deleted, all its associated links also get deleted.

Project audit tool. This tool has been added so as to track the progress of scanning projects. It provides the last link added, time it was added and reasons why the project may not be scanning.

Websites

https://axg.house/website

When you add a new site, it is automatically created in the Websites section, it starts the search for email inside the site and hoster based on whois database via https://search.arin.net/rdap

IP address update:
We check all websites once every 24 hours for new IPs. If new IP is found, this is set as the current website IP and all previously found IPs logged. This process is achieved by gethostbyname standard method of php Skip links scanning - Sub Links from all links of this website are not added. Links also not checked for upload links.
Wait for Content loading - Means that content will be received using selenium, after waiting (20 sec by default).
Each website also has a RUN NOW button. Clicking this button starts full web search on the selected website immediately. This occurs ONLY when full webs search checkbox is checked. If not, that means DO NOT carry out full web search on the website.
Clicking RUN FWS will run fws search with normal speed
We also have a search method for each website. This method is manually changed. Search bots use the value of this method to determine how the website is searched.

The search algorithm

Search engines are given search codes:
Google search - 0, Bing - 6, Yandex - 5, Axgbot - 8
Comparison of anchor with search phrase with Apache solr score greater than 75% or max_score > 0.4 (full term search)
Checking for title in the anchor, checking for title in the page content, checking for phrases in the page content.
Google check for pirate content type in the page content, with a check mark in the content type on, checking for specific content type, fieldname, name value in the anchor, checking for global whitelist and project whitelist, check stop words (deleted at the request of the copyright holder, etc.)

Adding Links (General Algorithm)

Script cron_create_tasks - generates and sends tasks to the aparser for search, taking into account the search engines available to the project owner
The tasks are then moved to axgHouseTest server and distributed among the 4 extra servers.
Each of the 4 extra servers fetches tasks assigned to it ad launches search for each
Each of the found links undergoes primary filtering based on anchor - by title or title_eng and phrase Next; all links that pass this test are stored for final verification
WorkerAddPirate (job-add-pirate.py) - checks the link for the title in the content. If the check is successful, then add the link to the database and the fake check starts

Adding Upload Links / Sublinks

Upload hosts
Extracts links from the found page and select by domain from the Upload hosts database
By title and translator
Using the links added by title, check the title in the anchor and in the content of the page, check for stop words (from the group).
When the check mark in the content type is on, checking the value of the "Specified content type field name" in the anchor and with added links.

For some websites, we need to check project titles/author from particular locations. So we have developed a custom method for checking from these these particular locations with customized solr scores for each.
Solr scores are used as below:
For specific locations to check title -solr score of 3 and above needed.
For the rest of the links - solr score of 4.2 and above needed.
Example websites for which this is developed is libgen

Website audit tool. This tool has been added so as to track the progress of scanning websites using full website search. It provides the last link added, time it was added and reasons why the website may not be scanning.

Analytics

https://axg.house/analytics

Displays links in the below categories

Grouped by date added
Grouped by search engines.
Links categorised as either torrents, free downloads, messengers, fake sites, cyber lockers, social networkes, link shorteners

Email Templates

https://axg.house/email_template

Shows all email templates used by the system.
These templates are used in.

Email notifications for takedown
Account Registration / User Deletion
Adding users to projects
Project Creation / Expiration / Deletion

Check Status

Checks the link for the presence of stop words from the Axghouse group after notifications have been sent.
The check is dependent on the removal time of the associated website. Check duration as below:

If website removal time is '>14days' , do not check status.
IF website removal time is 'no data' , then we check the link once every 2 days.
Otherwise, check the link after 8 hours

If stop words are not found:
If link added from either full website search or search engines, then we check for the absence of a title/translator.

Statuses that lead to changing to Deleted

Presence of stop words
Absense of title/translator
404,410,451 status codes
403 non cloudflare code
302,301 redirecting to homepage

The following http codes are referred to as skip codes. Links with such status codes are not changed to Deleted:
400, 401, 402, 407, 429, 501, 503, 504, 596

If scraper returns 403 cloudflare, the link content is fetched using selenium. Selenium is able to bypass protection from cloudflare.

For links on which we detect deleted criteria, we confirm one more time after 2 hours.
For sites protected by cloudflare, we usually get statuses 403 or 410. We then use selenium to bypass cloudflare protection. if selenium doesnt bypass, we use playwright.
In some cases we cannot bypass cloudflare. Hence, in such cases, we don’t change status to Deleted.

Check status lasts 15 days from the time notifications were sent.If during the first 14 days, we cannot find deleted criteria, we change back the link from Pending to Not Deleted

Check status can also be manually triggered by clicking the info icon that appears alongside the status of each link on this page https://axg.house/pirate
This only happens if the link status is Pending

When a project goes to Disabled all Pending links for the project go to Not Deleted if they do not meet any criteria.

check status script

We have a script named check_deleted_microservice.py that runs on web1, web2, web3, web4, web5, web7 under the /home/moses/scripts folder.
This script runs on cron on each of the mentioned servers.
The script first fetches content using a custom scraper. If stop words are detected, this link is confirmed by fetching content using selenium.
Fetch content with either selenium or playwright sets a google refer url: https://www.google.com/url?q=. The script then follows the link displayed on the url page.

We have set a specific way to fetch content for some websites. This can either be selenium or playwright. With this method, we fetch content for the website using the set method. Some websites require a longer delay because of longer content loading. We found out that a 20 second delay works okay.

Check status can also be manually triggered; when this is done, the check status is done via a flask app that runs on web4.
We can also manually change the status of a link from the Pirate page. Changing a link status to either Deleted or Not Deleted triggers recalculation of Website Removal Time for the website to which the link belongs.
For links on social media, we have developed a way to first log into the social media account using cookies then fetch content from the link. For these, we have cookies login for:

Facebook
Instagram
Tiktok

Also have cookies login for other platforms like reddit.
The age of the cookies is determied by a script that logs into each social media account every 4 hours. If the cookies expire, an email is triggered to ask for manual update.
Check status uses proxies. We fetch proxies directly from webshare,

https://proxy.webshare.io/api/proxy/list

Sometimes, due to frequent submissions, webshare can throttle our requests. In such cases, we wait a few seconds then retry.

Google Form

Sent within 1 minute after takedown date appears.
We send links for upto 70 projects with the same publisher.

Algorithm

As soon as notifications are sent, tasks are created for google form. The tasks are stored in the table google_form
A script hosted on web8 named google_dmca_two.py runs all the time checking if there are any tasks in the table.
If any tasks, these are grouped based on publisher and sent to google
The tasks are sent to form url - https://reportcontent.google.com/forms/dmca_search
Sending is done using python playwright

Process

Script begins by grouping tasks into related projects. Each project is allowed a maximum of 1000 links.
It then groups the projects into related publishers with each publisher containing a maximum of 70 projects.
We then create a text file of the links to upload to google.
Links are uploaded (sent to google) and the form saved in the database.

Response

When successfully sent to google, we get a report id that we can use to query the status of the form.
If status - Complete, the we update all sent links in database. Set field google_sent = 1
If status Error, we write a error message to database

Google form works on web8 (212.83.171.22:32034) via the python script google_dmca_two.py
Google does not accept non utf-8 character sets. We have to convert all non utf-8 characters in all links to their utf8 equivalents. All conversions are in the google form script google_dmca_two.py
Google script and bing script generate a lot of temp files from playwright. These files need constant removal from the directory as they accumulate pretty quickly. We have a script set to run every 10 min cleaning up the temp folders.
Sometimes, google script hungs up and fails to send subsequent forms. In such cases, we have to manually kill all google form processes and then restart google.
We have a script named kill_google_form.py that achieves this. The looks at the time the last google form was sent, how many links it contained and if there are tasks for google in queue.
If last form was sent more than 30 minutes ago and there are tasks in queue, then we assume google has hung up.

If last form contained less than 10k links, restart google script 30 min after last form was sent
If last form contained more than 10k links but less than 30k links, restart google after 45 min
If last form contained more than 30k links, restart google after one hour.

Blogspot Form

Sent within 1 minute after takedown date appears.
We send links for upto 10 projects with the same publisher.

Algorithm

Projects are grouped by publisher.
For each project we get not sent to dcma form links.
Then create a Data Form with max 10 groups (for each project) and max 1000 links total.
We ause python playwright to send to url - https://reportcontent.google.com/forms/dmca_blogger
The page presents a captcha which we solve using https://anti-captcha.com Python script is on web8 and named blogspot.py Response
When we get response from google, we parse it and define status. (Complete or Error) If status - Complete, the we update all sent links in database. Set field google_sent = 1
If status Error, we write a error message to database
Blogspot form is currently stopped but is supposed to work from web8

Bing Form

Sent in a similar manner as Google Form.
We send links per project to the bing DMCA.

Algorithm

As soon as notifications are sent, tasks are created for bing form. The tasks are stored in the table google_form_bing
A script hosted on web8 named bing_dmca.py runs all the time checking if there are any tasks in the table.
Bing form form url - https://www.bing.com/webmaster/tools/contentremovalform
Sending is done using python playwright. We first need to log into bing using cookies stored on the server.
Sometimes, cookies expire and we have to manually log into bing and fetch fresh cookies

Process

Script begins by grouping links into related projects. Each project is allowed a maximum of 1000 links.
We then create a text file of the links to upload to bing.
Links are uploaded and the form results saved in the database.

Response

When successfully sent to google, we get a ticket id that we can use to query the status of the form.
If we get an error, we write a error message to database

Bing form works on web8 (212.83.171.22:32034) via the python script bing_dmca_three.py
After successfully sending the form, we update the number of forms sent by this user.

Content Types

Fields:

Google keywords (For detect pirate) - Specifies a list of words (separated by commas) that must be present on the page along with the Title of the project, if left blank then pages are not added.
Specified content type field field name - changes the name of the Author field in the project to the desired one (does not affect the search).
Check Specified content type field on page - Will the presence of the Author field on pages be checked Swap project keywords - Changes the order of formation of search keywords in the project.
Stop words - we check for these words in the title of each link content. If found, the link is not considered a pirate one Screenshot - if this is set, we fetch screenshots for links of this content type

Proxies

We use proxies provided by Webshare. Proxies are used bu fws, se, check status (all forms), test tool and link verification.

The proxies rotate every month on between the 19th and 24th of each month. Sometimes, webshare just replaces certain proxies at any given time.
After rotation, with new proxies provisioned we have to download the new list and save on all servers in the folder /opt/aparser/files/proxy.txt where proxies are required.
Download is done every hour.

Proxy API host: https://proxy.webshare.io/api/v2/proxy/list/?mode=direct
Proxy authorization token s7t89waym9igp51mxq0i3el4ac85qd2d5jfp5xqe
When scripts run and need to use proxies, we read them directly from webshare. Search engines using a-aparser also use proxies from webshare. a-parser is unable to read directly from webshare; hence we download them and store in files in aparser directly where it can fetch directly.
In addition we update all proxy locations using this service https://ipapi.co
All scripts using proxies include

Full website search scripts -- fws_solr.py, fetch_links, fws_solr_sublinks.py
Check status scripts --check_deleted_microservice.py
Search Engines --search_engines.py, job_add_pirate.py

System

User Group - by default, words from the 7th group are used in the search
Content types - content types that are selected in the project
Search keywords –contain templates for generating search phrases for each type of content
Email templates - Templates for letters - are formed and stored as files
Not deleted notification - a template for sending notifications of deleted links
White List - sites from this list are not added, all sites without www
Fake sites - fake sites - such sites do not search for links and do not send complaints to them. The list is replenished automatically if links to other fakes are found on the page or a redirect to fakes is marked with the Auto label
Fakesite white - fake site exceptions
Deleted links - basket for deleted links
Link shorters - link reduction services - they must be checked for redirects
System Log - contains a log about some user actions
Upload hostings - a list of hosts that are considered file services, such links are added as Uploads and the title is not checked there. Also for such links there is a section (button) Edit blackwords list - A list of lines that should not appear in the link (to exclude pictures, scripts etc)

Full Web Search

We have 6 servers involved in full web search.
5 (web1,web2,web3,web4,web5) servers are used for content type id 31 projects while one server (web10) used for the rest of the content type projects. Each of the 6 servers has apache solr aparser running on url http://127.0.0.1:8983
Website search is run via three methods:

Run button on each website

when clicked, it goes through all projects with content type 31 and sets up search tasks for each project on the selected website.

The tasks are then equally distributed among the 5 servers basedon the content type of the website.

A script has been developed that

inserts the tasks into each server with proper distribution such that no server has more tasks than the others.

 Project run button:

This button runs both search engine tasks and website tasks for the select project.

For website tasks, it goes through all websites of content type 31 and equally distributes search tasks among all servers.

 Content type run button:

This button runs all projects pointed to the content type. Each content type has specific servers it runs its tasks on.

Content type ID 6: web10

All other content types: the rest of the servers

After tasks are created, we fetch links using fetch_links.py

Then these links are checked by c++ executable named fws_new that is present on all servers.

Each server has a copy of the main database which is updated every 3 hours.
We run search on websites or projects. The main server prepares the tasks assiging each server a list of own tasks.Each server then picks its assigned tasks and begins website search.
fetch_links.py --located on each server begins extracting links from task This extraction is done using either: scraper playwright selenium cloudscraper (https://pypi.org/project/cloudscraper/) .This is used where the above 3 methods cannot bypass cloudflare We fetch content via the scraper, selenium or playwright and then links extracted using bs4.
Each link is checked using apache solr. If links meet criteria, they will be inserted in the main servers database for final checks then added to the system.
Criteria checked include stop words, project stop words and title/translate.
For some links we extract upload/download links from specific locations on the page. Information about the exact location to extract from is hard coded in the scripts. The main scripts involved in checking links is /root/moses/fws_solr.py and /root/moses/fws_solr_sublinks.py.

Before we add any links, we delete any links containing these search urls.

search.php
s/?q=
.onion
/searchrss/
.json
page=

For all links added we extract mirrors. Mirrors are links for similar websites which are already grouped as the mirrors website.
The extracted mirrors are checked existence before being added to the main queue for final checks.
Script for extracting mirrors named transfer_links_four.py working in multithreaded mode.

The servers process fws tasks at varying speeds. Some servers are faster than others. When some servers run out of tasks, there is need to move some tasks from other servers to the ones without tasks.
We have an algorithm to balance tasks that works as below: The script name is balance-tasks.py It runs on web1 and moves 30% of tasks from the server with the most tasks to the server with the lowest number of tasks. This happens if the highest tasks server has more than 10 tasks.

When we scan by FWS or checking redirect links we adding parent website. For example when we scan website A or checking redirect, then we must add only website A even if URL changed. But if website A redirects to website B, then we do not add website B.

Scraper (FWS and Uploads)

Content for full website search is fetched using a scraper built in C++ programming language.

Detailed Documentation for the Script

Overview

The script is a comprehensive C++ program designed to manage and process web scraping tasks. It integrates multiple functionalities including web requests, proxy management, MySQL database interaction, and data cleaning. Below is a detailed breakdown of its structure and functionality.

Key Functionalities

 - Web Scraping: Using cURL to send HTTP requests and retrieve webpage content.

- Proxy Handling: Reads proxy details from a file and integrates them into web requests.

- User-Agent Randomization: Dynamically fetches and utilizes different user-agents to mimic browser behavior.

- Data Cleaning: Removes unwanted characters, scripts, and HTML tags from scraped content.

- Database Interaction: Connects to a MySQL database to fetch and update data related to projects and web pages.

- Multithreading: Uses std::async to process multiple web pages concurrently.

- Error Handling: Manages errors in database queries, file operations, and network requests.

Dependencies

The program relies on the following libraries:

Standard Libraries
- iostream, string, vector, fstream, regex, future, unordered_map, algorithm, ctime
- Third-party Libraries
- - curl/curl.h: For handling HTTP requests. - mysql/mysql.h: For MySQL database connectivity.
  - External Files
  - - search_engine.cpp: The main search engine. - Proxy and user-agent files: Required to simulate realistic browsing behavior.

Program Workflow

Initialization:
- Connects to the MySQL database.
- Reads proxies and user-agents into memory.
- Sets a limit on the number of projects or links to process.

Data Fetching:
- Fetches project details and URLs to process from the database.
Data Processing:
- For each project:
- Cleans and parses webpage content.
- Filters out unwanted data based on predefined rules (e.g., stop words).
Data Validation:
- Ensures retrieved content meets criteria (e.g., length, HTTP response codes).
- Performs additional checks like title and author verification.
Database Updates:
- Updates or deletes records based on the processing results. - Inserts valid links into the database for further use.
Multithreading:
- Uses asynchronous tasks to parallelize the processing of multiple projects.

Major Functions and Their Roles

get_user_agent()
User agents are downloaded via a NodeJS package named user-agents. We feed into the script the type of user agent we need (Mobile, Desktop or Brower type --eg Chrome, Firefox) and it returns with a fresh user agent.
This NodeJS package is installed on all participating servers and user agent fetched a script user_agents.js
get_proxies()
- Reads proxy addresses from a file and stores them in a vector.
replaceAll()
- Removes all occurrences of specific substrings within given delimiters.
clean_content()
- Removes unwanted characters from strings.
charReplacements()
- Replaces accented characters with plain equivalents.
replaceHTMLENTITIES()
- Replaces HTML entities (e.g., `é`) with their corresponding characters.
getContents(string &response)
- Strips all HTML tags from a webpage's content.

Uploads

Initialization
- Retrieves user-group keywords, proxies, and blacklists.
- Selects random projects and their associated URLs from the database.
Content Processing
- Fetches web content for each URL.
- Applies stop word and blacklist filtering.
- Updates the database with the results.
Multithreading
- Processes multiple projects simultaneously for efficiency.
Cleanup
- Frees resources and closes database connections.
Execution
- Compile the script using a C++ compiler with necessary library flags: g++ -o scraper add_uploads.cpp -lcurl -lmysqlclient -lpthread - Run the executable: ./scraper

Email Notifications

Email notifications are sent for all new links
The notifications are divided into: Admin email notifications. We notify the admin of the websites on which the links have been found. Hosting email notifications - we search for the hosting email of the website and notify them to take down the link If the website is hosted on cloudflare, we send a separate cloudflare email.

Hosting Email

Website IPs keep changing. Every 24 hours system checks for the current website IP. The python script is run_update_ip.py located in /root/pythons
If current IP is different from previously found IP, the IP is updated but previous one cached for reference.
Check websites without hosting emails and update. When change IP runs we need to check the website for the new hosting email.

RIPE Database search

The main search URL is https://rdap.db.ripe.net/ip/ This function runs on web3
The python script is run_update_hosting_email.py
The python script uses aparder with parser Shop::Amazon.
Sometimes during search we encounter issues with a-aparser. If this happens, the tasks is skipped and rechecked later. During search on the above website, we can find several emails. But we need to get the one that has the term 'abuse' or other given keywords. If two or more emails are found containing the given keywords we pick the first one selected.
We run 100 websites in parallel.
We also have another script that runs hosting email for newly added websites on web3

Algorithm

If only one email found, just add it
If more than one email found:
- If any of the emails contains the word 'abuse' or variants of the word, add it
- If none contains the word 'abuse', then search by the list of keywords
- If more than email contains a keyword, add the first email found.

Sometimes the service redirects between three urls

https://rdap.db.ripe.net
https://rdap.lacnic.net/rdap
https://rdap.arin.net/registry

In each case, follow the redirect and fetch the content.
We use python requests package to fetch content with follow redirects set to True

Admin Email

We need to add admin email for all new websites. Every 2 minutes, the system checks if any new websites are added. If any new found, it starts the process of finding admin email.
The python script is run_email_search.py located in /home/moses/scripts of web7.
Process works by first extracting links from the homepage of the website.

Process

Extract links from the homepage of the website
Each link has an anchor tag (email keyword). This tag is compared to a list of allwable keywords
If the keyword matches, we then extract content from the link.
We then extract emails from the links and create a list of extracted emails

Email MX records

For each email from the list, we check if it is an unallowed email (check against a list of forbidden emails)
If the above check passes, we then check the MX record of the email.
If MX check passed, the email is added under the particular website.

Cloudflare Form

We post cloudflare form to the url https://abuse.cloudflare.com/api/v2/report/abuse_dmca
The url accepts the following POST parameters

Email of user
Title of the request
Name of the organization
Address of the organization
City of the organization
Country of the organization in ISO format
Organization name
Phone number of the organization
List of links containing original work
Infringing urls

Documentation found here
About links original work (official links), we pick the project copyright holder. If copyright holder is empty then we use link to power of attorney.
In addition to the above, the form expects a client ID and client secret sent within the headers
Our values are:
client id: 6d8f8cf364c5b9380a88273624778766.access
client secret: 3038279b4d678feffde8c53002ea8f7ebef551e4f8f691df147a3e96fe85ad2b
Response is either a success or a fail with a json detailed description of what happpened. Then we log this in the logs table.

We send cloudflare based on the content type of the project. If the content type has skip cloudflare set, then we skip sending cloudflare for websites with WRT > 14 days for the projects of the content type.s

Scan Official Links

Each content content type is set with an official Amazon url to check for project official links.
For every new project we scan links from this amazon url by checking over the project title and project author (if present).
We have set a cron task for this in web6 (212.83.171.22:32024) via script named scan_official_pages_schedule.py. This scans for links by checking the content provided by a-parser Shop::Amazon parser. The cron script runs every hour.

Apache Solr

Apache solr installed on each of the 4 extra servers. It runs on url http://127.0.0.1:8983
Three nodes created on apache:

search_engines -- for handling full web search tasks
check_status --for checking deleted status
deleted_status --confirming if deleted links are still deleted

Solr is set to have a minimum score of 0.4 for positive tests.
For search terms that are more than 1 word long, we split the terms and search on each term independently.

Removal Time

We update each websites average removal time.
The algorithm works as below:

Step 1:
- Select links (Deleted and Not Deleted) for the given website which were added within the last 3 months.
- If there are no links from the above step, select the last 100 links (Deleted and Not Deleted) for the website.
Step 2:
For each link, get its removal time:
This time is calculated as the difference in time between the time takedown notices were sent and the time the system detected the link as deleted
Step 3:
Get the average of the times found and set this as the website removal time.

In addition, we update website removal via the below processes.

By running the refresh icon that is beside this Average removal Time on the websites section.Calculation is baed on the above formula.
When links are changed from Pending to Deleted or from Not Deleted to Deleted either manually or automatically

We also have a general update for all websites of WRT not >14days once a day. This also uses the above algorithm

Pirate/ Links Section

The system gets its links from either search engines or full website search. The searches are done on separate servers named web1, web2,web3,web4,web5
Results /links from these servers are queued on the main server and slowly added into the database.
They are stored in a table named reference. This is the table we read from to display links on the UI.
Some websites have several mirrors. For some of these (ie. blogspot.com, wordpress.com and tumblr.com), we combine all the mirrors to display as the parent website when storing them in the system.
All links displayed in the UI are fetched from the reference table. For some websites whose website removal time is > 14days their links are moved to a backup table named reference_backup if the link is older than 2023-12-31.
Links from both tables ought to be checked during check status and also website removal time.

Database Update

The database on the main server is constantly updated. This is done by users who regularly update parts of the database or by automatic scripts running all the time.
we need to reflect some of these updates on webX servers; since these have their own copies of the database.
To achieve this, each of webx servers has a script that updates its database every hour.

Unit Test

We developed a unit test tool for checking functionality before uploading any changes to gitlab. This tool currently works only for axg.house
The smallest testable parts of the application, called units, are individually and independently scrutinised for process operation to ensure that each part is error free (and secure). We have used PHPUnit for our testing and this runs on web1

In order to access content on social media; we use cookies. We have created users for facebook, reddit, tiktok and instagram.
Cookies for these users are regularly when they expire or just manually. Sometimes, cookies expire and we need to immediately update them. For this, we developed a script to regularly check them and send an email regarding this. The script is named cookies_expiry_email.py and runs every 4 hours on web5.

Google and Bing Form

Google and bing form users also use cookies. These cookies sometimes expire and need to be updated. Whenever they expire, bing/google sends an email about that.

Content type settings

We have different butttons in the content type settings having the different functionalities. They are Run FWS Now, Run All Sites by FWS, Scan official links for all projects and Run Links Collector

Run FWS Now

This button inserts the contenttype id to the table "axgbot_run_now" and run the full websearch for all projects.

Run All Sites by FWS

This button inserts the contenttype id & serverid in to "create_fws_schedule" which runs the same fws but in respective serverid

Scan official links for all projects

This button goes through all the projects for the respective content types & scans the official links for the projects which doesn't have & updates the official links.

Run Links Collector

This button presents only for the Content-Type Id 33. Once we click on this button, it runs the python script "collect_links_33.py" present in web10 server by the flask app. It get the projects of the content type 33 and get the sites added into the project in last 1 year. Then script will iterate through each site & search for the links having the projectname. If links are matched, it will inserts into reference table

Content Type 33 Script

The Content-Type 33 Script (`collect_links_33.py`) is designed to gather all websites associated with Content-Type 33 projects and process each project to retrieve relevant links. The script runs on the web10 server and is located at /home/mose/collect_links_33.py.

When executed, the script performs the following steps:

Fetches all projects categorized under Content-Type 33 and collects websites added to these projects within the last year.
Iterates through each collected website and checks for specific search inputs.
Performs a search for links containing variations of the project title and collects all matching links.
For each matching link, the script navigates to sublinks up to a depth of 15 to gather additional related links.
All gathered links are stored in the database after undergoing several checks, including:
- Whitelist and global whitelist verification.
- Link existence check to avoid duplicate entries.

Once links are processed and verified, they are stored in the reference table, making them available for UI display. The script also includes filtering mechanisms to respect existing whitelists and prevent redundant entries.

Telegram Bot Integration

The Telegram bot is integrated into the system to assist users with verifying and removing links through a simple, user-friendly interface. The bot communicates with the backend to fetch the link verification status and provides necessary actions for users. This script currently located in web10 server at /home/moses/telegram-bot/telegram.py

Bot Functionality

The bot allows users to start the verification process for links they want to remove. After the user provides a link, the bot checks its availability and gives feedback on whether the link is available, removed, or if there are any issues.

Start Link Verification

When the user starts the bot, they are greeted with a "Start Link Verification" button. Clicking this button initiates the link verification process, prompting the user to submit a link they wish to verify.

Link Verification

After receiving a link, the bot checks its availability and category. If the link is found to be available, the bot estimates the removal time and provides a link to the remov.ee registration page for further processing.

Link Removal Process

If the link is unavailable or removed, the bot informs the user and suggests submitting another link for verification. It also provides an option to visit remov.ee directly to proceed with link removal if applicable.

Backend Integration

The Telegram bot is integrated with the backend system to verify the link status. It uses an API to fetch the link's status, category, and removal time. Additionally, it communicates with the database to ensure that all operations are recorded, including link verification results and user interactions.

Bot Setup

To set up the Telegram bot, ensure that the required dependencies, including the python-telegram-bot library, are installed. The bot must be configured with a valid Telegram bot token and connected to the backend system for link verification operations. Once set up, the bot can be run from a Python script located on your server.

Task Distribution System

The Task Distribution System ensures that tasks with active = 0 are gathered from multiple databases and distributed evenly across servers without affecting tasks marked as active = 1. The script handles bulk operations efficiently, ensuring no duplicate tasks are inserted. This functionality is implemented in the backend for scalable and robust task management.

Functionality Overview

The system collects tasks marked as active = 0 from all specified databases, removes duplicates, and distributes them evenly across servers. Before inserting tasks, it checks to ensure that no tasks with active = 1 for the same project_id exist, thus preserving existing active tasks.

Task Gathering

The script connects to each database and retrieves a list of project_id values for tasks with active = 0. Simultaneously, it gathers all project_id values with active = 1 to ensure that no duplicate tasks are inserted.

Task Filtering

The gathered tasks are filtered to remove duplicates and exclude any tasks with project_id values that are already active active = 1). This step ensures only valid tasks are considered for distribution.

Task Distribution

The filtered tasks are evenly distributed across the available servers. Each server is assigned a subset of tasks in a round-robin manner, ensuring balanced workload distribution.

Task Insertion

After distributing tasks, the script performs a bulk insertion operation for each server. It uses the insertOrIgnore method to avoid conflicts and skips inserting any tasks with duplicate project_id values. This ensures data integrity and prevents redundant entries.

Backend Integration

The Task Distribution System is tightly integrated with the backend, utilizing Laravel's database query builder for operations. It connects to multiple databases, performs filtering and distribution in-memory using Laravel collections, and executes efficient bulk insert operations.

Setup Instructions

To use the Task Distribution System:

Ensure all target databases are configured in the backend with proper connection names (e.g., web1, web2, etc.).
Verify that the axgbot_scan_websites table exists in each database and follows the required schema.
Deploy the script to the server and schedule it as needed using a job scheduler like Laravel's task scheduler or a CRON job.

The script will automatically handle task distribution without affecting existing active tasks.

Link Push Script Integration

The system includes a feature that allows users to push verified links to the main table. This functionality is executed through a Python script that automates the process of handling links. The script is designed to efficiently move data from intermediate tables to the main table, ensuring data integrity and seamless integration.

Button Functionality

A button labeled "Push Links to Main Table" is provided in the interface. When clicked, this button triggers the Python script responsible for transferring links to the main table. The script processes the links, validates their structure, and ensures they are correctly added to the main table for further operations.

How It Works

Upon clicking the button:

The frontend sends a request to the backend to execute the Python script.
The backend runs the script asynchronously to avoid blocking the user interface.
The script fetches data from the database, processes the links, and inserts them into the main reference table.

Test Page