Documentation
AXGHOUSE Service
General Description Python Scripts System Logs Users Projects Websites Analytics Email Templates Check Status Google Form Blogspot Form Bing Form Content Types Proxies System Full Web Search Hosting Email Admin Email Cloudflare Form Official Links Apache Solr Removal Time Scraper Email Notifications Unit Test/Comparison Tool Database Update Add / View Pirate Registration CookiesGeneral Description
The project has a repository https://gitlab.com/Axghouse/axghouse. Before cloning, you need to check the relevance with the repository on the server. The system consists of the following main components:
The schedule of scripts can be seen by running crontab -e
Scripts (omitted script path /var/www/html/app/crons):
Operating mode - script name - Description Hourly - cron_add_pirate.sh - Creates search engine tasks for projects by phrases
Daily - cron_all - Clearing old records and links
Once a day at 11pm - cron_not_deleted_notification.sh - Send a report to the user about not deleted links
Every 2 hours -cron_delete_content_detect.sh - Check for deleted content
Every midnight - cron_check_expired_users.sh - Check users for expiration dates
Hourly - cron_create_project_schedule.sh --add schedules for recently added projects
Daily - cron_check_disabled_project.sh --check projects that have expired and disable them
Constant --cron_mail_send.sh --Send takedown notifications
Constant --cron_cloudflare_send.sh --send cloudflare forms
Python Scripts
These are located in /root/pythons folder update_website_removal_time.py: Runs every 4 hours - Updates website removal time for websites with wrt not >14days
Formula for caluclating website removal is as follows:
Select Deleted or Not Deleted links for the website for the last 3 mnoths for which notifictions have been swent.If no links found, select the last 100 Deleted or Not Deleted the website for which notifications have been sent.
These links can be in either reference table or reference_backup tableFor each of the links, calculate removal time. Removal time is the time between when notifications were
sent and when the link was updated to DeletedTake the average of the removal times for all found linksSet this removal time as the Website Removal Time (WRT) run_update_ip.py: - Runs every hour. Checks for websites that have updated IPs.
run_add_axgbot.py: - Final confirmation for full website search links. Runs every 31 minutes.
run_add_mirrors.py: - Runs every minute to add mirror links from manuualy added links
send_mail.py: - Send email for websites that become >14days
send_mail_two.py: - Send email for websites whose removal time is about to increase and has pending links
run_fws.py: - Create fws tasks spread across all participating servers
update_removal_time.py: - Update website removal time for websites that have recently deleted links
delete_user_projects.py: - Deletes projects for users that been disabled.
check-supervisor.py:- Ensures that supervisor is running all the time
System Logs
All logs are in application/logs folder
cron scripts write to the logs of the same name. Search for a phrase in all scripts - grep -rnw 'application / logs' -e 'phrase' A background script cleans these logs once every week.
Users
Enter fields (* - required), email *,
select the function to send all copies of notifications to the second email, enter the second email.
Username * and surname, select the name of the copyright holder for all projects,
select the available content types, select the available search engines to the user, hide the schedule for the user and configure
it in the settings phone number, password *, select the user role (admin, manager, guest), select the expiration date of the account,
select the available projects for managers and guests.
The user expiration date is set, the crons/cron_check_expired_users script is responsible for its expiration
Change the name and surname, enter data for connecting SMTP mail, enter a signature (valid only when using your personal templates, when using the formula), change the password.
Edit user data
Invoices management
Change password and delete user account
Projects
Main tab - Visibility for the manager, selecting the type of content from the available ones, entering a title, entering values by author, year or artist (these fields vary depending on the selected type of content), insert links to official web resources, as well as a power of attorney.
Search keywords - when you go to the tab, keywords from the Title field are automatically generated,
if there is a value in after Author / Year / Artist.
As well as the Translator of copyrighted work field, key phrases from this field are also generated in conjunction with the Title,
but provided that in the system-> search keywords section, put the "+" in the translator field in the keyword phrase +.
Content type - affects the field name Author / Year / Artist.
Schedule - scan schedule, works by server time (Germany) +2 utc
Whitelist - Whitelist for links
Document - Download Files
When you save the project for the first time, a search is automatically launched for all search engines available to the user. After the first save, the RUN NOW button appears - start the search now.
Hitting the RUN SEARCH button will start search for the selected project for all search engines. We have four search engines (google,bing,yandex,axgbot).
Hitting RUN FWS will run full website search immediately (using search engine axgbot).
Tasks for the engine axgbot are picked by other servers. The servers are: 212.83.171.22:32004 web1
212.83.171.22:32008 web2
212.83.171.22:32012 web3
212.83.171.22:32016 web4
212.83.171.22:32020 web5
212.83.162.31:32024 web6
Search engines use proxies located at /opt/aparser/files/proxy.txt
The other servers are: 135.181.199.23 --Links verifier server (web10)
37.27.2.94 --i.house
Disabling projects based on expiration date- Each day all projects are checked on the expiration dates.
The function that performs this is located at cron/cron_not_deleted_notification.
If project expiration is in the past, the project is disabled and won't scan for links.
For each project we can manaully add a list of official links. For many others, we scan the official links from Amazon. Scanning is via a python script that is manually run from web3 whenever there are new projects.
When a project is deleted, all its associated links also get deleted.
Project audit tool. This tool has been added so as to track the progress of scanning projects. It provides the last link added, time it was added and reasons why the project may not be scanning.Websites
https://axg.house/websiteWhen you add a new site, it is automatically created in the Websites section, it starts the search for email inside the site and hoster based on whois database via https://search.arin.net/rdap
IP address update:
We check all websites once every 24 hours for new IPs. Skip links scanning - Sub Links from all links of this website are not added. Links also not checked for upload links.
If new IP is found, this is set as the current website IP and all previously found IPs logged.
This process is achieved by gethostbyname standard method of php
Wait for Content loading - Means that content will be received using selenium, after waiting (20 sec by default).
Each website also has a RUN NOW button. Clicking this button starts full web search on the selected website immediately. This occurs ONLY when full webs search checkbox is checked. If not, that means DO NOT carry out full web search on the website.
Clicking RUN FWS will run fws search with normal speed
We also have a search method for each website. This method is manually changed. Search bots use the value of this method to determine how the website is searched.
For some websites, we need to check project titles/author from particular locations. So we have developed a custom method for checking from these these particular locations with customized solr scores for each.
Solr scores are used as below:
For specific locations to check title -solr score of 3 and above needed.
For the rest of the links - solr score of 4.2 and above needed.
Example websites for which this is developed is libgen
Website audit tool. This tool has been added so as to track the progress of scanning websites using full website search. It provides the last link added, time it was added and reasons why the website may not be scanning.
Analytics
https://axg.house/analyticsDisplays links in the below categories
Email Templates
https://axg.house/email_templateShows all email templates used by the system.
These templates are used in.
Check Status
Checks the link for the presence of stop words from the Axghouse group after notifications have been sent.
The check is dependent on the removal time of the associated website. Check duration as below:
Check status can also be manually triggered by clicking the info icon that appears alongside the status of each link on this page https://axg.house/pirate
This only happens if the link status is Pending
Google Form
Sent within 1 minute after takedown date appears.
We send links for upto 70 projects with the same publisher.
Blogspot Form
Sent within 1 minute after takedown date appears.
We send links for upto 10 projects with the same publisher.
Bing Form
Sent in a similar manner as Google Form.
We send links per project to the bing DMCA.
Content Types
Proxies
We use proxies provided by Webshare. Proxies are used bu fws, se, check status (all forms), test tool and link verification.
The proxies rotate every month on between the 19th and 24th of each month. Sometimes, webshare just replaces certain proxies at any given time.
After rotation, with new proxies provisioned we have to download the new list and save on all servers in the folder /opt/aparser/files/proxy.txt where proxies are required.
Download is done every hour.
Proxy API host: https://proxy.webshare.io/api/v2/proxy/list/?mode=direct
Proxy authorization token s7t89waym9igp51mxq0i3el4ac85qd2d5jfp5xqe
When scripts run and need to use proxies, we read them directly from webshare. Search engines using a-aparser also use proxies from webshare. a-parser is unable to read directly from webshare; hence we download them and store in files in aparser directly where it can fetch directly.
In addition we update all proxy locations using this service https://ipapi.co
All scripts using proxies include
System
User Group - by default, words from the 7th group are used in the search
Content types - content types that are selected in the project
Search keywords –contain templates for generating search phrases for each type of content
Email templates - Templates for letters - are formed and stored as files
Not deleted notification - a template for sending notifications of deleted links
White List - sites from this list are not added, all sites without www
Fake sites - fake sites - such sites do not search for links and do not send complaints to them. The list is replenished automatically if links to other fakes are found on the page or a redirect to fakes is marked with the Auto label
Fakesite white - fake site exceptions
Deleted links - basket for deleted links
Link shorters - link reduction services - they must be checked for redirects
System Log - contains a log about some user actions
Upload hostings - a list of hosts that are considered file services, such links are added as Uploads and the title is not checked there. Also for such links there is a section (button) Edit blackwords list - A list of lines that should not appear in the link (to exclude pictures, scripts etc)
Full Web Search
We have 6 servers involved in full web search.Run button on each website
when clicked, it goes through all projects with content type 31 and sets up search tasks for each project on the selected website.
The tasks are then equally distributed among the 5 servers basedon the content type of the website.
A script has been developed that
inserts the tasks into each server with proper distribution such that no server has more tasks than the others.
Project run button:
This button runs both search engine tasks and website tasks for the select project.
For website tasks, it goes through all websites of content type 31 and equally distributes search tasks among all servers.
Content type run button:
This button runs all projects pointed to the content type. Each content type has specific servers it runs its tasks on.
Content type ID 6: web10
All other content types: the rest of the servers
After tasks are created, we fetch links using fetch_links.py
Then these links are checked by c++ executable named fws_new that is present on all servers.
Each server has a copy of the main database which is updated every 3 hours.
We run search on websites or projects. The main server prepares the tasks assiging each server a list of own tasks.Each server then picks its assigned tasks and begins website search.
fetch_links.py --located on each server begins extracting links from task We fetch content via the scraper, selenium or playwright and then links extracted using bs4.
This extraction is done using either:
scraper
playwright
selenium
cloudscraper (https://pypi.org/project/cloudscraper/) .This is used where the above 3 methods cannot bypass cloudflare
Each link is checked using apache solr. If links meet criteria, they will be inserted in the main servers database for final checks then added to the system.
Criteria checked include stop words, project stop words and title/translate.
For some links we extract upload/download links from specific locations on the page. Information about the exact location to extract from is hard coded in the scripts. The main scripts involved in checking links is /root/moses/fws_solr.py and /root/moses/fws_solr_sublinks.py.
Before we add any links, we delete any links containing these search urls.
For all links added we extract mirrors. Mirrors are links for similar websites which are already grouped as the mirrors website.
The extracted mirrors are checked existence before being added to the main queue for final checks.
Script for extracting mirrors named transfer_links_four.py working in multithreaded mode.
The servers process fws tasks at varying speeds. Some servers are faster than others. When some servers run out of tasks, there is need to move some tasks from other servers to the ones without tasks.
We have an algorithm to balance tasks that works as below: The script name is balance-tasks.py
It runs on web1 and moves 30% of tasks from the server with the most tasks to the server with the lowest number of tasks.
This happens if the highest tasks server has more than 10 tasks.
When we scan by FWS or checking redirect links we adding parent website. For example when we scan website A or checking redirect, then we must add only website A even if URL changed. But if website A redirects to website B, then we do not add website B.
Scraper (FWS and Uploads)
Content for full website search is fetched using a scraper built in C++ programming language. - Web Scraping: Using cURL to send HTTP requests and retrieve webpage content.
- Proxy Handling: Reads proxy details from a file and integrates them into web requests.
- User-Agent Randomization: Dynamically fetches and utilizes different user-agents to mimic browser behavior.
- Data Cleaning: Removes unwanted characters, scripts, and HTML tags from scraped content.
- Database Interaction: Connects to a MySQL database to fetch and update data related to projects and web pages.
- Multithreading: Uses std::async to process multiple web pages concurrently.
- Error Handling: Manages errors in database queries, file operations, and network requests.
Standard Libraries- iostream, string, vector, fstream, regex, future, unordered_map, algorithm, ctime
Third-party Libraries- curl/curl.h: For handling HTTP requests.
- mysql/mysql.h: For MySQL database connectivity.
External Files- search_engine.cpp: The main search engine.
- Proxy and user-agent files: Required to simulate realistic browsing behavior.
g++ -o scraper add_uploads.cpp -lcurl -lmysqlclient -lpthread - Run the executable: ./scraper Email Notifications
Email notifications are sent for all new links
The notifications are divided into: Admin email notifications. We notify the admin of the websites on which the links have been found.
Hosting email notifications - we search for the hosting email of the website and notify them to take down the link
If the website is hosted on cloudflare, we send a separate cloudflare email.
Hosting Email
Website IPs keep changing. Every 24 hours system checks for the current website IP. The python script is run_update_ip.py located in /root/pythons
If current IP is different from previously found IP, the IP is updated but previous one cached for reference.
Check websites without hosting emails and update. When change IP runs we need to check the website for the new hosting email.
Admin Email
We need to add admin email for all new websites. Every 2 minutes, the system checks if any new websites are added. If any new found, it starts the process of finding admin email.
The python script is run_email_search.py located in /home/moses/scripts of web7.
Process works by first extracting links from the homepage of the website.
Cloudflare Form
We post cloudflare form to the url https://abuse.cloudflare.com/api/v2/report/abuse_dmca
The url accepts the following POST parameters
Documentation found here
About links original work (official links), we pick the project copyright holder. If copyright holder is empty then we use link to power of attorney.
In addition to the above, the form expects a client ID and client secret sent within the headers
Our values are:
client id: 6d8f8cf364c5b9380a88273624778766.access
client secret: 3038279b4d678feffde8c53002ea8f7ebef551e4f8f691df147a3e96fe85ad2b
Response is either a success or a fail with a json detailed description of what happpened. Then we log this in the logs table.
We send cloudflare based on the content type of the project. If the content type has skip cloudflare set, then we skip sending cloudflare for websites with WRT > 14 days for the projects of the content type.s
Scan Official Links
Each content content type is set with an official Amazon url to check for project official links.
For every new project we scan links from this amazon url by checking over the project title and project author (if present).
We have set a cron task for this in web6 (212.83.171.22:32024) via script named scan_official_pages_schedule.py. This scans for links by checking the content provided by a-parser Shop::Amazon parser. The cron script runs every hour.
Apache Solr
Apache solr installed on each of the 4 extra servers. It runs on url http://127.0.0.1:8983
Three nodes created on apache:
Removal Time
We update each websites average removal time.
The algorithm works as below:
Pirate/ Links Section
The system gets its links from either search engines or full website search. The searches are done on separate servers named web1, web2,web3,web4,web5
Results /links from these servers are queued on the main server and slowly added into the database.
They are stored in a table named reference. This is the table we read from to display links on the UI.
Some websites have several mirrors. For some of these (ie. blogspot.com, wordpress.com and tumblr.com), we combine all the mirrors to display as the parent website when storing them in the system.
All links displayed in the UI are fetched from the reference table. For some websites whose website removal time is > 14days their links are moved to a backup table named reference_backup if the link is older than 2023-12-31.
Links from both tables ought to be checked during check status and also website removal time.
Database Update
The database on the main server is constantly updated. This is done by users who regularly update parts of the database or by automatic scripts running all the time.
we need to reflect some of these updates on webX servers; since these have their own copies of the database.
To achieve this, each of webx servers has a script that updates its database every hour.
Unit Test
We developed a unit test tool for checking functionality before uploading any changes to gitlab. This tool currently works only for axg.house
The smallest testable parts of the application, called units, are individually and independently scrutinised for process operation to ensure that each part is error free (and secure). We have used PHPUnit for our testing and this runs on web1
Cookies
In order to access content on social media; we use cookies. We have created users for facebook, reddit, tiktok and instagram.
Cookies for these users are regularly when they expire or just manually. Sometimes, cookies expire and we need to immediately update them. For this, we developed a script to regularly check them and send an email regarding this. The script is named cookies_expiry_email.py and runs every 4 hours on web5.
Content type settings
We have different butttons in the content type settings having the different functionalities. They are Run FWS Now, Run All Sites by FWS, Scan official links for all projects and Run Links Collector
Content Type 33 Script
The Content-Type 33 Script (`collect_links_33.py`) is designed to gather all websites associated with Content-Type 33 projects and process each project to retrieve relevant links. The script runs on the web10 server and is located at /home/mose/collect_links_33.py.
When executed, the script performs the following steps:
Once links are processed and verified, they are stored in the reference table, making them available for UI display. The script also includes filtering mechanisms to respect existing whitelists and prevent redundant entries.
Telegram Bot Integration
The Telegram bot is integrated into the system to assist users with verifying and removing links through a simple, user-friendly interface. The bot communicates with the backend to fetch the link verification status and provides necessary actions for users. This script currently located in web10 server at /home/moses/telegram-bot/telegram.py
The bot allows users to start the verification process for links they want to remove. After the user provides a link, the bot checks its availability and gives feedback on whether the link is available, removed, or if there are any issues.
When the user starts the bot, they are greeted with a "Start Link Verification" button. Clicking this button initiates the link verification process, prompting the user to submit a link they wish to verify.
After receiving a link, the bot checks its availability and category. If the link is found to be available, the bot estimates the removal time and provides a link to the remov.ee registration page for further processing.
If the link is unavailable or removed, the bot informs the user and suggests submitting another link for verification. It also provides an option to visit remov.ee directly to proceed with link removal if applicable.
The Telegram bot is integrated with the backend system to verify the link status. It uses an API to fetch the link's status, category, and removal time. Additionally, it communicates with the database to ensure that all operations are recorded, including link verification results and user interactions.
To set up the Telegram bot, ensure that the required dependencies, including the python-telegram-bot library, are installed. The bot must be configured with a valid Telegram bot token and connected to the backend system for link verification operations. Once set up, the bot can be run from a Python script located on your server.
Task Distribution System
The Task Distribution System ensures that tasks with active = 0 are gathered from multiple databases and distributed evenly across servers without affecting tasks marked as active = 1. The script handles bulk operations efficiently, ensuring no duplicate tasks are inserted. This functionality is implemented in the backend for scalable and robust task management.
The system collects tasks marked as active = 0 from all specified databases, removes duplicates, and distributes them evenly across servers. Before inserting tasks, it checks to ensure that no tasks with active = 1 for the same project_id exist, thus preserving existing active tasks.
The script connects to each database and retrieves a list of project_id values for tasks with active = 0. Simultaneously, it gathers all project_id values with active = 1 to ensure that no duplicate tasks are inserted.
The gathered tasks are filtered to remove duplicates and exclude any tasks with project_id values that are already active active = 1). This step ensures only valid tasks are considered for distribution.
The filtered tasks are evenly distributed across the available servers. Each server is assigned a subset of tasks in a round-robin manner, ensuring balanced workload distribution.
After distributing tasks, the script performs a bulk insertion operation for each server. It uses the insertOrIgnore method to avoid conflicts and skips inserting any tasks with duplicate project_id values. This ensures data integrity and prevents redundant entries.
The Task Distribution System is tightly integrated with the backend, utilizing Laravel's database query builder for operations. It connects to multiple databases, performs filtering and distribution in-memory using Laravel collections, and executes efficient bulk insert operations.
To use the Task Distribution System:
Link Push Script Integration
The system includes a feature that allows users to push verified links to the main table. This functionality is executed through a Python script that automates the process of handling links. The script is designed to efficiently move data from intermediate tables to the main table, ensuring data integrity and seamless integration.
A button labeled "Push Links to Main Table" is provided in the interface. When clicked, this button triggers the Python script responsible for transferring links to the main table. The script processes the links, validates their structure, and ensures they are correctly added to the main table for further operations.
Upon clicking the button:
Test Page