AXGHOUSE Documentation
Welcome to the comprehensive documentation for the AXGHOUSE platform. This guide covers all aspects of the system, from basic setup to advanced features.
What is AXGHOUSE?
AXGHOUSE is a comprehensive platform for managing digital content protection, automated takedown processes, and link verification systems. It provides powerful tools for content creators and copyright holders to protect their intellectual property across the web.
Key Features
- Automated Link Detection - Advanced search algorithms to find unauthorized content
- Multi-Platform Integration - Support for Google, Bing, Cloudflare, and more
- PostgreSQL Performance - 40-50% faster performance with advanced database features
- Real-time Monitoring - Continuous monitoring and status checking
- Comprehensive Analytics - Detailed reporting and insights
Getting Started
This documentation is organized into several sections to help you navigate the platform effectively:
- Core Components - Essential system components and their functions
- Features - Platform features and capabilities
- Forms & Integration - Third-party integrations and form submissions
- System Management - Administrative tools and system configuration
- Advanced - Advanced features and technical details
General Description
The project has a repository at https://gitlab.com/Axghouse/axghouse. Before cloning, you need to check the relevance with the repository on the server. The system consists of the following main components:
- Supervisor - Used to create project schedule,
supervisorctl status- see all running workers. restart all restart.
Supervisor conf files are located in/etc/supervisor/conf.d - PHP Scripts - Scripts from the
application/cronsfolder are launched from cron, a description of these scripts will be given below. - Python Scripts - Scripts written in python for adding links, checking deleted content and email search in
/root/pythonsfolder. - Backend Database - PostgreSQL database (migrated from MySQL in November 2025) with optimized connection pooling and enhanced performance features.
Background Scripts
Script Schedule
The schedule of scripts can be seen by running crontab -e
Scripts (omitted script path /var/www/html/app/crons):
Operating Mode - Script Name - Description
- Hourly -
cron_add_pirate.sh- Creates search engine tasks for projects by phrases - Daily -
cron_all- Clearing old records and links - Once a day at 11pm -
cron_not_deleted_notification.sh- Send a report to the user about not deleted links - Every 2 hours -
cron_delete_content_detect_executors.py- Check for deleted content - Every midnight -
cron_check_expired_users.sh- Check users for expiration dates - Hourly -
cron_create_project_schedule.sh- Add schedules for recently added projects - Daily -
cron_check_disabled_project.sh- Check projects that have expired and disable them - Constant -
cron_mail_send.sh- Send takedown notifications - Constant -
cron_cloudflare_send.sh- Send cloudflare forms
PostgreSQL Database
The system has been successfully migrated from MySQL to PostgreSQL, providing significant performance improvements and advanced features:
Key Benefits
- Performance Improvements: 40-50% faster overall system performance, with 60% improvement in concurrent operations and 33x faster full-text search operations
- ACID Compliance: Full ACID compliance guaranteed at all times, ensuring better data integrity and consistency
- MVCC (Multi-Version Concurrency Control): True MVCC - readers don't block writers, writers don't block readers, resulting in 50-60% better concurrent performance
- Advanced Features: Native JSONB support, full-text search with GIN indexes, window functions, materialized views, and partial indexes
- Better Scalability: Handles large datasets (100M+ rows) better with efficient partitioning and logical replication support
- Resource Efficiency: 20% less CPU usage, 15% less memory consumption, and 25% less disk I/O operations
PostgreSQL Configuration
The default database connection is configured to use PostgreSQL in config/database.php. The system uses:
- PostgreSQL driver:
pgsql - Connection pooling for optimized resource usage
- UTF-8 character encoding
- SSL mode: prefer (for secure connections)
- Optimized PDO settings for PostgreSQL compatibility
Performance Metrics
Post-migration performance improvements:
- Complex SELECT with JOINs: 39% faster (850ms → 520ms)
- Full-Text Search: 33x faster (3,200ms → 95ms)
- Bulk INSERT (10K rows): 60% faster (45s → 18s)
- User Dashboard Load: 63% faster (1,850ms → 680ms)
- Email Processing: 60% faster (1,000/min → 1,600/min)
- Concurrent Users: 2x capacity (50 → 100+ users)
Python Scripts
Note: All Python scripts have been migrated to use PostgreSQL (via psycopg2 library) instead of MySQL. The scripts connect to the PostgreSQL database using the same connection parameters as the main application.
These are located in /root/pythons folder:
Key Scripts
update_website_removal_time.py: Runs every 4 hours - Updates website removal time for websites with wrt not >14daysrun_add_axgbot.py: Final confirmation for full website search links. Runs every 31 minutes.run_add_mirrors.py: Runs every minute to add mirror links from manually added linkssend_mail.py: Send email for websites that become >14dayssend_mail_two.py: Send email for websites whose removal time is about to increase and has pending linksrun_fws.py: Create fws tasks spread across all participating serversupdate_removal_time.py: Update website removal time for websites that have recently deleted linksdelete_user_projects.py: Deletes projects for users that been disabled.check-supervisor.py: Ensures that supervisor is running all the time.backup_daily.py: Runs every midnight and backs up old links into reference_backup table.
Update IP
run_update_ip.py -- This adds IPs to newly added websites. After the IP is found the website is queued for adding hosting email. The function uses the inbuilt python gethostname function.
System Logs
All logs are in storage/logs folder and storage/app/public folder
A background script cleans these logs once every week.
Users
Create New Users
Enter fields (* - required), email *, select the function to send all copies of notifications to the second email, enter the second email. Username * and surname, select the name of the copyright holder for all projects, select the available content types, select the available search engines to the user, hide the schedule for the user and configure it in the settings phone number, password *, select the user role (admin, manager, guest), select the expiration date of the account, select the available projects for managers and guests.
The user expiration date is set, the crons/cron_check_expired_users script is responsible for its expiration.
Personal User Settings
Change the name and surname, enter data for connecting SMTP mail, enter a signature (valid only when using your personal templates, when using the formula), change the password. Edit user data, Invoices management, Change password and delete user account.
Projects
Creating Projects
Main tab - Visibility for the manager, selecting the type of content from the available ones, entering a title, entering values by author, year or artist (these fields vary depending on the selected type of content), insert links to official web resources, as well as a power of attorney.
Search Keywords - When you go to the tab, keywords from the Title field are automatically generated, if there is a value in after Author / Year / Artist. As well as the Translator of copyrighted work field, key phrases from this field are also generated in conjunction with the Title, but provided that in the system-> search keywords section, put the "+" in the translator field in the keyword phrase +.
Content Type - affects the field name Author / Year / Artist.
Schedule - scan schedule, works by server time (Germany) +2 utc
Whitelist - Whitelist for links
Document - Download Files
When you save the project for the first time, a search is automatically launched for all search engines available to the user. After the first save, the RUN NOW button appears - start the search now.
Hitting the RUN SEARCH button will start search for the selected project for all search engines. We have four search engines (google,bing,yandex,axgbot). Hitting RUN FWS will run full website search immediately (using search engine axgbot).
Websites
When you add a new site, it is automatically created in the Websites section, and its IP updated. It starts the search for both admin email and hosting email. Hosting email searched on whois database via https://rdap.arin.net/registry
IP Address Update
We check all websites once every 24 hours for new IPs. If new IP is found, this is set as the current website IP and all previously found IPs logged. This process is achieved by gethostbyname standard method of php. Once a new IP is found we use web7 to update the hosting email.
Skip links scanning - Sub Links from all links of this website are not added. Links also not checked for upload links.
Wait for Content loading - Means that content will be received using selenium, after waiting (20 sec by default).
Analytics
Displays links in the below categories:
- Grouped by date added
- Grouped by search engines
- Links categorised as either torrents, free downloads, messengers, fake sites, cyber lockers, social networks, link shorteners
Email Templates
https://axg.house/email_template
Shows all email templates used by the system. These templates are used in:
- Email notifications for takedown
- Account Registration / User Deletion
- Adding users to projects
- Project Creation / Expiration / Deletion
Check Status
Checks the link for the presence of stop words from the Axghouse group after notifications have been sent. The check is dependent on the removal time of the associated website. Check duration as below:
- If website removal time is '>14days', do not check status.
- If website removal time is 'no data', then we check the link once every 2 days.
- Otherwise, check the link after 8 hours
Statuses that lead to changing to Deleted
- Presence of stop words
- Absence of title/translator
- 401, 404, 410, 451 status codes
- 403 non cloudflare code
- 302,301 redirecting to homepage
Enhanced Status Checking (2025): The system now uses curl_cffi scraper as a fallback when the primary scraper fails or times out. This provides better success rates for status checking, especially for JavaScript-heavy websites and sites with anti-bot protection.
Full Web Search
We have 5 servers involved in full web search (web1,web2,web3,web4,web5). Website search is run via three methods:
Run button on each website
When clicked, it goes through all projects with content type 31 and sets up search tasks for each project on the selected website. The tasks are then equally distributed among the 5 servers based on the content type of the website.
Project run button
This button runs both search engine tasks and website tasks for the select project. For website tasks, it goes through all websites of content type 31 and equally distributes search tasks among all servers.
Content type run button
This button runs all projects pointed to the content type. Each content type has specific servers it runs its tasks on.
Email Notifications
Email notifications are sent for all new links. The notifications are divided into:
- Admin email notifications - We notify the admin of the websites on which the links have been found.
- Hosting email notifications - We search for the hosting email of the website and notify them to take down the link
- Cloudflare notifications - If the website is hosted on cloudflare, we send a separate cloudflare email.
Google Form
Sent within 1 minute after takedown date appears. We send links for up to 70 projects with the same publisher.
Algorithm
As soon as notifications are sent, tasks are created for google form. The tasks are stored in the table google_form. A script hosted on web8 named google_dmca_two.py runs all the time checking if there are any tasks in the table.
The tasks are sent to form url - https://reportcontent.google.com/forms/dmca_search
Blogspot Form
Sent within 1 minute after takedown date appears. We send links for up to 10 projects with the same publisher.
Algorithm
Projects are grouped by publisher. For each project we get not sent to dcma form links. Then create a Data Form with max 10 groups (for each project) and max 1000 links total.
We use python playwright to send to url - https://reportcontent.google.com/forms/dmca_blogger
Bing Form
Sent in a similar manner as Google Form. We send links per project to the bing DMCA.
Algorithm
As soon as notifications are sent, tasks are created for bing form. The tasks are stored in the table google_form_bing. A script hosted on web8 named bing_dmca.py runs all the time checking if there are any tasks in the table.
Bing form url - https://www.bing.com/webmaster/tools/contentremovalform
Cloudflare Form
We post cloudflare form to the url https://abuse.cloudflare.com/api/v2/report/abuse_dmca
The url accepts the following POST parameters:
- Email of user
- Title of the request
- Name of the organization
- Address of the organization
- City of the organization
- Country of the organization in ISO format
- Organization name
- Phone number of the organization
- List of links containing original work
- Infringing urls
Counter Notice System
The Counter Notice System allows users to submit counter-notifications to Google for links that have been incorrectly flagged for removal. This system validates project owners have active Google Counter Users before processing counter notices.
Google Counter Users
Before submitting counter notices, the system validates that all project owners have active Google Counter Users configured. Each Google Counter User must have:
- Valid contact email matching project owner's email
- Active status (active = 1)
- Complete profile information (name, company, address, etc.)
- Valid cookies for Google authentication
Content Types
Fields:
- Google keywords (For detect pirate) - Specifies a list of words (separated by commas) that must be present on the page along with the Title of the project, if left blank then pages are not added.
- Specified content type field field name - changes the name of the Author field in the project to the desired one (does not affect the search).
- Check Specified content type field on page - Will the presence of the Author field on pages be checked
- Swap project keywords - Changes the order of formation of search keywords in the project.
- Stop words - we check for these words in the title of each link content. If found, the link is not considered a pirate one
- Screenshot - if this is set, we fetch screenshots for links of this content type
Proxies
We use proxies provided by Webshare. Proxies are used by fws, se, check status (all forms), test tool and link verification.
The proxies rotate every month on between the 19th and 24th of each month. Sometimes, webshare just replaces certain proxies at any given time. After rotation, with new proxies provisioned we have to download the new list and save on all servers in the folder /opt/aparser/files/proxy.txt where proxies are required.
Proxy API Details:
Proxy API host: https://proxy.webshare.io/api/v2/proxy/list/?mode=direct
Proxy authorization token: s7t89waym9igp51mxq0i3el4ac85qd2d5jfp5xqe
System
- User Group - by default, words from the 7th group are used in the search
- Content types - content types that are selected in the project
- Search keywords – contain templates for generating search phrases for each type of content
- Email templates - Templates for letters - are formed and stored as files
- Not deleted notification - a template for sending notifications of deleted links
- White List - sites from this list are not added, all sites without www
- Fake sites - fake sites - such sites do not search for links and do not send complaints to them
- Upload hostings - a list of hosts that are considered file services
Removal Time
We update each websites average removal time. The algorithm works as below:
- Step 1:
- Select links (Deleted and Not Deleted) for the given website which were added within the last 3 months.
- If there are no links from the above step, select the last 100 links (Deleted and Not Deleted) for the website.
- Step 2:
For each link, get its removal time: This time is calculated as the difference in time between the time takedown notices were sent and the time the system detected the link as deleted - Step 3:
Get the average of the times found and set this as the website removal time.
Text Search Engine
This application is a C++ server that listens for HTTP POST requests on a given port and supports two main endpoints:
/add/search
Key Functionalities
- Web Scraping: Using cURL to send HTTP requests and retrieve webpage content.
- Proxy Handling: Reads proxy details from a file and integrates them into web requests.
- User-Agent Randomization: Dynamically fetches and utilizes different user-agents to mimic browser behavior.
- Data Cleaning: Removes unwanted characters, scripts, and HTML tags from scraped content.
- Database Interaction: Connects to a PostgreSQL database to fetch and update data related to projects and web pages.
- Multithreading: Uses std::async to process multiple web pages concurrently.
- Timeout Management: 30-second timeout with automatic fallback to curl_cffi scraper
- Enhanced Fallback Chain: Integrated with curl_cffi, Playwright, and Selenium for maximum success rates
2025 Enhancement: The C++ scraper now includes timeout management and seamless integration with the curl_cffi Python scraper. When the C++ scraper takes longer than 30 seconds or encounters specific error codes, it automatically falls back to curl_cffi for better success rates.
curl_cffi Scraper Integration
The system now includes an advanced curl_cffi-based scraper that provides enhanced web scraping capabilities with better anti-detection features and improved performance for modern websites.
New in 2025: curl_cffi integration provides better success rates for JavaScript-heavy websites and enhanced proxy support with automatic retry mechanisms.
Key Features
- Browser Fingerprinting: Uses Chrome 110 impersonation for better compatibility with modern websites
- Proxy Integration: Seamless integration with existing proxy infrastructure from
/root/flask/proxies/proxy.txt - User Agent Rotation: Dynamic user agent fetching from Node.js script
/root/user_agents.js - Automatic Retry Logic: Built-in retry mechanism for 500 status codes (up to 3 attempts)
- Home Redirect Detection: Intelligent detection of homepage redirects for better link classification
- Timeout Management: 30-second timeout with automatic fallback to curl_cffi when scraper hangs
Integration Architecture
The curl_cffi scraper is integrated into the existing content fetching pipeline:
Fallback Chain
- Primary Scraper: Traditional C++ scraper (with 30-second timeout)
- curl_cffi Fallback: Activated on timeout, 500 errors, or specific status codes (403, 429, 503)
- Playwright/Selenium: Used for JavaScript-heavy content when curl_cffi returns incomplete data
- Aparser: Final fallback for Cloudflare-protected websites
Status Code Handling
- 200, 300 Codes: Properly processed and used when returned from curl_cffi
- 500 Codes: Automatic retry with 2-second delays between attempts
- Home Redirects: Automatically detected and classified as status code 300
- Timeout Scenarios: Immediate fallback to curl_cffi after 30-second scraper timeout
Configuration
The curl_cffi scraper uses the following configuration:
Headers:
- User-Agent: Dynamic from /root/user_agents.js
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Accept-Language: en-US,en;q=0.9
- Accept-Encoding: gzip, deflate, br
- Referer: https://www.google.com/
- Connection: keep-alive
Timeouts:
- Connect Timeout: 60 seconds
- Read Timeout: 120 seconds
- Scraper Timeout: 30 seconds (before curl_cffi fallback)
Retry Logic:
- Max Retries: 3 attempts for 500 errors
- Retry Delay: 2 seconds between attempts
Performance Improvements
Enhanced Success Rates:
- JavaScript-Heavy Sites: Better handling of sites that load content dynamically
- Anti-Bot Protection: Improved bypass capabilities for modern protection systems
- Timeout Recovery: No more hanging scrapers - automatic fallback after 30 seconds
- Proxy Reliability: Better proxy rotation and error handling
Playwright Enhancements
Playwright has been optimized to work better with the curl_cffi integration:
- Increased Timeout: Default timeout increased to 60 seconds for better JavaScript loading
- Wait Strategy: Uses
wait_until="load"for more reliable page loading - Fallback Integration: Automatically triggered when curl_cffi returns incomplete content
Link Checker Package
The curl_cffi functionality is also available through the Link Checker Python package:
Dependencies:
- curl-cffi>=0.5.0
- psycopg2-binary>=2.9.0 (PostgreSQL support)
- beautifulsoup4>=4.9.0
- selenium>=4.0.0
- playwright>=1.20.0
Deployment Notes:
- Ensure
/root/flask/proxies/proxy.txtexists and contains valid proxies - Verify
/root/user_agents.jsscript is executable and returns valid user agents - curl_cffi requires Python 3.7+ and may need compilation on some systems
- Monitor proxy rotation schedules (19th-24th of each month) for uninterrupted service
User Agents
The User Agents page provides a comprehensive management interface for user agent strings used throughout the system. User agents are essential for web scraping operations as they help mimic real browser behavior and avoid detection by anti-bot systems.
Features
- User Agent Database: Stores a collection of user agent strings with metadata including platform, device category, and viewport dimensions
- Search Functionality: Search user agents by user agent string, platform, or device category using case-insensitive matching
- Pagination: Displays 10 user agents per page with navigation controls
- Import from GitHub: One-click import of user agents from the intoli/user-agents repository
User Agent Information
Each user agent entry contains the following information:
- ID: Unique identifier for the user agent record
- User Agent: The complete user agent string (e.g., "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...")
- Created At: Date and time when the user agent was added to the database
- Platform: Operating system or platform (e.g., "Windows", "Mac OS X", "Linux", "Android", "iOS")
- Device Category: Type of device (e.g., "desktop", "mobile", "tablet")
Importing User Agents
The system can import user agents from the intoli/user-agents GitHub repository, which provides a comprehensive collection of real-world user agent strings.
Import Process:
- Click the "Add User Agents" button on the User Agents page
- The system downloads a gzipped JSON file from
https://raw.githubusercontent.com/intoli/user-agents/main/src/user-agents.json.gz - The file is automatically decompressed and parsed
- Existing user agents are cleared (truncated) from the database
- New user agents are imported with the following fields:
- User agent string
- Platform information
- Device category
- Viewport height and width
- Date added timestamp
Usage in System
User agents from this database are used throughout the system for:
- Web Scraping: Random user agent selection for HTTP requests to avoid detection
- curl_cffi Scraper: Dynamic user agent rotation for enhanced anti-detection capabilities
- Status Checking: User agent rotation when checking link status
- Form Submissions: Browser-like user agents for DMCA form submissions
Best Practices:
- Regularly update the user agent database to include the latest browser versions
- Use the import feature periodically to refresh the collection with new user agents
- Search functionality helps identify specific user agents for testing or debugging
- The system automatically selects random user agents from the database for each request
Database Schema
The user agents are stored in the user_agents table with the following structure:
id- Primary key (auto-increment)useragent- The user agent stringdate_added- Timestamp when the record was createdplatform- Operating system/platform namedeviceCategory- Device type (desktop, mobile, tablet)viewportHeight- Viewport height in pixelsviewportWidth- Viewport width in pixels
AWS Credentials (SMTP Password Generator)
The AWS Credentials page provides a tool for generating SMTP passwords for Amazon SES (Simple Email Service). This utility simplifies the process of creating SMTP credentials needed for sending emails through AWS SES.
Purpose
When configuring SMTP settings for AWS SES, you need to generate an SMTP password from your AWS access key secret. This page automates that process by using AWS SDK to convert your AWS secret access key into an SMTP password that can be used with SES SMTP endpoints.
Features
- SMTP Password Generation: Converts AWS secret access keys into SMTP passwords
- AWS Region Selection: Supports all major AWS regions for SES
- AJAX-Based Interface: Real-time password generation without page refresh
- Copy to Clipboard: One-click copy functionality for generated passwords
Required Information
To generate an SMTP password, you need to provide:
- Email: The email address associated with the AWS SES account
- SMTP Username: Your AWS access key ID (e.g., AKIA...)
- SMTP Secret: Your AWS secret access key
- AWS Region: The AWS region where your SES is configured
Supported AWS Regions
The tool supports the following AWS regions:
- US Regions: us-east-2, us-east-1, us-west-2, us-gov-west-1
- Asia Pacific: ap-south-1, ap-northeast-2, ap-southeast-1, ap-southeast-2, ap-northeast-1
- Europe: eu-central-1, eu-west-1, eu-west-2, eu-west-3, eu-south-1, eu-north-1
- Canada: ca-central-1
How It Works
Generation Process:
- Enter your email address, SMTP username (AWS access key ID), and SMTP secret (AWS secret access key)
- Select the appropriate AWS region from the dropdown menu
- Click "Generate SMTP Password" button
- The system executes a Python script (
/var/www/credentials.py) that uses AWS SDK to generate the SMTP password - The generated password is displayed in a read-only field
- Use the "Copy Password" button to copy the password to your clipboard
Technical Details
The password generation is handled by a Python script located at /var/www/credentials.py. This script:
- Uses AWS SDK (boto3) to convert the secret access key to an SMTP password
- Applies AWS's SMTP password generation algorithm based on the selected region
- Returns the generated password as output
Usage in System
Generated SMTP passwords are used for:
- Email Notifications: Sending takedown notifications and system emails via AWS SES
- User SMTP Configuration: Users can configure their own SMTP settings in account settings
- Cloudflare Form Submissions: Email notifications sent through AWS SES
- System Communications: Automated email communications throughout the platform
Security Notes:
- Keep your AWS secret access keys secure and never share them publicly
- The generated SMTP password is specific to the AWS region selected
- If you regenerate your AWS access keys, you'll need to generate a new SMTP password
- SMTP passwords are different from your AWS console login password
- Ensure the Python script at
/var/www/credentials.pyhas proper permissions and AWS SDK installed
SMTP Configuration
Once you have the generated SMTP password, you can configure your email client or application with:
- SMTP Host:
email-smtp.{region}.amazonaws.com(e.g.,email-smtp.eu-west-1.amazonaws.com) - SMTP Port: 25, 465 (SSL), or 587 (TLS)
- SMTP Username: Your AWS access key ID
- SMTP Password: The generated password from this tool
Best Practices:
- Store generated SMTP passwords securely in your user account settings
- Use IAM users with SES-specific permissions rather than root AWS credentials
- Regularly rotate AWS access keys and regenerate SMTP passwords accordingly
- Test SMTP connectivity after generating new passwords
Unit Test
We developed a unit test tool for checking functionality before uploading any changes to gitlab. This tool currently works only for axg.house
The smallest testable parts of the application, called units, are individually and independently scrutinised for process operation to ensure that each part is error free (and secure). We have used PHPUnit for our testing and this runs on web1
Database Update
The database on the main server is constantly updated. This is done by users who regularly update parts of the database or by automatic scripts running all the time. We need to reflect some of these updates on webX servers; since these have their own copies of the database.
To achieve this, each of webx servers has a script that updates its database every hour.
PostgreSQL-Specific Considerations
When working with PostgreSQL databases across multiple servers:
- Sequence Synchronization: After bulk data operations or data imports, PostgreSQL sequences may need to be reset to prevent duplicate key errors.
- Transaction Handling: PostgreSQL uses true MVCC (Multi-Version Concurrency Control), allowing better concurrent read/write operations without blocking.
- Connection Management: PostgreSQL connections are more efficient, but connection pooling (PgBouncer) is recommended for high-traffic scenarios.
- Data Type Compatibility: PostgreSQL is stricter with data types - ensure proper type casting when transferring data between servers.
Pirate/Links Section
The system gets its links from either search engines or full website search. The searches are done on separate servers named web1, web2, web3, web4, web5. Results/links from these servers are queued on the main server and slowly added into the database.
They are stored in a table named reference. This is the table we read from to display links on the UI. Some websites have several mirrors. For some of these (ie. blogspot.com, wordpress.com and tumblr.com), we combine all the mirrors to display as the parent website when storing them in the system.
All links displayed in the UI are fetched from the reference table. For some websites whose website removal time is >14days their links are moved to a backup table named reference_backup if the link is older than 2024-08-01.
Scan Official Links
Each content content type is set with an official Amazon url to check for project official links. For every new project we scan links from this amazon url by checking over the project title and project author (if present).
We have set a cron task for this in web6 (212.83.171.22:32024) via script named scan_official_pages_schedule.py. This scans for links by checking the content provided by a-parser Shop::Amazon parser. The cron script runs every hour.
Hosting Email
Website IPs keep changing. Every 24 hours system checks for the current website IP. The python script is run_update_ip.py located in /root/pythons
If current IP is different from previously found IP, the IP is updated but previous one cached for reference. Check websites without hosting emails and update. When change IP runs we need to check the website for the new hosting email.
RIPE Database Search
The main search URL is https://rdap.arin.net/registry/ip/ This function runs on web7. The python script is run_update_hosting_email.py
Algorithm
- If only one email found, just add it
- If more than one email found:
- If any of the emails contains the word 'abuse' or variants of the word, add it
- If none contains the word 'abuse', then search by the list of keywords
- If more than email contains a keyword, add the first email found.
Admin Email
We need to add admin email for all new websites. Every 2 minutes, the system checks if any new websites are added. If any new found, it starts the process of finding admin email.
The python script is run_email_search.py located in /home/moses/scripts of web7. Process works by first extracting links from the homepage of the website.
Process
- Extract links from the homepage of the website
- Each link has an anchor tag (email keyword). This tag is compared to a list of allowable keywords
- If the keyword matches, we then extract content from the link.
- We then extract emails from the links and create a list of extracted emails
Email MX Records
- For each email from the list, we check if it is an unallowed email (check against a list of forbidden emails)
- If the above check passes, we then check the MX record of the email.
- If MX check passed, the email is added under the particular website.
Registration
User registration system with email verification and account management features.
Cookies
In order to access content on social media; we use cookies. We have created users for facebook, reddit, tiktok and instagram.
Cookies for these users are regularly updated when they expire or just manually. Sometimes, cookies expire and we need to immediately update them. For this, we developed a script to regularly check them and send an email regarding this. The script is named cookies_expiry_email.py and runs every 4 hours on web5.
Google and Bing Form
Google and bing form users also use cookies. These cookies sometimes expire and need to be updated. Whenever they expire, bing/google sends an email about that.
Remov.ee Platform
Note: This section documents the Remov.ee platform, which is a separate service from the main AXGHOUSE system. The Remov.ee platform is located at https://remov.ee and operates on the santhosh/remov branch of the repository.
General Description
Remov.ee is a self-service platform that allows users to create projects for content removal requests. The project repository is located on the santhosh/remov branch. Before cloning, you need to check the relevance with the repository on the server.
Test Server
A test server has been created at https://new.remov.ee using a new branch remov_test created from the santhosh/remov branch. This test server is used for uploading new changes before pushing to the live server.
Remov.ee Users
Personal User Settings
Users can manage their personal settings including:
- Email Updates: Update email address and other personal identifying information
- Country Information: Update country and location details
- Subscriptions Management: Manage active subscriptions and payment methods
- Password Management: Change password and delete user account
Remov.ee Projects
Creating Projects
The projects page displays a listing of all projects and their expiry information. Each project has the following action icons:
- Analytics: Displays analytic graphs for the project
- View Links: View all links associated with the project
- Edit: Change project details
- Copy: Duplicate the project
Project Creation Process
When creating a new project, the first step is to add a link:
- The link MUST be a valid URL. The system checks for invalid URLs using JavaScript and reports errors
- The link is verified via Selenium on a special server named LinkVerifier at IP
135.181.199.23(same process as in registration) - When adding the link, the system performs the following checks:
- Check if the link already exists in the system
- Check HTTP code using Selenium
- Proceed only if Selenium is NOT 404 and link is NOT whitelisted
- Processing also occurs if the website exists in our websites list
- If all checks pass, the link is added to the system as a Google link (Google SE code is 0)
- The link is also added to a temp table that contains recently added links
Project Status
Projects can be either:
- Active (Enabled): Paid projects that are actively processing links
- Inactive (Disabled): Projects not paid for. Users can activate them by clicking the Pay button
Payment and Invoicing
- Projects already paid for can have their invoices downloaded
- Tax is added to a project at a rate of 22% for users from EU countries who do not have valid VAT
- When a project expires (either by reaching expiry period or when admin deactivates it), the user can reactivate by making a payment
Project Deletion
Projects can be deleted by the admin. Upon deletion, everything associated with the project is deleted, including:
- Links
- Orders
- Documents
- All related data
An email is sent to the user when their projects get deleted.
Project Expiration
When projects are about to expire, email reminders are sent to project owners. The system uses a project email reminder template that describes how far in advance the email reminder should be sent. The email reminder is sent at 11 PM.
Adding New Links
Users can add new links to existing projects via:
https://remov.ee/pirate/add?project={project_id}
- A user is allowed to add a maximum of 100 links within one form
- Each project has a monthly limit of links to be added
- Links to be added plus those already added for the month should not exceed the monthly limit
- Users can add screenshots alongside the links. These screenshots are editable
- Screenshot editing example: https://remov.ee/pirate/edit/{link_id}
Link Status Change
Link status progression:
- Pending Review: Initial status when a link is added
- Pending Removal: Status changes to this after notifications are sent
- Deleted: Status changes to this when the link gets deleted
- Not Deleted: After 14 days if link has not been deleted, status is set to Not Deleted
Remov.ee Pricing
Standard Account Pricing
Prices for copyright content types (video, audio, image, text, software):
| Plan | Monthly Price |
|---|---|
| One link per month | 49 EUR |
| Up to 10 links a month | 99 EUR |
| Up to 100 links a month | 199 EUR |
| Up to 500 links a month | 299 EUR |
| Up to 1k links a month | 599 EUR |
| Up to 10k links a month | 999 EUR |
Impostor Account Pricing
For Impostor accounts, different pricing applies:
| Plan | Monthly Price |
|---|---|
| One link per month | 339 EUR |
| Up to 10 links a month | 990 EUR |
| Up to 100 links a month | 1990 EUR |
| Up to 1k links a month | 2990 EUR |
| Up to 10k links a month | Hidden (may be used in future) |
Monthly Limits: A user is not allowed to add more links than the monthly limit allows. If they want to do so, they need to move to a higher package or create a new project.
Calculating Tax
Tax is added to all payments as follows:
- European Union Users:
- If VAT number is not valid, add 22% tax
- If VAT number is not entered, add 22% tax
- If user is from Estonia, add 22% tax regardless of VAT validity
- Non-EU Users: No tax is added
Remov.ee Registration
Registration of users with project creation and online payment in accordance with chosen options.
Registration Process
The first step before registration is to check the pirate URL:
- The link MUST be a valid URL. The system checks for invalid URLs using JavaScript and reports errors
- The link is then checked via Selenium on IP
135.181.199.23 - The Selenium API has been modified from standard code whereby any codes found in the title tags of the content are returned as HTTP codes. The system checks for "Not found" in title tags plus any other HTTP codes that were previously returned
- Proceed only if Selenium is NOT 404 and link is NOT whitelisted. Processing also occurs if the website exists in our websites list
- If all checks pass, the link is added to the system as a Google link (Google SE code is 0)
Tax Calculation During Registration
Tax is calculated during registration as follows:
- European Union Users:
- System asks for a valid VAT number
- If VAT number is not valid, add 22% tax
- If VAT number is not entered, add 22% tax
- Estonian Users: Add 22% tax regardless of whether they have valid VAT or not
- Non-EU Users: No tax is added
Remov.ee Link Verification
Link Verification Process
Whenever we need to create a project or register a user, we need to verify the infringed link which needs to be deleted. The verification process includes:
- URL Validity Check: First, check whether the URL is valid or not. If it fails, the submit button won't be clickable
- Government Domain Check: Check whether the URL contains ".gov" or not
- Duplicate Check: Check whether the link already exists in our system or not
- Flask App Verification: A Flask app runs in the background to check whether the link is valid, deleted, or not deleted
Flask App Automation
The Flask app uses different types of automation frameworks:
- Selenium: Currently, most link verification is done by Selenium using proxies
- Playwright: Alternative automation framework for link verification
Whenever any user tries to verify a link, the Flask app scrapes the URL using Selenium and checks for deleted stop words. If deleted stop words are found, it won't allow the user to register and create a new project.
Invalid Links Exclusion
We have added regular expressions for certain websites to exclude bad links (which are not direct links) from those websites. Whenever a user tries to verify such a link, the system shows a message on the UI: "Link is not a direct link".
Link Verification Server (flask_server.cpp)
The link verification process is handled by a C++ server application located at flask_server.cpp. This server provides HTTP-based link verification functionality.
Standard Libraries Used
<iostream>: Input-output stream functionality<cstdlib>: Conversions, random number generation<cstring>: C-style string manipulation<vector>: Dynamic arrays<string>: String operations<fstream>: File input and output<algorithm>: C++ algorithms<regex>: Regex-based operations<ctime>: Date and time functionality<netinet/in.h>,<sys/socket.h>,<arpa/inet.h>,<unistd.h>: Network programming<thread>: Multi-threading<mysql/mysql.h>: MySQL database interaction<curl/curl.h>: HTTP requests
Global Variables
vector<string> v: Holds proxy addressesvector<string> useragents: Stores user-agent stringsvector<string> all_keywords: Contains keywords used in content filtering
Key Functions
- is_int(char *c): Checks if a string represents an integer
- write_content(): CURL callback to write content fetched from a URL
- finish_with_error(): Handles MySQL errors and closes connections
- random_string(): Generates random alphanumeric strings
- get_user_agent(): Fetches user-agent strings from external script
- get_proxies(): Loads proxy addresses from a file
- split(): Splits a string into tokens based on delimiter
- clean_content(): Removes unwanted characters from content
- replaceHTMLENTITIES(): Replaces HTML entities with characters
- getCode(): Extracts HTTP response code from header content
- extractTitle(): Extracts <title> content from HTML
- replaceAll(): Removes HTML tags or content between markers
- getContents(): Removes all HTML tags from response body
- getContent(): Fetches HTML content using proxy and random user-agent
- processFacebook(): Handles URL content extraction for Facebook via Python script
- useSelenium(): Uses Selenium automation via Python to fetch content
- generateResponse(): Main logic to process URLs, query database, and filter content
- connection_handler(): Handles incoming HTTP requests
- main(): Entry point, sets up socket server and spawns threads
Compilation and Usage
# Compile the program
g++ flask_server.cpp -o flask_server -lmysqlclient -lcurl -lpthread
# Run the server with a specified port
./flask_server <port_number>
Server Behavior:
- Listens for HTTP requests
- Processes URLs to fetch and filter content
- Communicates with MySQL database for configuration and keyword checks
Remov.ee Scheduler
We have three schedulers running on the remov.ee server for sending emails. All functions are configured in the controller Expired.php:
Scheduled Tasks
- Project Expiry Reminder:
- Schedule: 11 PM
- Purpose: Sends email to remind users that their projects are expiring in 24 hours
- Project Expiry:
- Schedule: Midnight
- Purpose: Sends email to users that their project has expired
- Link Status Changed:
- Schedule: 9 AM (only for previously changed links to Deleted or Denied)
- Purpose: Sends emails to users whenever the status of any link added by remov.ee is changed
Remov.ee Analytics
Displays links in the following categories:
- Grouped by date added: Links organized by when they were added to the system
- Grouped by search engines: Links categorized by the search engine that found them
- Link categories: Links categorized as either:
- Torrents
- Free downloads
- Messengers
- Fake sites
- Cyber lockers
- Social networks
- Link shorteners
Remov.ee User IP Address Detection
The system uses ipinfo.io service for IP address detection and tracking.
Configuration
- Service URL: https://ipinfo.io/
- API Token:
40b2dd7fa5fe8c - Documentation: https://ipinfo.io/developers
Purpose
This service is developed to record and track users' and guests' actions from https://axg.house/statistics. It provides geolocation and other IP-related information for analytics and security purposes.