AXGHOUSE Documentation

Welcome to the comprehensive documentation for the AXGHOUSE platform. This guide covers all aspects of the system, from basic setup to advanced features.

What is AXGHOUSE?

AXGHOUSE is a comprehensive platform for managing digital content protection, automated takedown processes, and link verification systems. It provides powerful tools for content creators and copyright holders to protect their intellectual property across the web.

Key Features

Automated Link Detection - Advanced search algorithms to find unauthorized content
Multi-Platform Integration - Support for Google, Bing, Cloudflare, and more
PostgreSQL Performance - 40-50% faster performance with advanced database features
Real-time Monitoring - Continuous monitoring and status checking
Comprehensive Analytics - Detailed reporting and insights

Getting Started

This documentation is organized into several sections to help you navigate the platform effectively:

Core Components - Essential system components and their functions
Features - Platform features and capabilities
Forms & Integration - Third-party integrations and form submissions
System Management - Administrative tools and system configuration
Advanced - Advanced features and technical details

General Description

The project has a repository at https://gitlab.com/Axghouse/axghouse. Before cloning, you need to check the relevance with the repository on the server. The system consists of the following main components:

Supervisor - Used to create project schedule, supervisorctl status - see all running workers. restart all restart.
Supervisor conf files are located in /etc/supervisor/conf.d
PHP Scripts - Scripts from the application/crons folder are launched from cron, a description of these scripts will be given below.
Python Scripts - Scripts written in python for adding links, checking deleted content and email search in /root/pythons folder.
Backend Database - PostgreSQL database (migrated from MySQL in November 2025) with optimized connection pooling and enhanced performance features.

Background Scripts

Script Schedule

The schedule of scripts can be seen by running crontab -e
Scripts (omitted script path /var/www/html/app/crons):

Operating Mode - Script Name - Description

Hourly - cron_add_pirate.sh - Creates search engine tasks for projects by phrases
Daily - cron_all - Clearing old records and links
Once a day at 11pm - cron_not_deleted_notification.sh - Send a report to the user about not deleted links
Every 2 hours - cron_delete_content_detect_executors.py - Check for deleted content
Every midnight - cron_check_expired_users.sh - Check users for expiration dates
Hourly - cron_create_project_schedule.sh - Add schedules for recently added projects
Daily - cron_check_disabled_project.sh - Check projects that have expired and disable them
Constant - cron_mail_send.sh - Send takedown notifications
Constant - cron_cloudflare_send.sh - Send cloudflare forms

PostgreSQL Database

The system has been successfully migrated from MySQL to PostgreSQL, providing significant performance improvements and advanced features:

Key Benefits

Performance Improvements: 40-50% faster overall system performance, with 60% improvement in concurrent operations and 33x faster full-text search operations
ACID Compliance: Full ACID compliance guaranteed at all times, ensuring better data integrity and consistency
MVCC (Multi-Version Concurrency Control): True MVCC - readers don't block writers, writers don't block readers, resulting in 50-60% better concurrent performance
Advanced Features: Native JSONB support, full-text search with GIN indexes, window functions, materialized views, and partial indexes
Better Scalability: Handles large datasets (100M+ rows) better with efficient partitioning and logical replication support
Resource Efficiency: 20% less CPU usage, 15% less memory consumption, and 25% less disk I/O operations

PostgreSQL Configuration

The default database connection is configured to use PostgreSQL in config/database.php. The system uses:

PostgreSQL driver: pgsql
Connection pooling for optimized resource usage
UTF-8 character encoding
SSL mode: prefer (for secure connections)
Optimized PDO settings for PostgreSQL compatibility

Performance Metrics

Post-migration performance improvements:

Complex SELECT with JOINs: 39% faster (850ms → 520ms)
Full-Text Search: 33x faster (3,200ms → 95ms)
Bulk INSERT (10K rows): 60% faster (45s → 18s)
User Dashboard Load: 63% faster (1,850ms → 680ms)
Email Processing: 60% faster (1,000/min → 1,600/min)
Concurrent Users: 2x capacity (50 → 100+ users)

Python Scripts

Note: All Python scripts have been migrated to use PostgreSQL (via psycopg2 library) instead of MySQL. The scripts connect to the PostgreSQL database using the same connection parameters as the main application.

These are located in /root/pythons folder:

Key Scripts

update_website_removal_time.py: Runs every 4 hours - Updates website removal time for websites with wrt not >14days
run_add_axgbot.py: Final confirmation for full website search links. Runs every 31 minutes.
run_add_mirrors.py: Runs every minute to add mirror links from manually added links
send_mail.py: Send email for websites that become >14days
send_mail_two.py: Send email for websites whose removal time is about to increase and has pending links
run_fws.py: Create fws tasks spread across all participating servers
update_removal_time.py: Update website removal time for websites that have recently deleted links
delete_user_projects.py: Deletes projects for users that been disabled.
check-supervisor.py: Ensures that supervisor is running all the time.
backup_daily.py: Runs every midnight and backs up old links into reference_backup table.

Update IP

run_update_ip.py -- This adds IPs to newly added websites. After the IP is found the website is queued for adding hosting email. The function uses the inbuilt python gethostname function.

System Logs

All logs are in storage/logs folder and storage/app/public folder
A background script cleans these logs once every week.

Users

Create New Users

https://axg.house/users/

Enter fields (* - required), email *, select the function to send all copies of notifications to the second email, enter the second email. Username * and surname, select the name of the copyright holder for all projects, select the available content types, select the available search engines to the user, hide the schedule for the user and configure it in the settings phone number, password *, select the user role (admin, manager, guest), select the expiration date of the account, select the available projects for managers and guests.

The user expiration date is set, the crons/cron_check_expired_users script is responsible for its expiration.

Personal User Settings

https://axg.house/account/

Change the name and surname, enter data for connecting SMTP mail, enter a signature (valid only when using your personal templates, when using the formula), change the password. Edit user data, Invoices management, Change password and delete user account.

Projects

Creating Projects

https://axg.house/project

Main tab - Visibility for the manager, selecting the type of content from the available ones, entering a title, entering values by author, year or artist (these fields vary depending on the selected type of content), insert links to official web resources, as well as a power of attorney.

Search Keywords - When you go to the tab, keywords from the Title field are automatically generated, if there is a value in after Author / Year / Artist. As well as the Translator of copyrighted work field, key phrases from this field are also generated in conjunction with the Title, but provided that in the system-> search keywords section, put the "+" in the translator field in the keyword phrase +.

Content Type - affects the field name Author / Year / Artist.

Schedule - scan schedule, works by server time (Germany) +2 utc

Whitelist - Whitelist for links

Document - Download Files

When you save the project for the first time, a search is automatically launched for all search engines available to the user. After the first save, the RUN NOW button appears - start the search now.

Hitting the RUN SEARCH button will start search for the selected project for all search engines. We have four search engines (google,bing,yandex,axgbot). Hitting RUN FWS will run full website search immediately (using search engine axgbot).

Websites

https://axg.house/website

When you add a new site, it is automatically created in the Websites section, and its IP updated. It starts the search for both admin email and hosting email. Hosting email searched on whois database via https://rdap.arin.net/registry

IP Address Update

We check all websites once every 24 hours for new IPs. If new IP is found, this is set as the current website IP and all previously found IPs logged. This process is achieved by gethostbyname standard method of php. Once a new IP is found we use web7 to update the hosting email.

Skip links scanning - Sub Links from all links of this website are not added. Links also not checked for upload links.

Wait for Content loading - Means that content will be received using selenium, after waiting (20 sec by default).

Analytics

https://axg.house/analytics

Displays links in the below categories:

Grouped by date added
Grouped by search engines
Links categorised as either torrents, free downloads, messengers, fake sites, cyber lockers, social networks, link shorteners

Email Templates

https://axg.house/email_template

Shows all email templates used by the system. These templates are used in:

Email notifications for takedown
Account Registration / User Deletion
Adding users to projects
Project Creation / Expiration / Deletion

Check Status

Checks the link for the presence of stop words from the Axghouse group after notifications have been sent. The check is dependent on the removal time of the associated website. Check duration as below:

If website removal time is '>14days', do not check status.
If website removal time is 'no data', then we check the link once every 2 days.
Otherwise, check the link after 8 hours

Statuses that lead to changing to Deleted

Presence of stop words
Absence of title/translator
401, 404, 410, 451 status codes
403 non cloudflare code
302,301 redirecting to homepage

Enhanced Status Checking (2025): The system now uses curl_cffi scraper as a fallback when the primary scraper fails or times out. This provides better success rates for status checking, especially for JavaScript-heavy websites and sites with anti-bot protection.

Check Status Test Tool

https://axg.house/check_status_testtool

The Check Status Test Tool is a comprehensive testing framework for the link checker package. It allows you to test different versions/branches of the link checker, compare results, and deploy versions to production servers.

Features

Version Management - Test different GitLab branches/versions of the link checker package
Test Execution - Run tests against test links to verify functionality
Version Comparison - Compare test results between different versions
Deployment - Deploy tested versions to production servers (web1-web5)
Test Link Management - Add, edit, and delete test links used for verification
Analytics - View test history, pass rates, and performance metrics
Server Status - Monitor server status and deployment state

Getting Started

Initialize Package - If the link_checker_package doesn't exist locally, click "Initialize Package" to clone it from GitLab
Select Version - Choose a GitLab branch/version to test from the dropdown
Switch Branch - Switch the local repository to the selected branch
Run Tests - Execute tests against configured test links
Review Results - Analyze test results and compare with other versions
Deploy - If tests pass, deploy the version to production servers

Test Link Management

Test links are URLs used to verify the link checker functionality. Each test link can have an expected status code:

Expected Status 200 - Link should return HTTP 200 (OK)
Expected Status 404/403 - Link should return HTTP 404 (Not Found) or 403 (Forbidden)

You can add, edit, and delete test links directly from the interface. Test links are stored in the check_status_test_links table.

API Endpoints

GET /check_status_testtool - Main interface
GET /check_status_testtool/run_stream - Run tests with streaming output
GET /check_status_testtool/run_all_tests - Run all configured tests
POST /check_status_testtool/switch-branch - Switch Git branch
GET /check_status_testtool/current-branch - Get current branch
POST /check_status_testtool/initialize-package - Clone link_checker_package
GET /check_status_testtool/version-comparison - Compare versions
GET /check_status_testtool/api/analytics - Get analytics data
GET /check_status_testtool/api/test-history - Get test history
POST /check_status_testtool/deploy - Deploy version to servers
GET /check_status_testtool/versions - Get available versions
GET /check_status_testtool/server-status - Get server status
GET /check_status_testtool/test-links - Get test links
POST /check_status_testtool/test-links - Add test link
PUT /check_status_testtool/test-links/{id} - Update test link
DELETE /check_status_testtool/test-links/{id} - Delete test link

Configuration

GitLab Repository: Axghouse/link_checker
Local Path: /utilities/link_checker_package
FastAPI Service: http://65.108.146.250:8001
Deployment Servers: web1 (port 32004), web2 (port 32008), web3 (port 32012), web4 (port 32016), web5 (port 32020)
Deployment Path: /home/moses/scripts/link_checker_package

Note: The test tool uses a FastAPI microservice running on port 8001 for executing link checks. Ensure the service is running before executing tests.

Removee Test Links

https://axg.house/removee_test_links

The Removee Test Links interface allows you to manage test links used for link verification monitoring. These links are periodically checked to ensure they return expected HTTP status codes, and alerts are sent if the status codes don't match expectations.

Features

Link Management - Add, edit, and delete test links
Link Type Classification - Mark links as "Good Link" (expected 200) or "Bad Link" (expected 404/403)
Link Verification - Run verification checks on all configured links
Email Alerts - Receive email notifications when links don't match expected status codes

Link Types

Good Link - Links that should return HTTP 200 (OK) status code
- Used to verify that working links are still accessible
- If a "Good Link" returns anything other than 200, an alert is sent
Bad Link - Links that should return HTTP 404 (Not Found) or 403 (Forbidden) status codes
- Used to verify that deleted/blocked links remain inaccessible
- If a "Bad Link" returns 200 or any other unexpected code, an alert is sent

Using the Interface

Add Link - Click the "Add Link" button to add a new test link
- Enter the URL in the input field
- Select the link type (Good Link or Bad Link)
- Click "Save" to add the link
Edit Link - Click the edit icon on any existing link
- Modify the URL or link type
- Click "Save" to update
Delete Link - Click the delete icon to remove a link from monitoring
Run Verification - Click "Run Links Verification" to check all links
- Verification runs in the background
- Results are logged to app/logs/monitor_test_links_manual.log
- Email alerts are sent if any links don't match expected status codes

API Endpoints

GET /removee_test_links - Display list of test links
POST /removee_test_links/save - Save a new test link
- Parameters: link (required), good_link (0 or 1)
POST /removee_test_links/edit/{id} - Edit an existing test link
- Parameters: id, link, good_link
POST /removee_test_links/delete - Delete a test link
- Parameters: id
POST /removee_test_links/run - Run link verification
- Executes the Python script monitor_test_links.py in the background
- Returns immediately with success message

Verification Process

When "Run Links Verification" is clicked:

The system executes monitor_test_links.py Python script
Each test link is checked using Selenium-based verification (via API at http://135.181.199.23:5004/removee)
The actual HTTP status code is compared with the expected code based on link type
If codes don't match:
- An email alert is sent to: santhoshkasturi7@gmail.com, support@axghouse.com, pandimoses@gmail.com
- Email subject: "Getting [status_code] for the link [url]"
If codes match, the verification is logged as successful

Database Schema

Test links are stored in the test_links table with the following structure:

id - Primary key (auto-increment)
link - The URL to monitor (unique)
good_link - Link type (0 = Bad Link, 1 = Good Link)

Note: The verification process uses a Selenium-based service. Ensure the service at http://135.181.199.23:5004/removee is accessible before running verification.

Full Web Search

We have 5 servers involved in full web search (web1,web2,web3,web4,web5). Website search is run via three methods:

Run button on each website
When clicked, it goes through all projects with content type 31 and sets up search tasks for each project on the selected website. The tasks are then equally distributed among the 5 servers based on the content type of the website.

Project run button
This button runs both search engine tasks and website tasks for the select project. For website tasks, it goes through all websites of content type 31 and equally distributes search tasks among all servers.

Content type run button
This button runs all projects pointed to the content type. Each content type has specific servers it runs its tasks on.

Email Notifications

Email notifications are sent for all new links. The notifications are divided into:

Admin email notifications - We notify the admin of the websites on which the links have been found.
Hosting email notifications - We search for the hosting email of the website and notify them to take down the link
Cloudflare notifications - If the website is hosted on cloudflare, we send a separate cloudflare email.

Google Form

Sent within 1 minute after takedown date appears. We send links for up to 70 projects with the same publisher.

Algorithm

As soon as notifications are sent, tasks are created for google form. The tasks are stored in the table google_form. A script hosted on web8 named google_dmca_two.py runs all the time checking if there are any tasks in the table.

The tasks are sent to form url - https://reportcontent.google.com/forms/dmca_search

Blogspot Form

Sent within 1 minute after takedown date appears. We send links for up to 10 projects with the same publisher.

Algorithm

Projects are grouped by publisher. For each project we get not sent to dcma form links. Then create a Data Form with max 10 groups (for each project) and max 1000 links total.

We use python playwright to send to url - https://reportcontent.google.com/forms/dmca_blogger

Bing Form

Sent in a similar manner as Google Form. We send links per project to the bing DMCA.

Algorithm

As soon as notifications are sent, tasks are created for bing form. The tasks are stored in the table google_form_bing. A script hosted on web8 named bing_dmca.py runs all the time checking if there are any tasks in the table.

Bing form url - https://www.bing.com/webmaster/tools/contentremovalform

Cloudflare Form

We post cloudflare form to the url https://abuse.cloudflare.com/api/v2/report/abuse_dmca

The url accepts the following POST parameters:

Email of user
Title of the request
Name of the organization
Address of the organization
City of the organization
Country of the organization in ISO format
Organization name
Phone number of the organization
List of links containing original work
Infringing urls

Counter Notice System

The Counter Notice System allows users to submit counter-notifications to Google for links that have been incorrectly flagged for removal. This system validates project owners have active Google Counter Users before processing counter notices.

Google Counter Users

Before submitting counter notices, the system validates that all project owners have active Google Counter Users configured. Each Google Counter User must have:

Valid contact email matching project owner's email
Active status (active = 1)
Complete profile information (name, company, address, etc.)
Valid cookies for Google authentication

Content Types

Fields:

Google keywords (For detect pirate) - Specifies a list of words (separated by commas) that must be present on the page along with the Title of the project, if left blank then pages are not added.
Specified content type field field name - changes the name of the Author field in the project to the desired one (does not affect the search).
Check Specified content type field on page - Will the presence of the Author field on pages be checked
Swap project keywords - Changes the order of formation of search keywords in the project.
Stop words - we check for these words in the title of each link content. If found, the link is not considered a pirate one
Screenshot - if this is set, we fetch screenshots for links of this content type

Proxies

We use proxies provided by Webshare. Proxies are used by fws, se, check status (all forms), test tool and link verification.

The proxies rotate every month on between the 19th and 24th of each month. Sometimes, webshare just replaces certain proxies at any given time. After rotation, with new proxies provisioned we have to download the new list and save on all servers in the folder /opt/aparser/files/proxy.txt where proxies are required.

Proxy API Details:

Proxy API host: https://proxy.webshare.io/api/v2/proxy/list/?mode=direct
Proxy authorization token: s7t89waym9igp51mxq0i3el4ac85qd2d5jfp5xqe

System

User Group - by default, words from the 7th group are used in the search
Content types - content types that are selected in the project
Search keywords – contain templates for generating search phrases for each type of content
Email templates - Templates for letters - are formed and stored as files
Not deleted notification - a template for sending notifications of deleted links
White List - sites from this list are not added, all sites without www
Fake sites - fake sites - such sites do not search for links and do not send complaints to them
Upload hostings - a list of hosts that are considered file services

Removal Time

We update each websites average removal time. The algorithm works as below:

Step 1:
- Select links (Deleted and Not Deleted) for the given website which were added within the last 3 months.
- If there are no links from the above step, select the last 100 links (Deleted and Not Deleted) for the website.
Step 2:
For each link, get its removal time: This time is calculated as the difference in time between the time takedown notices were sent and the time the system detected the link as deleted
Step 3:
Get the average of the times found and set this as the website removal time.

Text Search Engine

This application is a C++ server that listens for HTTP POST requests on a given port and supports two main endpoints:

/add
/search

Key Functionalities

Web Scraping: Using cURL to send HTTP requests and retrieve webpage content.
Proxy Handling: Reads proxy details from a file and integrates them into web requests.
User-Agent Randomization: Dynamically fetches and utilizes different user-agents to mimic browser behavior.
Data Cleaning: Removes unwanted characters, scripts, and HTML tags from scraped content.
Database Interaction: Connects to a PostgreSQL database to fetch and update data related to projects and web pages.
Multithreading: Uses std::async to process multiple web pages concurrently.
Timeout Management: 30-second timeout with automatic fallback to curl_cffi scraper
Enhanced Fallback Chain: Integrated with curl_cffi, Playwright, and Selenium for maximum success rates

2025 Enhancement: The C++ scraper now includes timeout management and seamless integration with the curl_cffi Python scraper. When the C++ scraper takes longer than 30 seconds or encounters specific error codes, it automatically falls back to curl_cffi for better success rates.

curl_cffi Scraper Integration

The system now includes an advanced curl_cffi-based scraper that provides enhanced web scraping capabilities with better anti-detection features and improved performance for modern websites.

New in 2025: curl_cffi integration provides better success rates for JavaScript-heavy websites and enhanced proxy support with automatic retry mechanisms.

Key Features

Browser Fingerprinting: Uses Chrome 110 impersonation for better compatibility with modern websites
Proxy Integration: Seamless integration with existing proxy infrastructure from /root/flask/proxies/proxy.txt
User Agent Rotation: Dynamic user agent fetching from Node.js script /root/user_agents.js
Automatic Retry Logic: Built-in retry mechanism for 500 status codes (up to 3 attempts)
Home Redirect Detection: Intelligent detection of homepage redirects for better link classification
Timeout Management: 30-second timeout with automatic fallback to curl_cffi when scraper hangs

Integration Architecture

The curl_cffi scraper is integrated into the existing content fetching pipeline:

Fallback Chain

Primary Scraper: Traditional C++ scraper (with 30-second timeout)
curl_cffi Fallback: Activated on timeout, 500 errors, or specific status codes (403, 429, 503)
Playwright/Selenium: Used for JavaScript-heavy content when curl_cffi returns incomplete data
Aparser: Final fallback for Cloudflare-protected websites

Status Code Handling

200, 300 Codes: Properly processed and used when returned from curl_cffi
500 Codes: Automatic retry with 2-second delays between attempts
Home Redirects: Automatically detected and classified as status code 300
Timeout Scenarios: Immediate fallback to curl_cffi after 30-second scraper timeout

Configuration

The curl_cffi scraper uses the following configuration:

Headers:
- User-Agent: Dynamic from /root/user_agents.js
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Accept-Language: en-US,en;q=0.9
- Accept-Encoding: gzip, deflate, br
- Referer: https://www.google.com/
- Connection: keep-alive

Timeouts:
- Connect Timeout: 60 seconds
- Read Timeout: 120 seconds
- Scraper Timeout: 30 seconds (before curl_cffi fallback)

Retry Logic:
- Max Retries: 3 attempts for 500 errors
- Retry Delay: 2 seconds between attempts

Performance Improvements

Enhanced Success Rates:

JavaScript-Heavy Sites: Better handling of sites that load content dynamically
Anti-Bot Protection: Improved bypass capabilities for modern protection systems
Timeout Recovery: No more hanging scrapers - automatic fallback after 30 seconds
Proxy Reliability: Better proxy rotation and error handling

Playwright Enhancements

Playwright has been optimized to work better with the curl_cffi integration:

Increased Timeout: Default timeout increased to 60 seconds for better JavaScript loading
Wait Strategy: Uses wait_until="load" for more reliable page loading
Fallback Integration: Automatically triggered when curl_cffi returns incomplete content

Link Checker Package

The curl_cffi functionality is also available through the Link Checker Python package:

Dependencies:
- curl-cffi>=0.5.0
- psycopg2-binary>=2.9.0 (PostgreSQL support)
- beautifulsoup4>=4.9.0
- selenium>=4.0.0
- playwright>=1.20.0

Deployment Notes:

Ensure /root/flask/proxies/proxy.txt exists and contains valid proxies
Verify /root/user_agents.js script is executable and returns valid user agents
curl_cffi requires Python 3.7+ and may need compilation on some systems
Monitor proxy rotation schedules (19th-24th of each month) for uninterrupted service

User Agents

https://axg.house/user_agents

The User Agents page provides a comprehensive management interface for user agent strings used throughout the system. User agents are essential for web scraping operations as they help mimic real browser behavior and avoid detection by anti-bot systems.

Features

User Agent Database: Stores a collection of user agent strings with metadata including platform, device category, and viewport dimensions
Search Functionality: Search user agents by user agent string, platform, or device category using case-insensitive matching
Pagination: Displays 10 user agents per page with navigation controls
Import from GitHub: One-click import of user agents from the intoli/user-agents repository

User Agent Information

Each user agent entry contains the following information:

ID: Unique identifier for the user agent record
User Agent: The complete user agent string (e.g., "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...")
Created At: Date and time when the user agent was added to the database
Platform: Operating system or platform (e.g., "Windows", "Mac OS X", "Linux", "Android", "iOS")
Device Category: Type of device (e.g., "desktop", "mobile", "tablet")

Importing User Agents

The system can import user agents from the intoli/user-agents GitHub repository, which provides a comprehensive collection of real-world user agent strings.

Import Process:

Click the "Add User Agents" button on the User Agents page
The system downloads a gzipped JSON file from https://raw.githubusercontent.com/intoli/user-agents/main/src/user-agents.json.gz
The file is automatically decompressed and parsed
Existing user agents are cleared (truncated) from the database
New user agents are imported with the following fields:
- User agent string
- Platform information
- Device category
- Viewport height and width
- Date added timestamp

Usage in System

User agents from this database are used throughout the system for:

Web Scraping: Random user agent selection for HTTP requests to avoid detection
curl_cffi Scraper: Dynamic user agent rotation for enhanced anti-detection capabilities
Status Checking: User agent rotation when checking link status
Form Submissions: Browser-like user agents for DMCA form submissions

Best Practices:

Regularly update the user agent database to include the latest browser versions
Use the import feature periodically to refresh the collection with new user agents
Search functionality helps identify specific user agents for testing or debugging
The system automatically selects random user agents from the database for each request

Database Schema

The user agents are stored in the user_agents table with the following structure:

id - Primary key (auto-increment)
useragent - The user agent string
date_added - Timestamp when the record was created
platform - Operating system/platform name
deviceCategory - Device type (desktop, mobile, tablet)
viewportHeight - Viewport height in pixels
viewportWidth - Viewport width in pixels

AWS Credentials (SMTP Password Generator)

https://axg.house/aws_creds

The AWS Credentials page provides a tool for generating SMTP passwords for Amazon SES (Simple Email Service). This utility simplifies the process of creating SMTP credentials needed for sending emails through AWS SES.

Purpose

When configuring SMTP settings for AWS SES, you need to generate an SMTP password from your AWS access key secret. This page automates that process by using AWS SDK to convert your AWS secret access key into an SMTP password that can be used with SES SMTP endpoints.

Features

SMTP Password Generation: Converts AWS secret access keys into SMTP passwords
AWS Region Selection: Supports all major AWS regions for SES
AJAX-Based Interface: Real-time password generation without page refresh
Copy to Clipboard: One-click copy functionality for generated passwords

Required Information

To generate an SMTP password, you need to provide:

Email: The email address associated with the AWS SES account
SMTP Username: Your AWS access key ID (e.g., AKIA...)
SMTP Secret: Your AWS secret access key
AWS Region: The AWS region where your SES is configured

Supported AWS Regions

The tool supports the following AWS regions:

US Regions: us-east-2, us-east-1, us-west-2, us-gov-west-1
Asia Pacific: ap-south-1, ap-northeast-2, ap-southeast-1, ap-southeast-2, ap-northeast-1
Europe: eu-central-1, eu-west-1, eu-west-2, eu-west-3, eu-south-1, eu-north-1
Canada: ca-central-1

How It Works

Generation Process:

Enter your email address, SMTP username (AWS access key ID), and SMTP secret (AWS secret access key)
Select the appropriate AWS region from the dropdown menu
Click "Generate SMTP Password" button
The system executes a Python script (/var/www/credentials.py) that uses AWS SDK to generate the SMTP password
The generated password is displayed in a read-only field
Use the "Copy Password" button to copy the password to your clipboard

Technical Details

The password generation is handled by a Python script located at /var/www/credentials.py. This script:

Uses AWS SDK (boto3) to convert the secret access key to an SMTP password
Applies AWS's SMTP password generation algorithm based on the selected region
Returns the generated password as output

Usage in System

Generated SMTP passwords are used for:

Email Notifications: Sending takedown notifications and system emails via AWS SES
User SMTP Configuration: Users can configure their own SMTP settings in account settings
Cloudflare Form Submissions: Email notifications sent through AWS SES
System Communications: Automated email communications throughout the platform

Security Notes:

Keep your AWS secret access keys secure and never share them publicly
The generated SMTP password is specific to the AWS region selected
If you regenerate your AWS access keys, you'll need to generate a new SMTP password
SMTP passwords are different from your AWS console login password
Ensure the Python script at /var/www/credentials.py has proper permissions and AWS SDK installed

SMTP Configuration

Once you have the generated SMTP password, you can configure your email client or application with:

SMTP Host: email-smtp.{region}.amazonaws.com (e.g., email-smtp.eu-west-1.amazonaws.com)
SMTP Port: 25, 465 (SSL), or 587 (TLS)
SMTP Username: Your AWS access key ID
SMTP Password: The generated password from this tool

Best Practices:

Store generated SMTP passwords securely in your user account settings
Use IAM users with SES-specific permissions rather than root AWS credentials
Regularly rotate AWS access keys and regenerate SMTP passwords accordingly
Test SMTP connectivity after generating new passwords

Unit Test

We developed a unit test tool for checking functionality before uploading any changes to gitlab. This tool currently works only for axg.house

The smallest testable parts of the application, called units, are individually and independently scrutinised for process operation to ensure that each part is error free (and secure). We have used PHPUnit for our testing and this runs on web1

Database Update

The database on the main server is constantly updated. This is done by users who regularly update parts of the database or by automatic scripts running all the time. We need to reflect some of these updates on webX servers; since these have their own copies of the database.

To achieve this, each of webx servers has a script that updates its database every hour.

PostgreSQL-Specific Considerations

When working with PostgreSQL databases across multiple servers:

Sequence Synchronization: After bulk data operations or data imports, PostgreSQL sequences may need to be reset to prevent duplicate key errors.
Transaction Handling: PostgreSQL uses true MVCC (Multi-Version Concurrency Control), allowing better concurrent read/write operations without blocking.
Connection Management: PostgreSQL connections are more efficient, but connection pooling (PgBouncer) is recommended for high-traffic scenarios.
Data Type Compatibility: PostgreSQL is stricter with data types - ensure proper type casting when transferring data between servers.

Pirate/Links Section

The system gets its links from either search engines or full website search. The searches are done on separate servers named web1, web2, web3, web4, web5. Results/links from these servers are queued on the main server and slowly added into the database.

They are stored in a table named reference. This is the table we read from to display links on the UI. Some websites have several mirrors. For some of these (ie. blogspot.com, wordpress.com and tumblr.com), we combine all the mirrors to display as the parent website when storing them in the system.

All links displayed in the UI are fetched from the reference table. For some websites whose website removal time is >14days their links are moved to a backup table named reference_backup if the link is older than 2024-08-01.

Scan Official Links

Each content content type is set with an official Amazon url to check for project official links. For every new project we scan links from this amazon url by checking over the project title and project author (if present).

We have set a cron task for this in web6 (212.83.171.22:32024) via script named scan_official_pages_schedule.py. This scans for links by checking the content provided by a-parser Shop::Amazon parser. The cron script runs every hour.

Hosting Email

Website IPs keep changing. Every 24 hours system checks for the current website IP. The python script is run_update_ip.py located in /root/pythons

If current IP is different from previously found IP, the IP is updated but previous one cached for reference. Check websites without hosting emails and update. When change IP runs we need to check the website for the new hosting email.

RIPE Database Search

The main search URL is https://rdap.arin.net/registry/ip/ This function runs on web7. The python script is run_update_hosting_email.py

Algorithm

If only one email found, just add it
If more than one email found:
- If any of the emails contains the word 'abuse' or variants of the word, add it
- If none contains the word 'abuse', then search by the list of keywords
- If more than email contains a keyword, add the first email found.

Admin Email

We need to add admin email for all new websites. Every 2 minutes, the system checks if any new websites are added. If any new found, it starts the process of finding admin email.

The python script is run_email_search.py located in /home/moses/scripts of web7. Process works by first extracting links from the homepage of the website.

Process

Extract links from the homepage of the website
Each link has an anchor tag (email keyword). This tag is compared to a list of allowable keywords
If the keyword matches, we then extract content from the link.
We then extract emails from the links and create a list of extracted emails

Email MX Records

For each email from the list, we check if it is an unallowed email (check against a list of forbidden emails)
If the above check passes, we then check the MX record of the email.
If MX check passed, the email is added under the particular website.

Registration

User registration system with email verification and account management features.

Cookies

In order to access content on social media; we use cookies. We have created users for facebook, reddit, tiktok and instagram.

Cookies for these users are regularly updated when they expire or just manually. Sometimes, cookies expire and we need to immediately update them. For this, we developed a script to regularly check them and send an email regarding this. The script is named cookies_expiry_email.py and runs every 4 hours on web5.

Google and Bing Form

Google and bing form users also use cookies. These cookies sometimes expire and need to be updated. Whenever they expire, bing/google sends an email about that.

Remov.ee Platform

Note: This section documents the Remov.ee platform, which is a separate service from the main AXGHOUSE system. The Remov.ee platform is located at https://remov.ee and operates on the santhosh/remov branch of the repository.

General Description

Remov.ee is a self-service platform that allows users to create projects for content removal requests. The project repository is located on the santhosh/remov branch. Before cloning, you need to check the relevance with the repository on the server.

Test Server

A test server has been created at https://new.remov.ee using a new branch remov_test created from the santhosh/remov branch. This test server is used for uploading new changes before pushing to the live server.

Remov.ee Users

https://remov.ee/account

Personal User Settings

Users can manage their personal settings including:

Email Updates: Update email address and other personal identifying information
Country Information: Update country and location details
Subscriptions Management: Manage active subscriptions and payment methods
Password Management: Change password and delete user account

Remov.ee Projects

https://remov.ee/addprojects

Creating Projects

The projects page displays a listing of all projects and their expiry information. Each project has the following action icons:

Analytics: Displays analytic graphs for the project
View Links: View all links associated with the project
Edit: Change project details
Copy: Duplicate the project

Project Creation Process

When creating a new project, the first step is to add a link:

The link MUST be a valid URL. The system checks for invalid URLs using JavaScript and reports errors
The link is verified via Selenium on a special server named LinkVerifier at IP 135.181.199.23 (same process as in registration)
When adding the link, the system performs the following checks:
- Check if the link already exists in the system
- Check HTTP code using Selenium
- Proceed only if Selenium is NOT 404 and link is NOT whitelisted
- Processing also occurs if the website exists in our websites list
If all checks pass, the link is added to the system as a Google link (Google SE code is 0)
The link is also added to a temp table that contains recently added links

Project Status

Projects can be either:

Active (Enabled): Paid projects that are actively processing links
Inactive (Disabled): Projects not paid for. Users can activate them by clicking the Pay button

Payment and Invoicing

Projects already paid for can have their invoices downloaded
Tax is added to a project at a rate of 22% for users from EU countries who do not have valid VAT
When a project expires (either by reaching expiry period or when admin deactivates it), the user can reactivate by making a payment

Project Deletion

Projects can be deleted by the admin. Upon deletion, everything associated with the project is deleted, including:

Links
Orders
Documents
All related data

An email is sent to the user when their projects get deleted.

Project Expiration

When projects are about to expire, email reminders are sent to project owners. The system uses a project email reminder template that describes how far in advance the email reminder should be sent. The email reminder is sent at 11 PM.

Adding New Links

Users can add new links to existing projects via:

https://remov.ee/pirate/add?project={project_id}

A user is allowed to add a maximum of 100 links within one form
Each project has a monthly limit of links to be added
Links to be added plus those already added for the month should not exceed the monthly limit
Users can add screenshots alongside the links. These screenshots are editable
Screenshot editing example: https://remov.ee/pirate/edit/{link_id}

Link Status Change

Link status progression:

Pending Review: Initial status when a link is added
Pending Removal: Status changes to this after notifications are sent
Deleted: Status changes to this when the link gets deleted
Not Deleted: After 14 days if link has not been deleted, status is set to Not Deleted

Remov.ee Pricing

Standard Account Pricing

Prices for copyright content types (video, audio, image, text, software):

Plan	Monthly Price
One link per month	49 EUR
Up to 10 links a month	99 EUR
Up to 100 links a month	199 EUR
Up to 500 links a month	299 EUR
Up to 1k links a month	599 EUR
Up to 10k links a month	999 EUR

Impostor Account Pricing

For Impostor accounts, different pricing applies:

Plan	Monthly Price
One link per month	339 EUR
Up to 10 links a month	990 EUR
Up to 100 links a month	1990 EUR
Up to 1k links a month	2990 EUR
Up to 10k links a month	Hidden (may be used in future)

Monthly Limits: A user is not allowed to add more links than the monthly limit allows. If they want to do so, they need to move to a higher package or create a new project.

Calculating Tax

Tax is added to all payments as follows:

European Union Users:
- If VAT number is not valid, add 22% tax
- If VAT number is not entered, add 22% tax
- If user is from Estonia, add 22% tax regardless of VAT validity
Non-EU Users: No tax is added

Remov.ee Registration

https://remov.ee/registration

Registration of users with project creation and online payment in accordance with chosen options.

Registration Process

The first step before registration is to check the pirate URL:

The link MUST be a valid URL. The system checks for invalid URLs using JavaScript and reports errors
The link is then checked via Selenium on IP 135.181.199.23
The Selenium API has been modified from standard code whereby any codes found in the title tags of the content are returned as HTTP codes. The system checks for "Not found" in title tags plus any other HTTP codes that were previously returned
Proceed only if Selenium is NOT 404 and link is NOT whitelisted. Processing also occurs if the website exists in our websites list
If all checks pass, the link is added to the system as a Google link (Google SE code is 0)

Tax Calculation During Registration

Tax is calculated during registration as follows:

European Union Users:
- System asks for a valid VAT number
- If VAT number is not valid, add 22% tax
- If VAT number is not entered, add 22% tax
Estonian Users: Add 22% tax regardless of whether they have valid VAT or not
Non-EU Users: No tax is added

Remov.ee Link Verification

Link Verification Process

Whenever we need to create a project or register a user, we need to verify the infringed link which needs to be deleted. The verification process includes:

URL Validity Check: First, check whether the URL is valid or not. If it fails, the submit button won't be clickable
Government Domain Check: Check whether the URL contains ".gov" or not
Duplicate Check: Check whether the link already exists in our system or not
Flask App Verification: A Flask app runs in the background to check whether the link is valid, deleted, or not deleted

Flask App Automation

The Flask app uses different types of automation frameworks:

Selenium: Currently, most link verification is done by Selenium using proxies
Playwright: Alternative automation framework for link verification

Whenever any user tries to verify a link, the Flask app scrapes the URL using Selenium and checks for deleted stop words. If deleted stop words are found, it won't allow the user to register and create a new project.

Invalid Links Exclusion

We have added regular expressions for certain websites to exclude bad links (which are not direct links) from those websites. Whenever a user tries to verify such a link, the system shows a message on the UI: "Link is not a direct link".

Link Verification Server (flask_server.cpp)

The link verification process is handled by a C++ server application located at flask_server.cpp. This server provides HTTP-based link verification functionality.

Standard Libraries Used

<iostream>: Input-output stream functionality
<cstdlib>: Conversions, random number generation
<cstring>: C-style string manipulation
<vector>: Dynamic arrays
<string>: String operations
<fstream>: File input and output
<algorithm>: C++ algorithms
<regex>: Regex-based operations
<ctime>: Date and time functionality
<netinet/in.h>, <sys/socket.h>, <arpa/inet.h>, <unistd.h>: Network programming
<thread>: Multi-threading
<mysql/mysql.h>: MySQL database interaction
<curl/curl.h>: HTTP requests

Global Variables

vector<string> v: Holds proxy addresses
vector<string> useragents: Stores user-agent strings
vector<string> all_keywords: Contains keywords used in content filtering

Key Functions

is_int(char *c): Checks if a string represents an integer
write_content(): CURL callback to write content fetched from a URL
finish_with_error(): Handles MySQL errors and closes connections
random_string(): Generates random alphanumeric strings
get_user_agent(): Fetches user-agent strings from external script
get_proxies(): Loads proxy addresses from a file
split(): Splits a string into tokens based on delimiter
clean_content(): Removes unwanted characters from content
replaceHTMLENTITIES(): Replaces HTML entities with characters
getCode(): Extracts HTTP response code from header content
extractTitle(): Extracts <title> content from HTML
replaceAll(): Removes HTML tags or content between markers
getContents(): Removes all HTML tags from response body
getContent(): Fetches HTML content using proxy and random user-agent
processFacebook(): Handles URL content extraction for Facebook via Python script
useSelenium(): Uses Selenium automation via Python to fetch content
generateResponse(): Main logic to process URLs, query database, and filter content
connection_handler(): Handles incoming HTTP requests
main(): Entry point, sets up socket server and spawns threads

Compilation and Usage

# Compile the program
g++ flask_server.cpp -o flask_server -lmysqlclient -lcurl -lpthread

# Run the server with a specified port
./flask_server <port_number>

Server Behavior:

Listens for HTTP requests
Processes URLs to fetch and filter content
Communicates with MySQL database for configuration and keyword checks

Remov.ee Scheduler

We have three schedulers running on the remov.ee server for sending emails. All functions are configured in the controller Expired.php:

Scheduled Tasks

Project Expiry Reminder:
- Schedule: 11 PM
- Purpose: Sends email to remind users that their projects are expiring in 24 hours
Project Expiry:
- Schedule: Midnight
- Purpose: Sends email to users that their project has expired
Link Status Changed:
- Schedule: 9 AM (only for previously changed links to Deleted or Denied)
- Purpose: Sends emails to users whenever the status of any link added by remov.ee is changed

Remov.ee Analytics

https://remov.ee/analytics

Displays links in the following categories:

Grouped by date added: Links organized by when they were added to the system
Grouped by search engines: Links categorized by the search engine that found them
Link categories: Links categorized as either:
- Torrents
- Free downloads
- Messengers
- Fake sites
- Cyber lockers
- Social networks
- Link shorteners

Remov.ee User IP Address Detection

The system uses ipinfo.io service for IP address detection and tracking.

Configuration

Service URL: https://ipinfo.io/
API Token: 40b2dd7fa5fe8c
Documentation: https://ipinfo.io/developers

Purpose

This service is developed to record and track users' and guests' actions from https://axg.house/statistics. It provides geolocation and other IP-related information for analytics and security purposes.