Job description:
We are looking for a highly skilled Python Engineer with deep experience building large-scale web scraping pipelines and AI-powered data processing systems. This role is focused on extracting, normalizing, and enriching large volumes of structured and unstructured data using Python, LLMs (e.g., OpenAI), and AWS-based containerized infrastructure.
You will own the end-to-end lifecycle of data ingestion: from scraping and document processing, through AI-driven enrichment and classification, to deployment in scalable cloud environments.
Key Responsibilities
- Design, build, and maintain high-reliability Python scraping systems for collecting data from complex, dynamic, and unstructured web sources (HTML, PDFs, APIs, documents).
- Implement AI-assisted extraction, classification, summarization, and normalization pipelines using large language models (e.g., OpenAI).
- Develop resilient scraping architectures with rate-limiting, retries, proxy management, CAPTCHA handling, and change detection.
- Build data processing pipelines that clean, transform, deduplicate, and enrich scraped content for downstream analytics and ML workflows.
- Develop and maintain containerized Python services using Docker and deploy them at scale via AWS ECS and related services.
- Integrate LLMs into automated workflows for document parsing, entity extraction, taxonomy mapping, and insight generation.
- Design and expose internal APIs for triggering scraping jobs, processing data, and retrieving AI-generated outputs.
- Manage cloud resources across AWS (ECS, S3, Lambda, RDS, CloudWatch) with a focus on scalability, reliability, and cost efficiency.
- Optimize scraping and AI pipelines for performance, throughput, and fault tolerance.
- Implement monitoring, logging, and alerting for long-running scraping and AI workloads.
- Write clear technical documentation covering scraping logic, AI workflows, and deployment patterns.
Qualifications
- Strong Python engineering background with a focus on data ingestion and scraping systems.
- Extensive experience building web scrapers using tools such as BeautifulSoup, Scrapy, Playwright, Selenium, or similar frameworks.
- Hands-on experience integrating LLM APIs (e.g., OpenAI) into production systems.
- Proven ability to handle unstructured data (HTML, PDFs, text blobs) and convert it into structured outputs.
- Experience containerizing Python applications with Docker and deploying them using AWS ECS.
- Solid understanding of AWS services including S3, ECS, Lambda, RDS, and CloudWatch.
- Experience designing and consuming RESTful APIs.
- Familiarity with CI/CD pipelines, Git-based workflows, and automated testing.
- Strong grasp of software engineering best practices: modularity, observability, error handling, and performance optimization.
- Ability to work independently in ambiguous problem spaces and iterate quickly.
- Clear written and verbal communication skills, especially around complex technical systems.
Pay: E£27,500.00 - E£35,000.00 per month
Work Location: Remote