About the project
Intelli Messaging SMS gateway is a carrier-grade messaging technology for enterprise and application providers. The system is designed to process large volumes of SMS traffic and can accept client submissions via various entry points such as SMPP connections, REST API, or email. These submissions are converted into SMPP messages and routed via configurable routes to the other SMS gateways. The system can also receive delivery receipts from those gateways and push notifications about them via different notification channels to the configured recipients. Another type of supported traffic includes Mobile Originated messages that can be received from the SMS gateways and pushed toward customer-defined endpoints.
For the past 10 years, SoftwareMill has been engaged in developing a new bulk-messaging gateway that met the client's expectations for messaging volume, throughput, and availability. This included software design and development, testing, deployment, support, and project management for the new gateway system.
Team
- 3+ devs
Duration
- 1 year
Team role
- Senior Scala Engineer
- Senior Architect
- Senior DevOps
Industry
- Telcom
Technology
- Scala
- Java
- Akka
- Grafana
- Prometheus
- AWS Cloud
- Kubernetes
- Helm
- Terraform
- MongoDB
- MySQL
- Greylog
- Docker
- Jenkins
Challenges
When the Client considered cloud migration for their 10-year-old Intelli Messaging system, there were several issues with the existing on-premises infrastructure. The setup had various low-level problems and needed to allow the system to scale cost-effectively.
System services were tied to specific machines based on the predicted resource usage, making it difficult to recover from failures. The services were deployed as sets of Java jars, lacking proper isolation. Deployments were cumbersome, relying on manual symbolic link modifications for rollbacks, causing stress during critical service deployments.
The event store was using an outdated version of MongoDB, and upgrading was a concern due to potential risks. To migrate the system to the cloud, it was necessary to identify the required cloud capabilities and limitations, address migrating the data from the data center to the cloud, and minimize the system downtime window. Some of the customers who were aware of the migration plan added an additional set of requirements around security features based on VPN.
Technology used
How we faced client’s needs
The system faced several challenges in the on-premises setup, including limitations in service scalability and deployment issues. The event store was running on an outdated MongoDB version, causing reluctance to upgrade due to potential risks.
The team identified the required cloud capabilities and limitations to migrate to the cloud. Data migration from the data center to the cloud was needed to minimize downtime. Some customers required additional security features based on VPN.
We explored cloud solutions and opted for Amazon Web Services (AWS). They used Amazon Virtual Private Cloud (VPC) to create isolated environments for production, staging, and internal tools. AWS EC2 allowed easy scaling of instances. Infrastructure was defined as code using Terraform, and Amazon DynamoDB and S3 stored Terraform state.To handle publicly available entry points, we used AWS Certificate Manager and Amazon Route 53 for DNS mappings. Containers and Docker were chosen for isolated service deployment, stored in Amazon Elastic Container Registry, built by Jenkins.
Resource usage was measured to select suitable EC2 instance types. Scaling was achieved with Amazon Elastic Kubernetes Service (EKS) and Auto Scaling groups. Data migration integrity was a priority, addressed through replication for the event store (MongoDB) and reporting database (MySQL). Critical services were safeguarded using dedicated Kubernetes eviction policies.Security requirements were met with AWS Transient Gateway, AWS Site-to-Site VPN connections, AWS VPN Client service, and a custom VPN server.
Finally, we upgraded the technology stack, including the MongoDB engine, to newer versions.
Results
As an outcome of this work, the Intelli Messaging system has increased overall traffic handling capacity. The solution already had a chance to prove its worth. Shortly after the official launch, it received traffic several times higher than it could handle before. It wasn't just a single spike but lasted for a few hours. The client immediately felt more confident about the stability and throughput of the system.
The system can now automatically scale out or scale in based on the actual resource usage. No manual intervention is needed. All that is required is to maintain the rules which define the scaling triggers. Thanks to the Kubernetes and EKS foundation, the system is much more reliable and resistant to failures. The number of incidents that cannot be resolved by automatic service redeployment has gone down significantly. Service availability is higher, giving the customers more confidence in the quality of service.
Physical infrastructure incidents are not a problem anymore. The cloud provider manages hard drive failures and equipment redundancy, taking away the stress and effort required to handle this kind of incident under time pressure.
System owners now have a much better insight into the system's costs. AWS provides clear and up-to-date reports about current billings. Therefore it's easier to make predictions and plan the budget. Based on historical data, the AWS Cost Explorer service can also recommend the most optimal AWS resource usage.
Service deployment is faster, leaner, and more resilient to potential issues. Rolling back a failed deployment is a matter of seconds instead of minutes. Releases can be performed more often, with a higher level of confidence and a lower level of stress. The migration from the on-premises data center to AWS cloud solutions took about a year. With the system foundation based on the cloud, it became easier to identify the right direction for future system design and performance improvements.