Architecting on AWS: the best services to build a two-tier application

The notion of a scalable, on-demand, pay-as-you go cloud infrastructure tends to be easy understood by the majority of today’s IT specialists. However, in order to fully reap the benefits from hosting solutions in the cloud you will have to rethink traditional ‘on-premises’ design approaches. This should happen for a variety of reasons with the most prominent ones the design-for-costs or the adoption of a design-for-failure approach.

This is the first of a series of posts in which we will introduce you to a variety of entry-level AWS services on the example of architecting on AWS to build a common two-tier application deployment (e.g. mod_php LAMP). We will use the architecture to explain common infrastructure and application design patterns pertaining to cloud infrastructure.
To start things off we provide you with a high level overview of the system and a brief description of the utilised services.

Architecting on AWS

Virtual Private Cloud (VPC)

The VPC allows you to deploy services into segmented networks to reduce the vulnerability of your services to malicious attacks from the internet. Separating the network into public and private subnets allows you to safeguard the data tier behind a firewall and to only connect the web tier directly to the public internet. The VPC service provides flexible configuration options for routing and traffic management rules.  Use an Internet Gateway to enabls connectivity to the Internet for resources that are deployed within public subnets.

Redundancy

In our reference design we have spread all resources across two availability zones (AZ) to provide for redundancy and resilience to cater for unexpected outages or scheduled system maintenance. As such, each availability zone is hosting at least one instance per service, except for services that are redundant by design (e.g. Simple Storage Service, Elastic Load Balancer, Rote 53, etc.).

Web tier

Our web tier consists of two web servers (one in each availability zone) that are deployed on Elastic Compute Cloud (EC2) instances. We balance external traffic to the servers using Elastic Load Balancers (ELB). Dynamic scaling policies allow you to elastically scale the environment in adding or removing web instances to the auto scaling group. Amazon Cloud Watch allows us to monitor demand on our environment and triggers scaling events using Cloud Watch alarms.

Database tier

Amazon’s managed Relational Database Service (RDS) provides the relational (MySQL, MS SQL or Oracle) environment for this solution. In this reference design it is established as multi-AZ deployment. The multi-AZ deployment includes a standby RDS instance in the second availability zone, which provides us with increased availability and durability for the database service in synchronously replicating all data to the standby instance.
Optionally we can also provision read replicas to reduce the demand on the master database. To optimise costs, our initial deployment may only include the master and slave RDS instances, with additional read replicas created in each AZ as dictated by the demand.

Object store

Our file objects are stored in Amazon’s Simple Storage Service (S3). Objects within S3 are managed in buckets, which provide virtually unlimited storage capacity. Object Lifecycle Management within an S3 bucket allows us to archive (transition) data to the more cost effective Amazon Glacier service and/or the removal (expiration) of objects from the storage service based on policies.

Latency and user experience

For minimised latency and an enhanced user experience for our world-wide user base, we utilise Amazon’s CloudFront content distribution network. CloudFront maintains a large number of edge locations across the globe. An edge location acts like a massive cache for web and streaming content.

Infrastructure management, monitoring and access control

Any AWS account should be secured using Amazon’s Identity and Access Management (IAM). IAM allows for the creation of users, groups and permissions to provide granular, role based access control over all resources hosted within AWS.
The provisioning of above solution to the regions is achieved in using Amazon CloudFormation. CloudFormation supports the provisioning and management of AWS services and resources using scriptable templates. Once created, CloudFormation also updates the provisioned environment based on changes made to the ‘scripted infrastructure definition’.
We use the Route 53 domain name service for the registration and management of our Internet domain.

In summary, we have introduced you to a variety of AWS services, each of which has been chosen to address one or multiple specific concern in regards to functional and non-functional requirements of the overall system. In our upcoming posts we’ll investigate a number of above services in more detail, discussing major design considerations and trade-offs in selecting the right service for your solution. In the meantime you can start to learn more about the individual AWS services using the courses that are available from CloudAcademy.

DISCLOSURE: This post has originally been created for and sponsored by CloudAcademy.com.

Hadoop streaming R on Hortonworks Windows distribution

The ability to combine executables with the hadoop map/reduce engine is a very powerful concept known as streaming. Thanks to a very active community, we have  ample of examples available on the web that caters for a large variety of languages and implementations. Sources like this and this have also provided some R language examples that are very easy to follow.
In my case I was only required to look at the integration between two tools. Unfortunately though, I hadn’t so far been able to find any detail on how to implement streaming of R on the Hortonworks Windows distribution.

Why Windows you are asking? – Well, I guess this is based on the same reason on which Hortonworks decided to even consider shipping a Windows distribution in first instance. Sometimes it is just easier to reach a market in a perceived familiar environment. But this may become a topic for another post someday.

In hindsight everything always appears quite straight forward. Still, I would like to briefly share my findings here to reduce the research time for anyone else who is presented with a similar challenge.

As a first requirement R obviously needs to be installed on every data node. This also applies to all additional R packages that are used within your application.

Next you are creating two files containing the instructions for the map and reduce tasks in R. In the example below the files are named map.R and reduce.R.

Assuming that your data is already loaded into hdfs you issue the following command on the hadoop command line:

hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.4.0.2.1.1.0-1621.jar -files "file:///c:/Apps/map.R,file:///c:/Apps/reduce.R" -mapper "C:\R\R-3.1.0\bin\x64\Rscript.exe map.R" -reducer "C:\R\R-3.1.0\bin\x64\Rscript.exe reduce2." -input /Logs/input -output /Logs/output

A couple of comments regarding the streaming jar options used in this command under Windows:

-files: in order to submit multiple files to the streaming process, the comma separated list needs to be encapsulated in double quotes. Access to the local file system is provided using the file:/// scheme.
-mapper and -reducer: since the R application can’t be pre-faced with a hashbang in Windows, we need to provide the execution environment as part of the option. As above, the path to the Rscript executable and the name of the R file again need to be encapsulated in double quotes.

Using IAM credentials to grant access to AWS services

Having used Amazon Web Services (AWS) for quite some time now, I realised that it should be time to start sharing some of my experiences on this blog, particularly considering that I haven’t contributed to the World Wide Web community for quite some time.
Today’s post is about Amazon’s Identity and Access Management (IAM) service and why it is a good idea to use it.
I am using the great backup solution from CloudBerry to backup important files from my laptop on Amazon S3. While I am absolutely excited about the capabilities of the application, I still did not feel comfortable to provide my AWS root account access keys (as described in CloudBerry’s help file) to the application.

Why is that?
To fully understand the issue we need to be aware about the difference between AWS root credential and an IAM identity. The AWS root account is provisioned for all AWS users and has full access to all resources and services in the account. Sharing the secret access key for this identity with a 3rd party potentially gets you into big trouble; a malicious piece of software may use the credentials to wipe out all your data, terminate instances or, potentially worse, subscribe to a new raft of additional services that you will have to pay for at the end of the month. Scary stuff right? So please read on. Continue reading

Fixing crashes in Business Intelligence Development Studio (BIDS)

I recently started to look at SQL Server Analysis Services to get a better understanding of the design and usage of OLAP cubes.

Unfortunately the development environment (Business Intelligence Development Studio) which is still based on Visual Studio 2008 kept on crashing on me for no obvious reasons. Being relatively new to the topic I obviously couldn’t judge whether the crashes where caused by my wrongdoing or something happening in the application itself. To my defence though, I shall say that apps should terminate gracefully after a crash instead of just disappearing from the screen. Just saying!
Whenever I opened Translations, Calculations or Perspectives in the Cube Designer my UI just disappeared from the screen, rendering those few features unusable.
At first, the big world of the wide web unfortunately didn’t yield any results until I tried to work with Cube Actions, where I was finally presented with an error message that helped me moving into the right direction:

Could not load file or assembly 'msmgdsrv, Version=9.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91' or one of its dependencies. The system cannot find the file specified.

Researching the error message I got a few pointers that explained how the issue would be solved for MDX queries affecting SSMS. However, it did not fix my issue with Business Intelligence Development Studio, which is built on top of Visual Studio 2008.

Looking at the syntax of the fix though I decided that if above fixes are applied to the ssms.exe.config, they should work equally well for the devenv.exe in customising the devenv.exe.config.

So in editing devenv.exe.config on C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE (your path may be different) I added the snippet below as an additional  <dependentAssembly> element in the XML file (screenshot).

<!– CP 24/09/2012 added to fix missing msmgdsrv error –>
<dependentAssembly>
<assemblyIdentityname=msmgdsrvpublicKeyToken=89845dcd8080cc91/>
<codeBaseversion=9.0.0.0href=C:\Program Files (x86)\Microsoft Analysis Services\AS OLEDB\10\msmgdsrv.dll/>
</dependentAssembly>
<!– end of customisation –>

Needless to say that this solved all the issues I encountered so far with the development environment as otherwise I wouldn’t have documented them here!

Using regular expression with Connections profiles population

Today I was challenged to make use of regular expressions in Tivoli Directory Integrator to further filter LDAP distinguished names for the import into IBM Connections Profiles.

According to the Connections 3.0.1 documentation this can be achieved in specifying your regular expression in the source_ldap_required_dn_regex property of the profiles_tdi.PROPERTIES file.

Looking at the underlying TDI Assembly Line code though I discovered that source_ldap_required_dn_regex is never being read. Instead a property of source_ldap_required_dn_regex_pattern is being evaluated.

 

Using the right property unfortunately still doesn’t make the regular expression filter work properly. Instead I received the following error in the ibmdi.log file when running the sync_all_dns Assembly Line:

ERROR [AssemblyLine.AssemblyLines/_internal_ldap_iterate.5] - [ldap_iterate]
CTGDIS181E Error while evaluating Hook 'After GetNext' in the Component
'ldap_iterate' (ldap_iterate.after_getnext).
com.ibm.jscript.InterpretException: Script interpreter error, line=26, col=74:
Unknown member 'test' in Java class 'java.lang.String'

Hunting down the offending line of code the cause of the error becomes quite obvious. According to the JavaScript introduction from the TDI Users online community, TDI specific objects are Java objects instead of JavaScript objects. In our example the source_ldap_required_dn_regex_pattern variable is a java.lang.String object, which does not have a method named test. The test method is part of the JavaScript RegExp object, which apparently has been expected in this case.

After a brief moment where I was wondering whether IBM ever tested that bit of code or if I am the only one with that issue I replaced:

if(!lcConf.source_ldap_required_dn_regex_pattern.test(sdn)) {

with:

var re = new RegExp(lcConf.source_ldap_required_dn_regex_pattern);
if(!re.test(sdn)) {

This results in my ‘re’ variable to be properly instantiated as a RegExp JavaScript object with the value of the source_ldap_required_dn_regex_pattern string variable. I can then successfully use the JavaScript RegExp.test function to test the value of sdn. As another way of solving the issue I could have also rewritten above line of code and use the matches function of the java.lang.String class, passing in the regular expression itself.