Apache Spark: Using AspectJ, Fake-S3 and Local Files to Save Money and Cost

We are using Amazon S3 to store files and running Spark application in AWS. 
When run integration test, it's slow as it need call S3 apt to locate files and get files remotely.

So in order to run integration test faster and save cost, we can use fake-s3 and local files instead.

Fake-S3  && How to install/run it
Check https://github.com/jubos/fake-s3
Saving Time and Money with Fake S3

High Level Idea
During first time when we run integration test with specific parameters, we actually call S3, but get and save them locally and put them into fake-s3, change the filePath from s3n:// to local file path. So the rest program will interact with local files.

After this, when run integration test with same parameters, we don't call Amazon S3 at all, instead, we use local fake-s3.

Low Level Implementation
We don't want to change program code to add this kind of logic - as this is only related when we are running local integration test.

Using AspectJ to change behavior for local integration test only
In our main spring-context.xml:
$lt;import resource="classpath:config/${env}/spring-context.xml"/$gt;
in config/local/spring-context.xml, create AspectJ that will do the following things:
$lt;bean id="usingFakeS3Aspect"
 class="commons.util.aspect.UsingFakeS3Aspect"$gt;
 $lt;property name="retriesNumber" value="2" /$gt;
$lt;/bean$gt;

$lt;aop:config$gt;
 $lt;aop:aspect id="akeS3Aspect" ref="usingFakeS3Aspect"$gt;
  $lt;aop:pointcut id="pointCutS3PostConstructor"
   expression="execution(* utils.S3Utils.postConstructor(..)) && target(s3Util)" /$gt;
  $lt;aop:after method="afterPostConstructor" pointcut-ref="pointCutS3PostConstructor" /$gt;

  $lt;aop:pointcut id="pointCutSetLogFilePath"
   expression="execution( * config.QosEventsContextImpl.setLogFilePath(*)) && target(context) && args(logFilePath)" /$gt;
  $lt;aop:around method="useLocalFSInsteadOfS3n" pointcut-ref="pointCutSetLogFilePath" /$gt;
 $lt;/aop:aspect$gt;
$lt;/aop:config$gt;
1.  After S3Utils post constructor, if test.run.useFakeS3==true, use fake-s3:
s3.setEndpoint("http://localhost:4567");
s3.setS3ClientOptions(new S3ClientOptions().withPathStyleAccess(true));
2. Around context.setFilePath, if test.run.useLocalFS=true, if local files doesn't exist, get them from Amazon S3, save to local and put to fake-s3, then change file path from s3n:// to local file path.
Implementation Code
public class UsingFakeS3Aspect {
    private static final Logger LOGGER = LoggerFactory.getLogger(UsingFakeS3Aspect.class);
    private S3Utils s3Utils;
    //@After("execution(* utils.S3Utils.postConstructor(..)) && target(s3Util)")
    public void afterPostConstructor(S3Utils s3Util) {
        this.s3Utils = s3Util;
        boolean useFakeS3 = "true".equals(System.getProperty("test.run.useFakeS3", "false"));
        if (useFakeS3) {
            LOGGER.info("Using fake-s3");
            AwsBucketConfig s3 = s3Util.getS3Config();
            s3.setEndpoint("http://localhost:4567");
            s3.setS3ClientOptions(new S3ClientOptions().withPathStyleAccess(true));
        }
    }

    // see http://stackoverflow.com/questions/4312224/aspectj-overwrite-an-argument-of-a-method
    //@Around("execution( * ContextImpl.setLogFilePath(*) ) && target(context) && args(logFilePath)")
    public void useLocalFSInsteadOfS3n(final ProceedingJoinPoint pjp, Context context, String logFilePath)
            throws Throwable {
        boolean useLocalFS = "true".equals(System.getProperty("test.run.useLocalFS"));

        String newLogFilePath = logFilePath;
        if (useLocalFS && logFilePath != null && logFilePath.startsWith("s3n://")) {
            boolean isFirstTime = "false".equals(System.getProperty("test.run.useFakeS3", "false"));
            newLogFilePath = copyFromAWSToFakeS3AndUsingLocalFiles(context.getFilesToIngest(), isFirstTime);
        }
        pjp.proceed(new Object[] {context, newLogFilePath});
        // pjp.proceed(new Object[] {newLogFilePath});
    }

    private String copyFromAWSToFakeS3AndUsingLocalFiles(Set$lt;String$gt; filesToIngest, boolean isFirstTime) {
        String logFilePath;
        StringBuilder sb = new StringBuilder();
        AmazonS3Client fakeS3 = S3Utils.createFakeS3();
        s3Utils.createfakeBucketIfNotExists(fakeS3);
        // check whether this file exists in fake-s3, if not create it
        for (String fileToIngest : filesToIngest) {
            File localFile = new File(LOCAL_S3_ROOT, fileToIngest);
            if (!localFile.exists()) {
                s3Utils.saveAWSFileToLocal(fileToIngest, localFile);
                s3Utils.saveLocalFileToFakeS3(localFile, fakeS3);
            } else {
                if (isFirstTime) {
                    // only save local file to fake-s3 once - the first time.
                    s3Utils.saveLocalFileToFakeS3(localFile, fakeS3);
                }
            }
            sb.append(new File(LOCAL_S3_ROOT, fileToIngest).getAbsolutePath()).append(",");
        }

        if (sb.length() $gt; 0) {
            sb.setLength(sb.length() - 1);
        }
        logFilePath = sb.toString();
        return logFilePath;
    }

    public static final String LOCAL_S3_ROOT = "/some-path";
    // Another approach: this would cause setLogFilePath called again with changed parameters.
    // @After("execution( * ContextImpl.setLogFilePath(*) ) && target(context)")
    public void useLocalFSInsteadOfS3nAnotherApproach(Context context) {
        String logFilePath = context.getLogFilePath();
        boolean useLocalFS = "true".equals(System.getProperty("test.run.useLocalFS"));

        if (useLocalFS && logFilePath != null && logFilePath.startsWith("s3n://")) {
            boolean isFirstTime = "false".equals(System.getProperty("test.run.useFakeS3", "false"));
            context.setLogFilePath(copyFromAWSToFakeS3AndUsingLocalFiles(context.getFilesToIngest(), isFirstTime));
            LOGGER.info("Using local FS instead of S3.");
        }
    }    
}
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts