Improve Apache Commons ArrayUtils removeElements to O(n)

Recently, I read apache commons ArrayUtils removeElements source code and found that its implementation is not efficient when try to get all indexes that needed to be removed.

It's O(m*n): n is the length of origin array, m is the unique count of toBeRemoved.
    final BitSet toRemove = new BitSet();
    for (final Map.Entry<Byte, MutableInt> e : occurrences.entrySet()) {
        final Byte v = e.getKey();
        int found = 0;
        for (int i = 0, ct = e.getValue().intValue(); i < ct; i++) {
            found = indexOf(array, v.byteValue(), found);
            if (found < 0) {
                break;
            }
            toRemove.set(found++);
        }
    }

We can just scan the origin array once to get all indexes that needed to be removed.
    final BitSet toRemove = new BitSet();

    // the only change here, time complexity changed to O(n)
    for (int i = 0; i < array.length; i++) {
        final int key = array[i];

        final MutableInt count = occurrences.get(key);
        if (count != null) {
            count.decrement();
            if (count.intValue() == 0) {
                occurrences.remove(key);
            }
            toRemove.set(i);
        }
    }

I created one JIRA: Improve ArrayUtils removeElements time complexity to O(n)
The complete code
The implementation is quite elegant and robust.
 
    public static int[] removeElements(final int[] array, final int... values) {
        if (isEmpty(array) || isEmpty(values)) {
            return clone(array); // can't just return origin array
        }
        final HashMap<Integer, MutableInt> occurrences = new HashMap<Integer, MutableInt>(values.length);
        for (final int v : values) {
            final Integer boxed = Integer.valueOf(v);
            final MutableInt count = occurrences.get(boxed);
            if (count == null) {
                occurrences.put(boxed, new MutableInt(1));
            } else {
                count.increment();
            }
        }
        final BitSet toRemove = new BitSet();

        // My change here, time complexity changed to O(n)
        for (int i = 0; i < array.length; i++) {
            final int key = array[i];

            final MutableInt count = occurrences.get(key);
            if (count != null) {
                count.decrement();
                if (count.intValue() == 0) {
                    occurrences.remove(key);
                }
                toRemove.set(i);
            }
        }

        return (int[]) removeAll(array, toRemove);
    }

    static Object removeAll(final Object array, final BitSet indices) {
        final int srcLength = ArrayUtils.getLength(array);
        // No need to check maxIndex here, because method only currently called from removeElements()
        // which guarantee to generate on;y valid bit entries.
//        final int maxIndex = indices.length();
//        if (maxIndex > srcLength) { 
//            throw new IndexOutOfBoundsException("Index: " + (maxIndex-1) + ", Length: " + srcLength);
//        }
        final int removals = indices.cardinality(); // true bits are items to remove
        final Object result = Array.newInstance(array.getClass().getComponentType(), srcLength - removals);
        int srcIndex=0;
        int destIndex=0;
        int count;
        int set;
        while((set = indices.nextSetBit(srcIndex)) != -1){
            count = set - srcIndex;
            if (count > 0) {
                System.arraycopy(array, srcIndex, result, destIndex, count);
                destIndex += count;
            }
            srcIndex = indices.nextClearBit(set);
        }
        count = srcLength - srcIndex;
        if (count > 0) {
            System.arraycopy(array, srcIndex, result, destIndex, count);            
        }
        return result;
    } 

Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts