Simplify splitting a String into alpha and numeric parts

  • Requirement: Parse a String into chunks of numeric characters and alpha characters. Alpha characters should be separated from the numeric, other characters should be ignored.



    Example Data:



    Input           Desired Output
    1A [1, A]
    12 [12]
    12G [12, G]
    12ABC-SFS513 [12, ABC, SFS, 513]
    AGE+W#FE [AGE, W, FE]
    -12WE- [12, WE]
    -12- &%3WE- [12, 3, WE]


    Question:



    The code below accomplishes this. However, I am looking for any suggestions as to a better way to accomplish this (maybe a crazy regex using String.split()? ) or any changes that could make this code more readable/easy to follow.



    Code:



    private static String VALID_PATTERN = "[0-9]+|[A-Z]+";

    private List<String> parse(String toParse){
    List<String> chunks = new LinkedList<String>();
    toParse = toParse + "$"; //Added invalid character to force the last chunk to be chopped off
    int beginIndex = 0;
    int endIndex = 0;
    while(endIndex < toParse.length()){
    while(toParse.substring(beginIndex, endIndex + 1).matches(VALID_PATTERN)){
    endIndex++;
    }
    if(beginIndex != endIndex){
    chunks.add(toParse.substring(beginIndex, endIndex));
    } else {
    endIndex++;
    }
    beginIndex = endIndex;
    }
    return chunks;
    }

  • sepp2k

    sepp2k Correct answer

    10 years ago

    First of all, yes there is a crazy regex you can give to String.split:



    "[^A-Z0-9]+|(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])"


    What this means is to split on any sequence of characters which aren't digits or capital letters as well as between any occurrence of a capital letter followed by a digit or any digit followed by a capital letter. The trick here is to match the space between a capital letter and a digit (or vice-versa) without consuming the letter or the digit. For this we use look-behind to match the part before the split and look-ahead to match the part after the split.



    However as you've probably noticed, the above regex is quite a bit more complicated than your VALID_PATTERN. This is because what you're really doing is trying to extract certain parts from the string, not to split it.






    So finding all the parts of the string which match the pattern and putting them in a list is the more natural approach to the problem. This is what your code does, but it does so in a needlessly complicated way. You can greatly simplify your code, by simply using Pattern.matcher like this:



    private static final Pattern VALID_PATTERN = Pattern.compile("[0-9]+|[A-Z]+");

    private List<String> parse(String toParse) {
    List<String> chunks = new LinkedList<String>();
    Matcher matcher = VALID_PATTERN.matcher(toParse);
    while (matcher.find()) {
    chunks.add( matcher.group() );
    }
    return chunks;
    }





    If you do something like this more than once, you might want to refactor the body of this method into a method findAll which takes the string and the pattern as arguments, and then call it as findAll(toParse, VALID_PATTERN) in parse.


    I had not looked into the Matcher class. This is excellent.

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM