How to parse and group hierarchical list items from an unindented string in Python?

Problem Statement

Given an unindented string as input, perform these steps:

  • Identify the list items at the highest level of the hierarchy within the string. These top-level items can be identified by the following criteria:

    • Numbering systems (e.g., 1., 2., 3.)
    • Lettering systems (e.g., A., B., C.)
    • Bullets (e.g., -, *, •)
    • Symbols (e.g., >, #, §)
  • For each top-level item identified in step 1:

    a. Group it with all subsequent lower-level items until the next top-level item is encountered. Lower-level items can be identified by the following criteria:

    • Prefixes (e.g., 1.1, 1.2, 1.3)
    • Bullets (e.g., -, *, •)
    • Alphanumeric sequences (e.g., a., b., c.)
    • Roman numerals (e.g., i., ii., iii.)

    b. Concatenate the top-level item with its associated lower-level items into a single string, maintaining the original formatting and delimiters. The formatting and delimiters should be preserved as they appear in the input string.

  • Return the resulting grouped list items as a Python list where each element represents a top-level item and its associated lower-level items. Each element in the list should be a string containing the concatenated top-level item and its lower-level items.

  • Exclude any text that appears before the first top-level item and after the last top-level item from the output. Only the content between the first and last top-level items should be included in the output list.

Goal

The goal is to create a Python method that takes an unindented string as input, identifies the top-level items and their associated lower-level items based on the specified criteria, concatenates them into a single string for each top-level item while maintaining the original formatting and delimiters, and returns the resulting grouped list items as a Python list. The output list should match the desired format, with each element representing a top-level item and its associated lower-level items.

Request

Please provide an explanation and guidance on how to create a Python method that can successfully achieve the goal outlined above. The explanation should include the steps involved, any necessary data structures or algorithms, and considerations for handling different scenarios and edge cases.

Additional Details

  • I have attempted to create a Python method to achieve the tasks outlined above, but my attempts have been unsuccessful. The methods I have tried do not produce the expected outputs for the given inputs.

  • To aid in testing and validating the solution, I have created and included numerous sample inputs and their corresponding expected outputs below. These test cases cover various scenarios and edge cases to ensure the robustness of the method.

Code Attempts:

Attempt 1:

def process_list_hierarchy(text):
    # Helper function to determine the indentation level
    def get_indentation_level(line):
        return len(line) - len(line.lstrip())

    # Helper function to parse the input text into a list of lines with their hierarchy levels
    def parse_hierarchy(text):
        lines = text.split('\n')
        hierarchy = []
        for line in lines:
            if line.strip():  # Ignore empty lines
                level = get_indentation_level(line)
                hierarchy.append((level, line.strip()))
        return hierarchy

    # Helper function to build a tree structure from the hierarchy levels
    def build_tree(hierarchy):
        tree = []
        stack = [(-1, tree)]  # Start with a dummy root level
        for level, content in hierarchy:
            # Find the correct parent level
            while stack and stack[-1][0] >= level:
                stack.pop()
            # Create a new node and add it to its parent's children
            node = {'content': content, 'children': []}
            stack[-1][1].append(node)
            stack.append((level, node['children']))
        return tree

    # Helper function to combine the tree into a single list
    def combine_tree(tree, combined_list=[], level=0):
        for node in tree:
            combined_list.append(('  ' * level) + node['content'])
            if node['children']:
                combine_tree(node['children'], combined_list, level + 1)
        return combined_list

    # Parse the input text into a hierarchy
    hierarchy = parse_hierarchy(text)
    # Build a tree structure from the hierarchy
    tree = build_tree(hierarchy)
    # Combine the tree into a single list while maintaining the hierarchy
    combined_list = combine_tree(tree)
    # Return the combined list as a string
    return '\n'.join(combined_list)

Attempt 2:

def organize_hierarchically(items):
    def get_level(item):
        match = re.match(r'^(\d+\.?|\-|\*)', item)
        return len(match.group()) if match else 0

    grouped_items = []
    for level, group in groupby(items, key=get_level):
        if level == 1:
            grouped_items.append('\n'.join(group))
        else:
            grouped_items[-1] += '\n' + '\n'.join(group)

    return grouped_items

Attempt 3:

from bs4 import BeautifulSoup
import nltk

def extract_sub_objectives(input_text):
    soup = BeautifulSoup(input_text, 'html.parser')
    text_content = soup.get_text()

    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text_content)

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []

    # Iterate through the sentences and extract sub-objectives
    current_sub_objective = ""
    for sentence in sentences:
        if sentence.startswith(("1.", "2.", "3.", "4.")):
            if current_sub_objective:
                sub_objectives.append(current_sub_objective)
                current_sub_objective = ""
            current_sub_objective += sentence + "\n"
        elif current_sub_objective:
            current_sub_objective += sentence + "\n"

    # Append the last sub-objective, if any
    if current_sub_objective:
        sub_objectives.append(current_sub_objective)

    return sub_objectives

Attempt 4:

def extract_sub_objectives(input_text, preserve_formatting=False):
    # Modified to strip both single and double quotes
    input_text = input_text.strip('\'"')
    messages = []
    messages.append("Debug: Starting to process the input text.")
    # Debug message to show the input text after stripping quotes
    messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")

    # Define possible starting characters for new sub-objectives
    start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
    messages.append(f"Debug: Start characters defined: {start_chars}")

    # Define a broader range of continuation characters
    continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
    messages.append(f"Debug: Continuation characters defined: {continuation_chars}")

    # Replace escaped newline characters with actual newline characters
    input_text = input_text.replace('\\n', '\n')
    # Split the input text into lines
    lines = input_text.split('\n')
    messages.append(f"Debug: Input text split into lines: {lines}")

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
    # Initialize an empty string to store the current sub-objective
    current_sub_objective = ''
    # Initialize a counter for the number of continuations in the current sub-objective
    continuation_count = 0

    # Function to determine if a line is a new sub-objective
    def is_new_sub_objective(line):
        # Strip away leading quotation marks and whitespace
        line = line.strip('\'"').strip()
        return any(line.startswith(start_char) for start_char in start_chars)

    # Function to determine if a line is a continuation
    def is_continuation(line, prev_line):
        if not prev_line:
            return False
        # Check if the line starts with an alphanumeric followed by a period or parenthesis
        if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
            # Check if it follows the sequence of the previous line
            if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                return False
            elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                return False
            else:
                return True
        # Add a condition to check for lower-case letters followed by a full stop
        if line[0].islower() and line[1] == '.':
            return True
        return any(line.startswith(continuation_char) for continuation_char in continuation_chars)

    # Iterate over each line
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i > 0 else ''
        # Check if the line is a new sub-objective
        if is_new_sub_objective(line):
            messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
            # If we have a current sub-objective, check the continuation count
            if current_sub_objective:
                if continuation_count < 2:
                    messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                    for message in messages:
                        print(message)
                    return None
                # Check the preserve_formatting parameter before adding
                sub_objectives.append(
                    current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
            # Reset the current sub-objective to the new one and reset the continuation count
            current_sub_objective = line
            continuation_count = 0
        # Check if the line is a continuation
        elif is_continuation(line, prev_line):
            messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
            # Increment the continuation count
            continuation_count += 1
        # Handle lines that are part of the current sub-objective but don't start with a continuation character
        elif current_sub_objective:
            messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()

    # If we have a current sub-objective, check the continuation count before adding it to the list
    if current_sub_objective:
        if continuation_count < 2:
            messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
            for message in messages:
                print(message)
            return None
        # Check the preserve_formatting parameter before adding
        sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
        messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")

    # Print the debug messages if no sub-objectives are found
    if not sub_objectives:
        for message in messages:
            print(message)

    return sub_objectives

Sample Data (Inputs and associated Outputs):

For the first sample, just having a play with regexes

(?P<bullet>\d{1,2}\.(?:[\s\S](?![a-z]\.))+\\n)|(?P<letter>(?<=\\n)[a-z]\.(?:[\s\S](?!\-))+\\n)|(?P<dash>(?<=\\n)\-(?:[\s\S](?!\d{1,2}\.))+?\\n)

Test here

A breakdown of the first group
(?P<bullet>\d{1,2}\.(?:[\s\S](?![a-z]\.))+\\n)

?P<bullet> - A named capture group called bullet
\d{1,2}\. - A decimal up to 2 figures long followed by a dot

Pattern wrapped in a non capturing group (?: ... ) and looked for + 1 or many times up to and including a newline \\n
(?:[\s\S](?![a-z]\.))+\\n
-- [\s\S] - any character
-- (?![a-z]\.) - negative lookahead e.g. not followed by a letter with a dot

The other capturing groups follow a similar pattern.

The regex is possibly flawed, but might be a start.

edit: I can’t help but feel there is a recursive solution to this, with \d\., [a-z]\., - representing the depth of recursion.

Playing with Javascript

A simple string example:

const testString = 
  '\n1. one\n2. two\na. three\nb. four\n2. five\na. six\n- seven\n- eight\n3. nine\n';

Simplified from the previous regex. We only need to match up to a newline character.

const subSectionsRx = 
  /(?<=\n)(\d{1,2}\..+?)\n|(?<=\n)([a-z]\..+?)\n|(?<=\n)(\-.+?)\n/g;

Capture groups are in order of indenting. 1: (bullet number) | 2: (bullet letter) | 3: (dash)

function getGroupIndex(captured) {
  for (let index = 1; index < captured.length; index += 1){
    if (captured[index]) {
      return index
    }
  }
  return 0
}

A match e.g. subSectionsRx.exec(strg) returns an array like object e.g. for a bullet number match

[
  '1. one\n', // full match
  '1. one', // ← capture group match with an index of 1
  undefined, 
  undefined, 
  index: 1,
  input: '\n1. one\n2. two\na. three\nb. four\n2. five\na. six\n- seven\n- eight\n3. nine\n', 
  groups: undefined
]

Making use of padStart we can use the groupIndex to define how many tabs are needed as a prefix to the match.

function parseString(strg) {
  const lines = [];
  let match;
  
  while ((match = subSectionsRx.exec(strg)) !== null) {
    let groupIndex = getGroupIndex(match)
    lines.push(''.padStart(groupIndex - 1, '\t').concat(match[groupIndex]))
  }
  return lines
}

console.log(parseString(testString).join('\n'));

Output

1. one
2. two
	a. three
	b. four
2. five
	a. six
		- seven
		- eight
3. nine

Testing on the first sample data

1. Develop a Comprehensive Game Library:
	a. Acquire games from Catan, Ticket to Ride, Pandemic, Codenames, and Azul to enrich the game library.
		- Utilize publicly available resources provided by Catan, Ticket to Ride, Pandemic, Codenames, and Azul to strategically select games.
		- Store acquired games in a organized format such as a database or inventory system for efficient retrieval and analysis.
2. Implement Star Trek Theme Integration:
	a. Leverage Star Trek themes and elements to enhance the gaming experience and attract fans of the franchise.
		- Utilize open-source Star Trek assets such as images, quotes, and lore to incorporate into game components and designs.
		- Use creative storytelling techniques to weave Star Trek narratives into the gameplay and objectives of selected games.
3. Enhance Player Engagement and Interaction:
	a. Incorporate real-time monitoring of player feedback and preferences to continuously update the game library.
		- Utilize player surveys and open-source forums to gather up-to-date information on emerging trends and popular mechanics.
		- Implement social media scraping techniques to extract relevant player insights from publicly available sources and discussions.
4. Establish Optimal Game Selection Strategies:
	a. Employ data analysis and player profiling algorithms to develop optimal game selection strategies for different player groups.
		- Utilize data analysis to identify patterns in player preferences and behaviors based on demographics and gaming history.
		- Implement player profiling strategies to anticipate and cater to the diverse interests and skill levels of different player segments.
5. Perform Game Session Analysis:
	a. Develop algorithms to analyze game sessions and identify potential areas for improvement in game balance, pacing, and player satisfaction.
		- Utilize graph theory and network analysis tools to model player interactions and identify critical points in the game flow.
		- Integrate data from playtesting sessions to profile and understand the progression of game sessions, enabling proactive game refinement strategies.

I know this isn’t python, but the ideas are there and hopefully it is along the right track.

I’m sure my python could be improved

subSectionsRx = r'(?<=\n)(\d{1,2}\..+?)\n|(?<=\n)([a-z]\..+?)\n|(?<=\n)(\-.+?)\n'

def getGroupIndex(groups):
    for index, value in enumerate(groups):
        if value:
            return index

def parseString(strg):
    matches = re.findall(subSectionsRx, strg)
    lines = []

    for groups in matches:
        index = getGroupIndex(groups)
        lines.append(''.rjust(index, '\t') + groups[index])

    return lines

testString = '\n1. one\n2. two\na. three\nb. four\n2. five\na. six\n- seven\n- eight\n3. nine\n'
print('\n'.join(parseString(testString)))

Output:

1. one
2. two
        a. three
        b. four
2. five
        a. six
                - seven
                - eight
3. nine

Thank you for your efforts!

When I test your method against each test case, it shows that 6 out of 37 are successful (or 31 out of 37 are unsuccessful). I put together a small program (shown below) which uses your method and compares the output against each expected output. If you could let me know if I’m missing something, that would be greatly appreciated.

Code without inputs and outputs:

from itertools import groupby, zip_longest

import nltk
from bs4 import BeautifulSoup


def process_list_hierarchy(text):
    # Helper function to determine the indentation level
    def get_indentation_level(line):
        return len(line) - len(line.lstrip())

    # Helper function to parse the input text into a list of lines with their hierarchy levels
    def parse_hierarchy(text):
        lines = text.split('\n')
        hierarchy = []
        for line in lines:
            if line.strip():  # Ignore empty lines
                level = get_indentation_level(line)
                hierarchy.append((level, line.strip()))
        return hierarchy

    # Helper function to build a tree structure from the hierarchy levels
    def build_tree(hierarchy):
        tree = []
        stack = [(-1, tree)]  # Start with a dummy root level
        for level, content in hierarchy:
            # Find the correct parent level
            while stack and stack[-1][0] >= level:
                stack.pop()
            # Create a new node and add it to its parent's children
            node = {'content': content, 'children': []}
            stack[-1][1].append(node)
            stack.append((level, node['children']))
        return tree

    # Helper function to combine the tree into a single list
    def combine_tree(tree, combined_list=[], level=0):
        for node in tree:
            combined_list.append(('  ' * level) + node['content'])
            if node['children']:
                combine_tree(node['children'], combined_list, level + 1)
        return combined_list

    # Parse the input text into a hierarchy
    hierarchy = parse_hierarchy(text)
    # Build a tree structure from the hierarchy
    tree = build_tree(hierarchy)
    # Combine the tree into a single list while maintaining the hierarchy
    combined_list = combine_tree(tree)
    # Return the combined list as a string
    return '\n'.join(combined_list)

def organize_hierarchically(items):
    def get_level(item):
        match = re.match(r'^(\d+\.?|\-|\*)', item)
        return len(match.group()) if match else 0

    grouped_items = []
    for level, group in groupby(items, key=get_level):
        if level == 1:
            grouped_items.append('\n'.join(group))
        else:
            grouped_items[-1] += '\n' + '\n'.join(group)

    return grouped_items

def extract_sub_objectives(input_text):
    soup = BeautifulSoup(input_text, 'html.parser')
    text_content = soup.get_text()

    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text_content)

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []

    # Iterate through the sentences and extract sub-objectives
    current_sub_objective = ""
    for sentence in sentences:
        if sentence.startswith(("1.", "2.", "3.", "4.")):
            if current_sub_objective:
                sub_objectives.append(current_sub_objective)
                current_sub_objective = ""
            current_sub_objective += sentence + "\n"
        elif current_sub_objective:
            current_sub_objective += sentence + "\n"

    # Append the last sub-objective, if any
    if current_sub_objective:
        sub_objectives.append(current_sub_objective)

    return sub_objectives

def extract_sub_objectivesV2(input_text, preserve_formatting=False):
    # Modified to strip both single and double quotes
    input_text = input_text.strip('\'"')
    messages = []
    messages.append("Debug: Starting to process the input text.")
    # Debug message to show the input text after stripping quotes
    messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")

    # Define possible starting characters for new sub-objectives
    start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
    messages.append(f"Debug: Start characters defined: {start_chars}")

    # Define a broader range of continuation characters
    continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
    messages.append(f"Debug: Continuation characters defined: {continuation_chars}")

    # Replace escaped newline characters with actual newline characters
    input_text = input_text.replace('\\n', '\n')
    # Split the input text into lines
    lines = input_text.split('\n')
    messages.append(f"Debug: Input text split into lines: {lines}")

    # Initialize an empty list to store the sub-objectives
    sub_objectives = []
    # Initialize an empty string to store the current sub-objective
    current_sub_objective = ''
    # Initialize a counter for the number of continuations in the current sub-objective
    continuation_count = 0

    # Function to determine if a line is a new sub-objective
    def is_new_sub_objective(line):
        # Strip away leading quotation marks and whitespace
        line = line.strip('\'"').strip()
        return any(line.startswith(start_char) for start_char in start_chars)

    # Function to determine if a line is a continuation
    def is_continuation(line, prev_line):
        if not prev_line:
            return False
        # Check if the line starts with an alphanumeric followed by a period or parenthesis
        if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
            # Check if it follows the sequence of the previous line
            if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                return False
            elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                return False
            else:
                return True
        # Add a condition to check for lower-case letters followed by a full stop
        if line[0].islower() and line[1] == '.':
            return True
        return any(line.startswith(continuation_char) for continuation_char in continuation_chars)

    # Iterate over each line
    for i, line in enumerate(lines):
        prev_line = lines[i - 1] if i > 0 else ''
        # Check if the line is a new sub-objective
        if is_new_sub_objective(line):
            messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
            # If we have a current sub-objective, check the continuation count
            if current_sub_objective:
                if continuation_count < 2:
                    messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                    for message in messages:
                        print(message)
                    return None
                # Check the preserve_formatting parameter before adding
                sub_objectives.append(
                    current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
            # Reset the current sub-objective to the new one and reset the continuation count
            current_sub_objective = line
            continuation_count = 0
        # Check if the line is a continuation
        elif is_continuation(line, prev_line):
            messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
            # Increment the continuation count
            continuation_count += 1
        # Handle lines that are part of the current sub-objective but don't start with a continuation character
        elif current_sub_objective:
            messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
            # Add the line to the current sub-objective, checking preserve_formatting
            current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()

    # If we have a current sub-objective, check the continuation count before adding it to the list
    if current_sub_objective:
        if continuation_count < 2:
            messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
            for message in messages:
                print(message)
            return None
        # Check the preserve_formatting parameter before adding
        sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
        messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")

    # Print the debug messages if no sub-objectives are found
    if not sub_objectives:
        for message in messages:
            print(message)

    return sub_objectives

import re

def parse_string(strg):
    sub_sections_rx = r'(?<=\n)(\d{1,2}\..+?)\n|(?<=\n)([a-z]\..+?)\n|(?<=\n)(\-.+?)\n'
    matches = re.findall(sub_sections_rx, strg)
    lines = []

    for groups in matches:
        index = next((i for i, value in enumerate(groups) if value), None)
        if index is not None:
            lines.append(''.rjust(index, '\t') + groups[index])

    return lines

import re



class ListHierarchyProcessor:

    def __init__(self, method='process_list_hierarchy'):
        self.method = method
        self.actual_outputs = []
        self.perfect_match_count = 0  # Instance variable to track the number of perfect matches
        self.inputs = [

...
        ]

        self.expected_outputs = [
            ...
           ]

    def process_inputs(self):
        for i, input_text in enumerate(self.inputs):
            print(f"========== Processing Input {i + 1} using method '{self.method}' ==========")
            print()

            # Existing code to process the input based on the method
            if self.method == 'process_list_hierarchy':
                output = process_list_hierarchy(input_text).split('\n')
            elif self.method == 'organize_hierarchically':
                output = organize_hierarchically(input_text.split('\n'))
            elif self.method == 'extract_sub_objectives':
                output = extract_sub_objectives(input_text)
            elif self.method == 'extract_sub_objectivesV2':
                output = extract_sub_objectivesV2(input_text)
            elif self.method == 'parse_string':
                output = parse_string(input_text)
            else:
                raise ValueError(f"Invalid method: {self.method}")

            self.actual_outputs.append(output)

            print("Output:")
            print("-------")
            print(output)
            print()

            print("Expected Output:")
            print("----------------")
            print(self.expected_outputs[i])
            print()

            # Calculate similarity percentage
            similarity_percentage = self.calculate_similarity_percentage(output, self.expected_outputs[i])
            print(f"Similarity Percentage: {similarity_percentage:.2f}%")
            print()

            # Increment perfect_match_count if the output matches the expected output perfectly
            if similarity_percentage == 100:
                self.perfect_match_count += 1

            # Check if the output matches the expected output
            if len(output) != len(self.expected_outputs[i]):
                print("Result: Output does not match the expected output.")
                print("Differences:")
                for j, (output_line, expected_line) in enumerate(zip_longest(output, self.expected_outputs[i], fillvalue="")):
                    if output_line != expected_line:
                        print(f"Line {j + 1}:")
                        print(f"Output:   {output_line or 'None'}")
                        print(f"Expected: {expected_line or 'None'}")
            else:
                print("Result: Output matches the expected output.")

            print("=" * 70)
            print()

        # Print the total count of perfect matches after processing all inputs
        print(f"Total Perfect Matches: {self.perfect_match_count}")

    def calculate_similarity_percentage(self, actual_output, expected_output):
        total_lines = max(len(actual_output), len(expected_output))
        if total_lines == 0:
            return 100
        matching_lines = sum(1 for actual_line, expected_line in zip_longest(actual_output, expected_output, fillvalue="") if actual_line == expected_line)
        similarity_percentage = (matching_lines / total_lines) * 100
        return similarity_percentage


processor = ListHierarchyProcessor(method='parse_string')
#processor = ListHierarchyProcessor(method='organize_hierarchically')
#processor = ListHierarchyProcessor(method='extract_sub_objectives')
#processor = ListHierarchyProcessor(method='extract_sub_objectivesV2')
processor.process_inputs()

Full Code including inputs and outputs:

https://pastebin.com/pEH7rbPq

I only tested it with the first test case as per the output in post#3. It was just an idea to test out some regexes and see if that was viable.

Obviously a bit lightweight compared to your approach, building hierarchies/node trees etc.

My python is somewhat limited, but I will have a more thorough look when I get a chance. Obviously if anyone else here has some input, that would be great.

1 Like

I have just had another look and I was off with the brief. I thought the lower level items needed to be separate list items, but in fact these need to be concatenated with their respective top level items. Is that correct?

Obviously this would be easier to do within the actual test environment, but I will have another go at the exercise when I have time.

So… aside from getting rpg to do your homework for you…

  1. The problem is not well defined in step 1.
  2. The problem is not well defined in step 2a.
  3. The problem does not well define a termination point.

m_hutley I appreciate your concern, but I want to clarify that this is not a homework assignment. This is a real-world problem I’m working on for a personal project. As you can see from my multiple code attempts, I have invested significant time and effort into trying to solve this issue on my own.

I’ve explored various approaches, including parsing the input text into a hierarchy, building tree structures, and using regular expressions to identify patterns. However, despite my best efforts, none of these methods have been able to successfully handle the complex input format and produce the desired output.

To ensure the robustness of my solution, I’ve also created an extensive set of test cases covering different scenarios and edge cases. Unfortunately, my current implementations have failed to pass these test cases. That’s why I’m reaching out to the community for guidance and insights on how to approach this problem effectively. I’m open to any suggestions or alternative perspectives that could help me overcome the challenges I’m facing.


Also, I appreciate your feedback, but I respectfully disagree with your assessment that the problem is not well-defined in steps 1, 2a, and the termination point. Allow me to clarify why these aspects of the problem are, in fact, clearly specified.

Regarding step 1, the criteria for identifying top-level items are explicitly stated:

  • Numbering systems (e.g., 1., 2., 3.)
  • Lettering systems (e.g., A., B., C.)
  • Bullets (e.g., -, *, •)
  • Symbols (e.g., >, #, §)

These criteria provide a clear and unambiguous way to determine the top-level items within the input string. There is no room for interpretation or confusion about what constitutes a top-level item.

Moving on to step 2a, the criteria for identifying lower-level items are also well-defined:

  • Indentation relative to the top-level item
  • Numbering or lettering sequences (e.g., i., ii., iii. or a., b., c.)
  • Bullets or symbols at a lower level than the top-level item

These criteria establish a clear hierarchy and structure for identifying lower-level items associated with each top-level item. The indentation, numbering/lettering sequences, and bullets/symbols provide a consistent and unambiguous way to group lower-level items with their corresponding top-level items.

Lastly, the termination point is clearly specified in the problem statement. The output should only include the content between the first and last top-level items, excluding any text that appears before the first top-level item or after the last top-level item. This termination point is well-defined and leaves no ambiguity about where the relevant content begins and ends.

In summary, the problem statement provides clear and specific criteria for identifying top-level items (step 1), lower-level items (step 2a), and the termination point. These criteria are well-defined and leave no room for misinterpretation or confusion. As a senior developer, I am confident that the problem is adequately specified to enable a robust and accurate solution.

Is there a particular reason you have to take in unindented untrimmed strings from some source? That you havent spent the time that you’ve put into coding attempts to… clean your data before working on it?

Any program is going to struggle with a vague and unspecific definition of the problem.

For clarity, what i mean is this:

What constitutes all possible numbering systems? What constitutes all possible lettering systems? What constitutes bullets? What constitutes symbols?
Is ! a symbol? is it a bullet? I dont know, because “e.g.” could mean anything. Does it need to be included in your solution?
What about just “a”? no period. I just start with a, and then b, and then c, with no punctuation? That’s a lettering system, surely. But can you detect it, without also detecting a sentence that begins with the word “A”?

Course, then you go to the next step, and say:

So a symbol is no longer valid, but roman numerals are? Were they not part of the numbering or lettering systems?

Thank you for your feedback on the problem definition. I appreciate you taking the time to provide a detailed critique. Let me address your points one by one.

Regarding the criteria for identifying top-level items, the thread clearly specifies the following:

  • Numbering systems (e.g., 1., 2., 3.)
  • Lettering systems (e.g., A., B., C.)
  • Bullets (e.g., -, *, $$\cdot$$)
  • Symbols (e.g., >, #, §)

The examples provided for numbering systems and lettering systems explicitly mention the use of a dot (period) as a signifier. The thread states “e.g., 1., 2., 3.” and “e.g., A., B., C.”, making it clear that the presence of a dot is a requirement for these cases.

Regarding your question about the exclamation mark (!), it is not considered a common practice for identifying list items. The thread provides examples of commonly used symbols such as >, #, and §. The exclamation mark is not included in this list, indicating that it should not be treated as a criterion for identifying list items.

As for your concern about detecting a lettering system without punctuation (e.g., “a”, “b”, “c”), the thread does not explicitly mention this case. The examples provided for lettering systems include a dot after each letter (e.g., A., B., C.), implying that the presence of a dot is expected. Detecting a lettering system without punctuation would indeed be challenging without also detecting sentences that begin with the word “A”. However, this case is not explicitly covered in the problem definition.

I understand your point about the potential ambiguity of using “e.g.” in the problem definition. To clarify, the examples provided are meant to be illustrative rather than exhaustive. The thread aims to provide a general understanding of the criteria for identifying top-level items, but it may not cover every possible variation or edge case.

If there are specific scenarios or edge cases that you believe should be addressed in the problem definition, please let me know, and I’ll be happy to discuss them further. We can work together to refine the problem definition and ensure that it provides sufficient clarity and specificity for the task at hand.

The structure of these inputs is a result of the API that my wider program relies on. The API returns data in this specific format, which is beyond my control.

Given that modifying the API itself is not feasible, I have focused my efforts on developing a robust Python parser to handle these inputs effectively. The parser aims to process the unindented and untrimmed strings returned by the API and transform them into a more usable format for further processing within my program. By dedicating time to create this parser, I can ensure that the data is cleaned and structured appropriately for subsequent operations.

While I understand your perspective on cleaning the data before working with it, in this particular case, the data format is dictated by the API. Therefore, my approach has been to develop a solution that can handle the given input structure and convert it into a more manageable format. This allows me to work with the data effectively without relying on changes to the API, which are outside of my control.

And there’s where you lose people. You’re talking out both sides. It’s both not exhaustive, but an explicit list?

You are correct that the lower level items need to be concatenated with their respective top level items, rather than being separate list items. The goal is to group each top-level item with all of its associated lower-level items into a single string, while maintaining the original formatting and delimiters.

The test program I provided allows you to validate any method you design against the sample inputs and expected outputs. It compares the actual output produced by your method (given a sample input) with the expected output for that input. The program calculates a similarity percentage between the actual and expected outputs, giving you a quantitative measure of how closely your method matches the desired behavior. Additionally, for any differences between the actual and expected output arrays, the program specifies how each differing index varies, helping you pinpoint areas that need adjustment in your method.

Thank you for your efforts so far, I really appreciate it.

So let me see if i can stab at the actual problem definition.

We are specifically looking for Top level items that:
Begin with a number followed by a period.
Begin with a single letter followed by a period.
Begin with -, *, (dot), >, # or §, presumably followed by a space?

And sub-level items that:
Contain the top level item followed by a number or a single letter.
Begin with a number followed by a period (If the TLI was not a number followed by a period)
Begin with a letter followed by a period. (If the TLI was not a letter followed by a period)
Begin with a roman numeral sequence followed by a period.
Begin with -,*, (dot), where that symbol is not the same as the current Top level item.

?

Do we need to expect malformed lists? (IE: a letter sequence that starts with b. instead of a.)

Thank you for your message, m_hutley. I understand your perspective on the examples provided in the problem statement. Allow me to clarify the intent behind those examples.

The examples of numbering systems (1., 2., 3.), lettering systems (A., B., C.), bullets (-, *, •), and symbols (>, #, §) were indeed meant to be illustrative rather than exhaustive. The purpose was to convey the general patterns and characteristics that define top-level items in the hierarchical list. These examples serve as a starting point to understand the concept, but they are not meant to be an all-inclusive list of every possible variation.

The point I was trying to make is that the examples follow a certain pattern, and although not every single iteration of that pattern was explicitly mentioned, the pattern itself can be extended and applied to similar cases. For instance, the numbering system examples (1., 2., 3.) imply that other variations like (i., ii., iii.) or (a., b., c.) would also be considered as top-level items. Similarly, the bullet examples (-, *, •) suggest that other common bullet symbols like (○, ■, :black_small_square:) would fall under the same category. The key is to recognize the underlying pattern rather than focusing solely on the specific examples provided.

Regarding your point about exclamation marks, I agree that they may not directly follow the pattern of the included symbols. The symbols mentioned in the problem statement (>, #, §) are commonly used for outlining or demarcating sections, whereas exclamation marks serve a different purpose. However, I apologize if my previous response was unclear or contradictory in this regard. Moving forward, I would be happy to collaborate with you to refine the problem statement and ensure that the examples and explanations are consistent and unambiguous.

Please let me know if you have any specific suggestions or ideas on how we can improve the clarity and coherence of the problem description. I appreciate your detailed feedback and am committed to working together to solve the issues at hand.

Which is fine when you’re trying to explain something. But when you want actual CODE, you have to be specific and exhaustive, or the code wont match what you need. If I code something to look for “1.”, and you come back and say “Well it didnt find ‘One.’, so it fails”… that’s not a problem of the code.

Patterns are great, but computers dont understand “give me all numeric sequences”

I was conscious of that @m_hutley, but it was apparent that christinanamodeo832 had already put in a lot of effort into coding this and had some idea of what they were doing. It certainly didn’t look like the usual 'can you do my assignment for me?` questions we get on here.

basically, the assignment (although vague) reads as a recursive function definition; pseudocode…

Given an input array of sentences (defined to be the initial block of text split on \n);
define an empty result.
#label1
if the array is empty, return your result.
check the first element for a TLI.
if found, and the TLI’s type is not the same as the current TLI, add this line to your result; call this function again, passing by reference the remaining array and this TLI as the current TLI; add its result to your result.
shift the array.
goto label1