1

I have two or more .csv files which have the following data:

//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page

//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3

Desired Output file:

//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page, 
2, Testing, 2014-04-04, , , 3

For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data). My problem is that I did not find any solution so far which accepts different column counts. Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:

  • Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
  • Iterate through all indices and add the data to a new HashMap.

Why I think this thought is not so good:

  • If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.

I think this might result if I do the suggested thing:

//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.

Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?

4
  • post the code you have so far. What library do you use for parsing/writing CSV files? Commented Dec 16, 2014 at 13:57
  • Does the order really matter in the columns? If I were to attempt to write code for this problem, what do I need to know about the order of the columns? Commented Dec 16, 2014 at 13:58
  • As I already said: I don't have code to show which meets the requirements I'm looking for. I can just show code of the conversion of json to csv, but that's not relevant here. I wrote an own programm which converts to csv. No external libs. @ThisClark: The order does not matter, as long as the row data will be in the right column. Commented Dec 16, 2014 at 14:00
  • 1
    I would use a BidiMap in this case, available from Apache Commons Collections commons.apache.org/proper/commons-collections I'm trying to write you some code right now. Commented Dec 16, 2014 at 14:19

1 Answer 1

2

This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.

I wrote a "database" to deal with situations like this a while back:

https://github.com/danielbchapman/groups

It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:

(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>

A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.

Once you have these structures you lookup can be as simple as:

 public List<Data> process(Data one, Data two) //pseudo code
  {
     List<Data> result = new List<>();
     for(Row row : one)
     {
       Id id = row.getId();
       Row additional = two.lookup(id);
       if(additional != null)
         merge(row, additional);

       result.add(row);
     }
  }

  public void merge(Row a, Row b)
  {
    //Your logic here.... either mutating or returning a copy.
  }

Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.

The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.

I do a lot of TSV processing in my line of work and maps are my best friends.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so far, I'll thought nearly about the same, but instead of 3 HashMaps, I thought about a HashMap<Integer,HashMap<String,ArrayList<String>>>. Nonetheless, I think that's a good starting point. I'll let you know if it works. +1 so far. :)
I got fairly stumped on the bidimap anyway - you relieved my guilt for not having an answer ready by now. +1
The Bidi map is a weird datastructure. It is really useful for modeling Map<Key,Key> structures. You can abuse it for performance but for data it scares me.
Hey, just wanted to inform you that I solved this task. It caused me a headache but I finally did it. Got to write some new classes and functions, so it would cause overload to post it all here. Big thanks to Daniel B. Chapman who gave me the idea I needed. Marked as SOLVED.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.