Building potential patent portfolios: An integrated approach based on topic identification and correlation analysis
Xian ZHANG1,2Corresponding authorE-mail to the corresponding author, Haiyun XU1E-mail to the corresponding author, Shu FANG1E-mail to the corresponding author, Zhengyin HU1,2E-mail to the corresponding author & Shuying LI2E-mail to the corresponding author
1Chengdu Library, Chinese Academy of Sciences, Chengdu 610041, China
2University of Chinese Academy of Sciences, Beijing 100190, China
2015, 8(2):39-51, Received: Apr. 13, 2015 Revised: Jul. 27, 2015 Accepted: Jul. 29, 2015
This work is jointly supported by the Science and Technology Service Network Initiative of Chinese Academy of Sciences (Grant No.: KFJ-EW-STS-032), the West Light Foundation of Chinese Academy of Sciences (Grant No.: Y4C0091001), and the National Social Science Foundation of China (Grant No.: 14CTQ033). This paper is based on a presentation at the conference of the 4th Global TechMining Conference, Stadsgehoorzaal, Leiden, the Netherlands.
X. Zhang (zhangx@clas.ac.cn, corresponding author) was responsible for the overall research design, proposed the analysis framework, wrote the paper outline and revised the paper. H.Y. Xu (xuhy@clas.ac.cn) designed the patent analysis framework, performed data analysis, made the figures and tables and revised the paper. S. Fang (fangsh@clas.ac.cn) proposed the research topic and revised the paper. Z.Y. Hu (huzy@clas.ac.cn) completed the details of patent analysis framework and S.Y. Li (lisy@mail.las.ac.cn) searched information for related study and edited the paper.

"Purpose: This paper suggests a framework to identify important patents for building potential patent portfolios based on patents owned by different assignees so as to highlight the value of individual patents in technology transfer and identify potential collaborators for patent assignees.

Design/methodology/approach: The analysis framework includes the following steps: 1) co-classification analysis based on the International Patent Classification (IPC) codes and Derwent Manual Codes (DMC) to detect sub-tech fields, 2) keyword co-occurrence analysis aiming to understand the core technology information in each patent, and 3) social network analysis used for identifying important technologies and partnerships of key assignees. A case study was conducted with 27,401 chemistry patents filed by a Chinese national research institute.

Findings: The results show that this framework is effective in building potential technological patent portfolios based on patents owned by different assignees and identifying future collaborators for the assignees. This integrated approach based on topic identification and correlation analysis that combines network-based analysis with keyword-based analysis can reveal important patented technologies and their connections and help understand detailed technological information mentioned in patents.

Research limitations: In keywords analysis, only titles and abstracts of patent documents were used and weights of keywords in different parts of the documents were not considered.

Practical implications: The analysis framework provides valuable information for decisionmakers of large institutions which have many patents with broad application prospects.

Originality/value: Different from previous patent portfolio studies based on the use of a combination of patent analysis indicators, this study provides insights into a method of building patent portfolios to discover the potential of individual patents in technology transfer and promote cooperation among different patent assignees."

1 Introduction
Companies apply for patents to obtain the exclusive right to use, sell and offer to sell the patented technologies for a limited period of time. Brockhoff [1] proposed the concept of patent portfolio, which is the collection of patents owned by an individual or a company. In order to enhance global competitiveness, companies have stepped up their efforts in patent management through patent portfolio analysis. For large enterprises or organizations which have a lot of subsidiary companies, however, it is a challenging task to build potential patent portfolios based on patents owned by different assignees and promote collaboration between these firms.
Current studies on patent portfolio focused primarily on the analysis related to one patent holder and few considered the ways of building potential portfolios of patents held by different assignees[2]. Using a case study of a large research institute in China, this paper intends to suggest an analysis framework for enterprises and organizations to build potential patent portfolios based on patents of different assignees so as to reveal the value of individual patents in technology transfer and identify potential collaborators for these assignees.
2 Related study
One of the major approaches to identification of important patents is co-occurrence analysis based on structured information such as patent classification codes, assignees, or citations. Researchers tried to use the co-occurrence relationships of patent classification codes to conduct co-classification analysis. Like co-citation and co-word analysis, it is assumed that classification codes represent cognitive elements associated with topics, specialities, or fields[3] and analysis of co-occurrence of patent classification codes can measure technological distance, with patents in a given patent category being considered more similar to one another than to those in other patent categories[4,5]. However, one limitation of co-classification analysis is that the classification may be too broad to fully capture all the technologies within a category and meet the goal for a particular analysis.
Patent citations refer to the count of citations of a patent in subsequent patents, and citations of a patent can be used to measure the relative importance of the patent[6]. A lot of studies[7,8,9,10,11,12,13] used citation counts to evaluate the quality and impact of the cited technologies and the relationships between cited and citing technological areas. However, patent citation approach may lead to inaccurate results as the applicants and examiners may have different purposes and motivations when citing patents and thus citations may not reflect the association between technologies[14].
With the development of data mining techniques, analyzing the unstructured data in patent documents such as titles and abstracts has become possible. Yoon & Park[15] extracted keywords through text mining and performed keyword-based morphology analysis. Tseng et al.[16] discussed text mining techniques such as text segmentation, term association, and topic mapping technique and their empirical study found that patent abstract and general summary and summaries from each section in a patent document are the most topic-relevant sections.
Social network analysis (SNA) is applied in identification of core technologies and detection of technological association. For instance, Yoon & Park[17] measured technology distance with the Euclidian distance of keywords vectors and built patent association networks. Luan[18] built the evolution network of solar energy technologies and identified key technologies with betweenness centrality. However, relatively little research has been conducted in using SNA approach to build patent portfolios and find collaborators for different patent assignees.
3 Methodology
3.1 Framework
The suggested analysis framework is illustrated in Fig. 1, which involves three modules.
• Data preparation. This module includes selection of a patent technology field and appropriate databases for data retrieval. Keywords are first extracted from patent titles and abstracts with Thomson Data Analyzer (TDA). Standardization work is then conducted for the extracted keywords, which involves using a singular form, unification of synonyms, avoidance of hyphen and abbreviations and standardization of phrases. For example, “fuel battery” and “fuel cell” are standardized into “fuel battery”.
① Thomson Data Analyzer (TDA) can be used for patent data mining and visualization. Details are available at http://thomsonreuters.com/en/products-services/intellectual-property/patent-research-and-analysis/thomson-data-analyzer.html.
• Network construction and visualization. We first construct a technology network based on co-occurrence of the IPC codes and DMCs, respectively, and then a keyword co-occurrence network based on keywords extracted from patent titles and abstracts. The network-based patent analysis gives us the overall situation of a relevant technical field, including important patents and the relationships between patents from a viewpoint of the network, while the keyword-based patent analysis enables us to understand the core technology information within patents on the basis of analysis of patent content. To identify the unique characteristics of the patent technology network and keyword network, such SNA indicators as E-I index[19] and centrality measures[20] are employed. Girvan-Newman algorithm[21] is employed for clustering and visualization. UCINET(version 6.216) is used as the software tool for the analysis and visualization of data.
• Data interpretation. With the participation of technological experts, the results are interpreted so as to build possible patent portfolios and find potential cooperation opportunities for patent assignees.
Fig. 1    The analysis framework.
The analysis framework.

3.2 Key steps of analysis process
3.2.1 Construction of patent technology network
It is possible that each patent is given more than one classification code, which means this patent is involved in several technology fields. We used TDA tool (version 5.1) to build a co-occurrence matrix of IPC codes and DMCs, respectively. The reason why we use both IPC codes and DMCs is that IPC is more functionoriented rather than application-oriented, which means that IPC codes can be hardly matched with the classified industry technology in reality[22]. By comparison, Derwent manual codes provided by Derwent innovation index (DII) are applicationoriented. As a result, using both IPC codes and DMCs to build the technology network will give us a comprehensive view of technology fields or areas.
We refer to Tseng et al.’s study[16] and the similarity between patent J and patent K could be measured by calculating the cosine angle between them following Eq. (1).

Where each patent is regarded as a vector space with the collections of n IPC codes, IPCij denotes the occurrence frequency of the ith IPC code in patent J, and IPCik denotes the occurrence frequency of the ith IPC code in patent K.
3.2.2 Construction of keyword co-occurrence network
As noun and noun phrases in a sentence express the main information of a sentence according to linguistics, the title and abstract of a patent document contain the main information of the patented technology[22]. Thus, keywords which are nouns and noun phrases are extracted from patent documents and are used to construct the keyword co-occurrence network.
Each patent document is transformed into a vector by the frequency of the keywords’ occurrence. Words with co-occurrence frequency over 100 times are considered to be closely connected with a specific technology and adopted as keywords. The association strength between two nodes is calculated by applying the cosine measure following Eq. (2).

Where Termij denotes the occurrence frequency of the ith keyword in patent J, and Termik denotes the occurrence frequency of the ith keyword in patent K.
3.2.3 Patent portfolio analysis with SNA indicators
The external internal index (E-I index)[19], which is the ratio of the density of subgroup against that of the entire network, is used to measure the internal or external relationships between different technology fields. The possible values of E-I index range from –1 to 1. A value close to 1 indicates external relationships, while a value close to –1 indicates internal relationships[19]. In addition, betweenness centrality is used to identify important technologies in the keyword co-occurrence network.
4 Empirical study
4.1 Data collection
We selected Derwent Innovation Index Database (DII) as the data source. DII merges the value-added patent information from Derwent World Patents Index with the patent citation information from Derwent Patent Citation Index. As the largest patent database in the world, DII covers over 14.3 million basic inventions from 40 worldwide patent-issuing authorities, including chemistry, electrics, electronics and mechanical engineering. Data in DII can be traced back to 1963, with over 30,000 patent records added in weekly. In DII, patent information can be retrieved according to patent code, assignee, assignee code and inventor and patents are assigned IPC codes, Derwent class codes, and Derwent manual codes.
According to the World Intellectual Property Organization (WIPO)’s IPC8- Technology Concordance Table revised in January, 2013, the concept of technology classification is divided into 5 areas, including electric engineering, apparatus, chemistry, mechanical engineering and other fields, and also can be divided into 35 fields.
We retrieved data on November 20, 2013 with IPC codes of the chemistry patents filed by a Chinese national research institute and its branch institutes during 1985 and 2012. For this national research institute, chemistry is the area in which it owns the most patent applications. We retrieved totally 27,401 chemistry records. As enough data is necessary for building patent portfolios, this paper selects chemistry patents as our data sample.
③ IPC codes are listed in Appendix I.
4.2 Results
4.2.1 Network based patent analysis
Figure 2 displays the co-classification network of IPC codes. Based on Girvan- Newman algorithm, we found 27,401 patents in chemistry are related to 11 fields. The EI index value is –0.094, which means the 11 fields have maintained relatively independent relationships.
Figure 2 illustrates the internal and external relationships of the subgroups. The purple line indicates the internal relations and the green line the external relations. More internal relationships are observed from Fig. 2, especially in fields of chemical engineering, pharmaceuticals and organic fine chemistry. But the fields of food chemistry, micro-structural and nano-technology, and surface technology and coating are found to be weakly connected.
Fig. 2    Patent co-classifi cation network based on IPC codes.
Patent co-classifi cation network based on IPC codes.

There are 188 subclasses in Derwent manual code classification, with clear hierarchy and long-term stability[23]. In our case study, this large research institute’s patents in chemistry cover 182 subclasses. The clustering and visualization of the patents based on Girvan-Newman algorithm is shown in Fig. 3, which can be classified into 6 technological fields: 1) polymer/plastics, 2) general chemistry, 3) catalyst, 4) pharmaceuticals, 5) agriculture, 6) semiconductors, circuits, fireproof materials, ceramics, cement and electrochemistry. The threshold value of the network is set at 0.05 for satisfying visualization result.
Fig. 3    Patent co-classifi cation network based on Derwent manual codes.
Patent co-classifi cation network based on Derwent manual codes.

Polymerid/Plasdocer is located in the center of the network. The patents in this technology field are mainly about monomer, concentration, polymerization, natural polymer, addition polymer, condensation polymer, inorganic polymer, polymer blending, aqueous dispersion, additive, property, analyzing, testing, controlling, polymerization, polymer modification and polymer processing. Polymer processing contains devices, materials and preparation method of polymer application and plastics. Polymer application which has the highest betweenness centrality is identified as the core node of the whole network.
4.2.2 Keyword-based network construction
Using the method in Section 3.2.2, we constructed the patent keyword network. The keywords with high betweenness degrees such as polymer application, polymerization reaction and fermentation industry play a bridge role in connecting other patents in the network. These important technologies are usually in the core positions in the network and can be considered key technologies. The keywords network of patents in polymer application is illustrated in Fig. 4.
Fig. 4    The possible patent portfolios of polymer applications.
The possible patent portfolios of polymer applications.

4.2.3 Portfolios possibility analysis
In view of core nodes in the network, we analyzed the possibilities of building patent portfolios on the 409 patents in polymer application. Figure 4 shows the co-occurrence network of keywords with frequencies of occurrence more than 100 times in patent documents of polymer application. Our findings indicate that this large research institute’s polymer application patents can be roughly summarized into six subject areas: 1) biological polymers, 2) synthetic resin, 3) conductive polymers, 4) engineering plastics, 5) fertilizers and 6) polyamide system.
We observed close relationships between the six subject areas, so a comprehensive protection solution is recommended. For example, when the same technology is applied to different fields, these different applications may be considered as a patent portfolio. Second, we suggest patent portfolio protection for diversified technologies and their applications within one of the six subject areas. For instance, in synthetic resin area, the preparation of phenolic resin contains various solvents (acetone, ethyl acetate, component solvent, etc.) and strengthened materials (polycarbafil, activated charcoal, etc.). As a result, a variety of preparation and materials of phenolic resin can be considered as a patent portfolio, which also highlights the value of individual patents.
4.2.4 Potential cooperation analysis
This paper selected top 10 assignees of polymer patent applications of this national institute and the keywords of these applications to perform a two-mode co-occurrence network as illustrated in Fig. 5. We detected multiple technology cooperation areas for these patent assignees.
Fig. 5    Potential partners in polymer application.
Potential partners in polymer application.

In Fig. 5, the red square nodes represent the top-ranked patent assignees. The blue circle nodes represent the key patent technologies. The nodes’ size is proportional to the value of the betweenness centrality of the nodes. The lines between these nodes indicate the relations between the assignees and the patented technologies. The thicker the line, the stronger is the linkage.
As revealed in Fig. 5, A, B and C possess multiple important patents and technologies in polymer application. As a result, there is a potential cooperation opportunity for these institutes. We discuss some of the most promising cooperation opportunities in details:
• A, B, D and F all focus their research on synthesis and application of phenolic resin and other resins;
• A, B, C, F and G have filed many patents on the preparation and application of polyethylene glycol;
• B, D and E have patent filings in preparation and application of chitosan;
• A and E both have applied patents in aluminium metal materials and preparations;
• B and C are both involved in the preparation and application of coating materials;
• A, C and F all have applied patents in polyvinyl alcohol preparation and application.
5 Discussions and conclusions
The previous studies on patent portfolios were based on the use of a combination of patent analysis indicators and focused primarily on an integration of multidimensional indicator system for exploring the patent assets of a certain assignee. This paper suggests a framework to provide insights into building potential patent portfolios on the technology level, especially focusing on the patents owned by different assignees and identifying potential collaborators. More specifically, the integrated approach based on topic identification and correlation analysis that combines network-based analysis with keyword-based analysis can reveal the overall situation of a relevant technological field and the content of the patent technologies at the same time.
This research needs to be improved in several aspects. For instance, in order to reveal the value of individual patents for technology transfer, we need to perform analysis at a more detailed level of granularity. To this end, we may consider the weight of words extracted from patent titles and abstracts or extract keywords from patent summary section in addition to titles and abstracts for keywords analysis. In addition, more case studies are needed to be conducted in different technological fields in order to verify the effectiveness of the suggested framework. We will address these issues in our future research.
Appendix I:    WIPO technology concordance in chemistry

1 Brockhoff, K.K. Indicators of firm patent activities. In Technology Management: The New International Language. Portland International Conference on Management of Engineering and Technology. Washington, DC: IEEE Computer Society, 1991: 476-481.
2 Yue, X.P. Inner-firm patent portfolio scale strategy: Based on cournot model. Journal of Intelligence (in Chinese), 2012, 31(11): 118-122, 135. Retrieved on May 14, 2015 from http://d.wanfangdata.com.cn/Periodical_qbzz201211025.aspx
3 Spasser, M.A. Mapping the terrain of pharmacy: Co-classification analysis of the international pharmaceutical abstracts database. Scientometrics, 1997, 39(1): 77-97. Retrieved on May 14,2015, from http://link.springer.com/article/10.1007/BF02457431. DOI:10.1007/BF02457431
4 Jaffe, A. Technological opportunity and spillovers of R&D: Evidence from firms' patents, profits, and market value. American Economic Review, 1986, 76(5), 984-1001.
5 Kauffman, S., Lobo, J., & Macready, W.G. Optimal search on a technology landscape. Journal of Economic Behavior & Organization, 2000, 43(2): 141-166. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0167268100001141. DOI:10.1016/S0167-2681(00)00114-1
6 Lee, S., Yoon, B., & Park, Y. An approach to discovering new technology opportunities: Keyword-based patent map approach, Technovation, 2009, 29 (6/7): 481-497. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0166497208001326. DOI:10.1016/j.technovation.2008.10.006
7 Trajtenberg, M. A penny of your quotes: Patent citations and the value of innovations. RAND Journal of Economics, 1990, 21(1): 172-187. Retrieved on May 14, 2015, from http://www.jstor.org/stable/2555502?seq=1#page_scan_tab_contents
8 Narin, F. Patents as indicators for the evaluation of industrial research output. Scientometrics,1995, 34 (3): 489-496. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2FBF02018015. DOI:10.1007/BF02018015
9 Lanjouw, J.O., & Schankerman, M.A. Characteristics of patent litigation: A window on competition. RAND Journal of Economics, 2000, 32 (1): 129-151. Retrieved on May 14,2015, from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=252299
10 Harhoff, D., Scherer, F.M., & Vopel, K. Citations, family size, opposition and the value of patent rights. Research Policy, 2003, 32(8): 1343-1363. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0048733302001245. DOI:10.1016/S0048-7333(02)00124-5
11 Harhoff, D., & Reitzig, M. Determinants of opposition against EPO patent grants - The case of biotechnology and pharmaceuticals. International Journal of Industrial Organization, 2004,22: 443-480.
12 Haupt, R., Kloyer, M., & Lange, M. Patent indicators for the technology life cycle development. Research Policy, 2007, 36(3): 387-398. Retrieved on May 14, 2015,from http://www.sciencedirect.com/science/article/pii/S0048733307000054. DOI:10.1016/j.respol.2006.12.004
13 Wang, X., Zhang, X., & Xu, S. Patent co-citation networks of Fortune 500 companies. Scientometrics, 2011, 88(3): 761-770. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2Fs11192-011-0414-x. DOI:10.1007/s11192-011-0414-x
14 Kraslawski, A. Semantic analysis for identification of portfolio of R&D projects. Example of microencapsulation. The 16th European Symposium on Computer Aided Process Engineering and 9th International Symposium on Process Systems Engineering, 2006, 21: 1905-1910. Retrieved on May 14 2015, from http://www.sciencedirect.com/science/article/pii/S1570794606803263. DOI:10.1016/S1570-7946(06)80326-3
15 Yoon, B., & Park, Y. Development of new technology forecasting algorithm: Hybrid approach for morphology analysis and conjoint analysis of patent information. IEEE Transactions on Engineering Management, 2007, 54 (3): 588-599. Retrieved on May 14, 2015, from http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4278022. DOI:10.1109/TEM.2007.900796
16 Tseng, Y.H., Lin, C.J., & Lin, Y.I. Text mining techniques for patent analysis, Information Processing & Management, 2001, 43: 1216-1247. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S0306457306002020. DOI:10.1016/j.ipm.2006.11.011
17 Yoon, B., & Park, Y. A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 2004, 15(1):37- 50. Retrieved on May 14, 2015, from http://www.sciencedirect.com/science/article/pii/S1047831003000439. DOI:10.1016/j.hitech.2003.09.003
18 Luan, C.J. Mapping the evolution of technology network and identifying key technologies in the field of solar energy technology via co-occurrence analysis. Journal of the China Society for Scientific and Technical Information (in Chinese), 2013, 32(1): 68-79. Retrieved on May 14, 2015, from http://d.wanfangdata.com.cn/Periodical_qbxb201301009.aspx. DOI:10.3772/j.issn.1000-0135.2013.01.008
19 Krackhardt, D., & Stern R.N. Informal networks and organizational crises: An experimental simulation. Social Psychology Quarterly, 1988, 51(2): 123-140. Retrieved on May 14, 2015, from http://www.jstor.org/stable/2786835?origin=crossref&seq=1#page_scan_tab_contents. DOI:10.2307/2786835
20Liu, J. An introduction to social network analysis (in Chinese). Beijing: Social Sciences Academic Press, 2004.
21 Newman, M.E.J., & Girvan, M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69: 026113. Retrieved on May 14, 2015, from http://journals.aps.org/pre/abstract/10.1103/PhysRevE.69.026113
22 Wang, B., Liu, S., & Ding, K., et al. Identifying technological topics and institution-topic distribution probability for patent competitive intelligence analysis: A case study in LTE technology. Scientometrics, 2014, 101(1): 685-704. Retrieved on May 14, 2015, from http://link.springer.com/article/10.1007%2Fs11192-014-1342-3. DOI:10.1007/s11192-014-1342-3
23 Thomson Reuters. DWPI classification system. Retrieved on May 14, 2015, from http://ip-science.thomsonreuters.com/support/patents/dwpiref/reftools/classification/